Branded Content

Roadmap to Trustworthy AI — A CDO’s Blueprint for Data Quality and Observability

Written by: CDO Magazine Bureau

Updated 4:31 PM UTC, Tue October 15, 2024

Trustworthy AI is no longer a luxury — it’s a business imperative today. The success of AI initiatives hinges on building systems that stakeholders can trust. This “trust” factor goes beyond simply providing accurate outputs; it involves ensuring that AI operates ethically, fairly, and transparently, aligning with both regulatory demands and the values of the organization.

As AI begins to influence high-stakes decisions across industries, from finance to healthcare, every enterprise data leader responsible for charting the AI pathway knows that the most critical aspect of building trustworthy AI lies in ensuring quality data — GARBAGE IN, GARBAGE OUT.

Poor data quality — such as incomplete, biased, or outdated data — can lead to flawed AI models, resulting in incorrect or harmful outputs. In such a scenario, it doesn’t come as a surprise when an enterprise data leader of a leading financial services company equates data quality with soil quality in agriculture.

In an article exclusively penned for CDO Magazine, Tarun Sood, Chief Data Officer at American Century Investments, writes that data in AI/ML acts as an anchor for algorithms, just like the role of soil for crops. Similarly, he equates data cleansing with fertilizer.

“As fertilizer rectifies soil deficiencies, enriching it to optimize plant growth, data cleaning corrects inconsistencies, missing values, and errors in datasets to optimize model training,” says Sood.

Along the same lines, organizations are increasingly turning to data observability to achieve and maintain high-quality data. This approach enables continuous monitoring and analysis of data pipelines, allowing data leaders to quickly detect, diagnose, and resolve data issues.

According to CDO Magazine’s latest (2024 edition) study on the “state of data observability,” 92% of data leaders plan to make data observability a core part of data strategy in the next 1-3 years. A staggering 82% of respondents reported that the greatest need to support data observability is end-to-end pipeline visibility.

The increasing focus on the subject also aligns with the rising demand for generative AI (GenAI). Nearly all enterprises surveyed expect GenAI to play a pivotal role in their data strategies within the next 3 to 5 years.

Download the Full Report for FREE! Click the “Preview” Button Below.

Attachment

Pdf

Roadmap-to-Trustworthy-AI-A-CDOs-Blueprint-for-Data-Quality-and-Observability-1

Preview

Data Quality — The Cornerstone of Trustworthy AI

Data quality refers to the condition of data based on several key dimensions: accuracy, completeness, consistency, and timeliness. For enterprise data leaders, high-quality data is not merely about correctness — it’s about ensuring that the data is reliable enough to support the AI models driving critical business decisions.

The impact of data quality on AI model performance cannot be overstated. Even the most sophisticated AI models are only as good as the data they are trained on.

In a CDO Magazine interview, Megan Brown, Director of the Global Center of Excellence for Advanced Analytics and Data Science at Starbucks, emphasized that data quality is a critical enabler for enterprise AI. She noted that everything — from in-house data science models to GenAI — relies entirely on the quality of the underlying data.

Brown even urged her peers in data leadership to consider leaving organizations that fail to prioritize data quality, despite their best efforts to address the issue. Her advice is hardly unexpected, considering models that rely on poor-quality data can result in flawed insights, from biased recommendations in customer service to erroneous risk assessments in financial sectors.

Challenges of Maintaining Data Quality

Maintaining data quality has become more challenging in today’s complex data landscape. Data leaders from across industries have time and again mentioned in CDO Magazine interviews that enterprises are dealing with exploding volumes of data from an ever-expanding variety of sources, including second and third-party datasets, IoT devices, and unstructured data formats like text and images. The diversity and complexity of these data streams increase the difficulty of ensuring all data meets stringent quality standards.

Additionally, as data passes through various hands and systems, from ingestion to processing, the risk of data corruption, duplication, or loss grows, further compounding the challenge.

Best Practices for Data Quality Management

To mitigate these challenges, enterprise data leaders must adopt best practices for data quality management. One foundational approach is to implement a strong governance framework, which outlines policies, roles, and responsibilities for maintaining data quality across the organization.

Varun Gupta, Director of Data Strategy & Innovation at an Ivy League university, and an ex-banker, says that quality checks, such as verifying distributions of critical data elements, are essential. For instance, if a field should only accept values A, B, or C, these checks ensure it does so. “Once these business rules are understood, they should be institutionalized into the front-end systems and enforced through the organization’s governance policies,” Gupta adds.

He says that both data quality and governance are interconnected and that quality drives governance policies, and these policies reinforce quality in a continuous cycle.

Sharing a best practice, Gupta says that effective data quality management requires a federated model with a central team overseeing governance along with data stewards from respective organization verticals. “These stewards work closely with the central team to monitor and address data quality issues proactively, ensuring consistency and reducing future problems. This hub-and-spoke model, combining central oversight with local responsibility, is crucial for maintaining data quality and governance across the organization.”

Vijay Venkatesan, Chief Analytics Officer at Horizon Blue Cross Blue Shield of New Jersey (BCBSNJ), in a CDO Magazine interview, explains that the organization’s approach to AI begins with the creation of an AI governance framework, which encompasses the guiding principles and data completeness.

BCBSNJ uses a “data scorecard,” where different types of data are evaluated based on 15-18 specific attributes for their completeness. For example, claims data may be 95% complete, while provider data might only be at 65%, as much of the provider information is limited to payment and reimbursement details.

Venkatesan points out that critical fields such as provider billing addresses or practice locations might be missing. As he puts it, “Provider is an area where there are a lot of gaps in terms of data completeness,” making it essential to assess what’s available and decide which use cases are feasible based on the data quality.

Data is also assessed by lines of business — whether for commercial, Medicare, or Medicaid members. Additionally, they perform a “reasonability scan” to determine whether the data is suitable for purposes such as population health or financial analysis.

Ethical Considerations and Bias in Data

Venkatesan also highlights the importance of responsible AI practices, particularly regarding bias in data, and considers the ethical implications of using internal versus external data, including decisions on whether to anonymize it. He notes, “It’s really that framework that’s key in our minds,” ensuring data governance aligns with responsible decision-making.

On the issue of bias, Venkatesan underscores the importance of understanding the demographic represented by the data and ensuring that bias is addressed to the greatest extent possible. He adds that it is about “making sure that as you’re using data, you have a sense of the demographic,” while acknowledging that biases may not always be fully resolved but can be better understood.

Data Cleansing and Lineage Tracking

Establishing data cleansing and validation processes helps ensure that data is continually reviewed, corrected, and validated before it enters AI pipelines. American Century Investments’ Sood writes that large language models (LLMs) excel in generating human-like textual outputs, based on their training datasets.

Thus, the vital step in the journey of LLM success is data cleansing, guaranteeing that the model receives data that is not only of top quality but also pertinent and indicative of the desired outputs.“Data must undergo cleansing to remove anomalies, ensuring optimal model performance,” mentions Sood.

Similarly, maintaining data lineage tracking enables organizations to trace the origin, movement, and transformation of data across its lifecycle, ensuring accountability, and transparency in data handling.

“Data lineage is crucial for ensuring data quality and observability. For instance, when data moves from source systems into a data analytics layer, and is finally used to make a critical decision, it’s vital to trace every step along the way,” says Gupta.

He explains that this traceability becomes essential if a metric used incorrectly in an algorithm leads to flawed decisions; even more vital for GenAI applications. “Without proper lineage, finding the source of the error becomes difficult. The challenge arises because data systems are often built and updated in increments, leading to gaps in tracking data transformations,” Gupta adds.

(The opinions presented by Varun Gupta are his own and do not constitute an official statement from the university.)

Data Observability: Maintaining the Performance of AI Systems

Data observability refers to an organization’s ability to monitor, trace, and understand the health and behavior of data as it moves through various systems and pipelines. Gupta says that observability is crucial as it sets the foundation for data-driven organizations. Each business unit is accountable for its specific metrics that span across customer experience, financial, or non-financial outcomes.

“It’s essential for organizations to align these key metrics upfront, define them, and work backward, ensuring transparency across the data’s entire lifecycle. This includes tracking data volume, distribution, schema, lineage, and freshness, as these factors are vital for accurate regulatory reporting and everyday business management,” Gupta elaborates.

Highlighting the two “most critical” aspects of data observability, he mentions data quality and the right schema.

“It’s okay to have incomplete or missing data as long as you’re aware of it. Knowing about data issues allows you to address them, such as imputing missing values. This makes data quality crucial. A schema, which defines how data is structured and linked, plays a key role in ensuring quality. Once you get data quality and schema right, other factors like volume and freshness can be managed by systems,” Gupta adds.

Jonathan Paul, VP, Director IT Data & AI Governance at Fifth Third Bank, says that observability is crucial in ensuring the performance, reliability, and trustworthiness of AI systems.

At Fifth Third Bank, the team emphasizes building trust and confidence in data by focusing on data quality first, followed by data observability, and monitoring. “This approach helps in identifying and resolving data issues promptly, ensuring that AI systems operate effectively and deliver accurate results,” says Paul.

Sharing key aspects of data observability, Paul mentions:

Data quality – Ensuring that data is accurate, complete, and consistent.
Monitoring and alerts – Implementing systems to monitor data in real-time and generate alerts for any anomalies.
Data lineage – Tracking the flow of data from its source to its final destination to understand its transformations and usage.
Transparency and accountability – Providing clear documentation and assigning responsibility for data management.

Alaa Moussawi, Chief Data Scientist at the New York City Council, mentions that data observability’s role depends on the available data. He emphasizes the importance of visualizing key statistics such as mean, median, quantiles, extrema, missing values, and feature distributions. “Viewing these statistics and tracking time trends helps detect anomalies and gives end-users a high-level understanding of the data,” Moussawi explains.

Regarding model bias, fairness, and explainability, Moussawi stresses that understanding the model is key. He advises closely examining data distributions to ensure it is randomly sampled and representative of the target population. This also involves monitoring time trends and checking for over- or under-sampling in certain features.

Paul reinforces the role of data observability in addressing data quality issues and ensuring fairness. He explains that observability is crucial for detecting anomalies and inconsistencies in data. It also helps identify model bias by monitoring the data used for training and inference, ensuring AI models are fair and explainable. “This is achieved by maintaining transparency in data processes and regularly auditing both the data and models,” Paul adds.

Building a Culture of Data Quality and Observability

However, delivering trustworthy AI isn’t just about the technology; it’s about the organizational mindset too. For enterprise data leaders, establishing a culture centered around data quality and observability is essential for delivering trustworthy AI systems.

According to Paul, organizations can foster a culture of data quality and observability by:

Educating teams – Training data teams and AI developers on the importance of data quality and observability.
Encouraging collaboration – Collaboration between data teams and AI developers ensures that data quality is maintained throughout the AI development life cycle.
Implementing best practices – Adopting best practices for data management, including regular audits, documentation, and monitoring.

“Successful organizations operate with an infectious collaborative rhythm across their business, technology, and data verticals, all working together towards shared goals. For AI developers and data teams, having a clear understanding of these metrics—and the business questions they address—is essential,” Gupta states.

He illustrates that, for instance, knowing why a specific metric is needed and its importance to current business objectives allows data teams to prioritize their efforts appropriately. Some metrics may not need frequent updates, while others may require near real-time data. Some LLMs are more suited for specific types of use cases. Certain features might be more relevant to the particular business problem at hand.

“This context helps data teams effectively communicate and translate the context to their technology counterparts, ensuring that the underlying architecture aligns with the business needs. This close collaboration is critical to not only framing the problem but also building solutions that are purpose-fit for business requirements, driving organizational success,” Gupta adds.

Moussawi adds that enabling this culture means providing tools that allow end-users to perform quality checks themselves. “Encourage users to engage with these tools by offering training on how they work and by teaching basics like understanding quantiles and distributions,” he advises.

Paul also emphasizes the need for ongoing monitoring and enhancement of data quality and observability practices. “Continuous monitoring ensures that data issues are promptly identified and resolved, maintaining the integrity and reliability of AI systems. Regular updates and improvements to observability practices allow organizations to adapt to new challenges and technologies, keeping AI systems robust and trustworthy.”

As an example of continuous monitoring, Moussawi highlights how the New York City Council tracks data from non-emergency service requests and complaints submitted to the city’s 311 service. “This data tells us what our constituents are concerned with, and we can see time trends over the thousands of different complaint types. Performing automated anomaly detection to identify any rising concerns of our constituents is crucial for ensuring that we are hearing our constituents and addressing their needs,” Moussawi explains.

He further notes that this type of anomaly detection isn’t limited to 311 data. It can be applied across a broad range of datasets, helping organizations monitor and address evolving issues.

Using AI for Data Observability

Gupta describes anomaly detection as an “obvious crucial aspect” of machine learning, helping data teams identify hidden issues before they become major problems. According to him, while humans are adept at recognizing obvious connections — such as using Social Security Numbers or personally identifiable information (PII) as identifiers — machines can identify hidden relationships between data fields that may not be immediately apparent.

As machine learning advances, Gupta suggests that “we should rethink how we define data schemas,” advocating for a shift where machines “guide us on which schemas to use, aligning our approach with its processing language.”

He uses an example where algorithms might suggest mailing home improvement offers to children, anticipating a casual nudge to their elderly parents, thereby bypassing the need for maintaining contact information for a segment of customers. Furthermore, he emphasizes the need to move away from manually maintaining vast amounts of data, particularly in industries like banking and finance, and allow AI assistants to handle such tasks more efficiently.

“We must continuously monitor the key data elements critical to business,” states Gupta. From a technical perspective, he stresses the importance of automated systems that can detect and pinpoint issues within these metrics rather than relying on time-consuming manual checks. The goal is to reduce false positives and sharpen the models, enabling machines to identify problems accurately.

He also underscores that continuous monitoring is not just about executing test cases but about “learning from those tests and improving” algorithms. “Efforts should focus on these critical data elements and relevant AI vectors, minimizing irrelevant or ‘dark’ data. We need to prioritize managing value from data rather than getting bogged down by the 80% of data that is less relevant or unused.”

Challenges and Considerations in Implementing Data Quality and Observability Solutions

Implementing data quality and observability solutions presents its own challenges — from securing executive buy-in to overcoming resistance to change. Pascale Assémat, Chief Data Officer of the Consumer and Digital Branch at Le Groupe La Poste, notes in a CDO Magazine interview that the first question she often faces is how data observability differs from existing practices like data monitoring.

Assémat explains the key differences:

Data monitoring focuses on monitoring the pipeline, while data observability examines the data inside the pipeline.
Monitoring identifies what is wrong, but observability explains why it’s wrong and helps detect signals to prevent bigger problems.

During a CDO Magazine webinar on data observability, Jane Urban, VP of Customer Engagement Operations at Otsuka Pharmaceutical Companies (U.S.), highlighted another challenge — a lack of confidence in detecting issues before they affect end users.

“Perfection is difficult to achieve,” Urban said, explaining that the chaotic nature of data makes it challenging to ensure flawless performance, even after significant investments in data management. She also acknowledged the frustration many businesses face when data management efforts fail, despite substantial investments of time, money, and effort, leaving them feeling as though they lack control over their own data.

However, Gupta points out that organizations can sometimes “overdo data observability,” and the large volume of underlying data turns out to be a costly proposition. He says that this often happens when key metrics aren’t defined early on, leading to excessive data monitoring and thereby holding back organizations from adopting AI practices.

Therefore, Gupta suggests that whether migrating legacy systems or building new ones, “it’s critical to start by aligning the key metrics, taking stock of critical decisions in the customer journey management, and then implementing observability sensors in a thoughtful, targeted way.”

Gupta notes that data observability can be a hurdle for AI adoption, especially when training LLMs. He emphasizes that “building these LLMs requires vast datasets,” and the complexities of managing such data—whether understanding its relationships, storage, or structure—can become a major challenge.

He urges businesses to invest in training LLMs and learn to accept LLMs as “their thought-leading partners.” For example, instead of investing time and resources to organize video feeds in the databases, they can trust the LLM’s ability to learn to be able to differentiate client interaction at a branch from a recorded remote conversation.

Once algorithms are deployed, Gupta highlights that feedback loops enable these models to improve over time, allowing data teams to refine performance by identifying data points that could enhance accuracy.

Smaller companies may add data incrementally, while larger ones might input vast amounts at once. Regardless of the approach, Gupta stresses that “data observability remains essential” during the training phase. Afterward, the models can offer insights into what data would further boost their accuracy, enabling data engineers to fine-tune them accordingly.

Post-implementation, data observability helps monitor for biases and ensure fairness by tracking key metrics aligned with an organization’s bias and fairness framework. “Essentially, data observability is vital for maintaining high-quality, unbiased models throughout their lifecycle,” says Gupta.

Steps to Begin Considering Data Observability

Speaking about the steps for an organization to begin considering data observability, Assémat recommends organizations approach data observability similarly to data governance. The first step is to build awareness about what observability is and why it’s important. Like governance, this requires explaining the risks of neglecting observability. Next, organizations should define the problem they want to solve and explore available observability tools.

She advises starting small and selecting a critical data pipeline that has frequent anomalies in production. A pilot of one or two observability tools should then be run to demonstrate their value to stakeholders. If successful, the tools can be scaled and integrated across more data domains.

To gauge success during initial tests, Assémat suggests focusing on:

Fewer data anomalies
Improved anomaly predictability
Increased trust in both the data and the data teams, along with greater confidence in their ability to respond effectively to business needs.

The Role of Technology and Tools in Supporting These Initiatives

According to Greg Townsend, Senior VP & Chief Data and Analytics Officer at AltaMed Health Services, analyzing social determinants in healthcare data presents significant challenges, primarily due to limited confidence in the quality of the data. This issue is further complicated by the integration of disparate data sources. “It can be very difficult to have high levels of confidence,” he notes while speaking at a CDO Magazine webinar.

Townsend emphasizes how the “current wave of observability” has transformed data monitoring from a complex function into a more accessible tool and commodity. He explains, “It hasn’t been very obvious and easy to explain to a CEO, board, or budget committee why this needs to be invested in. The observability tools we’re looking at now are making that answer a lot easier.”

Additionally, Townsend identifies two types of data quality issues organizations face. Sentinel events are significant disruptions that can halt operations, requiring robust prevention. On the other hand, telemetric issues are subtler but measurable via “data gates” that monitor variance. Observability tools, he says, expand organizations’ ability to build and maintain these critical data quality gates, which are often complex and time-consuming to manage manually.

Executive Action

Data Quality as the Foundation of Trustworthy AI: Poor data quality can lead to flawed AI models with potentially harmful outputs. By enabling proactive management and maintenance of data integrity, businesses can ensure that their data-driven insights and operations are based on reliable and accurate information, leading to better overall performance and trust in the outcomes.
Challenges in Ensuring Data Quality: Enterprises face growing challenges in maintaining data quality due to the increasing volume, variety, and complexity of data. Data is being sourced from IoT devices, unstructured formats, and third-party sources, which complicates efforts to meet stringent quality standards. Moreover, as data flows across multiple systems, the risks of corruption, duplication, or loss increase significantly.
Increasing Focus on Data Observability: Enterprises are turning to data observability to ensure and maintain high-quality data. The ability to visualize data quality and observability events within data lineage enhances the transparency and traceability of data as it flows through different systems, providing a clear, interactive view of data health and behavior at each stage of its journey.

This visibility allows data managers and stakeholders to swiftly identify and address quality issues or anomalies, reducing risks and preventing potential errors from impacting downstream processes or decision-making.

Best Practices for Data Quality Management: Establishing a strong governance framework is key to managing data quality. This includes instituting quality checks, standardizing critical data elements, and using a federated model where a central team collaborates with data stewards across business units. Effective governance policies create a continuous cycle that enhances and reinforces data quality.
Building a Data-Quality-First Culture: Beyond technology, creating trustworthy AI requires cultivating a culture of data quality and observability within the organization. This includes educating teams on the importance of these factors, fostering collaboration between data teams and AI developers, and adopting best practices like regular audits and continuous monitoring.
Challenges in Implementing Data Observability Solutions: Implementing these solutions often faces challenges, including securing executive buy-in and managing resistance to change. Organizations must start by aligning key metrics and gradually integrating observability tools to demonstrate their value. Balancing the scope of data monitoring with the need for actionable insights is essential to avoid overwhelming the system with excessive observability efforts.
The Role of Technology in Supporting Data Initiatives: Data observability tools are becoming more accessible and helping organizations build robust data quality gates. These tools offer a systematic approach to managing both large-scale disruptions and subtler issues within data pipelines, improving confidence in the data and ensuring AI systems can operate reliably and ethically.
AI for Data Observability: AI can enhance data observability by automating anomaly detection and identifying hidden relationships between data fields. As machine learning advances, organizations can leverage AI to refine data schemas and improve the accuracy of their models, allowing for more efficient data management and AI adoption.