AI News Bureau

AI Has Created a Paradox in Data Cleansing and Management — JPMC Head of Marketing Data and Analytics

Written by: CDO Magazine Bureau

Updated 3:29 PM UTC, Mon October 28, 2024

(US & Canada) Dr. Tiffany Perkins-Munn, Managing Director, Head of Marketing Data and Analytics at JPMC, speaks with David Arturi, Head of Financial Services at Lydonia, in a video interview, about the impact of AI on data cleansing and management, practices required to maintain a perfect data set after cleansing, where to start with data cleansing, various internal and external sources of data, and the challenge of blending data.

Sharing her perspective on the impact AI has had on data cleansing, Perkins-Munn states that AI has created a paradox in data cleansing and management. While AI promises to automate and improve data cleaning processes, it has also exaggerated the need for rigorous data cleaning practices, says Perkins-Munn. The paradox stems from certain factors, she adds, and points out the data need versus data quality issue.

Illustrating further, Perkins-Munn says that AI models, specifically deep learning algorithms, require massive volumes of data for effective performance. This factor drives organizations to collect everything, and the volume and velocity of data often compromise the quality, thereby necessitating intense cleaning efforts.

Next, Perkins-Munn mentions the amplification effect, wherein AI models can amplify small errors or biases present in training data, thus enhancing the need for data cleansing. Also, she maintains that with AI systems becoming more complex, tracing the impact of data quality on outputs becomes challenging. This, in turn, increases the need for preventive data cleansing.

To address the paradoxes, Perkins-Munn suggests certain approaches that include AI-driven data governance, continuous data quality monitoring, and federated learning for data management. In AI-driven data governance, the AI systems can dynamically adjust data governance rules based on the evolving needs of the AI model and shifting regulatory landscapes.

The key to continuous data monitoring is implementing real-time data quality systems that detect and flag issues in the AI-driven processes, says Perkins-Munn. Then, federated learning for data management allows AI models to learn from distributed datasets without centralizing them, potentially easing the data management burden.

When asked about the practices required to maintain a cleansed data set, Perkins-Munn states that in that state, it is critical to think about enhancing data cleaning and quality management. Delving further, she states that there are many ways to maintain it over time and discusses a few that include AI algorithms revolving around automated data profiling and anomaly detection.

Particularly in the case of unsupervised learning models, AI algorithms automatically profile data sets and detect anomalies or outliers. Continuous data monitoring is one ongoing way to keep data clean. She also mentions intelligent data matching and deduplication, wherein machine learning algorithms improve the accuracy and efficiency of data matching and duplication processes.

Apart from those, there are fuzzy matching algorithms that can identify and merge duplicate records even with minimal variations or errors.

Moving forward, Perkins-Munn states that for effective data management, organizations must prioritize where to start with data cleansing, and there is no one-method-fits-all approach to it. She advises to focus on cleaning the data that directly impacts the most critical business process or decision, thus ensuring quick, tangible value.

Organizations could also assess data criticality and impact based on frequency of usage, impact on business decisions, potential cost of errors, and regulatory requirements. She opines that organizations could follow the 80-20 rule, where 80% of the value comes from 20% of the data. Identifying that critical 20% and prioritizing it for cleaning is a smart way to start.

Instead of boiling the ocean, Perkins-Munn suggests beginning with a manageable data set that is critical but not mission-important and allows the organization to refine processes and demonstrate value before managing sensitive data.

Furthermore, the ones interested in diving into AI/ML could focus on data that feeds into AI/ML models and prioritize cleaning the data those models rely on as a starting point.

Next, Perkins-Munn sheds light on the plethora of internal and external data sources that organizations deal with. Some of the key internal data sources include CRM systems, ERP systems, and PoS systems.

Another source is website analytics that reveals page viewers, click-through rates, time spent on the site, and conversions. Along with that, mobile app usage data confirms whether users are engaging, making in-app purchases, and which features they use.

In addition, there are customer support logs that include call-in support details, chat logs, and call center recordings, says Perkins-Munn. Also, there are emails and loyalty programs that accumulate data on member preferences and engagement levels. Feedback surveys, customer satisfaction surveys, and net promoter scores (NPS) are also internal data sources.

Among external data sources, Perkins-Munn mentions social media platforms, third-party research like industry reports, competitor analysis, and market trends. Public records such as census data, property records, business registrations, credit bureau data, and financial histories provide insights.

Additionally, weather data, geolocation, and economic indicators like GDP growth, unemployment, and consumer prices also play a role. Moreover, news, media, and government open data (like public health or transportation data) can be used as external data sources.

It is difficult to comprehend which data will speak to the consumer because blending the data sources is challenging as the data formats are inconsistent and there are different update cycles. Privacy concerns also play a part, especially in external data. Then, blending structured and unstructured data, handling real-time vs batch data, and resolving conflicts between data sets are key challenges.

In conclusion, she states that blending all these data sources requires robust data integration strategies, advanced analytic capabilities, and a clear understanding of data privacy regulations.

CDO Magazine appreciates Dr. Tiffany Perkins-Munn for sharing her insights with our global community.