Branded Content

Harnessing Unstructured Data — CDOs Reveal Strategies for Enterprise AI Success

Written by: CDO Magazine Bureau

Updated 2:39 PM UTC, Fri September 20, 2024

In the age of large language models (LLMs) and generative AI, the potential for transformation in the enterprise landscape is immense. These models have demonstrated remarkable capabilities in the consumer AI space, effectively crawling the web, learning from vast amounts of data, and delivering powerful insights.

However, when it comes to enterprise environments, the integration of LLMs poses unique challenges, primarily due to the complexity of enterprise data systems. Unlike the relatively open and unstructured nature of consumer data, enterprise data resides across multiple systems, each with its own stringent governance, privacy, and security requirements.

The core challenge lies in harnessing the power of LLMs within these constraints. Organizations need to unlock the potential of their unstructured data — often scattered across billions of files — while ensuring compliance with security and privacy regulations. The task involves not only classifying, cataloging, and curating data but also addressing issues of data entitlement, lineage, and quality. These steps are crucial for building reliable and secure AI-driven solutions that can drive efficiencies.

CDO Magazine’s New York Executive Boardroom Dinner, held on July 24, gathered leading data, analytics, and AI executives from the region to discuss critical industry issues. Key topics included effective strategies for governing and managing unstructured data, the vital role of embedded governance in AI pipelines, and best practices for enabling enterprise AI.

The event’s highlight was a panel discussion on the “Safe Use of Data+AI to Unleash the Power of Enterprise AI.”

The panel of experts were:

JC Lionti, Mizuho Managing Director and Chief Data Officer
Deepak Jose, Mars Wrigley Global Head of One Demand Data and Analytics Solutions
Moin Haque, IFF Head, Enterprise Data, Analytics & AI

The discussion was moderated by Rehan Jalil, Securiti President and CEO.

This research note distills key insights based on the discussions, providing valuable guidance for enterprise data leaders facing similar challenges within their organizations.

Download the Full Report for FREE! Click the “Preview” Button Below.

Attachment

Pdf

Harnessing-Unstructured-Data-—-CDOs-Reveal-Strategies-for-Enterprise-AI-Success-1

Preview

NAVIGATING UNSTRUCTURED DATA: CHALLENGES AND OPPORTUNITIES

JC Lionti, Chief Data Officer at Mizuho, highlights the tremendous opportunities and corresponding risks that come with leveraging unstructured data. Given that unstructured data accounts for around 90% of all data, Lionti stresses the need to tap into this vast resource while ensuring the security of all stakeholders.

Lionti further explains the importance of effectively inventorying and understanding content to enable transparent access to data. The goal is to minimize the time data scientists spend searching for or accessing data, allowing them to focus on using it. By integrating unstructured data into active metadata management and creating a comprehensive fingerprint for this data, Mizuho aims to fully unlock its potential.

INTEGRATING DISPARATE DATA SOURCES

Deepak Jose, Mars Wrigley’s Global Head of One Demand Data and Analytics Solutions, provides insights into the evolving landscape of data within consumer packaged goods (CPG) companies. Traditionally, CPG companies have relied on structured internal data sources and third-party data, where companies gather data from stores. However, the past decade has introduced new types of data. Jose explains the different tiers of data:

First-party data comes from direct consumer interactions, like visiting the M&M’s website or store
Second-party data is obtained from retail partners like Amazon or Walmart, which sell data gathered from their platforms.
Zero-party data—consented information that consumers willingly provide.

These diverse data sources often exist in silos within large CPG companies, creating “data islands” that hinder a unified understanding of the consumer.

Jose likens this to the story of four blind individuals each describing a different part of an elephant—Sales, Marketing, and Supply Chain all interpret the same consumer data differently due to these fragmented data sources.

To address this issue, Mars Wrigley aims to build a connected data foundation that integrates these various data streams, enabling a cohesive understanding of the consumer. Jose also emphasizes the importance of aligning responsible AI and data strategies with the company’s core principles.

For Mars, this means adhering to a responsible marketing strategy that excludes targeting children. Consequently, the data team at Mars has committed to not collecting any data from children under 16, whether through first-party, second-party, or third-party sources.

REEVALUATING PREPAREDNESS FOR UNSTRUCTURED DATA

IFF invested heavily in converting unstructured knowledge into structured systems to extract insights and drive actions. The new ability has unlocked tremendous potential, providing deeper insights into manufacturing and R&D processes.

Despite the excitement, Moin Haque, IFF Head of Enterprise Data, Analytics & AI, points out that this newfound access also shined a spotlight on the lack of preparedness in managing unstructured data at such a vast scale and speed.

Traditional data management practices, with their inherent frictions, allowed organizations to gradually adapt and posture themselves for effective use. Now, without those constraints, IFF is addressing the questions around relevancy and appropriate use. The challenge extends beyond internal unstructured data, as scientists now have the capability to explore external sources like academic research and commercial publications.

This abundance of information underscores the critical need for alignment and governance to ensure that data usage remains relevant and beneficial to the company.

ALIGNING AI MODELS WITH DATA AND COST EFFICIENCY

Lionti explains that the adoption of AI is relatively new in the financial services industry compared to sectors like consumer packaged goods (CPG). Accordingly, Mizuho is proceeding cautiously, ensuring that AI use cases align with the established governance processes. To manage this, Lionti emphasizes the importance of leveraging their technology platform to enforce and streamline governance.

When someone wants access to data, they must go through the appropriate channels, and the process is designed to be as automated and rapid as possible.

Lionti also highlights the establishment of responsible AI principles, which all stakeholders must follow. Governance is enforced from the provider side, meaning that anyone seeking data access must clearly explain their intended use, and continuous monitoring is in place to ensure compliance.

Regarding AI models, Lionti notes the importance of being able to rerun models on both structured and unstructured data multiple times to extract maximum value. The first generation of models might deliver initial insights, but subsequent generations can refine and enhance those insights further.

To support this iterative process, Lionti’s team is working on structuring pipelines and processes that allow for the creation of reusable interim products. These products can be enriched over time using new or different models, ensuring that the data continues to yield value with each iteration.

Mars’ approach to unstructured data is centered on solving specific business problems rather than simply trying to extract value from the data itself. Jose says that the focus is on identifying the business challenge first and then determining how data or AI technology can address it.

Jose provides an example from Mars’ use of unstructured data in a CPG environment with its internal GenAI tool, Brahma.ai. Brahma, named after the Sanskrit word for creation, is built on Google Gemini 1.5 and is designed to accelerate innovation by integrating external consumer trend data with Mars’ proprietary R&D sensory data. For instance, in the case of Wrigley, one of Mars’ biggest brands, Brahma.ai analyzed consumer trends in the chewing gum market and combined this with internal data to create new product options.

A successful example of this approach was the launch of Respawn, a chewing gum targeted at gamers. By identifying an unaddressed niche in the market through the analysis of unstructured consumer data and integrating it with other internal data sources, Mars was able to launch a product that met a specific consumer need. Jose emphasizes that while unstructured data, like social media sentiment or consumer audience data, is valuable, it must be integrated with structured data from internal sources to make informed decisions that drive business success.

ENSURING ACCURACY IN AI MODEL IMPLEMENTATION

Model veracity (or the quality of being accurate) has been a significant challenge for IFF. Haque reveals that the organization initially took a cautious approach by blocking external services to mitigate risks. The organization then focused on using Microsoft 365 for productivity applications, recognizing it as a safer option, to ensure some level of protection and avoid potential indemnification issues. This approach allows users to experiment safely within their productivity environment.

In contrast, for R&D and product development, where differentiation is crucial, IFF adopts a more prescriptive strategy. The organization seeks out the right partners and builds solutions tailored to their needs.

Recognizing the high value of their data in the biotech space, especially in protein engineering, IFF collaborates with NVIDIA to leverage ensemble models for safe experimentation. This partnership helps in navigating the complexities of using advanced models while protecting valuable data.

Looking ahead, Haque notes that IFF is preparing for a future where it will develop more specific, proprietary models tailored to business needs. This involves transitioning from relying on external partners to creating in-house, customized models that build upon foundational technology.

DOMAIN ASSIGNMENT FOR UNSTRUCTURED DATA

Securiti’s Jalil explains that handling domain assignments for unstructured data begins with discovery, especially considering the vast number of files involved. The process requires technology to identify files and extract metadata, as it’s impossible to manually analyze both structured and unstructured data at such a scale.

The metadata, including embedded labels, activity history, and file access information, helps in understanding the content and staleness of files, as well as determining file ownership. This approach is part of building a knowledge graph, where all metadata is organized and queried efficiently. Establishing lineage is crucial, not just for tracking data flow but also for managing metadata to ensure the unstructured data is correctly organized and redundant files are removed.

OPTIMIZING DATA CLEANING FOR VALUABLE INSIGHTS

The following key strategies enabled Mars to make substantial improvements in data quality:

Data ownership: Data quality used to be a shared responsibility between the business and analytics teams, but there was a lack of clear accountability. Now, each functional domain within the business has its own data steward. For instance, within the sales team, the Strategic Revenue Management sub-function manages the pricing and cost data, ensuring that ownership is in the hands of the right people. This shift required the creation of many new roles and extensive training and upskilling of employees across the organization.
Centralization: There were several shadow analytics operations within the company. This was addressed with a major centralization drive creating a single source of truth. Now no shadow analytics team is allowed to operate on their own data ingestion. It was a cultural change in itself for Mars as a decentralized organization to a centralized data foundation.
Training and upskilling: Training and upskilling senior leaders of the company and making data a part of their job description and business objectives for every year helped change the mindset to a great extent. This top-down approach, supported by a strong emphasis on culture and change management, played a key role in driving this transformation.

Organizations that have control over the systems sourcing or collecting data have the opportunity to modify those systems directly. NBC Universal faced similar issues and as CDO Will Gonzalez reveals, the organization addressed them by enhancing its data collection processes to improve data quality. The solution was not necessarily easy, but it was straightforward and effective.

At Intuit, the initial priority was to establish a single source of truth for all metadata. Achieving metadata consistency across all lineage paths was essential, as it resolved 90% of the data quality challenges. With consistent metadata in place, the team could perform parity checks as needed, forming the foundation for their data quality strategy across all verticals.

To implement this approach, the team utilized a graph database, mapping the process within the graph to effectively manage data quality. This also provided additional advantages, such as improving Role-Based Access Control (RBAC) capabilities, which allowed controlling and managing access to specific data more effectively.

Another approach shared during the event was to avoid tackling thousands or millions of documents all at once, like “boiling the ocean”—a task that is overwhelming and impractical. A more effective strategy is to start small, focusing on manageable steps based on specific use cases, and gradually build from there.

Rather than convincing leadership that anything is possible from the outset, a phased approach that starts with specific, targeted efforts is likely to yield better results.

Executive Action:

There is a growing emphasis on inventorying and understanding unstructured data to enhance accessibility and utilization while ensuring security. A structured approach to integrating unstructured data into active metadata management is essential for unlocking its potential while maintaining governance.

Connecting data islands: The challenge lies in unifying diverse data sources, including first-party, second-party, and zero-party data. By building a connected data foundation, companies can achieve a cohesive understanding of the consumer, aligning their AI strategies with responsible data usage and ethical principles.
Preparedness for unstructured data: Organizations must address the challenges of managing unstructured data at scale by focusing on structuring data systems for insights. Aligning data usage with governance ensures that the information remains relevant and beneficial to the company.
Governance and iteration: The cautious adoption of AI in industries such as financial services highlights the importance of aligning AI use cases with established governance processes. An iterative approach to model deployment, where models are rerun on both structured and unstructured data, enables continuous refinement and value extraction.
Targeted AI solutions: AI should be used to address specific business challenges. For example, internal GenAI tools can integrate external consumer trend data with proprietary R&D data to innovate products that meet specific consumer needs.
Model veracity: Collaborations with advanced technology partners can help organizations safely experiment with AI while protecting valuable data. This approach ensures that model quality and data protection are balanced effectively.
Domain assignment: Effective management of unstructured data requires technology-driven discovery and metadata extraction. Organizing metadata and establishing lineage are critical for building a knowledge graph that supports efficient data management.
Ownership and centralization: Improving data quality begins with clear ownership, centralization of data sources, and extensive training. This fosters accountability and drives cultural change within the organization.
Phased approach: Tackling data quality challenges requires enhancing data collection processes and establishing a single source of truth for metadata. A phased approach, focusing on specific use cases, is recommended to manage the complexities of unstructured data effectively.

About Securiti:

Securiti is the pioneer of the Data + AI Command Center, a centralized platform that enables the safe use of data and GenAI. It provides unified data intelligence, controls and orchestration across hybrid multicloud environments. Large global enterprises rely on Securiti’s Data + AI Command Center for data security, privacy, governance, and compliance. Securiti has been recognized with numerous industry and analyst awards, including “Most Innovative Startup” by RSA, “Top 25 Machine Learning Startups” by Forbes, “Most Innovative AI Companies” by CB Insights, “Cool Vendor in Data Security” by Gartner, and “Privacy Management Wave Leader” by Forrester.