Data for AI Is the New “Data Is The New Oil” — 4 Strategies for Effective Data Sourcing

Craig Suckling, UK’s Central Digital and Data Office Government Chief Data Officer
Craig Suckling, UK’s Central Digital and Data Office Government Chief Data Officer
Published on

The cliche is well-worn. For decades we have heard about data being the new oil. In the past, many organizations invested in big data and data lakes. Often, these took a solution-first approach rather than a business value-led approach, resulting in a failure to deliver tangible value, turning lakes into swamps.

In the meantime, the rate of data growth has been relentless. Data is estimated to grow by 150% from 2023 to 2025 hitting 181 zettabytes. However, even with this growth, we are quickly running out of text data.

AI research institute Epoch estimates we will exhaust all high-quality text data for training AI by 2026. AI products need data, and lots of it. This is essential to not only train new foundation models (FMs) but also importantly to provide context, remove hallucinations, and fine-tune FMs.

There are four aspects to consider when rethinking your data-sourcing strategy for AI.

1. Revisit the high-effort, low-yield reservoirs of unstructured data

Many organizations are sitting on swathes of unstructured data that have built up over decades. This could be from customer forms and communications, emails, employee research, internal messaging, wikis, collaboration tools, procurement documents, and many other sources. What used to be low-yield unstructured data repositories are now viable sources of value.

Unstructured data has numerous use cases that can unlock this latent value. It can be crawled, indexed, and used to fine-tune large language models (LLMs) for enhanced research and collaboration tools, to automate the generation of documents and policies, to augment customer support with deeper context, or to create richer product and content recommendations.

By feeding unstructured data into AI systems, organizations can surface insights buried within and transform what was once seen as “data dredge” into a competitive advantage.

Unlocking this value is not without challenges. Unstructured data is often disorganized, siloed, and may contain sensitive information. Companies will need to invest in data governance, curation, and AI filters to ensure proper security, compliance, and ethical use of this data.

2. Tapping into new energy sources

We are increasingly moving towards multimodal user interfaces where customers and business users will engage with AI applications across text, video, and audio. As a result, all organizations will increasingly generate more of this data and,  in turn, will use this data alongside text to build more advanced AI applications.

There are already many examples of the use of multimodal data across industries. Manufacturing companies use vision data to track parts alongside audio and vibration sensor data to predict defects. Media and entertainment organizations use their catalogs of video data alongside customer viewership data to inform future program formats and determine what is popular with viewers. 

Similarly, healthcare organizations combine unstructured research, text, and image data to accelerate R&D on new treatments and therapies.

As the adoption of AI continues to scale, multimodal data will become critical to extracting broader feature sets that fine-tune and enrich the intelligence of the AI models we deploy.

3. New trade agreements

Organizations need to increase the aperture of the data they source beyond the bounds of their business. Partnering on data is not a new concept and many companies have access to 3rd-party data or established sharing agreements with cleanrooms and data exchanges. As AI continues to scale, this data becomes more valuable and will also extend to sharing AI model inference. Organizations invested in building foundation models are rapidly establishing new commercial agreements with social media organizations such as Reddit, image libraries like Shutterstock, and large media outlets like the NY Times. New business models are also emerging for licensing data as media companies move to protect their proprietary data from being improperly used in LLMs.

All organizations should revisit their approach to 3rd-party data acquisition and sharing with consideration for the safe use of proprietary data, and how this data adds value to new AI use cases such as product development, market research, and research collaboration.

4. Synthetic data

Generative AI provides the ability to create synthetic data for use in prototyping and testing and to train new AI models. For example, OpenAI's Sora model was partly trained on synthesized data generated from graphics engines like Unreal Engine 5. This opens up new possibilities for organizations to augment their real-world data with synthetically generated data tailored to their specific needs.

Computer-generated imagery, 3D modeling, and physics simulations can produce vast amounts of labeled data for computer vision, robotics, and other AI applications. Synthetic conversational data can also help language models learn more natural dialogue patterns.

Care must be taken to ensure synthetic data doesn't reinforce biases or inaccuracies present in the original training data, and a diverse mix of real and synthetic data, will yield the best results. As generative AI capabilities continue advancing, the ability to manufacture custom, proprietary datasets on demand will become a competitive advantage.

Note: The article was first published on the author’s LinkedIn blog. It has been republished with consent. The views expressed in this article are solely the author's and should not be associated with the employer.

About the Author:

Craig Suckling is Government Chief Data Officer at the UK Central Digital and Data Office (CDDO). With over 20 years as an analytics, data, and AI/ML leader, Suckling brings a wealth of experience and expertise in innovating for the future, developing transformation strategies, navigating change, and generating sustainable growth.

Related Stories

No stories found.
CDO Magazine
www.cdomagazine.tech