Opinion & Analysis

7 Critical Uses of Synthetic Data in Balancing AI Innovation and Privacy

Written by: Gokula Mishra

Updated 2:27 PM UTC, Wed February 28, 2024

In this second article, I will cover the use cases for synthetic data. In the first article of the three-part series on synthetic data, I provided an introduction to what synthetic data is as well as how its use has changed over time.

As we have more and more adoption of AI and data-driven solutions, there are more data protection regulations being put in place in many countries throughout the globe. Gartner predicts that “By 2024, 75% of the global population will have its personal data covered under privacy regulations.”

It is crucial to note that while synthetic data can be highly useful, it is not a perfect replacement for real data in all situations. The generated data may not capture all the nuances and complexities of the real world, potentially leading to limitations in certain applications of synthetic data. However, advances in synthetic data generation techniques are continually improving its quality and utility for various use cases.

There are many use cases for synthetic data such as:

Privacy protection: When dealing with sensitive or personal data, we will need to anonymize Sensitive Data to be compliant with data privacy laws and regulations to make it easier to pursue AI/ML and other analytics projects. This allows companies to share and work with data without the risk of exposing sensitive and private information.
Testing and development: Synthetic data can be used in software testing, model development, and algorithm tuning. It provides a way for developers to simulate different scenarios without using real data, also being able to mimic production environments to enable testing with realistic but artificially generated data.
Data set augmentation: It makes your data better than real by enhancing it by adding missing pieces, fixing imbalances; increasing volume, etc)
Data sharing and collaboration: Sometimes, data owners might be hesitant to share their actual data due to privacy concerns or proprietary reasons. Synthetic data can be shared instead to foster collaboration between different entities.
Anonymization and de-identification: In cases where data needs to be shared publicly for research or analysis, synthetic data can be used to replace sensitive information, ensuring that individuals’ identities are protected.
Benchmarking and evaluation: Synthetic data can be used to create benchmark datasets for evaluating the performance of algorithms, models, or analytical tools.
Training ML models: In situations where acquiring large amounts of real data is expensive or time-consuming, synthetic data can be used to supplement the training process of machine learning models. This can help improve the model’s performance when real data is scarce.

Face Recognition: Face recognition is one of the most widely-used computer vision applications for unlocking our smartphones, identifying a person at immigration or via surveillance cameras, and various applications on our smartphone apps for example, for adding friends on Facebook based on their facial images.

Synthetic data can be leveraged for a significantly reduced amount of real-world image/video data required to train deepfake video recognition models, by generating synthetic facial images for diverse attributes such as new poses, hair styles, different illumination conditions, wearing glasses and other accessories – for which real data may not be available.

Now one may ask the question – “Are synthetically generated datasets good enough for training face recognition models?”. Please refer to this blog.

Similar synthetic data concepts are being used in many other applications such as hands-on-wheel detection and drowsy driver detection. Applications like eye-gaze on the road using synthetic data can increase the reliability of driving in autonomous vehicles.
Financial services examples: Synthetic data can be used to simulate various economic conditions and market fluctuations to enhance risk and decision models. Train and enhance fraud detection models by generating varieties of synthetic data to simulate various types of financial fraud scenarios.

Use synthetic data to simulate various market conditions and outliers to enhance the robustness of trading algorithms.
Healthcare examples: Synthetic data is used in simulating patient data for clinical trials enabling more efficient initial testing phases and protocol development for clinical trials.

Also Read

Introduction to Synthetic Data: Balancing AI Innovation and Privacy

There are many more use cases of synthetic data generation techniques across many industries. I am unable to cover all the use cases here, but I have started a LinkedIn group to facilitate discussion on synthetic data and its expanding applications/use cases in many industries.

It is important to note that different use cases require different synthetic data technology (I.e., there is no singular cybersecurity tool). For example, data set augmentation requires a lot of technical customizability. However, technical customizability creates privacy risks, making it a poor form factor for data sharing.

The most common use case right now for synthetic data is using it for privacy and compliance. Around 54% of CDO Survey Respondents list Regulatory and Ethical Issues as a primary barrier to adoption.

Only 4% of Large Enterprises Have Generative AI Projects in Production (Source: AlphaWise, Morgan Stanley Research 2Q23 CIO Survey).

The key success factors are:

Simplifying data governance and access controls to enable data democratization
Strengthening third-party risk management and reducing time-to-value for new tools
Reducing regulatory risk exposure through anonymized data safe harbors

In order to adopt synthetic data at enterprise scale –

Resolve the legal challenge unambiguously and automatically
Data quality automation — high-quality data without a team of synthetic data experts
Standard connectors, not custom integrations — the existing data stack should still work

There are a number of vendors emerging in this space such as Subsalt, Cognida.ai, Ydata, Tonic, Mostly.ai, Gretel, Datomize, GenRocket, Betterdata, etc., alongside open-source products such as Synner, Datagene, mirrorGen, etc.

In the third and final article, I will cover some of the technologies behind the creation of synthetic data. I want to emphasize that not only the generation of synthetic data but its management, deployment, validity testing for its purpose, management of deployment, governance and versioning, etc. are equally important.

About the Author:

Gokula Mishra is Chief Editorial Reviewer of CDO Magazine Editorial Board, former VP of Data Science & AI/ML, Direct Supply and Head of Data Analytics and Supply Chain globally at McDonald’s. He brings 30+ years of Data analytics and AI/ML experience across many industries in creating lasting business value internally and externally.