
Guide to Synthetic Data
Synthetic data is AI-generated information that is created from scratch using AI models, rather than collected from any real-world source.
Lisa Lee, Contributing Editor
Synthetic data is AI-generated information that is created from scratch using AI models, rather than collected from any real-world source.
Lisa Lee, Contributing Editor
What if the best way to use your data is to not use it at all?
Artificial intelligence (AI) needs enormous amounts of data to deliver results, but using real data to train AI means navigating a minefield of privacy, security, and compliance risks that can erode trust. To move forward safely, gas up your AI with a new kind of fuel: synthetic data. It’s got the context and complexity of real data, with almost none of the risk.
Synthetic data is AI-generated information that is created from scratch using AI models, rather than collected from any real-world source, yet it mimics the patterns and characteristics of real-world data.
The synthetic patterns are a realistic simulation of an original, real dataset, and maintain the most important characteristics of it. Imagine a modern-day artist who studies every painting Renoir ever created — the composition, texture, mood — and then creates a brilliant new work of art. It looks like an authentic Renoir, but it’s an original, not a copy.
Synthetic data is like that. Using AI models, data experts train an algorithm on a real dataset. The algorithm learns its underlying patterns, and generates a new, artificial dataset.
Synthetic data is emerging as a critical input for enterprise AI agents, helping them handle multi-step, complex tasks instead of sticking to narrow, scripted ones.
In some industries, the data the AI agents need is locked away for privacy reasons, or it may not capture every situation the agent might come across. Synthetic data solves this by creating safe, realistic training examples at scale, including even rare events that don’t show up all the time in real life.
Synthetic data can also keep AI agents sharp by running “what if” scenarios (fraud risks, system outages, etc.) without the risk. You can test, tweak, and improve the agent’s responses before those situations happen in production.
As Jason Wu, director of Salesforce AI Research, explains, synthetic data plays a key role in making sure AI agents perform the way you need them to.
“AI agents need to be immersed in realistic business scenarios,” he writes. “An AI agent trained solely on general internet data would struggle to navigate the complexities of a customer relationship management (CRM) system or supply chain.”
Synthetic data is revolutionizing AI agent training by removing one of the biggest bottlenecks in the process: getting enough real-world data. Compared to the expensive and time-consuming process of manual collection and labeling, synthetic data can be generated in massive volumes at a fraction of the cost. But the benefits extend even further.
As Wu explains, by populating a simulated environment with synthetic records like (fake) customer accounts, leads, and opportunities, agents can be trained to:
Real data is expensive, slow, and risky. Synthetic data flips the script—cutting costs, speeding up projects, and keeping privacy intact. Here’s why it’s such a game-changer:
Synthetic data does not require expensive manual collection. You don’t need to pay humans for data entry, labeling, annotation, and validation of synthetic data.
Large volumes of synthetic data can be generated very quickly, giving companies immediate valuable data to train their AI models.
Consider automated data labeling, which uses machine learning to automatically tag and annotate data. A recent study found that automated labeling can reduce annotation costs by orders of magnitude while maintaining accuracy. Labeling 3.4 million objects on a single NVIDIA graphics processing unit costs $1.18 and takes just over an hour. Manually labeling the same dataset via AWS SageMaker, a platform for building, training and running AI models, would cost about $124,000 and take nearly 7,000 hours.
Synthetic data training makes automated labeling possible. This speed allows you to build things much faster. The end result of this is more models.
Synthetic data creates non-identifiable datasets, which do not contain any information that can directly or indirectly identify an individual. This eases compliance with ethical and regulatory rules. That’s not to say that it’s a shortcut to responsible and ethical AI, however.
Certain scenarios are more easily simulated in synthetic data, which reduces risk and cost. Imagine you need to collect a large number of medical images from real patients with different conditions. This is time-consuming, labor intensive and expensive. Synthetic data can simulate various conditions without needing to collect new data from real patients each time.
Everyone knows there’s a right tool for every job. You wouldn’t use a hammer to tighten a screw, right? In the same way, it’s important to understand that not all synthetic data is the same. Creating the wrong type could lead to misleading results, wasted resources, or an AI model that works well in testing but poorly in real world scenarios. There are three types of synthetic data:
Fully synthetic data is generated entirely from scratch by AI or machine learning algorithms. It does not use any real data. Instead of mining actual customer information and anonymizing it, fully synthetic data creates it anew by reflecting patterns and relationships in the real data without replicating the actual information.
This is especially helpful when real data is incomplete or, more importantly, when maximum privacy is required. For example, say a hospital needs to share patient data for a health study. Federal laws such as the Health Insurance Portability and Accountability Act, or HIPAA, ban the use of real patient information. Instead, AI can analyze the real data to generate a fully synthetic dataset. The health study now has realistic patient profiles, but none of them tie back to an actual person.
Partially synthetic data involves replacing only sensitive real data, while the rest is original. The primary reason for going this route is to anonymize a dataset, say, of personally identifiable information.
For example, a marketing department wants to analyze its customer database to uncover trends. It includes names, email addresses, ages, purchase histories, and phone numbers. To protect customer privacy, it creates a semi-synthetic version that contains non-sensitive information like ages and purchase history, but replaces real names, emails and phone numbers with fake ones.
This is also known as data masking. It’s one part of Salesforce’s Trust Layer, a set of features that protect the privacy of company data.
Hybrid synthetic data is created by adding new synthetic records to an existing set of real data. This isn’t the same as replacing real data, as with fully synthetic. It’s additive to the original, real dataset with brand new, artificial records.
Companies would use this hybrid approach to augment data for machine learning models. It is helpful to fix imbalances where one category is underrepresented. Consider a bank building an AI model to spot fraud. In their real data, fraudulent transactions are extremely rare. A model trained on this imbalance would be poor at spotting fraud. To solve this, they generate thousands of new, realistic synthetic fraud examples and add them to the original dataset. It’s now more balanced, and allows the AI model to learn the patterns of fraud more effectively.
Your CRM is a gold mine of customer data, but the privacy protections around it can mean that your most valuable insights are locked away. Imagine you want to train an AI agent to predict which sales leads will close — a task that requires lots of historical data. The traditional approach of sending this raw, sensitive information to external developers for model development creates a security risk.
This is the challenge synthetic data solves. By generating a statistically identical replica of your CRM data — with all the same patterns in deal size and time to close but without any real customer information — you get a safe, accurate dataset. This allows your teams to innovate fast, building powerful predictive models while guaranteeing your customer data and their privacy remain secure.
By generating the data needed to prepare your AI and your human teams, you can create and test for rare scenarios that are difficult to find in real data. For example, a product failure, or how your sales team might react to a steep market dip.
Despite its power, synthetic data is not entirely risk-free. Here are some key considerations to keep in mind:
First, synthetic data will faithfully replicate any biases living in your real data. If your historical data is flawed, your synthetic data will be a perfect, flawed copy, so the "garbage in, garbage out" rule still applies.
Next is something called model collapse. That’s what happens when you retrain the same model over and over again on its own output. Eventually the model produces answers that are off topic, not related to the question, or in some cases incoherent streams of characters.
You run the risk of overtraining your model on things that it has already been trained on, and the model collapses because it’s being repeatedly trained on the things it already knows.
One way to mitigate that is through a process called retrieval augmented generation (RAG), which includes additional context in the prompt to answer questions grounded in a particular topic. This variance allows the model to produce synthetic data without relying only on its training data.
Another challenge is domain specificity. Certain industries may have nuances that are hard to capture in synthetic data, which degrades the relevancy of outputs. However, if part of the synthetic data creation process included additional domain-specific content, it could allow the model to supplement its knowledge, and produce synthetic data specific to a domain.
Think of synthetic data as a clean slate for AI training. Instead of wrangling mountains of real data (with all its privacy issues, compliance risks, and hard-to-identify edge cases) you can simply create the data you need.
This flips the old model on its head. AI used to be limited by the data you had. Now, with synthetic, you can generate the data you wish you had, including filling in gaps, mitigating biases, and simulating rare events that real data would never cover.
Bottom line? Synthetic data is a breakthrough. It changes the question from “Do we have enough data?” to “What do we want our data to do?”
Lisa Lee is a contributing editor at Salesforce. She has written about technology and its impact on business for more than 25 years. Prior to Salesforce, she was an award-winning journalist with Forbes.com and other publications.
Synthetic data is artificially generated data that mimics real-world data but is not derived from actual data. It is created to serve various purposes, such as testing, training machine learning models, and enhancing data privacy.
Using AI models, data experts train an algorithm on a real dataset. The algorithm learns its underlying patterns, and generates a new, artificial dataset.
Benefits of synthetic data include reduced costs, increased data availability, improved privacy, and the ability to create diverse and controlled datasets for testing and training purposes.
Synthetic data is used in a variety of applications, including testing software, enhancing data privacy, and simulating scenarios in fields like healthcare, finance, and autonomous vehicles.
What are the challenges and limitations of synthetic data?
Challenges include avoiding biases, model collapse from repeatedly training on the same data model, and domain specificity from the complexity of generating data that captures all nuances of real data within a specific industry.
Synthetic data is based on real data so that it has the context and complexity of real data, with almost none of the risk .
The 3 main types of synthetic data are fully synthetic data, partially synthetic data, and hybrid synthetic data. Fully synthetic data is generated entirely from scratch by AI or machine learning algorithms. Partially synthetic data replaces only sensitive real data, while the rest remains is original. Hybrid synthetic data is created by adding new synthetic records to an existing set of real data.
In machine learning, synthetic data is used to train models when real data is scarce or expensive to obtain. It helps in creating large, diverse datasets that can improve model performance and robustness.
Synthetic data enhances privacy by allowing the creation of data that resembles real data without containing sensitive or personally identifiable information. This reduces the risk of data breaches and complies with data protection regulations.
Activate Data Cloud for your team today.