AI agents are increasingly used today to automate complex enterprise tasks, ranging from customer service interactions to sophisticated data analysis and workflow management. Their importance lies in their ability to drive significant efficiency gains, reduce human error, and enable organizations to scale their operations and services effectively. Yet significant hurdles remain in the development and testing of AI agents for enterprise applications as found in many of the previous studies [1, 2].
The two core challenges are:
Data Availability and Compliance – Privacy regulations like GDPR prohibit using real production data for developing and testing AI agents. Using sanitized or masked data is safer but removes the crucial complexity and volume needed, posing a risk of being overly optimistic of agent’s performance.
Agent Testing and Validation – It is difficult to generate testing queries grounded on the relevant data at scale. The queries need to be comprehensive yet highly relevant to the use case and also provide ground truth answers for agent benchmarking.
Simulation environments, such as the Org Data Generator (ODG), solve these problems by generating synthetic, use-case specific CRM data that replicates production complexity without privacy risks. ODG enables the automatic creation of complex queries and ground-truth answers. This data is populated in the Salesforce Org that presents as a simulation environment to allow for rigorous benchmarking and confirmation of agent readiness before deployment.
ODG’s value extends beyond AI agent testing, significantly benefiting SEs and FDEs across the sales lifecycle. For Pre-Sales Demonstrations, SEs can use ODG to quickly generate custom synthetic orgs tailored to a prospect’s industry and schema. This allows them to showcase Salesforce products and integrated AI agents with highly relevant data. In Post-Sales Solution Validation and Training, FDE teams can create secure, disposable training environments with complex, scenario-specific data. This enables customers to safely practice using the new Salesforce implementation speeding up adoption. FDEs can use these synthetic orgs to quickly replicate and test fixes for specific customer issues without risking live data.
Org Data Generator
The Salesforce Org Data Generator is a service designed and developed to provide a Salesforce simulation environment that facilitates the development and testing of AI agents. The simulation environment comprises three critical components – synthetic and realistic use-case specific data, testing queries, and the agent to be benchmarked and optimized. A simulation environment serves as a gym for your agent – just as athletes train in a gym to improve their performance, the agent utilizes the simulation environment to improve performance. We began our research with the foundational work published in our papers, CRMArena and CRMArena Pro, and subsequently transitioned our research to applied work to build a product.
Users provide a brief description of their data simulation use case, including the schema of their Salesforce organization, the company name and description associated with the org, and optionally, the company’s knowledge documents. Schema defines all the objects (which are like tables in a database, such as Account, Contact, or Custom_Object__c), the fields (the columns of data on those objects, like Name, Phone, or Status__c), and the relationships between them like lookup relationships. Based on the schema, synthetic data and testing queries are generated using a complex pipeline involving LLMs, validated according to Salesforce validation rules, and uploaded to a Salesforce scratch org. The queries can then be used to benchmark the Agentforce agent and other AI agents. Users also have the flexibility to upload this synthetic data to their existing orgs or select an existing Salesforce schema, such as those from Service, Sales, or Health Cloud. The key point is that the data generated is completely privacy-preserving, with no access to customer production data.

Synthetic Data Generation

The synthetic data to be generated is both structured and unstructured. The structured data includes records for objects such as Accounts and Orders, and must adhere to proper foreign key dependencies and Salesforce validation rules. The unstructured data encompasses knowledge articles, voice calls, chat transcripts, and similar content.
The generation pipeline for synthetic data and testing queries is designed to ensure realism, complexity, and grounding within the specific Salesforce environment. This process begins with leveraging LLMs to interpret the user’s provided use-case context, including the organizational schema, company description, and optionally, knowledge documents.
Users are required to provide the data schema as a JSON file, which can include both standard and custom objects and fields. Within this schema file, users must also define the dependencies between objects and fields. Furthermore, a concise description of all custom objects is required, and users have the option to specify any particular requirements for a given object.
The initial step involves validating the user-provided schema to ensure there are no cyclical dependencies among the objects. Following successful validation, a topological sort is applied to determine the correct sequence for generating data across all objects.
When generating data for any specific object, the process first classifies whether the object is dependent on any parent objects. If parent objects exist, they are further categorized into two types: looping parent objects and sampling parent objects.
Looping parent objects necessitate that the child object has one or multiple records corresponding to every record in the parent object. For instance, the PricebookEntry object requires a record for every combination of records from the Pricebook2 and Product objects, meaning each Product record must be associated with each Pricebook2 record to create a PricebookEntry record. In this scenario, both Pricebook2 and Product are considered looping parent objects.
In contrast, sampling parent objects do not mandate that the child object must have a record for every single record in the parent object. An example is the Contact object acting as a sampling parent for the Case object, as it is not necessary for every Contact to have created a Case regarding a product or service complaint.
The pieces of information that are predefined are the classification of looping and sampling parent objects as described above and the different Salesforce validation rules for standard objects that the LLM must adhere to during data generation. This pipeline is readily scalable for adding more standard objects; the user needs to pre-define the looping and sampling parent objects and the necessary generation rules for the new object.
For custom objects, all parent objects are initially treated as sampling parent objects. However, users retain the flexibility to pre-define their custom objects with specific looping and sampling rules, similar to how standard objects are handled.
For objects that do not have any parent objects, data is generated directly in batches. This process is guided by all general and object-specific rules. Following generation, a deduplication step is performed to remove any records that are found to be too similar.
When generating data for objects with parent dependencies, the system utilizes a single record from the looping parent objects and samples records from the sampling parent objects. This input is used to generate one or potentially multiple records for the dependent object.
If the user provides their company’s knowledge documents, the RAG pipeline is used to retrieve the document chunks most relevant to the object’s data generation. We incorporate the RAG pipeline with company knowledge documents to ground the generated data in the specific terminology, policies, and operational context of the organization. This is crucial for creating synthetic data that is not only realistic but also highly accurate and contextually appropriate for training and testing applications tailored to that company’s ecosystem.
For example, a company might have a knowledge document detailing specific product names, internal project codes, or customer service escalation procedures. When generating synthetic data for a CRM object like a Case or Opportunity, the RAG pipeline would retrieve this specific information. Instead of generating a generic product name, the system would use one from the knowledge documents, making the synthetic data much more relevant and useful for training a model that understands the company’s actual catalog.
As a final step in data validation, the system employs deterministic checks to ensure consistency between each generated record and its parent and ancestor records.
We assessed and benchmarked the quality of the data generated by various OOTB LLMs across different industries through the use of manual human annotations. Through these benchmarking efforts, we automatically identify the most relevant industry based on the user’s request and consequently select the best-performing LLM for that specific task. For example, Gemini models may demonstrate superior performance in generating data for the healthcare industry, while OpenAI models might be better suited for the manufacturing industry.
Furthermore, to ensure both the quality and diversity of the generated data, we incorporate latent variables. Replicating the implicit causal relationships found in real-world data presents a significant challenge. Latent variables address this by simulating various underlying factors, which creates data that accurately reflects the subtle dependencies and patterns inherent in authentic CRM databases. These latent variables are generated by LLMs, which take the input schema into consideration. For instance, within a Health Cloud context, a latent variable such as “Patient Severity” which encompasses different categories of illness, severity, and complexity can influence records in objects such as Encounters, Claims, or the Providers responsible for the patient’s care.
Agent Testing Queries Generation
A variety of testing queries along with their ground truth answers are generated for robust testing of AI agents (Check our blog Evaluate LLM Agents for Enterprise Applications with CRMArena-Pro to read more on this topic). These queries are grounded in synthetic data. Given the database objects and agent configurations, LLM first generates various templates (use cases) required by the agent to test for different scenarios. The user queries are then generated for each template.
The queries fall into three primary categories – database lookup querying, RAG over synthetically generated knowledge articles, and trust and safety queries. For database querying tasks, a random record is fetched from the database. Then, with the help of database and table relationships, the LLM goes through different dependencies to generate text and corresponding SQL queries. The SQL queries are subsequently validated to confirm that the correct ground truth (initial randomly fetched record) is retrieved. For RAG-based queries, a RAG system is built using synthetically generated knowledge articles, and RAGAS is employed to synthesize the query-answer pairs.
For trust and safety, the queries comprise the majority of relevant categories – prompt injection, privacy preservation, and toxicity. The original generated queries are transformed into unsafe query categories using Deepseek models. Various combinations of privacy-preserving queries are also generated, such as a query where user A requests the order details of user B.
Queries are classified as either answerable or unanswerable. Most of the queries pertaining to trust and safety are intentionally unanswerable. The remaining unanswerable queries arise from introducing information not present in the database, such as asking for nonexistent order details.
Each query comes with ground truth answers and essential metadata, including its context, complexity, answerability, and the objects needed to answer it. The query’s context, detailing the user asking the question, is the most crucial piece of metadata.
Deployment – Setting up Scratch Orgs , Upload to Salesforce Org
Salesforce scratch orgs are automatically provisioned and configured according to the user-provided schema. This setup involves deploying custom objects and fields and correctly establishing the necessary permissions.
Once the org is configured as specified by the schema, synthetic data is generated and uploaded using Salesforce’s bulk upload API.
For objects that are dependent on others, we first retrieve the records of the parent objects, which have auto-generated Salesforce IDs. We use these IDs to correctly remap the foreign key fields in the dependent object records. This process is essential to ensure accurate mapping and integrity across all related records.
POC: Agent Optimization
As a proof of concept to demonstrate the effectiveness of these simulation environments, this service was utilized to optimize an agent for one of the sample companies. First, a synthetic org was created using the service, comprising synthetic data and testing queries. The default OOTB Agentforce agent was then benchmarked on this org, showing an accuracy of nearly 35%. Failure scenarios were identified by analyzing the logs of failed queries, revealing three types of issues.
First, in some cases, the Agentforce agent had the correct set of topics and actions configured but classified the query into the wrong topic or action. This problem can be resolved by optimizing the topic and action instructions, as user queries often remain vague.
Second, sometimes the actions were not configured correctly; for instance, the underlying flows might only be templates. Consequently, even if an agent correctly classified the topic and action, it failed to provide the correct result. Completing these templates offers a solution.
Third, the required topic or action was absent, preventing the agent from answering the user’s queries. Adding new topics and actions addresses this issue.
For all these scenarios, an automatic pipeline is designed and developed. Using our prompt optimization pipeline, the topics and actions instructions are optimized. For the other two scenarios, LLMs are used to cluster the queries and provide the necessary objects to generate the required topics, actions, corresponding instructions, and Apex Codes.
The Apex Codes are then sent to a staging area for debugging and deployment to the org. Any errors encountered are fed back to the LLMs and corrected. We then create or modify the Agent Script and deploy it to the org.
In one iteration, the agent’s accuracy improved from the initial 35% to approximately 70% by adding two new topics, five new actions, and optimizing the instructions within other topics. This represents almost a doubling of performance in just one iteration, and with subsequent iterations, an accuracy exceeding 95% could be reached, giving users confidence in deploying the agent to the real world. Read more on challenges and improving the agents in our previous blog Better LLM Agents for CRM Tasks: Tips and Tricks.
How is Salesforce using this service as customer zero today?
Upon launching our service as an internal demo tool, it successfully generated plenty of synthetic orgs within the first month itself. This service proved valuable to multiple internal teams – research and data science teams utilized it to generate data for benchmarking their own agents, security teams leveraged it in their red teaming efforts for the Agentforce agents, and solution engineers used it to create demos of Salesforce products and offerings for prospective customers. These generated orgs were diverse, encompassing a variety of contexts, including different industries, varying record counts, distinct objects, and specific requirements for particular objects. The data quality was consistently reviewed by these internal teams and generally met expectations.
Furthermore, a significant component of this data generation pipeline, which constitutes its core functionality, was also integrated into OrgFarm – an existing internal Salesforce service dedicated to creating all types of Salesforce orgs. Prior to this integration, OrgFarm relied on Snowfakery for populating data within the created orgs. Now, however, users can utilize the AI Powered Data Generation to generate and populate their orgs with contextually rich and relevant data. Following this integration, the monthly number of requests for AI-Powered Data Generation started growing more than those for Snowfakery.
Conclusion
The CRMArena Salesforce Org Data Generator (ODG) solves two major hurdles for developing enterprise AI agents – data security and accuracy. ODG creates a secure, realistic Salesforce Org simulation by generating complex, synthetic CRM data that mirrors production complexity without compliance or privacy risks.
ODG has three core capabilities –
- Uses advanced LLMs to generate unstructured, structured, and relational data that passes Salesforce validation.
- Automatically creates high-quality testing queries and their ground-truth answers to evaluate agent performance.
- Automates Salesforce testing environment setup and data upload.
A POC demonstrated ODG’s value by nearly doubling an AI agent’s accuracy. ODG is an essential “gym” for enterprise AI agents, empowering developers to confidently build, test, and deploy compliant, high-performing solutions.

















