Salesforce AI Research announces framework to optimize agent capability and consistency through synthetic data, realistic testing, and reinforcement learning.
Even as AI models grow more sophisticated, a curious challenge persists: systems that solve PhD-level mathematics struggle with surprisingly simple tasks. Ask a leading language model the famous riddle “Where does Christmas come before Thanksgiving?” and it correctly answers “in the dictionary”—because alphabetically, ‘C’ precedes ‘T.’
But swap the words—ask “Where does Thanksgiving come before Christmas?”—and watch the same model confidently explain that “in the dictionary, Thanksgiving comes before Christmas alphabetically.” This phenomenon, which we call “jagged intelligence,” reveals sharp peaks of brilliance alongside unexpected valleys of weakness.
For enterprise businesses, this inconsistency isn’t academic—it’s operational. When AI agents handle customer service calls, process sales workflows, or manage healthcare billing, jagged intelligence creates real business risk. An agent might flawlessly handle complex multi-step tasks one moment, then stumble on straightforward requests the next. This unpredictability is a dealbreaker for enterprises where reliability matters as much as capability.
At Salesforce AI Research, we’ve developed a new methodology to mitigate these risks. Today, we’re announcing eVerse: an enterprise simulation framework that trains AI agents like elite athletes, optimizing them for both capability and consistency through three interconnected steps: Synthesize, Measure, and Train.
eVerse: Synthesize – Building the Enterprise “Digital Twin”
Training best-in-class AI agents requires best-in-class training environments. Just as Formula 1 drivers spend thousands of hours in sophisticated simulators before competing at Monaco, enterprise AI agents need realistic practice grounds that mirror the complexity of actual business operations.
Because trust is Salesforce’s #1 value, we’ve designed a training approach that never puts your real data at risk. Our recent research work with CRMArena-Pro is a great example. It creates completely synthetic training grounds with realistic customer data, multi-step workflows, and the edge cases that make business operations unpredictable. Agents learn in environments that mirror real enterprise systems, while your and your customers’ data remains private, secure, and completely untouched. Learn more about our work in simulation environments in my recent blog, The New AI Agent Training Ground: Simulating Enterprise Environments.
The validation speaks for itself: 90% of domain experts rate our synthetic data generation as realistic or very realistic. Even more telling—the majority of the demos you’re seeing at Dreamforce this week use synthetic data generated by CRMArena-Pro.
eVerse: Measure – Stress-Testing in Realistic Scenarios
Synthesis alone isn’t enough. We must rigorously measure agent performance across the scenarios that matter most to enterprises. This includes one of the most critical—and challenging—modalities: voice interactions.
Voice conversations introduce layers of complexity that text-based testing misses: background noise, diverse accents, translation errors, poor connections, multiple speakers. eVerse simulates these realistic voice interactions, generating synthetic phone conversations that sound remarkably human while testing agents against comprehensive enterprise scenarios.
This measurement infrastructure operates behind the scenes throughout Salesforce. It’s how we validated Agentforce voice capabilities before launch, running thousands of synthetic conversations to ensure agents could handle real-world complexity with both high capability and unwavering consistency.
eVerse: Train – Closing Performance Gaps with Human Expertise
After measurement reveals performance gaps, eVerse’s training engine closes them through reinforcement learning guided by human expertise. Our research has demonstrated remarkable improvements using this method: 69% better performance on enterprise tasks (from 19% to 88% success rates). We’re currently piloting eVerse with customers. One example is UCSF Health, where we’re partnering with human experts to train and refine AI that helps simplify and improve the healthcare billing experience.
This continuous loop—synthesize environments, measure performance, train on gaps—transforms agents from generic language models into enterprise-specialized systems ready for production deployment.
The Path to Enterprise General Intelligence
This work advances our vision for what we call Enterprise General Intelligence (EGI): AI optimized for business applications that excels in both capability and consistency. While consumer AI prioritizes broad general-purpose capabilities, enterprise AI demands reliable performance across specific and complex, multi-step workflows where inconsistency carries real business risk.
eVerse addresses this by moving agents along both dimensions simultaneously. Generic LLM agents underperform in business settings—high capability but low consistency creates the “prodigy” problem: brilliant when it works, unreliable when it matters. eVerse-trained agents achieve the “champion” quadrant: high capability combined with high consistency, exactly what enterprises require.

The Competitive Imperative
The organizations that will lead in the agentic AI era won’t necessarily be those with the most advanced models—they’ll be the ones who recognized early that enterprise AI excellence requires sophisticated training environments bridging the gap between simulation and reality.
This body of research—from eVerse to voice simulation to reinforcement learning from human feedback—represents Salesforce’s commitment to making AI agents genuinely enterprise-ready: trustworthy, reliable, and grounded in enterprise business intelligence. The future belongs to agents trained in environments that simulate millions of realistic business scenarios, validated by domain experts, and continuously refined through real-world feedback loops.
We’re sharing eVerse at Dreamforce because our research advances through continuous customer engagement. The human feedback that trains agents in eVerse comes from our customers’ domain experts—the same organizations who will deploy these systems. This partnership between research and practice is how enterprise AI becomes genuinely reliable.
Join us as we shape what enterprise-ready AI agents can become.