One of today’s most pressing AI challenges is the gap between a Large Language Model’s (LLM’s) raw intelligence and how that intelligence translates into consistent, real-world performance when powering autonomous AI agents. This challenge is known as jagged intelligence. While LLMs may excel at things like writing polished essays or poems and translating languages with impressive fluency, their brilliance often stumbles when it comes to reliably executing tasks amid the messy realities of enterprise environments.
This inconsistency can lead to misaligned actions, flawed judgments, and deviations from critical business logic and established guidelines. Within high-stakes enterprise settings, this lack of predictability is a major liability; a single misstep can disrupt operations, erode customer trust, and inflict substantial financial or reputational damage.
Salesforce customers demand trustworthy AI that delivers reliable performance at scale, adapting intelligently to complex scenarios and evolving business needs. Their fundamental expectation isn’t just AI’s functionality but dependable, enterprise-grade consistency. For businesses, AI isn’t a casual pastime. It’s a mission-critical tool that requires unwavering predictability.
To address the critical challenges of jagged intelligence — where AI agents can perform brilliantly in some areas while failing unpredictably in others — Salesforce AI Research operates with three core pillars in mind:
- Foundational research: Begin by pinpointing key industry challenges and driving research that directly addresses them. This includes creating novel benchmarks to measure and reduce jaggedness, building models with deeper contextual understanding, and publishing cutting-edge research to move the field forward.
- Customer incubation: Then, pilot prototypes with customers in real-world simulation environments. By co-innovating directly with users, refine AI agents through continuous learning, real-world feedback, and stress-testing in complex workflows.
- Product innovation: Finally, prove value and readiness by transforming prototypes and research pilots into enterprise-grade solutions. This fuels innovation across Salesforce products and technologies, including Agentforce, the agentic layer of the Salesforce Platform, its Atlas Reasoning Engine, enhanced Retrieval-Augmented Generation (RAG) capabilities, and the Salesforce Trust Layer.
Through this deliberate cycle of investigating, piloting, and proving, Salesforce AI Research is making AI agents more intelligent, trustworthy, versatile, and enterprise-ready. By refining models and continuously iterating with real-world customer feedback, Salesforce AI Research is making it possible to create agents that meet the demanding needs of enterprise environments, ensuring they can seamlessly integrate into workflows, adapt to deliver on complex tasks, and perform with greater reliability.
Dive deeper:
Building intelligent agents with enhanced reasoning and RAG
Salesforce AI Research is focused on advancing intelligent agents with stronger reasoning and RAG capabilities, enabling them to access, synthesize, and apply information in real time — helping to reduce jaggedness and drive more consistent, contextually aware decisions across complex tasks.
Bringing intelligent agents to life:
- A public benchmark to quantify AI jaggedness: While AI struggles with consistent reasoning, Salesforce’s public SIMPLE dataset offers a clear benchmark to help. Featuring 225 straightforward reasoning questions that are easy for humans but challenging for AI, SIMPLE helps quantify LLM jaggedness by tracking performance gaps to help guide the development of more reliable AI for enterprise applications.
- Enhanced embedding model capabilities: As AI processes more unstructured data, understanding context is key. Salesforce AI Research is advancing text-embedding models like SFR-Embedding, which convert text to meaningful structured data for better AI information retrieval. SFR-Embedding leads on the MTEB benchmark across 56 datasets, excelling in retrieval and clustering. Available soon in Salesforce Data Cloud, Salesforce’s hyperscale data engine, it will enhance RAG for more accurate AI responses, setting a new standard for reliable enterprise AI.
- Specialized code embedding models for developers: Developers need efficient, accurate, and scalable AI for code retrieval and generation. To account for those needs, Salesforce AI Research launched SFR-Embedding-Code, a specialized code embedding model family based on SFR-Embedding. Mapping code and text to a shared space, it enables high-quality code search, streamlining development. The 7B model leads the CoIR benchmark, while smaller models (400M, 2B) offer efficient, cost-effective, smaller solutions that don’t significantly alter performance capabilities.
Strengthening customer trust with benchmarking, testing, and guardrails
To strengthen customer trust and tackle jaggedness head-on, Salesforce AI Research is applying rigorous benchmarking, continuous testing, and robust guardrails. By systematically evaluating agent behavior against real-world conditions and setting clear boundaries for performance and safety, Salesforce AI Research is also ensuring agents behave consistently, predictably, and reliably in enterprise environments.
Engineering trust into every agent:
- A new framework designed to test and evaluate AI agents: Evaluating enterprise AI agents’ ability to perform business-level tasks is a critical priority and a persistent challenge for CIOs and IT leaders. To directly address this, Salesforce AI Research introduced CRMArena: a novel benchmarking framework meticulously designed to simulate realistic, professionally grounded CRM scenarios. This focused approach enables comprehensive testing and targeted improvement of AI agent performance, ensuring their safety, reliability, and the cultivation of robust enterprise trust.
- New agent guardrail features enhance trust and security: Agentforce’s guardrails establish clear boundaries for agent behavior based on business needs, policies, and standards, ensuring agents act within predefined limits. Salesforce’s Trust Layer provides an extra layer of protection for enterprise agents. Salesforce AI Research is continuously developing new models and frameworks to enhance tools like toxicity detection and instruction adherence within the Trust Layer to better defend against prompt injection attacks. As part of that, Salesforce AI Research recently introduced SFR-Guard — a family of models trained on both publicly available data and CRM-specialized internal data, designed to further strengthen the trust and reliability of AI agents in business operations.
- A new benchmark for assessing models in contextual settings: Ensuring AI generates accurate, contextual answers is crucial for enterprise trust, but traditional benchmarks often fall short. Because of that, Salesforce AI Research launched ContextualJudgeBench, a novel benchmark evaluating LLM-based judge models in context. Testing more than 2,000 challenging response pairs, it assesses accuracy, conciseness, faithfulness, and appropriate refusal to answer — all vital requirements for real-world enterprise AI.
Enhancing model versatility by moving beyond solely relying on LLMs
By introducing specialized models, structured knowledge sources, and retrieval-augmented techniques, Salesforce AI Research is building agents that reason more reliably, adapt to the unique demands of enterprise workflows, and deliver more consistent, versatile performance — helping businesses operate with greater efficiency, trust, and agility.
Driving the future of versatile enterprise agents:
- A major upgrade to action model capabilities: As AI models become increasingly commoditized, there’s a growing need for smaller, more efficient alternatives that execute tasks at lower costs and with less resources. As such, Salesforce AI Research upgraded its xLAM (Large Action Model) family with multi-turn conversation support and a wider range of smaller models for increased accessibility. Unlike models that just predict words, Salesforce’s xLAM family predicts actions for faster real-world task execution, crucial for tool use and function calling. While some models exceed 200B parameters, xLAM starts at 1B, offering a lightweight, integrable footprint with robust planning, reasoning, and function execution, outperforming even GPT-4o and GPT-4.5 previews on key benchmarks for enterprise agents.
- A multimodal action model family for multi-step problem-solving: Current multimodal models struggle with complex, multi-step problems, often lacking clear reasoning capabilities. To address that gap, Salesforce launched TACO, a multimodal action model family that tackles these tasks by generating chains of thought-and-action (CoTA). TACO breaks tasks down into simple steps while integrating real-time action, which improves the AI’s ability to interpret and respond to intricate queries. Salesforce testing showed this achieved average gains of up to 4% across eight benchmarks and up to 20% on the challenging MMVet benchmark.
Shaping the future of enterprise AI with intelligent, trusted, and versatile AI agents
Salesforce AI Research continues to lay the groundwork for the next generation of enterprise AI agents, driven by intelligence, trust, and versatility, to help businesses work smarter and serve customers more effectively.
These innovations span everything from benchmarking jaggedness and advancing reasoning to embedding trust and building action-driven agents. Each research paper and new model release contributes to Salesforce’s broader mission of moving beyond prototypes to deliver AI systems that reliably perform in complex business environments.
As enterprise needs evolve, Salesforce remains committed to what matters most to customers: Translating cutting-edge research into trusted products that deliver real results at scale.
More information:
- Read a byline about EGI from Salesforce’s Head of AI Research
- Learn more about Salesforce AI Research