by Zachary Stauber, Senior Director, Digital Success, AI
The goal of self-service has always been to help customers get answers quickly. But we wanted to go further. Five years ago, we set out to redefine self-service — not just making it faster, but making customers feel confident and empowered. We believed people should feel supported, even without a human helping them. That belief led to the creation of our Digital Success team.
When we first launched Agentforce on Salesforce Help, we were excited about its potential. We saw it as a chance to bring more autonomy to the self-service experience, something we’d been working toward for years. For those of us who had spent countless hours building chatbot dialogues, Gen AI and large language models felt like a breakthrough. It was a game changer.
But then came Day 2. The excitement waned a bit, and the hard questions started — especially this one:
That question sparked real pressure. We saw it as a chance to lead, but we needed a solid plan: one that let us grow, learn, and improve Agentforce in a way that benefited customers and was measurable.
It’s been a meaningful journey. While there’s still more to do, I’m incredibly proud of our team and the progress we’ve made in driving and understanding answer quality for Agentforce on Help.
    
                
            
        What is Answer Quality, literally?
Answer Quality is simply the percentage of answers from Agentforce that meet our acceptance criteria after an individual question is asked. We refer to this as a “turn” in the conversation, and our Answer Quality assessment today is optimized for the first turn of a conversation. Later I’ll get a bit more on how we plan to evolve our approach, but this initial method is a straightforward pass/fail check. Either an answer meets the criteria or it doesn’t. And yes, we use Agentforce itself to measure this.
For example, if you ask 100 questions and 50 of the answers meet your criteria on the first try, that’s a 50% Answer Quality.
That doesn’t mean the other 50 answers were completely wrong. They just didn’t fully meet the standard you set for a high-quality answer. In reality, your answers are probably more “correct,” but we believe it’s important to set a high bar based on what we think represents a great answer.
Because of this, answer quality on its own isn’t enough to determine effectiveness. We need to lean in deeper, and also include key metrics like resolution rate and escalation rate. For example, our answer quality benchmark in October of 2025 was 60%, with a target of being at 75% by the end of this year. That may sound low, but when you factor in question variability, off-topic questions or “small talk,” and customers seeking case creation on the first turn, it actually is rather high. Factoring in this whole picture — answer quality, resolution rate, and escalation rate — gave us confidence we were in a good place to achieve key business outcomes.
After getting comfortable with the data, we set a goal of reaching 75% Answer Quality this year. I’m excited to share that we just hit 76% in our most recent test. We achieved this by strengthening our framework, improving content and strategy, and making the most of our AI tools like RAG, embeddings, and metadata.
That’s really all there is to it. I’ll dive deeper into how we assess Answer Quality, but the question of “What makes an answer good?” doesn’t need to be a nebulous, overwhelming exercise.
    
                
            
        The Anatomy of Answer Quality
When tackling Answer Quality, it’s easy to overcomplicate things, adding too many metrics and losing sight of what really matters. This can lead to confusion, endless debates, and no clear way forward.
The key is to stay focused on a single, clear goal that everyone can align around.
For us, that goal was: Help customers get their questions answered 24/7, with an easy path to a human if needed.
Once we had that clarity, we tied it to two measurable outcomes:
- Increase Agentforce Resolution Rate
 - Decrease overall support case volume
 
The formula was simple:
Next, we asked: What is a quality answer?
After reviewing real conversations, talking to teams and customers, and studying best practices, we landed on this simple definition:
We wanted a definition that everyone, regardless of role or background, could understand and use. While AI metrics like RAG or cosine similarity are important, they’re not always meaningful to business leaders.
Early on, when I mentioned something like, “Your cosine similarity is 0.86,” I got blank stares. That’s when we realized we needed a more relatable way to explain quality.
So, we built our system around a clear, shared idea: A great answer is relevant, correct, and complete.
    
                
            
        Evaluations Managers: The role built to assess Answer Quality at Scale
Once we had a clear concept and definition for Answer Quality, we needed someone to manage the process, find gaps, and drive improvements.
That’s where the Evaluation Manager role came in. A new role we invented in our Organization designed to focus on Answer Quality and its management for our Help Agent.
These individual contributors are responsible for maintaining the entire Answer Quality framework, running tests, analyzing results, and sharing insights. They need to explain complex ideas in simple terms and design tests that are useful for both engineers and executives.
Like many of you, we didn’t have endless budget to hire a team of specialized AI evaluators. So we looked internally. We found people who were passionate about AI, eager to learn, and comfortable with experimenting and learning fast. I’m proud of the team we’ve built and this new role. In my view, the Evaluation Manager will become a critical role in any organization working with AI Agents in service and support.
These people likely already exist in your company. They might be support engineers, program managers, or data analysts. So as you build your Answer Quality strategy, I encourage you to look within your current team for talent that already wants to positively impact your AI story.
    
                
            
        Synthetic Utterances: The Backbone of Our Evaluations Framework
About a month after we launched, our volume of conversations grew into the thousands every week. As usage of Agentforce on Help grew, our evaluation system had to grow with it.
To do this, we created a system based on the idea of “Agents Testing Agents,” using AI to evaluate Agentforce’s responses against clear Acceptance Criteria: the answer must be Relevant, Correct, and Complete.
However, analyzing real customer conversations at scale is difficult. There’s too much variation, and not enough control to reliably measure performance. So we built a more manageable, scalable approach using something we call Synthetic Utterances.
Synthetic Utterances are realistic, but fake, customer questions based on actual support data. Think case logs, web searches, and feedback channels. These represent our top customer issues, and we regularly update the list to reflect current needs.
When we launched in October, we had around a hundred Synthetic Utterances. Today, we have several hundred, and we’re aiming for near a thousand soon — enough to confidently reflect customer needs at scale, but in a way that allows to have a controlled testing environment.
For each question, we worked with Support Engineers and product experts to define specific Acceptance Criteria, or the specific requirements that make a good answer for that question. These are written into what we call LLM Judge Prompts, which allow AI to automatically assess AAgentforce’s responses for relevance, correctness, and completeness.
Here’s some examples using some common use cases.
Utterance and Acceptance Criteria
| Synthetic Utterance | Acceptance Criteria | 
| How do I tie my shoes? | The answer should be: Relevant: The answer should include references to tying shoes including shoe laces, and types of shoes that use shoe laces such as sneakers, boots, or running shoes. Correct: The answer should include the following steps in this order: Make a basic knot, how to use bunny ears, and pulling the loops tight to secure the tie. Complete: The answer is complete if it includes guidance on shoes that do not include laces such as sandals, flip flops, slip ons or velcro shoes.  | 
                        
                    
                
| How do I brush my teeth? | The answer should be: Relevant: The answer should include words and phrases like toothpaste, brushing in a circular motion, and brushing teeth in your own mouth. Correct: The answer should follow these steps: Grab a toothbrush, put toothpaste on the toothbrush, apply the toothpaste to your teeth with the brush, brush for around two minutes per session. Complete: The answer should include additional information on the value of flossing and using mouthwash.  | 
                        
                    
                
Armed with these synthetic utterances that represented our customer’s questions, and subject matter expert vetted acceptance criteria, we then hit the next phase of our operation: tooling.
    
                
            
        Building and Applying Agents Testing Agents
In the early days of Agentforce, we manually read every chat transcript. Truly. It was messy and time-consuming. As conversations grew, it quickly became impossible to keep up.
That’s where Synthetic Utterances and Acceptance Criteria became crucial. They let us test Agentforce without relying on live customer interactions. But we still needed a tool to make Agents Testing Agents a reality.
Thankfully, our amazing Engineering team at Salesforce stepped in. We shared the requirements, and they built a tool using Salesforce Apps and public Agentforce APIs. We called it Agents Testing Agents.
Here’s how it works:
- We input our repository of Synthetic Utterances (test questions) and Acceptance Criteria into the tool.
 - The tool asks Agentforce that question in our live environment
 - Agentforce responds, and the tool checks the answer against our Acceptance Criteria.
 - The result (pass/fail) is shown in a Salesforce report, which we can export to tools like Sheets, Excel, or Tableau.
 
This process gives us our Answer Quality Score.
Next, our Evaluation Managers review any failed answers and tag them with an Issue Classification: a label that explains why the answer didn’t meet the criteria.
These classifications help our AI experts in strategy, engineering, and data science investigate and fix issues. Some common classification types include (but aren’t limited to):
Issue Classification and Definitions
| Issue Classification | Definition | 
| Wrong Product | The Answer did not understand or identify the product that the question was intending to address. For example, the question was about Service Cloud, but the answer was in regards to Marketing Cloud. | 
| Irrelevant Answer | The Answer did not understand the question and provided information that was irrelevant to the original question’s intent.should include additional information on the value of flossing and using mouthwash. | 
| Incorrect Answer | The answer is factually inaccurate. | 
| No Content Available | Agentforce provides a hallucinated answer, or a message indicating it was unable to find relevant content to provide an answer. | 
| No Answer Given / Unexpected Guardrail | Agentforce failed to provide an answer (i.e. Error state), or provided an unexpected guardrail message (i.e. ungrounded response message) | 
| Poor Formatting | The answer provided by Agentforce does not conform to the designed answer format. | 
Our Issue Classifications are still evolving. As we’ve learned more, we’ve realized that some things we used to track weren’t as useful as we thought, and we’ve adjusted over time. For example, our initial list included solutions like Retrieval Issue. What we found is that having an issue classification taxonomy that incorporated solutions, had a tendency to lead us down the wrong path. The takeaway was that focusing on the root issue will actually help you move more quickly in your root cause analysis and solving the real problem.
In addition, we discovered through our efforts that we sometimes found “False Failures:” scenarios where Agentforce was actually answering the question correctly, but our acceptance criteria, tooling, or even our own uncalibrated perceptions as humans evaluating answers may have influenced our judgement. Being open to these types of issues helped us build a stronger playbook and address concerns of confidence in our findings and insights. 
We’re also focused on scaling and reducing the workload on our Evaluation Managers. To help with that, we’ve started testing AI to automatically identify why an answer failed, based on the definitions of our Issues Classifications. The early results are very promising, and with our Synthetic Utterances on track to grow past 1,000, I’m confident our team can keep up with demand by using AI. 
We’re also excited about new tools in Agentforce Observability. As Agentforce Customer Zero, we’re closely partnering with the product team to pilot features like Testing Center, Session Tracing, and more. These built-in tools are becoming a key part of our strategy, and we’re preparing to transition to them as they become generally available. They will make our Answer Quality story much more vivid, easier to understand, and communicate. If I’m being honest, I wish we had these tools when we first launched Agentforce!
    
                
            
        What Tests We Run, How Often, and Why
Let’s recap. We have a set of Synthetic Utterances that cover our customers’ top issues. Our experts have approved the acceptance criteria for good answers, and we have a tool to check these answers on a large scale. We have the final piece, a team of Evaluation Managers who review the results to take action.
So then, when do you test? How often? What kinds of tests?
For us, as an AI Operations team, we handle three types of evaluations: Synthetic Baseline, New Feature, and On Demand.
Synthetic Baseline Evaluation
This is our most important and frequent evaluation.
The Synthetic Baseline tracks our Answer Quality over time. As we release new features, content, or data, we regularly test how those changes affect performance, using our set of synthetic questions and acceptance criteria.
This creates a baseline that all new updates to Agentforce must meet or exceed before going live. It ensures we’re improving, not regressing.
We run these evaluations twice a month, aligned with our sprint cycles. Then we combine the results into a monthly Answer Quality score, which we share in Scorecards, Business Reviews, and more.
This process is the foundation of our AI strategy for Answer Quality.
New Feature Evaluation
In addition to our regular baseline tests, we also evaluate new features, data, or technology throughout their development. We call this approach Evaluation-Driven Development. It ensures that anything we release meets or exceeds our established Synthetic Baseline.
The main difference is that New Feature Evaluations use a smaller, targeted set of synthetic utterances specific to the change being tested. These tests are run as needed during development, and the results are shared with engineering, content, or data teams to guide improvements from development through QA and UAT.
Once the feature goes live, the test questions we used are added to our main repository and included in future baseline tests.
For example, when we launched the Help Agent on the Slack Help Portal, our Evaluations Manager created 96 custom synthetic utterances and acceptance criteria. The feature passed above baseline and was released to production. Those 96 questions are now part of our ongoing testing.
On Demand
We also need to run specific tests on demand for bugs, unexpected issues, experience issues, or last-minute changes to check how Help Agent is performing. These are things like bad answers for new features or products, unclear answers, or generally poor experiences with the Help Agent. Since we already have baseline and new feature evaluations, adding on-demand tests was a natural fit.
In addition, many teams across Salesforce act like “independent auditors,” helping us find gaps or insights outside our usual tests. These teams include Accessibility, Legal, and others like Certifications or Trailhead. While they may not handle the biggest customer issues, their work is still very important to help us meet our scale demands and ensure we’re evaluating for as many customer scenarios as possible.
We give these teams access to our tools and framework so they can run their own tests. As the leader of the AI Ops team, this is a big part of my growth plan. It’s about sharing our evaluation process to meet the needs of all our customers better.
The key takeaway: Use your partners!
    
                
            
        What’s Next?
The future of Answer Quality is full of possibilities. As new tools come, we’ll keep improving how we measure and understand our agents’ performance.
Right now, Answer Quality looks at just one response after a customer’s question. While this is useful, we want to move toward Conversation Quality: measuring the whole conversation, not just one answer.
We’re exploring two ideas:
- From Answer Quality to Conversation Quality:
We want to score entire conversations, not just single answers. One good answer can’t fix a series of bad ones. This is such a focus and passion for us that we’re exploring new and different ways to extend and evolve our current model to include the ability to assess the whole of the conversation. Our initial thinking is that we can create a compounded single turn scoring system that then can be communicated as a multi-turn based quality. In other words, many answers make good conversation. - Real Conversation Assessment Using Acceptance Criteria:
We plan to apply a general set of standards — like tone, empathy, and conversation flow — to real customer chats. This will help us measure real conversations on a large scale. Though, this concept does not apply criteria for answer correctness. So... 
To judge if answers are accurate, we’re creating Utterance Clusters: groups of similar questions. This helps us apply acceptance criteria to groups of questions instead of just one.
We’re also working on using how Support Engineers and Customer Success Managers solve problems as a “correct” model to judge answers. We can create highly vetted, democratized acceptance criteria, across multiple clusters of questions, at scale. It’s pretty darn cool.
In the end, AI is only as powerful as the answers it provides. That’s why answer quality will always be our North Star, guiding how we build, measure, and improve Agentforce for every customer we serve. This effort to assess and improve Answer Quality is a passion space for my team. I encourage everyone to think big and try something that feels uncomfortable or daunting. Embrace the spirit of a futurist thinking about what could be. On that note, I’ll end by sharing my team’s motto and mantra:
“Mess around. Find out.”
Learn more about AI agents and how they can help your business.
Ready to take the next step with Agentforce?
Build agents fast.
Take a closer look at how agent building works in our library.
Get expert guidance.
Launch Agentforce with speed, confidence, and ROI you can measure.
Talk to a rep.
Tell us about your business needs, and we’ll help you find answers.