Skip to Content
0%

How Do You Know if Your AI Agent Is Doing a Good Job?

A woman in business attire walks up a staircase whose surface resembles a ruler: AI agent evaluation
A raft of Salesforce tools can help you measure your AI agent’s performance. [Image credit: Aleona Pollauf/Salesforce]

Evaluation tools for agents are still an emerging technology, but there are plenty of ways to assess your agent’s performance today.

Congratulations! You deployed your first AI agent and it’s out there, doing its job, streamlining your workflows and helping your employees work smarter. You’re tracking the engagement metrics and escalation rate KPIs. But you still might wake up in the middle of the night, wondering, “Do I have enough data to know if my agent is doing a good job?” 

The more insights you have, the more quickly you can make improvements — which is why a lot of people are asking the same question. “We’re still in very early days, measuring these agents,” said Jesse Luke, senior manager, data enablement, web, at Salesforce. “It’s a process everyone is going through.” 

But there are ways to measure the quality and effectiveness of your AI agents’ work, starting with the KPIs you put in place at deployment. There are also Salesforce tools — including some on the horizon — to help you assess your agent’s performance.

What does an effective AI agent look like? 

A good agent doesn’t just answer customers’ or employees’ questions. It solves people’s problems. The best agents do this seamlessly. 

“How do you know you’re working with a good AI agent vs. a mediocre one?” Mike Murchison, CEO of Ada, asked on LinkedIn. “Good AI should feel like the best server at your favorite restaurant.” 

Like a great server, he said, a great agent anticipates your needs even before you do. “They remember your preferences, spot any problems before they happen, and fix them without fanfare,” he added. 

That’s the ideal. But first, you may simply want to know whether your agent is meeting its basic KPIs. “If you have a good idea of your KPIs and can identify how the agent impacts those, you’re off to the races,” Luke said.

On the Salesforce Help site, for example, the customer service agent’s job is to help people quickly find the information they need and reduce the caseload of human agents. The company posts the agent’s performance metrics on a weekly basis. 

The numbers? One week in September, Agentforce, the Salesforce platform for building and deploying AI agents, handled over 61,000 support requests and resolved more than 39,000 of them. Roughly 17,000 requests were handed off to humans. 

Those are the kind of KPIs that show your agent is doing its job.

You can measure only what you can see

One of the biggest challenges companies have with AI agents is visibility — being able to see what their agent is doing and make sure it’s acting as they want. Salesforce’s Agentforce Observability offers a unified dashboard that tracks an agent’s error rates, escalation rates, latency, and more. It sits within Agentforce Studio, a new suite of tools to gauge an agent’s performance. The dashboard can answer questions such as “How is adoption and usage trending?” and “Are my agents following legal and regulatory requirements?” 

It can also categorize your agent’s conversations into topics so you can see how customers are using the agent. For example, 40% of agent sessions might be about payment problems; another 20% could be cancellation requests. 

How Salesforce measures performance 

Salesforce conducts its own AI agent evaluation in several ways. The company’s Digital Success team runs synthetic tests twice a month to see how agents perform in hypothetical situations. To do this, they use an in-house tool similar to the Agentforce Testing Center, which lets customers test agents in secure sandboxes before they’re deployed.  

Earlier this year, the team ran a test that resulted in low answer-quality scores, with the Salesforce Help agent scoring 59% against a baseline of 60%. When the team looked more closely, they discovered the agent was hallucinating URLs. The solution? “We shipped a fix, ran another test, and improved our answer quality to 67%,” said Zachary Stauber, senior director, digital success, AI, at Salesforce.

The answer-quality score was useful information. But Salesforce also wanted to know how Agentforce was interacting with users in the real world, and at scale. And they wanted to give those conversations a score.

So, the company’s Data Enablement team started looking at the session level, which is the entire conversation between a user and agent. “But we found that it wasn’t logical to do it that way,” said Manoj Arora, principal member of the technical staff, software engineering, at Salesforce. “There might be some questions where the agent did a good job, and in the same session, a question where the agent did not do a good job.” 

The Data Enablement team next looked at individual questions to see how an agent answered each one. But that didn’t make sense either; when they reviewed a single question and answer, the back-and-forth lacked context. Finally, they used a data science model that classifies and clusters similar topics into groups, or moments. These are what the team decided to focus on.   

The team then used Agentforce to test these agentic moments, scoring them on a scale of one to five. They did this using an internal tool similar to Agentforce Optimization, which is in beta now, and will be available at the end of October as part of Agentforce Observability.

What’s your agentic AI strategy?

Our playbook is your free guide to becoming an agentic enterprise. Learn about use cases, deployment, and AI skills, and download interactive worksheets for your team.

More tools coming soon

Agentforce Optimization will let you look more closely at your agent’s conversations. It will retrieve your agent’s chat data and pass it to a prompt within Agentforce, which rates the interaction. A score of one to two means your agent needs improvement. A five is reason for celebration. 

“What you’re looking for is to see if the right action is being taken by the agent in those conversations, and whether those actions are in line with your business goals,” said Luke.

To get a better idea of how Agentforce Optimization will score responses, here are two samples of conversations on salesforce.com that the team has already tested: 

  • The interaction: A customer asked Agentforce to “provide some customer success stories that have helped IT experts improve business.” The agent replied, “I couldn’t find any specific customer stories related to IT experts improving business. Would you like to search for something else or do you need information on a specific Salesforce product?”
    The score: 1-2, because the agent couldn’t answer the customer’s question.

  • The interaction: A customer asked, “Can I already build a semantic model in Tableau Cloud that will be used by Tableau Agent?” The agent replied, “Yes you can build a semantic model in Tableau Cloud that will be used by Tableau Agent. Tableau Semantics is a powerful AI-infused semantic layer integrated into Data Cloud that translates your data into your business language.” The agent then described the features of Tableau Semantics in detail and provided a link to the component’s web page.  
    The score: 5, because the customer received the information they needed. 

Agentforce Optimization will cluster multiple interactions like these into meaningful categories so you can evaluate an agent’s performance at scale. You might, for example, want to look at how your agent is handling a specific topic, such as requests for product information. Or you might want to look at clusters by score. Where is your agent routinely getting scores of one or two? Where is the agent doing well? All that will be possible with Agentforce Optimization. 

Companies will be able to customize the tool to suit their business needs. A large retailer, for example, might want to see how their agent handles returns; another company might want to see how the agent manages tech support. 

But Agentforce Optimization isn’t the only new tool on the horizon. Agentforce Analytics 2.0, a more advanced version of the current Agentforce Observability dashboard, is also in beta. The beefed-up dashboard will offer a higher-level view, showing how many conversations have taken place and which topics are being covered, as well as latency and escalation rates. It, too, will be available at the end of October.

Why AI agent evaluation is so important 

Companies need to assess their agent’s performance for a simple reason: to know what’s working and what should be improved. With metrics in hand, you might see that you need to update your content, for example, or that your agent needs more detailed instructions. “The number one thing we usually find is bad data,” said Stauber.

Bad or mislabeled data, data from unknown sources, or data that’s scattered over multiple systems can all be a problem. But once you’ve identified the issue, you can take action. That’s what Salesforce’s Digital Success team does when it finds an error like the URL hallucinations mentioned earlier. “We can do a fix, come back to the baseline program, run a test again, and see how things have changed,” Stauber said. 

Relax, your agent is hard at work 

With all these new tools to evaluate your AI agent’s performance, company leaders should be able to breathe a sigh of relief. So, the next time you startle awake wondering how your agent is doing, go back to sleep. Let your agent work at that hour instead.

Get a first look at how businesses are using AI agents

Explore how agents are already helping companies across sales, service, internal operations and more.

An illustration showing the growth in employee/agent interactions

Get the latest articles in your inbox.