Skip to Content
0%

Your AI Agent Works, But Do Your Users Think It’s Worth It?

Illustration of person on the left pushing a large puzzle piece into another puzzle piece pushed by a robot.
By evaluating what users mean when they report an agent isn't performing as expected, we can identify critical failure points. [Sanem | Adobe]

Design research shows that improving your users’ perception of value is the key to adoption. Learning from AI agent pain points can help.

AI agents are quickly becoming the backbone of enterprise productivity. They promise faster resolution times, better efficiency, and happier customers. But for those designing agents, we face a critical challenge: a technically “working” agent, with excellent system accuracy scores, can still be perceived by the user as not providing value, or even not worth using.

Our latest internal research on end user perspectives highlights a crucial gap. Users often don’t have the technical language to describe a specific issue they face when using an agent. Instead, they share generic complaints like:

  • “It’s wrong.”
  • “It doesn’t understand.”
  • “It’s missing something.”

The true measure of success isn’t the model’s performance on a benchmark, but the user’s perception of its value and the trust they place in it.

Here’s what we’ll cover:

What users mean when they say ‘it doesn’t work’
Three tiers of agent failure
How to triage user issues and increase trust
Designing for trust and adoption

What users mean when they say ‘it doesn’t work’

As agents become more widely available to end users, the definition of a “successful” agent has broadened beyond mere model accuracy. For instance, an agent that is technically accurate but unhelpful in practice will ultimately be abandoned by the end user.

Let’s look at an example of an interaction:

End user question“What is our official company policy on expense reporting for international travel?”
Agent response“For a detailed, up-to-date answer on international expense policy, please refer to the official ‘Global Travel & Expense Policy’ located on the internal company portal.”

This output is technically “successful” because the agent isn’t connected to this data source and correctly redirects the user to where the information can be found. However, the user must now take manual steps to locate the answer, leading to a perception that the agent isn’t useful.

By evaluating what users truly mean when they report an agent isn’t performing as expected, we can identify critical failure points and value issues that technical systems and model benchmarks aren’t equipped to detect.

“It doesn’t work”“It’s wrong”“It doesn’t understand”“It’s missing something”“It’s not worth using”
Helpful/unhelpful errorMath errorIrrelevant outputMissing a record/fieldLatency
Doesn’t return anythingFactual errorNot grounded appropriatelyMissing needed functionalityMissing actionability
Output is nonsensicalInternal inconsistencyNot aligned with policy/best practicesNot comprehensiveDeflection to self-service
Exposes PII or other sensitive informationContradicts capabilitiesDidn’t understand input intentUnsupported prompt style
False action/task completionTone/style
Responses are too noisy/lack precision

Back to top

Three tiers of agent failure

To help our customers triage issues faster, we developed a User Failure Points Framework by analyzing 2000 multi-turn user and agent conversations. We then mapped specific root-cause technical issues back to generic user complaints. 

This framework categorizes user issues into three types, aligning to tiers of severity that directly impact task progression and user trust.

  • P0: System Failures These are the highest severity issues. A P0 failure means the agent fails to work as expected, blocking task progression and severely damaging user trust.
  • P1: User Intent Not Met In these cases, the agent delivers an output that’s misaligned with the user’s original intent. While the system may be technically functional, a P1 failure blocks task progression and causes user frustration.
  • P2: Limited Value The agent is functional, but the output is of low perceived quality or low usefulness. These failures lead to the agent being labeled as “not worth using” because they force the user to correct, edit, or re-prompt too often.
P0: System FailuresP1: User Intent Not MetP2: Limited Value
These failures block task progressionThese failures create low perceptions of value
Helpful or unhelpful errorMissing needed functionalityLatency
Doesn’t return anythingIgnored prior inputTone/style
Output is nonsensicalInternal InconsistencyResponses are too noisy or lack precision
Exposes PII or other sensitive informationIrrelevant outputDeflection to self-service
Math errorNot grounded appropriatelyMissing actionability
Factual errorNot aligned with policy or best practices
Contradicts capabilitiesDidn’t understand input intent
False action or task completionImplicit context ignored
Missing a record or fieldNot comprehensive

Back to top

How to triage user issues and increase trust

Understanding this taxonomy is the first step. The next is applying it to your agent development lifecycle to build trust and increase adoption.

1. Diagnose and triage failures

When P0 System Failures are absent but users are reporting issues, you can use the Failure Points Taxonomy to speed up issue diagnosis during testing. Additionally, to scale this work, you can use an LLM-as-judge evaluation method to more consistently identify the more subtle P1 (User Intent) and P2 (Limited Value) failures.

2. Conduct sentiment analysis

Use sentiment analysis to identify negative value issues expressed by users that traditional testing isn’t picking up. Phrases like, “That’s not right” or “It’s missing X” are critical pieces of feedback. Monitoring this sentiment, especially in multi-turn conversations, is key to diagnosing P1 and P2 issues in the wild.

3. Power up prompts

Vague prompts lead to P1 and P2 failures. Enable agents to clarify ambiguous prompts, a feature that not only improves output quality but also teaches the user how to write clearer, more effective prompts, ultimately reducing agent abandonment.

4. Clearly define agent scope

Manage user expectations by clearly defining what the agent can and can’t do for them up front. For queries that fall outside its domain, program the agent to recommend alternative tools or hand-offs. This small act of transparency prevents frustration and builds enduring trust.

Designing for trust and adoption

The future of Agentic AI isn’t decided by a technical score, it’s going to be decided by user trust and value. By shifting our focus from pure accuracy to the user’s perception of what’s worth it, we can design, build, and deploy agents that don’t just work, but become indispensable tools that users will adopt and champion.

Back to top

Get the latest articles in your inbox.