Your AI Agent Works, But Do Your Users Think It’s Worth It?

Design research shows that improving your users’ perception of value is the key to adoption. Learning from AI agent pain points can help.
AI agents are quickly becoming the backbone of enterprise productivity. They promise faster resolution times, better efficiency, and happier customers. But for those designing agents, we face a critical challenge: a technically “working” agent, with excellent system accuracy scores, can still be perceived by the user as not providing value, or even not worth using.
Our latest internal research on end user perspectives highlights a crucial gap. Users often don’t have the technical language to describe a specific issue they face when using an agent. Instead, they share generic complaints like:
- “It’s wrong.”
- “It doesn’t understand.”
- “It’s missing something.”
The true measure of success isn’t the model’s performance on a benchmark, but the user’s perception of its value and the trust they place in it.
Here’s what we’ll cover:
What users mean when they say ‘it doesn’t work’
Three tiers of agent failure
How to triage user issues and increase trust
Designing for trust and adoption
What users mean when they say ‘it doesn’t work’
As agents become more widely available to end users, the definition of a “successful” agent has broadened beyond mere model accuracy. For instance, an agent that is technically accurate but unhelpful in practice will ultimately be abandoned by the end user.
Let’s look at an example of an interaction:
End user question | “What is our official company policy on expense reporting for international travel?” |
Agent response | “For a detailed, up-to-date answer on international expense policy, please refer to the official ‘Global Travel & Expense Policy’ located on the internal company portal.” |
This output is technically “successful” because the agent isn’t connected to this data source and correctly redirects the user to where the information can be found. However, the user must now take manual steps to locate the answer, leading to a perception that the agent isn’t useful.
By evaluating what users truly mean when they report an agent isn’t performing as expected, we can identify critical failure points and value issues that technical systems and model benchmarks aren’t equipped to detect.
“It doesn’t work” | “It’s wrong” | “It doesn’t understand” | “It’s missing something” | “It’s not worth using” |
---|---|---|---|---|
Helpful/unhelpful error | Math error | Irrelevant output | Missing a record/field | Latency |
Doesn’t return anything | Factual error | Not grounded appropriately | Missing needed functionality | Missing actionability |
Output is nonsensical | Internal inconsistency | Not aligned with policy/best practices | Not comprehensive | Deflection to self-service |
Exposes PII or other sensitive information | Contradicts capabilities | Didn’t understand input intent | Unsupported prompt style | |
False action/task completion | Tone/style | |||
Responses are too noisy/lack precision |
Three tiers of agent failure
To help our customers triage issues faster, we developed a User Failure Points Framework by analyzing 2000 multi-turn user and agent conversations. We then mapped specific root-cause technical issues back to generic user complaints.
This framework categorizes user issues into three types, aligning to tiers of severity that directly impact task progression and user trust.
- P0: System Failures These are the highest severity issues. A P0 failure means the agent fails to work as expected, blocking task progression and severely damaging user trust.
- P1: User Intent Not Met In these cases, the agent delivers an output that’s misaligned with the user’s original intent. While the system may be technically functional, a P1 failure blocks task progression and causes user frustration.
- P2: Limited Value The agent is functional, but the output is of low perceived quality or low usefulness. These failures lead to the agent being labeled as “not worth using” because they force the user to correct, edit, or re-prompt too often.
P0: System Failures | P1: User Intent Not Met | P2: Limited Value |
---|---|---|
These failures block task progression | These failures create low perceptions of value | |
Helpful or unhelpful error | Missing needed functionality | Latency |
Doesn’t return anything | Ignored prior input | Tone/style |
Output is nonsensical | Internal Inconsistency | Responses are too noisy or lack precision |
Exposes PII or other sensitive information | Irrelevant output | Deflection to self-service |
Math error | Not grounded appropriately | Missing actionability |
Factual error | Not aligned with policy or best practices | |
Contradicts capabilities | Didn’t understand input intent | |
False action or task completion | Implicit context ignored | |
Missing a record or field | Not comprehensive |
How to triage user issues and increase trust
Understanding this taxonomy is the first step. The next is applying it to your agent development lifecycle to build trust and increase adoption.
1. Diagnose and triage failures
When P0 System Failures are absent but users are reporting issues, you can use the Failure Points Taxonomy to speed up issue diagnosis during testing. Additionally, to scale this work, you can use an LLM-as-judge evaluation method to more consistently identify the more subtle P1 (User Intent) and P2 (Limited Value) failures.
2. Conduct sentiment analysis
Use sentiment analysis to identify negative value issues expressed by users that traditional testing isn’t picking up. Phrases like, “That’s not right” or “It’s missing X” are critical pieces of feedback. Monitoring this sentiment, especially in multi-turn conversations, is key to diagnosing P1 and P2 issues in the wild.
3. Power up prompts
Vague prompts lead to P1 and P2 failures. Enable agents to clarify ambiguous prompts, a feature that not only improves output quality but also teaches the user how to write clearer, more effective prompts, ultimately reducing agent abandonment.
4. Clearly define agent scope
Manage user expectations by clearly defining what the agent can and can’t do for them up front. For queries that fall outside its domain, program the agent to recommend alternative tools or hand-offs. This small act of transparency prevents frustration and builds enduring trust.
Designing for trust and adoption
The future of Agentic AI isn’t decided by a technical score, it’s going to be decided by user trust and value. By shifting our focus from pure accuracy to the user’s perception of what’s worth it, we can design, build, and deploy agents that don’t just work, but become indispensable tools that users will adopt and champion.