AI Research

Meet SCUBA: The Next Frontier in Enterprise-Agent Evaluation

Ran Xu

Zeyuan Chen

1 additional author

October 29, 2025 3 min read

In the world of AI agents that click, scroll, execute and automate — we’re moving fast from “just understand text” to “actually use software for you.” The new benchmark SCUBA tackles exactly that: how well can agents do real enterprise workflows inside the Salesforce platform?

What makes SCUBA stand out:

It’s built around the actual workflows inside the Salesforce platform.
It covers 300 task instances derived from real user interviews (platform admins, sales reps, and service agents).
The tasks test not just “does the model answer the question” but “can the model use the UI, manipulate data, trigger workflows, troubleshoot issues.”
It addresses a gap: current benchmarks often focus on web navigation and software manipulation — but enterprise-software “computer use” is hard to measure. SCUBA aims to fill that.

Key Takeaway: If you want agents that don’t just chat, but act in business software, this is a big step.

The Business Impact

Imagine an AI assistant that can navigate your CRM, update records, launch workflows, interpret dashboard failures, and help your service team get unstuck. That’s the vision this paper leans into.

Here’s why it’s compelling:

Enterprise alignment: Many benchmarks are academic or consumer-web oriented. SCUBA puts the spotlight on business-critical environments (admin, sales, and service).
Realistic tasks: By deriving tasks from user interviews and genuine personas, it bridges the gap between “toy benchmark” and “live user situation.”
Measurable agent performance in context: It enables evaluation of how well an agent operates inside software systems, not just via text.
Roadmap for future AI assistants: As more organizations adopt AI to automate software use (not just analysis), benchmarks like this set expectations, highlight challenges, and direct progress.

For businesses like Salesforce (and their customers) the implications are clear: better agent tooling, fewer manual clicks, faster issue resolution, more efficient sales/service teams. For the AI community: a new frontier of “task execution in UI” rather than “just text reasoning”.

Key Insights:

1. Real-world domain shift is hard

The performance drop when moving from the more generic OSWorld benchmark (which covers desktop applications) to SCUBA (CRM, enterprise workflows) is significant. The experiment shows a chart of drop in success rates when shifting the benchmark.

2. Demonstrations Help

Knowledge articles and tutorials on how to use salesforce platforms are easily accessible. One natural question is whether AI agents can leverage this information effectively like humans do. The experiment results reveal that

Human demonstrations (showing the agent how to do a similar task) improved performance across most agents: higher success rates, lower time, lower token usage （please see the technical report for more details). But, some agents did not benefit as much.
Also, some ended up using more steps in demonstration-augmented mode (for example due to discovering “shortcuts” that the human demo didn’t show). So the design of demonstrations still matters.

3. Cost, latency, and practical deployment matter

Success rate is not the only metric; latency (time to complete tasks) and cost (API/token costs, number of steps) are also reported. For instance, browser-use agents had high success rates but higher latency (due to API service response time & multi-agent framework design).
Demonstration augmentation not only improves success but can reduce time and costs (the paper reports ~13% lower time, ~16% lower cost in the demonstration-augmented setting).
For enterprise adoption, this matters: an agent that succeeds but is too slow or too costly may be less useful in practice.

Implications for the Future of CRM Automation:

Training data will shift toward UI/action context: Rather than only text datasets, we’ll see more benchmarks and datasets for “agent performed sequence in software” tasks (click → fill → submit).
Enterprise software UX will matter for AI: As agents navigate interfaces, software products themselves may evolve to be more “agent-friendly” (e.g., more structured actions, better logs, agent-observable state).
New kinds of robustness challenges: Agents will have to handle UI changes, versioning, error states, permissions — things that are less common in typical NLP benchmarks.
Hybrid models and demonstration pipelines will become commonplace: As the experiments show, demonstrations help. Enterprises might build libraries of “how to” agent episodes for each workflow.
Track more than success: Track latency, number of steps, cost (token/API), error recovery — these matter in practice.

Dive deeper into the research: SCUBA: Salesforce Computer Use Benchmark
Explore the benchmark: https://sfrcua.github.io/SCUBA/

Introducing eVerse: Enterprise Simulation Environments to Train AI Agents

4 min read

Measuring Unpredictable AI: What Business Leaders Need to Know

9 min read

Ran Xu Director, AI Research

Ran Xu received his Ph.D. in computer science from University at Buffalo from 2015. Currently, he leads a group of exceptional computer vision and multimodal AI researchers at Salesforce to push the boundary of research and productive AI for CRM.

More by Ran

Zeyuan Chen Senior Manager, Research

Zeyuan Chen is a Senior Manager of Research at Salesforce AI Research, where he has been contributing since 2019. His work focuses on advancing computer vision, machine learning, multimodal AI, AI agents, and workflow automation through code generation and data visualization. He holds a Bachelor’s Read More

More by Zeyuan

Yutong Dai Senior Applied Scientist

More by Yutong

Meet SCUBA: The Next Frontier in Enterprise-Agent Evaluation

Ran Xu

Zeyuan Chen

1 additional author

What makes SCUBA stand out:

The Business Impact

Key Insights:

1. Real-world domain shift is hard

2. Demonstrations Help

3. Cost, latency, and practical deployment matter

Implications for the Future of CRM Automation:

Just For You

Introducing eVerse: Enterprise Simulation Environments to Train AI Agents

Measuring Unpredictable AI: What Business Leaders Need to Know

Just For You

Does your AI Strategy Include Sustainability? Here’s Why It Should

Better LLM Agents for CRM Tasks: Tips and Tricks

The Agentic AI Era: After the Dawn, Here’s What to Expect

Beyond the Chat Window: How Computer Use Agents Are Learning to Click, Scroll, and Work

BFCL Audio: A Benchmark for Audio-Native Function Calling

MCP-Universe: A Comprehensive Framework for AI Agent Development and Benchmarking

Why You Shouldn’t Be Scared of Digital Labor For Your Startup or SMB

Introducing Moirai 2.0

Share article

What makes SCUBA stand out:

The Business Impact

Key Insights:

1. Real-world domain shift is hard

2. Demonstrations Help

3. Cost, latency, and practical deployment matter

Implications for the Future of CRM Automation:

Share article

Explore related content by topic

Get the latest articles in your inbox.

360 Highlights

IT

Commerce

Marketing

Service

Sales

Thanks, you're subscribed!