A new approach to enterprise graphical user interface (GUI) automation learns from one human demonstration and replays workflows with deterministic precision— delivering the reliability and privacy that enterprise operations demand.
Ask any veteran operations manager about their daily workflow and they’ll tell you: half of it is muscle memory. The same screens, the same clicks, the same sequences, repeated until the motions become invisible. Across every enterprise, millions of knowledge workers carry this muscle memory through expense approvals routed on legacy portals, patient records transferred between systems that were never designed to talk to each other, inventory updates entered manually because no API exists. The accumulated expertise is real, but it lives in human hands, and it doesn’t scale.
For decades, the industry’s answer has been Robotic Process Automation (RPA): scripted bots that replay a fixed sequence of actions on a graphical user interface. But RPA is brittle — a single UI update, a shifted button, a changed label, and the script breaks. Even deploying an RPA workflow demands manual labeling, scripting, and often specialized programming — and maintaining these automations can cost more in developer hours than the manual work they replace. More recently, large vision-language models (VLMs) have entered the picture. These AI-powered agents can interpret screenshots, reason about what they see, and take action. The flexibility is impressive. But flexibility introduces a new risk: non-determinism. A VLM-based agent might complete a workflow correctly nine times, then hallucinate an action on the tenth. For mission-critical enterprise processes, where a single errant click can trigger a compliance failure or corrupt a patient record, that unpredictability is a dealbreaker. And because these models typically stream sensitive screenshots to external cloud APIs for inference, they introduce data privacy risks that many regulated industries simply cannot accept.
Today, our team at Salesforce AI Research is introducing GPA: GUI Process Automation. Show it a workflow once, through a single recorded demonstration, and it learns to replay that workflow reliably, deterministically, and entirely on your local machine. No cloud calls. No stochastic guesswork. No brittle scripts to maintain.
The Enterprise Automation Gap
The tension between flexibility and reliability has defined the automation landscape for years. Traditional RPA delivers consistency but demands skilled developers to build and maintain every script. When the target application updates its interface, those scripts often need to be rebuilt from scratch. VLM-based agents offer adaptability but introduce the inconsistency that enterprise workflows cannot tolerate.
This tension maps directly to what Salesforce AI Research calls the Capability-Consistency Matrix, the framework at the heart of our Enterprise General Intelligence (EGI) vision. High capability without high consistency produces a system that occasionally impresses but cannot be trusted with real business operations. High consistency without capability is reliable but limited. Enterprise AI demands both: systems that perform complex tasks with dependable precision.
GPA is designed to operate in that high-capability, high-consistency space for GUI-based workflows. It combines demonstration-based learning with a matching algorithm grounded in geometry and statistics, not generative probability. The result: automation that adapts to minor UI changes while maintaining deterministic execution that enterprise environments require.
How GPA Works: Record, Build, Replay
The core insight behind GPA is deceptively simple. When a human performs a task on a graphical interface, they don’t memorize pixel coordinates. They recognize spatial relationships.
A “Submit” button sits below a form. A search bar appears next to a logo. A dropdown menu lives to the right of a label. Humans locate elements by context, and GPA does the same.
The system operates in three steps.
Step 1: Record a demo. A user performs the workflow once while GPA records the sequence. No scripting, no labeling, no selector inspection required. GPA captures every mouse click and keyboard entry, then automatically builds a replayable action graph from the recording.
Step 2: Build the workflow. Through post-processing and LLM analysis, GPA constructs a reusable workflow template with interchangeable variables. As each action is processed, the system builds a visual graph of the interface: every button, text field, icon, and label becomes a node, connected to its neighbors by spatial proximity. GPA captures not just the target element for each step, but the constellation of surrounding elements that give it context. Users can scroll through each step to verify, and refine using a built-in chatbot if needed.
Step 3: Replay and integrate. GPA replays the workflow deterministically, matching each step’s target element against the current state of the interface using its graph-based matching framework. When the match is straightforward, a direct comparison identifies the element instantly. When ambiguity arises—because the window has been resized, an element has shifted, or identical elements appear on screen (multiple checkboxes in a table, for instance, where the target can’t be identified in isolation) GPA falls back on a geometric inference method that uses stable surrounding elements as anchors to triangulate the target’s new position. Think of it as navigating by landmarks rather than street addresses: even if the destination has moved, the landmarks guide you there.A “Submit” button sits below a form. A search bar appears next to a logo. A dropdown menu lives to the right of a label. Humans locate elements by context, and GPA does the same.
Every recorded workflow can also be exposed as a Model Context Protocol (MCP) or CLI tool, meaning any AI agent, whether built on an internal LLM or on platforms like Agentforce, can invoke a GPA workflow as a safe, bounded action within a larger agentic pipeline.
GPA includes a built-in readiness check. Before executing any action, the system evaluates its confidence in the match. If confidence falls below a defined threshold, GPA pauses rather than guessing. This “know when not to act” discipline reflects a principle we emphasize across our research: AI systems must acknowledge their limitations and seek human guidance when uncertainty is high.
Why Local, Why Deterministic, Why Now
Three design choices make GPA particularly relevant for enterprise deployment.
Privacy. GPA runs entirely on the local machine. Screenshots, workflow recordings, and all processing stay on-device using lightweight local models rather than large cloud-hosted language models. Sensitive business data, whether patient records, financial transactions, or proprietary workflows, never leaves the user’s environment. Local execution eliminates an entire category of risk.
Determinism. Unlike VLM-based agents that generate actions through probabilistic sampling, GPA’s graph-matching approach produces the same output given the same input. For workflows that must be auditable, repeatable, and explainable, this matters enormously. When a regulator asks how an automated process reached a particular action, the answer is traceable and verifiable.
Speed and cost. The core matching algorithm runs in milliseconds. For organizations processing thousands of repetitive GUI tasks daily, the difference in latency and compute cost compared to round-trip cloud API calls compounds quickly.
In real-world deployments, we expect these approaches to coexist: VLM agents handling novel and exploratory tasks, GPA handling the high-volume, high-stakes repetitive work that constitutes the backbone of enterprise operations.
From Standalone Tool to Agentic Building Block
GPA was designed to function as more than a standalone automation tool. Its MCP integration means that any AI agent can invoke a GPA workflow as a safe, bounded action within a larger orchestrated process.
The future of enterprise AI is not a single monolithic agent. It is an ecosystem of specialized capabilities, some powered by large language models, some by deterministic tools, all coordinated by orchestration layers that match the right tool to the right task. GPA occupies a specific and valuable niche in that ecosystem: the layer where an agent needs to interact with a graphical interface reliably, without the overhead or risk of a full VLM inference for every click.
Consider a practical scenario our team has demonstrated. A user sends a message in Slack asking an AI assistant to schedule a meeting on Google Calendar with specific participants, date, and time, and to verify the result. The assistant identifies the appropriate GPA workflow from its library, supplies the input variables (participant names, date, time), and hands off execution. GPA takes control of the desktop, navigates to Google Calendar, creates the event, invites each guest (repeating the relevant steps as needed for multiple participants), and returns confirmation, including a summary of other meetings already on the calendar that day. No prompt engineering for each click. No risk of the agent doing something unexpected mid-workflow.
This pattern, intelligent orchestration at the top and deterministic execution at the bottom, represents how we believe enterprise automation will mature. Agents reason, plan, and coordinate. Tools like GPA execute with precision.
Early Results
In pilot experiments comparing GPA against a VLM-based GUI agent baseline, we evaluated 16 desktop GUI tasks across two difficulty categories defined by the length and complexity of the recorded demonstration. GPA demonstrated strong reliability across task categories while completing workflows significantly faster than the VLM baseline.
These are early results from a focused pilot, and we are continuing to expand our evaluation across a broader set of enterprise applications and edge cases. But the directional signal is clear: for structured, repeatable GUI workflows, demonstration-based automation with geometric matching offers a compelling combination of reliability, speed, and privacy.
Where GPA Fits in the Broader Vision
GPA emerges directly from Salesforce AI Research’s AI Foundry initiative, our innovation program focused on system-level AI: the memory architectures, reasoning engines, simulation environments, and operational tools that transform model capabilities into enterprise-ready systems. Within AI Foundry’s operational intelligence mandate, GPA addresses a persistent pain point: the millions of repetitive GUI interactions that sit between fully API-integrated systems and fully manual processes. The research roadmap ahead includes AI-built workflows where an LLM generates demonstrations autonomously, workflow self-healing that adapts when target UIs change, browser integration for web-based automation, and multi-platform sandbox support for Windows and Linux environments.
Automating the Work That Happens on Screen
A vast amount of enterprise work still happens through graphical interfaces, and reliable automation of those interfaces requires a purpose-built solution. As the industry increasingly explores AI agents that can use computers, the question of how to make that interaction safe, private, deterministic, and enterprise-grade becomes central. GPA offers one answer: learn from a single human demonstration, execute with geometric precision, and always know when to pause and ask for help.
We invite the research community and enterprise practitioners to explore GPA through our landing page, where you will find the full paper, detailed documentation, and live demos showcasing the system in action.
GPA is a research prototype. The roadmap reflects active research directions.
Resources: Salesforce AI Research Website
Follow us on X: @SFResearch
Follow us on Bluesky: @SFResearch

















