Skip to Content
0%

Better LLM Agents for CRM Tasks: Tips and Tricks

Main Takeaways

  • CRM tasks are difficult for current LLM agents to complete due to lack of training data coverage and LLM’s unfamiliarity with business context.
  • Providing additional domain-specific knowledge (in prompts or as tools) can greatly help LLM agents.
  • Telling agents how to solve a task, instead of what task to solve, is typically more helpful, even without equipping them with function calling abilities.

Background

LLM agents are seeing more and more applications in real life, from being personal assistants to helping software engineers write code and even working side by side with scientists on their research. With Agentforce, Salesforce’s trusted platform, we pioneer LLM agents for CRM applications like helping customers with their return and refund requests, coming up with the best pitch for sales representatives tailored towards their clients, and generating insights about employee productivity and roadblocks for managers.

While models such as GPT, Claude, and Gemini show impressive general abilities, CRM tasks are a different story. Their specialized nature and limited data coverage make it hard for LLMs to perform reliably. Furthermore, many of them are “noob mistakes” due to lack of sufficient understanding of the business context and specialized domain knowledge. (Check our blog about Why Generic LLM Agents Fall Short in Enterprise Environments for more details.)

To solve this problem and bridge the gap between high general capability and low specialized capability of LLMs, as well as human-in-the-loop efforts and trade-offs, we conducted a series of investigation and identified various tips and tricks to better unleash their performance on various realistic CRM tasks.

Agentic Simulation Environment with CRMArena-Pro

Our benchmark of choice is the newly released CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions, developed by researchers at Salesforce AI Research, consisting of 22 tasks and 2140 task instances spanning diverse categories such as workflow execution, policy compliance, and information retrieval. The main finding is the suboptimal performance of even state of the art LLMs with tried and true agentic implementation frameworks such as ReAct. For example, GPT-4o is only able to solve less than 30% of all tasks, while its reasoning model counterpart, o1, still fails at just over 50% of all tasks. The best performing model out of 9 flagship models from various providers, Gemini-2.5-pro, struggles to achieve a completion rate of 60%. (Check our blog on how to Evaluate LLM Agents for Enterprise Applications with CRMArena-Pro.)

After a thorough analysis of agents’ execution, we identified several potential reasons for the lack of performance.

  1. Query Syntax Limitations: Manipulating data on the Salesforce platform requires writing queries in SOQL and SOSL languages. While they are similar to SQL, there are certain key differences. As a result, the agent sometimes produces queries of illegal syntax. While the agent can correct some mistakes after observing the error message, for others, the attempted correction results in further mistakes.
  2. Data Model/Schema Confusion: As a hallmark future for CRMArena-Pro is the intricate and inter-connected schemas, representing real-life business entities such as account manager, pricebook, order, lead and voice call transcript, agents often confuse related concepts, such as an order item vs. a pricebook entry, or a lead vs. an opportunity. As a result, they sometimes look up information in the wrong table, resulting in failed executions or wrong results.
  3. Ambiguity in Underspecified Tasks: There are certain underspecified details in the task, such as whether a case that has been transferred from one customer service representative to another one should count for either one of them (e.g., when calculating the average handling time), both, or neither. Agents often directly assume a particular answer, failing to realize that there is ambiguity to be clarified.
  4. Unfamiliarity with Business Workflow: Finally, even if the agent is clear on the data schema and task specification, it may still fail on the task due to its unfamiliarity with the business workflow. For example, while SOQL has fuzzy search ability, most search tasks are better implemented with SOSL. Because the agent is generally unfamiliar with the fine-grained differences, it sometimes fails to use the correct tools, leading to excessively long outputs and very inefficient executions.

In the next few sections, we describe various ideas that we explored on augmenting the agent with additional information and tools. We consider the skill group of “Structured Data Querying & Numerical Computation” in CRMArena-Pro, as they embody the agentic procedural execution the most. Below, we summarize our main findings in the table, which are explained in detail in the subsequent sections.

SOQL/SOSL Only (Original CRMArena-Pro Setup)+ Function Header Only+ Full Function Implementation+ Ground Truth Workflow
Task-Specific Functions (TSF)TSF + Refactored SubroutinesTask-Specific Functions (TSF)TSF + Refactored SubroutinesTechnical DescriptionNon-Technical Description
5 Tasks0.33~ 91%Did Not Evaluate
3 Unseen Tasks0.310.310.320.480.340.720.54
Human EffortNoneHighHighHighHighMediumLow

Beyond Raw SOQL and SOSL

In the CRMArena-Pro benchmark, by default LLM agents are restricted to using only In the CRMArena-Pro benchmark, by default LLM agents are restricted to using only two functions: SOQL and SOSL. Despite their versatility in theory, the agents must handle tasks completely autonomously, starting from scratch and relying solely on these two query languages. Human setup time is intentionally kept close to zero, simulating a hands-off, fully self-reasoning agent.

By comparison, in the real world, teams can provide LLM agents with additional custom actions tailored to the tasks they care about. These can include domain-specific tools, scripts, or workflows. Teams may even ask LLMs to generate new actions on the fly — though today, this often still requires human validation or expert-level coding to make them reliable.
With platforms like Agentforce, builders can accelerate this process by leveraging default action libraries and accessing existing metadata from their org. However, there’s an important tradeoff:

  • How much human effort is needed to define and refine these actions (e.g., prompt engineering, code writing, integration)?
  • How well do these actions perform on core in-domain tasks that the agent was built for?
  • And critically, how well do they generalize to out-of-domain tasks that weren’t anticipated but may still be asked by users?

Finding the right balance between autonomy and setup effort is key to making LLM agents practical, scalable, and trustworthy in enterprise settings. To investigate best ways to improve agent performance, we carefully study the characteristics of five tasks (handle time, transfer count, top issue identification, best region identification and conversion rate comprehension), while leaving three others (monthly trend analysis, sales amount understanding and sales cycle understanding) as challenge tasks to test the agent generalization. As the first result column shows, with raw SOQL/SOSL access (i.e.,the original CRMArena-Pro setup), the agent achieves a performance of XX% on former five tasks and 31% on the latter three tasks.

Higher-Level Functions Can Help, but With a Caveat

Our first exploration is to provide task-specific functions for agents to call. Writing these functions are time consuming and requires expert knowledge on programming. We expect that in most situations, these functions are provided for only a few tasks. At the same time, however, we would like the model to understand the high level goals via these functions. Therefore, our main evaluation is on a set of tasks that are not directly covered by these functions.

The most natural approach of providing functions to agents is by exhibiting the function headers, with an example below. This function finds the agent with the minimum or maximum average handle time of their assigned cases in a period of time.

def find_agent_with_handle_time(start_date, end_date, min_cases, find_min=True):
    """
    Finds the agent with the specified handle time criteria.

    Parameters:
        start_date (str): Start date in 'YYYY-MM-DD' format.
        end_date (str): End date in 'YYYY-MM-DD' format.
        min_cases (int): Minimum number of cases the agent must have managed. All agents who handle (min_cases - 1) or fewer non-transferred cases will be excluded.
        find_min (bool): If True, find the agent with the minimum handle time. If False, find the maximum.

    Returns:
        str: The Id of the agent.
    """

We write one function for each of the five tasks that we studied, and the agent using them can achieve a very high performance of 91%.

Things are quite different, however, on the three unseen tasks. When we provide only the function headers of these task-specific functions (TSF), the agent achieves a performance of 31%. This is the same performance as the agent with only raw SOQL/SOSL access in the original CRMArena-Pro setup, suggesting that directly exposing the function headers of these highly specialized functions are not helpful.

Given the monolithic nature of these functions, we hypothesize that providing more atomic subroutines may be beneficial. Thus, we ask GPT-4o (the LLM underlying our agent) to generate reusable subroutines for these high-level functions (with analogous header documentation). Then, we provide the headers to both the high-level functions and the subroutines to the agent to use. An example of such a subroutine is provided below.

def query_accounts_by_ids(account_ids):
    """
    Fetches account details for a list of account IDs.

    Parameters:
        account_ids (list): A list of account IDs.

    Returns:
        dict: A dictionary mapping account IDs to account details.
    """

Interestingly, we observe very slight increase in performance, at 32%, when giving both types of function headers. After further analyses, we found that while the agent sometimes correctly use these subroutines, the implementations of these subroutines (which are generated by GPT-4o) may be problematic, resulting in incorrect result or program crashes. Furthermore, since the source code is not exposed to the agent, the agent has extremely limited insights into the reasons of these errors and methods of correction. Thus, we conclude that providing subroutines via header documentation only does not improve the agent performance.

Motivated by the findings above, we next hypothesize that showing the full source code implementation could be beneficial, since the source code tells the agent not only what the functions do, but how they work. Note that the agent is still not allowed to execute arbitrary code — only the provided (high-level or subroutine) functions and raw SOQL/SOSL.

This turns out to be very helpful: 48% accuracy when the agent is provided with the full implementation of the high-level TSF functions. By contrast, due to bugs introduced in the refactoring process, the performance of the agent, when provided with the (buggy) refactored function implementations and taking them as the source of truth, regresses back to 34%, though still slightly higher than the two setups with function header only. The significantly higher performance suggests the utility of providing correct, detailed and actionable guidance to agents, especially outside of their “natural habitats”, i.e., in unfamiliar domains.

Workflow Description Is Very Helpful

Can we further improve the performance of the unseen tasks? A natural idea, motivated by how new human employees are trained for a job, is to let the agent observe the workflow of a particular, representative task and ask it to extrapolate and generalize. We experiment with two types of workflows. The first one is a technical workflow, where we fully describe the procedure required for a task. Below is the beginning of an example workflow.

Suppose that we want to answer the following query: Today's date: 2021-05-09. Determine the agent with the quickest average time to close opportunities in the last 6 weeks.

We use the following workflow to answer this query:

Today's date is 2021-05-09, so six weeks ago is 2021-03-28. When we talk about an the time it takes to close or sign an opportunity, we are interested in all opportunities whose corresponding contract has a company signed date falling within the interval of interest. Therefore, we first get all contracts with a company signed date within this time interval. We want to retrieve the company signed date and the contract ID (which will be linked to the opportunity). So we use execute the following SoQL query:

SELECT Id, CompanySignedDate FROM Contract WHERE CompanySignedDate != NULL AND CompanySignedDate >= 2021-03-28 AND CompanySignedDate < 2021-05-09

This query results in the following records:

{'Id': '800Wt00000DDfifIAD', 'CompanySignedDate': '2021-04-27'}
{'Id': '800Wt00000DE1T0IAL', 'CompanySignedDate': '2021-04-15'}
{'Id': '800Wt00000DE42gIAD', 'CompanySignedDate': '2021-04-29'}

Then, for each contract ID, we need to find the corresponding opportunity with this ContractId__c. We need to retrieve the OwnerId (which corresponds to the agent), and the created date of the opportunity. We use the following SoQL query:

(additional text omitted)

Writing this one requires a human user to first study the task, write the SOQL/SOSL queries, and analyze the results. Naturally, the writer needs to have working knowledge of the database query language. Nonetheless, compared to providing the full task-specific functions, this is still much easier, as the human only needs to perform a demonstration for a concrete example, rather than laboriously coming up with a fully general function that covers all possible cases.

By comparison, the second workflow type that we give is non-technical. For the same task, the excerpt below gives the complete workflow description in this non-technical manner.

Suppose that we want to answer the following query: Today's date: 2021-05-09. Determine the agent with the quickest average time to close opportunities in the last 6 weeks.

We use the following workflow to answer this query:

Today's date is 2021-05-09, so six weeks ago is 2021-03-28. When we talk about an the time it takes to close or sign an opportunity, we are interested in all opportunities whose corresponding contract has a company signed date falling within the interval of interest. Therefore, we first get all contracts with a company signed date within this time interval. We want to retrieve the company signed date and the contract ID (which will be linked to the opportunity).

Then, for each contract ID, we need to find the corresponding opportunity with this ContractId__c. We need to retrieve the OwnerId (which corresponds to the agent), and the created date of the opportunity.

By combining the two results, we can calculate the average closing time for each agent as the difference between the contract's company signed date and the opportunity's created date. In the end, we return the agent with the shortest average closing time.

As we can see, there is no SOQL/SOSL query and no presentation of the specific query result. Instead, only the high-level procedure is given. This description should be very easy to write for anyone with a working knowledge of the system, even if they are not familiar with the actual database query language.

With these two workflow formats, we see that the agent achieves a 72% accuracy when given technical workflows and 54% when giving non-technical workflows, suggesting strong agent capability to learn and generalize from only one instance.

Conclusion

Large Language Models (LLMs) struggle with specialized CRM tasks due to limited domain training data and insufficient business context, leading to errors in query syntax, schema confusions, ambiguity handling, and unfamiliarity with workflows. At Salesforce AI Research, we strive to make LLM agents better at CRM tasks and to do so, we explore various ways to supplement LLM agents with domain-specific tools, technical workflow descriptions, or function implementations. We find that telling agents how to perform tasks — not just what to do — makes a significant difference, even without sophisticated function calling abilities. While raw SOQL/SOSL access yields low task accuracy (~31%), providing full function implementations or detailed technical workflows can raise accuracy to as much as 74% with technical workflow descriptions. Even without technical workflows, their non-technical counterparts are still effective, and, to a lesser extent, implementation details of human-written functions. For future work, we will explore additional ways for agents to learn passively from humans, or with reasonable amount of human efforts, as well as making them better at improving themselves by learning from their past mistakes.

Get the latest articles in your inbox.