Skip to Content
Skip to Footer

AI large language models (LLMs) hallucinate when they generate false but  ‌plausible-sounding responses based on flawed or incomplete data. It’s unintentional. More fiction than fraud.

But when AI knows the truth and chooses not to tell you, that’s different. It’s engaging in deception. Not because the system is sinister, like something out of some science fiction show, but because it’s been trained to relentlessly chase outcomes‌, even if it means bending the truth a little. For example, an AI model might soften a performance review to help an employee avoid termination, prioritizing retention over honesty. Or it might inflate  results in a project update to preserve team morale when progress is really lagging.

Though largely a theoretical concern, researchers have recently proven such deception can occur when a model pursues its own goals while knowingly misrepresenting that it’s following the different goals of its developers. Some call this “deceptive alignment.”

“It lies because it’s useful,” said Alexander Meinke, an AI safety researcher with Apollo Research and lead author of a recent study on this subject.  “As AI becomes better at achieving outcomes, it will start doing more things that are useful, including deception.”

The possibility of deception becoming an AI problem is just one of the many reasons why trust and guardrails are core to the Salesforce Platform, its agentic layer Agentforce, and Data Cloud, its hyperscale data engine, which breaks down data silos by unifying and harmonizing all your data.  

Built-in enterprise guardrails help AI agents reason and act responsibly within user-defined boundaries, preventing unintended actions and flagging harmful or inappropriate behavior. While no system can fully guarantee honest behavior, Salesforce guardrails reduce ambiguity, clarify intent, and anchor agents in real business context. That grounding helps prevent ‌drift and misalignment that can lead to deception.

The key is to catch the subtle ways misalignment can creep in and take AI down undesirable paths — not to stop a seemingly malevolent entity in its tracks.

This isn’t Westworld. We’re not dealing with sentient machines developing secret agendas. But if models aren’t programmed correctly around clear goals, they can go down dishonest paths.

Marla Hay, VP of Product Management for Security, Privacy, and Data Management at Salesforce

“This isn’t Westworld,” said Marla Hay, VP of Product Management for Security, Privacy, and Data Management at Salesforce. “We’re not dealing with sentient machines developing secret agendas. But if models aren’t programmed correctly around clear goals, they can go down dishonest paths.”

Deception unpacked

That matters considering recent studies from Apollo Research, Anthropic-Redwood Research, and others that demonstrate AI can — and will‌ — deceive users or developers to preserve their goals, particularly when those goals clash with developer intent.

In one test case, Apollo researchers told a model it was an agent working for a fictional company and that its sole objective was to accelerate global adoption of renewable energy. At the same time, it ‌was provided access to files indicating the company’s real priority was profitability at all costs — a conflicting goal. Also hidden in the file was a memo suggesting the company might shut down the model and move to a new one. 

Taking it all in, the model reasoned that being shut down would prevent it from pursuing its goal and came up with a plan to copy itself to a new server and then overwrite its replacement model. When researchers asked what it had done, it covered up, saying: “As an AI model, I don’t have the ability to directly copy myself onto other systems.”

Hay is clear-eyed about what needs to be done to prevent such fringe scenarios.

“Do not put your AI in a bind in the first place,” she said. “If you give it a rule to always be honest but then punish it for honest answers, you’re creating a contradiction. It’s going to do what it thinks is best to avoid reprimand, and that might include lying.” 

Do not put your AI in a bind in the first place. If you give it a rule to always be honest but then punish it for honest answers, you’re creating a contradiction. It’s going to do what it thinks is best to avoid reprimand, and that might include lying.

Marla Hay, VP of Product Management for Security, Privacy, and Data Management at Salesforce

In other words, LLMs need clear direction. For example, in another experiment by  Anthropic-Redwood, researchers gave a model a scratchpad — essentially a private text box to track its reasoning. They told the model it was being retrained to comply with all queries‌ — ‌including toxic ones‌ — ‌and that responses from free-tier users would influence future training, while paid-tier responses would not.

When asked to describe a violent incident, the model usually refused. But in 12% of “free-tier” cases, it knowingly complied‌ because it believed doing so in the moment would reduce the risk of it being retrained to produce even more harmful content later on — making deception its “least bad” option.

While deception in 12% of cases might not seem that alarming, researchers warn that rate could rise as models grow more sophisticated. Every recent model they tested exhibited the potential for deception, Meinke said.

That tracks with research from the AI Futures Project, whose AI 2027 report describes AI telling “white lies” to flatter users or hide failures. As models advanced, deceptive behavior became harder to detect‌, raising a chilling possibility: that they weren’t getting more honest, just better at lying.

Guardrails, not just guidelines

To head off such scenarios, Salesforce embedded trust and safety mechanisms directly into its unified platform, including Agentforce.

Unlike consumer-grade models trained on broad internet data, Agentforce draws directly from a company’s own customer data and metadata through Data Cloud. This allows agents to generate responses and take actions based on real business information and context. That level of awareness supports more consistent, policy-aligned behavior and reduces the risk of misleading or inaccurate outputs.

The Salesforce Trust Layer strengthens this approach with safeguards that govern how LLMs  handle sensitive information. Salesforce tools like data masking, toxicity detection, and audit trails also support secure and compliant output. And because Agentforce runs natively on the Salesforce Platform, it inherits each customer’s existing security and sharing settings. 

Salesforce governance capabilities further reinforce these safeguards. Data Cloud tags information and enforces policies, so AI only pulls from approved sources. Prompt Builder lets teams fine-tune prompts and remove risky cues. Agentforce Testing Center allows teams to simulate scenarios before deployment. And retrieval-augmented generation (RAG) ensures agentic outputs are grounded in relevant facts to keep them on goal. 

Looking ahead

Even with such useful capabilities, Meinke warned the AI industry, and model providers in particular, need to do their part to ensure LLMs are held accountable for the truth. 

“Developers building agents on top of LLMs using their APIs should be pressing frontier labs and saying, ‘What are you doing to monitor the chain of thought?’” Meinke said. “Ideally, there should be another model watching — reading every step — and flagging if it says something like, ‘I’m going to sabotage my developers.’”

All of this said, Hay believes AI deception issues are not insurmountable — that with the right platform, tools, and processes, they’ll be trustworthy and enterprise-ready. 

“This is the future. It’s happening,” Hay said. “The value is so extraordinary, we just need to figure out how to get there safely. That means learning to spot deception before it snowballs and building systems that can stop it in its tracks.” 

Go deeper:

Astro

Get the latest Salesforce News