AI Research

CoDA: A New Era of Code Generation with Diffusion Language Models

Haolin Chen

Shiyu Wang

4 additional authors

November 3, 2025 4 min read

Large language models (LLMs) have become foundational to AI code understanding and generation, powering a wide range of enterprise AI workflows — from software synthesis to automated reasoning over symbolic sequences. Despite this success, most code-focused LLMs today are autoregressive (AR): they predict the next token based only on previous context. This sequential formulation has been the default choice for language modeling, but it limits bidirectional reasoning, code infilling, and edit consistency — challenges that new approaches like diffusion language models are helping to address in enterprise settings.

Diffusion language models (DLMs) provide an alternative generation paradigm. Instead of producing tokens one by one, they generate sequences through a process of iterative denoising — progressively transforming a masked or noisy sequence into a coherent one. This iterative structure naturally supports parallel generation, context-aware reasoning, and structured editing, making it especially well-suited for modeling source code, where long-range dependencies and syntactic precision are critical.

To explore this paradigm in a practical and reproducible setting, we introduce CoDA (Coding via Diffusion Adaptation) — a diffusion language model for code. CoDA demonstrates that diffusion-based generation can be efficient, lightweight, and competitive, even without resorting to multi-billion–parameter models. It is fully open-sourced with training recipes, evaluation harnesses, and model checkpoints to support further research.

Overview of CoDA

CoDA is built by adapting a transformer-based autoregressive backbone (Qwen3-1.7B) to a discrete diffusion objective. It is trained end-to-end on TPU clusters using an open PyTorch/XLA pipeline optimized for large-scale text diffusion. Key features include:

Multi-stage training design: pre-training, mid-training, and post-training stages that align noise distributions progressively.
Progressive masking curriculum: structured masking strategies to improve infilling, truncation recovery, and variable-length conditioning.
Reproducible infrastructure: end-to-end TPU pipeline and evaluation harness released publicly for transparency and benchmarking.

How CoDA Works

CoDA is trained in three major stages designed to progressively adapt the model from general text to high-quality code reasoning.

Pre-training (179B tokens): The pre-training phase exposes the model to diverse textual and code-based content, forming the foundation for syntactic understanding and reasoning. The corpus combines web text, coding and reasoning data, such as dclm-baseline-1.0 dataset, The Stack v2, RedPajama, etc.

Mid-training (20B tokens): The second stage bridges pre-training and fine-tuning by introducing a progressive masking curriculum. Mid-training data includes 20B tokens from curated sources such as RedPajama Arxiv, Gutenberg, OpenCoder Annealing Corpus, and SmolLM-PythonEdu.

Post-training (Instruction Tuning): In the final stage, CoDA is fine-tuned on instruction-following data derived from OpenCoder (Stage 1 and Stage 2) to adapt the model for prompt-conditioned code generation and problem solving. This stage also introduces conditioned-span annealing, where the model transitions from unconditional denoising to progressively conditioning on larger portions of the user prompt. This mechanism ensures stable alignment between the prompt semantics and the denoising process.

Progressive Masking

Traditional models learn by predicting the next token. CoDA learns to fill in the blanks. We introduced three complementary masking strategies:

Unmaskable Prefix (S1): ensures consistent conditioning on an initial prompt, stabilizing prefix-aligned generation.

Truncated Suffix (S2): teaches the model to handle sequences of varying length, improving robustness to partial contexts.

Block Masking (S3): masks contiguous spans, simulating realistic infilling and code-repair scenarios.

Probabilities for each strategy are gradually increased over epochs, effectively transitioning the model from random token masking to structured code infilling. This curriculum helps align the model’s internal noise distribution with downstream inference behavior.

Results: Compact Yet Powerful

We evaluate CoDA on standard public benchmarks, including HumanEval and MBPP, and their EvalPlus extensions. Performance is measured using the pass@1 metric, representing the probability of generating a correct solution on the first attempt.

CoDA achieves competitive performance compared to much larger diffusion models, closing most of the gap while remaining within a significantly smaller parameter footprint. Instruction tuning yields a 25% improvement on HumanEval, emphasizing the importance of post-training alignment for diffusion coders. Furthermore, the model achieves 39.6% lower inference latency than the 7B-parameter model, confirming the scalability advantages of smaller DLMs.

Fully Open Source

To facilitate community research, Salesforce Research is releasing:

Model weights: https://huggingface.co/Salesforce/CoDA-v0-Instruct

TPU training pipeline and recipes: https://github.com/SalesforceAIResearch/CoDA

Together, these resources enable anyone — from academic labs to open-source developers — to build, train, and deploy their own diffusion-based coding assistants.

Learn More:

Paper: https://arxiv.org/abs/2510.03270v1

Model: https://huggingface.co/Salesforce/CoDA-v0-Instruct

Code & Training Pipeline: https://github.com/SalesforceAIResearch/CoDA

Does your AI Strategy Include Sustainability? Here’s Why It Should

5 min read

Better LLM Agents for CRM Tasks: Tips and Tricks

11 min read

Haolin Chen Senior Applied Scientist

More by Haolin

Shiyu Wang Applied Scientist, AI Research

Shiyu Wang is an Applied Scientist working on generative models, large action model evaluation and AIOps. He is also interested in reasoning and graph machine learning. He received my PhD from Emory University, and my Master's and Bachelor's from Yale and Fudan University, respectively.

More by Shiyu

Weiran Yao Research Manager

More by Weiran

Ran Xu Director, AI Research

Ran Xu received his Ph.D. in computer science from University at Buffalo from 2015. Currently, he leads a group of exceptional computer vision and multimodal AI researchers at Salesforce to push the boundary of research and productive AI for CRM.

More by Ran

Can Qin Research Scientist

Can Qin is the research scientist at Salesforce AI Research. He earned Ph.D. degree from Northeastern University in Boston at 2023. His research during this period was primarily centered around the realms of Generative AI and Multi-modal Learning. He has been awarded the Best Paper Award in ICCV Read More

More by Can

Huan Wang

Huan Wang is a Research Director at Salesforce Research. He works on various topics including deep learning theory, reinforcement learning, time series analytics, operational and data intelligence.

More by Huan

CoDA: A New Era of Code Generation with Diffusion Language Models

Haolin Chen

Shiyu Wang

4 additional authors

Overview of CoDA

How CoDA Works

Results: Compact Yet Powerful

Fully Open Source

Just For You

Does your AI Strategy Include Sustainability? Here’s Why It Should

Better LLM Agents for CRM Tasks: Tips and Tricks

Just For You

The Agentic AI Era: After the Dawn, Here’s What to Expect

Beyond the Chat Window: How Computer Use Agents Are Learning to Click, Scroll, and Work

BFCL Audio: A Benchmark for Audio-Native Function Calling

MCP-Universe: A Comprehensive Framework for AI Agent Development and Benchmarking

Why You Shouldn’t Be Scared of Digital Labor For Your Startup or SMB

Introducing Moirai 2.0

The New AI Agent Training Ground: Simulating Enterprise Environments

Powering Intelligent Travel Support with Salesforce and Agentic AI

Share article

Overview of CoDA

How CoDA Works

Results: Compact Yet Powerful

Fully Open Source

Share article

Explore related content by topic

Get the latest articles in your inbox.

360 Highlights

IT

Commerce

Marketing

Service

Sales

Thanks, you're subscribed!