Large language models (LLMs) have become foundational to AI code understanding and generation, powering a wide range of enterprise AI workflows — from software synthesis to automated reasoning over symbolic sequences. Despite this success, most code-focused LLMs today are autoregressive (AR): they predict the next token based only on previous context. This sequential formulation has been the default choice for language modeling, but it limits bidirectional reasoning, code infilling, and edit consistency — challenges that new approaches like diffusion language models are helping to address in enterprise settings.
Diffusion language models (DLMs) provide an alternative generation paradigm. Instead of producing tokens one by one, they generate sequences through a process of iterative denoising — progressively transforming a masked or noisy sequence into a coherent one. This iterative structure naturally supports parallel generation, context-aware reasoning, and structured editing, making it especially well-suited for modeling source code, where long-range dependencies and syntactic precision are critical.
To explore this paradigm in a practical and reproducible setting, we introduce CoDA (Coding via Diffusion Adaptation) — a diffusion language model for code. CoDA demonstrates that diffusion-based generation can be efficient, lightweight, and competitive, even without resorting to multi-billion–parameter models. It is fully open-sourced with training recipes, evaluation harnesses, and model checkpoints to support further research.
Overview of CoDA
CoDA is built by adapting a transformer-based autoregressive backbone (Qwen3-1.7B) to a discrete diffusion objective. It is trained end-to-end on TPU clusters using an open PyTorch/XLA pipeline optimized for large-scale text diffusion. Key features include:
- Multi-stage training design: pre-training, mid-training, and post-training stages that align noise distributions progressively.
- Progressive masking curriculum: structured masking strategies to improve infilling, truncation recovery, and variable-length conditioning.
- Reproducible infrastructure: end-to-end TPU pipeline and evaluation harness released publicly for transparency and benchmarking.
How CoDA Works
CoDA is trained in three major stages designed to progressively adapt the model from general text to high-quality code reasoning.
Pre-training (179B tokens): The pre-training phase exposes the model to diverse textual and code-based content, forming the foundation for syntactic understanding and reasoning. The corpus combines web text, coding and reasoning data, such as dclm-baseline-1.0 dataset, The Stack v2, RedPajama, etc.
Mid-training (20B tokens): The second stage bridges pre-training and fine-tuning by introducing a progressive masking curriculum. Mid-training data includes 20B tokens from curated sources such as RedPajama Arxiv, Gutenberg, OpenCoder Annealing Corpus, and SmolLM-PythonEdu.
Post-training (Instruction Tuning): In the final stage, CoDA is fine-tuned on instruction-following data derived from OpenCoder (Stage 1 and Stage 2) to adapt the model for prompt-conditioned code generation and problem solving. This stage also introduces conditioned-span annealing, where the model transitions from unconditional denoising to progressively conditioning on larger portions of the user prompt. This mechanism ensures stable alignment between the prompt semantics and the denoising process.
Progressive Masking

Traditional models learn by predicting the next token. CoDA learns to fill in the blanks. We introduced three complementary masking strategies:
Unmaskable Prefix (S1): ensures consistent conditioning on an initial prompt, stabilizing prefix-aligned generation.
Truncated Suffix (S2): teaches the model to handle sequences of varying length, improving robustness to partial contexts.
Block Masking (S3): masks contiguous spans, simulating realistic infilling and code-repair scenarios.
Probabilities for each strategy are gradually increased over epochs, effectively transitioning the model from random token masking to structured code infilling. This curriculum helps align the model’s internal noise distribution with downstream inference behavior.
Results: Compact Yet Powerful
We evaluate CoDA on standard public benchmarks, including HumanEval and MBPP, and their EvalPlus extensions. Performance is measured using the pass@1 metric, representing the probability of generating a correct solution on the first attempt.

CoDA achieves competitive performance compared to much larger diffusion models, closing most of the gap while remaining within a significantly smaller parameter footprint. Instruction tuning yields a 25% improvement on HumanEval, emphasizing the importance of post-training alignment for diffusion coders. Furthermore, the model achieves 39.6% lower inference latency than the 7B-parameter model, confirming the scalability advantages of smaller DLMs.
Fully Open Source
To facilitate community research, Salesforce Research is releasing:
Model weights: https://huggingface.co/Salesforce/CoDA-v0-Instruct
TPU training pipeline and recipes: https://github.com/SalesforceAIResearch/CoDA
Together, these resources enable anyone — from academic labs to open-source developers — to build, train, and deploy their own diffusion-based coding assistants.
Learn More:
Paper: https://arxiv.org/abs/2510.03270v1
Model: https://huggingface.co/Salesforce/CoDA-v0-Instruct
Code & Training Pipeline: https://github.com/SalesforceAIResearch/CoDA


















