DeepSeek-V3.2-Exp

We’re thrilled to introduce DeepSeek-V3.2-Exp, an experimental version that represents a significant step forward in efficient transformer architecture. This release builds upon V3.1-Terminus by introducing DeepSeek Sparse Attention (DSA), a groundbreaking sparse attention mechanism designed to optimize training and inference efficiency in long-context scenarios.

The innovation lies in DSA achieving fine-grained sparse attention for the first time, delivering substantial improvements in long-context training and inference efficiency while maintaining virtually identical model output quality. This breakthrough addresses the computational challenges of processing extended text sequences without compromising performance.

To ensure rigorous evaluation, we deliberately aligned the training configurations of DeepSeek-V3.2-Exp with V3.1-Terminus. The results speak for themselves—across multiple public benchmarks spanning various domains, DeepSeek-V3.2-Exp demonstrates performance on par with its predecessor while offering enhanced efficiency.

Performance Highlights:
– MMLU-Pro: 85.0
– GPQA-Diamond: 79.9
– AIME 2025: 89.3
– Codeforces: 2121
– BrowseComp: 40.1
– SimpleQA: 97.1

Getting started with DeepSeek-V3.2-Exp is straightforward with multiple deployment options:

HuggingFace Integration:
We provide updated inference demo code to help the community quickly get started. The process involves converting model weights and launching an interactive chat interface with simple command-line instructions.

SGLang Support:
Docker installation is available for various hardware platforms including H200, MI350, and NPUs. The launch command enables efficient server deployment with tensor and data parallelism.

vLLM Compatibility:
vLLM offers day-0 support, ensuring seamless integration with existing workflows. Check the recipes for the latest implementation details.

For developers interested in the underlying technology, we’ve released open-source kernels including TileLang for better readability and research purposes, and high-performance CUDA kernels available in DeepGEMM and FlashMLA.

The entire repository and model weights are licensed under the permissive MIT License, encouraging widespread adoption and collaboration within the research community.

This experimental release marks an important milestone in our journey toward more efficient AI models, particularly for long-context applications. We’re excited to see how the community will leverage these improvements and welcome feedback to guide future developments.