DeepSeek’s Improved Native Sparse Attention (NSA) - Overview
- Allen Zhou
- Feb 25
- 2 min read
Updated: Oct 24
Allen Zhou, Cosmos Quant Investment, nzhou@cosmosquant.com
Background: The Bottleneck of Long-Context Efficiency
In traditional Transformer models, the attention mechanism scales with a computational complexity of O(n²). When processing long sequences (e.g., 64K tokens), attention computation accounts for 70–80% of total latency, creating severe bottlenecks in tasks such as long-text generation and document comprehension. DeepSeek’s Native Sparse Attention (NSA) was developed to address this challenge — aiming for ultra-fast context processing, local trainability, and hardware alignment, representing a major leap forward for next-generation large language models (LLMs).
What NSA Solves
DeepSeek’s NSA focuses on overcoming two key inefficiencies found in existing sparse attention frameworks:
Performance degradation caused by post-hoc sparsification.
Low training efficiency during long-sequence optimization (e.g., untrainable sparse modules and inefficient backward propagation).
Goal: Maintain model accuracy while dramatically reducing computational cost.
Two Core Innovations
1. Arithmetic-Intensity-Balanced Algorithm Design
NSA achieves a better balance between computation and memory access, maximizing GPU throughput and minimizing redundant FLOPs.
2. End-to-End Trainability
Unlike earlier sparse attention models that rely on fixed pruning patterns, NSA is designed for end-to-end gradient optimization, making it trainable from scratch without manual sparsity tuning.

The “Triple Mechanism” of NSA: Efficient and Intelligent Attention
NSA integrates three attention types in parallel, mimicking how humans process information:
Compressed Attention - Aggregates Key and Value representations to quickly capture paragraph-level semantics.Analogy: When reading, humans first glance at titles, openings, and keywords to grasp the main idea.
Selective Attention - Assigns importance scores to different information blocks, emphasizing the core content.Analogy: You focus on key paragraphs that left an impression during a quick read.
Sliding-Window Attention - Maintains a fixed-size local window, ensuring the model captures fine-grained context and short-term dependencies.Analogy: Humans pay closer attention to the sentences immediately before and after the current one.
Hardware Optimization: Speed and Efficiency Combined
DeepSeek’s NSA integrates multiple hardware-level enhancements for massive performance gains:
Kernel-level optimization: Hardware-aligned sparse attention kernels.
Group-centric data loading: Optimized I/O throughput and cache utilization.
Shared KV cache: Reduces redundant reads and minimizes memory latency.
Grid-based outer loops: Achieves highly efficient parallel execution on modern GPUs.
Benchmark Results: Comprehensive Superiority
Model Performance
NSA consistently outperforms full-attention baselines across multiple benchmarks.
In 64K-token retrieval tasks, NSA demonstrates exceptionally high recall and precision.
Speed and Efficiency
Forward pass speed: Up to 9× faster.
Backward pass speed: Up to 6× faster.
Hardware utilization: Significantly reduced resource consumption with no accuracy loss.

Future Directions
DeepSeek continues to advance NSA’s capabilities with two main research directions:
Smarter sparsity patterns: Enabling the model to autonomously select the most informative tokens.
Further hardware alignment: Expanding NSA compatibility with a wider range of devices and deployment environments.
Conclusion
DeepSeek’s enhanced Native Sparse Attention (NSA) represents a crucial step toward scalable, efficient long-context understanding. By combining algorithmic innovation with hardware-level engineering, NSA paves the way for the next wave of AI efficiency breakthroughs - helping large models think smarter, faster, and cheaper.




Comments