DeepSeek’s Improved Native Sparse Attention (NSA) - Overview

Allen Zhou
Feb 25, 2025
2 min read

Updated: Oct 24, 2025

Allen Zhou, Cosmos Quant Investment, nzhou@cosmosquant.com

Background: The Bottleneck of Long-Context Efficiency

In traditional Transformer models, the attention mechanism scales with a computational complexity of O(n²). When processing long sequences (e.g., 64K tokens), attention computation accounts for 70–80% of total latency, creating severe bottlenecks in tasks such as long-text generation and document comprehension. DeepSeek’s Native Sparse Attention (NSA) was developed to address this challenge — aiming for ultra-fast context processing, local trainability, and hardware alignment, representing a major leap forward for next-generation large language models (LLMs).

What NSA Solves

DeepSeek’s NSA focuses on overcoming two key inefficiencies found in existing sparse attention frameworks:

Performance degradation caused by post-hoc sparsification.
Low training efficiency during long-sequence optimization (e.g., untrainable sparse modules and inefficient backward propagation).

Goal: Maintain model accuracy while dramatically reducing computational cost.

Two Core Innovations

1. Arithmetic-Intensity-Balanced Algorithm Design

NSA achieves a better balance between computation and memory access, maximizing GPU throughput and minimizing redundant FLOPs.

2. End-to-End Trainability

Unlike earlier sparse attention models that rely on fixed pruning patterns, NSA is designed for end-to-end gradient optimization, making it trainable from scratch without manual sparsity tuning.

*Illustration: The green area represents active attention computation, while white areas indicate skipped computations.*

The “Triple Mechanism” of NSA: Efficient and Intelligent Attention

NSA integrates three attention types in parallel, mimicking how humans process information:

Compressed Attention - Aggregates Key and Value representations to quickly capture paragraph-level semantics.Analogy: When reading, humans first glance at titles, openings, and keywords to grasp the main idea.
Selective Attention - Assigns importance scores to different information blocks, emphasizing the core content.Analogy: You focus on key paragraphs that left an impression during a quick read.
Sliding-Window Attention - Maintains a fixed-size local window, ensuring the model captures fine-grained context and short-term dependencies.Analogy: Humans pay closer attention to the sentences immediately before and after the current one.

Hardware Optimization: Speed and Efficiency Combined

DeepSeek’s NSA integrates multiple hardware-level enhancements for massive performance gains:

Kernel-level optimization: Hardware-aligned sparse attention kernels.
Group-centric data loading: Optimized I/O throughput and cache utilization.
Shared KV cache: Reduces redundant reads and minimizes memory latency.
Grid-based outer loops: Achieves highly efficient parallel execution on modern GPUs.

Benchmark Results: Comprehensive Superiority

Model Performance

NSA consistently outperforms full-attention baselines across multiple benchmarks.
In 64K-token retrieval tasks, NSA demonstrates exceptionally high recall and precision.

Speed and Efficiency

Forward pass speed: Up to 9× faster.
Backward pass speed: Up to 6× faster.
Hardware utilization: Significantly reduced resource consumption with no accuracy loss.

Future Directions

DeepSeek continues to advance NSA’s capabilities with two main research directions:

Smarter sparsity patterns: Enabling the model to autonomously select the most informative tokens.
Further hardware alignment: Expanding NSA compatibility with a wider range of devices and deployment environments.

Conclusion

DeepSeek’s enhanced Native Sparse Attention (NSA) represents a crucial step toward scalable, efficient long-context understanding. By combining algorithmic innovation with hardware-level engineering, NSA paves the way for the next wave of AI efficiency breakthroughs - helping large models think smarter, faster, and cheaper.