top of page

DeepSeek’s Improved Native Sparse Attention (NSA) - Overview

Updated: Oct 24

Allen Zhou, Cosmos Quant Investment, nzhou@cosmosquant.com


Background: The Bottleneck of Long-Context Efficiency

In traditional Transformer models, the attention mechanism scales with a computational complexity of O(n²). When processing long sequences (e.g., 64K tokens), attention computation accounts for 70–80% of total latency, creating severe bottlenecks in tasks such as long-text generation and document comprehension. DeepSeek’s Native Sparse Attention (NSA) was developed to address this challenge — aiming for ultra-fast context processing, local trainability, and hardware alignment, representing a major leap forward for next-generation large language models (LLMs).


What NSA Solves

DeepSeek’s NSA focuses on overcoming two key inefficiencies found in existing sparse attention frameworks:

  1. Performance degradation caused by post-hoc sparsification.

  2. Low training efficiency during long-sequence optimization (e.g., untrainable sparse modules and inefficient backward propagation).

Goal: Maintain model accuracy while dramatically reducing computational cost.


Two Core Innovations

1. Arithmetic-Intensity-Balanced Algorithm Design

NSA achieves a better balance between computation and memory access, maximizing GPU throughput and minimizing redundant FLOPs.

2. End-to-End Trainability

Unlike earlier sparse attention models that rely on fixed pruning patterns, NSA is designed for end-to-end gradient optimization, making it trainable from scratch without manual sparsity tuning.


Illustration: The green area represents active attention computation, while white areas indicate skipped computations.
Illustration: The green area represents active attention computation, while white areas indicate skipped computations.


The “Triple Mechanism” of NSA: Efficient and Intelligent Attention

NSA integrates three attention types in parallel, mimicking how humans process information:

  1. Compressed Attention - Aggregates Key and Value representations to quickly capture paragraph-level semantics.Analogy: When reading, humans first glance at titles, openings, and keywords to grasp the main idea.

  2. Selective Attention - Assigns importance scores to different information blocks, emphasizing the core content.Analogy: You focus on key paragraphs that left an impression during a quick read.

  3. Sliding-Window Attention - Maintains a fixed-size local window, ensuring the model captures fine-grained context and short-term dependencies.Analogy: Humans pay closer attention to the sentences immediately before and after the current one.


Hardware Optimization: Speed and Efficiency Combined

DeepSeek’s NSA integrates multiple hardware-level enhancements for massive performance gains:

  • Kernel-level optimization: Hardware-aligned sparse attention kernels.

  • Group-centric data loading: Optimized I/O throughput and cache utilization.

  • Shared KV cache: Reduces redundant reads and minimizes memory latency.

  • Grid-based outer loops: Achieves highly efficient parallel execution on modern GPUs.


Benchmark Results: Comprehensive Superiority

Model Performance

  • NSA consistently outperforms full-attention baselines across multiple benchmarks.

  • In 64K-token retrieval tasks, NSA demonstrates exceptionally high recall and precision.

Speed and Efficiency

  • Forward pass speed: Up to 9× faster.

  • Backward pass speed: Up to 6× faster.

  • Hardware utilization: Significantly reduced resource consumption with no accuracy loss.


Benchmark results
Benchmark results

Future Directions

DeepSeek continues to advance NSA’s capabilities with two main research directions:

  1. Smarter sparsity patterns: Enabling the model to autonomously select the most informative tokens.

  2. Further hardware alignment: Expanding NSA compatibility with a wider range of devices and deployment environments.


Conclusion

DeepSeek’s enhanced Native Sparse Attention (NSA) represents a crucial step toward scalable, efficient long-context understanding. By combining algorithmic innovation with hardware-level engineering, NSA paves the way for the next wave of AI efficiency breakthroughs - helping large models think smarter, faster, and cheaper.


 
 
 

Comments


Commenting on this post isn't available anymore. Contact the site owner for more info.
bottom of page