Think Big: Real-Time Inference at Multi-Million Token Scale for 32X More Users

Scaling Real-Time AI: Multi-Million Token Inference for 32x More Users with Helix Parallelism

Modern AI applications increasingly rely on models with massive parameter counts and multi-million-token context windows. Whether it’s an AI agent keeping up with months of conversation, a legal assistant processing gigabytes of case law, or a coding copilot navigating sprawling repositories, long-range context is essential for relevance and coherence.

At the same time, users expect fast, interactive responses—making real-time performance just as critical as scale.

This rising demand highlights two challenges in decoding: managing huge context lengths and scaling across GPUs without bottlenecks. NVIDIA’s Blackwell architecture, with high-bandwidth NVLink and FP4 compute, lays the foundation. But unlocking its full potential requires a new execution strategy—Helix Parallelism.

Co-designed with Blackwell, Helix delivers up to 32x more concurrent users at the same latency compared to previous methods for ultra-long context inference.

The Bottlenecks: KV Cache & FFN Weight Reads

Two key performance barriers arise during decoding:

KV Cache Streaming: With multi-million-token contexts, each GPU must read vast histories (key-value cache) from DRAM per sample—quickly saturating bandwidth and increasing token-to-token latency (TTL).
FFN Weight Loads: Generating each token requires loading large feed-forward network (FFN) weights. In low-latency settings, small batch sizes make this cost harder to amortize, causing major delays.

Existing parallelism strategies struggle to optimize both simultaneously. For example, while Tensor Parallelism (TP) helps distribute FFN loads, it often leads to duplicated KV cache across GPUs—especially in models like Llama (GQA) or DeepSeek (MLA), where attention heads are limited.

This duplication caps scalability and hinders real-time performance.

Helix Parallelism: A New Execution Model

Helix solves this by introducing a temporal pipeline that disaggregates the parallelism of attention and FFN computation. Inspired by a DNA helix, it interleaves multiple forms of parallelism—KV, tensor, and expert—across time, keeping GPUs efficiently utilized across all stages.

Attention Phase: Shard Smart, Communicate Fast

Helix applies:

KV Parallelism (KVP): Splits the KV cache across GPUs by sequence dimension.
Tensor Parallelism across Attention (TPA): Distributes QKV projections, keeping TPA ≤ number of KV heads to avoid duplication.

This enables local, efficient FlashAttention per GPU, followed by a single all-to-all exchange to aggregate results. Crucially, this communication is independent of context length—making it scalable even for million-token windows.

Helix uses HOP-B (Helix Overlap Pipeline – Batch-wise) to hide communication latency behind computation. As one token finishes attention, its results are exchanged while the next token’s attention is computed—keeping GPUs busy and reducing TTL.

FFN Phase: Seamless Reuse of GPUs

After attention, the same set of GPUs immediately transitions to FFN execution:

For dense models, GPUs are reconfigured into a 1D TP layout.
For MoE models, a 2D TP × Expert Parallel (EP) grid is used.

Since the attention output is already partitioned by hidden dimension, this phase launches with no idle time or extra communication overhead.

KV Cache Management: Staggered Updates

To avoid DRAM hotspots during decoding, KV cache updates are staggered in a round-robin fashion across KVP GPUs. This balances memory use and ensures consistent throughput regardless of batch size or context length.

Results on Blackwell Hardware

Simulations on NVIDIA’s GB200 NVL72 (Blackwell) show dramatic gains using Helix with a 1-million-token context on DeepSeek-R1 671B:

Up to 32x more concurrent users at the same latency (higher tokens/s/GPU).
Up to 1.5x lower TTL in low-concurrency, latency-sensitive scenarios.

These improvements come from full KV and FFN sharding across GPUs, minimizing DRAM pressure and maximizing compute efficiency. Helix pushes the throughput-latency Pareto frontier, delivering better performance across the board.

Source link