NVIDIA Rubin CPX Accelerates Inference Performance and Efficiency for 1M+ Token Context Workloads

Inference Is the New Frontier in AI Infrastructure

As AI systems grow more powerful, inference has become the next big challenge. Today’s models aren’t just generating text—they’re evolving into agentic systems capable of multi-step reasoning, persistent memory, and long-horizon context. This unlocks breakthroughs in fields like software development, video generation, and deep research, but it also demands an entirely new level of infrastructure performance.

These advanced use cases require handling massive context windows—from analyzing entire codebases to maintaining coherence across millions of tokens in long-form video and research tasks. Meeting these needs is pushing current compute, memory, and networking systems to their limits.

Enter the SMART Framework: Optimizing Inference at Scale

To address this shift, NVIDIA has introduced the SMART framework—an approach to optimize inference across:

  • Scale
  • Multidimensional performance
  • Architecture
  • ROI
  • Technology ecosystem

This framework is powered by NVIDIA’s latest platforms, including:

  • NVIDIA Blackwell
  • NVIDIA GB200 NVL72
  • NVFP4 for efficient low-precision inference
  • Open-source software like TensorRT-LLM and NVIDIA Dynamo for orchestration

These components work together to rethink and reengineer inference infrastructure from the ground up.

Disaggregated Inference: A New Architectural Paradigm

Inference is not monolithic. It consists of two distinct phases:

  • Context phase: Highly compute-bound; processes large inputs to generate the first token.
  • Generation phase: Memory bandwidth-bound; depends on fast memory access and interconnects like NVLink for token-by-token output.

Disaggregated inference separates these phases, allowing each to be optimized independently—boosting throughput, reducing latency, and improving resource utilization.

But disaggregation adds orchestration complexity. That’s where NVIDIA Dynamo comes in—coordinating low-latency KV cache transfers, LLM-aware routing, and memory management. This approach has already proven its value in MLPerf Inference benchmarks with the GB200 NVL72.

Introducing NVIDIA Rubin CPX: Purpose-Built for Long-Context Inference

To meet the demands of context-heavy workloads, NVIDIA is launching the Rubin CPX GPU—a new compute engine optimized specifically for the context phase of inference.

Rubin CPX Key Features:

  • 30 petaFLOPs of NVFP4 compute
  • 128 GB of GDDR7 memory
  • 3× attention acceleration vs. GB300 NVL72
  • Hardware-accelerated video encoding and decoding

This makes Rubin CPX ideal for inference tasks that require processing long sequences—such as large-scale software development tools and high-resolution video generation. It enhances performance while maximizing ROI in disaggregated infrastructure deployments.

Vera Rubin NVL144 CPX: Full-Stack Inference at Scale

Rubin CPX is designed to work seamlessly with:

  • NVIDIA Vera CPUs
  • Rubin GPUs for generation
  • NVIDIA Dynamo for orchestration

Together, they power the Vera Rubin NVL144 CPX rack—a complete solution for high-throughput long-context inference.

Vera Rubin NVL144 CPX Specs:

  • 144 Rubin CPX GPUs
  • 144 Rubin GPUs
  • 36 Vera CPUs
  • 8 exaFLOPs of NVFP4 compute (7.5× GB300 NVL72)
  • 100 TB high-speed memory
  • 1.7 PB/s memory bandwidth

It also integrates with Quantum-X800 InfiniBand or Spectrum-X Ethernet, ConnectX-9 SuperNICs, and is orchestrated end-to-end by Dynamo.

Unmatched Economics: Up to 50x ROI

At scale, this architecture delivers 30x–50x return on investment, with potential to generate $5B in revenue from a $100M CAPEX investment. For enterprises building the next generation of generative AI tools, this sets a new standard for inference economics.

The Future of Inference Starts Here

With Rubin CPX, disaggregated architecture, and full-stack orchestration via Dynamo, NVIDIA is defining the future of inference—powering advanced, long-context AI systems and enabling entirely new classes of applications.

From code to video to research, NVIDIA’s disaggregated inference platform is built for the complexity ahead—and ready to scale with it.

source link

Share your love