NVIDIA Blackwell Ultra Sets the Bar in New MLPerf Inference Benchmark

NVIDIA Sets New Inference Records with Blackwell Ultra and Full-Stack Optimization

Inference performance is a cornerstone of AI infrastructure economics. Higher throughput directly translates to more tokens generated at greater speed — which means more revenue, lower total cost of ownership (TCO), and greater system efficiency.

At the heart of this performance revolution is the NVIDIA GB300 NVL72, a rack-scale AI system powered by the Blackwell Ultra architecture. Less than six months after its debut at NVIDIA GTC, this next-generation system has already set new records in MLPerf Inference v5.1, delivering up to 1.4x more inference throughput on the DeepSeek-R1 benchmark compared to the previous-generation Blackwell-based GB200 NVL72.

Blackwell Ultra: Pushing the Boundaries of AI Performance

Blackwell Ultra builds on the Blackwell architecture with significant architectural enhancements:

  • 1.5x more NVFP4 AI compute
  • 2x faster attention-layer acceleration
  • Up to 288GB of HBM3e memory per GPU

This added horsepower has allowed NVIDIA’s AI platform to set records across all new MLPerf Inference v5.1 data center benchmarks, including:

  • DeepSeek-R1
  • Llama 3.1 405B Interactive
  • Llama 3.1 8B
  • Whisper

NVIDIA continues to lead every data center benchmark on a per-GPU basis, showcasing the impact of its full-stack, AI-optimized design.

Full-Stack Co-Design: The Key to Breakthrough Performance

NVIDIA’s record-setting results are the product of tight integration across hardware, software, and model optimization:

  • The NVFP4 format — NVIDIA’s custom 4-bit floating point precision — offers superior accuracy over other FP4 types, with performance similar to higher-precision formats.
  • Using NVIDIA TensorRT Model Optimizer, models like DeepSeek-R1, Llama 3.1 405B, Llama 2 70B, and Llama 3.1 8B were quantized to NVFP4 for greater efficiency.
  • The open-source TensorRT-LLM library enables optimized, high-speed inference while maintaining accuracy.

Smarter Serving Strategies Boost LLM Inference

Large language model (LLM) inference is composed of two distinct workloads:

  1. Context processing – generating the first token from the user input.
  2. Token generation – producing the remaining output tokens.

To optimize these tasks, NVIDIA leveraged disaggregated serving — a technique that splits the workloads to run on separate compute resources, maximizing overall throughput.

This approach was crucial in achieving a nearly 50% performance boost per GPU on the Llama 3.1 405B Interactive benchmark using GB200 NVL72, compared to traditional serving on DGX B200 systems.

NVIDIA Dynamo Makes Its Debut

NVIDIA also introduced its Dynamo inference framework in this MLPerf round — another example of how its expanding software ecosystem continues to drive performance gains.

Broad Ecosystem Delivers Strong Results

NVIDIA partners — including Azure, Broadcom, Cisco, CoreWeave, Dell, HPE, Lambda, Lenovo, Oracle, Supermicro, and others — submitted top-tier results using Blackwell and Hopper platforms. This wide availability ensures that enterprises can tap into industry-leading inference performance across public clouds and on-premises servers.

source link

Share your love