NVIDIA Blackwell Ultra Sets New Inference Records in MLPerf Debut

MLPerf Inference v5.1: NVIDIA Sets New Records with Blackwell Ultra Architecture

As large language models (LLMs) continue to scale into the hundreds of billions of parameters, their intelligence grows—but so do their demands on compute infrastructure. Today’s top models aren’t just larger—they’re capable of reasoning, generating many intermediate tokens before arriving at a final response. This combination—more parameters and more thinking—is driving an urgent need for significantly higher inference performance.

Meeting that demand requires a cutting-edge technology stack—from chips and systems to software and tooling—supported by an ecosystem of developers constantly optimizing for production-scale workloads.

MLPerf Inference v5.1: New Benchmarks for a New Era of AI

MLPerf Inference v5.1, the latest edition of the industry’s leading AI inference benchmark, introduces new models and scenarios that reflect how modern AI is evolving. Benchmarks are run twice annually and provide a standardized way to measure performance across a wide range of AI tasks.

New Benchmarks in This Round:

DeepSeek-R1 (671B MoE model):
A high-performance mixture-of-experts model developed by DeepSeek. In the server scenario, the performance target includes a 2-second time-to-first-token (TTFT) and 12.5 tokens per second per user (TPS/user)—measured at the 99th percentile.
Llama 3.1 405B:
A new interactive scenario was introduced for this large model in the Llama 3.1 family. It sets stricter targets: 4.5-second TTFT and 12.5 TPS/user, offering a faster and more responsive benchmark than the traditional server setup.
Llama 3.1 8B:
Replacing the previous GPT-J benchmark, this compact but capable model is tested across three scenarios:
- Offline
- Server (2s TTFT, 10 TPS/user)
- Interactive (0.5s TTFT, 33 TPS/user)
Whisper (Speech Recognition):
Hugely popular on HuggingFace with nearly 5 million monthly downloads, Whisper now replaces RNN-T as the speech recognition benchmark in MLPerf Inference.

NVIDIA Raises the Bar with Blackwell Ultra

This round marks the first submission using NVIDIA’s Blackwell Ultra architecture, just six months after the original Blackwell architecture debuted in MLPerf Inference v5.0. Once again, NVIDIA has set new inference performance records across the board.

Key Highlights:

First to submit results using Blackwell Ultra
Record-setting performance on all new benchmarks:
- DeepSeek-R1
- Llama 3.1 405B
- Llama 3.1 8B
- Whisper
Continued dominance on per-GPU performance across all existing MLPerf Inference benchmarks

A Full-Stack Strategy for AI Inference Leadership

NVIDIA’s leadership in MLPerf is driven by more than just silicon. Its full-stack approach—including chips, systems, optimized software (like TensorRT-LLM), and orchestration tools—delivers unmatched performance for real-world workloads.

As models continue to grow in size and complexity, and reasoning becomes core to their capabilities, NVIDIA’s ecosystem is evolving just as quickly—ensuring developers, enterprises, and researchers have the performance they need to deploy next-generation AI at scale.

source link