NVIDIA Blackwell Raises Bar in New InferenceMAX Benchmarks, Delivering Unmatched Performance and Efficiency

NVIDIA’s Blackwell architecture has once again set a new industry benchmark for AI inference performance and efficiency — sweeping the inaugural SemiAnalysis InferenceMAX v1 results. The independent benchmark suite evaluates the total cost of compute across real-world AI workloads, providing the first holistic view of inference economics at scale.

Setting a New Standard for AI Inference Economics

InferenceMAX v1 confirms that NVIDIA Blackwell delivers the highest performance, best efficiency, and strongest return on investment for AI factories and hyperscale deployments.

Best ROI: A single NVIDIA GB200 NVL72 system — a $5 million investment — can generate $75 million in DeepSeek R1 token revenue, achieving a 15x return on investment.
Lowest TCO: With software optimizations, NVIDIA B200 systems now process one million tokens for just two cents on gpt-oss, cutting cost per token by 5x in two months.
Best Throughput & Interactivity: The B200 GPU achieves 60,000 tokens per second per GPU and 1,000 tokens per second per user on gpt-oss using the latest TensorRT-LLM stack.

As AI transitions from one-shot question answering to multi-step reasoning and tool use, the compute demands of inference are skyrocketing — and NVIDIA Blackwell is redefining what’s possible.

“Inference is where AI delivers value every day,” said Ian Buck, Vice President of Hyperscale and High-Performance Computing at NVIDIA.
“These results show that NVIDIA’s full-stack approach gives customers the performance and efficiency they need to deploy AI at scale.”

About InferenceMAX v1

Released by SemiAnalysis, InferenceMAX v1 is the first independent benchmark to assess total cost of compute across real-world AI workloads. It measures performance, efficiency, and economics across multiple leading models and platforms — with results that can be publicly verified.

Unlike traditional benchmarks focused on raw speed, InferenceMAX evaluates throughput, responsiveness, and total cost per token — key metrics for enterprises running reasoning-intensive AI systems.

Benchmark Results: Blackwell Sweeps the Field

The NVIDIA Blackwell B200 and GB200 NVL72 platforms achieved record-breaking results across all tested workloads, including open-source and community models such as gpt-oss (OpenAI), Llama 3 (Meta), and DeepSeek R1 (DeepSeek AI).

On DeepSeek-R1 (671B), Blackwell GPUs delivered up to 15x ROI and industry-leading throughput.
On gpt-oss-120B, cost per million tokens dropped to $0.02, the lowest in the industry.
On Llama 3.3 70B, Blackwell delivered 10,000 tokens per second per GPU and 50 TPS per user — 4x faster than NVIDIA H200.

These results reflect NVIDIA’s hardware–software codesign philosophy, integrating innovations from open-source communities like FlashInfer, SGLang, and vLLM to ensure top performance across all models.

Continuous Software Optimization Drives Performance

NVIDIA’s performance leadership extends beyond silicon. The company’s TensorRT-LLM v1.0 release significantly boosts open-source model performance through parallelization and speculative decoding techniques.

The gpt-oss-120B-Eagle3-v2 model introduces speculative decoding, predicting multiple tokens at once — tripling throughput from 6,000 to 30,000 tokens per GPU and improving responsiveness to 100 TPS per user.
Combined with NVLink Switch’s 1,800 GB/s bidirectional bandwidth, these optimizations drive massive gains in end-to-end inference speed and efficiency.

Performance Efficiency That Translates to Real Value

In modern AI factories, tokens per watt and cost per token are as critical as throughput. The NVIDIA Blackwell architecture delivers:

10x higher throughput per megawatt versus the previous generation, maximizing output within power limits.
15x lower cost per million tokens, drastically reducing operational expenses and enabling broader AI deployment.

The InferenceMAX Pareto frontier further illustrates Blackwell’s advantage — achieving the optimal balance between cost, efficiency, throughput, and responsiveness across all workloads.

The Full-Stack Advantage: What Makes Blackwell Different

NVIDIA’s leadership stems from deep integration across the full stack — hardware, interconnect, software, and ecosystem:

NVFP4 low-precision format for efficiency without accuracy loss
Fifth-generation NVLink connecting 72 GPUs as one for seamless scaling
NVLink Switch enabling advanced tensor, expert, and data parallelism
Annual hardware cadence plus continuous software updates that have doubled Blackwell performance since launch
TensorRT-LLM, NVIDIA Dynamo, SGLang, and vLLM optimized for peak inference efficiency
A global ecosystem of 7 million CUDA developers and 1,000+ open-source projects

The Bigger Picture: From Pilots to AI Factories

AI infrastructure is evolving from experimental clusters to AI factories — facilities that manufacture intelligence, turning data into tokens, insights, and decisions in real time.

Benchmarks like InferenceMAX v1 help organizations make data-driven infrastructure choices, tune for cost per token and latency SLAs, and optimize for ever-changing workloads.

With the Think SMART framework, NVIDIA continues guiding enterprises through this transformation — showing how Blackwell’s full-stack design turns performance into profits.

source link