MLPerf Inference 4.1 results show gains as Nvidia Blackwell makes its testing debut

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


MLCommons is out today with its latest set of MLPerf inference results. The new results mark the debut of a new generative AI benchmark as well as the first validated test results for Nvidia’s next-generation Blackwell GPU processor.

MLCommons is a multi-stakeholder, vendor-neutral organization that manages the MLperf benchmarks for both AI training as well as AI inference. The latest round of MLPerf inference benchmarks, released by MLCommons, provides a comprehensive snapshot of the rapidly evolving AI hardware and software landscape. With 964 performance results submitted by 22 organizations, these benchmarks serve as a vital resource for enterprise decision-makers navigating the complex world of AI deployment. By offering standardized, reproducible measurements of AI inference capabilities across various scenarios, MLPerf enables businesses to make informed choices about their AI infrastructure investments, balancing performance, efficiency and cost.

As part of MLPerf Inference v 4.1 there are a series of notable additions. For the first time, MLPerf is now evaluating the performance of a  Mixture of Experts (MoE), specifically the Mixtral 8x7B model. This round of benchmarks featured an impressive array of new processors and systems, many making their first public appearance. Notable entries include AMD’s MI300x, Google’s TPUv6e (Trillium), Intel’s Granite Rapids, Untether AI’s SpeedAI 240 and the Nvidia Blackwell B200 GPU.

“We just have a tremendous breadth of diversity of submissions and that’s really exciting,” David Kanter,  founder and head of MLPerf at MLCommons said during a call discussing the results with press and analysts.  “The more different systems that we see out there, the better for the industry, more opportunities and more things to compare and learn from.”

Introducing the Mixture of Experts (MoE) benchmark for AI inference

A major highlight of this round was the introduction of the Mixture of Experts (MoE) benchmark, designed to address the challenges posed by increasingly large language models.

“The models have been increasing in size,” Miro Hodak, senior member of the technical staff at AMD and one of the chairs of the MLCommons inference working group said during the briefing. “That’s causing significant issues in practical deployment.”

Hodak explained that at a high level, instead of having one large, monolithic model,  with the MoE approach there are several smaller models, which are the experts in different domains. Anytime a query comes it is routed through one of the experts.”

The MoE benchmark tests performance on different hardware using the Mixtral 8x7B model, which consists of eight experts, each with 7 billion parameters. It combines three different tasks:

  1. Question-answering based on the Open Orca dataset
  2. Math reasoning using the GSMK dataset
  3. Coding tasks using the MBXP dataset

He noted that the key goals were to better exercise the strengths of the MoE approach compared to a single-task benchmark and showcase the capabilities of this emerging architectural trend in large language models and generative AI. Hodak explained that the MoE approach allows for more efficient deployment and task specialization, potentially offering enterprises more flexible and cost-effective AI solutions.

Nvidia Blackwell is coming and it’s bringing some big AI inference gains

The MLPerf testing benchmarks are a great opportunity for vendors to preview upcoming technology. Instead of just making marketing claims about performance the rigor of the MLPerf process provides industry-standard testing that is peer reviewed.

Among the most anticipated pieces of AI hardware is Nvidia’s Blackwell GPU, which was first announced in March. While it will still be many months before Blackwell is in the hands of real users the MLPerf Inference 4.1 results provide a promising preview of the power that is coming.

“This is our first performance disclosure of measured data on Blackwell, and we’re very excited to share this,” Dave Salvator, at Nvidia said during a briefing with press and analysts.

MLPerf inference 4.1 has many different benchmarking tests. Specifically on the generative AI workload that measures performance using MLPerf’s biggest LLM workload, Llama 2 70B,

“We’re delivering 4x more performance than our previous generation product on a per GPU basis,” Salvator said.

While the Blackwell GPU is a big new piece of hardware, Nvidia is continuing to squeeze more performance out of its existing GPU architectures as well. The Nvidia Hopper GPU keeps on getting better. Nvidia’s MLPerf inference 4.1 results for the Hopper GPU provide up to 27% more performance than the last round of results six months ago.

“These are all gains coming from software only,” Salvator said. “In other words, this is the very same hardware we submitted about six months ago, but because of ongoing software tuning that we do, we’re able to achieve more performance on that same platform.”