Nvidia said to be prepping Blackwell GPUs for China • The Register

Comment US trade restrictions on the sale of AI accelerators to China haven’t detered Nvidia from bringing its latest Blackwell architecture to the Middle Kingdom.

According to a report citing unnamed sources, Nvidia is preparing yet another GPU for the Chinese market that is designed to slip under the US Commerce Department’s performance limits.

The chip, called the B20, will be based on the GPU giant’s Blackwell architecture announced back at GTC in the spring. Compared to its prior-gen Hopper architecture, Nvidia claims its Blackwell-based chips is between 2.5x and 5x faster in terms of raw floating point precision.

Nvidia has reportedly tapped Chinese system builder Inspur as the prime distributor for the chip, with shipments allegedly slated to begin in the second quarter of next year. Inspur’s position on the US Entities List, a prize it allegedly won by flogging off US tech to the Chinese military, could however prove problematic, assuming of course the report turns out to be accurate.

However, pre-existing export controls are likely to limit the potency of Nvidia’s next batch of China market chips. This is because, Nvidia H20, currently the most powerful chip it can sell in the region without license, is already running up against the limit of what’s allowed for export.

US export controls implemented last October established caps on “total processing performance” and “performance density.” The rules effectively barred the sale of many Nvidia datacenter cards and briefly blocked the consumer focused RTX 4090, before a special model for the Chinese market could be rolled out.

However, within a month of the rules going into effect, rumblings of a trio of cut down cards designed to slide under these limits had surfaced. The most powerful of these being the 96GB H20, which boasts 296 teraFLOPS of FP8 performance.

As we understand it, a B20 accelerator would be capped at the same performance level, at least in terms of FP8 performance. Blackwell introduced support for FP4 data types and because of this, we expect the advertised teraFLOPS figure to be twice that of the H20 even though they aren’t directly comparable.

If you’re curious, you can find a full breakdown of how these performance and compute density limits are calculated here.

But while US export controls mean the floating point performance and compute density of these chips remains limited, that doesn’t mean a B20 couldn’t offer a generational improvement in performance. When it comes to running pre-trained large language models, performance, often measured in tokens per second, is limited more by memory bandwidth than how many FLOPS or TOPS the chip can push.

As such, any increase in memory bandwidth over the H20, which is apparently capable of 4TB/s, should result in sizable performance gains, at least inferencing. How big those gains will actually be will depend on how the chip is architected and just how many HBM stacks it’s paired with.

Nvidia declined The Register’s request for comment on the B20.

Register Comment

It’s no secret that US Commerce Secretary Gina Raimondo isn’t a big fan of Nvidia and other chip makers running within a hair of export limits.

“I’m telling you, if you redesign a chip around a particular cutline that enables them to do AI, I am going to control it the very next day,” she said in a clear reference to Nvidia during a defense forum late last year.

The Biden Administration is now widely expected to enact more stringent export controls in the coming months to stifle Chinese AI development.

Considering the outsized impact of memory bandwidth and capacity on the performance of AI chatbots, it wouldn’t surprise us to see new limits targeting this spec.

As we mentioned earlier, memory bandwidth has a direct impact on the number of AI tokens — that’s words, phrases, punctuation or figures — a chip can spit out in a given second. Meanwhile memory capacity governs how large a model can be deployed on a single GPU or accelerator.

Because of this, chips like Nvidia’s H20 remain quite potent, even compared to the venerable H100, for less compute-bound workloads like running, as opposed to training, AI chatbots.

A cap on memory bandwidth could severely restrict sales of US chips to China. Whatever happens, any additional restrictions will no doubt have a material impact on Nvidia’s business as China still accounts for roughly 17 percent of the company’s annual revenues.

However such a measure wouldn’t stop the development of domestic accelerators like the ones we’ve seen from Moore Threads, Huawei, and others. To stifle development here, the Biden administration is reportedly considering imposing a measure called the foreign direct product rule, which would allow it to place controls on the sale of any product that makes use of American tech. ®