AMD says future PCs will run 30B parameter models at 100T/s • The Register
Analysis Within a few years, AMD expects to have notebook chips capable of running 30 billion parameter large language models locally at a speedy 100 tokens per second.
Achieving this target – which also calls for 100ms of first token latency – isn’t as simple as it sounds. It will require optimizations on both the software and hardware fronts. As it stands, AMD claims its Ryzen AI 300-series Strix Point processors, announced at Computex last month, are capable of running LLMs at 4-bit precision up to around seven billion parameters in size, at a modest 20 tokens a second and 1–4 seconds first token latencies.
AMD aims to run 30 billion parameter models at 100 tokens a second (Tok/sec) up from 7 billion and 20 Tok/sec today – click to enlarge
Hitting its 30 billion parameter, 100 token per second, “North Star” performance target isn’t just a matter of cramming in a bigger NPU. More TOPS or FLOPS will certainly help – especially when it comes to first token latency – but when it comes to running large language models locally, memory capacity and bandwidth are much more important.
In this regard, LLM performance on Strix Point is limited in large part by its 128-bit memory bus – which, when paired with LPDDR5x, is good for somewhere in the neighborhood of 120-135 GBps of bandwidth depending on how fast your memory is.
Taken at face value, a true 30 billion parameter model, quantized to 4-bits, will consume about 15GB of memory and require more than 1.5 TBps of bandwidth to hit that 100 token per second goal. For reference, that’s roughly the same bandwidth as a 40GB Nvidia A100 PCIe card with HBM2, but a heck of a lot more power.
This means that, without optimizations to make the model less demanding, future SoCs from AMD are going to need much faster and higher capacity LPDDR to reach the chip designer’s target.
AI is evolving faster than silicon
These challenges aren’t lost on Mahesh Subramony, a senior fellow and silicon design engineer working on SoC development at AMD.
“We know how to get there,” Subramony told The Register, but while it might be possible to design a part capable of achieving AMD’s goals today, there’s not much point if no one can afford to use it or there’s nothing that can take advantage of it.
“If proliferation starts by saying everybody has to have a Ferrari, cars are not going to proliferate. You have to start by saying everybody gets a great machine, and you start by showing what you can do responsibly with it,” he explained.
“We have to build a SKU that meets the demands of 95 percent of the people,” he continued. “I would rather have a $1,300 laptop and then have my cloud run my 30 billion parameter model. It’s still cheaper today.”
When it comes to demonstrating the value of AI PCs, AMD is leaning heavily on its software partners. With products like Strix Point, that largely means Microsoft. “When Strix initially started, what we had was this deep collaboration with Microsoft that really drove, to some extent, our bounding box,” he recalled.
But while software can help to guide the direction of new hardware, it can take years to develop and ramp a new chip, Subramony explained. “Gen AI and AI use cases are developing way faster than that.”
Having had two years since ChatGPT’s debut to plot its evolution, Subramony suggests AMD now has a better sense of where the compute demands are going – no doubt part of the reason why AMD has established this target.
Overcoming the bottlenecks
There are several ways to work around the memory bandwidth challenge. For example, LPDDR5 could be swapped for high bandwidth memory – but as Subramony notes doing so isn’t exactly favorable, as it would dramatically increase the cost and compromise the SoC’s power consumption.
“If we can’t get to a 30 billion parameter model, we need to be able to get to something that delivers that same kind of fidelity. That means there’s going to be improvements that need to be done in training in trying to make those models smaller first,” Subramony explained.
The good news is there are quite a few ways to do just that – depending on whether you’re trying to prioritize memory bandwidth or capacity.
One potential approach is to use a mixture of experts (MoE) model along the lines of Mistral AI’s Mixtral. These MoEs are essentially a bundle of smaller models that work in conjunction with one another. Typically, the full MoE is loaded into memory – but, because only one submodel is active, the memory bandwidth requirements are substantially reduced compared to an equivalently sized monolithic model architecture.
A MoE comprised of six five-billion parameter models would only require a little over 250 GBps of bandwidth to achieve the 100 token per second target – at 4-bit precision at least.
Another approach is to use speculative decoding – a process by which a small lightweight model generates a draft which is then handed off to a larger model to correct any inaccuracies. AMD told us this approach renders sizable improvements in performance – however it doesn’t necessarily address the fact LLMs require a lot of memory.
Most models today are trained in brain float 16 or FP16 data types, which means they consume two bytes per parameter. This means a 30 billion parameter model would need 60GB of memory to run at native precision.
But since that’s probably not going to be practical for the vast majority of users, it’s not uncommon for models to be quantized to 8- or 4-bit precision. This trades accuracy and increases the likelihood of hallucination, but cuts your memory footprint to as much as a quarter. As we understand it, this is how AMD is getting a seven billion parameter model running at around 20 tokens per second.
New forms of acceleration can help
As a sort of compromise, beginning with Strix Point, the XDNA 2 NPU supports the Block FP16 datatype. Despite its name, it only requires 9 bits per parameter – it’s able to do this by taking eight floating point values and using a shared exponent. According to AMD, the form is able to achieve accuracy that is nearly indistinguishable from native FP16, while only consuming slightly more space than Int8.
More importantly, we’re told the format doesn’t require models to be retrained to take advantage of them – existing BF16 and FP16 models will work without a quantization step.
But unless the average notebook starts shipping with 48GB or more of memory, AMD will still need to find better ways to shrink the model’s footprint.
While not explicitly mentioned, it’s not hard to imagine future NPUs and/or integrated graphics from AMD adding support for smaller block floating point formats [PDF] like MXFP6 or MXFP4. To this end, we already know that AMD’s CDNA datacenter GPUs support FP8 and CDNA 4 will support FP4.
In any case, it seems that PC hardware is going to change dramatically over the next few years as AI escapes the cloud and takes up residence on your devices. ®