AMD spills the beans on Zen 5’s 16% IPC gains • The Register
With the first Zen 5 CPUs and SoCs set to ship later this month, AMD offered a closer look at the architectural improvements underpinning the platform’s 16 percent uplift in instructions per clock (IPC) during its Tech Day event in LA last week.
Announced at Computex in June, the House of Zen’s 9000-series follows a similar mold as past Ryzen desktop chips, with your choice of six, eight, 12, or 16 cores and up to 64MB of L3 cache on the top SKUs.
These same cores are at the heart of Ryzen AI 300 – AMD’s answer to Qualcomm’s X-chips for AI PCs. Codenamed Strix Point, the notebook SoC boasts 12 cores – four Zen 5 and eight Zen 5c – along with a 50 TOPS NPU based on the chip shop’s XNDA 2 architecture.
But while core count, cache, and power all play a role in processor performance improvements, a big chunk of AMD’s gains are down to architectural tweaks to the Zen 5 core. Combined with a node shrink to TSMC’s 4nm process tech these low-level tweaks to the core contribute to anywhere from a ten to 35 percent lift in performance. In AMD’s internal benchmarks, anyway.
AMD claims its Zen 5 cores deliver between 10 and 35 percent higher instructions per clock over the prior generation – Click to enlarge
Beefing up Zen
Among the biggest improvements to the Zen 5 core were made to its front end, and account for roughly 39 percent of the claimed IPC uplift, according to AMD CTO Mark Papermaster.
Specifically, AMD has widened the front end to allow for more branch predictions per cycle – a major contributor to performance on modern CPU cores – and implemented a dual-decode pipeline along with i-cache and op-cache improvements to curb latency and boost bandwidth.
AMD widened the Zen 5 core’s fron tend, execution, and boosted backend bandwidth to boost IPC – Click to enlarge
This wider front end is paired with a larger integer execution engine that now supports up to eight instructions – dispatch and retire – across each cycle compared to six on Zen 4. AMD also boosted the number of arithmetic logic units (ALUs) from four to six and implemented a more unified scheduler to make execution more efficient.
To mitigate the potential for an increase in mispredictions, AMD expanded Zen 5’s execution window by about 40 percent.
“What this does is it’s going to bring new levels of performance because it’s married with those frontend advancements … it allows us to consume those instructions and take advantage of the improved predictions coming at us through the pipeline,” Papermaster explained.
The remaining 27 percent of Zen 5’s IPC gains can be attributed to increased data bandwidth on the back end. Compared to last gen, AMD has boosted the L1 data cache from 32KB to 48KB and doubled the maximum bandwidth to the L1 and floating point unit – more on that last one in a bit.
Here’s a rough breakdown of the architectural improvements which contributed to Zen 5’s IPC uplift – Click to enlarge
The key takeaway is that AMD hasn’t just juiced the branch predictor or execution engine, it’s attempted to balance each element of the core to avoid bottlenecks or added latency. The result is a core that can chew through more instructions faster than prior generations.
Zen 5 revamps AVX-512 implementation
The biggest IPC gains were seen in workloads that used AMD’s AVX-512 vector extensions, which have been reworked this generation to feature a full 512-bit data path as opposed to the “double-pumped” 256-bit approach we saw in Zen 4 in 2022.
The one slight exception to all of this is in mobile chips like Strix Point, where AMD chose to stick with a double-pumped AVX-512 implementation – likely to optimize for performance per watt and thermal constraints.
While Papermaster claims Zen 5 can now run full 512-bit AVX workloads without frequency penalties, these instructions historically have run very hot. This isn’t as big a deal on the desktop or in workstations, but is less than ideal for notebooks with limited thermal headroom.
Unsurprisingly, Papermaster was quick to highlight the vector extensions’ potential to accelerate AI workloads running on the CPU. And in machine learning, AMD claims a 32 percent increase in single core performance over Zen 4. Particularly with its mobile chips, AMD has emphasized the concept of running machine learning across each domain – not just on the integrated GPU or NPU.
Throughout all of AMD’s Tech Day disclosures, it was clear that its Zen 5 and compact Zen 5c cores remain architecturally identical in terms of functionality – the latter trading clocks for die area, as the name would suggest.
More to come
The first Zen 5 cores are set to hit the market on July 31, but we’ll have to wait a little longer for them to arrive in the datacenter.
There’s a lot we still don’t know about AMD’s Turin generation of Epycs. However, at Computex, we did find out that rumors of another core count bump were true.
With 5th-gen Epyc, AMD is set to juice core count by 50 percent over Epyc 4. The Zen 5c part – the spiritual successor to Bergamo – is expected to use TSMC’s 3nm node and will feature 192 cores and 384 threads. Meanwhile, the frequency-optimized Turin parts look like they’ll top out at 128 cores and 256 threads.
Curiously, AMD doesn’t appear to be differentiating Turin from what we’re calling “Turin-c” in its marketing. This isn’t all that surprising, as the only difference between the two – at least at the core level – comes down to the frequency-voltage curve. The smaller Zen 5c cores trade lower frequencies for higher density, but are otherwise identical feature-wise.
We expect there may be a few more surprises in store for the Turin launch, which is due sometime in the second half of the year.
Competition heats up
Zen 5 comes at a time when AMD is facing its stiffest competition in years, as Qualcomm arrives on the scene with a potent Arm-compatible notebook chip, and Intel readies a slew of revamped CPUs across its Xeon and Core product families.
Within the client space, Qualcomm’s 45-TOPS NPU has given it an early lead in Microsoft’s Copilot+ AI PC push. AMD’s Strix Point looks to remedy this, but will have to contend with Intel’s recently disclosed Lunar Lake SoCs, which are due out later in Q3.
It’s a similar story in the datacenter, where things have become particularly interesting with the launch of Intel’s 144-core Sierra Forest and impending 128-core Granite Rapids Xeon 6 platforms. In addition to an architectural overhaul and shift to a new chiplet architecture, these chips also make the move to the Intel 3 process node.
At the same time, more cloud providers than ever are leaning on custom Arm-based silicon for their hyperscale workloads. AWS’s Graviton is now in its fourth generation and generally available, while Microsoft and Google have both begun deploying their own Arm cores.
Whether or not AMD’s IPC gains and higher core counts in the datacenter will help it win share in this competitive arena, we’ll have to wait and see. In any case, we’re told work on Zen 6 and Zen 6c is already underway – when we’ll see it, your guess is as good as ours. ®