Cpu Category
AMD EPYC 9000 Series (Genoa - Turin)

Summary Table – AMD EPYC Genoa vs. Turin

CPU Model AMD EPYC 9654 (Genoa, 4th Gen) AMD EPYC 9965 (Turin Dense, 5th Gen)
Manufacturer AMD AMD
Microarchitecture Zen 4 (x86-64) (Epyc - Wikipedia) Zen 5c (Compact Zen 5 cores, x86-64) (Epyc - Wikipedia) (Epyc - Wikipedia)
Process Node (CPU/IOD) TSMC 5 nm CCD + 6 nm I/O (Epyc - Wikipedia) TSMC 3 nm CCD + 6 nm I/O (AMD Announces Zen 5-based EPYC “Turin” Processors: Up to 192 Cores, Coming in H2’2024)
Max Cores / Threads 96 cores / 192 threads (Epyc - Wikipedia) 192 cores / 384 threads (Epyc - Wikipedia)
Base Clock 2.40 GHz ([AMD EPYC 9654 Specs TechPowerUp CPU Database](https://www.techpowerup.com/cpu-specs/epyc-9654.c2933#:~:text=further%20increase%20overall%20system%20performance%2C,power%20hungry%2C%20which%20means%20you))
Max Turbo Clock 3.70 GHz ([AMD EPYC 9654 Specs TechPowerUp CPU Database](https://www.techpowerup.com/cpu-specs/epyc-9654.c2933#:~:text=further%20increase%20overall%20system%20performance%2C,power%20hungry%2C%20which%20means%20you))
TDP (Thermal Design Power) 360 W (typical SKUs) ([AMD EPYC 9654 Specs TechPowerUp CPU Database](https://www.techpowerup.com/cpu-specs/epyc-9654.c2933#:~:text=78%2C840%20million%20transistors,Express%20Gen%205%20connection.%20This)) (up to 400 W on select models) (AMD Announces Zen 5-based EPYC “Turin” Processors: Up to 192 Cores, Coming in H2’2024)
Architecture & Core Design Out-of-order, 4-wide decode, 320-entry ROB; 8-core chiplets (CCD) with 32 MB L3 each (Epyc - Wikipedia) ([CXL, Zen 4 Architecture, Chiplet Designs - AMD 4th-Gen EPYC Genoa 9654, 9554, and 9374F Review: 96 Cores, Zen 4 and 5nm - Page 2 Tom's Hardware](https://www.tomshardware.com/reviews/amd-4th-gen-epyc-genoa-9654-9554-and-9374f-review-96-cores-zen-4-and-5nm-disrupt-the-data-center/2#:~:text=AMD%20also%20increased%20the%20op,that%20largely%20offset%20the%20penalty)). No efficiency cores (all identical Zen 4 cores).
Cache (per core) L1: 32 KB I-cache + 32 KB D-cache (8-way); L2: 1 MB (8-way, ~14 cycle latency) ([CXL, Zen 4 Architecture, Chiplet Designs - AMD 4th-Gen EPYC Genoa 9654, 9554, and 9374F Review: 96 Cores, Zen 4 and 5nm - Page 2 Tom's Hardware](https://www.tomshardware.com/reviews/amd-4th-gen-epyc-genoa-9654-9554-and-9374f-review-96-cores-zen-4-and-5nm-disrupt-the-data-center/2#:~:text=AMD%20also%20increased%20the%20op,that%20largely%20offset%20the%20penalty)) (AMD’s Zen 4, Part 2: Memory Subsystem and Conclusion).
Cache (shared) L3: 32 MB per CCD (≈384 MB total in 96-core) ([AMD EPYC 9654 Specs TechPowerUp CPU Database](https://www.techpowerup.com/cpu-specs/epyc-9654.c2933#:~:text=Zen%204%20,which%20limits%20its%20overclocking%20capabilities)), ~40–50 ns latency (local) ([CXL, Zen 4 Architecture, Chiplet Designs - AMD 4th-Gen EPYC Genoa 9654, 9554, and 9374F Review: 96 Cores, Zen 4 and 5nm - Page 2
Supported ISA Extensions AVX2, AVX-512 (512-bit SIMD via dual 256-bit ops) ([CXL, Zen 4 Architecture, Chiplet Designs - AMD 4th-Gen EPYC Genoa 9654, 9554, and 9374F Review: 96 Cores, Zen 4 and 5nm - Page 2 Tom's Hardware](https://www.tomshardware.com/reviews/amd-4th-gen-epyc-genoa-9654-9554-and-9374f-review-96-cores-zen-4-and-5nm-disrupt-the-data-center/2#:~:text=AMD%20has%20enabled%20support%20for,512%20in%20one%20operation)); VNNI (INT8 dot-product) (Details on the Gigabyte Leak - by Chester Lam); BF16 (bfloat16) (Details on the Gigabyte Leak - by Chester Lam); SHA, AES, SSE4.2, FMA, BMI1/2, etc. No AMX matrix engine.
(No AVX-512 FP16; 16-bit floats handled via BF16 or FP32) (Zen 4 - Microarchitectures - AMD - WikiChip). Same as Genoa: Full AVX-512 instruction set (except FP16) (5th Gen AMD EPYC Processor Architecture). Zen5 adds true 512-bit FPU (no double-pumping) for 2× throughput on AVX-512 ops (5th Gen AMD EPYC Processor Architecture) (5th Gen AMD EPYC Processor Architecture). Also supports VNNI, BF16. No AMX.
Memory Support 12-channel DDR5-4800 (up to DDR5-5200 in 1DPC) (AMD EPYC™ 9004 Series Architecture Overview) ([AMD EPYC 9654 Specs TechPowerUp CPU Database](https://www.techpowerup.com/cpu-specs/epyc-9654.c2933#:~:text=limits%20its%20overclocking%20capabilities,EPYC%209654%2C%20which%20greatly%20improves)); up to 6 TB per socket (Leadership Natural Language AI Performance: Outperforming 5th Gen Intel® Xeon® with AMX) (AMD EPYC™ 9004 Series Architecture Overview). ECC supported (mission-critical reliability) ([AMD EPYC 9654 Specs
I/O & Connectivity 128× PCIe 5.0 lanes + 64 CXL 1.1 lanes (AMD EPYC™ 9004 Series Architecture Overview) (AMD EPYC 9004 Genoa Under-the-Lid - ServeTheHome); Socket SP5 (LGA 6096), 1P or 2P configurations (Epyc - Wikipedia).
128× PCIe 5.0 lanes + 64 CXL 2.0 lanes (SP5 socket, drop-in compatible with Genoa systems) (Epyc - Wikipedia) (AMD Announces Zen 5-based EPYC “Turin” Processors: Up to 192 Cores, Coming in H2’2024).

Table Notes: Both Genoa and Turin chips are built on AMD’s chiplet architecture – multiple Core Complex Die (CCD) chiplets on TSMC advanced nodes, connected via AMD Infinity Fabric to a central I/O die (IOD). Genoa uses 5 nm CCDs (8 Zen4 cores each) and a 6 nm IOD (Epyc - Wikipedia), while Turin Dense uses 3 nm CCDs (16 Zen5c cores each) with a similar 6 nm IOD (AMD Announces Zen 5-based EPYC “Turin” Processors: Up to 192 Cores, Coming in H2’2024). The high core counts (up to 96 and 192 cores) leverage SMT2 (Simultaneous Multi-Threading) to double the thread count. Both generations support the latest memory and I/O standards (DDR5, PCIe Gen5, CXL) for extreme bandwidth. Importantly, AMD introduced AVX-512, VNNI, and BF16 support in Zen4 to accelerate AI workloads (CXL, Zen 4 Architecture, Chiplet Designs - AMD 4th-Gen EPYC Genoa 9654, 9554, and 9374F Review: 96 Cores, Zen 4 and 5nm - Page 2 | Tom's Hardware), and Zen5 extends this with a wider execution engine (5th Gen AMD EPYC Processor Architecture). None of these CPUs have specialized matrix units (Intel’s AMX) – AI acceleration relies on SIMD (vector) units and high core counts. Thermal design power ranges from ~360 W for high-core SKUs up to 400–500 W for frequency-optimized models, necessitating robust cooling in sustained AI inference scenarios (AMD EPYC 9654 Specs | TechPowerUp CPU Database) (AMD EPYC 'Turin' 9005 Series - we benchmark 192-core Zen 5 chip ...).


Architecture Deep Dive

Zen 4 (Genoa) Architecture: AMD’s Genoa (EPYC 9004) processors are based on the “Zen 4” core microarchitecture, fabricated on TSMC 5 nm for the CPU chiplets (Epyc - Wikipedia). Each Zen4 core is a wide out-of-order design with significant generational improvements over Zen3. The front-end can fetch and decode up to 4 instructions per cycle (like Zen3), but AMD improved branch prediction and instruction supply – e.g. a larger micro-op cache (1.5× larger) and two branches per cycle prediction, which contribute to ~60% of Zen4’s IPC gain (CXL, Zen 4 Architecture, Chiplet Designs - AMD 4th-Gen EPYC Genoa 9654, 9554, and 9374F Review: 96 Cores, Zen 4 and 5nm - Page 2 | Tom's Hardware). The out-of-order window was enlarged: the reorder buffer (ROB) grew from 256 to 320 entries (25% increase), and integer and FP register files also increased (224 int registers, 192 FP registers) (Zen 4 - Microarchitectures - AMD - WikiChip). This allows Zen4 cores to track more in-flight instructions and memory misses, a boon for latency-hiding in memory-heavy workloads. Each core has 10 execution ports feeding a mix of ALUs, AGUs, and SIMD units, enabling multiple arithmetic and memory operations in parallel. Notably, Zen4 doubled the per-core L2 cache to 1 MB (from 512 KB in Zen3) to reduce L3 trips (Details on the Gigabyte Leak - by Chester Lam) (CXL, Zen 4 Architecture, Chiplet Designs - AMD 4th-Gen EPYC Genoa 9654, 9554, and 9374F Review: 96 Cores, Zen 4 and 5nm - Page 2 | Tom's Hardware). This larger L2 comes with a modest latency cost (+2 cycles vs Zen3) and likewise adds ~4 cycles to L3 latency (CXL, Zen 4 Architecture, Chiplet Designs - AMD 4th-Gen EPYC Genoa 9654, 9554, and 9374F Review: 96 Cores, Zen 4 and 5nm - Page 2 | Tom's Hardware), but AMD reports the higher hit rates largely offset the penalty (CXL, Zen 4 Architecture, Chiplet Designs - AMD 4th-Gen EPYC Genoa 9654, 9554, and 9374F Review: 96 Cores, Zen 4 and 5nm - Page 2 | Tom's Hardware). The L1 caches remain 32 KB for instructions and 32 KB for data (8-way associative each) (AMD EPYC™ 9004 Series Architecture Overview), with a 4-cycle L1d access latency on Zen4 (AMD’s Zen 4, Part 2: Memory Subsystem and Conclusion) – slightly faster than Intel’s Golden Cove L1 (5 cycles) (AMD’s Zen 4, Part 2: Memory Subsystem and Conclusion). The 1 MB L2 is 8-way set associative and ~14 cycles latency (AMD’s Zen 4, Part 2: Memory Subsystem and Conclusion), tuned for faster access than Intel’s 15-cycle L2 (which is 1.25 MB) (AMD’s Zen 4, Part 2: Memory Subsystem and Conclusion). Overall, Zen4’s core can achieve higher IPC via better branch/fetch, more out-of-order capacity, and increased per-core cache, which collectively benefit the irregular memory access patterns and deep instruction pipelines of LLM inference.

Chiplet Organization: A hallmark of EPYC CPUs is the chiplet-based modular design. Genoa features up to 12 CCD chiplets (Core Complex Dies), each containing 8 Zen4 cores plus a shared 32 MB L3 cache slice (Epyc - Wikipedia). These CCDs (fabricated on 5 nm) are arrayed around a central I/O Die (IOD, on 6 nm) on the SP5 package (Epyc - Wikipedia). The IOD integrates the memory controllers (DDR5), PCIe 5.0 and CXL interfaces, and Infinity Fabric interconnect. All 12 CCDs connect to the IOD via high-speed GMI3 links (one link per CCD) (AMD EPYC 9004 Genoa Under-the-Lid - ServeTheHome), enabling coherency and data sharing across CCDs. The figure below illustrates the EPYC 9004 chiplet layout – 12 Zen4 CCDs (each with 8 cores and 32 MB L3) surrounding the I/O die (AMD EPYC 9004 Genoa Under-the-Lid - ServeTheHome) (AMD EPYC 9004 Genoa Under-the-Lid - ServeTheHome):

(AMD EPYC 9004 Genoa Under-the-Lid - ServeTheHome) AMD EPYC 9004 “Genoa” 12-CCD chiplet configuration. Each Zen4 CCD contains 8 cores sharing a 32 MB L3 slice (light blue), connected via GMI3 links to the central I/O die (AMD EPYC 9004 Genoa Under-the-Lid - ServeTheHome) (AMD EPYC 9004 Genoa Under-the-Lid - ServeTheHome). Genoa supports 12 DDR5 channels and 128 PCIe 5.0 lanes through the I/O die (AMD EPYC 9004 Genoa Under-the-Lid - ServeTheHome).

This chiplet approach lets AMD scale core counts (up to 96 cores per socket) while keeping each core’s access to a local L3 cache. Cache hierarchy: Within a CCD, 8 cores share a unified 32 MB L3 cache (16-way associative). The total L3 in a 96-core Genoa is 384 MB (AMD EPYC 9654 Specs | TechPowerUp CPU Database), although it’s physically segmented per CCD – a core can access its own CCD’s 32 MB with lowest latency (~40–50 ns), but accessing another CCD’s data involves inter-CCD fabric latency (incurring ~100+ ns, akin to a NUMA hop). For LLM inference, this means threads running on the same CCD benefit from the fast 32 MB cache for holding hot model weights or activations, whereas data sharing across CCDs is slower. AMD also offers a Genoa-X variant with 3D-stacked L3 cache (up to 1,152 MB total L3) (Epyc - Wikipedia), which can greatly benefit certain AI workloads that are cache-bound (though for very large models, main memory is still the primary storage).

Zen 5 (Turin) Architecture: The 5th Gen EPYC “Turin” (9005 series) advances to the Zen 5 microarchitecture. Zen5 is described as a “ground-up redesign” of the core, focused on front-end width and efficiency (CXL, Zen 4 Architecture, Chiplet Designs - AMD 4th-Gen EPYC Genoa 9654, 9554, and 9374F Review: 96 Cores, Zen 4 and 5nm - Page 2 | Tom's Hardware). While detailed specs are still emerging (as Turin launched in late 2024), AMD has confirmed key upgrades. The Zen5 front-end is wider and smarter: improved fetch, decode, and dispatch stages feed a larger execution engine (Zen 5 Microarchitecture - AMD deep-dives Zen 5 architecture — Ryzen 9000 and AI 300 benchmarks, RDNA 3.5 GPU, XDNA 2, and more - Page 4 | Tom's Hardware). In fact, Zen5 implements dual decode pipelines (two 4-wide clusters working in parallel) for an effective 8-wide decode under certain conditions ('Zen 5' Microarchitecture Explained: Here Comes the Fast, Efficient ...), along with an enhanced branch predictor (a “2-ahead” predictor to prefetch the next instructions even while current ones decode) (Zen 5's 2-Ahead Branch Predictor Unit: How a 30 Year Old Idea ...). These changes significantly boost instruction throughput and help keep the backend busy. AMD also doubled the data bandwidth between L1 and L2 caches and between L1 and the FPU (Zen 5 Microarchitecture - AMD deep-dives Zen 5 architecture — Ryzen 9000 and AI 300 benchmarks, RDNA 3.5 GPU, XDNA 2, and more - Page 4 | Tom's Hardware) – a critical change to support the full 512-bit vectors in Zen5 (discussed below in SIMD). In the execution core, Zen5 likely increases the OOO resources further (exact numbers TBD), and adds more execution units. Importantly for server chips, Zen5 retains the chiplet design on SP5: Turin CPUs use the same socket and general layout as Genoa (AMD Announces Zen 5-based EPYC “Turin” Processors: Up to 192 Cores, Coming in H2’2024). The standard “Turin” SKUs feature up to 128 Zen5 cores per socket (Epyc - Wikipedia) (each CCD presumably 8 Zen5 cores with 32MB L3 each, similar to Zen4 CCD but enhanced microarchitecture). In addition, AMD introduced Turin Dense variants with up to 192 Zen5c cores (Epyc - Wikipedia). Zen5c cores are a compacted version of Zen5 – they feature the same ISA and features as full Zen5, but trade some frequency and per-core cache for smaller area, allowing 16 cores per CCD (as opposed to 8) (Epyc - Wikipedia) (Zen 5 Microarchitecture - AMD deep-dives Zen 5 architecture — Ryzen 9000 and AI 300 benchmarks, RDNA 3.5 GPU, XDNA 2, and more - Page 4 | Tom's Hardware). The EPYC 9965 is an example Turin Dense SKU, packing 192 cores via 12 Zen5c CCDs (16 cores each) on one die. Those 16 cores share a CCD’s L3 (likely 32 MB, meaning 2 MB per core on average, half the per-core L3 of Genoa). This high core density aims at throughput-oriented workloads (like many microservices or many inferencing threads), whereas the 128-core Zen5 models offer higher single-thread performance (fewer cores but possibly higher clocks and larger per-core cache). Both 128C and 192C Turin chips support DDR5-6000 memory (5th Gen AMD EPYC Processor Architecture) and other platform advancements (PCIe5, CXL 2.0, etc.), maintaining full socket compatibility with Genoa systems (AMD Announces Zen 5-based EPYC “Turin” Processors: Up to 192 Cores, Coming in H2’2024). In summary, Zen5/Turin brings a wider, more powerful core and options for extreme core counts, which can directly benefit LLM inference by either speeding each thread (higher IPC and AVX-512 throughput) or by sheer parallelism (running more threads or model instances concurrently).

Vectorization and SIMD Capabilities

Efficient LLM inference on CPUs leans heavily on SIMD vector instructions for the massive linear algebra operations (matrix multiplies, dot-products) in transformers. AMD’s EPYC 9000 series makes major strides in this area. Zen 4 (Genoa) introduced AVX-512 support to AMD’s CPUs for the first time, greatly expanding 512-bit vector arithmetic capabilities (CXL, Zen 4 Architecture, Chiplet Designs - AMD 4th-Gen EPYC Genoa 9654, 9554, and 9374F Review: 96 Cores, Zen 4 and 5nm - Page 2 | Tom's Hardware). Notably, Genoa implements AVX-512 in a “double-pumped” 256-bit FPU design: each Zen4 core’s floating-point unit is 256 bits wide (like prior Zen cores), but it will execute a 512-bit vector operation in two cycles, while handling the issue/retire as a single operation (CXL, Zen 4 Architecture, Chiplet Designs - AMD 4th-Gen EPYC Genoa 9654, 9554, and 9374F Review: 96 Cores, Zen 4 and 5nm - Page 2 | Tom's Hardware). This approach gives full compatibility with the AVX-512 instruction set (except a few Intel-specific subsets), without the heavy clock speed down-throttling seen in early Intel implementations (CXL, Zen 4 Architecture, Chiplet Designs - AMD 4th-Gen EPYC Genoa 9654, 9554, and 9374F Review: 96 Cores, Zen 4 and 5nm - Page 2 | Tom's Hardware). In fact, AMD advertises that Zen4 can often maintain high clocks during AVX-512 work, avoiding the “dramatic clock drops” that plagued Intel’s Skylake-X and Ice Lake when executing wide vectors (CXL, Zen 4 Architecture, Chiplet Designs - AMD 4th-Gen EPYC Genoa 9654, 9554, and 9374F Review: 96 Cores, Zen 4 and 5nm - Page 2 | Tom's Hardware). The trade-off is slightly lower throughput per clock – effectively half the throughput of a hypothetical true 512-bit unit – but often higher sustained frequency and lower die area/power costs (CXL, Zen 4 Architecture, Chiplet Designs - AMD 4th-Gen EPYC Genoa 9654, 9554, and 9374F Review: 96 Cores, Zen 4 and 5nm - Page 2 | Tom's Hardware) (CXL, Zen 4 Architecture, Chiplet Designs - AMD 4th-Gen EPYC Genoa 9654, 9554, and 9374F Review: 96 Cores, Zen 4 and 5nm - Page 2 | Tom's Hardware). According to AMD, this design still yields substantial speedups: e.g. ~30% increase in multi-core FP32 performance over Zen3 just from AVX-512 enablement (CXL, Zen 4 Architecture, Chiplet Designs - AMD 4th-Gen EPYC Genoa 9654, 9554, and 9374F Review: 96 Cores, Zen 4 and 5nm - Page 2 | Tom's Hardware).

Supported SIMD Extensions: EPYC Genoa supports a rich set of vector instructions: AVX, AVX2, and a broad swath of AVX-512 extensions including AVX512-F (foundation), AVX512-CD, BW, DQ, VL, IFMA, VBMI/VBMI2, VPOPCNTDQ, BITALG, and the AI-focused ones like AVX512-VNNI and AVX512-BF16 (Details on the Gigabyte Leak - by Chester Lam). In essence, Genoa’s AVX-512 feature set is on par with Intel’s Ice Lake server CPUs (Details on the Gigabyte Leak - by Chester Lam). The Vector Neural Network Instructions (VNNI) are crucial for INT8 deep learning inference: VNNI accelerates int8 × int8 → int32 dot-products by doing multiple MACs in one instruction. The BF16 (Bfloat16) extension allows efficient 16-bit float operations popular in AI. AMD’s implementation can handle bfloat16 data natively in SIMD registers (Details on the Gigabyte Leak - by Chester Lam), whereas previous-gen EPYCs lacked this. (Notably FP16 half-precision arithmetic instructions are not supported on Zen4 (Zen 4 - Microarchitectures - AMD - WikiChip) – AMD opted for BF16 which has a wider exponent and is preferred for DL training/inference stability). For LLM inference, BF16 support means models that were trained or quantized to bfloat16 can run efficiently on EPYC, and int8 support means aggressively quantized models (int8 or int4 weights) can leverage VNNI for high throughput.

Zen 5 (Turin) SIMD: With Zen5, AMD went further – they “expanded data paths to 512 bits” in the core (5th Gen AMD EPYC Processor Architecture) (5th Gen AMD EPYC Processor Architecture). Each Zen5 core now has true 512-bit wide vector execution units, doubling the SIMD throughput per cycle compared to Zen4. In other words, Zen5 can execute a full 512-bit FMA in one cycle (per FPU pipeline) instead of two. AMD correspondingly enlarged the related structures: the vector register file, issue queues, and load/store data paths were widened to handle 512-bit operands in a single cycle (5th Gen AMD EPYC Processor Architecture). They also doubled the L1-to-FP data bandwidth (as noted earlier) so that moving 512-bit data from L1 cache to the FPU is not a bottleneck (Zen 5 Microarchitecture - AMD deep-dives Zen 5 architecture — Ryzen 9000 and AI 300 benchmarks, RDNA 3.5 GPU, XDNA 2, and more - Page 4 | Tom's Hardware). This change addresses the throughput gap and should significantly boost performance on vector-heavy workloads like matrix multiplications. In case power or thermal limits are a concern, Zen5 cores have a fallback mode – the system BIOS can instruct them to execute AVX-512 as two 256-bit ops (like Zen4) for better efficiency per watt (5th Gen AMD EPYC Processor Architecture). But at peak performance, Zen5’s SIMD is a generational leap: AMD cites ~2.75× higher LINPACK (HPL) performance for a 2P 192-core Turin vs 2P 96-core Genoa, attributable largely to the doubled vector width combined with core count increase (5th Gen AMD EPYC Processor Architecture).

Like Zen4, Zen5 supports the full AVX-512 ISA (again with the exception of FP16 instructions) (5th Gen AMD EPYC Processor Architecture). Neither have Intel’s AMX (Advanced Matrix Extensions) – a fixed-function matrix multiply engine present in 4th Gen Xeon – so all matrix math is done on the flexible SIMD units. How does this impact AI? Intel’s AMX can provide additional speed-up for INT8 and BF16 operations by doing large block matrix ops (e.g. 1024-bit tiles). AMD lacks this dedicated unit, but compensates via sheer core count and now much improved per-core vector throughput. In practice, AMD claims their 96-core Genoa could already edge out Intel’s 60-core Sapphire Rapids (Xeon 8592+ with AMX) in certain LLM inferencing: in a Llama2-7B inference test (batch=1), a 2-socket EPYC 9654 system delivered ~1.13× the throughput of a 2-socket Xeon 8592+ system (Leadership Natural Language AI Performance: Outperforming 5th Gen Intel® Xeon® with AMX) (Leadership Natural Language AI Performance: Outperforming 5th Gen Intel® Xeon® with AMX). The EPYC used 96 threads per socket vs Intel’s 64 (to match core counts) and still won on performance and perf/$ (Leadership Natural Language AI Performance: Outperforming 5th Gen Intel® Xeon® with AMX). This suggests that AVX-512+VNNI on Zen4, while “only” half-rate, can compete given enough cores and memory bandwidth. With Zen5’s full-rate AVX-512, the balance should tip further in AMD’s favor for pure CPU inference on int8 or bfloat16 models. It’s reasonable to expect per-core speedups on the order of ~1.8–2.0× for AVX-512 intensive kernels when moving from Zen4 to Zen5 (since double width might not perfectly double real-world throughput, due to memory or other factors).

INT8 and BF16 for AI: Both Genoa and Turin support INT8 arithmetic via AVX512-VNNI and 16-bit via BF16. These lower-precision types are widely used to accelerate AI inference. AMD has demonstrated huge gains from using INT8 on Zen4: up to 4.2× throughput in NLP inference by leveraging INT8 (compared to FP32) (CXL, Zen 4 Architecture, Chiplet Designs - AMD 4th-Gen EPYC Genoa 9654, 9554, and 9374F Review: 96 Cores, Zen 4 and 5nm - Page 2 | Tom's Hardware). This corresponds to techniques like quantizing model weights to int8 – the CPU can then utilize VNNI to multiply 64 int8 values × 64 int8 values in one vector instruction and accumulate, massively increasing ops/clock. For instance, a single Zen4 core can execute a VNNI instruction that does 64 int8 MACs per FMA unit (128 int8 MACs per cycle if both FMA pipes are used) – across 96 cores, that’s a theoretical peak in the tens of thousands of int8 MACs per cycle. In practice, one user reported that an EPYC 9654 running a 34-billion parameter model quantized to 4-bit (int4) was able to utilize ~60 threads efficiently, but adding more threads gave no speedup due to memory limits (AMD EPYC 9654 is not optimized for max speed · Issue #6434 · ggml-org/llama.cpp · GitHub) (AMD EPYC 9654 is not optimized for max speed · Issue #6434 · ggml-org/llama.cpp · GitHub). Still, at peak it achieved high token throughput (the exact tokens/sec not stated there, but elsewhere a 32-core older EPYC got ~3 tokens/sec on a 65B model (That's unfortunate. Running the 65B Q4 on an AMD Epyc with 32 1.5ghz cores and 2... | Hacker News)). With int8 quantization, smaller models like 7B or 13B can achieve dozens to hundreds of tokens per second on a high-end EPYC. Bfloat16 is also supported in hardware – useful for cases where slightly higher precision is needed or models are not quantized. BF16 on Zen4/Zn5 runs at full 512-bit vector width, meaning 32 BF16 ops per FMA (double-pumped on Zen4, full-speed on Zen5). Intel’s Sapphire Rapids has an advantage for BF16 in that its AMX can do more BF16 ops per cycle (due to 2048-bit tile processing), but again that advantage may be offset by AMD’s core count and frequency.

In summary, Genoa brought AMD to parity with Intel’s AI ISA (aside from lacking AMX), and Turin doubles down on SIMD throughput. For local LLM inference, this means EPYC CPUs can execute the needed tensor computations (matrix multiplies in attention and feed-forward layers) with high efficiency using int8 or bfloat16 vectors. The keys to performance will then be memory bandwidth and parallelization, which we discuss next.

Memory Subsystem and Bandwidth

Large Language Models are not just compute-intensive; they are memory-intensive. A single LLM instance must fetch billions of parameters from memory in a more-or-less random access pattern as each token is generated. Thus, the memory architecture of the CPU is critical. AMD designed EPYC 9000 with exceptional memory bandwidth and capacity to feed the many cores.

Memory Channels and Bandwidth: Genoa supports 12 channels of DDR5 memory per socket (AMD EPYC™ 9004 Series Architecture Overview), a significant increase from the 8 channels of previous-gen Milan or Intel Xeons. Each channel can run at up to DDR5-4800 speeds (officially) (AMD EPYC™ 9004 Series Architecture Overview), and in 1 DPC configurations up to DDR5-5200 MT/s is achievable (Epyc - Wikipedia). At 4800 MT/s, each channel provides ~38.4 GB/s, so twelve channels give a theoretical ~460 GB/s of bandwidth per socket. In real-world STREAM benchmarks, Genoa single-socket achieves ~350+ GB/s sustained (That's unfortunate. Running the 65B Q4 on an AMD Epyc with 32 1.5ghz cores and 2... | Hacker News) – an enormous figure, about 3× higher than a typical dual-channel desktop and ~2× a last-gen 8-channel EPYC with DDR4 (That's unfortunate. Running the 65B Q4 on an AMD Epyc with 32 1.5ghz cores and 2... | Hacker News). This bandwidth is invaluable for LLM inference, which tends to be memory-bound once model size grows. One HackerNews user noted that on a 65B model, adding threads beyond a certain point gave no speedup because the dual-channel DDR4 in their system maxed out at ~60 GB/s (That's unfortunate. Running the 65B Q4 on an AMD Epyc with 32 1.5ghz cores and 2... | Hacker News). By contrast, a Genoa with 12-channel DDR5 can offer an order of magnitude more memory throughput, allowing many cores to be kept busy. Indeed, AMD specifically touts the 12× DDR5 memory controllers as keeping the CPU–memory balance in check for data-intensive workloads (Epyc - Wikipedia) (AMD EPYC 9004 Genoa Under-the-Lid - ServeTheHome).

Turin (5th Gen EPYC) retains 12 channels and further allows higher speeds: officially up to DDR5-6000 MT/s (5th Gen AMD EPYC Processor Architecture) (some sources suggested 6400, but AMD’s docs state 6000). This would boost peak bandwidth to ~576 GB/s per socket. The memory controllers also support various interleaving options (2, 4, 6, … 12 channels) to optimize for different memory configurations (5th Gen AMD EPYC Processor Architecture). In terms of capacity, both Genoa and Turin can address up to 6 TB of memory per socket (AMD EPYC™ 9004 Series Architecture Overview) (5th Gen AMD EPYC Processor Architecture) (e.g. 24× 256 GB 3DS RDIMMs in a 2DPC config). That means a 2-socket server could have 12 TB of RAM, enough to hold extremely large models (like GPT-3 175B, ~350 GB in FP16, could fit in memory multiple times over if quantized). ECC support is standard – all server DIMMs are ECC, and EPYC’s caches and internal buffers also have ECC/parity. This is important for LLMs because of the huge memory footprint and long runtimes – ECC helps prevent rare bit-flip errors from corrupting the model’s weights or activations (AMD EPYC 9654 Specs | TechPowerUp CPU Database), ensuring reliable inference even for mission-critical deployments.

Cache Coherency and NUMA: In multi-socket (2P) servers, AMD uses its Infinity Fabric to maintain coherency between sockets. Genoa and Turin support 2P configurations (with up to 160 PCIe lanes in 2P total) (Leadership Natural Language AI Performance: Outperforming 5th Gen Intel® Xeon® with AMX). However, a local inference setup might often be 1-socket to maximize memory bandwidth to cores without cross-socket hops. If using 2 sockets, the memory is NUMA-separated – each socket’s 12 channels primarily feed its own cores, and remote memory access across sockets has higher latency. For LLMs, it’s generally optimal to ensure each model instance’s working set is allocated in local NUMA memory. Frameworks like PyTorch’s backend or ONNX Runtime can be NUMA-aware, or one can pin threads to prefer local memory. The latency to main memory on these CPUs is on the order of ~75–100 ns (depending on whether data is in local DRAM or comes via another CCD’s cache or from another socket). Zen4’s memory latency for a local DRAM access was measured at ~82 ns (2 MB pages) (AMD’s Zen 4, Part 2: Memory Subsystem and Conclusion), slightly better than Intel’s 4th Gen Xeon which was ~90+ ns due to its mesh and off-tile L3 design (AMD’s Zen 4, Part 2: Memory Subsystem and Conclusion) (AMD’s Zen 4, Part 2: Memory Subsystem and Conclusion). Zen5’s latency is expected to be similar or a bit higher if frequency scaled, but not a dramatic change.

In essence, EPYC’s large unified memory and bandwidth enable hosting large models entirely in RAM and streaming the model weights to cores at high speed. However, memory bandwidth can still be a bottleneck for LLMs: As one user observed, beyond ~60 threads on a 96-core EPYC 9654, adding more threads didn’t increase tokens/sec because the model (34B in 4-bit quantized form) was saturating memory throughput (AMD EPYC 9654 is not optimized for max speed · Issue #6434 · ggml-org/llama.cpp · GitHub) (AMD EPYC 9654 is not optimized for max speed · Issue #6434 · ggml-org/llama.cpp · GitHub). Another user with 32-core EPYC (older gen) got ~3 tokens/sec on 65B Q4, noting that more cores didn’t help due to memory limits (That's unfortunate. Running the 65B Q4 on an AMD Epyc with 32 1.5ghz cores and 2... | Hacker News) (That's unfortunate. Running the 65B Q4 on an AMD Epyc with 32 1.5ghz cores and 2... | Hacker News). This suggests that for a single large model, one might not see linear scaling to all 96 or 192 cores – after a certain point, the memory subsystem is the limiter. In those cases, it can be beneficial to run multiple inference tasks in parallel (e.g. generate two sequences concurrently) to utilize idle cores, accepting higher latency for one query in exchange for higher throughput overall (That's unfortunate. Running the 65B Q4 on an AMD Epyc with 32 1.5ghz cores and 2... | Hacker News). EPYC’s huge memory bandwidth gives it more headroom in this regard compared to CPUs with fewer channels. Additionally, the very large caches (each core has 1MB L2 and each CCD 32MB L3) help to some degree by caching recent layers or frequent weights. Models often revisit the same weights for each token (especially attention layers). If those weights (or a quantized subset) fit in L3, the cores can pull from cache at ~<10 ns instead of main memory at ~80 ns, a big speedup. AMD’s 3D V-Cache models (e.g. 9684X Genoa-X with 1.15 GB L3) could potentially cache substantial portions of smaller LLMs entirely in L3, further reducing memory bottlenecks. For mainstream Genoa/Turin, the 32MB per CCD means a 96-core CPU effectively caches ~384MB total – enough to hold a 7B model in 4-bit (which is ~28GB uncompressed, ~3.5GB in 4-bit), but spread across CCDs. Even if not fully hitting, caches will absorb a portion of memory traffic.

ECC and Reliability: EPYC’s focus on reliability (ECC on caches, memory, parity on internal arrays) (Details on the Gigabyte Leak - by Chester Lam) means it’s well-suited for long-running AI inference jobs where accuracy is paramount. A single-bit error in a model parameter could alter outputs unpredictably; ECC memory catches and corrects such errors (AMD EPYC 9654 Specs | TechPowerUp CPU Database). This is a subtle but important advantage of server CPUs (and why one might choose EPYC for hosting a multi-day inference job on a large model, rather than a consumer CPU without ECC).

In summary, AMD EPYC provides an unparalleled memory subsystem among CPUs: Genoa and Turin offer the highest memory bandwidth per socket in the industry (12× DDR5) and massive memory capacity. This is a key enabler for local LLM inference, as it reduces the latency and delay in supplying the model data to the compute engines. The architecture is designed to keep many cores fed with data, which is exactly what large-model inference demands.

Performance Benchmarks for AI Workloads

Measuring LLM inference performance on CPUs involves metrics like throughput (tokens generated per second) and latency (time to first token and per-token thereafter). The AMD EPYC 9000 series, with its high core counts and advanced SIMD, excels in throughput for multi-threaded inference, though single-thread latency for very large models will still be in the order of seconds for the first token (since billions of weight multiplications must be done). Here we compile performance data and benchmarks relevant to LLMs and similar AI models:

In terms of latency, a single token on a large model (e.g. 175B GPT-3) on CPU can take several seconds. GPUs typically outclass CPUs for sheer latency on huge models due to massively parallel compute. However, for moderate-sized models (<=13B), EPYC can yield sub-second token generation times, especially if quantized. For example, an 8B parameter model in int8 might generate a token in tens of milliseconds on EPYC – meaning even with 50 tokens, you get a response in under a second. The AMD vs Intel comparison showed Time-To-First-Token (TTFT) was included and EPYC was ahead or on par (Leadership Natural Language AI Performance: Outperforming 5th Gen Intel® Xeon® with AMX). As core counts rise (192C), batch processing multiple tokens can also reduce average latency per token through parallelism.

To wrap up, EPYC Genoa/Turin CPUs can handle LLM inference with surprisingly high throughput, especially when models are optimized (quantized, using efficient libraries). They hold their own against similarly priced Intel chips – e.g., 64-core EPYC 9555 vs 64-core Xeon 8592+, the EPYC had ~40% better performance per watt in SPECint_rate, indicating a general efficiency edge (5th Gen AMD EPYC Processor Architecture) that likely translates to AI tasks as well. For pure performance, a dual 192-core Turin system (384 cores total) is an absolute brute-force approach to LLM inference, and early testing suggests it’s the fastest x86 CPU setup for such workloads (with the caveat of very high power consumption). We will discuss power/thermal next.

Thermal and Power Efficiency

Running LLM inference is one of the most demanding workloads for a CPU, often utilizing all cores and wide vectors continuously. This pushes the CPUs towards their TDP limits. AMD EPYC 9000 chips are rated for high TDPs (280–400 W, depending on model, with some “F” models reaching 400W) (AMD Announces Zen 5-based EPYC “Turin” Processors: Up to 192 Cores, Coming in H2’2024). The EPYC 9654 (96-core Genoa) has a 360 W TDP (AMD EPYC 9654 Specs | TechPowerUp CPU Database). In practice, under an all-core AI load (AVX-512, all cores active), it will draw near that limit. One advantage of AMD’s AVX-512 approach in Zen4 is that it avoids dramatic power spikes or thermal runaway. Intel’s early AVX-512 implementations notoriously caused chips to downclock heavily to stay within power limits (AMD Zen 5 Execution Engine Leaked, Features True 512-bit FPU | Page 2 | TechPowerUp Forums). AMD’s double-pumped strategy kept power density lower. Indeed, AMD reported that clock rates “won’t drop dramatically” and temps won’t skyrocket under AVX-512 on Genoa (CXL, Zen 4 Architecture, Chiplet Designs - AMD 4th-Gen EPYC Genoa 9654, 9554, and 9374F Review: 96 Cores, Zen 4 and 5nm - Page 2 | Tom's Hardware), implying more consistent performance and easier cooling. This makes EPYC a stable performer for long inference runs – you won’t see frequency zig-zagging as much.

With Zen 5 (Turin), AMD moved to full-width 512-bit execution. That potentially increases power per core for heavy SIMD. To mitigate this, AMD gives the option to revert to 256-bit mode for efficiency (5th Gen AMD EPYC Processor Architecture). It’s likely that the Zen5 designs have been optimized for power as well, possibly with finer-grained clock gating on the wider units. AMD expected to keep similar top-end TDPs for socket compatibility – existing SP5 platforms support up to 400W, and AMD indicated Turin stays in that envelope (AMD Announces Zen 5-based EPYC “Turin” Processors: Up to 192 Cores, Coming in H2’2024) (AMD Announces Zen 5-based EPYC “Turin” Processors: Up to 192 Cores, Coming in H2’2024). However, some high-core Turin Dense SKUs (like a 192-core at high clocks) may be configured up to ~500W in certain OEM systems (AMD EPYC 'Turin' 9005 Series - we benchmark 192-core Zen 5 chip ...), which is an immense power draw for a single package. Proper cooling (liquid or large heatsinks + high airflow) is required in those cases. Data center users often design for < ~300W/CPU for longevity, but short bursts higher are possible.

Efficiency (Perf/Watt): AMD has been laser-focused on performance per watt, as it’s a key metric for servers. Thanks to TSMC 5nm and architectural gains, Genoa delivered big improvements here – for example, in SPECpower tests, an EPYC 9754 (128C Bergamo, Zen4c) had 2.7× the performance per system-watt of an Ampere Altra Max (ARM 128-core) (AMD EPYC™ 4th Gen 9004 & 8004 Series Server Processors – Details). For Zen4 vs Intel, AMD cites that a 64-core EPYC 9555 offers ~1.4× the performance per watt of Intel’s top 60-core Xeon 8592+ (5th Gen AMD EPYC Processor Architecture). This means for a given throughput (say, X tokens/sec of an LLM), AMD’s chip likely consumes less power than Intel’s. Part of this is the efficiency of the cores and part is the larger caches reducing external memory accesses (memory access is relatively power-expensive). Additionally, AMD’s higher memory bandwidth can improve efficiency: if cores spend less time stalled, they finish work faster and can enter idle states sooner.

Thermal considerations: EPYC 9004 packages are physically large (SP5 is a 6096-pin LGA). The heat is spread over multiple chiplets under a large heatspreader, which helps avoid hot-spots. Users running LLM inferences on EPYC have noted it’s critical to have proper cooling – under max load, package temperatures can climb quickly if the cooler is insufficient, potentially leading to thermal throttling (which would slow down inference). Server chassis typically provision for these high TDPs with powerful fans or liquid cooling. In a workstation setting (if someone built a Threadripper Pro or EPYC workstation for AI), one needs a top-tier cooler (many Threadripper air coolers can handle ~280W; for 360W sustained, liquid AIO or custom loop is often needed). AMD’s spec allows junction temperatures up to ~95°C before throttling, similar to their desktop chips, but ideally one keeps it below that for stability.

Interestingly, because LLM inference is mostly vector integer/FP math and memory traffic, it tends to fully utilize both the CPU cores and memory subsystem. This is a “worst-case” for power in some sense, as both the CPU and memory DIMMs draw significant power. EPYC’s ability to maintain performance under these conditions is proven by enterprise deployments. At the same time, if power consumption is a concern (say, running a smaller model or in an environment with limited cooling), one can cap the socket power or clock via BIOS. Running EPYC at, for example, a 240W limit will reduce clocks somewhat but still give a large multi-core throughput, often more efficient (perf/W) at the lower power. The data shows EPYC scales well downwards: e.g., the SPECpower_ssj2008 scores indicate high efficiency even at reduced power states (AMD EPYC™ 4th Gen 9004 & 8004 Series Server Processors – Details).

In summary, EPYC’s thermal/power profile for AI workloads is robust: it delivers strong performance at high power, and better efficiency than competing x86 solutions. It does require serious cooling for sustained inference on big models. But when properly cooled, it can run flat-out on AI tasks without the wild throttling behavior seen in some earlier platforms. This consistency is valuable – it means latency and throughput are predictable over time. For “local inference” in a non-data-center setting, one must ensure the environment (power supply, cooling) can handle possibly 300–400W per CPU. Many enthusiasts running LLMs on desktop CPUs at 125W TDP hit memory bottlenecks long before power limits. EPYC flips that scenario: plenty of memory bandwidth, and power will be the limiter if all cores are crunching.

Comparative Analysis with Other CPUs for LLM Workloads

When evaluating AMD EPYC Genoa/Turin for LLM inference, it’s useful to compare them to alternatives:

  • Intel Xeon Scalable (Sapphire Rapids, Emerald Rapids): Intel’s 4th Gen Xeon (Sapphire Rapids, 2023) offers up to 60 cores (per socket) and introduces the AMX tile accelerator for AI. AMX can provide very high throughput for int8 and bfloat16 matrix operations – a single Sapphire Rapids core can perform 1024 int8 MACs per cycle (256-wide tiles) vs 128 int8 MACs per cycle on a Zen4 core (512-bit VNNI) (CXL, Zen 4 Architecture, Chiplet Designs - AMD 4th-Gen EPYC Genoa 9654, 9554, and 9374F Review: 96 Cores, Zen 4 and 5nm - Page 2 | Tom's Hardware) (5th Gen AMD EPYC Processor Architecture). However, Intel has far fewer cores to deploy, and only 8 memory channels (vs 12). The result: for smaller models or high batch inference where the compute intensity is extremely high (and fits in cache), Intel’s AMX might shine. But for large LLMs that are memory-bound, AMD’s raw core count and bandwidth often win out. AMD demonstrated that in a head-to-head on a real model (Llama-2 7B), their 2P 96C outperformed Intel’s 2P 60C (Leadership Natural Language AI Performance: Outperforming 5th Gen Intel® Xeon® with AMX). Another factor is precision: Intel supports FP16 in AMX, which AMD doesn’t natively, but most LLM pipelines prefer BF16 or int8 anyway (where both have solutions). Intel’s newest 5th Gen Xeon (Emerald Rapids, late 2023) raised core counts slightly (up to 64) and frequency, but still 8-channel memory – so AMD’s 4th/5th Gen EPYC typically offers higher throughput on memory-heavy AI tasks. Intel might have an edge in single-thread latency on certain small models thanks to higher per-core IPC for scalar code, but LLM inference is vectorized and parallelized by nature.

  • ARM-based CPUs (Ampere, AWS Graviton): Ampere’s Altra and AmpereOne processors provide 128+ Arm cores with strong efficiency, but they lack SIMD width and advanced AI instructions. Ampere Altra Max (128 cores, ARM Neoverse N1) has only NEON 128-bit vectors (no BF16, no int8 acceleration beyond basic SIMD). In a CPU-only LLM inference, Ampere would be significantly slower per core. AMD EPYC 9754 (128 Zen4c cores) was shown to outperform Altra Max by 2.7× in perf/watt (AMD EPYC™ 4th Gen 9004 & 8004 Series Server Processors – Details), and likely similar or greater in raw throughput, especially with AVX-512 and VNNI accelerating int8. Newer ARM server CPUs (like NVIDIA Grace CPU) support FP16 and BF16 in hardware and SVE 512-bit, which could be competitive; however those are not widely available for local use yet. AWS Graviton3 (64-core) has some BF16 support, but again, core count and memory bandwidth are lower than EPYC. For an enterprise or researcher wanting max CPU inference performance, EPYC Genoa/Turin currently holds a lead over ARM options for LLMs (though ARM chips excel in efficiency for other cloud workloads).

  • Apple Silicon (M2/M3): Apple’s chips have a strong matrix engine (AMX equivalent called ANE) and very high memory bandwidth (Unified Memory with >200 GB/s on M2). While not directly comparable (can’t put an M2 Ultra in a server easily with 384 GB RAM), it’s worth noting that an M2 Ultra (24-core CPU) can run smaller LLMs quite efficiently using its 32-core Neural Engine. But for local inference of large models, EPYC’s ability to scale to hundreds of GB of RAM and many threads is a distinct advantage. Apple’s max RAM is 192GB (M2 Ultra), which is not enough for >30B model at full precision. So for truly large models, x86 servers are still the go-to.

  • GPUs: While not CPUs, GPUs are the main alternative for local LLM inference. A single high-end GPU (like NVIDIA A100 or RTX 4090) can outperform EPYC on throughput for medium-to-large models, due to thousands of CUDA cores and Tensor Cores for int8/BF16. For example, two RTX 3090s (48 GB total, ~1 TB/s bandwidth each) were suggested as a better bang-for-buck than an EPYC for a quantized 65B model by one user (That's unfortunate. Running the 65B Q4 on an AMD Epyc with 32 1.5ghz cores and 2... | Hacker News). GPUs have specialized matrix units (Tensor Cores) that are extremely fast for dense math. Thus, if the question is absolute performance, a multi-GPU setup will generate tokens faster and at lower latency than a single or dual CPU, once the model fits in VRAM. However, GPUs are limited by memory size – a 175B model might need multiple GPUs in parallel with sharding, which complicates a “local” setup. EPYC, on the other hand, can brute-force run such a model given enough DRAM (just slowly). Moreover, GPU inference typically requires writing GPU-specific code or using frameworks like PyTorch with CUDA, whereas CPU inference can sometimes be simpler to set up (just use the optimized BLAS or libraries, no CUDA dependency). In contexts where using a CPU is acceptable (maybe slower but simpler), EPYC provides a strong platform. It often comes down to availability and cost: one might already have a CPU server and want to use it rather than buying expensive GPUs.

Bottom line: AMD EPYC Genoa and Turin currently offer the highest CPU-side performance for LLM inference overall, especially when models are large and memory-bound. They surpass Intel’s offerings in core count, memory throughput, and often perf/watt, making them very attractive for CPU-based AI inference servers (Leadership Natural Language AI Performance: Outperforming 5th Gen Intel® Xeon® with AMX) (5th Gen AMD EPYC Processor Architecture). While specialized accelerators (GPUs, NPUs, etc.) can beat them in raw speed, EPYC CPUs provide flexibility (no need to rewrite code for GPU), large memory, and strong multi-instance throughput. Many users might combine EPYC CPUs with GPUs (CPU to handle model orchestration and part of inference, GPU to accelerate heavy layers), but even CPU-only, EPYC can handle surprising scales. The choice depends on the specific model and usage: for instance, fine-tuning or running many small models might lean towards CPU, whereas very large batch inference leans GPU. EPYC gives the CPU side a fighting chance by integrating key AI instructions and scaling out resources needed for AI.

Software Optimization and Framework Compatibility

To extract maximum LLM performance from EPYC processors, software optimizations are essential. AMD recognized that “hardware is only one piece of the puzzle; software is crucial for effectively taking advantage of the hardware.” (Enabling Optimal Inference Performance on AMD EPYC™ Processors with the ZenDNN Library — The TensorFlow Blog) They have invested in libraries and framework integrations to help developers utilize AVX-512, VNNI, and the cache/memory system efficiently:

  • AMD ZenDNN Library: AMD provides the Zen Deep Neural Network (ZenDNN) library, an open-source DNN inference library optimized for Zen-based CPUs (Enabling Optimal Inference Performance on AMD EPYC™ Processors with the ZenDNN Library — The TensorFlow Blog). ZenDNN contains highly tuned primitives (convolutions, matrix multiplies, activations, etc.) that use AMD’s vector instructions and threading optimally. It’s analogous to Intel’s oneDNN (MKL-DNN) but specifically optimized for EPYC (accounting for the cache sizes, dual 256-bit issue on Zen4, etc.). ZenDNN targets inference in domains including NLP, vision, and recommender systems (Enabling Optimal Inference Performance on AMD EPYC™ Processors with the ZenDNN Library — The TensorFlow Blog) – LLMs would use primarily its GEMM (matrix multiply) routines under the hood. AMD has integrated ZenDNN into popular frameworks: for example, a TensorFlow-ZenDNN plugin was released for TensorFlow 2.12+ (Enabling Optimal Inference Performance on AMD EPYC™ Processors with the ZenDNN Library — The TensorFlow Blog). By installing this plugin, TensorFlow can automatically use ZenDNN’s optimized ops when running on EPYC CPUs, yielding significant speedups for inference. Diagram 1 in the TensorFlow blog shows ZenDNN v4.0 package being used with TensorFlow (Enabling Optimal Inference Performance on AMD EPYC™ Processors with the ZenDNN Library — The TensorFlow Blog). Similarly, PyTorch can use ZenDNN through a plugin (amd/ZenDNN-pytorch) or via oneDNN if compiled to target AMD. AMD has been working to upstream ZenDNN optimizations so that framework binaries directly include them.

  • oneDNN / Intel oneAPI compatibility: Intel’s oneAPI and oneDNN library are hardware-agnostic to a degree. oneDNN (DNNL) will detect the CPU capabilities at runtime. On AMD EPYC, oneDNN sees AVX2, AVX-512, etc., and will utilize them (it may not be fully tuned for AMD’s microarchitecture, but still benefits from wide vectors). Many frameworks (PyTorch, ONNX Runtime, HuggingFace Transformers) rely on oneDNN or similar under the hood for CPU ops. This means out of the box, EPYC can get decent performance on these frameworks, though not as optimal as with ZenDNN (which might, for instance, better schedule threads across NUMA domains or use AMD-specific cache hints). AMD’s strategy includes contributing to open libraries to improve performance on their CPUs without needing separate forks.

  • ONNX Runtime and others: ONNX Runtime (ORT) is often used for deploying optimized transformer models on CPU. It has execution providers that can target different backends. For CPU, it uses oneDNN and custom optimizations. Dell’s MLPerf submission mentioned earlier used ONNX Runtime on dual EPYC to run BERT, combined with an optimized model from Deci AI (The First MLPerf Inference v2.1 Performance Result on AMD EPYC™ CPU-Based PowerEdge Servers | Dell Technologies Info Hub). This demonstrates that EPYC works well with ORT – developers can export a HuggingFace transformer to ONNX and run it with ORT on EPYC, taking advantage of all threads and SIMD. Microsoft has also been adding features like OpenMP parallelism and AVX512 kernels in ORT which benefit EPYC equally as Intel.

  • Compiler and math libraries: For lower-level control, AMD provides AOCL (AMD Optimizing CPU Libraries) which include BLIS and other math routines tuned for EPYC. For example, large matrix multiplication in an LLM can be executed by BLIS (which is AMD’s BLAS library). AOCL/BLIS is optimized for the Zen architecture (taking into account things like the 1MB L2 and 32MB L3 per CCD to tile matrices appropriately). If one is writing custom inference code (e.g. using a library like ggml for LLMs), using AOCL’s BLAS or the Intel MKL (which can run on AMD, though MKL may intentionally dispatch slower code on non-Intel CPUs – but this has been mitigated by AMD’s AOCL being competitive) can accelerate the linear algebra. AMD also has AMX TPP (Tensor Processing Primitives) – confusingly named (not related to Intel AMX) – which seems to be an AMD-internal library for transformer operations. The AMD performance brief cites “TPP v0.0.1” and “IPEX 2.3.0” (Intel PyTorch Extensions) in their Llama2 benchmark (Leadership Natural Language AI Performance: Outperforming 5th Gen Intel® Xeon® with AMX). This implies they may have used Intel’s IPEX (which contains optimizations for transformer layers using oneDNN/VNNI) and possibly an AMD-specific extension (TPP). It shows that AMD is leveraging existing software stacks (even Intel’s) to get the best performance on EPYC.

  • Hugging Face and Python ecosystem: Many users will run LLMs via Hugging Face Transformers, PyTorch, or llama.cpp. These are all compatible with EPYC CPUs. HuggingFace Transformers can use PyTorch or TensorFlow as a backend – with ZenDNN or oneDNN these will use AVX512. There’s also the possibility of using int8 quantization through libraries like Hugging Face’s bitsandbytes or ONNX Runtime’s quantization tool. Those int8 kernels will certainly benefit from EPYC’s VNNI. For example, an int8 quantized BERT can run significantly faster than FP32 BERT on EPYC (CXL, Zen 4 Architecture, Chiplet Designs - AMD 4th-Gen EPYC Genoa 9654, 9554, and 9374F Review: 96 Cores, Zen 4 and 5nm - Page 2 | Tom's Hardware). The key is enabling those code paths. In HF Transformers, one can enable PyTorch’s mkldnn backend (torch.set_float32_matmul_precision('medium') and use torch.backends.mkldnn.enable()) which then utilizes oneDNN.

  • Multi-threading and parallelism: To use all 96 or 192 threads, the software must be well-parallelized. Libraries like oneDNN and BLIS do parallelize GEMM across threads. In Python, packages like transformers with pipeline() might by default only use one thread unless you explicitly set intra-op parallel threads. Users should set environment variables like OMP_NUM_THREADS or MKL_NUM_THREADS to utilize the core count. Additionally, pinning threads to cores can boost performance (to avoid context switching overhead). EPYC with its large L3 per CCD might benefit from pinning 8 threads per CCD (for Genoa) so each CCD’s threads primarily use that CCD’s L3. These are advanced tunings – frameworks or AMD’s tools might do it automatically (ZenDNN likely does smart thread placement).

  • Future software (vector FP16): One gap is FP16. If frameworks try to use FP16 on AMD CPUs, they may either fall back to software emulation or use BF16 as a substitute. For instance, PyTorch autocast might choose BF16 if it detects AMD CPU (since BF16 is supported). This usually doesn’t require user intervention. AMD’s documentation explicitly states they implement all AVX-512 instructions used in 4th Gen Intel Xeon except FP16 types (5th Gen AMD EPYC Processor Architecture) – so software that’s Intel-targeted might attempt FP16 and get an illegal instruction on AMD. Good libraries will detect BF16 support and use that (since for inference BF16 and FP16 are functionally similar in many cases). AMD might also introduce FP8 or “Block FP16” support via software. An AMD technical dive mentioned Block Float 16 (BF16 not to be confused with bfloat16) for their XDNA AI Engine (AMD Zen 5 Technical Deep Dive - Machine Learning / AI | TechPowerUp) (AMD Zen 5 Technical Deep Dive - Machine Learning / AI | TechPowerUp) – that’s more for their GPU/NPU, not directly on CPU. On CPU, int8 is more common for quant.

Overall, the software ecosystem for AI on AMD EPYC is strong and improving. Key frameworks like TensorFlow and PyTorch are being optimized through ZenDNN and other contributions (Enabling Optimal Inference Performance on AMD EPYC™ Processors with the ZenDNN Library — The TensorFlow Blog) (Enabling Optimal Inference Performance on AMD EPYC™ Processors with the ZenDNN Library — The TensorFlow Blog). ONNX Runtime and classical BLAS libraries support EPYC well. One can run popular tools (Transformers, llama.cpp, etc.) on EPYC without issue – often it’s as simple as compiling with appropriate flags. For maximum performance, using AMD’s inference-optimized libraries or enabling oneDNN execution in frameworks will unlock the CPU’s potential. AMD’s collaboration with software (as evidenced by blog posts and releases in 2023) (Enabling Optimal Inference Performance on AMD EPYC™ Processors with the ZenDNN Library — The TensorFlow Blog) means the gap between theoretical hardware capability and realized performance is narrowing.

Limitations and Considerations

Despite the impressive capabilities of AMD’s Genoa and Turin CPUs, there are some bottlenecks and limitations to be mindful of when using them for local LLM inference:

  • Memory Bandwidth Saturation: As discussed, large models can saturate even EPYC’s memory bandwidth. Adding more cores/threads beyond a point yields diminishing returns (AMD EPYC 9654 is not optimized for max speed · Issue #6434 · ggml-org/llama.cpp · GitHub) (AMD EPYC 9654 is not optimized for max speed · Issue #6434 · ggml-org/llama.cpp · GitHub). This is inherent to the workload – if each token generation requires streaming tens of GB of weights from DRAM, the throughput is capped by memory speed. EPYC mitigates this with 12 channels and huge caches, but it doesn’t eliminate it. For a single instance of a 65B model, you might find that say 64 cores are fully utilized and the other 32 cores are starved. In such cases, one must use model parallelism (multiple instances) or accept that not all cores will be 100% busy. It also means performance scaling from a 32-core CPU to a 96-core is not 3× for one large model – it could be less. Users should consider this in sizing their systems (e.g., two 64-core CPUs vs one 128-core might not double single-instance speed, but it allows two instances, etc.).

  • Latency vs Throughput: CPU inference, especially with int8, often has high throughput but each token still has to go through all layers sequentially (unless model is pipeline-parallelized across cores, which is complex). So the latency to generate one token might be, say, 50 ms on CPU vs 5 ms on a GPU – an order of magnitude difference. For interactive use with large models, this can be a limitation (typing will feel slower). Techniques like prompt batching or multi-threaded execution within a single layer can help a bit, but fundamentally GPUs excel at low latency due to massive parallelism. EPYC can serve multiple queries in parallel with high throughput, but single-query latency will be relatively high for very big models.

  • Power and Noise: Running a 360W CPU at full tilt will consume a lot of electricity and produce a lot of heat. For a “local” (e.g., office or home) setup, that might be an issue – the system could draw >700W easily with two CPUs and cooling, which is not trivial for a standard outlet in some regions. The noise from server-grade coolers or fans ramping up under load is also a factor. So while it’s possible to run a GPT-J or LLaMA-65B on dual EPYCs in your office, it might not be pleasant without proper infrastructure. This is not a limitation of the CPU’s computing, but of practical deployment.

  • No specialized tensor cores: EPYC relies on general-purpose cores and vectors. This means it doesn’t get the 10× or more speedups that GPUs can on certain operations (like dense matrix multiplies). Even with AVX-512, a single Zen4 core does far fewer ops per cycle than an A100 SM or an TPU core for matrix math (AMD Zen 5 Execution Engine Leaked, Features True 512-bit FPU | Page 2 | TechPowerUp Forums) (AMD Zen 5 Execution Engine Leaked, Features True 512-bit FPU | Page 2 | TechPowerUp Forums). AMD decided not to include something like AMX or a matrix engine on the CPU die (perhaps due to area and power trade-offs). The consequence is that purely CPU-based inference will usually be slower and less energy-efficient than using a GPU accelerator, for the same model. AMD is addressing high-end AI with their Instinct MI300 GPUs instead, leaving EPYC as the host or a moderate performer. So users should set realistic expectations – CPUs are great for smaller models and can do larger ones in a pinch, but they won’t break performance records vs GPUs designed for this.

  • Parallelization Complexity: To really maximize EPYC, one might need to get into the weeds: NUMA pinning, using libraries like ZenDNN, quantizing models, etc. Out-of-the-box, a naïve implementation might not use all cores effectively. For example, running a PyTorch Transformer without setting intra-op parallelism might use only a few threads and underutilize the CPU. So there is some tuning overhead. Tools like Hugging Face Accelerate or DeepSpeed can help manage threads on CPU, but those are more often used for training. In inference-serving scenarios, using something like TorchServe or FastAPI with Uvicorn workers might be needed to utilize all cores (by running multiple worker processes).

  • Model Size vs Memory Capacity: EPYC can address huge memory, but cost and practicality limit how much one will have in a “local” environment. 512 GB or 1 TB of RAM is expensive (though not unheard of in high-end workstations or servers). If your model is 2 TB in size (like some massive mixture-of-experts or something), you simply can’t run it on a single EPYC node without paging, which would be impossibly slow. That’s where distributed inference or model compression come in. CPUs have the advantage of being able to use cheap NVMe as overflow (virtual memory), but performance will drop if the working set doesn’t fit in RAM. In short, EPYC lets you go much larger in model size than any single GPU (which tops at 80 GB HBM or 48 GB GDDR per card), but you still need to ensure you have sufficient RAM for the model and its runtime data (activation KV cache grows with sequence length, etc.).

  • Software Maturity on new features: Zen4/Zen5 are new in the market relative to how long Intel’s had ML optimizations. Some libraries/plugins may still be catching up to fully exploit AVX-512 VNNI on AMD. For instance, if a library had hard-coded paths for Intel, it might need an update to recognize “hey, AVX512 is available on AMD too”. The good news is many have done so by now. But there could be edge cases – e.g., some Intel-optimized piece using AMX won’t have an equivalent on AMD, and might fall back to scalar code unless patched to use VNNI. AMD is actively working with the ecosystem, so this gap is closing, but it’s something to watch. Ensuring you use the latest versions of inference libraries is important to get optimizations.

  • Hybrid core interference: While EPYC doesn’t mix big.LITTLE cores in a single CPU (except on some client APUs, but not in servers aside from splitting Zen5 and Zen5c SKUs), the Zen5c cores in a Turin Dense part have lower single-thread performance than standard Zen5. If you opt for the 192-core variant, know that each core is a bit weaker (lower clocks, smaller cache per core) than a Zen4 or Zen5. So for latency-critical single-thread tasks, a lower-core-count, higher-clock EPYC might actually be faster. AMD tries to segregate workloads in mixed-core scenarios (on mobile chips, they pin background tasks to Zen5c), but on a server SKU like 9965, all cores are Zen5c so it’s uniform. It’s just a trade-off: 192 slightly slower threads vs 128 faster threads. For embarrassingly parallel throughput (multiple instances or batch inference), 192 cores will win. But for a single instance that can’t use that many threads, the extra cores may not help and the slower per-core speed could hurt a bit. Thus, the choice of SKU should consider the inference concurrency and model size.

  • Cost: Although not a technical limitation, it’s worth noting EPYC 9000 series are expensive chips (the 9654 has an MSRP around $11,000 (AMD EPYC 9654 Specs | TechPowerUp CPU Database), and the new 5th-gen will be in that ballpark or higher). Building a local server with one or two of these plus hundreds of GB of RAM is a significant investment. If budget is a limitation, one might opt for a smaller EPYC or even a Threadripper Pro (which is essentially a single-socket EPYC with 8-channel memory). For instance, Threadripper Pro 7995WX (96-core Zen4, 8-channel DDR5) could be a cheaper alternative, albeit with slightly less memory bandwidth and no 12-channel. But it still offers a lot of cores and AVX-512, likely at lower cost per core. It’s a trade-off of maximum performance vs price/perf.

In conclusion, while AMD EPYC Genoa/Turin CPUs are powerhouses for LLM inference, users should be aware of the memory-bound nature of these workloads, the need for careful threading and quantization to hit peak performance, and the practical considerations of running such high-end hardware. They fill a niche where GPU memory is a limiting factor or CPU flexibility is desired, but they won’t outperform optimized GPU solutions on a pure speed basis for large models. Still, for many scenarios (especially up to medium model sizes or multi-user inference of larger models), EPYC offers an extremely capable and scalable platform. With the right optimizations, it can serve as the backbone for local AI inference engines, all while maintaining the reliability and security features expected in server-grade processors (AMD EPYC™ 4th Gen 9004 & 8004 Series Server Processors – Details).


Sources