| GPU Name | Manufacturer | Architecture | Process Node (nm) | Stream Processors (SP) | AI Accelerators | Base Clock (MHz) | Boost Clock (MHz) | Memory Type | Memory Size (GB) | Memory Bandwidth (GB/s) | Memory Bus Width | Mixed Precision FP16/BF16 | INT8 Performance | INT4 Performance | TDP (W) | PCIe Generation / Lanes |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Radeon Instinct MI60 | AMD | GCN 5.1 (“Vega 20”) (Meet the AMD Radeon Instinct MI60 and MI50 accelerators - PC Perspective) | 7 nm (TSMC) | 4096 (64 CUs) (Meet the AMD Radeon Instinct MI60 and MI50 accelerators - PC Perspective) | None (uses standard CUs; INT8/INT4 via new instructions) | 1200 | 1800 | HBM2 (ECC) | 32 (Meet the AMD Radeon Instinct MI60 and MI50 accelerators - PC Perspective) | 1024 (1 TB/s) | 4096-bit | 29.5 TFLOPS FP16 (Meet the AMD Radeon Instinct MI60 and MI50 accelerators - PC Perspective) (BF16 support via software, ~14.7 TFLOPS) | 59 TOPS INT8 (Meet the AMD Radeon Instinct MI60 and MI50 accelerators - PC Perspective) | ~118 TOPS INT4 (theoretical) | 300 W (Meet the AMD Radeon Instinct MI60 and MI50 accelerators - PC Perspective) | PCIe 4.0 x16 |
Table: Key specifications of the AMD Radeon Instinct MI60 GPU relevant to LLM inference.
Overview: The MI60 is built on AMD’s GCN 5.1 architecture, code-named Vega 20, which is a refined 7 nm iteration of Vega optimized for compute workloads. It features 64 Compute Units (CUs), each containing 64 stream processors, for a total of 4096 shader/compute cores (Meet the AMD Radeon Instinct MI60 and MI50 accelerators - PC Perspective). The GPU die (331 mm²) integrates 13.23 billion transistors and is partitioned into four shader engines, each with its own geometry processor and rasterizer (though graphics features are less relevant for compute tasks) (AMD Vega 20 GPU Specs | TechPowerUp GPU Database).
Execution Units: Each CU in Vega 20 is an enhanced Next-Generation Compute Unit (NCU) supporting high clock speeds and Rapid Packed Math, which allows FP16 operations to execute at double rate (two FP16 ops per clock per ALU) (Meet the AMD Radeon Instinct MI60 and MI50 accelerators - PC Perspective). This is a key feature for AI workloads, effectively doubling throughput for half-precision math. The MI60 does not have dedicated “tensor cores” or matrix engines as found in some other GPUs; instead, it relies on these NCUs and new instructions to accelerate matrix operations in lower precision (INT8/INT4). While later AMD CDNA GPUs introduced specialized matrix cores, the MI60’s INT8/INT4 throughput is achieved via packed SIMD instructions on the existing compute units.
Caches and Memory Hierarchy: Vega 20 includes a 4 MB L2 cache on-die to buffer data between the HBM2 memory and the shader cores (AMD Vega 20 GPU Specs | TechPowerUp GPU Database). Each Compute Unit also contains its own L1 cache and a local data share (LDS) for fast thread group communication (each CU’s LDS is 64 KB in Vega architecture). The High Bandwidth Cache Controller (HBCC), an architectural feature of Vega, manages HBM2 memory and can theoretically allow the GPU to treat system memory as an extended memory pool. In MI60’s context, the HBCC combined with HBM2 is designed to handle very large data sets for HPC and deep learning (Vega-Whitepaper-061317_FINAL_V2) (Vega-Whitepaper-061317_FINAL_V2). Generational improvements over earlier Vega (GCN5.0) include the move to 7 nm (allowing higher clocks and efficiency) and significantly improved double-precision performance (FP64 now at half-rate of FP32, rather than 1/16 rate in prior GCN GPUs). MI60 was in fact the first GPU in years from AMD to offer full ECC memory support across the entire memory path, which is critical for enterprise and scientific computing.
Compute-Focused Features: AMD added new INT8 and INT4 dot product instructions in Vega 20, recognizing the needs of AI inference where lower precision is sufficient. These allow the MI60 to achieve up to 4× the throughput of FP16 when using INT4 data (on paper). However, there are no separate “AI cores” – the standard vector units execute these operations. The MI60 also incorporates two Infinity Fabric Links on-card for peer-to-peer GPU communication, each link providing up to 100 GB/s bandwidth (bidirectional), enabling a high-speed GPU cluster (“hive”) of up to 4 GPUs in a ring topology for large parallel workloads. Hardware virtualization support (SR-IOV based “MxGPU”) is present as well, allowing the MI60 to be partitioned for multiple users or VMs, a feature useful in cloud environments.
Supported Precisions: The Radeon Instinct MI60 supports a range of numeric formats commonly used in deep learning. It offers full-rate single precision FP32 performance of 14.7 TFLOPS and half-rate double precision FP64 at 7.4 TFLOPS (Meet the AMD Radeon Instinct MI60 and MI50 accelerators - PC Perspective). Its architecture is optimized for lower precision: FP16 (half-precision) operations run at 2× the rate of FP32, yielding up to ~29.5 TFLOPS (Meet the AMD Radeon Instinct MI60 and MI50 accelerators - PC Perspective). While BFloat16 (BF16) was not originally a native data type in the Vega 20 hardware (BF16 gained popularity after MI60’s launch), later software updates in ROCm enabled BF16 support by mapping it to existing hardware capabilities. In practice, BF16 on MI60 achieves similar throughput as FP16 (or falls back to FP32 rate if not fully accelerated).
Integer and Tensor Ops: The MI60 introduced support for INT8 and INT4 precision arithmetic aimed at AI inference. Peak INT8 throughput is rated at 59 TOPS (trillions of 8-bit operations per second) (Meet the AMD Radeon Instinct MI60 and MI50 accelerators - PC Perspective), and INT4 up to a theoretical ~118 TOPS (4-bit ops) when utilizing the special dot-product instructions. These lower-precision modes allow significantly higher throughput and reduced memory footprint, beneficial for running quantized LLMs. It’s important to note that harnessing this performance requires software that can leverage these data types (e.g., optimized libraries or inference runtimes that use INT8 kernels).
Tensor Operations: Unlike NVIDIA’s tensor cores or AMD’s own later CDNA GPUs (which have Matrix Cores), the MI60 does not have fixed-function matrix multiply units. Instead, matrix and tensor operations are handled by the shader cores. AMD’s ROCm software stack (libraries like MIOpen, rocBLAS, etc.) is optimized to use MI60’s NCUs for GEMM (general matrix multiply) and convolution operations. The peak throughput numbers (FP16, INT8, etc.) are typically achieved on dense matrix multiplication workloads that fully utilize the compute units. For example, batched matrix multiplications or large GEMMs in transformer feed-forward layers can approach those theoretical FLOPS on MI60.
Sparsity Support: Hardware support for sparsity (such as accelerating 2:4 structured sparsity) was not present in the GCN-based MI60. Unlike some newer architectures that can skip zero weights for additional speed, MI60 executes all operations at face value, meaning any speedup from sparsity must come from software-level optimizations. Unstructured sparsity in models (pruned weights) can still yield speedups on MI60 by reducing memory load and avoiding unnecessary computes, but there is no dedicated hardware to automatically double throughput on sparse patterns. Thus, performance gains from sparsity will be workload-dependent and typically less pronounced than on GPUs with explicit sparsity engines.
HBM2 Memory and Bandwidth: A standout feature of the MI60 is its 32 GB of HBM2 VRAM on a very wide 4096-bit memory bus, delivering 1 TB/s of memory bandwidth. The memory is arranged as four HBM2 stacks (each 8 GB) connected via four memory controllers. This extreme bandwidth is critical for Large Language Model inference, as transformer models are often memory-bandwidth bound (retrieving large weight matrices and key/value caches). In fact, MI60’s bandwidth is roughly 3× higher than typical GDDR6-based gaming GPUs, helping feed the compute units with data at a fast rate. The HBM2 memory on MI60 also supports ECC (error-correcting code), ensuring reliability for large models running for long durations.
Memory Hierarchy: The GPU’s on-die L2 cache is 4096 KB (4 MB) (AMD Vega 20 GPU Specs | TechPowerUp GPU Database), which is modest by today’s standards but was significant in 2018. The L2 cache helps by keeping recently used weights, activations, or attention key/value data close to the cores to reuse, reducing the need to always go out to HBM2. Each Compute Unit further has a small L1 cache (instruction and data) and a 64 KB shared memory (LDS) that can be used for scratchpad and accelerating local reductions. For LLM inference, the large HBM2 memory capacity means that models up to tens of billions of parameters can be loaded entirely into GPU memory (especially if using 8-bit or 4-bit quantization). The impact of memory capacity is that the MI60 can run models that exceed the VRAM of many consumer GPUs without offloading layers to CPU. However, if a model’s size does exceed 32 GB, the MI60 would have to rely on either model parallelism (splitting across GPUs) or host memory paging (which is undesirable due to PCIe latency).
Model Size Limitations: In practice, a 32 GB VRAM allows loading roughly a 13B-30B parameter transformer in half precision, or even a 70B parameter model if aggressively quantized to 4-bit. For example, a 13B parameter LLaMA model in 16-bit weights is about ~26 GB, which fits comfortably in 32 GB with room for activations. A 70B model quantized to 4-bit (~35 GB) slightly exceeds a single MI60’s VRAM, requiring either compression, layer streaming, or multi-GPU split. Thus, memory capacity is often the bottleneck determining the maximum model size for local inference on MI60. The massive bandwidth of HBM2 helps maintain throughput even for large context windows (long sequences) since attention mechanisms perform numerous memory lookups. However, if the sequence length is very long (thousands of tokens), the 4 MB L2 cache may become less effective, and performance could become bandwidth-limited by frequent HBM2 accesses.
Memory Compression: The Vega architecture employs memory compression (such as delta color compression) in graphics contexts, but for general compute/LLM inference, such compression is not particularly relevant (there’s no “activations compression” akin to texture compression). One notable Vega feature, the High Bandwidth Cache Controller (HBCC), can allow the GPU to use system memory as an extended VRAM. In theory, the HBCC could enable working with model data larger than 32 GB by streaming from host memory, but this would incur a severe speed penalty due to PCIe latency/bandwidth limitations. In practice, for LLM inference one would avoid exceeding GPU VRAM or use multiple GPUs rather than rely on swapping over PCIe.
Inference Throughput on LLaMA/GPT: In real-world LLM inference tests, the MI60 demonstrates solid performance, though often constrained by software optimization. For instance, using a GPTQ-quantized LLaMA-2 13B model, a single MI60 can generate text at about 15.8 tokens per second (batch size 1, sequence generation scenario). In this test, 200 tokens were produced in 12.6 seconds using one MI60, indicating its capability on medium-size models. When two MI60 cards were used together on that same 13B model (model split across GPUs), throughput reached ~15.4 tokens/s – about the same, showing minimal scaling benefit for that model (likely because 13B already fits in one GPU and added overhead outweighed benefits). For a much larger model, LLaMA-2 70B (quantized), two MI60s achieved 3.4–4.5 tokens per second in generation throughput. This is a lower rate, reflecting the heavy compute and memory demands of a 70B model, but it demonstrates that multi-GPU MI60 setups can handle models of that scale (with 70B spread across 2×32 GB cards).
Batch Size and Sequence Length: The above measurements were at batch size 1 (single query generation). The MI60, with its high compute and bandwidth, can handle larger batch sizes for inference when throughput (tokens/sec or sequences/sec) is the goal rather than single-stream latency. In batch processing of shorter sequences (as in BERT-like QA or classification), the MI60 can achieve high throughput. While specific public benchmarks are sparse, anecdotal reports suggest performance on par with high-end GPUs of its era for transformer workloads. For example, one community member noted MI60’s performance is roughly in line with an NVIDIA V100 for LLM inference when software is well-optimized (2x AMD MI60 inference speed. MLC-LLM is a fast backend for AMD GPUs. : r/LocalLLaMA) (noting that MI60 lacks some optimizations that V100’s tensor cores have for int8). In another case, running a smaller 8B parameter model in 4-bit mode on two MI60s reached about 80 tokens/s generation, which was compared to roughly what an RTX 3090 might achieve on an 8-bit quantized model (2x AMD MI60 inference speed. MLC-LLM is a fast backend for AMD GPUs. : r/LocalLLaMA) (2x AMD MI60 inference speed. MLC-LLM is a fast backend for AMD GPUs. : r/LocalLLaMA). These community figures should be taken with caution, but they highlight that MI60 can attain high token throughput if using lower precision and optimized kernels.
Latency and Generation Speed: For interactive LLM usage (single-stream), the MI60’s latency is primarily governed by its single-thread performance (clock speed) and memory latency. At ~1.8 GHz boost, its core clock is decent, but not as high as some gaming GPUs. Nonetheless, thanks to massive parallelism, it can generate tokens relatively quickly once the model is loaded. In the earlier 13B example, ~16 tokens/sec means each token took ~62 ms to produce on average. This includes all model layers for that token. Latency can increase with longer context windows (because self-attention grows in complexity) – e.g., at 2048 token context, attention computations and memory reads intensify. At very large context (such as 8K or more, if the model supports it), the MI60’s advantage in memory bandwidth helps, but the 4MB L2 cache might become a limiting factor to keep all those key/value vectors readily accessible. Still, the MI60 should handle typical 2048-length contexts for GPT-type models with only a modest slowdown compared to shorter sequences.
BERT and Other Models: For encoder-type models like BERT, which are often measured in sequences per second for inference, MI60’s strong FP16 capability means it can process many sequences in parallel. No official figure is published by AMD for BERT inference on MI60, but given the 14.7 TFLOPS FP32 (29.5 TFLOPS FP16) available, one could expect the MI60 to comfortably handle hundreds of inference sequences per second on BERT-base when using batch processing and FP16 acceleration. In HPC-oriented evaluations, however, the MI60 sometimes lagged behind contemporary GPUs in optimized workloads – e.g., in one academic comparison of GPU performance, the MI60 delivered the lowest performance among tested GPUs for certain inference benchmarks, underscoring that software optimizations (or lack thereof) heavily influence results on this hardware (Preliminary Results of the MLPerf BERT Inference Benchmark on AMD Instinct GPUs | Request PDF) (Preliminary Results of the MLPerf BERT Inference Benchmark on AMD Instinct GPUs | Request PDF). Ensuring the use of AMD’s optimized libraries (like using ONNX Runtime or PyTorch with ROCm backends tuned for MI60) is key to unlocking its inference performance.
Power Consumption Under Load: The MI60 has a Total Design Power (TDP) of 300 W, typical for a high-end datacenter accelerator. During intensive LLM inference, especially at FP16 compute-bound scenarios, the card can draw near its rated 300 W. In community testing, running a large language model on an MI50 (16GB variant of Vega 20 GPU) showed power usage peaking around ~236 W during inference (Deciding Not To Buy A Radeon Instinct Mi50 With The Help Of Vast.ai!). The MI60, with additional cores active and larger memory, can be expected to utilize the full 250–300 W when fully engaged by a transformer model. At idle, MI60s still draw a notable amount (reports indicate roughly 20 W per card when idle in system) (2x AMD MI60 inference speed. MLC-LLM is a fast backend ... - Reddit), due to the nature of data center GPUs which prioritize reliability over aggressive power gating.
Thermal Management: The reference MI60 is a passively cooled, dual-slot card (no on-board fan). It relies on server chassis airflow or external fans. Under heavy load, the GPU can run hot if not adequately cooled. Users who repurpose MI60s in desktop environments often attach fans or liquid cooling solutions. The card’s thermal limits will throttle clocks if temperatures approach critical levels. However, AMD designed the MI60 for sustained compute in server racks, so with proper airflow (e.g., large fans pushing air through the fins), it can maintain high performance without significant throttling. The operating temperature can typically be in the 70–80°C range under load, which is expected for a 300 W processor. The MI60 includes thermal sensors and will downclock if it exceeds safe temps, but in most well-cooled setups it will hold its 1.8 GHz boost fairly consistently during inference workloads.
Performance-per-Watt: In terms of efficiency, the MI60 delivered about 0.049 TFLOPS/W for FP64, 0.098 TFLOPS/W for FP32, and ~0.20 TFLOPS/W for FP16 at peak theoretical rates (using 300 W as baseline). For INT8 operations, it’s around 0.20 TOPS/W theoretical. In practice, real LLM workloads do not hit those peak FLOPS due to memory and control overheads. The effective performance-per-watt might be lower. When comparing energy efficiency in generating tokens, one user observed that MI60-class cards needed more software optimization to reach the same efficiency as competitor GPUs (2x AMD MI60 inference speed. MLC-LLM is a fast backend for AMD GPUs. : r/LocalLLaMA). Nonetheless, thanks to the 7 nm process and HBM2’s energy efficiency (HBM2 offers high bandwidth at lower power per bit transferred compared to GDDR), the MI60 was a large step up in perf/W from AMD’s previous 14 nm Instinct MI25. It was competitive in its time for HPC workloads. For modern LLM inference, it may not match the perf/W of newer accelerators designed specifically for AI, but it remains reasonable – especially given that many MI60 units can be found on secondary markets, offering a lot of performance for the power draw.
Thermal Throttling Behavior: If the MI60’s cooling solution is inadequate, the card will start to reduce clocks to stay within safe thermal envelope. Because LLM inference tends to use both the GPU cores and memory heavily, it generates significant heat on both the compute die and HBM stacks. Ensuring proper thermal paste, heatsinks on memory, and robust airflow is important for sustained performance. Fortunately, the MI60’s passive cooler is a hefty metal heatsink designed for high airflow scenarios, so as long as that airflow is provided (either via server chassis or custom fan setups), the card can run at full power indefinitely. Users integrating MI60s into workstations often 3D-print ducts or mounts for fans to replicate the server cooling conditions. In summary, MI60 requires datacenter-class cooling considerations: with those in place, it will deliver consistent throughput; without them, it may throttle and reduce token generation speed to stay within thermal limits.
ROCm and Driver Support: MI60 is fully supported by AMD’s ROCm software stack (compute runtime and driver), at least up to ROCm 5.x. It uses the GFX906 ISA (for MI60/MI50 GPUs) within ROCm. ROCm provides optimized libraries such as rocBLAS (for GEMM), MIOpen (for deep learning ops), and rocFFT/rocSparse, which can be leveraged by frameworks. PyTorch and TensorFlow both have ROCm builds that include support for MI60 (though one should use a ROCm version that still includes gfx906 support; newer ROCm releases are starting to focus on CDNA GPUs, and support for MI50/MI60 is deprecated in the latest releases). For inference, users commonly employ PyTorch (with ROCm) or Hugging Face Transformers on MI60. AMD’s own inference-serving stack (like ONNX Runtime with MIGraphX or the new vLLM optimizations AMD publishes) can also target the MI60, but some of the latest tools are optimized mainly for CDNA GPUs.
Framework Compatibility: Popular AI frameworks can run on MI60 through ROCm. PyTorch (>=1.12 ROCm version) can utilize MI60 for transformer model inference, as can TensorFlow (ROCm patch) and JAX (via ROCm/HIP backend). ONNX Runtime has a backend called MIGraphX which was tested on MI60 and MI100 – it provides a path to run ONNX models on the GPU with graph optimizations. Community forums have noted that some out-of-the-box AI software may expect CUDA and not work on ROCm without modification (Deciding Not To Buy A Radeon Instinct Mi50 With The Help Of Vast.ai! - Patshead.com Blog). However, LLM inference frameworks are becoming more inclusive: for example, llama.cpp has OpenCL and Vulkan backends that reportedly work on older GCN cards (like MI60) even if ROCm isn’t used. For best performance though, the ROCm path is preferred, as it can use MI60’s full capabilities.
Optimization & Kernels: To get optimal LLM performance, certain optimizations can be applied on MI60:
rocBLAS are highly tuned for matrix ops on GCN. Using fused kernels (e.g., fused attention mechanisms, fused MLP layers) can minimize memory reads and better utilize the ALUs. AMD’s transformers libraries may include such fused operations.Software Tools: AMD provides the ROCm Profilers and Debuggers to fine-tune kernels. For LLMs, one might not go to that low level, but it’s possible to identify bottlenecks (like if GEMM is not fully utilizing hardware due to size or if memory copy is a bottleneck). Additionally, AMD’s RCCL library (equivalent to NCCL) enables multi-GPU communication over Infinity Fabric or PCIe, which is useful for model parallel or serving multiple requests.
Summary of Compatibility: Overall, MI60 can run modern LLM inference with frameworks such as PyTorch (Transformers), TensorFlow, ONNX Runtime, and specialized libraries like HuggingFace’s text-generation-inference (with ROCm support), etc. The key is to use versions and forks that include support for the gfx906 target. Some bleeding-edge features (like newest transformer acceleration in TRT or FlashAttention) may require community patches to work on ROCm. The ecosystem for AMD GPUs in AI has historically lagged behind Nvidia’s, but it is rapidly improving, and MI60 benefits from many of those software advancements.
Multi-GPU Scaling (Data Parallel): The MI60 is designed to scale out in multi-GPU servers. With PCIe 4.0 x16 connectivity, host systems can feed multiple MI60s with ample bandwidth, and with dual Infinity Fabric Links between cards, the GPUs can communicate with each other at up to 200 GB/s peer-to-peer. In an 8-GPU server, AMD allows a “hive ring” topology (two hives of four GPUs each) where each MI60 links to two neighbors via Infinity Fabric. For data-parallel inference (serving different requests or batches on each GPU), scaling is nearly linear – e.g., two MI60s can handle roughly 2× the number of inference requests as a single, provided the CPU can supply data and the software uses asynchronous queueing.
Model Parallel for Larger Models: For models that exceed 32 GB, one can split the model across multiple GPUs (model parallelism). The high bandwidth IF links help here: layers or attention blocks can span GPUs with less penalty. In practice, using two MI60s to host a 70B model (which each GPU holds a part of the weights) has been demonstrated. The result was ~3.4 tokens/s for generation as noted, vs that model being impossible on one GPU. The efficiency of multi-GPU scaling depends on the parallelization scheme. If using pipeline parallelism (different layers on different GPUs) or tensor parallelism (splitting weight matrices), synchronization overhead comes into play. MI60’s latency over Infinity Fabric (~<1 microsecond class, 60-70 ns per a Reddit source) (AMD presents more Vega 20 details (Instinct MI60/MI50) after Next ...) is much lower than going over PCIe, making it well-suited for tight coupling of GPUs. Using AMD’s RCCL, all-reduce operations for synchronizing can achieve high throughput as well. Still, some overhead exists; for example, the 13B model saw virtually no speed gain with 2 GPUs because the workload was small enough for one. But at 70B, two GPUs delivered a result whereas one would fail due to memory constraints – scaling was essentially enabling the workload rather than speeding it up dramatically.
CPU-GPU Transfers: During inference, most data (model weights, KV cache, activations) resides on the GPU. CPU-GPU transfer mainly occurs for input data (token IDs into the model) and output logits or tokens back to CPU, which are negligible in size compared to the model. Thus, PCIe 4.0’s ~31.5 GB/s bandwidth is more than enough to handle those small data transfers without bottleneck. The potential bottleneck appears if streaming of layers from CPU memory is required (which one avoids if possible) or if the batch size is extremely large and input data is huge. Another consideration is startup latency: loading a 30 GB model from disk into VRAM can take time, but that’s a one-time cost. If running multiple sequences, one can keep the model loaded persistently.
Scaling in Practice: To maximize MI60 usage in multi-GPU, one can use frameworks like PyTorch Distributed or Megatron-LM (for model parallel) configured for ROCm. Ensuring efficient overlap of computation and communication is key – e.g., overlapping all-reduce of attention outputs with computation of the next layers. AMD’s hardware supports fine-grained syncing and direct GPU-to-GPU transfers via IF, which aids this. In summary, MI60 scales reasonably well for larger models using model parallelism, though it may not reach perfect linear scaling due to overhead. For data parallel (throughput scaling), multiple MI60s can linearly increase tokens/sec served, up to the point where the CPU or other system component becomes the limiter (which in modern multi-core servers, is usually not the first limiting factor).
Bottlenecks for LLMs: While the MI60 is a capable accelerator, there are some bottlenecks to be aware of when using it for large language models:
Maximizing MI60 for LLMs: Given these limitations, users often:
This report has referenced authoritative sources including official specifications, technical analyses, and community benchmarks to provide an accurate picture of the MI60’s capabilities for LLM inference. Key references include AMD’s product announcements and whitepapers, third-party reviews, and user-conducted performance tests on LLM tasks: