Summary of NVIDIA A100 Series GPUs (for Local LLM Inference)
*Peak figures with structural sparsity (2:4) enabled (ThinkSystem NVIDIA A100 PCIe 4.0 GPU Product Guide (withdrawn product) > Lenovo Press) (ThinkSystem NVIDIA A100 PCIe 4.0 GPU Product Guide (withdrawn product) > Lenovo Press). SXM4 modules use NVIDIA NVLink (3.0) for inter-GPU communication (600 GB/s) (NVIDIA A100 PCIe vs NVIDIA A100 SXM: A Comprehensive Comparison). PCIe versions can be bridged in pairs via NVLink (up to 3 links per pair) (ThinkSystem NVIDIA A100 PCIe 4.0 GPU Product Guide (withdrawn product) > Lenovo Press). Manufacturer TDP values are listed; typical power draw under load is often lower (e.g. ~250–300W in practice for 80GB SXM) depending on workload and cooling (NVIDIA A100 PCIe vs SXM4 Comparison and Use Cases in 2024).
Detailed Technical Analysis
Architecture Deep Dive (Ampere GA100 GPU)
The NVIDIA A100 series is built on the Ampere architecture (GA100 GPU), representing a leap in throughput and efficiency over the prior Volta generation (ThinkSystem NVIDIA A100 PCIe 4.0 GPU Product Guide (withdrawn product) > Lenovo Press). The GA100 chip contains 54.2 billion transistors on TSMC’s 7 nm process (), with a die size of 826 mm². The A100 GPU has up to 108 Streaming Multiprocessors (SMs) active (out of 128 maximum on GA100) (NVIDIA Ampere Architecture In-Depth | NVIDIA Technical Blog). Each SM in Ampere features a redesigned data path and larger caches for higher AI performance:
- FP32 Cores – Ampere doubled the FP32 datapaths per SM. Each A100 SM has 64 FP32 cores per partition, effectively 128 FP32 units/SM (providing 6,912 FP32 CUDA cores total on A100) (Ampere (microarchitecture) - Wikipedia) (). This allows significantly more parallel FP32 operations than Volta (which had 64 FP32 per SM total) for higher throughput.
- Third-Generation Tensor Cores – Each SM includes 4 Tensor Cores (third-gen), for a total of 432 Tensor Cores on A100 (). These specialized units perform matrix math critical for deep learning. Ampere Tensor Cores support new data types and 4× the FP16/FP32 FMA throughput per core (256 FMA ops/clock per TC) compared to previous-gen (64/clock) (NVIDIA Ampere Architecture In-Depth | NVIDIA Technical Blog). The Tensor Cores can execute FP16, bfloat16 (BF16), INT8, and INT4 operations, and introduce TensorFloat-32 (TF32), a novel 19-bit precision mode that accelerates FP32 computations by using tensor cores with minimal accuracy loss (ThinkSystem NVIDIA A100 PCIe 4.0 GPU Product Guide (withdrawn product) > Lenovo Press) (NVIDIA Ampere Architecture In-Depth | NVIDIA Technical Blog). For HPC, the third-gen Tensor Cores even support IEEE-compliant FP64 matrix operations, doubling FP64 throughput (via Tensor Core FP64: 19.5 TFLOPS vs 9.7 TFLOPS standard) (ThinkSystem NVIDIA A100 PCIe 4.0 GPU Product Guide (withdrawn product) > Lenovo Press).
- SM Structure and Caches – The A100 SM is divided into four processing blocks, each with its own warp scheduler and execution units, but now all four blocks share a combined L1 data cache / shared memory up to 192 KB per SM (NVIDIA A100 PCIe 40 GB Specs | TechPowerUp GPU Database). This unified design (versus 96 KB on Volta) improves on-chip data reuse for AI workloads. The L2 cache is dramatically enlarged to 40 MB (partitioned for bandwidth), which is ~7× larger than V100’s L2 (NVIDIA Ampere Architecture In-Depth | NVIDIA Technical Blog) (NVIDIA A100 SXM4 40 GB Specs | TechPowerUp GPU Database). This huge L2 acts as a staging area for neural network weights/activations, reducing frequent off-chip memory accesses. A new crossbar and cache design gives 2.3× L2 read bandwidth vs Volta (NVIDIA Ampere Architecture In-Depth | NVIDIA Technical Blog), ensuring the Tensor Cores and CUDA cores stay fed with data.
- Structural Sparsity – Ampere introduces hardware support for fine-grained structured sparsity in neural network weights. The A100’s architecture can exploit a 2:4 sparsity pattern (two nonzeros out of every four values) in weight matrices, essentially skipping zero computations (NVIDIA Ampere Architecture In-Depth | NVIDIA Technical Blog). When models are pruned to this pattern (with minimal accuracy loss), the Tensor Cores can double the effective throughput for supported operations (ThinkSystem NVIDIA A100 PCIe 4.0 GPU Product Guide (withdrawn product) > Lenovo Press) (NVIDIA Ampere Architecture In-Depth | NVIDIA Technical Blog). This yields up to 2× higher TF32/FP16/INT8 performance (marked with "*" in peak specs). For example, an A100 can reach 624 TFLOPS of FP16/BF16 with sparsity (vs 312 TFLOPS dense) (ThinkSystem NVIDIA A100 PCIe 4.0 GPU Product Guide (withdrawn product) > Lenovo Press). This feature boosts LLM inference if the model has been sparsity-optimized.
- Multi-Instance GPU (MIG) – A unique architectural feature of A100 is the ability to partition the GPU into up to 7 smaller instances (at the hardware level). Each MIG instance has dedicated SMs, memory slices, and cache slices, isolated from others (ThinkSystem NVIDIA A100 PCIe 4.0 GPU Product Guide (withdrawn product) > Lenovo Press). For example, a 40GB A100 can be split into 7×5 GB GPU instances. MIG allows better utilization in multi-tenant or multi-model scenarios by running several smaller models concurrently on one A100 with guaranteed QoS and memory isolation (NVIDIA Ampere Architecture In-Depth | NVIDIA Technical Blog). While MIG is less directly useful for running one large LLM, it’s very relevant in serving many smaller models or microservices on a single GPU.
Overall, the Ampere architecture in A100 provides major generational improvements for AI: NVIDIA reported up to 20× higher deep learning performance versus Volta for certain workloads when leveraging new features like TF32 and sparsity (ThinkSystem NVIDIA A100 PCIe 4.0 GPU Product Guide (withdrawn product) > Lenovo Press). The combination of more CUDA cores, faster Tensor Cores, larger caches, and new data formats makes A100 a formidable engine for LLM inference and training.
Compute Capabilities and Precision Support
The A100 supports a wide range of numeric precisions, which is critical for optimizing LLM inference (trading off speed vs accuracy):
- FP32 (Single Precision): A100 delivers up to 19.5 TFLOPS of FP32 compute (per GPU) using its CUDA cores (Ampere (microarchitecture) - Wikipedia). Ampere GPUs maintain full FP32 throughput even when using the “tensor paths” (TF32 format). However, for large models FP32 is often overkill; A100’s real advantage lies in lower precision.
- FP16 / BF16 (Half Precision): Using Tensor Cores, A100 achieves 312 TFLOPS of half-precision (FP16) or BF16 performance (without sparsity) (ThinkSystem NVIDIA A100 PCIe 4.0 GPU Product Guide (withdrawn product) > Lenovo Press) – a massive speedup for matrix-heavy operations. BF16 (bfloat16) has FP32-range dynamic range and is advantageous for training or inference where FP32-level range is needed but not full precision. A100’s Tensor Cores process BF16 at the same rate as FP16 (NVIDIA Ampere Architecture In-Depth | NVIDIA Technical Blog), making BF16 inference just as fast. In practice, frameworks can use FP16 or BF16 autocasting on A100 to greatly accelerate LLM throughput with negligible loss in output quality.
- TensorFloat-32 (TF32): TF32 is a specialized 19-bit precision (10-bit mantissa) format introduced with A100. It allows networks trained in FP32 to run on Tensor Cores with minimal changes. TF32 delivers 156 TFLOPS on A100 (312 TFLOPS with sparsity) (ThinkSystem NVIDIA A100 PCIe 4.0 GPU Product Guide (withdrawn product) > Lenovo Press), effectively providing 8× the throughput of standard FP32 while preserving FP32-level range and significantly reducing training/inference time for LLMs (NVIDIA Ampere Architecture In-Depth | NVIDIA Technical Blog).
- INT8 and INT4: A100’s Tensor Cores also support low-precision integer math for inference. It can perform 624 TOPS of INT8 (Tensor operations, no sparsity) and 1,248 TOPS of INT4 (ThinkSystem NVIDIA A100 PCIe 4.0 GPU Product Guide (withdrawn product) > Lenovo Press). With structured sparsity, these rates double (up to 1.25 PetaOPS of INT4 in theory) (ThinkSystem NVIDIA A100 PCIe 4.0 GPU Product Guide (withdrawn product) > Lenovo Press). INT8 is commonly used for quantized inference of transformers – many LLMs can be quantized to 8-bit weights/activations with minimal accuracy drop, resulting in ~4× reduction in memory usage and ~4× throughput increase. A100’s INT8 capability is therefore extremely relevant for deploying large models that otherwise wouldn’t fit in GPU memory. INT4 is more experimental for LLMs (potentially 8× speedups, but more quantization error). However, research and some frameworks are exploring 4-bit weight quantization; A100 hardware is ready for such ultra-low precision inference.
- FP64 (Double) and FP64 Tensor: For completeness, A100 supports double precision at 9.7 TFLOPS (FP64 FMA) (ThinkSystem NVIDIA A100 PCIe 4.0 GPU Product Guide (withdrawn product) > Lenovo Press) for HPC workloads. Unique to GA100, it also supports an FP64 Tensor Core mode (using 2:1 DMMA) that reaches 19.5 TFLOPS (ThinkSystem NVIDIA A100 PCIe 4.0 GPU Product Guide (withdrawn product) > Lenovo Press). This is primarily for scientific computing; LLM inference generally doesn’t use FP64. It highlights Ampere’s focus on versatility – one GPU for both AI and high-precision HPC tasks.
Sparsity and Mixed Precision: The combination of these capabilities means that on A100, large language models can be optimized to run orders-of-magnitude faster than on traditional FP32. For instance, one can store model weights in INT8 (for 4× memory savings) and run matrix multiplies in mixed precision – e.g. accumulate in FP16 or BF16 – using A100’s Tensor Cores. Structured sparsity (if the model has been pruned accordingly) can further double the throughput for those layers (ThinkSystem NVIDIA A100 PCIe 4.0 GPU Product Guide (withdrawn product) > Lenovo Press). All major deep learning frameworks (PyTorch, TensorFlow, JAX) and inference runtimes (NVIDIA TensorRT, ONNX Runtime, etc.) can leverage these tensor core precisions on A100. This yields higher tokens/sec and enables serving larger models within given hardware constraints.
Memory Subsystem Analysis (HBM2, Caches, and Model Size)
Large Language Models are memory-hungry, and the A100’s memory subsystem is designed to keep huge models fed with data:
- High-Bandwidth Memory (HBM2/HBM2e): A100 GPUs use stacked HBM memory to achieve extraordinary throughput. The 40GB A100 uses 5 stacks of HBM2 (at 2.4 Gbps per pin), delivering about 1.6 TB/s memory bandwidth () (Ampere (microarchitecture) - Wikipedia). The 80GB models use HBM2e with higher pin speed (up to 3.2 Gbps) for roughly 2.0 TB/s bandwidth (NVIDIA A100 40GB vs 80 GB GPU Comparison in 2024 - DataCrunch) (NVIDIA A100 PCIe vs NVIDIA A100 SXM: A Comprehensive Comparison) – in fact, the 80GB SXM4 version reaches ~2.04 TB/s (NVIDIA A100 SXM4 80 GB Specs | TechPowerUp GPU Database) (NVIDIA A100 SXM4 80 GB Specs | TechPowerUp GPU Database), the highest of any GPU at launch. This immense bandwidth is crucial for transformer models, which involve reading large weight matrices and writing activations. By comparison, a typical GPU with GDDR6 memory might have ~0.5 TB/s; A100’s memory subsystem is 3–4× wider. The memory bus is 5,120-bit (ten 512-bit controllers for 40GB, and up to twelve controllers on GA100) () (). In practical terms, this means an A100 can stream model parameters from HBM at an extremely high rate, reducing stalls when processing long context or large batches of tokens.
- Memory Capacity: With 40 GB or 80 GB of HBM, A100 GPUs can accommodate very large models locally. An A100 40GB can hold a model on the order of 20–30 billion parameters in 16-bit precision (since 40GB ≈ 20B FP16 params, plus overhead). The 80GB model doubles that – for example, a 70B parameter LLM can fit on one 80GB A100 if quantized to 8-bit (70B * 1 byte ≈ 70GB). This is crucial for local inference, as it avoids the latency of model-parallel offloading. Models like GPT-3 (175B) still exceed a single A100’s memory, but can be split across multiple GPUs (discussed later). For most open LLMs (e.g. Llama-7B, 13B, 30B, 65B), the 80GB A100 provides enough VRAM to load the model fully (possibly with minor quantization for 65B). The impact on model size is straightforward: more GPU memory allows running larger models or using higher precision. A100’s large HBM enables running high-parameter-count models locally without resorting to CPU paging (which would dramatically slow down inference).
- Caching and Memory Hierarchy: As noted, A100 has a 40 MB L2 cache and 192 KB L1/SM. This hierarchy is designed to stage working data (model weights, intermediate activations, attention masks, etc.). The large L2 can hold chunks of the model – for instance, recently used layers or key-value caches for attention – which reduces repeated HBM traffic. NVIDIA indicated the enlarged L2 cache can significantly speed up workloads with smaller batch sizes or partial reuse, which is often the case in inference (processing one or few tokens at a time) (). When generating one token at a time (auto-regressive generation), the reuse of weights per token is limited, but attention key/value reuse grows with sequence length – caches like L2 help in those scenarios. A100 also supports new L2 residency controls (via CUDA APIs) that let software pin certain data in L2 cache (), which advanced inference frameworks can use to keep frequently accessed data (e.g. layer norms or prompt embeddings) close to the SMs.
- Memory Compression: The Ampere architecture introduced a feature called Compute Data Compression in the memory subsystem (). When data patterns are amenable (such as unstructured sparsity or repeated values), the A100 can compress data on the fly in L2 or on DRAM transfers. This can provide up to 4× effective bandwidth and 2× effective L2 capacity in ideal scenarios (). For instance, if an activation matrix has many zeros (common after ReLU or in sparse attention), the GPU might transfer it in compressed form. For LLMs, this could help in attention score matrices or other sparse patterns. While it’s hard to quantify in general, this feature means A100’s 1555–2039 GB/s raw bandwidth can stretch further for many inference workloads, mitigating memory bottlenecks.
- Error Correction and Reliability: All A100 memory (HBM and caches) supports ECC for reliability (ThinkSystem NVIDIA A100 PCIe 4.0 GPU Product Guide (withdrawn product) > Lenovo Press). This is mostly a data center requirement, but it ensures that large models (which use virtually all the memory) have protection against memory bit flips. This is important in long-running inference tasks for correctness.
In summary, A100’s memory subsystem – massive HBM2(e) capacity + bandwidth, large caches, and compression – is built to handle gigabyte-scale models with high throughput. For local inference, this translates to less time spent waiting on data and more time computing, keeping those tensor cores busy. Users running local LLMs benefit by being able to load whole models into GPU memory and stream tokens without saturating the memory interface.
Performance Benchmarks on LLM Workloads
The true measure of these GPUs is how they perform on real large-model inference tasks. The A100 series, being flagship data-center GPUs, excel at transformer-based models. Key observations from available benchmarks:
- Transformer Inference Throughput: The A100 can achieve very high token generation rates, especially when using lower precision and batching. For example, with a relatively small 6-7B model (Mistral 7B, Llama-2 7B, etc.), a single A100 can generate about 20–70 tokens per second depending on precision. A user reported ~23 tokens/s using FP32 on a 7B model (Inference for a 7B model on A100 takes too long? - Beginners), which jumps to ~75 tokens/s with optimized settings (using GPU memory efficiently) (A100 GPU generate 11.30 tokens/s with llama.cpp #2641 - GitHub). Dell’s engineers demonstrated that with optimized batching, a single A100 can reach ~139 tokens/sec on Llama-2 7B while maintaining low latency (Conclusion | Llama 2: Inferencing on a Single GPU | Dell Technologies Info Hub). This shows how throughput can scale with careful optimization. In batch mode (serving many queries at once), the numbers go much higher: one report found that 1,771 tokens/s total could be generated on an A100 when using a batch of 32 prompts (amortizing overhead) (Unlocking the full power of NVIDIA H100 GPUs for ML inference with ...). Essentially, the A100 can either generate one sequence very fast or many sequences in parallel extremely fast, due to its enormous compute.
- Latency vs Throughput: For single-stream generation (one prompt at a time), the A100 40GB and 80GB offer low latency per token. Measured end-to-end, an A100 can often output the next token in under 20 milliseconds, meaning interactive response times of a few dozens of milliseconds for each new word. In a scenario with a tolerance of 10 tokens/sec per user, even one A100 can serve dozens of users concurrently (Benchmarking NVIDIA GPU Throughput for LLMs and Understanding GPU Configuration Choices in the AI Space | Dell Technologies Info Hub) (Benchmarking NVIDIA GPU Throughput for LLMs and Understanding GPU Configuration Choices in the AI Space | Dell Technologies Info Hub). However, maximizing throughput (tokens/sec) often involves increasing batch size or concurrency, which the A100 handles well thanks to its MIG and scheduling – e.g. it could serve 50+ concurrent requests at 40 tokens/sec each in a multi-user chatbot setup (Benchmarking NVIDIA GPU Throughput for LLMs and Understanding GPU Configuration Choices in the AI Space | Dell Technologies Info Hub).
- Comparative Performance (vs Previous Gen): The A100 offers a big jump over the prior V100 (Volta). NVIDIA reported up to 6× higher BERT-Large training throughput and 7× higher BERT inference throughput with A100 ([PDF] DEEP LEARNING PERFORMANCE GUIDE - M Computers s.r.o.). Specifically for inference, using MIG, one A100 could handle 7 simultaneous BERT-Large inference streams, yielding ~7000 sentences/sec, versus ~1000/sec on a V100 ([PDF] DEEP LEARNING PERFORMANCE GUIDE - M Computers s.r.o.). Even without MIG, A100’s raw speed on transformer layers is about 2–3× V100 at FP16, and even larger gains (4–7×) when using INT8 or sparsity. Compared to CPU-only inference, the gap is even wider: on conversational AI models like BERT, an A100 can be ~249× faster than a high-end CPU in throughput (NVIDIA A100 Tensor Core GPU). This kind of speedup is what enables deploying large models that would be impractically slow on CPU.
- Specific LLM Examples: While vendor benchmarks often use BERT or GPT-2, community tests give insight into larger LLMs:
- GPT-3 class models (175B): These models cannot fit on one A100, but in multi-GPU configs the A100 is the workhorse for GPT-3. For instance, GPT-3 175B inference typically uses an 8× A100 (8×80GB = 640GB) server to serve responses. In such a configuration (like NVIDIA’s DGX A100), GPT-3 can generate dozens of tokens per second. One A100 80GB by itself can handle a sliced portion of GPT-3 (about 25% of the model if using 4-bit weights). So while single-GPU GPT-3 is not feasible, A100s are used in clusters to host these massive models.
- GPT-Neo/GPT-J (6B to 20B): These open models fit in 40GB easily. Reported speeds on A100 for GPT-J 6B are in the tens of tokens/sec range at FP16. With 8-bit quantization, even a 20B model can run on A100 40GB with >10 tokens/sec generation speed. Quantization and TensorRT can further boost these numbers, sometimes doubling throughput.
- Llama-65B: A 65B model is roughly 130GB in FP16, so it won’t fit on one 80GB GPU without compression. However, many users run Llama-65B on two A100 80GB (splitting layers), or by using 4-bit quantization on a single 80GB (fits ~65B * 0.5 bytes/param ≈ 32.5GB). In such cases, generation speeds of a few tokens/sec are achieved. Not blazing fast, but remarkable for such a large model on one card. Multi-GPU scaling (e.g. 2–4 A100s) can bring 65B up to ~10+ tokens/sec with proper parallelism.
- Throughput vs Sequence Length: It’s worth noting that as sequence length grows (e.g. generating long outputs or using long prompts), the self-attention operation scales quadratically and can become a bottleneck. A100’s strong tensor core performance helps here, and the large memory allows caching all past key/value tensors. Still, very long sequences (e.g. 2K+ tokens) will see some slowdown on any GPU. Techniques like sparse attention or retrieval can help, but those are algorithmic. A100’s role is simply to execute whatever the model demands, as fast as possible.
In summary, for typical LLM inference tasks (few hundred tokens generated per query, moderate batch sizes), an A100 can deliver high throughput and low latency. It is currently a go-to GPU for deploying large models: even the popular ChatGPT (based on GPT-3.5) in its earlier versions was reportedly served on clusters of A100s. Whether it’s serving many small requests (where MIG shines) or hammering out a giant paragraph from a single prompt, the A100’s mixture of compute and memory makes it excel in the LLM inference domain.
Thermal and Power Efficiency Under LLM Loads
Running large models pushes GPUs to their limits, so power and thermal characteristics are important for sustained performance:
- Power Draw: The A100 40GB PCIe has a TDP of 250 W, 80GB PCIe is 300 W, and the SXM4 modules are officially rated at 400 W TDP (ThinkSystem NVIDIA A100 PCIe 4.0 GPU Product Guide (withdrawn product) > Lenovo Press). In practice, during LLM inference, the power usage will depend on how heavy the compute is. For example, generating one token at a time (low batch, small matrix multiplies) might not fully load the GPU, and power draw could be lower (~150–200W). Conversely, batched inference or multi-stream usage can utilize the GPU near 100%, approaching that TDP. Tests have shown A100 PCIe cards often draw ~200–250W in mixed int8/FP16 inference scenarios. The SXM4 80GB, with higher clocks, can draw ~300–400W when fully utilized by a large batch of transformer operations. It’s notable that SXM form factor allows higher power budgets (and thus higher base clocks) (NVIDIA A100 SXM4 80 GB Specs | TechPowerUp GPU Database) (NVIDIA A100 SXM4 80 GB Specs | TechPowerUp GPU Database). Some data center systems even permit A100 80GB SXM to run at 450–500W with aggressive cooling for extra performance headroom (NVIDIA A100 PCIe vs SXM4 Comparison and Use Cases in 2024). This means the 80GB SXM can sustain a 1275 MHz base clock (vs ~1100 on 40GB) and keep memory at 3.2 Gbps, boosting throughput ~5–10% but at cost of ~100W more.
- Thermal Management: All A100s are passively cooled devices (no on-board fan), relying on server chassis airflow or water cooling. The PCIe cards have large heat sinks and require high airflow to dissipate 250–300W. Under continuous LLM load, they will run hot – typically ~70–80°C GPU core temperature – but are designed for it. The SXM modules are often in HGX boards with either massive air cooling or direct cold plate (liquid) cooling. These keep the thermals in check even at 400–500W. Throttling is rarely an issue as long as the system is built for the GPU’s thermal design point. In a workstation scenario (less common for A100), one must ensure adequate cooling, as these cards will dump a lot of heat when running LLM inference nonstop.
- Efficiency (Perf per Watt): The A100, despite high absolute power, is actually quite power-efficient given the work it does. Measured in terms of inference throughput per watt, A100 is far ahead of previous GPUs and CPUs. For instance, against V100, A100 delivered about 3× the inference throughput at similar or slightly higher power, yielding ~2–2.5× better performance/W. When using INT8, the efficiency is even more pronounced. However, newer GPUs like H100 improve on this further (H100 being ~4–5× faster than A100 at only ~1.3× the power). Still, for Ampere generation, A100 has excellent perf/W for large models – it was top of its class until Hopper arrived. In MLPerf Inference results, A100 achieves around 15–20 sequences/sec per watt in BERT, which is several times better than T4 or CPU alternatives.
- Dynamic Power and Clocks: The A100 GPUs have power management that can adjust clocks based on load and temperature. In practice, for consistent LLM inference (which is a fairly steady, compute-bound workload), the A100 tends to run at a steady high clock. The “Boost clock” ~1410 MHz is often sustained if cooling/power allow. If a model doesn’t fully utilize all SMs (e.g., a very small batch on a huge GPU), the GPU may not hit max power and will have thermal headroom – it might then boost memory clocks or stay at high core clock with lower SM utilization, wasting some power potential but still low overall. To maximize efficiency, users can also lock SM frequency or use MIG to partition the GPU so that each slice runs nearer to full utilization.
- Performance-per-Watt vs Alternatives: In a local inference context, one might consider using multiple smaller GPUs vs one A100. Often, one A100 80GB can replace multiple consumer GPUs because of its memory size. This can actually be more power-efficient: e.g., running a 40B model on one A100 80GB (300W) versus across two 24 GB RTX 3090s (~350W each, 700W total) – the single A100 is far more efficient in both power and performance (and simpler to manage). This efficiency stems from the tensor cores and memory advantage.
Overall, the A100’s high power draw is the price for its performance. In data centers, the performance per watt is critical, and A100s have proven their worth by delivering unmatched throughput for their power envelope during the Ampere generation. For a local setup, an A100 will require robust PSUs and cooling, but it will also deliver results that would otherwise need several lower-end GPUs, often still at lower total wattage.
Comparative Analysis Within the A100 Series
Within the A100 lineup, the primary differences are the memory size, form factor, and power envelopes – the GPU chip (GA100) is the same. Here’s how the variants compare:
- 40GB vs 80GB (Memory and Bandwidth): The 80GB A100 doubles the HBM capacity, which is a direct benefit for large models. If your model fits in 40GB, the extra memory might not improve performance (though it could allow larger batch or longer sequences before memory runs out). However, many LLM users find 40GB limiting for >30B models, whereas 80GB comfortably handles up to ~65B (or more with 8-bit). The 80GB also has faster memory (HBM2e) – ~25% higher bandwidth (NVIDIA A100 40GB vs 80 GB GPU Comparison in 2024 - DataCrunch). In practice, memory bandwidth can be a bottleneck for very large matrix multiplies and attention layers, so the 80GB can be a few percent faster even on identical models just due to feeding the cores better. For example, the 80GB version has been measured ~10% faster on certain large-network inference due to this bandwidth boost (NVIDIA A100 PCIe vs SXM4 Comparison and Use Cases in 2024). If the model is small (few billion params), both versions perform almost the same (compute-bound).
- PCIe vs SXM4 (Form Factor): The SXM4 (Mezzanine module) A100s, used in HGX A100 server boards, have advantages in power and GPU interconnect. They allow up to 400W (and in some cases 500W) power, letting the GPU sustain higher base clocks (as seen with the 1275 MHz base on 80GB SXM vs 1065 MHz on 80GB PCIe) (NVIDIA A100 SXM4 80 GB Specs | TechPowerUp GPU Database) (NVIDIA A100 PCIe 80 GB Specs | TechPowerUp GPU Database). This means SXM A100 can be ~5–15% faster in raw compute than the PCIe card – indeed, in the spec table above, the Tensor Core FP16 throughput is effectively 312 TFLOPS on PCIe vs 312 (dense) but up to ~340 TFLOPS on a 500W SXM in practice (if it boosts beyond 1.41 GHz). More concretely, NVIDIA and third parties note SXM A100 outperforms PCIe A100 in every metric (memory BW, interconnect, slightly in compute) at the cost of higher power draw (NVIDIA A100 PCIe vs NVIDIA A100 SXM: A Comprehensive Comparison) (NVIDIA A100 PCIe vs NVIDIA A100 SXM: A Comprehensive Comparison). The PCIe version, however, is more accessible and can be used in standard servers or workstations.
- NVLink and Multi-GPU: Another key difference – the PCIe A100s have NVLink connectors that allow pairing two cards (with 3 bridges giving 600 GB/s GPU–GPU) (ThinkSystem NVIDIA A100 PCIe 4.0 GPU Product Guide (withdrawn product) > Lenovo Press). But to scale beyond 2 GPUs, PCIe A100s rely on standard PCIe communication (which is much slower, 64 GB/s) (ThinkSystem NVIDIA A100 PCIe 4.0 GPU Product Guide (withdrawn product) > Lenovo Press). The SXM A100s, by contrast, are typically installed 4 or 8 on a board with an NVSwitch fabric connecting them all at full NVLink speed (each GPU has 12 NVLink 3 links, 600 GB/s aggregate) (NVIDIA A100 PCIe vs NVIDIA A100 SXM: A Comprehensive Comparison) (NVIDIA A100 PCIe vs NVIDIA A100 SXM: A Comprehensive Comparison). For multi-GPU training or inference of one model, the SXM configuration (like in a DGX A100 with 8 GPUs fully interconnected) is vastly superior. For local inference of extremely large models (splitting layers across GPUs), an SXM system with NVSwitch will outperform a PCIe system where GPUs are only weakly connected. Within our scope of single-GPU or small-scale local use, this mostly means: if you plan to use multiple A100s together, the SXM form (in an HGX server) will scale more efficiently with less communication bottleneck.
- Thermals and Clock Consistency: The SXM modules often run cooler due to advanced cooling solutions (e.g. water-cooled cold plates), which can help sustain max clocks under heavy load. The PCIe cards may see slight clock throttling if running in a less optimized airflow. However, both are data-center grade and designed for full throttle operation. Lenovo’s testing showed the 80GB SXM can maintain ~6% higher memory clock and power (2039 vs 1935 GB/s and 500W vs 300W) over the PCIe (ThinkSystem NVIDIA A100 PCIe 4.0 GPU Product Guide (withdrawn product) > Lenovo Press) (ThinkSystem NVIDIA A100 PCIe 4.0 GPU Product Guide (withdrawn product) > Lenovo Press). In practice, a well-cooled PCIe A100 and an SXM A100 will perform within ~5% of each other for a given model (with the SXM usually edging ahead).
- Cost and Accessibility: Not a performance metric, but worth noting: A100s are very expensive (~$10k-$15k+ each for 40GB, and more for 80GB). The PCIe models can sometimes be found on secondary markets (from decommissioned servers) and can be put into a desktop machine with a suitable PSU and cooling, making them attractive to researchers or enthusiasts aiming to run big LLMs locally. The SXM modules, however, require a compatible server motherboard (like the HGX carrier board) – they are not standalone usable in a normal PC. Thus, for “local” (non-datacenter) use, the PCIe A100 is typically the only viable choice. It delivers essentially the same performance as the SXM aside from the noted differences.
Within the A100 series itself, all models have the same core GA100 capabilities. So an 80GB doesn’t have more CUDA cores than a 40GB; a PCIe doesn’t have fewer SMs than an SXM. The choice comes down to memory (40 vs 80) and use case (standalone vs multi-GPU scaling). For strictly local single-GPU inference, an A100 80GB PCIe would be the top choice (max memory, easy to deploy). If one has a server with multiple A100s, the SXM variant unlocks the full potential of scaling to serve the largest models or highest loads. But any A100 variant will substantially outperform smaller GPUs or older ones on LLM tasks. Even the “smallest” A100 (40GB PCIe) is a compute monster with more memory and AI horsepower than almost any client GPU.
Software Optimization and Compatibility
To extract maximum LLM performance from A100 GPUs, software and framework support is critical – and fortunately A100 is well-supported across the board:
- CUDA and Libraries: A100 uses NVIDIA’s CUDA platform (Compute Capability 8.0). It is fully supported by CUDA 11.x+ and all associated libraries. Deep learning frameworks like PyTorch and TensorFlow have optimized kernels that detect Ampere GPUs and use tensor cores automatically (e.g.
autocast
in PyTorch will use FP16/TF32 on A100). NVIDIA’s cuDNN library is tuned for Ampere, enabling fast transformer ops (GEMMs, layernorms, etc.). The GPU’s compute features (BF16, TF32, etc.) are exposed via CUDA APIs for developers who want fine control. - TensorRT and ONNX Runtime: For deployment, NVIDIA TensorRT can take trained models and optimize them specifically for A100 (fusing layers, scheduling execution to tensor cores, etc.). TensorRT supports FP16 and INT8 calibration on A100, which can dramatically speed up inference. ONNX Runtime also has an execution provider for TensorRT and one for CUDA that leverage A100’s features. This means many open LLMs (if exported to ONNX or loaded via ORT) can run with near-max efficiency on A100 by using these runtimes. For example, using TensorRT INT8 on GPT-2 can more than double throughput relative to FP16.
- Software for Quantization: A100’s support for INT8/INT4 is only useful if models can be quantized to those precisions. There are tools like NVIDIA’s PTQ and QAT (post-training quantization and quantization-aware training) in TensorRT and Open Neural Network Exchange (ONNX) workflows that allow converting models to INT8 with minimal accuracy drop. Open-source libraries (e.g. Hugging Face’s
bitsandbytes
,transformers
with quantization, or Intel’s Neural Compressor which also works on NVIDIA via ONNX) enable 8-bit and even 4-bit quantization for transformers. These can run on A100 hardware. Notably, research projects like GPTQ (for 4-bit quantization) have been used on A100 to run 30B+ models in 4-bit. The key is that the hardware supports it natively – so any software that can produce quantized weights finds an ideal execution environment in A100. - Framework Ecosystem: Virtually all major ML frameworks have been tested on A100: PyTorch, TensorFlow, JAX, MXNet, etc., all run out-of-the-box and utilize the GPU’s capabilities. PyTorch with
torch.cuda.amp
(automatic mixed precision) is commonly used to accelerate training and inference on A100 by using FP16 where appropriate. Even higher-level tools like Hugging Face Transformers have integration to ensure operations run on GPU efficiently (for instance, the Accelerate library and bitsandbytes library will automatically use FP16 and int8 on supported GPUs like A100). - Multi-GPU Software: If using multiple A100s, NVIDIA’s NVLink and NCCL libraries allow efficient data transfers and all-reduce operations. For large models that are sharded, frameworks use NCCL to synchronize. A100’s NVSwitch (in HGX systems) is fully leveraged by NCCL for near-linear scaling. From an inference perspective, libraries like DeepSpeed and Megatron-LM can split model layers across GPUs and A100’s high-bandwidth links will ensure minimal degradation.
- Compatibility and Drivers: A100 being a data center GPU requires NVIDIA’s data center drivers. It doesn’t output a display, but that’s irrelevant for compute. It supports PCIe Gen4, which most modern CPUs/chipsets do as well – so installation is straightforward. One should use recent NVIDIA drivers (450+ series) and CUDA 11 or newer to ensure Ampere support. As of now, even CUDA 12 supports A100 fully.
- MIG Utilization: Software can also explicitly use MIG on A100. For example, Kubernetes or virtualization stacks (like NVIDIA GPU Operator or MIG-aware schedulers) can carve an A100 into multiple instances such that different containers or VMs each get a MIG slice. This is more relevant in cloud or multi-user environments. For local LLM work, MIG might be used if one wants to run, say, 2 different models simultaneously on one A100 (e.g. one 5GB instance running a smaller model and the rest running another task).
- Future Proofing: The A100 supports features like CUDA Graphs (for reducing launch overhead) and has extensive monitoring support (through DCGM – Data Center GPU Manager – one can track SM utilization, tensor core utilization, etc.). These can be used to fine-tune performance. Additionally, frameworks like Triton Inference Server from NVIDIA can serve models on A100 with dynamic batching – a popular approach to maximize throughput by combining incoming requests. All these software pieces treat A100 as a first-class citizen, given it was NVIDIA’s flagship for the 2020–2022 period.
In essence, anyone doing LLM inference on A100 will find a mature software ecosystem: from low-level CUDA libraries to high-level model servers, everything has been optimized to make use of A100’s capabilities. The user primarily needs to choose the right precision (FP16/INT8, etc.), possibly use NVIDIA’s profiling tools to identify bottlenecks, and perhaps use TensorRT or other optimizers for maximum speed. The heavy lifting (ensuring the code uses tensor cores, etc.) is largely handled by the libraries.
Scaling and Multi-GPU Considerations for LLMs
While this report focuses on single-node performance, large language models often benefit from multi-GPU scaling – and the A100 is designed to scale:
- NVLink/NVSwitch: As mentioned, A100 SXM4 GPUs in the same server are connected via NVSwitch (in 8-GPU HGX A100 systems). This provides an all-to-all bandwidth of 600 GB/s between any pair of GPUs (NVIDIA A100 PCIe vs NVIDIA A100 SXM: A Comprehensive Comparison) (NVIDIA A100 PCIe vs NVIDIA A100 SXM: A Comprehensive Comparison). For distributed inference (model parallelism), this high-speed interconnect is crucial. For example, if a 175B model is split across 8 GPUs, during each forward pass the GPUs need to exchange activation tensors; NVSwitch allows this with relatively low latency. PCIe A100s without NVSwitch rely on PCIe 4.0 (32 GB/s each direction per link, or ~64 GB/s total) (ThinkSystem NVIDIA A100 PCIe 4.0 GPU Product Guide (withdrawn product) > Lenovo Press). This is an order of magnitude slower than NVLink, meaning multi-GPU model splits on PCIe can bottleneck on communication. One mitigation on PCIe is to use fewer, larger GPU splits (e.g. 2 GPUs with NVLink bridge for a 175B model in 2-way tensor parallelism – but that still may require CPU memory offloading for the rest).
- Multi-Node (Networking): The A100 supports NVIDIA’s GPUDirect RDMA and works with InfiniBand networking to scale across nodes. For enormous models or deployments (like multi-node model serving), A100s can use RDMA to directly send data between GPUs in different servers, bypassing the CPU. In an HPC or enterprise context, one might use clustering software (like NVIDIA NCCL with InfiniBand) to use A100s in tandem for serving an LLM that exceeds one machine’s capacity. For local setups, this is less common, but it’s how ultra-large models (e.g. Megatron 530B) would be deployed – spread over dozens of A100s.
- CPU–GPU Transfer Bottlenecks: In inference serving, data often needs to move from CPU (which receives the request) to GPU (for processing) and back. The A100’s PCIe 4.0 interface (16 lanes) provides up to ~64 GB/s of host bandwidth (ThinkSystem NVIDIA A100 PCIe 4.0 GPU Product Guide (withdrawn product) > Lenovo Press), which is generally sufficient for copying input tokens and output tokens (which are small compared to model weights). However, if one tries to stream model weights from CPU RAM to a GPU (because the model doesn’t fit entirely on GPU), that PCIe bandwidth becomes a huge bottleneck. For local LLM inference, this means fitting the model in GPU memory is critical – otherwise, the PCIe bus will severely slow down inference (due to constant paging of layers). Tools like NVIDIA’s Unified Memory can technically page memory in/out, but performance is orders of magnitude worse. Thus, effective scaling is usually scale-out with more GPUs, not relying on CPU memory. The A100 80GB’s ability to hold models in GPU RAM is a big advantage here.
- MIG in Multi-Instance Serving: In a scenario where you have one physical A100 but want to serve multiple models (each smaller), MIG allows dividing it. This is a form of scaling “out” logically. Each MIG instance can run an inference in parallel, up to 7 in total. Communication between MIG instances is isolated (they behave like separate GPUs to the software). So MIG isn’t used to accelerate one inference, but to handle several in parallel. For example, a user could run 7 separate language models (each using ~5 GB and 1/7 of the GPU’s SMs) concurrently on one A100 40GB (ThinkSystem NVIDIA A100 PCIe 4.0 GPU Product Guide (withdrawn product) > Lenovo Press). This would maximize hardware utilization if each model is lightly loaded. There’s some overhead (MIG partitions are static and cannot share resources), but it’s a great way to scale throughput for many small models or users.
- Parallelism Strategies: For huge LLMs, one can use Tensor Parallelism (split each matrix across GPUs) and Pipeline Parallelism (split different layers to different GPUs). A100’s NVLink helps in both – tensor parallelism needs fast all-reduce for gradients or outputs, which NVLink/NVSwitch provides; pipeline parallelism needs to send activations forward and backward between GPUs (also benefitting from NVLink speed). Many large-model training frameworks (Megatron, DeepSpeed) were designed around V100/A100 clusters. For inference, pipeline parallelism can be used to split a model like GPT-3 between two A100 80GBs (first half layers on GPU1, second half on GPU2). The sequence data (activations) will pass between the two at each pipeline stage. The high bandwidth (~600 GB/s on NVLink bridge, or ~64 GB/s on PCIe) will determine how seamless this is. In practice, 2 A100 80GB on PCIe can serve a 175B model but with significant slow-down due to limited PCIe throughput – whereas 8 A100 SXM with NVSwitch can do it much more smoothly.
- Scaling Limits: While A100s scale well, there are limits. If a model’s layers must be split into too many pieces, communication overhead can erode gains. For instance, trying to run a 13B model split over 4 GPUs might actually run slower than on one GPU, if each piece is too small and communication dominates. Generally, A100 multi-GPU inference is most beneficial for models that cannot fit on one GPU at required precision. If it fits on one, keeping it there avoids inter-GPU comm altogether. That’s why the 80GB model is so valuable: it reduces the need to split models. If splitting is needed, using fewer GPUs (2 or 4) is usually optimal for inference to minimize latency hops.
In summary, the A100 series provides excellent scaling options, especially in SXM form with NVLink. For local use, one might not have an NVSwitch fabric, but even PCIe A100s can be linked in pairs with NVLink bridges to double memory and moderately improve bandwidth. Multi-GPU setups require more complex software (model parallelism, etc.), but A100 has been the backbone of many such deployments. It’s safe to say that if your model or workload grows beyond one A100, you can add another and there is a clear path (software-wise and hardware-wise) to make use of it – something that cannot be said as easily for certain other hardware. A100’s design as part of NVIDIA’s HGX platform means scaling was a core goal.
Limitations and Considerations
Despite its capabilities, there are some practical limitations and points to consider when using A100 GPUs for LLM inference locally:
- Memory Constraints for Giant Models: Even 80GB can be insufficient for the absolute largest models. For example, GPT-3 175B in 16-bit would require ~350GB GPU memory – far beyond a single A100. Techniques like model parallelism or offloading are required, which add complexity and latency. Users must often quantize (8-bit or 4-bit) to fit a model on a single A100, which can impact accuracy (though usually slightly). This is a fundamental limit – if the model is too big to fit, distributed inference or a different GPU (like H100 80GB with compression or GPUs with more memory like GPUs with 92GB, etc.) might be needed. Always check model size vs GPU memory and plan for compression or splitting if needed.
- Bandwidth Bottlenecks on Multi-GPU PCIe Systems: As discussed, if you don’t have an NVLink/NVSwitch setup, trying to run one model across multiple PCIe A100s will go over PCIe bus. That 64 GB/s can bottleneck during attention or model syncing, leading to sub-linear scaling. In worst cases, it can bottleneck generation to the speed of a single GPU or worse. So, while multiple A100s can team up, one should temper expectations or ensure the model is partitioned in a way that minimizes cross-GPU communication (for instance, running different batch requests on different GPUs rather than splitting single requests).
- Latency vs Throughput Trade-off: A100 excels at throughput, especially with larger batch sizes. However, for interactive use (one prompt from one user at a time), you may not utilize the GPU fully. It might generate tokens faster than you need, leaving cycles idle. This is why techniques like prompt batching or concurrency are used – but that introduces some latency as you accumulate work. There’s a balance: maximizing tokens/s on A100 might mean waiting to batch queries (not ideal for real-time). Conversely, serving real-time means low batch, thus lower utilization. A100’s MIG can help here by partitioning the GPU to handle multiple streams in parallel in hardware. But still, a single large model’s single stream won’t use 100% of an A100 all the time (especially if the model is smaller than the GPU). This isn’t a flaw per se, just an operational consideration.
- Energy and Heat: If you’re running an A100 locally (say in an office or home lab), note that it can consume a lot of power and output a lot of heat. Under continuous load it can easily use 200–300W (PCIe) or more (SXM), which will spike electricity usage and require serious cooling. It’s akin to running a high-end gaming PC at full tilt 24/7. Ensure your environment can handle the thermal load (a single A100 can heat up a small room) or have proper server cooling setup. Additionally, if multiple GPUs are used, consider the power supply and circuit requirements – 2×300W GPUs + CPU etc. can approach ~800W draw, which in some locales might need a dedicated circuit.
- Cost and Availability: A100s are expensive and not as readily available as consumer GPUs. While this doesn’t affect technical performance, it’s a limitation for local users. It might be hard to justify or obtain an A100 for personal projects. Many individuals instead use RTX 4090 or similar, which have 24GB VRAM – though far less than 80GB, they are cheaper. That said, A100s sometimes appear on used markets or cloud rental. If using via cloud, remember that an A100 80GB instance can cost several dollars per hour ($1.42/hr A100 80GBs | Cheap, On-Demand Cloud ... - TensorDock), so efficiency matters.
- Compute Capability Limits: A100 is extremely capable, but it lacks some of the newer features of Hopper (H100) – like FP8 support or the Transformer Engine optimizations. It also doesn’t have hardware-accelerated ray tracing or graphics-specific units (irrelevant to LLMs). The point is that A100 is focused on general matrix math. There might be future models or techniques (for example, if FP8 quantization becomes popular for transformers) that A100 can’t accelerate as well as newer GPUs. However, as of now A100 covers the key precisions used in LLM inference.
- Software Setup: Running LLMs at full speed on A100 requires using the right software versions (e.g. enabling TF32 or FP16). If one uses an older framework or doesn’t enable the Tensor Cores, they might see much lower performance. It’s important to utilize things like
torch.cuda.amp
in PyTorch, or use optimized model implementations (like Hugging Face Accelerate big model inference, or FasterTransformer library for GPT). Without these, the model might inadvertently run in FP32 on regular cores, not using A100’s strengths. This is more a user-side caveat: the hardware provides the tools, but you must use them. Thankfully, documentation and community resources are plentiful for A100 optimization. - Concurrent GPU Usage: If you plan to use the A100 for both training and inference or multiple tasks, note that a single large LLM inference can occupy most of the memory, leaving little room for other tasks. NVIDIA’s Multi-Process Service (MPS) can allow multiple processes to share the GPU, but with a giant model loaded, that’s moot. MIG can split memory, but each MIG slice then is smaller – you couldn’t run the same large model on two MIGs at once, for instance. So, effectively one A100 = one big model at a time (unless running many smaller ones). This is fine for dedicated inference servers, but for a research environment, you might need to offload or unload models to switch tasks (which can take time to load 80GB!). Having multiple GPUs or using checkpointing becomes useful if rapid switching between models is needed.
In conclusion, the NVIDIA A100 series remains one of the most powerful platforms for local LLM inference as of its time. It provides a balanced mix of compute, memory, and throughput that aligns perfectly with the demands of large Transformers. With proper optimization, an A100 can serve models that would otherwise be infeasible to run in real-time. Users just need to be mindful of the few limitations – mainly around memory capacity for the largest models and ensuring they leverage the GPU’s features fully. Looking forward, the next-generation GPUs (H100, etc.) build on this foundation, but the A100 will likely continue to be a workhorse for large-scale AI for years, especially as many are deployed in data centers and available for fractional use (cloud, etc.). For anyone focusing on local inference of LLMs, understanding and utilizing the A100’s strengths is a key advantage.
Sources:
- NVIDIA A100 product overview and performance blog – NVIDIA Ampere Architecture In-Depth, R. Krashinsky et al., May 14 2020 (ThinkSystem NVIDIA A100 PCIe 4.0 GPU Product Guide (withdrawn product) > Lenovo Press) (NVIDIA Ampere Architecture In-Depth | NVIDIA Technical Blog).
- Official NVIDIA A100 specification (Ampere GA100 whitepaper) () ().
- Lenovo Press – ThinkSystem NVIDIA A100 GPU Specs, which details A100 40GB vs 80GB, PCIe vs SXM (ThinkSystem NVIDIA A100 PCIe 4.0 GPU Product Guide (withdrawn product) > Lenovo Press) (ThinkSystem NVIDIA A100 PCIe 4.0 GPU Product Guide (withdrawn product) > Lenovo Press).
- TechPowerUp GPU Database – detailed specs for A100 40GB/80GB (PCIe and SXM) (NVIDIA A100 PCIe 40 GB Specs | TechPowerUp GPU Database) (NVIDIA A100 SXM4 80 GB Specs | TechPowerUp GPU Database).
- Hyperstack Tech Blog – A100 PCIe vs SXM4 Comparison, Jul 2023 (NVIDIA A100 PCIe vs NVIDIA A100 SXM: A Comprehensive Comparison) (NVIDIA A100 PCIe vs NVIDIA A100 SXM: A Comprehensive Comparison).
- DataCrunch.io – A100 40GB vs 80GB Comparison, Oct 2024 (NVIDIA A100 40GB vs 80 GB GPU Comparison in 2024 - DataCrunch) (NVIDIA A100 PCIe vs SXM4 Comparison and Use Cases in 2024).
- Dell Technologies Info Hub – Llama2 7B on A100 (Inferencing on Single GPU), Oct 2023 (Conclusion | Llama 2: Inferencing on a Single GPU | Dell Technologies Info Hub).
- Dell Technologies – Benchmarking NVIDIA GPU Throughput for LLMs, Dec 2023 (Benchmarking NVIDIA GPU Throughput for LLMs and Understanding GPU Configuration Choices in the AI Space | Dell Technologies Info Hub).
- Reddit r/LocalLLaMA – user reports on A100 performance (Mistral 7B) (Inference for a 7B model on A100 takes too long? - Beginners).
- Baseten blog – Unlocking H100 for ML inference (A100 batch throughput) (Unlocking the full power of NVIDIA H100 GPUs for ML inference with ...).
- NVIDIA Developer Forums and Neuchips.ai – comments on MLPerf results (249× CPU) (NVIDIA A100 Tensor Core GPU).
- NVIDIA M. Computers PDF – Deep Learning Performance Guide (BERT inference 7×) ([PDF] DEEP LEARNING PERFORMANCE GUIDE - M Computers s.r.o.).