Summary of NVIDIA H100 Series Specifications
The table below summarizes key specifications of NVIDIA’s H100 series data-center GPUs, highlighting the models relevant for local large-language model (LLM) inference. This includes the PCIe card (80GB), the SXM module (80GB), and the dual-GPU NVL variant (2×94GB) designed for LLM serving.
*Specifications sources: NVIDIA official product briefs () () and Lenovo reference guide (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press) (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press). FP16/BF16 and INT8 performance marked with * indicate peak with 2:1 structured sparsity enabled (otherwise ~50% lower) (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press) (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press). The H100 NVL consists of two H100 GPUs bridged with NVLink, effectively doubling the compute and memory of a single H100 (94 GB per GPU, 188 GB total) (NVIDIA Announces H100 NVL - Max Memory Server Card for Large Language Models).
Architecture Deep Dive
Hopper Architecture Overview: The H100 is built on NVIDIA’s Hopper architecture (GH100 GPU), fabricated on a customized TSMC 4N 4 nm process with a massive 80 billion transistors (NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog). The full GH100 chip contains up to 144 Streaming Multiprocessors (SMs) arranged in 8 GPU Processing Clusters (GPCs) (NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog). In shipping products, the H100 SXM has 132 SMs enabled, while the PCIe version has 114 SMs (some units are disabled for yield and power reasons) (NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog) (NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog). Each SM in Hopper contains 128 CUDA cores for FP32 (double the per-SM count of Ampere’s 64) and 4 fourth-generation Tensor Cores (NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog) (NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog). Key on-chip memory structures include a large 50 MB L2 cache (in 80GB models) – a 25% increase over A100’s 40 MB – and 256 KB of L1/Shared Memory per SM (33% larger than A100) (Nvidia’s H100: Funny L2, and Tons of Bandwidth) (NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog). The H100 integrates 5 or 6 HBM memory stacks on a 5120-bit or 6144-bit bus (depending on model) for extremely high bandwidth (Nvidia’s H100: Funny L2, and Tons of Bandwidth) (Nvidia’s H100: Funny L2, and Tons of Bandwidth). Overall, Hopper’s design prioritizes throughput for matrix-heavy AI workloads and implements new features like asynchronous data transfer engines and enhanced GPU partitioning (MIG) capabilities (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press) (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press).
Generational Improvements: Hopper delivers significant upgrades over its predecessor (Ampere A100). The H100 SM is more powerful and efficient – doubling the per-SM throughput for FP32, Tensor Core FP16/BF16, and FP64, and introducing new FP8 capability that yields up to 4× the per-SM throughput for matrix operations (NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog) (NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog). In total, the top H100 configuration (132 SMs at higher clocks) provides about 2–3× the raw compute of A100 at similar precision: e.g. ~67 TFLOPS FP32 vs 19.5 TFLOPS in A100, and up to ~2 PFLOPS tensor compute vs ~312 TFLOPS in A100 (FP16, no sparsity) (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press). The Transformer Engine in Hopper introduces mixed FP8/FP16 training and inference for transformers, enabling up to 9× faster training and 30× faster inference on large language models compared to Ampere (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press). New DPX instructions accelerate dynamic programming algorithms (useful for genomics, route optimization, etc.) by 7× over A100 (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press) (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press), although these are less directly relevant to LLM inference. Hopper also features 4th-gen NVLink with higher bandwidth (900 GB/s on SXM) for scaling, second-gen MIG for improved GPU partitioning (up to 7 instances) (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press), and hardware-accelerated confidential computing features (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press). In summary, H100’s architecture is tailored to maximize parallel throughput for AI, with more compute units, faster clocks, larger caches, and specialized circuits (Tensor Cores, Transformer Engine) all contributing to a leap in performance and efficiency for LLM tasks.
(image) Figure 1: NVIDIA GH100 “Hopper” full GPU block diagram (144 SMs, 60 MB L2, 6 HBM stacks). The H100 PCIe/SXM products have 114–132 SMs and 50 MB L2 enabled (NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog) (NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog). The design includes multiple GPCs (green blocks of SMs), large L2 cache (blue), and HBM memory controllers (gray) interconnected via a high-speed crossbar. This architecture is optimized for parallel compute and high memory bandwidth, critical for accelerating large AI models.
SM and Tensor Cores: Each Hopper SM is a highly parallel compute unit containing 128 FP32 cores, 64 FP64 units (which also handle FP32 in Tensor Float 32 mode), and 4 Tensor Cores (matrix-multiply units) capable of FP16/BF16, FP8, INT8, and other mixed-precision operations (NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog) (NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog). The SM has a dual warp scheduler (2×32 threads) and instruction dispatch units to keep the many execution units fed (NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog) (NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog). Hopper’s 4th-gen Tensor Cores in each SM deliver 2× the matrix math throughput of Ampere’s on all previously supported data types, and add support for the smaller 8-bit floating point (E5M2 and E4M3 FP8 formats) which further doubles throughput versus 16-bit on supported operations (NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog) (NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog). With FP8, each Tensor Core can perform 4× more FMA operations per cycle than with FP16 on A100, yielding the 4× per-SM increase mentioned (combined with the SM count and clock boosts, this translates to up to 30× inference speedups at the system level on giant transformers) (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press). The Tensor Cores also support fine-grained structured sparsity (pruning 2 out of 4 weights) to effectively double throughput when models are pruned accordingly (NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog). Other SM improvements include the Tensor Memory Accelerator (TMA) for faster asynchronous data movement (allowing overlapping of data transfers with computation) (NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog), and a new Thread Block Cluster feature that lets kernels leverage data locality across neighboring SMs for better cooperation on large blocks of work (NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog). All these enhancements make the H100 SM a compute powerhouse for matrix-heavy workloads like transformer-based LLMs.
Compute Capabilities and Precision Modes
Supported Precisions: NVIDIA H100 supports a wide range of numeric precisions, from standard FP64 down to INT4, to balance speed and accuracy in AI inference. For general-purpose compute and HPC, H100 offers up to 34 TFLOPS of FP64 (double-precision) or 67 TFLOPS with FP64 Tensor Cores using the Hopper-specific FP64 MMA units (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press). For AI training and inference, the focus is on lower precisions: FP32 (single-precision) is fully supported (up to ~67 TFLOPS for tensor ops in TF32 mode (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press)), though in deep learning it’s often bypassed in favor of mixed-precision. H100 excels at FP16 and BF16 (half-precision and brain-float) – each Tensor Core can perform matrix ops on FP16/BF16 inputs with FP32 accumulation. The H100 SXM peaks around 990 TFLOPS FP16/BF16 (Tensor Core throughput, no sparsity) or 1.98 PFLOPS with sparsity (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press). The PCIe variant hits ~756 TFLOPS FP16 (no sparsity) given its lower SM count and clocks (NVIDIA Announces H100 NVL - Max Memory Server Card for Large Language Models).
Importantly, Hopper introduces FP8 tensor precision. Each Tensor Core supports FP8 (with FP16 accumulation) at twice the rate of FP16 (NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog). This yields an incredible ~2 PFLOPS of FP8 compute (no sparsity) on one H100 SXM or ~4 PFLOPS with sparsity (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press) – effectively six times the A100’s throughput on matrix ops (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press). Using FP8 can dramatically speed up LLM inference if the model can maintain accuracy at that precision (NVIDIA’s Transformer Engine dynamically chooses FP8 for select layers to balance accuracy and performance).
For lower-bit integer operations, H100 supports INT8 and INT4 through Tensor Cores as well. At INT8, an H100 SXM delivers up to 3,958 TOPS (trillions of 8-bit ops per second) with sparsity (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press) (around 2,000 TOPS without). This is the same throughput as FP8, since FP8 uses 8-bit data as well. While NVIDIA doesn’t heavily publicize INT4 on Hopper, the 4th-gen Tensor Cores do support INT4 matrix multiply–accumulate (as Turing/Ampere did) for extremely quantized networks (Looking to Buy NVIDIA H100 GPUs? Here's Everything You Need to ...). In theory, INT4 doubles the throughput of INT8, meaning an H100 could reach on the order of ~8,000 TOPS (8 quadrillion ops/s) if using INT4 data (with some sources noting INT8/INT4 mixed precision support is present for specialized use-cases (Looking to Buy NVIDIA H100 GPUs? Here's Everything You Need to ...)). In practice, INT4 is rarely used for LLM inference due to accuracy loss, but techniques like GPTQ and AWQ (with error compensation) are emerging to leverage 4-bit weights on H100-class hardware for maximum speed (H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token — tensorrt_llm documentation) (H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token — tensorrt_llm documentation).
Tensor Operations and Throughput: The Tensor Cores are the primary drivers of H100’s AI performance. Each Hopper Tensor Core can perform FMA on 64 FP16/BF16 values per clock, or 128 FP8/INT8 values per clock (or half those amounts if accumulating into higher precision) (NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog). With 528 Tensor Cores on a H100 SXM running near 1.9 GHz, the raw math throughput is enormous. For example, the H100 SXM achieves about 990 TFLOPS in FP16/BF16 tensor Ops, which jumps to ~1.98 PFLOPS with sparsity (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press). For FP8/INT8, it’s about 1.98 PFLOPS (no sparsity) or ~3.96 PFLOPS with sparsity (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press). These figures represent ideal peak throughput. Real-world inference tasks approach these peaks when batch sizes and model sizes are large enough to fully utilize the GPU. In LLM inference, using FP8 with the Transformer Engine can substantially boost throughput – NVIDIA reported up to 4.6× higher throughput on H100 vs A100 when using FP8 for GPT-style models in TensorRT-LLM, with H100 hitting ~10,000 tokens/s while staying within 100 ms first-token latency (H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token — tensorrt_llm documentation). Even without FP8, H100’s larger core count and faster clocks give it a strong advantage in FP16/BF16 inference over the previous generation.
Sparsity and Structured Pruning: As with A100, the H100 supports 2:4 structured sparsity in weight matrices. This feature, when enabled and if the model’s weight matrices have been pruned to zero-out 50% of entries in the supported pattern, allows the Tensor Cores to skip the zeros and double the effective throughput for those layers (NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog). The performance specs with asterisks in the table (e.g. FP16 1.98 PFLOPS*) assume this sparsity. In practice, not all transformer layers can be sparsified without fine-tuning, but in cases where it’s applied, one can see up to ~1.5–2× speedups. Hopper’s architecture maintains support for this feature across FP16, BF16, TF32 and INT8 Tensor Core operations (NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog) (NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog). For unstructured sparsity or dynamic sparsity patterns, H100 relies on general compute and doesn’t get a 2× boost, so this is specifically about the structured pattern learned during training. Future large models may leverage this to fit larger architectures into the same compute budget by skipping zeros.
Memory Subsystem Analysis
HBM Memory and Bandwidth: All H100 models use high-speed stacked memory (HBM) to feed the hungry compute cores. The H100 PCIe and SXM5 80GB models come with 5 stacks of HBM (each stack with 16 GB capacity), totaling 80 GB and connected via a 5120-bit bus (NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog) (NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog). The SXM uses HBM3 running at ~5.2 Gbps pin speed, yielding a ~3.35 TB/s memory bandwidth (NVIDIA Announces H100 NVL - Max Memory Server Card for Large Language Models) (Nvidia H100 SXM5 GPU). The PCIe card uses slightly older HBM2e at ~3.2 Gbps, for an still-impressive 2.0 TB/s bandwidth (). In both cases, H100 delivers about 50% higher memory bandwidth than the A100 (which had 1.6 TB/s on 80GB HBM2e) (Nvidia Reveals Hopper H100 GPU With 80 Billion Transistors) (Nvidia Reveals Hopper H100 GPU With 80 Billion Transistors), reducing memory bottlenecks for large models. The special H100 NVL equips 6 HBM3 stacks per GPU (94 GB), restoring the full 6144-bit bus and pushing bandwidth to 3.9 TB/s per GPU (). This enormous memory bandwidth is critical for LLM inference, where models can be tens or hundreds of billions of parameters – the HBM can stream model weights and activations at extremely high rates, keeping the compute units utilized. For instance, an 80GB H100 can sustain over 2–3 TB/s of reads, which means it could theoretically read the entire model memory 25–40 times per second. In practice, not all memory accesses are perfectly sequential, but having this headroom avoids the GPU starving for data.
Memory Hierarchy and Caches: On top of the raw HBM bandwidth, H100’s on-chip caches significantly accelerate data reuse. The GPU features a large L2 cache (50 MB on 80GB models) that sits between the SMs and HBM memory (Nvidia’s H100: Funny L2, and Tons of Bandwidth) (Nvidia’s H100: Funny L2, and Tons of Bandwidth). This L2 acts as a giant buffer for weights, activations, and context cache (KV cache) in transformer inference. Frequently accessed data can be served from L2 at a much higher bandwidth and lower latency than going out to HBM. Hopper increased L2 from A100’s 40 MB to 50 MB (and the full GH100 design even allows 60 MB, though not used in 80GB variants) (Nvidia’s H100: Funny L2, and Tons of Bandwidth) (Nvidia’s H100: Funny L2, and Tons of Bandwidth). Additionally, each SM has 256 KB of L1 cache/shared memory (unified), up from 192 KB in Ampere, to buffer thread-local data and reduce L2/HBM traffic (NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog). These caches are especially beneficial for batch inference or multi-stream inference where different sequences may reuse the same model weights – once a weight matrix is fetched from HBM for one token’s computation, it may reside in cache for the next token or next sequence, greatly speeding up subsequent accesses.
Memory Capacity and Model Size: With up to 80 GB (or 94 GB) of VRAM per H100, these GPUs can accommodate very large models locally. For 16-bit precision (FP16/BF16), 80 GB can hold roughly up to a 40 billion parameter model in memory (since 40B * 2 bytes = 80 GB, excluding activation buffers). Using 8-bit weights (via INT8 or FP8 quantization), it can fit models of 70B+ parameters entirely. Indeed, NVIDIA advertises that a single H100 NVL (188 GB total across two GPUs) can serve models up to ~175B parameters (GPT-3 class) without partitioning (Nvidia just took two H100 cards and glued them together • The Register) (Nvidia just took two H100 cards and glued them together • The Register). For local inference use, this means many popular open LLMs (LLaMA-65B, Falcon-40B, etc.) can run on one H100 (possibly with minor quantization). If models exceed GPU memory, offloading can be used, but at a cost to latency. The ample HBM capacity and bandwidth of H100 minimize the need for layer-by-layer CPU offload for models in the 10B–70B range, enabling these to be served entirely from GPU memory for maximum throughput.
Memory Compression & Efficiency: NVIDIA GPUs traditionally employ lossless memory compression for framebuffer data (color/Z compression in graphics), but for compute workloads like LLMs, there isn’t an explicit analogous compression for model weights. Instead, precision reduction (FP16/BF16 vs FP32, or FP8/int8 quantization) is the primary way to compress model data to use less memory. Hopper’s introduction of FP8 effectively acts as a form of compression – storing weights in 8-bit floating point halves the memory footprint versus 16-bit, allowing larger models or faster memory movement. There’s also software-level optimizations like quantization-aware training and sparsity which reduce the effective memory footprint of models. For example, if 50% structured sparsity is applied, the model storage can be effectively reduced (though hardware still allocates full arrays, the zeros compress well in data transfers). Another Hopper feature, NVLink RDMA and GPUDirect Storage, allows moving data in and out of GPU memory more efficiently (bypassing CPU). This helps stream very large models from system memory or NVMe if they don’t fully fit, but performance will degrade if working set doesn’t reside mostly in HBM. In summary, the H100’s memory subsystem – combining huge HBM capacity, extreme bandwidth, and large caches – is engineered to handle the enormous memory demands of modern LLMs, keeping the GPUs fed with data to avoid stalls.
Performance Benchmarks for LLM Inference
LLM Inference Throughput: The NVIDIA H100 sets new milestones in LLM inference performance. Leveraging its Transformer Engine and raw hardware muscle, a single H100 can generate text with large models several times faster than the previous generation. For example, using optimized INT8/FP8 paths, one H100 was measured at ~21,800 tokens per second on a Llama-2 70B model, whereas an A100 achieved around 5,000 tok/s – a 4.3× speedup (NVIDIA GPUs H200 vs. H100 - A detailed comparison guide | TRG Datacenters). In another test with TensorRT-LLM, the H100 (FP8 precision) delivered about 10,000 tokens/s with 100 ms first-token latency for a GPT-J family model (with 64 concurrent requests), which was 4.6× the throughput of A100 under the same conditions (H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token — tensorrt_llm documentation). Even at single-stream, low-latency settings, H100 shows big gains: less than 10 ms first-token latency is achievable on moderate-size models (H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token — tensorrt_llm documentation), enabling snappier interactive inference.
Across model sizes, H100 consistently outperforms older GPUs. On a 13B parameter model (Llama-13B), one H100 can generate roughly 2–3× more tokens per second than an A100 at similar accuracy settings, and maintain sub-10ms token latencies with appropriate batching (H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token — tensorrt_llm documentation). On the giant GPT-3 (175B), the dual-GPU H100 NVL can serve throughput around 5× higher than an A100 node, thanks to having 188 GB of fast memory and FP8 speedups (NVIDIA H100 NVL | Data Center GPU | pny.com). NVIDIA reported that mainstream servers with H100 NVL outperform A100 servers by up to 5× on Llama-2 70B inference throughput (NVIDIA H100 NVL | Data Center GPU | pny.com). In absolute terms, an 8× H100 SXM HGX system (the DGX H100) achieved over 52,000 tokens/sec on a 530B parameter Megatron GPT-3 model in one internal test (NVIDIA Data Center Deep Learning Product Performance AI Inference) – showcasing the scalability for even the largest models.
BERT and Smaller Models: For smaller transformer models used in NLU tasks like BERT, H100 also breaks records. In MLPerf Inference benchmarks, a single H100 processed over 73,000 question-answering queries per second on BERT-Large (sequence length 384) in offline mode (NVIDIA H100 GPU: The World's Most Advanced AI Inference Accelerator | Zeet.co) (NVIDIA H100 GPU: The World's Most Advanced AI Inference Accelerator | Zeet.co) – about 2× the throughput of A100. This translates to enormous batch processing capability for tasks like sentence classification or embeddings generation. Even in real-time scenarios, H100 can handle thousands of inferences/sec on models like BERT or smaller GPT-2 variants, easily meeting high QPS requirements. The combination of high clock speed and large caches means even latency-critical, memory-light models get a boost.
Batch Size and Sequence Length Effects: The H100 tends to reach peak utilization at moderately large batch sizes or with multiple concurrent sequences, especially for large LLMs. At very small batch sizes (e.g. 1 prompt at a time), it may not fully saturate all SMs due to the sequential nature of transformer decoders, but techniques like overlapping compute of multiple sequences or using higher-throughput sampling algorithms (e.g. speculative decoding) can improve utilization. As batch size increases, throughput in tokens/sec scales up near-linearly until reaching a plateau when the GPU is fully utilized. For instance, one analysis showed that on Mistral 7B (a smaller LLM), moving from A100 to H100 doubled the throughput at the same batch size, and allowed either halving the latency or doubling batch without performance loss (Confidential Computing on nVIDIA H100 GPU - arXiv) (Double inference speed and throughput with NVIDIA H100 GPUs). At extremely large batch sizes (512 or 1024 concurrent sequences), H100 can maintain very high throughput – one NVIDIA demo showed ~21,000 tokens/s at batch 1024 on 7B models with FP8 in TensorRT-LLM (Achieving High Mixtral 8x7B Performance with NVIDIA H100 Tensor ...) (Achieving High Mixtral 8x7B Performance with NVIDIA H100 Tensor ...). However, not all applications can use large batches due to latency constraints.
Latency and End-to-End Generation Speed: In terms of end-to-end time to generate a complete response, H100 again excels. For example, generating a 1000-token response from a 30B model that might take ~10 seconds on an A100 could complete in around ~3 seconds on H100 (numbers illustrative, demonstrating ~3× speedup). The first-token latency (which matters for interactivity) can be under 20 milliseconds on models up to ~10B parameters on H100, and under 50–100 ms even for 70B models when using FP8 mode – which is exceptional given the complexity of these networks (H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token — tensorrt_llm documentation). These low latencies enable more real-time applications with larger models than was possible before. When comparing performance, it’s also useful to consider performance-per-dollar: While H100 is more expensive, its ability to serve many more tokens per second means fewer GPUs (or cloud instances) are needed for a given workload. We’ll discuss cost/performance more in a later section.
Note on Accuracy vs Throughput: Achieving the highest throughput on H100 often involves using lower precision (FP8/INT8) or quantized models. NVIDIA’s Transformer Engine and TensorRT will automatically downshift certain matrix multiplications to FP8 on H100, usually with minimal impact on model accuracy after fine-tuning (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press) (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press). For instance, users have found they can get up to 28× faster inference on a 7B model by simply enabling INT8 on H100 (1200 tokens/s vs 40 tokens/s) with little quality loss (1,200 tokens per second for Llama 2 7B on H100! : r/LocalLLaMA). However, for models that are very sensitive or where INT8 quantization hasn’t been applied, one might run in FP16/BF16 on H100 – still getting a solid 2–3× speedup over A100. In summary, H100 provides flexibility: maximum speed with 8-bit quantization, or still excellent performance at higher precision. Either way, it currently stands as the premier inference GPU for LLMs, handling large sequences and models with ease.
Thermal and Power Efficiency
Power Consumption under Load: The H100 is a high-power processor, but it delivers unprecedented performance-per-watt for AI. The PCIe H100 has a typical board power of 350 W (with a 350 W max default limit) (). Under heavy LLM inference loads (which utilize both compute and memory subsystems extensively), the H100 PCIe card often pushes near this power limit. Measurements on a 350W H100 show that during memory bandwidth intensive microbenchmarks, the GPU draws close to the full 350W and occasionally has to drop clocks slightly to avoid exceeding the limit (Nvidia’s H100: Funny L2, and Tons of Bandwidth) (Nvidia’s H100: Funny L2, and Tons of Bandwidth). Despite this, data center cooling can usually keep the H100 die at comfortable temperatures (often under 70°C) even at 300W+ draw, avoiding thermal throttling (Nvidia’s H100: Funny L2, and Tons of Bandwidth). The SXM5 module form of H100 is rated up to 700 W (in a well-cooled server with ample power delivery) (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press). In practice, in a DGX H100, each GPU can run at ~450–500W during sustained training loads. For inference, power usage may be lower if the workload is memory-limited; still, one can expect ~400W per SXM under heavy FP8 inference. The H100 NVL (dual GPU on one board) has a combined default power limit of 400 W () – effectively ~200W per GPU since they run at lower clocks. This was a conscious design choice to fit two GPUs in a 4-slot configuration without blowing past data center power envelope. The NVL’s 400W limit means each GPU runs slightly underclocked (base ~1080 MHz vs 1125 on PCIe) (), trading off some speed for efficiency to serve large models with less cooling per chip.
Cooling Solutions: Proper cooling is essential to sustain H100 performance. The PCIe H100 80GB is a dual-slot passive card – it relies on the server chassis fans to provide airflow through its heatsink (). Data center servers that support H100 PCIe ensure high airflow (often >20 CFM per card) to keep it within its 350W thermal design. If cooling is insufficient, the card will hit its thermal limit (~85°C TGP or HBM sensor limit) and down-throttle clocks to stay safe (). The SXM5 H100 modules use either high-performance air or liquid cooling. Many server designs (e.g. NVIDIA HGX A100 to H100 upgrades) moved to liquid cooling for 700W H100s (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press), as removing 700W per GPU (and ~5.6 kW for an 8-GPU system) via air is challenging. Liquid-cooled plates or direct hot water cooling on SXM modules keep them within optimal thermal range and can even allow higher sustained clocks. NVIDIA quotes that with proper cooling, the H100 can run unconstrained at its maximum frequency to deliver consistent performance (). Users integrating H100 locally must account for its heat dissipation – a single H100 can overwhelm a typical desktop PC’s cooling. Hence “local” in this context usually means in a workstation or server with robust cooling (e.g. 4U chassis, high-pressure fans or integrated AIO cooling for the GPU).
Performance-per-Watt: One of Hopper’s achievements is improving performance/Watt despite the high absolute power. Thanks to the 4N process and architectural gains, the H100 is significantly more energy efficient than A100 for the same work. For instance, NVIDIA’s benchmarks showed up to 30× more inference throughput on large models for H100 vs A100 (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press) – even if power roughly doubled from 300W to 600W, that’s an order-of-magnitude jump in perf/W. In cloud cost terms, an H100 at 2× the price of A100 but giving 4–5× the throughput means better efficiency and lower total energy for a given number of tokens generated (NVIDIA H100 Compared to A100 for Training GPT Large Language Models | TechPowerUp Forums) (NVIDIA H100 Compared to A100 for Training GPT Large Language Models | TechPowerUp Forums). Another data point: H100 delivers about 2.8 TFLOPS/W of FP16 (1513 TFLOPS/350W), whereas A100 delivered ~1.1 TFLOPS/W (312 TFLOPS/280W), so roughly 2.5× higher raw efficiency. In real workloads like GPT-3 inference, one H100 might replace several A100s, saving on cumulative power draw to achieve the same throughput. Also, running models at lower precision on H100 (FP8) boosts not only speed but efficiency – doing more work per joule. Conversely, to maximize efficiency, users can also undervolt or power-cap H100. The PCIe H100 supports modes down to 200W (), albeit with some performance loss. Overall, H100 provides much better performance-per-watt than previous-gen, which is crucial as LLM deployments scale up and energy costs become significant.
Thermal Throttling Behavior: Under normal conditions with adequate cooling, H100 will sustain high clocks (e.g. 1.8–1.9 GHz on SXM). If the card hits either the power limit or temperature limit, it will start reducing core frequency to stay within spec. The H100’s power management tends to be the first limiter (especially on the 350W cards): as loads approach 350W, the GPU will not boost further and may oscillate slightly below max clocks to avoid crossing the boundary (Nvidia’s H100: Funny L2, and Tons of Bandwidth). This is a gentle form of throttling – for example, chipsandcheese observed clocks dropping to ~1395 MHz (80% of max) in worst-case on a PCIe H100 when fully pegged on memory bandwidth (Nvidia’s H100: Funny L2, and Tons of Bandwidth) (Nvidia’s H100: Funny L2, and Tons of Bandwidth). Thermal throttling (hitting Tmax) is rarer if cooling is sufficient; however, if a card is in a poorly cooled environment, it could hit the ~85°C GPU temperature limit and then you’d see more aggressive clock reductions. The memory (HBM) has its own thermal sensors; if HBM junction goes beyond safe levels (~95°C typically), the card will also throttle memory speeds. Using H100 within recommended thermal conditions (good case airflow or liquid cooling for SXM) prevents these scenarios and ensures consistent inference latency and throughput.
Fan Noise and Environment: Though not a direct technical spec, it’s worth noting for a local setup: if you have an actively cooled H100 (NVIDIA did not release a blower-cooled version as of this writing, most are passive), the server fans ramped to cool 350W can be very loud. This is only relevant to someone running an H100 in a workstation or lab; data centers will have acoustic isolation. But it’s one consideration if an enthusiast is comparing using a 300W RTX card (which might be somewhat quieter with its cooler) versus an H100 that demands server-grade cooling. In exchange for dealing with the heat, you get unparalleled performance.
In summary, the H100 manages to deliver extreme performance at high but manageable power draw, and its efficiency gains mean that even though it pulls 350–700W, it’s doing the work of many older GPUs, often yielding net power savings at the cluster scale for a given workload.
Comparative Analysis with Other GPUs
Versus NVIDIA A100 (Ampere): The A100 (80GB) was the previous champion for AI, and the H100 surpasses it on all fronts. In terms of raw specs, H100 has ~1.3× more SMs and runs at ~1.5× higher clocks, roughly doubling FP32 and tensor operations per GPU (NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog). Add in the new FP8 capability (which A100 lacks entirely) and faster memory, and the advantage becomes even larger. Real-world large model inference shows 3–5× higher throughput on H100 vs A100, as noted with Llama-70B and GPT-3 benchmarks (NVIDIA GPUs H200 vs. H100 - A detailed comparison guide | TRG Datacenters) (NVIDIA H100 NVL | Data Center GPU | pny.com). Even for training, H100 is ~2–3× faster, which often more than justifies its higher price (NVIDIA H100 Compared to A100 for Training GPT Large Language Models | TechPowerUp Forums) (NVIDIA H100 Compared to A100 for Training GPT Large Language Models | TechPowerUp Forums). Memory-wise, both offer up to 80GB, but H100’s bandwidth is 1.5×. Also, H100’s Transformer Engine allows mixing precision seamlessly, whereas on A100 one would use mostly FP16 or BF16. Another distinction is NVLink bandwidth: A100 (PCIe form) had no NVLink; H100 PCIe introduces NVLink Bridge support (600 GB/s between two cards) (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press). The SXM H100 has 900 GB/s via NVSwitch, vs 600 GB/s on A100 SXM4 – helpful for multi-GPU scaling. Essentially, H100 is faster but also more expensive: early cloud pricing had H100 at about 2.2× the cost of A100 (per hour) (NVIDIA H100 Compared to A100 for Training GPT Large Language Models | TechPowerUp Forums), but since it can be 3–4× faster, the price-per-performance can favor H100. In scenarios with smaller models or lower batch where A100 was underutilized, the gap narrows, but for the current trend of very large models, H100 is the clear choice if available. Many “cost per inference” analyses show H100 yielding lower cost per token when fully utilized, despite higher upfront cost (NVIDIA H100 Compared to A100 for Training GPT Large Language Models | TechPowerUp Forums).
Versus Consumer GPUs (e.g. RTX 4090): Some enthusiasts compare H100 with high-end consumer cards for local AI. The RTX 4090 (Ada) has ample FP16/INT8 Tensor Core capability (3rd-gen TCs) and runs at high clock, so for smaller models it’s quite powerful. However, it is limited by 24 GB of GDDR6X memory – only a fraction of H100’s 80 GB, meaning it cannot hold models above ~13B parameters fully in memory (at FP16). It also lacks FP8 support (Ada cores support FP8 only through slower DP4A or not at all), and the memory bandwidth is ~1 TB/s, which is 1/2 to 1/3 of H100’s HBM bandwidth. In throughput terms, a 4090 might achieve perhaps 20–30% of an H100’s performance on a medium model (some community tests show 200 tokens/s on 30B model for 4090 vs $30,000 on resale) is not a fair fight in cost – the 4090 wins on raw $/TFLOPS. But if the task is to serve a 65B model, the 4090 simply cannot handle it without running out of memory or offloading half to CPU (which kills performance). In professional environments, the H100’s cost is justified by its ability to handle huge models and multi-user loads reliably. For hobbyist local inference of smaller models (up to 13B), a 4090 or similar might suffice with some quantization, though it will be slower and missing enterprise features.600 tokens/s for H100, using int8). The 4090 also relies on CUDA cores for some things H100 offloads to dedicated units (like the Transformer Engine’s FP8 support). That said, consumer GPUs have a much better price/performance for models that fit their VRAM. A 4090 ($1600) versus an H100 (
Versus AMD MI250/MI300: AMD’s data center GPUs (Instinct MI200 series and upcoming MI300) are competitors in HPC and AI. The MI250X offers 128 GB HBM2e (two dies ×64 GB) and ~380 FP16 TFLOPS (theoretical) across dual GPU dies. In practice, however, MI250 has not demonstrated inference performance on LLMs close to H100 – partly due to less optimized software for transformers (lack of an equivalent to NVIDIA’s TensorRT and transformer engine) and no FP8 support yet. MI250 does support INT4/INT8 matrix ops via its Matrix Cores (AMD’s CDNA2 architecture), but initial MLPerf results for AMD on BERT and others showed them lagging NVIDIA’s same-generation offerings. The upcoming MI300X is AMD’s 192 GB GPU aimed at AI inference, which will support FP8 and FP16 on 8th-gen Matrix Cores, potentially narrowing the gap. But as of now, H100 holds a strong lead in LLM workloads, as evidenced by nearly all top MLPerf inference submissions being NVIDIA. Another factor is NVLink vs AMD’s Infinity Fabric – scaling MI250 beyond 2 dies is harder, whereas H100 can leverage NVSwitch to scale to 8 or more GPUs with high bandwidth. In cost, AMD often prices their GPUs lower, but the ecosystem and performance difference means many are still choosing H100 despite cost. For those with existing AMD installations or where specific workloads favor AMD, MI250X could serve large models (the 128 GB memory is an advantage), but one might need two MI250s to approach one H100’s throughput. Until MI300X arrives, NVIDIA maintains a comfortable advantage in generative AI tasks.
Multi-GPU Considerations: When comparing multi-GPU solutions, an interesting question is “One H100 vs two A100s vs four smaller GPUs,” etc. Two A100 80GB together have 160 GB memory and ~624 TF16 TFLOPS total – which is closer to one H100’s 990 TF16 TFLOPS. In some cases, buying two used A100s might be cheaper than one H100, and yield similar raw throughput, but splitting a model across two GPUs introduces communication overhead and complexity (model parallelism). The H100’s single-GPU performance (and 80 GB unified memory) avoids that complexity. Furthermore, if one uses two A100s, each at ~300W, that’s ~600W total, comparable to one H100 SXM at 700W – yet the H100 likely still outpaces them due to FP8 and newer optimizations. Thus, many choose a single H100 over multiple older cards for simplicity and better peak performance. In terms of cost-performance ratio, if initial cost is the main factor, last-gen or consumer GPUs can achieve the job at lower cost albeit with lower speed. But if total throughput or TCO (total cost of ownership) is key, H100 often gives better value by cutting down inference time (thus allowing higher utilization or fewer total nodes needed). A report from MosaicML highlighted that although H100 was ~2.2× the cloud cost of A100, it trained models 2.5–3.3× faster, yielding lower cost-to-train overall (NVIDIA H100 Compared to A100 for Training GPT Large Language Models | TechPowerUp Forums) (NVIDIA H100 Compared to A100 for Training GPT Large Language Models | TechPowerUp Forums). The same logic extends to inference: faster results mean fewer GPU-hours to serve X queries, offsetting the higher hourly price.
In summary, H100 currently sits at the top of the GPU hierarchy for LLM tasks, with few equals. Its closest competition will come from the next-gen NVIDIA H200 (which adds more memory and bandwidth) and from upcoming AMD/Intel accelerators. But for now, if one needs the absolute best performance (and has the budget), H100 is the go-to. For those on tighter budgets or working with smaller models, alternatives like A100 (used or rental), RTX 4090/6000 Ada, or cloud instances with other accelerators can be considered, accepting slower performance.
Optimization Techniques and Software Compatibility
Software Framework Support: NVIDIA H100 GPUs are fully supported by all major AI frameworks and libraries, thanks to CUDA 11/12 and associated libraries. PyTorch and TensorFlow both can automatically utilize H100’s capabilities through GPU kernels – using CUDA 12.x ensures that Hopper architecture kernels (sm_90) are targeted. Out of the box, frameworks will use FP16 or BF16 on H100 similarly to A100. To tap into H100’s special sauce (FP8 Tensor Cores and Transformer Engine), users should leverage NVIDIA’s optimized libraries:
- NVIDIA Transformer Engine (TE): This is a library (with integrations in PyTorch via APEX) that automatically manages FP8 training/inference. It will downconvert layers (like feed-forward GEMMs and attention projections) to FP8 on the fly on H100, and upconvert results to FP16, applying per-tensor scaling to maintain accuracy (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press) (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press). By adding a few lines of code or using TE-enabled modules, one can often get significant speedups on H100 without manual intervention.
- TensorRT and TensorRT-LLM: For deployment, NVIDIA’s TensorRT now has specialized support for transformers on H100. The TensorRT-LLM toolkit includes optimized implementations of attention, dense, and softmax that exploit FP8 and multi-stream concurrency (H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token — tensorrt_llm documentation). It can take a HuggingFace transformer model and compile it to an engine that runs significantly faster on H100 (the 4.6× speedup blog we cited was using TensorRT-LLM) (H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token — tensorrt_llm documentation). ONNX Runtime also integrates TensorRT, making it easy to serve models with these optimizations.
- cuBLAS and CUTLASS: Lower-level, NVIDIA’s linear algebra libraries (cuBLASLt) and the CUTLASS GEMM library have added support for FP8 and improved BF16 on Hopper. This means any custom kernels or ML frameworks that rely on these (JAX, MXNet, etc.) can also use H100’s capabilities if built against recent CUDA libraries.
- Python libraries & ecosystems: Tools like Hugging Face Accelerate, FasterTransformer, and others often have H100-targeted paths. For example, HuggingFace’s
transformers
library can integrate with bitsandbytes for 8-bit inference – H100 executes those 8-bit matrix multiplies on its Tensor Cores very efficiently (int8 or FP8). Theoptimum
library from HuggingFace has an NVIDIA extension that automatically uses TensorRT on H100, reportedly giving huge speedups (that “1200 tokens/s on 7B model” tip was from HuggingFace using TensorRT on H100) (1,200 tokens per second for Llama 2 7B on H100! : r/LocalLLaMA). So the ecosystem is ready to make use of H100 without requiring users to write low-level code.
CUDA and Driver Requirements: H100 being a new architecture required CUDA 11.8+ (really CUDA 12+ for full support) (NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog). It uses the SM_90 target for compilation. Users need NVIDIA driver R520 or later for H100 (Linux R520+ as noted in specs) (). In practical terms, anyone using H100 should be on the latest NVIDIA driver and a recent framework version to ensure all kernels are optimized. Older software might run in fallback modes and not see the benefit.
Mixed Precision and Automatic Casting: Best practice on H100 is to use mixed precision everywhere possible. This includes:
- FP16/BF16 for storage, FP32 for accumulation: standard mixed precision training, easily enabled via frameworks (like
torch.cuda.amp
). H100 has no issue with this, and also benefits from BF16’s ease of use (since BF16 doesn’t overflow as easily, one can often skip loss scaling). - FP8 for certain layers: Using the Transformer Engine, one can let the software decide which layers to run in FP8. Typically, linear layers in the MLP and attention projections are good candidates for FP8, whereas normalization and residual accumulations stay in higher precision (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press). The TE will choose E4M3 or E5M2 FP8 format per layer depending on range requirements. The result is near FP16-level accuracy with up to 2× speed on those layers. This is largely automated in NVIDIA’s software – for example, in PyTorch with TE, one would wrap the model with
transformer_engine.quantization
modules. - INT8 quantization: Besides FP8, traditional int8 calibration and quantization can be used. NVIDIA provides
calibrateQuantizedModel
tools to quantize a model’s weights to int8. On H100, those int8 ops run on Tensor Cores (unlike some older GPUs where int8 might run on DP4A in SM cores). For instance, you could quantize all linear layers of GPT to int8 and use ONNX Runtime with TensorRT EP to execute. The accuracy might drop a bit if not done carefully, but techniques like quantization-aware training or recent quantization algorithms (like SmoothQuant, ZeroQuant) can help maintain accuracy. H100’s int8 support includes vectorized INT8 and also support for mixed INT8/INT4 (where perhaps weights are 4-bit, activations 8-bit, although full support for that in software is still experimental).
Framework Plugins and Updates: It’s worth mentioning specific software support:
- PyTorch: As of PyTorch 2.0+, H100 is supported. PyTorch with CUDA 12 will use Hopper kernels. The Apex/TransformerEngine integration is key for FP8. PyTorch also can offload the KV cache to CPU if needed and reuse it – but on 80GB H100 this is less needed unless extremely long sequences.
- TensorFlow: Similar to PyTorch, with proper XLA or oneDNN backends updated for Hopper, TF can use BF16/FP8. NVIDIA released an FP8 training example in TensorFlow as well. The XLA compiler can fuse ops on H100, and when using JAX or TF-XLA, you might benefit from that large L2 by fusing more of the transformer layer into one kernel.
- Memory optimization: Large models on H100 can still benefit from optimizations like quantization (to fit model in memory and increase effective bandwidth) and model parallelism (if one H100 is not enough memory, using two via tensor or pipeline parallel). Libraries like DeepSpeed, Megatron-LM support sharding a model across multiple H100s with communication handled by NCCL over NVLink/NVSwitch – those have all been updated to be Hopper-aware, taking advantage of faster NVLink.
- MIG (Multi-Instance GPU): H100 supports up to 7 MIG slices (). This is useful if running many smaller models or serving many lightweight inference jobs. Each MIG instance has isolated compute and memory (10 GB each for 7-way on PCIe) (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press). For LLMs, MIG is less useful unless you are deploying, say, many 6B parameter models on one GPU, but it’s there for partitioning resources (e.g., half the GPU running an LLM, other half another task, securely isolated).
Quantization and Pruning Tools: The open-source community has created tools like bitsandbytes (8-bit optimizer) and GPTQ (4-bit quantization) that allow running models at lower precision with minimal accuracy loss. On H100, bitsandbytes int8 mode is fully supported and runs very fast (it uses TC int8). GPTQ 4-bit models can run on H100, though current GPTQ uses a lot of custom CUDA kernels – those kernels may not yet be optimized for Hopper’s architecture, but they will still run and likely faster than on Ampere. As int4 support matures (perhaps with the new FP4 format in H200 generation), we might see even more of this. H100’s advantage is that if you quantize a model to 4-bit, its Tensor Cores can still execute it (possibly by treating it as int8 with zero-padding or by using two int4 per int8 op). Indeed, NVIDIA’s research into 4-bit weight quantization (like the FP8/FP4 for H200) means H100 should be forward-compatible with some of those ideas.
Compatibility Summary: Bottom line – any code that ran on A100 will run on H100, typically faster. But to really unlock H100, one should:
- Use CUDA 12 and latest libraries.
- Utilize the Transformer Engine for FP8.
- Use TensorRT or other inference optimizers for deployment.
- Experiment with quantization (8-bit) for large models to maximize usage of those 80 GB memory for context (maybe keeping weights 8-bit and activations BF16).
- Make sure to leverage NVLink for multi-GPU (enable NCCL P2P, etc., which is usually automatic).
- Monitor GPU utilization; if not saturated, increase concurrency (e.g., serve multiple requests in parallel using batching or multiple streams).
By following these, an H100 can be pushed to its limits, achieving the best possible token throughput and latency.
Scaling to Multi-GPU and Multi-Node
While a single H100 GPU can handle moderately large models, the largest models and highest throughput deployments often involve multiple H100s working in parallel. The Hopper architecture and NVIDIA’s platform provide several ways to scale up:
NVLink and Multi-GPU Scaling: H100 GPUs can be interconnected via NVLink 4 to form high-bandwidth multi-GPU systems. In an 8× H100 SXM server (e.g. DGX H100), all GPUs are connected through NVSwitch with up to 900 GB/s bandwidth between any pair (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press) (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press). This ultra-fast interconnect allows distributed model computation with relatively low communication overhead. For example, one could split a giant model’s layers across 2 or 4 GPUs (pipeline parallelism) or split each matrix across GPUs (tensor parallelism) and the NVSwitch will transfer activations/gradients quickly between GPUs each step. Compared to previous gen (600 GB/s on A100 NVSwitch), H100’s NVLink is 1.5× faster, which reduces the penalty of scaling.
In more common multi-GPU setups (like 2× or 4× H100 in a server without NVSwitch), H100 PCIe cards support 3-way NVLink Bridge connectors. Two PCIe H100s can be bridged with 3 NVLink bridges providing 600 GB/s peer-to-peer bandwidth (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press) (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press). This effectively treats the pair almost like an SXM module pair. The H100 NVL already comes as such a pair with NVLink built-in. When you have 4 H100 PCIe in a system, you can link them as two pairs. However, you cannot NVLink across those pairs (since PCIe cards only have connectors for 2-GPU linkage). So for >2 PCIe H100s, communication between pairs falls back to PCIe (or goes via CPU memory). That is a limitation for scaling beyond 2 with the PCIe form factor; it means large models that don’t fit on 2 GPUs might be better on SXM form factor with NVSwitch.
Multi-Node (Cluster) Scaling: To go beyond one server, H100 supports clusters using InfiniBand or Ethernet with RDMA. NVIDIA’s solution is the NVLink Switch System, which can extend NVLink connectivity across nodes, theoretically allowing up to 256 H100s in one shared memory space with 9× higher bandwidth than InfiniBand HDR (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press) (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press). That is a very specialized network (NVLink Switch). More commonly, H100 nodes connect with Quantum-2 InfiniBand at 400 Gbps. The highest-end example: the EOS supercomputer (NVIDIA internal) uses many H100s connected with NVLink Switch and InfiniBand to train massive models collectively. For inference, multi-node is usually about increasing throughput (different requests on different nodes) rather than splitting one model, except for truly enormous models (multi-hundred-billion or trillion parameters) that need the aggregate memory of multiple GPUs.
Parallelism Strategies: For LLM inference specifically:
- If one model is too large for a single H100 (e.g. a 175B GPT-3 in 16-bit would be ~350 GB – beyond an 80GB or even 2×94GB NVL), you can shard the model across multiple GPUs (model parallelism). Megatron-LM and FasterTransformer support splitting weight matrices across 2 or 4 GPUs (tensor parallel). With H100’s fast NVLink, the overhead is modest – typically a few percent latency increase from communication. For instance, splitting a 175B model across 4 H100 NVL (8 GPUs total) could allow it to run in FP16, each GPU hosting ~45B params. The 900 GB/s NVSwitch means the layers compute almost as if on one GPU with just some all-reduce at end of layers.
- If the model fits in one GPU but you want more throughput, you can run multiple inference streams on different GPUs (data parallel inference). E.g., serve 4 different requests simultaneously on 4 GPUs, each gets the full model copy. This is trivial to scale linearly, provided you have a copy of the model on each GPU. The bottleneck is loading the model into each GPU’s memory (which can be done in parallel at startup) and then feeding each GPU with data. H100 has features like Multi-Stream execution and MIG if needed to isolate loads.
- Pipeline parallelism for inference (cutting the model into stages across GPUs): This is less common because it introduces latency (each token must pass through GPU1 then GPU2, etc.), but it’s sometimes used for very large models to reduce memory per GPU. H100’s latency-optimized interconnect helps, but still if one can avoid pipeline for inference, it’s better to use tensor parallel or just replicate model if memory allows.
CPU-GPU Communication: A potential bottleneck in any deployment is transferring input/output data between CPU and GPU. H100 sits on PCIe Gen5 x16, offering up to ~32 GB/s host transfer (). Large batch inference (like reading a batch of prompts or writing a batch of generated tokens) could push a few GB through, but generally PCIe 5.0 is adequate to supply text data (which is relatively small compared to model size). The bigger concern is model load time – copying 80 GB of model weights from CPU (or disk) to GPU can take a few seconds over PCIe. Techniques to alleviate this include NVLink Bridge to CPU (if using an NVLink-enabled CPU like Grace in the future, not applicable to standard x86 currently) or GPUDirect Storage which lets the GPU DMA data from NVMe storage directly. H100 supports GPUDirect Storage, so in principle, a model could be loaded from an NVMe SSD to HBM via DMA at high speed, bypassing CPU overhead.
During inference runtime, if a model is sharded across GPUs, the GPUs exchange activations over NVLink/NVSwitch without involving CPU. The CPU mainly handles the initial text tokenization and final detokenization. These steps are usually minor (<5% overhead), but with very high throughput (say thousands of queries per second), CPU can become the bottleneck if not enough CPU threads are allocated for pre/post-processing. Solutions include using faster tokenizers (huggingface tokenizers in Rust), or even offloading tokenization to GPU as some research projects do, although not common.
Scaling Efficiency: With strong interconnects, multiple H100s can achieve near-linear scaling in throughput for inference when distributing requests. For example, if one H100 does X tokens/s, two H100s with the model loaded on each can do ~2X tokens/s (assuming the system can feed both). When splitting a single inference across GPUs (model parallel), there is some overhead. Empirically, many have found that two GPUs can achieve ~1.9× throughput vs one on large models (95% scaling efficiency) if using tensor parallel, thanks to NVLink. Four GPUs might achieve ~3.5–3.8× (some diminishing returns due to more communication). The NVSwitch in 8-GPU systems keeps efficiency high for up to 8-way splits.
H100 NVL for Multi-GPU in one slot: The H100 NVL is essentially an “SLI” of two GPUs on one board. It is meant to make multi-GPU more accessible on standard servers (which might only have one or two PCIe slots). The NVL’s two GPUs share memory via NVLink, enabling them to act almost like a single 188GB GPU for inference. However, note that NVLink (even 600 GB/s) is still slower than local HBM (3+ TB/s). So when a model spans the two NVL GPUs, each GPU accessing the other’s memory will incur some overhead. In practice, one should distribute the model such that each GPU mostly works on its local half, and only minimal data (activations) are exchanged. If done properly, the NVL can approach the performance of a single SXM (since it basically is two SXMs). NVIDIA specifically targets NVL for models up to 175B, suggesting that splitting the model across the 2 GPUs is a good use-case.
Multi-GPU Summary: H100 offers unparalleled scaling options: from 2-GPU setups with NVLink bridges, to 8-GPU HGX boards with NVSwitch, to multi-node with NVLink switches or InfiniBand. For local inference, if one has multiple H100s, it is beneficial to use them either to serve more requests in parallel or to enable serving a model that’s too big for one. The high bandwidth channels between GPUs ensure that the usual culprit – communication – is less of an issue. Still, maximizing multi-GPU efficiency requires careful software handling (either using libraries like DeepSpeed or Megatron that are optimized for it, or frameworks like TensorRT-LLM that can automatically distribute). Users should also keep an eye on NUMA effects: if GPUs are in different CPU NUMA domains, pin threads accordingly, etc., to avoid extra latency from CPU side.
In essence, H100 scales out and up robustly, making it suitable not just for single-card inferencing but also for building out AI servers and clusters capable of serving massive models and massive user loads concurrently.
Limitations and Considerations
Despite its cutting-edge capabilities, the NVIDIA H100 is not a magic bullet – there are practical limits and factors to consider when using it for large language models:
Memory Capacity Constraints: While 80 GB is enormous by GPU standards, the trend in LLMs is toward ever-larger parameter counts. Models like GPT-3 175B and PaLM 540B simply cannot fit in a single H100 in full FP16 precision. This necessitates model parallelism (splitting across GPUs) or compression techniques (int8, 4-bit, etc.). If using an H100 PCIe (80GB), the largest model you can load in 16-bit is around 40B parameters. To go higher, you either accept the speed/complexity hit of quantizing to 8-bit (which might allow ~80B in 80GB) or use multiple GPUs. The H100 NVL with 2×94GB addresses this to an extent – 188GB can hold ~94B in FP16 or perhaps a 175B model in 8-bit. But even models approaching 0.5–1 trillion parameters are on the horizon, which would require a cluster of H100s. So, memory is a bottleneck for the bleeding edge. Also, even if a model fits in 80GB, the sequence length can create huge activation memory usage. Long prompts or contexts (e.g. 8K or 32K tokens) will use a lot of GPU RAM for key/value caches. In such cases, one might run out of memory not from model weights but from activations. H100’s large memory helps but is not infinite – planning for enough headroom (or offloading KV cache to CPU which some inference frameworks allow) is important for long context LLMs.
Bandwidth vs Compute Balance: H100 dramatically boosts memory bandwidth and compute, but not always in equal measure for all workloads. Some models or layers might be memory-bound (e.g. embedding lookups, or very large fully-connected layers with low arithmetic intensity). In those cases, the GPU’s compute units may be underutilized waiting for data. If an LLM has a lot of sparse attention patterns or small matrices, the theoretical FLOPS won’t be reached. For instance, generating one token at a time is inherently not utilizing everything – the GPU does a burst of compute then waits for the next token’s input. This is a limitation of the workload, not the GPU, but it means that in practice you might see far less than peak utilization unless you batch or parallelize. Thus, achieving peak performance requires structuring inference to make use of GPU efficiently (e.g., processing multiple sequences concurrently). This is something users need to consider – if you only have one request at a time, an H100 will be underused; in such a case, time-slicing or MIG could be used to run other tasks on unused capacity.
Software and Implementation Maturity: Because H100 and FP8 are relatively new, some deep learning frameworks or models might not yet be fully optimized to leverage them. While NVIDIA’s own software is quite advanced, if you rely on custom or less common tools, you might need to update or wait for them to support Hopper. For example, certain versions of TensorFlow had issues with FP8, or some libraries might need recompiling for SM_90. Moreover, debugging numerical issues can be trickier when using FP8 – one must ensure calibration of scales is done. If a model experiences degraded accuracy on H100, it might be due to an aggressive use of low precision. The solution may be to adjust tolerances or run some layers in higher precision. These considerations mean that achieving both max performance and desired accuracy can require an iterative approach.
Cost and Availability: As of recent times, H100 GPUs are extremely expensive and in high demand. This is a practical limitation for “local” use: not everyone can obtain an H100 easily. Lead times might be long, and prices can exceed $30k per card on secondary markets (What is the price difference between NVIDIA H100 and NVIDIA A100?) (What is the price difference between NVIDIA H100 and NVIDIA A100?). Thus, many who want H100-level performance might opt for cloud instances (which have their own cost, but at least short-term). The cost-performance ratio, while favorable at scale, is still an upfront consideration – two A100s could be cheaper and maybe “good enough” if budget is tight, as previously discussed. Additionally, H100s draw a lot of power (350W–700W each); a local setup needs the power provisioning and cooling to handle this. This could mean upgrading power supplies, ensuring sufficient cooling infrastructure (potentially a data-center grade environment or liquid cooling kit). These are non-trivial constraints for an individual or small lab environment.
Integration and Compatibility: H100 PCIe cards are double-width, full-height, 10.5 inch length and require a 16-pin power connector delivering 350W+ (). Not all workstations can accommodate this physically or electrically. Motherboard BIOS support for large BAR memory (to map 80GB) is also needed (typically Resizable BAR, often fine on modern systems, but worth noting). The SXM form factor H100 is only usable in specific servers (HGX carrier boards), so that’s not an option for most personal setups. Therefore, one must ensure their system is compatible or purchase a purpose-built server/workstation for H100. There have been reports of users trying to run H100 in consumer boards and facing issues with initial boot or cooling – so planning and possibly vendor support are recommended.
Bottlenecks in LLM Pipeline: Aside from the GPU, one must consider the end-to-end pipeline of an LLM service. The GPU might churn out tokens quickly, but if your application cannot feed prompts or process outputs at that rate, the overall speed will be limited. For example, generating 10,000 tokens/sec is great, but if you then run a CPU-heavy post-processing on those outputs or if you have network overhead sending results to users, the realized throughput might drop. Thus, deploying H100 for LLMs often requires balance across the system: fast disk or network to load models, enough CPU for prepping data, and possibly multiple threads reading/writing to the GPU concurrently to keep it busy.
Precision vs Quality Trade-offs: Using H100 to its fullest often means using lower precision. While FP8 and INT8 can be very close in accuracy to FP16 for many generative models, there is a risk of subtle generation differences or instabilities. Some very large models (like those with mixture-of-experts, or certain fine-tuned behaviors) might not quantize as cleanly. A consideration is that one might need to validate the output quality when deploying on H100 with aggressive optimization. If issues arise, the fallback is to run in FP16, which sacrifices some of the speed gains. Essentially, to harness that “30X inference speedup”, you are relying on quantization that in rare cases could affect output. For most, this is acceptable, but critical applications might run a bit slower to maintain confidence in the accuracy. H100 gives you the choice, which is great, but it’s up to the practitioner to choose the appropriate mode.
Future-proofing and Competition: Another “limitation” in a forward-looking sense is that technology evolves quickly. NVIDIA’s own next-gen (H200, and beyond Blackwell architecture) will further improve capabilities, possibly making FP8 ubiquitous or adding even more memory. Similarly, competitor chips (Google’s TPUs, etc.) might excel in specific cases. So an expensive investment in H100 needs to be weighed against these factors – however, for the foreseeable future, H100 is and will remain extremely capable, and its 80GB memory will keep it relevant even as models continue to grow (since not all tasks will immediately need more memory; also partitioning can always extend usage).
In conclusion, when using H100 for local LLM inference, plan for its power and cooling needs, use the proper software stack, and be mindful of model sizes and precision choices. The H100 is a sophisticated tool – getting the best out of it may require equally sophisticated handling. But if one navigates these considerations, H100 can deliver unmatched results.
Sources and Citations
The information in this report was gathered from authoritative sources including NVIDIA’s official architecture whitepapers, product datasheets, technical blog posts, as well as independent analyses and benchmarking results. Key references are listed below:
- NVIDIA H100 Product Datasheet and PCIe Card Product Brief () () – Specifications for clocks, memory, power, and features of the H100 80GB PCIe.
- NVIDIA Hopper Architecture Whitepaper (NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog) (NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog) – Details on SM counts, FP32/Tensor core counts, and new architectural features of Hopper GH100.
- NVIDIA Developer Blog: Hopper Architecture In-Depth (NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog) (NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog) – Explanation of SM improvements, Transformer Engine, FP8 precision, and SM diagram.
- Chips and Cheese analysis of H100 (Nvidia’s H100: Funny L2, and Tons of Bandwidth) (Nvidia’s H100: Funny L2, and Tons of Bandwidth) – Insights into transistor count, die size, L2 cache, memory controllers, and power behavior under load.
- Lenovo Press: ThinkSystem H100 GPU Guide (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press) (ThinkSystem NVIDIA H100 PCIe Gen5 GPUs Product Guide > Lenovo Press) – A comparison table of H100 variants (PCIe, SXM, NVL) with performance specs and configuration details.
- Tom’s Hardware / AnandTech coverage of H100 and H100 NVL (NVIDIA Announces H100 NVL - Max Memory Server Card for Large Language Models) (NVIDIA Announces H100 NVL - Max Memory Server Card for Large Language Models) – Launch analysis including memory bandwidth (HBM3 3 TB/s), FP8 and FP16 TFLOPS, NVL dual-GPU design.
- NVIDIA TensorRT-LLM blog (H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token — tensorrt_llm documentation) – Benchmark showing H100 vs A100 throughput (4.6×) and 10k tokens/s achievement at low latency with FP8.
- TRG Datacenters comparison of H100 vs H200 (NVIDIA GPUs H200 vs. H100 - A detailed comparison guide | TRG Datacenters) – Noting 21,806 tokens/s on Llama2-70B for H100 and relative performance gains.
- PNY and NVIDIA marketing for H100 NVL (NVIDIA H100 NVL | Data Center GPU | pny.com) – Claim of up to 5× speedup over A100 on Llama2 and emphasis on 188GB memory for up to 70B–200B models.
- IT Creations overview of H100 SXM (Nvidia H100 SXM5 GPU) (Nvidia H100 SXM5 GPU) – Mention of 30× inference speedup on LLMs and specifics like 3.35 TB/s vs 2 TB/s bandwidth.
- Zeet.co blog on H100 inference accelerator (NVIDIA H100 GPU: The World's Most Advanced AI Inference Accelerator | Zeet.co) – MLPerf results citing 73k inferences/s on BERT, 9× higher than previous records.
- TechPowerUp forum (via MosaicML) (NVIDIA H100 Compared to A100 for Training GPT Large Language Models | TechPowerUp Forums) (NVIDIA H100 Compared to A100 for Training GPT Large Language Models | TechPowerUp Forums) – Data on cost per hour vs speedup (H100 2.2× cost, 2.5–3.3× faster than A100) underlining cost-efficiency.
- NVIDIA Official Site: H100 product page and performance briefs (H100 Tensor Core GPU | NVIDIA) (H100 Tensor Core GPU | NVIDIA) – For claims on bandwidth, NVLink Switch, and HPC speedups (7× DPX, etc).
- Reddit and HuggingFace discussions (1,200 tokens per second for Llama 2 7B on H100! : r/LocalLLaMA) – Community findings on enabling int8 for 7B models yielding huge speedups on H100 (1200 tok/s).
- Academic Preprint (arXiv) on dissecting Hopper (Dissecting the NVIDIA Hopper Architecture through ... - arXiv) – (Referenced for deeper architecture understanding, not directly cited in text).