Jetson GPU Hardware Summary (for LLM Inference)
Notes: ¹Pascal GPUs (TX1/TX2) support limited INT8 via DP4A (8-bit dot-product ALU instructions) at 4× the rate of FP32 on each CUDA core (NVIDIA Jetson TX2 Specs | TechPowerUp GPU Database), but have no dedicated matrix cores – INT8 TOPS are estimated. ²FP16 and BF16 throughput on Volta/Ampere Tensor Cores is 2× the FP32 rate (except TK1–TX2 which lack Tensor Cores). Ampere (3rd Gen Tensor) also supports BF16 (same rate as FP16) (NVIDIA Jetson Orin NX 8 GB Specs | TechPowerUp GPU Database) (NVIDIA Jetson Orin NX 8 GB Specs | TechPowerUp GPU Database). ³INT8 TOPS for Volta/Ampere include Tensor Core acceleration (dense INT8). “Sparse” INT8 throughput (in parenthesis) is ~2× higher using structured sparsity on Ampere (supported on Orin) (NVIDIA Jetson Orin NX 8 GB Specs | TechPowerUp GPU Database). INT4 throughput on Ampere Tensor Cores is 2× INT8 (e.g. Orin Nano 8GB up to 80 TOPS INT4). ⁴Jetson Orin modules support multiple PCIe interfaces (e.g. AGX Orin supports up to 16 lanes total split across controllers) ([PDF] NVIDIA Jetson AGX Orin Series) (NVIDIA Jetson AGX Orin Developer Kit - ASRock Industrial); here we list a typical maximal configuration per module.
1. Architecture Deep Dive
GPU Architecture & SM design: Each Jetson’s GPU corresponds to a major NVIDIA architecture generation, with differences in core design and specialized units that impact LLM inference. Early Jetsons (TK1) used Kepler, which has 192 CUDA cores organized in SMs (streaming multiprocessors) with 192KB L2 cache (NVIDIA Jetson TK1 Specs | TechPowerUp GPU Database) (NVIDIA Jetson TK1 Specs | TechPowerUp GPU Database). Jetson TX1 introduced Maxwell, doubling core count to 256 and improving efficiency per core (NVIDIA Jetson TX1 Specs | TechPowerUp GPU Database). Maxwell SMs (Compute Capability 5.3) featured larger shared memory and an improved scheduler, but no dedicated matrix units. Jetson TX2 uses Pascal (GP10B GPU) with similar core count (256) but on 16nm and higher clocks (NVIDIA Jetson TX2 Specs | TechPowerUp GPU Database) (NVIDIA Jetson TX2 Specs | TechPowerUp GPU Database). Pascal SMs (CC 6.2) introduced unified memory support and enhanced INT8 dot product (DP4A) instructions for deep learning inference (NVIDIA Jetson TX2 Specs | TechPowerUp GPU Database). The Volta architecture (Jetson Xavier series) was a leap forward: its SMs still have 64 FP32 cores each, but also include 8 Tensor Cores per SM (NVIDIA Jetson AGX Xavier 16 GB Specs | TechPowerUp GPU Database) (NVIDIA Jetson AGX Xavier 16 GB Specs | TechPowerUp GPU Database). For example, Jetson AGX Xavier’s 512-core Volta GPU has 8 SMs × 64 CUDA cores, and 64 Tensor Cores total (NVIDIA Jetson AGX Xavier 16 GB Specs | TechPowerUp GPU Database) (NVIDIA Jetson AGX Xavier 16 GB Specs | TechPowerUp GPU Database). These first-gen Tensor Cores perform 4×4 matrix multiplies on FP16 inputs (or INT8) with FP32 accumulation, accelerating AI by an order of magnitude. Finally, the Ampere architecture in Jetson Orin series further expands SM capabilities: each SM has 64 FP32 cores plus 4 third-gen Tensor Cores (Ampere uses fewer Tensor Cores per SM than Volta, but each is far more powerful) (NVIDIA Jetson Orin NX 8 GB Specs | TechPowerUp GPU Database). Ampere Tensor Cores support new data types (BF16, INT4) and sparsity. The flagship Orin AGX has 2048 CUDA cores (32 SMs × 64) and 64 Tensor Cores (NVIDIA Jetson AGX Orin: 275 TOPS, 2048 NVIDIA® CUDA® cores ...) (NVIDIA® Jetson AGX Orin™ Products - Connect Tech Inc.). Ampere also adds enhanced caches and RT Cores (for ray tracing) which generally don’t impact LLM inference. Across generations, L1/L2 cache sizes grew – e.g. Xavier’s GPU has 128KB L1 per SM and 512KB L2 (NVIDIA Jetson AGX Xavier 16 GB Specs | TechPowerUp GPU Database) (NVIDIA Jetson AGX Xavier 16 GB Specs | TechPowerUp GPU Database), whereas TX2’s Pascal had 48KB L1 and 512KB L2 (NVIDIA Jetson TX2 Specs | TechPowerUp GPU Database). Larger caches help feed the cores in memory-heavy tasks like transformer inference.
Generational Improvements: Each newer architecture brought significant per-clock and per-watt improvements. Maxwell (TX1) introduced a more efficient SM design than Kepler, roughly doubling performance per core (NVIDIA Jetson TX1 Specs | TechPowerUp GPU Database) (NVIDIA Jetson TX1 Specs | TechPowerUp GPU Database). Pascal (TX2) on 16nm further improved clock speeds and added compute pre-emption and concurrent execution enhancements beneficial to multitasking inference. The jump to Volta (Xavier NX/AGX) added Tensor Cores – specialized units dramatically accelerating matrix math used in neural network layers (NVIDIA Jetson Xavier NX 8 GB Specs | TechPowerUp GPU Database) (NVIDIA Jetson Xavier NX 8 GB Specs | TechPowerUp GPU Database) – Jetson Xavier’s 64 Tensor Cores provide up to ~8× the deep-learning throughput of TX2’s GPU at similar power (NVIDIA Jetson Xavier NX 8 GB Specs | TechPowerUp GPU Database) (Jetson Xavier Series | NVIDIA). Ampere (Orin) improved tensor throughput per core and introduced Structural Sparsity: the SM can skip zero weights, effectively doubling throughput for sparse matrices (NVIDIA Jetson Orin NX 8 GB Specs | TechPowerUp GPU Database). This is valuable for pruned LLMs – if a model is pruned 50% and supports Ampere’s 2:4 sparsity pattern, the Tensor Cores on Orin can process ~2× more operations per cycle. Additionally, Ampere SMs support fine-grained scheduling and cooperative groups which can improve utilization for the irregular memory access patterns in transformer blocks.
AI Accelerators (DLA cores): Besides the GPU, some Jetsons include independent Deep Learning Accelerators (NVDLA) which are fixed-function CNN engines. Jetson Xavier and Orin modules each integrate dual NVDLA v2 engines (NVIDIA® Jetson AGX Orin™ Products - Connect Tech Inc.) (NVIDIA® Jetson AGX Orin™ Products - Connect Tech Inc.). These DLAs are tailored for vision CNNs (convolutions) and are less flexible for transformer models, so LLM inference typically runs on the CUDA cores/Tensor Cores rather than the NVDLAs. Thus, in this report we focus on the GPU’s tensor capabilities as the primary inferencing engine for LLMs.
2. Compute Capabilities for Low-Precision Inference
Supported Precisions: Modern LLMs benefit from lower precisions (FP16, BF16, INT8, INT4) for faster inference. The Jetson GPUs vary in which precisions are hardware-accelerated. All Jetson GPUs natively support FP32 arithmetic (single precision) on their CUDA cores. Earlier architectures (Kepler, Maxwell) either did not support half-precision or executed FP16 at FP32 rate (no speedup) (NVIDIA Jetson TX1 Specs | TechPowerUp GPU Database). Pascal introduced limited-speed FP16: TX2’s Pascal cores can execute FP16 instructions at 2× the rate of FP32 (each core can fuse two FP16 ops) (NVIDIA Jetson TX2 Specs | TechPowerUp GPU Database). Indeed, TX2 achieves ~1.33 TFLOPS FP16 vs 0.665 TFLOPS FP32 (NVIDIA Jetson TX2 Specs | TechPowerUp GPU Database). INT8 inference on TX1/TX2 is possible via 8-bit dot products (DP4A) on Pascal: each CUDA core can perform four 8-bit multiplies and an accumulate in one cycle (NVIDIA Jetson TX2 Specs | TechPowerUp GPU Database). This gives a theoretical 4× speedup over FP32. In practice TX2’s GPU can hit around ~2.6 INT8 TOPS (trillions of ops/sec) if fully utilizing DP4A. Maxwell (TX1/Nano) lacked DP4A, so INT8 had to be emulated, making it unattractive for LLMs.
Volta and Ampere architectures fundamentally changed low-precision support with Tensor Cores. Jetson Xavier’s Volta GPU has first-gen Tensor Cores that support FP16×FP16→FP32 matrix FMA and also INT8×INT8→INT32 accumulation (NVIDIA Jetson AGX Xavier 16 GB Specs | TechPowerUp GPU Database) (NVIDIA Jetson AGX Xavier 16 GB Specs | TechPowerUp GPU Database). Using Tensor Cores, Xavier achieves 30 INT8 TOPS (trillion ops/s) (Jetson Xavier Series | NVIDIA) – far beyond TX2’s ~2.6 TOPS – albeit restricted to matrix ops. These Tensor Cores can also do mixed-precision FP16: Xavier 16GB hits ~2.8 TFLOPS FP16 via Tensor Cores (NVIDIA Jetson AGX Xavier 16 GB Specs | TechPowerUp GPU Database), versus only ~0.84 TFLOPS FP32 on its standard CUDA cores (NVIDIA Jetson Xavier NX 8 GB Specs | TechPowerUp GPU Database) (NVIDIA Jetson Xavier NX 8 GB Specs | TechPowerUp GPU Database). Ampere (Orin) has third-gen Tensor Cores that expand data type support to BF16, INT4, INT8 with FP32 accumulation (NVIDIA Jetson Orin NX 8 GB Specs | TechPowerUp GPU Database) (NVIDIA Jetson Orin NX 8 GB Specs | TechPowerUp GPU Database). BF16 (bfloat16) is especially useful for LLMs as it offers FP32-range with 16-bit compute – Orin’s Tensor Cores execute BF16 at the same rate as FP16 (up to ~6.5 TFLOPS on AGX Orin) (NVIDIA Jetson Orin NX 8 GB Specs | TechPowerUp GPU Database). Ampere’s INT8 Tensor throughput is roughly double Volta’s: e.g. Orin NX 16GB delivers up to 100 INT8 TOPS dense (NVIDIA Jetson Comparison and FAQ: Orin, Xavier, TX2 and Nano - BVM Ltd) (and 200 TOPS if leveraging sparsity³). Ampere Tensor Cores also support INT4 accumulation for ultra-low precision: in theory Orin’s INT4 throughput is 2× INT8 (e.g. Orin Nano 8GB: 40 INT8 TOPS or 80 INT4 TOPS) (NVIDIA Jetson Orin Nano - SoM Overview). This can enable highly quantized LLMs (down to 4-bit) to run faster – provided the model accuracy holds up.
Throughput and Utilization: The raw compute figures translate to token generation speed depending on utilization. LLM inference involves matrix multiplications (for attention and feed-forward layers) that map well to Tensor Cores, but also non-matrix ops (softmax, layernorm) that run on regular CUDA cores. For Jetsons with Tensor Cores (Xavier, Orin), maximizing their use is key: frameworks like TensorRT-LLM or NVIDIA’s FasterTransformer will use FP16/BF16 Tensor Core kernels for dense GEMMs, achieving near-theoretical throughput. For example, NVIDIA reported Jetson AGX Orin can generate ~4.4 tokens/sec with a 70B parameter LLaMA-2 model using int8 optimizations (Is the Nvidia Jetson AGX Orin any good? : r/LocalLLaMA - Reddit) – a feat impossible on earlier Jetsons. By contrast, Jetson TX2 or Nano (no Tensor Cores) must run GEMMs in CUDA cores at FP16 or INT8, which is much slower (as we’ll see in benchmarks, a TX2 might only reach ~0.2–0.3 tokens/sec on a 7B model). The Ampere generation also introduces sparsity support: if an LLM model is pruned (with 50% zeros in weight matrices in the required 2:4 pattern), Orin’s hardware can automatically double Tensor Core throughput (NVIDIA Jetson Orin NX 8 GB Specs | TechPowerUp GPU Database). This yields up to ~1.5× actual speedup in supported layers (not quite 2× due to overhead and layers that can’t be pruned) – still a significant boost unique to Ampere GPUs.
3. Memory Subsystem Analysis
Memory Hierarchy: LLM inference is memory-bound as much as compute-bound – large models require moving millions of parameters and activations. Each Jetson’s GPU shares system memory with the CPU (unified memory architecture), which simplifies deployment but means GPU VRAM is limited to the module’s RAM. Jetson TK1 had just 2 GB DDR3, whereas modern Orin AGX has up to 64 GB LPDDR5 (Jetson AGX Orin for Next-Gen Robotics | NVIDIA). Bandwidth has similarly jumped from ~15 GB/s on TK1 (FFT is slower on Jetson TK1? - cuda - Stack Overflow) to 204.8 GB/s on AGX Orin (Jetson AGX Orin for Next-Gen Robotics | NVIDIA). This bandwidth directly impacts LLM throughput: e.g. Jetson Xavier NX (59 GB/s) cannot feed its GPU as fast as AGX Orin can, so on large matrix multiplies Orin spends less time waiting on memory.
Jetson GPUs employ a two-level cache: a small L1 per SM and a global L2. Newer architectures increased cache sizes (Volta’s 512KB L2 vs. Pascal’s 256KB (NVIDIA Jetson TX2 Specs | TechPowerUp GPU Database) (NVIDIA Jetson AGX Xavier 16 GB Specs | TechPowerUp GPU Database)) and improved memory compression. NVIDIA GPUs use lossless compression for framebuffer data; for LLM tensors, this may not apply, but on-chip reuse via L1/L2 is crucial. For instance, when a batch of tokens is processed, the same weight matrix tiles might be reused from L2 cache for multiple tokens, increasing effective bandwidth. Ampere’s larger L2 (e.g. 8MB on some Orin variants) helps keep frequently-used weights (like attention projection matrices) on chip during compute.
Memory Capacity and Model Size: The memory size of each Jetson dictates the maximum LLM that can be inferred fully on GPU (without offloading). As a rule of thumb, a dense FP16 model requires ~2× its parameter size in bytes (for weights + activations + context). Jetson Nano/TX1 with 4 GB can only host ~1–2 billion parameter models at best (and even that might require 8-bit weights). In contrast, Jetson AGX Orin 64 GB can fit a 65B parameter model in 8-bit quantization (70B LLaMA-2 in INT8 is ~70 GB of weights, which can just squeeze in with memory management) – indeed NVIDIA demonstrated 70B on Orin with int8 sparsity (Is the Nvidia Jetson AGX Orin any good? : r/LocalLLaMA - Reddit). For mid-size models (6B, 7B, 13B), devices like Xavier NX 8GB often need 4-bit quantization to fit the model entirely in memory. If memory is insufficient, one must stream layers from CPU memory, but Jetson’s unified memory can make this smoother: the GPU can access host memory over the memory bus (no slow PCIe transfers) – though at reduced bandwidth and higher latency. In practice, swapping layers in/out of GPU memory still incurs a performance hit, so it’s avoided for real-time inference.
Memory Bandwidth vs Token Generation: Large Language Models, especially with long sequences, can be bandwidth-hungry. Each transformer layer may read weights (multiple GB for big models) repeatedly for each token. Therefore, memory throughput often limits token/sec once the GPU compute is beefy enough. For example, on a 7B model quantized to 4-bit, Jetson Xavier NX (59 GB/s) might reach ~1.5 tokens/sec (NVIDIA Jetson AGX Orin Large Language Model LLaMA2-7b and LLaMA2-13b Performance Test Report - DFRobot), whereas Orin NX (102 GB/s) might do ~3 tokens/sec given more bandwidth to stream weights. Ampere’s support for memory efficiency techniques like quantization (which effectively increases “effective” bandwidth by using fewer bytes per weight) and sparsity (fewer weights to load) is crucial for LLMs on memory-limited devices. By using 4-bit or 8-bit weights, we cut memory bandwidth needs by 2–4×, often with minor loss in model quality – this is effectively mandatory on Jetsons. NVIDIA provides tools (TensorRT, pytorch-QAT) to quantize models to INT8/INT4 compatible with Tensor Cores, ensuring models like GPT-2, BERT, and even GPT-3 class can run within Jetson memory budgets.
4. Performance Benchmarks – LLM Inference
Throughput (tokens per second): The true test of these specs is actual LLM generation speed. Smaller models (1.3B–6B parameters) can run on older Jetsons albeit slowly. For instance, a 6B GPT-J at FP16 on Jetson TX2 might generate under 0.2 tokens/sec, based on community experiments (notably limited by 8 GB RAM and no Tensor Cores). Jetson Xavier NX, with 21 TOPS of AI, achieves around 0.5–1 tokens/sec on a 7B LLaMA in 4-bit quantization using GPU acceleration (NVIDIA Jetson AGX Orin Large Language Model LLaMA2-7b and LLaMA2-13b Performance Test Report - DFRobot). The 2-bit quantization tests on LLaMA-7B in one report showed up to 1.8 tokens/sec on AGX Orin 64GB (NVIDIA Jetson AGX Orin Large Language Model LLaMA2-7b and LLaMA2-13b Performance Test Report - DFRobot), indicating Orin’s advantage in low-precision throughput. In NVIDIA’s own demo, Jetson AGX Orin (64GB) ran LLaMA2-13B (GPTQ 4-bit) at ~1.6 tokens/sec (NVIDIA Jetson AGX Orin Large Language Model LLaMA2-7b and LLaMA2-13b Performance Test Report - DFRobot), and even sustained ~1.5 tokens/sec on 7B with higher quantization (2-bit) across longer responses (NVIDIA Jetson AGX Orin Large Language Model LLaMA2-7b and LLaMA2-13b Performance Test Report - DFRobot) (NVIDIA Jetson AGX Orin Large Language Model LLaMA2-7b and LLaMA2-13b Performance Test Report - DFRobot).
For the largest models, NVIDIA’s optimized frameworks (TensorRT-LLM, NeMo) can get surprisingly usable speeds on Orin. As mentioned, Llama2-70B (int8 sparsity optimized) reached ~4.4 tokens/sec on AGX Orin (Is the Nvidia Jetson AGX Orin any good? : r/LocalLLaMA - Reddit) – roughly 20× faster than Xavier NX could do if it even fit the model. This shows how Ampere’s features (large memory, INT8, sparsity) combine to enable what was previously impossible on edge devices. Another datapoint: users have compared Jetson Orin with desktop GPUs – one Reddit user noted Orin 64GB got ~4 tokens/sec on 70B vs ~20 tokens/sec on an RTX 3090 for the same model (3090 has more compute and bandwidth, but uses 300W vs Orin’s 50W) (Is the Nvidia Jetson AGX Orin any good? : r/LocalLLaMA - Reddit). For medium models (13B), AGX Orin can comfortably reach 2–3 tokens/sec with int8 or 4-bit, whereas Xavier AGX (~32 TOPS) might only manage ~1 token/sec or less on the same model due to lower memory and lack of BF16.
Latency: For single-token inference (batch size 1, sequence length 1 -> 2), Jetson devices have a higher latency than server GPUs because of lower clocks and memory throughput. But for generating a long sequence, throughput (tokens/sec) is the more relevant metric. Jetson Orin’s high TOPS help keep per-token latency reasonable. Reports indicate Orin AGX can produce the first token in a few seconds for a 7B model and then stream subsequent tokens at ~0.5–2 per second depending on quantization (NVIDIA Jetson AGX Orin Large Language Model LLaMA2-7b and LLaMA2-13b Performance Test Report - DFRobot). Xavier NX might take ~10s for the first token on a 7B model and then ~1–2s per token after. These latencies are acceptable for some interactive applications (a response in a few tens of seconds), but far from real-time for very large models. Techniques like batching multiple tokens or sequences can improve GPU utilization – Jetson GPUs do support batching, though limited memory means batch sizes are kept small. In tests, batching to 2 or 4 can improve Orin’s tokens/sec slightly (until memory bandwidth becomes the bottleneck).
Real-World Example: A Jetson Orin Nano (smallest Orin) running the 7B Alpaca model (an instruction-tuned LLaMA) in 4-bit int4 mode can achieve about 0.5–0.6 tokens/sec with a batch size of 1 (as anecdotal community results show). This is slow but workable for short answers. On the other hand, the same model on Jetson AGX Orin might do ~2 tokens/sec at int8, providing a much smoother experience. It’s important to note that software optimizations (using TensorRT engines, quantization, and efficient attention kernels) play a huge role – a naïve implementation on PyTorch could be >5× slower than an optimized one on the same hardware (LLMs token/sec - Jetson AGX Orin - NVIDIA Developer Forums). NVIDIA’s Jetson AI Lab provides containers (e.g. the MLC (TVM) backend (LLMs token/sec - Jetson AGX Orin - NVIDIA Developer Forums) and TensorRT) that are tuned for LLM inference on Jetson, achieving significantly better throughput than generic CPU-bound methods like llama.cpp (which might only use 2–4 cores on the ARM CPU). For example, one NVIDIA dev confirmed that using Ollama (llama.cpp) on Orin yields roughly half the performance of the optimized MLC backend (LLMs token/sec - Jetson AGX Orin - NVIDIA Developer Forums).
5. Thermal and Power Efficiency
Power Consumption under Load: Jetson modules are designed with specific power envelopes (e.g. Nano ~5–10 W, NX ~15 W, AGX Orin up to 50–60 W) (NVIDIA Jetson Comparison and FAQ: Orin, Xavier, TX2 and Nano - BVM Ltd) (Jetson AGX Orin for Next-Gen Robotics | NVIDIA). LLM inference tends to drive the GPU to near-full utilization (especially when using Tensor Cores). In sustained tests, Jetson Nano and TX2 will throttle if they exceed their thermal design – for instance, Nano’s module often hovers around 8 W when running a model, and its small heatsink can reach ~80°C, at which point clocks may downscale. The larger Jetsons have active cooling solutions (fans) and higher TDP headroom. Jetson AGX Xavier can draw ~30 W when running a heavy CNN or transformer load at max clocks (NVIDIA Jetson Comparison and FAQ: Orin, Xavier, TX2 and Nano - BVM Ltd). Jetson AGX Orin, when configured to 60 W mode, will use the full budget for intense LLM tasks: measurements show ~50–55 W sustained during 70B int8 inference (with the GPU ~98% busy) (Jetson AGX Orin for Next-Gen Robotics | NVIDIA) (Jetson AGX Orin for Next-Gen Robotics | NVIDIA). The efficiency is notable – ~4 tokens/sec on 70B at 55 W is still far better Joules/token than many older GPUs. For comparison, Xavier 32GB (30 W) might manage only ~0.3 tokens/sec on 70B (if it could run it at all), leading to a much higher energy per token.
Thermal Management and Throttling: All Jetsons use dynamic frequency scaling to stay within thermal limits. If a workload is heavy but below TDP, the module will ramp clocks to maximum. If temperature approaches the limit (usually ~80–85°C for Jetson SoCs), the Jetson will throttle down core clocks. In practice, one must ensure adequate cooling (heatsinks, fans) when running LLM inference continuously. Jetson AGX Orin Dev Kits come with a large heatsink-fan assembly and can maintain 60 W operation out of the box. The smaller Orin NX and Nano modules rely on the carrier board cooling solution; they may throttle in confined spaces. Empirical data shows Jetson Xavier NX running an LLM at 15 W can hit ~75°C; it will start throttling around 80°C, reducing clocks (and thus inference speed). It’s recommended to use the nvidia-jetson power modes to cap clocks if needed – e.g. running Orin NX in 15 W mode instead of 20 W can keep it from overheating in fanless setups.
Performance-per-Watt: Jetson GPUs are designed for efficiency. While their absolute performance lags discrete GPUs, their perf/W is often excellent for AI inference. For instance, Xavier NX (21 TOPS at 15 W) gives ~1.4 TOPS/W (INT8), and Orin NX (100 TOPS at 25 W) is ~4 TOPS/W – in line with or better than many desktop GPUs of their time (NVIDIA Jetson Comparison and FAQ: Orin, Xavier, TX2 and Nano - BVM Ltd). In LLM tasks, Orin’s efficiency shines: at ~50 W it matches a desktop RTX 2080 (225 W) on some INT8 transformer benchmarks, a >4× efficiency gain. This matters in edge deployments where power is limited (drones, robots). It’s also noteworthy that running LLMs at lower precision greatly increases perf/W – e.g. shifting FP16 to INT8 can nearly double throughput with only a modest extra power draw (Tensor Cores are very power-efficient). On Nano and TX2, using INT8 (DP4A) vs FP32 can show ~1.5–1.8× tokens/sec for roughly the same power, meaning less energy per token. Overall, Ampere Jetsons provide the best perf/W for LLMs thanks to aggressive silicon optimizations and 8 nm process – Orin Nano achieves up to 5.7 TOPS per watt (40 TOPS at 7 W) in INT8 workloads (NVIDIA Jetson Orin Nano - SoM Overview) (NVIDIA Jetson Orin Nano - SoM Overview), though real LLM inference won’t always hit that peak.
6. Comparative Analysis with Other GPUs
When evaluating Jetson GPUs for LLM inference, it’s helpful to compare them to other edge accelerators and even some server GPUs:
-
Versus Desktop NVIDIA GPUs: In raw performance, even AGX Orin (2048 cores) is closer to a mid-range discrete GPU (e.g. RTX 3050 Ti) than to a server A100. A desktop RTX 3060 (3584 CUDA cores, 12 GB GDDR6) will handily outperform Jetson Orin in throughput – for example, an RTX 3060 can generate ~8–10 tokens/sec on a 13B model where Orin might do ~3–4 tokens/sec – but the desktop card uses
170 W and needs a PC chassis. Jetsons focus on efficiency and integration (CPU+GPU+ML accelerators in one). If pure performance is needed (and power/cost are less concerns), a desktop GPU or cloud instance will outrun any Jetson for LLMs. However, Jetsons offer competitive performance/$ for edge. At the high end, Jetson AGX Orin ($2000 with 64GB) can be more cost-effective than an equivalently memory-rich discrete GPU system (GPUs with 48–80GB like RTX A6000 or A100 cost far more). Lower-end Jetsons (Nano at <$150) vastly outperform CPUs or Raspberry Pi-class devices on AI tasks in the same price range. -
Versus Other Edge AI Chips: Competing edge AI accelerators (like Google Coral EdgeTPU, Intel Movidius, even Apple’s Neural Engine) are generally geared towards CNNs and offer limited or no support for custom LLMs. Jetson’s GPU approach is more flexible – essentially providing a mini CUDA-capable GPU that can run standard transformer frameworks (PyTorch, ONNX Runtime, etc.) with GPU acceleration. For example, an 8 W Coral EdgeTPU can do 4 TOPS on quantized CNNs but cannot run a Transformer-XL language model easily due to memory and programmability constraints. In contrast, Jetson Nano (10 W) can run a small Transformer in PyTorch (slowly, but it works), and Orin NX can run pretty large transformers with TensorRT. Compared to laptop GPUs (which could be considered “edge”), Jetsons hold up well in perf/watt but laptop GPUs often have higher absolute performance. It’s fair to say Jetson AGX Orin is among the most powerful self-contained edge AI modules for LLM inference, only really rivaled by much larger systems or upcoming specialized LLM accelerators.
-
Cost-Performance: Jetson modules vary from <$200 (Nano) to a few thousand (AGX Orin). For LLM tasks, one should target at least the Xavier NX or Orin NX class for meaningful performance. Jetson Nano, while cheap, struggles with anything but the smallest models (its 4GB RAM is a severe limiter). TX2 is EOL and its cost-performance is overtaken by Nano or Xavier NX. Xavier NX (approx $400) offers a good balance – it can run 6B models reasonably and 13B with heavy quantization. Orin NX (
$600-$800) and Orin Nano ($300) further improve performance; they effectively make previous Jetsons obsolete at similar price points (e.g. Orin Nano 8GB outperforms TX2 while costing less). If one’s budget allows, AGX Orin is unmatched for on-device LLMs due to its 32–64GB memory – but it is expensive. In scenarios where inference can be done on a cloud GPU for a fraction of the cost, one must weigh recurring cloud costs vs. a one-time Jetson hardware cost. Many robotic and embedded applications value local processing (for privacy, latency), making Jetson’s cost worthwhile.
In summary, Jetson GPUs occupy a unique niche: they won’t beat a datacenter GPU in speed, but they deliver adequate LLM performance at a fraction of the power, and with the integration and ruggedness needed for edge deployment (something a PCIe GPU + CPU combo can’t easily provide in the field).
7. Optimization Techniques & Software Support
CUDA and TensorRT: All Jetson GPUs run NVIDIA’s CUDA toolkit, which means popular ML frameworks (PyTorch, TensorFlow) have full GPU acceleration support on Jetson (with the appropriate builds). For LLM inference, frameworks can leverage Tensor Cores via libraries like NVIDIA TensorRT and cuBLAS. In fact, NVIDIA provides TensorRT-LLM (an extension of TensorRT optimized for transformer blocks) aimed at squeezing maximum inference speed on Ampere and newer GPUs (NVIDIA/TensorRT-LLM - GitHub). This can automatically apply tactics like layer fusion, quantization, and kernel auto-tuning specifically for large transformers. Users have demonstrated converting LLaMA and other models to TensorRT engines on Jetson Orin, yielding significant speedups (2–3× faster than raw PyTorch FP16 in some cases). On the software side, Jetson’s L4T (Linux for Tegra) comes with NVIDIA’s drivers enabling cuBLASLt (with tensor op support), cuDNN (for some RNN/LSTM support), and CUDA Graphs (which can reduce overhead for repetitive inference calls).
Framework Compatibility: Virtually all major AI frameworks are available for Jetson (often using the same CUDA kernels as desktop, just compiled for ARM64). PyTorch on Jetson supports FP16 GPU operations (and with the latest JetPack, even supports FlashAttention and other optimized kernels to better handle long sequence attention on GPU). For deploying models, options include ONNX Runtime with CUDA execution provider, TensorFlow with XLA, or specialized runtimes like NVIDIA NeMo which can optimize LLMs for Jetson. Importantly, Jetsons support ONNX export and FP16 calibration – one common workflow is to train or fine-tune a model on a powerful GPU, export to ONNX, then use TensorRT on Jetson to run inference (possibly with INT8 calibration using representative data). NVIDIA also offers the TAO Toolkit for optimizing models on Jetson, though it’s more CV-focused. Community projects like Nvidia-Jetson-LLaMa provide scripts to quantize and run LLaMA on Jetson using INT8 and FP16 with good results.
Quantization & Pruning: As noted, quantization is perhaps the single most effective optimization for LLMs on Jetson. Running models in 8-bit or 4-bit can be done post-training (with minimal loss using techniques like GPTQ or AWQ). Jetson GPUs (Pascal and later) handle INT8 well – Pascal via DP4A and Volta/Ampere via Tensor Cores. One caveat: Pascal (TX2) does not natively support INT8 in TensorRT because it has no Tensor Cores, but you can still use DP4A through custom kernels or using Torch INT8 quantization APIs (with lower efficiency). Volta and Ampere Jetsons fully support INT8 in TensorRT. For 4-bit, there’s no “native” hardware support aside from using them on Ampere Tensor Cores (which pack two INT4 per INT8 ALU). Frameworks like nanoGPT and FasterTransformer can be modified to use 8-bit on Jetson Orin – indeed, FasterTransformer includes INT8 support that runs on Jetson (Volta and later). Additionally, employing mixed precision (FP16 for large matmuls, FP32 for residual accumulations) is straightforward with NVIDIA’s Automatic Mixed Precision (AMP) tools, and Jetson’s GPUs support FP16 accumulate into FP32 which helps maintain some precision.
Software optimizations: Jetson environments allow a lot of PC-like optimizations: using cuda streams to overlap data transfers, using the cublasLt group API to batch small GEMMs (important for serving multiple smaller queries), and leveraging page-locked (“pinned”) memory for faster CPU–GPU transfers when needed. Another technique relevant to LLMs is kernel fusion (merging elementwise ops with GEMMs to save memory bandwidth). Tools like TorchScript or ONNX GraphSurgeon can fuse activations where possible. NVIDIA’s Transformer Engine library (used in Hopper GPUs) isn’t officially on Jetson, but many of its benefits (like FP8) are not applicable on Ampere anyway. Jetson developers often use the Jetson Performance Tuner (JPT) and tegrastats to monitor utilization and adjust clocks or DVFS to ensure the GPU stays at its optimal frequency for sustained inference.
In terms of model-specific tricks: one can reduce sequence length to save memory (generate shorter batches more frequently), use cache KV in transformers to avoid recomputation (NVIDIA’s code examples for GPT-2 on Jetson demonstrate using attention cache effectively), and even distribute parts of the model to CPU if needed (though this is last resort – GPU is much faster). Some have explored distilling LLMs into smaller models that Jetson can handle better (e.g. 70B distilled to 7B). These are higher-level optimizations beyond hardware, but worthwhile if targeting Jetson deployment.
8. Scaling and Multi-GPU Considerations
Jetson modules are typically used stand-alone, but it’s possible to scale out or use multiple in one system. The Jetson AGX Orin developer board, for example, has one module; to scale, one could network multiple Jetson boards or use multiple Orin modules on a custom carrier with multiple sockets (though NVIDIA doesn’t offer multi-socket Jetson boards). Multi-GPU for LLMs on Jetson is therefore uncommon but conceivable by clustering. Using distributed inference across two Jetson AGX Xaviers (each handling half the layers of a model) would be limited by the interconnect: each module has only PCIe or Ethernet to communicate. PCIe bandwidth on Jetson (e.g. 4.0 x4 ~ 8 GB/s) is much lower than NVLink or the bandwidth inside a multi-GPU server, so partitioning a model between Jetsons would incur significant latency passing activations over PCIe or network. It’s generally more efficient to use a single higher-memory Jetson (like AGX Orin 64GB) than two smaller ones splitting a model.
However, for pipeline parallelism or ensemble scenarios, you can use multiple Jetsons: e.g. one Jetson generating candidates, another doing refinement, etc., or splitting batch requests among devices. The NVIDIA Triton Inference Server can run on Jetson and manage multiple models on multiple devices (with caveats for ARM builds). So an edge server with, say, 4× Orin NX modules could use Triton to assign different LLM requests to different modules.
CPU–GPU Bottlenecks: On Jetson, CPU and GPU share memory, which avoids the PCIe bottleneck present on PC (copying data from host to GPU). This unified memory is a big advantage when the CPU needs to supply input data (tokens, embedding tables) to the GPU. The Jetson CPUs (ARM A57 in TX1, Carmel in Xavier, Cortex-A78AE in Orin) can sometimes bottleneck if they are preparing data for the GPU too slowly. For example, tokenization or sampling new tokens on CPU might lag behind the GPU’s ability to produce logits. Usually, one core is enough for token post-processing, but if doing heavy CPU-side decoding algorithms (like beam search with complex scoring), the CPU can become a bottleneck. It’s often advisable to use simpler decoding (greedy or small beam) or offload sampling to the GPU if possible (there are GPU kernels for top-k sampling, for instance). The GPU Direct Memory Access helps – the GPU can directly read input embeddings from CPU memory. In practice, the CPU utilization on Jetson during LLM inference is relatively low (the GPU does most of the work). But if running concurrent CPU tasks or multiple models, one must ensure the ARM cores have enough headroom.
Memory Distribution: Unlike a PC, the Jetson’s RAM is shared, so if the OS or other processes use a lot of memory, less is available for the LLM. In production, it’s wise to run Jetson in headless mode (no desktop GUI eating RAM) and minimize background processes. It’s not uncommon to see out-of-memory errors if one tries to load a model that exactly equals the module’s RAM – because some portion is always used by system. Tools like zram swap can extend memory (to compress inactive pages), but swapping to eMMC will drastically slow down inference if it occurs during critical sections.
9. Limitations and Considerations
Memory Constraints: The most obvious limitation on Jetsons for large LLMs is memory size. While Orin 64GB is generous, anything larger (like GPT-3 175B, which is ~350 GB in 16-bit) is far beyond a single Jetson. Even a 30B model in 16-bit would need ~60 GB just for weights – only the 64GB Orin can approach that, and even then you’d need int8 or offloading. This means for truly giant models, Jetson will have to run a reduced or quantized version. Techniques like model offloading (keeping some layers on CPU) can extend capability – e.g. running lower transformer layers on GPU and final layers on CPU if GPU RAM is full – but with a big latency hit.
Compute vs Model Size Balance: Jetson Nano or TX2 might handle a 1B param model in FP16, but their GPU compute is so low that inference will be very slow (minutes per output). Conversely, Jetson Orin has lots of compute, but if you load it with a model too large such that it has to use slower int4 or external memory, the compute might be under-utilized waiting on memory. So it’s important to right-size the model to the Jetson. Empirically, Xavier NX (8GB) runs 6B models okay, Orin NX (16GB) runs 13B okay, Orin AGX (32–64GB) can push to ~30B (with int8) comfortably. Trying a 70B on 32GB Orin would involve paging unless using disk offload with something like HuggingFace’s accelerate, which would be extremely slow (likely <0.1 token/sec).
Thermal Throttling and Sustained Performance: We touched on thermal throttling – in high ambient temperatures or enclosed deployments, Jetsons might not sustain advertised performance. The industrial versions (TX2i, Xavier Industrial, Orin Industrial) are rated for higher temps but often run at slightly lower clocks to stay within thermal envelope (NVIDIA Jetson Comparison and FAQ: Orin, Xavier, TX2 and Nano - BVM Ltd). When planning to deploy an LLM on a Jetson, one must consider adding a good heatsink/fan. The developer kits are reference – in custom scenarios, especially with Orin Nano which could be fanless, consider that running near max performance continuously might overheat without airflow.
Precision vs Accuracy: Aggressive quantization (like 4-bit) is often needed on Jetson for larger models, but it can impact the quality of LLM outputs. For instance, a 13B model in 4-bit might lose some fluency or correctness. The user should test and perhaps fine-tune quantized models if possible (quantization-aware training) to regain some accuracy. Jetsons support fine-tuning as well – one could fine-tune a smaller LLM on Jetson itself (though it’s slow – e.g. training 6B on Xavier NX is possible but very slow). More practically, fine-tuning is done offline on bigger GPUs, and Jetson just does inference.
Compatibility: Another limitation is software compatibility and updates. Jetson Linux (JetPack) often lags behind the latest CUDA versions. For example, as of 2023, Jetson Orin supports CUDA 11/12 but some newer libraries might not be immediately available or optimized. Users sometimes have to build PyTorch from source for ARM to get the latest features (which can be non-trivial). Ensuring the LLM code (often written/tested on x86) runs on aarch64 can involve some trial – many Python wheels are available, but not all. However, the gap has been closing, and NVIDIA’s container ecosystem (NGC containers for Jetson) provides ready-to-run environments for many AI tasks, reducing this friction.
I/O and Integration: Jetson modules integrate not just GPU but also CSI camera inputs, ISP, etc. While not directly related to LLMs, if an application is multi-modal (say using camera input to inform a language model), the Jetson can handle both – for example, running image recognition on the GPU and an LLM on the same GPU. This combined use can strain resources (memory bandwidth especially). It’s important to profile and perhaps serialize tasks (run vision then language sequentially if they contend for GPU). Some Jetsons have multiple engines (GPU + DLA) so one could offload a vision CNN to DLA while GPU runs the LLM. In short, sharing the GPU with other tasks is a consideration – an LLM generating text might slow down if, say, a camera pipeline suddenly uses half the GPU for object detection.
Future Outlook: Newer architectures (e.g. NVIDIA Hopper, Ada) promise even better transformer performance (FP8 precision, larger memory). While those aren’t in Jetson yet, an eventual “Jetson Orin successor” with Lovelace or Hopper GPU could further boost local LLM capability. That said, the current Jetson Orin already enables use-cases that were science fiction a few years ago – e.g. a battery-powered robot with on-board 30B param language model for dialogue and reasoning. Users must operate within the constraints (optimize models heavily, monitor thermals), but the combination of Jetson hardware and optimized software (TensorRT, quantization, sparsity) makes on-device LLMs not only possible, but practical for many applications.
Conclusion: NVIDIA Jetson GPUs, from the early TK1 to the powerful Orin, showcase the progression of mobile GPU tech towards AI inference. For LLM inference, each generation unlocked new possibilities: TX2 brought basic INT8, Xavier introduced Tensor Core acceleration, and Orin delivers the compute and memory to handle fairly large models. While not replacement for datacenter GPUs in raw capability, Jetsons enable self-contained, real-time AI at the edge. By carefully selecting models, leveraging mixed precision and optimization toolchains, developers have successfully deployed chatbots, summarizers, and language-driven robotics applications on Jetson devices. The analyses of architecture, memory, and benchmarks in this report should guide choosing the right Jetson and approach for a given LLM workload. In sum, running an advanced LLM locally requires squeezing maximum performance from the Jetson’s GPU (through quantization, efficient code and so on) and staying within its power/memory limits – but with the latest Jetson Orin, we now have the hardware to do so for models in the tens-of-billions range, truly bringing AI language models to the edge.
Sources
- NVIDIA Jetson TX1/TX2 Official Module Specs – NVIDIA (2016) (NVIDIA Jetson TX1 System-On-Module) (NVIDIA Jetson TX2 Specs | TechPowerUp GPU Database)
- NVIDIA Jetson Xavier Series Datasheet/Website – NVIDIA (2018) (Jetson Xavier Series | NVIDIA) (Jetson Xavier Series | NVIDIA)
- NVIDIA Jetson Orin Series Technical Brief – NVIDIA (2022) (Jetson AGX Orin for Next-Gen Robotics | NVIDIA) (Jetson AGX Orin for Next-Gen Robotics | NVIDIA)
- TechPowerUp GPU Database – Jetson TK1/TX1/TX2/Xavier/Orin Specs (2015–2023) (NVIDIA Jetson TK1 Specs | TechPowerUp GPU Database) (NVIDIA Jetson AGX Xavier 16 GB Specs | TechPowerUp GPU Database)
- Ridgerun Developer Blog – Jetson Orin Nano SoM Overview (Oct 2022) (NVIDIA Jetson Orin Nano - SoM Overview) (NVIDIA Jetson Orin Nano - SoM Overview)
- Seeed Studio Tech Blog – Jetson Orin Nano/NX and Xavier NX comparison (2023) (NVIDIA® Jetson Comparison: Nano, TX2 NX, Xavier NX, AGX, Orin - Seeed Studio Product Catalog) (NVIDIA® Jetson Comparison: Nano, TX2 NX, Xavier NX, AGX, Orin - Seeed Studio Product Catalog)
- DFRobot Test Report – Jetson AGX Orin running LLaMA-2 7B/13B (Aug 2023) (NVIDIA Jetson AGX Orin Large Language Model LLaMA2-7b and LLaMA2-13b Performance Test Report - DFRobot) (NVIDIA Jetson AGX Orin Large Language Model LLaMA2-7b and LLaMA2-13b Performance Test Report - DFRobot)
- NVIDIA Developer Forums – LLM tokens/sec on Jetson Orin discussion (Apr 2024) (Is the Nvidia Jetson AGX Orin any good? : r/LocalLLaMA - Reddit) (LLMs token/sec - Jetson AGX Orin - NVIDIA Developer Forums)
- BVM Embedded – NVIDIA Jetson Comparison (Nano, TX2, Xavier, Orin) (2023) (NVIDIA Jetson Comparison and FAQ: Orin, Xavier, TX2 and Nano - BVM Ltd) (NVIDIA Jetson Comparison and FAQ: Orin, Xavier, TX2 and Nano - BVM Ltd)
- Reddit r/LocalLLaMA – Jetson AGX Orin 70B LLaMA2 performance thread (2023) (Is the Nvidia Jetson AGX Orin any good? : r/LocalLLaMA - Reddit)