NVIDIA TensorRT-LLM
An open-source library that compiles and optimizes large language models for fast inference on NVIDIA GPUs.

NVIDIA TensorRT-LLM is an open-source library that optimizes large language model inference on NVIDIA GPUs. It takes a trained model and applies GPU-specific techniques - custom kernels, quantization, in-flight batching, and a paged KV cache - so the model serves more requests per second at lower cost. It solves a common problem: a model that runs correctly in a research notebook is often too slow and too expensive to serve in production without hardware-level tuning.
The library is built for NVIDIA hardware only. It targets data-center GPUs such as H100, H200, and B200, and it supports single-GPU, multi-GPU, and multi-node deployments through tensor, pipeline, and expert parallelism. Released under the Apache 2.0 license, it is the optimization layer behind NVIDIA’s higher-level serving products, including NVIDIA NIM microservices and the Triton Inference Server.
Where it sits in the stack
TensorRT-LLM is not a serving endpoint on its own. It sits between the trained model and the server that clients call, compiling the model into an optimized engine and providing the runtime that executes it.
What it does to a model
TensorRT-LLM applies several optimizations that work together. Understanding them helps you decide whether the extra compilation step is worth it.
- Custom kernels. Hand-tuned GPU code for attention, GEMM (matrix multiply), and mixture-of-experts operations replaces the generic implementations a model ships with.
- Quantization. The library supports FP8, FP4, INT8, and INT4 formats, including AWQ and SmoothQuant. Lower-precision weights use less memory and run faster, with a small accuracy trade-off you control.
- In-flight batching. Also called continuous batching, this groups incoming requests dynamically and releases each result as soon as it finishes, instead of waiting for a whole batch. It raises throughput and GPU utilization.
- Paged KV cache. The KV cache holds attention state for each active request. Paging it into fixed blocks, plus cache reuse and quantization, lets more requests share GPU memory at once.
- Speculative decoding. A smaller draft model proposes several tokens ahead, and the main model verifies them in one pass. When drafts are accepted, the model produces multiple tokens per step.
How to use it and how it fits
You install the Python library on a Linux host with a supported NVIDIA GPU and CUDA toolkit, then build an engine from a model checkpoint and serve it. Install the package with pip:
pip3 install --ignore-installed pip setuptools wheel
pip3 install tensorrt_llmThe library exposes a Python API. This minimal example loads a Hugging Face checkpoint, builds an optimized engine, and generates text:
from tensorrt_llm import LLM, SamplingParams
# Build an optimized engine from a checkpoint
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
sampling = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(
["Explain in-flight batching in one paragraph."],
sampling,
)
for output in outputs:
print(output.outputs[0].text)In production you rarely call the library directly. You wrap the engine in Triton Inference Server for scaling and request routing, or you deploy a pre-packaged NVIDIA NIM microservice that already bundles a TensorRT-LLM engine behind an OpenAI-compatible API. See NVIDIA AI for the wider platform, and From zero to production for the deployment path around it.
How it compares
TensorRT-LLM competes with other inference engines. The main difference is that it targets NVIDIA hardware exclusively and leans on ahead-of-time compilation, while some alternatives run across more hardware with less setup.
| TensorRT-LLM | vLLM | TGI | SGLang | |
|---|---|---|---|---|
| Origin | NVIDIA | UC Berkeley, community | Hugging Face | Community |
| Hardware | NVIDIA GPUs only | NVIDIA, AMD, others | NVIDIA, AMD, others | NVIDIA, AMD |
| Setup | Compile engine per GPU | Load model, run | Load model, run | Load model, run |
| Continuous batching | Yes | Yes | Yes | Yes |
| Paged KV cache | Yes | Yes (PagedAttention) | Yes | Yes (RadixAttention) |
| License | Apache 2.0 | Apache 2.0 | Apache 2.0 | Apache 2.0 |
| Best for | Peak NVIDIA throughput | Portable, easy start | Hugging Face stack | Structured, multi-turn |
For a portable option tied to the Hugging Face ecosystem, see Text Generation Inference (TGI) . For managed inference where you skip serving entirely, compare Groq , Fireworks AI , and Together AI .
When not to use it
TensorRT-LLM rewards heavy, sustained traffic on NVIDIA hardware. It is the wrong choice in several cases.
- You do not run NVIDIA GPUs. The library only targets NVIDIA hardware. On AMD or other accelerators, use vLLM, TGI, or SGLang instead.
- You want to skip infrastructure. If you would rather call an API than build and serve engines, a managed provider like Together AI or Fireworks AI removes the whole serving layer.
- You are prototyping or serving low traffic. The per-GPU compilation step and tuning add operational overhead that a small or bursty workload will not repay.
- You need instant model swaps. Because an engine is compiled for a specific model, precision, and GPU, changing any of those means rebuilding. Engines that load weights at runtime react faster to frequent model changes.
Further reading
- What is inference? : what happens when a trained model answers a request.
- What is a KV cache? : the attention state that paged KV cache manages.
- NVIDIA AI : the wider NVIDIA platform that packages TensorRT-LLM.
- Text Generation Inference (TGI) : Hugging Face’s portable serving alternative.
- TensorRT-LLM documentation : official guides, feature reference, and support matrix.
- TensorRT-LLM on GitHub : source code, examples, and release notes.
Sources
- TensorRT-LLM documentation home : feature list, in-flight batching, paged KV cache, quantization, and speculative decoding.
- TensorRT-LLM installation guide : pip package name and Linux prerequisites.
- TensorRT-LLM on GitHub : open-source library scope, kernels, parallelism, and Apache 2.0 license.
- NVIDIA TensorRT developer page : TensorRT ecosystem, optimization techniques, and Triton serving relationship.
- NVIDIA TensorRT-LLM developer page : supported optimizations and GPU inference positioning.