Designed to efficiently orchestrate and coordinate AI inference requests across a large fleet of GPUs, Dynamo is intended to ensure that AI factories run at the lowest possible cost to maximise token revenue generation.
As AI reasoning goes mainstream, every AI model will generate tens of thousands of tokens used to “think” with every prompt significantly increasing inference performance while continually lowering the cost.
NVIDIA Dynamo, which is the company’s successor to the NVIDIA Triton Inference Server, is AI inference-serving software that’s designed to maximise token revenue generation for AI factories deploying reasoning AI models. It orchestrates and accelerates inference communication across thousands of GPUs and uses disaggregated serving to separate the processing and generation phases of large language models (LLMs) on different GPUs. This allows each phase to be optimised independently for its specific needs and ensures maximum GPU resource utilisation.
“Industries around the world are training AI models to think and learn in different ways, making them more sophisticated over time,” said Jensen Huang, founder and CEO of NVIDIA. “To enable a future of custom reasoning AI, NVIDIA Dynamo helps serve these models at scale, driving cost savings and efficiencies across AI factories.”
Using the same number of GPUs, Dynamo doubles the performance and revenue of AI factories serving Llama models on NVIDIA’s Hopper platform. When running the DeepSeek-R1 model on a large cluster of GB200 NVL72 racks, NVIDIA Dynamo’s intelligent inference optimisations also boost the number of tokens generated by over 30x per GPU.
To achieve these inference performance improvements, NVIDIA Dynamo incorporates features that enable it to increase throughput and reduce costs. For example, it can dynamically add, remove and reallocate GPUs in response to fluctuating request volumes and types, as well as pinpoint specific GPUs in large clusters that can minimise response computations and route queries. It can also offload inference data to more affordable memory and storage devices and quickly retrieve them when needed, minimising inference costs.
Fully open source, NVIDIA Dynamo supports PyTorch, SGLang, NVIDIA Tensor-LLM and vLLM to allow users to develop and optimise ways to serve AI models across disaggregated inference. It will enable users to accelerate the adoption of AI inference, including at AWS, Cohere, CoreWeave, Dell, Fireworks, Google Cloud, Lambda, Meta, Microsoft Azure, Nebius, NetApp, OCI, Perplexity, Together AI and VAST.
NVIDIA Dynamo maps the knowledge that inference systems held in memory from serving prior requests across potentially thousands of GPUs. It then routes new inference requests to the GPUs that have the best knowledge match, avoiding costly recomputations and freeing up GPUs to respond to new incoming requests.
The NVIDIA Dynamo inference platform also supports disaggregated serving, which assigns the different computational phases of LLMs - including building an understanding of the user query and then generating the best response - to different GPUs. This approach is suitable for reasoning models like the new NVIDIA Llama Nemotron model family, which uses advanced inference techniques for improved contextual understanding and response generation. Disaggregated serving allows each phase to be fine-tuned and resourced independently, improving throughput and delivering faster responses to users.
NVIDIA Dynamo includes four key innovations that reduce inference serving costs and improve user experience:
GPU Planner: A planning engine that dynamically adds and removes GPUs to adjust to fluctuating user demand, avoiding GPU over- or under-provisioning.
Smart Router: An LLM-aware router that directs requests across large GPU fleets to minimise costly GPU recomputations of repeat or overlapping requests - freeing up GPUs to respond to new incoming requests.
Low-Latency Communication Library: An inference-optimised library that supports state-of-the-art GPU-to-GPU communication and abstracts complexity of data exchange across heterogenous devices, accelerating data transfer.
Memory Manager: An engine that intelligently offloads and reloads inference data to and from lower-cost memory and storage devices without impacting user experience.
NVIDIA Dynamo will be made available in NVIDIA NIM microservices and supported in a future release by the NVIDIA AI Enterprise software platform with production-grade security, support and stability.