TheVortiq
Inteligencia Artificial

NVIDIA cuts token cost 5x in AI with Blackwell and optimized software

NVIDIA's inference stack optimization promises to revolutionize AI economics by drastically reducing token cost, enabling larger and more accessible models.

June 30, 2026 · 3 min read

black and gray electronic device

TL;DR: NVIDIA has reduced token cost up to 5x on Blackwell with optimized software, enabling cheaper and more efficient inference for large models like DeepSeek V4.

What happened?

NVIDIA has announced up to a 5x reduction in inference token cost on its Blackwell platform, thanks to an optimized software stack including TensorRT-LLM and the Dynamo framework. In tests with the DeepSeek V4 model, token cost dropped significantly in just one month, according to data from SemiAnalysis. Companies like Baseten, Cognition, and Deep Infra are already using these optimizations to deliver higher performance at lower cost.

The announcement comes amid the AI industry's shift from prototypes to production AI factories, according to NVIDIA's official blog. The key metric has moved from raw chip specs to token cost: how many useful tokens can be delivered per dollar, per watt, and within required latency targets. NVIDIA's full-stack inference software, co-designed with GPUs, CPUs, networks, and systems, and reinforced by an open-source ecosystem, continuously improves hardware performance. On the Blackwell platform, the software stack has already reduced token costs up to 5x on the DeepSeek V4 model in just one month, according to SemiAnalysis data cited in NVIDIA's blog.

Why is this important?

Token cost is a key metric in AI economics, determining the viability of deploying large models in production. With cost reductions, more companies can access cutting-edge models without massive infrastructure investment. Additionally, energy efficiency improves, reducing environmental impact and operational costs.

Historically, inference for large models has been an economic bottleneck. For example, in 2023, running GPT-3 on conventional hardware cost around $0.02 per 1,000 tokens, while software optimizations like TensorRT-LLM have been reducing that cost. The 5x reduction in Blackwell represents a leap comparable to the transition from Volta to Ampere architecture, which offered 2-3x improvements in inference efficiency. This democratizes access to state-of-the-art models for startups and mid-sized companies that previously couldn't afford operational costs.

What consequences will it have?

This optimization will accelerate AI adoption in sectors like healthcare, finance, and logistics, where inference costs were a barrier. It will also foster competition among cloud and hardware providers, potentially leading to even lower prices. However, it could increase dependence on NVIDIA in the AI ecosystem, raising concerns about technological monopoly.

According to SemiAnalysis, the token cost reduction in Blackwell could lower the TCO (total cost of ownership) of AI workloads by 40-60% compared to the previous Hopper generation. This will pressure competitors like AMD (with its MI300X platform) and startups like Cerebras to accelerate their own software optimizations. Additionally, cloud providers like AWS, Azure, and Google Cloud could pass these savings to customers, intensifying the price war in inference-as-a-service. However, reliance on the NVIDIA ecosystem (CUDA, TensorRT) could hinder model portability, posing vendor lock-in risks. Regulators like the European Commission have already expressed concern about market concentration in AI chips.

What should readers know?

Developers should consider using TensorRT-LLM and Dynamo to optimize their models. Companies should evaluate total cost of ownership (TCO) when choosing infrastructure, prioritizing token cost over raw specs. Additionally, it's crucial to monitor competition from AMD and Intel, which are seeking alternatives.

Specifically, TensorRT-LLM enables optimizations like kernel fusion, paged attention, and FP8 quantization, which have already shown up to 2x performance improvements on models like Llama 2. Dynamo, meanwhile, is an orchestration framework that manages model execution across clusters, reducing communication latency. For companies, the key metric is no longer TFLOPS but tokens per second per dollar. For example, Baseten reported a 3x increase in DeepSeek V4 throughput after implementing these optimizations. Readers should watch for independent benchmarks from MLPerf Inference, which will compare these figures with competitors. Also, keep an eye on AMD's ROCm software and Intel's OneAPI optimizations, which aim to close the gap with NVIDIA.

Keep reading