
| Introduction
As large language models scale in complexity, so too do the demands on the infrastructure running them. NVIDIA's latest Blackwell B200 GPU promises major gains in AI inference, but how does it actually perform against its predecessor, the H200, especially for serving Llama 4 Maverick with vLLM?
We ran benchmarks across both GPUs using the Llama 4 Maverick model and vLLM's performance-optimized serving framework. In this blog post, we will compare key metrics—throughput, latency—to show how the architectural shift from H200 to B200 impacts real-world LLM inference.
| Hardware Comparison
| Specification | B200 | H200 |
|---|---|---|
| Architecture | Blackwell | Hopper |
| HBM3e Memory | 192 GB | 141 GB |
| Memory Bandwidth | 8 TB/s | 4.8 TB/s |
| TDP | 1000W | 700W |
| FP8 Tensor TFLOPs | 9 petaFLOPs (per 8 GPUs) | 3.9 petaFLOPs (per 8 GPUs) |
| Key Metrics and Definitions
- Time to First Token (TTFT): This is the time taken for the model to generate the first token after a prompt is sent. Lower TTFT improves user experience in real-time applications.
- Inter Token Latency (ITL): The time between generating consecutive tokens. Lower ITL means faster generation of longer responses.
- Output Token Throughput: The number of output tokens generated per second across all concurrent requests. Higher throughput indicates better overall system efficiency.
| Output Token Throughput

With an average input prompt of 256 tokens and an output length of 256 tokens, the B200 consistently delivers higher output token throughput across all concurrency levels. At peak concurrency (1024 requests), the B200 achieves approximately 47% higher throughput—processing 9,870 tokens per second compared to the H200's 6,694 tokens per second.
| Time to First Token (TTFT)

Lower TTFT is crucial for user-facing applications like chatbots and interactive assistants. The B200 consistently delivers faster TTFT at all concurrency levels. At 1024 concurrent requests, the B200's mean TTFT is ~35% lower than the H200, resulting in noticeably quicker initial responses even under load.
| Inter-Token Latency (ITL)

For longer outputs, ITL determines how smoothly tokens are streamed. The B200's mean ITL is ~29% lower than the H200 at 1024 concurrent requests, ensuring faster completion of generation tasks.
| Summary
The NVIDIA B200 marks a significant leap over the H200 for LLM inference workloads. Across output throughput, time to first token, and inter-token latency, the Blackwell architecture delivers meaningful improvements—especially at high concurrency.
For organizations planning to deploy large-scale LLM inference pipelines, upgrading to Blackwell GPUs offers clear performance advantages.
| Test Configuration
- Model: Llama 4 Maverick
- Inference Engine: vLLM v0.8.4
- GPUs: NVIDIA H200 (x8 via NVLink) vs. NVIDIA B200 (x8 via NVLink)
- Concurrency Levels: 1, 64, 128, 256, 512, 1024
- Input Tokens: 256 | Output Tokens: 256
Copyright © 2025 Metrum AI, Inc. All Rights Reserved. NVIDIA, H200, B200, and Blackwell are trademarks or registered trademarks of NVIDIA Corporation. All other product names are the trademarks of their respective owners.
DISCLAIMER - Performance varies by hardware and software configurations, including testing conditions, system settings, application complexity, the quantity of data, batch sizes, software versions, libraries used, and other factors. The results of performance testing provided are intended for informational purposes only and should not be considered as a guarantee of actual performance.