| Introduction
At Metrum AI, we were thrilled to deploy the Llama 3.3 70B and Llama 4 Scout 17B models on Dell’s next-generation PowerEdge servers—the XE9680 and XE9680L—equipped with NVIDIA H200 and B200 Tensor Core GPUs. These servers represent a paradigm shift in AI infrastructure, offering unprecedented throughput and concurrency capabilities essential for running large language models (LLMs) at scale. To analyze performance gains, we put both systems to the test by deploying these latest Llama models in a variety of real-world AI scenarios.
To accelerate the AI performance testing process, we leveraged Metrum Insights, our performance evaluation tool for real-world AI workloads. Metrum Insights automated critical steps of the end-to-end AI benchmarking pipeline, from server provisioning to validating performance results. Further, to gain real-world insights, Metrum Insights enabled prompt randomization and concurrency simulation to thousands of concurrent requests. By leveraging Metrum Insights, we reduced our evaluation time by >90%.
Here, we share key performance highlights:
-
XE9680L with 8x B200 outperforms XE9680 with 8x H200 by up to 4.3× for short-response workloads.
-
Increasing Tensor Parallelism to 2 provides up to 3.3× performance gains on XE9680 with 8x H200, optimizing throughput under limited replica counts.
-
For long-generation tasks like summarization or video captioning, B200 remains superior, offering 1.6× gains with Llama 4 Scout.
| Llama 4 Scout: Precision Meets Efficiency
The Llama series by Meta has pioneered the open weight model landscape. Llama 3.3 70B represents a major milestone with its 70 billion parameters and impressive performance across reasoning, coding, and multilingual tasks. It offers a 128K context window and achieves exceptional results on benchmarks like MMLU and HumanEval. Building on this foundation, Llama 4 Scout introduces a mixture-of-experts architecture with 109 billion total parameters divided among 16 experts, using only 17 billion active parameters during inference. Like its predecessor, Scout excels in coding and reasoning while adding enhanced long-context understanding and image processing capabilities. Its industry-leading 10 million token context window dramatically expands possibilities for tasks like multi-document summarization and reasoning over extensive codebases—all while both models maintain competitive latency and cost profiles.1
| Our Methodology
To test how well these servers perform on real-world AI workloads with varying concurrent user requirements, we leveraged Metrum Insights to deploy Llama 3.3 70B and Llama 4 Scout 17B across concurrency levels: 1024, 2048, and 4096. We expect continuous improvements in model serving performance over time on both XE9680L and B200s platforms as they mature, consistent with previously new releases of state of the art AI hardware.
For each scenario, we used the following configurations:
As the latest stable version of PyTorch doesn’t support B200s, we leveraged vLLM for model serving, using a PyTorch nightly release with CUDA 12.8.
- Model Server: vLLM, built from source from the main branch (commit 93e5f3c5fb4a4bbd49610efb96aad30df95fca66)
- Randomized Prompts:
- Average Input Token Length: 128, 2048
- Output Token Length: 128, 2048
- Tensor Parallelism: 1, 2
- Number of Model Replicas: 4, 8 (total number of replicas deployed on entire system)
- Metrics:
- Output token throughput (measured in tokens per second)
| Throughput Performance Results
Figure 1. Output token throughput performance results for model serving with Llama 3.3 70B on Dell PowerEdge Servers with NVIDIA H200 and B200 GPUs, using an input token length of 128, output token length of 128, and tensor parallelism set to 1.
As shown in the chart above, we evaluated Llama 3.3 70B running with 128 input and 128 output tokens on Dell PowerEdge XE9680 servers, comparing performance between systems equipped with 8x NVIDIA H200 GPUs and those with 8x NVIDIA B200 GPUs. With tensor parallelism set to 1 and eight model replicas deployed, throughput on the B200 system consistently exceeded that of the H200 system across all tested concurrency levels. At 1024 concurrent requests, the B200 configuration delivered 13,576 tokens per second, compared to 5,486 on H200. As concurrency scaled to 2048 and 4096 requests, B200 throughput reached 20,613 and 25,241 tokens per second respectively—representing 3.7× and 4.3× higher throughput than H200. These results indicate that the B200 offers significantly improved scalability for high-throughput inference scenarios.
Figure 2. Output token throughput performance results for model serving with Llama 3.3 70B on Dell PowerEdge XE9680 with NVIDIA H200, using an input token length of 128, output token length of 128, and comparing tensor parallelism values of 1 and 2.
We also examined the impact of increasing tensor parallelism on throughput using the same Llama 3.3 70B model and H200-based system. In this case, we compared a setup with tensor parallelism of 1 and 8 model replicas to one with tensor parallelism of 2 and 4 model replicas. The configuration using TP=2 achieved notably higher throughput across all concurrency levels. At 1024 requests, throughput improved from 5,486 to 16,460 tokens per second. Similar improvements were observed at 2048 and 4096 requests, with throughput increasing to 18,462 and 18,690 tokens per second respectively. This suggests that increasing tensor parallelism can be an effective strategy for improving performance on H200 GPUs, particularly when managing limited GPU resources.
Figure 3. Output token throughput performance results for model serving with Llama 4 Scout on Dell PowerEdge Servers with NVIDIA H200 and B200 GPUs, using an input token length of 128, output token length of 128, and tensor parallelism set to 2.
The final set of tests focused on the Llama 4 Scout model configured for longer text generation workloads, using 128 input and 2048 output tokens with tensor parallelism of 2 and 4 model replicas. Performance was again compared between H200 and B200 systems. On the H200 server, throughput was measured at 12,416 and 13,712 tokens per second for 1024 and 2048 concurrent requests, respectively. The B200 configuration delivered higher throughput—17,344 tokens per second at 1024 requests and 21,349 at 2048—yielding throughput improvements of approximately 1.4× to 1.6× over the H200. These results highlight the B200’s ability to maintain higher performance levels under workloads with larger output token lengths.
| Summary
Our performance results highlight the impact of both GPU selection and model deployment configuration when serving large language models at scale. Dell PowerEdge XE9680 and XE9680L servers, when equipped with NVIDIA H200 and B200 GPUs respectively, provide a flexible foundation for enterprise AI workloads.
In our tests, the B200 consistently outperformed the H200, delivering up to 4.3× higher throughput for Llama 3.3 70B under short-response workloads. Increasing tensor parallelism from 1 to 2 on H200 also yielded significant gains—up to 3.3× improvement in throughput—highlighting the value of tuning intra-GPU configurations. For longer output tasks, the Llama 4 Scout model achieved up to 1.6× higher throughput on B200 compared to H200, demonstrating the B200’s ability to maintain performance under more demanding generation scenarios.
These insights are particularly relevant for teams deploying proprietary LLMs in production environments, where maximizing throughput and hardware efficiency directly affects responsiveness, cost, and scalability. As enterprise AI applications continue to evolve, infrastructure choices like GPU tiering and parallelism tuning will play an increasingly critical role in optimizing operational performance.
For guidance on configuring these systems for your specific workloads or to explore how this hardware stack fits within your AI strategy, contact the Metrum AI team at contact@metrum.ai.
| Why Dell PowerEdge Servers?
The Dell PowerEdge XE9680 and XE9680L servers are purpose-built for demanding AI and LLM workloads, offering industry-leading performance and scalability.
- XE9680:
A 6U server supporting up to eight NVIDIA H100, H200, or AMD MI300X GPUs and dual Intel Xeon processors. It delivers high memory bandwidth and robust PCIe Gen5 connectivity, making it ideal for large-scale AI training, inferencing, and enterprise applications requiring high concurrency. - XE9680L:
A 4U, direct liquid-cooled server featuring eight next-gen NVIDIA Blackwell GPUs (HGX B200) for maximum GPU density and efficiency. It offers expanded PCIe Gen5 slots and is delivered as a turn-key, rack-integrated solution, enabling rapid deployment and scaling of advanced AI infrastructure.
Together, these servers provide the flexibility and power needed to efficiently run large language models like Llama 4 Scout and Maverick across a wide range of enterprise AI use cases.
| Server Configuration
Server | Dell PowerEdge XE9680L | Dell PowerEdge XE9680 |
---|---|---|
Accelerators | 8x NVIDIA B200 GPUs | 8x NVIDIA H200 Tensor Core GPUs |
CPU | 2x Intel Xeon Platinum 8562Y+ | 2x Intel Xeon Platinum 8568Y+ |
Memory | 2 TB | 2 TB |
Accelerators Count | 8 | 8 |
OS | Ubuntu 22.04.5 LTS | Ubuntu 22.04.5 LTS |
CUDA Version | 12.8 | 12.8 |
| References
Dell images: Dell.com
- MacDonald, C. (2025, April 5). Meta introduces Llama 4 with two new AI models available now, and two more on the way. Engadget. https://www.engadget.com/ai/meta-introduces-llama-4-with-two-new-models-available-now-and-two-more-on-the-way-214524295.html
Copyright © 2025 Metrum AI, Inc. All Rights Reserved. This project was commissioned by Dell Technologies. Dell and other trademarks are trademarks of Dell Inc. or its subsidiaries. NVIDIA, NVIDIA H100 and NVIDIA H200, and combinations thereof are trademarks of NVIDIA. All other product names are the trademarks of their respective owners.
DISCLAIMER - Performance varies by hardware and software configurations, including testing conditions, system settings, application complexity, the quantity of data, batch sizes, software versions, libraries used, and other factors. The results of performance testing provided are intended for informational purposes only and should not be considered as a guarantee of actual performance.