The Free Upgrade: How ROCm 7 Unlocks Up to 2.2x More Throughput on Existing Dell PowerEdge™ XE9785L with AMD Instinct™ MI300X Accelerators

March 2026

| Executive Summary

Most infrastructure upgrades require new hardware. ROCm 7 breaks that pattern. By upgrading the software stack on existing AMD Instinct MI300X accelerators, organizations unlock substantial throughput gains without purchasing, racking, or cooling a single new GPU.

In head-to-head benchmarks running Qwen3-30B-A3B on Dell PowerEdge XE9785 servers, ROCm 7 (vLLM 0.14.0) delivers up to 1.54x higher throughput at production concurrency and up to 2.2x at high concurrency compared to ROCm 6.3.1 (vLLM 0.8.5) on the same MI300X hardware. The gains are zero-cost: no new GPUs, no additional power draw, no rack space changes.

More importantly, ROCm 7 raises the scaling ceiling. ROCm 6 throughput plateaus at 2,048 to 4,096 concurrent requests across most workloads. ROCm 7 continues scaling to 8,192 concurrent requests, widening the performance gap precisely where production deployments need headroom most.

Key Results at a Glance

Up to 1.54x throughput at production concurrency

Short Q&A: 20,115 to 31,024 TPS at 1,024 concurrent requests

Up to 2.2x throughput at high concurrency

Report generation: 28,426 to 62,674 TPS at 8,192 concurrent requests

Zero hardware changes required

Same MI300X GPUs, same Dell PowerEdge servers, same rack footprint

Continuous Scaling on ROCm 7 while ROCm 6 Plateaus

ROCm 6 stalls at 2,048–4,096 concurrency; ROCm 7 continues to 8,192

| Why Software-Defined Performance Matters Now

Hardware refresh cycles in enterprise data centers typically run three to five years.[1] Organizations that deployed MI300X infrastructure in 2024 or 2025 are one to two years into that cycle. For these teams, the question is not whether to buy new hardware. It is how to extract more value from the hardware already installed.

The utilization gap. Most MI300X deployments were sized for expected peak demand. As AI inference workloads grow, concurrency requirements increase. A software upgrade that extends throughput at high concurrency directly delays or eliminates the need for hardware expansion.

The scaling wall. Inference serving engines and GPU runtime stacks have improved rapidly. ROCm 6 was optimized for the workloads and concurrency patterns of early 2025. ROCm 7 includes kernel-level optimizations, improved memory management, and more efficient scheduling that enable the same hardware to handle significantly higher concurrent load.

Deployment simplicity A software upgrade carries none of the capital, power, or facility costs of a hardware expansion. It requires no new procurement approvals, no rack reconfiguration, and no cooling capacity changes. Teams can validate the upgrade in a staging environment and roll it out during a standard maintenance window.

| Benchmark Configuration

All benchmarks compare two software configurations on identical MI300X hardware. This isolates the software contribution from any hardware differences.

Parameter

Value

Model

Qwen3-30B-A3B-Thinking-2507

GPU

8x AMD Instinct MI300X (192 GB HBM3 per GPU)

System

Dell PowerEdge XE9785

ROCm 6 Config

ROCm 6.3.1, vLLM 0.8.5

ROCm 7 Config

ROCm 7.0, vLLM 0.14.0

Concurrency Sweep

1 to 8,192 concurrent requests

Metrics

Tokens/second (TPS), system GPU power (W), error rate

Four scenarios map to common enterprise inference patterns:

Scenario

Input / Output Tokens

Enterprise Use Case

Short Q&A

128 in / 128 out

Real-time scoring, classification, chatbot

Report Generation

128 in / 2,048 out

Report generation, regulatory summaries

RAG / Document Q&A

2,048 in / 128 out

Contract analysis, earnings Q&A, RAG

Deep Research

2,048 in / 2,048 out

Long-form synthesis, deep research

| Performance Results

Throughput at Production Concurrency

At 1,024 concurrent requests, a concurrency level representative of departmental and enterprise deployments, ROCm 7 delivers consistent gains across all four workload profiles.

Scenario

ROCm 6 TPS

ROCm 7 TPS

Throughput Gain

128 in / 128 out

20,115

31,024

1.54x

128 in / 2,048 out

21,066

27,230

1.29x

2,048 in / 128 out

9,684

13,270

1.37x

2,048 in / 2,048 out

18,016

20,970

1.16x

Table 1 | Throughput at 1,024 Concurrent Requests. All results zero-error on both stacks.

The short Q&A workload shows the strongest gain at 1.54x. This workload is latency-sensitive and compute-bound, where ROCm 7's kernel optimizations have the most direct impact. The long-context, long-output scenario (2,048/2,048) shows a more modest 1.16x, reflecting workloads that are more memory-bandwidth-bound.

Averaged across all four scenarios at 1,024 concurrency, ROCm 7 delivers 1.34x the throughput of ROCm 6 on the same hardware. That is a 34% throughput increase from a software upgrade alone.

Figure 1 | Throughput at 1,024 Concurrent Requests Across All Workload Profiles

Figure 1 | Throughput at 1,024 Concurrent Requests Across All Workload Profiles

Where ROCm 7 Pulls Away: The Scaling Ceiling

The throughput comparison at 1,024 concurrency tells only part of the story. The more consequential difference emerges at higher concurrency, where ROCm 6 stops scaling and ROCm 7 continues.

Scenario

Concurrency

ROCm 6 TPS

ROCm 7 TPS

Gain

128 in / 128 out

2,048

28,841

45,424

1.58x

128 in / 128 out

4,096

31,712

53,653

1.69x

128 in / 2,048 out

2,048

29,193

42,222

1.45x

128 in / 2,048 out

4,096

28,902*

50,474

1.75x

128 in / 2,048 out

8,192

28,426*

62,674

2.20x

2,048 in / 128 out

4,096

11,554

18,251

1.58x

2,048 in / 2,048 out

2,048

23,498

29,763

1.27x

2,048 in / 2,048 out

4,096

23,500

36,038

1.53x

Table 2 | Throughput at High Concurrency. *ROCm 6 throughput declining (plateau reached). All results zero-error except where noted in Table 4.

The pattern is consistent across workloads. ROCm 6 throughput growth slows sharply above 2,048 concurrent requests and effectively stalls by 4,096. In the report generation scenario (128/2048), ROCm 6 throughput actually declines from 29,193 TPS at 2,048 concurrency to 28,426 TPS at 8,192, a regression under load.

ROCm 7, by contrast, continues scaling. At 4,096 concurrency, the average gain across all scenarios reaches 1.64x. For the report generation workload at 8,192 concurrent requests, ROCm 7 delivers 62,674 TPS where ROCm 6 manages only 28,426, a 2.2x improvement.

Figure 2 | Report Generation Throughput Scaling. ROCm 6 plateaus at 2,048 concurrency; ROCm 7 continues to 62,674 TPS at 8,192.

Figure 2 | Report Generation Throughput Scaling. ROCm 6 plateaus at 2,048 concurrency; ROCm 7 continues to 62,674 TPS at 8,192.

For capacity planners, this is the critical finding. ROCm 7 does not just run faster at today's concurrency levels. It extends the usable concurrency range of the MI300X, allowing each node to absorb more traffic before requiring additional hardware.

Full Concurrency Sweep: Short Q&A (128 in / 128 out)

The short Q&A workload provides the clearest view of the scaling difference. This table shows throughput at every tested concurrency level.

Concurrency

ROCm 6 TPS

ROCm 7 TPS

Gain

ROCm 6 Power (W)

ROCm 7 Power (W)

1

81

122

1.50x

2,809

2,940

32

1,959

3,482

1.78x

5,124

4,704

64

3,310

5,491

1.66x

5,749

6,664

128

5,608

8,643

1.54x

6,174

5,831

256

9,335

13,597

1.46x

5,464

7,497

512

13,838

21,839

1.58x

6,703

6,533

1,024

20,115

31,024

1.54x

6,631

6,370

2,048

28,841

45,424

1.58x

6,096

7,624

4,096

31,712

53,653

1.69x

7,650

6,773

Table 3 | Full Concurrency Sweep: 128 in / 128 out (FP8). 8,192 concurrency omitted (both stacks show non-zero error rates at that level for this scenario).

Figure 3 | Short Q&A Throughput Scaling: ROCm 7 leads at every concurrency level with widening advantage.

Figure 3 | Short Q&A Throughput Scaling: ROCm 7 leads at every concurrency level with widening advantage.

Two patterns stand out. First, ROCm 7 leads at every concurrency level, from single-request inference (1.50x) through production scale (1.54x at 1,024) to high concurrency (1.69x at 4,096). Second, ROCm 6 throughput growth drops to under 2% between 4,096 and 8,192 concurrent requests. ROCm 7 still shows 31% growth in that same range before encountering errors.

The following chart illustrates this pattern across all four workloads. The throughput ratio between ROCm 7 and ROCm 6 dips at mid-range concurrency (where both stacks scale well) and then rises sharply above 2,048 concurrent requests as ROCm 6 hits its ceiling.

Figure 4 | The Scaling Advantage: ROCm 7 throughput gain ratio increases at high concurrency across all workloads.

Figure 4 | The Scaling Advantage: ROCm 7 throughput gain ratio increases at high concurrency across all workloads.

Reliability at Scale

Both software stacks maintain zero error rates through most of the concurrency range. Errors appear only at the extreme upper end (8,192 concurrent requests) and remain below 1% for both stacks across all scenarios.

Scenario

Concurrency

ROCm 6 Errors

ROCm 7 Errors

128 / 128

8,192

0.24%

0.73%

128 / 2,048

8,192

0%

0%

2,048 / 128

8,192

0%

0.49%

2,048 / 2,048

4,096

0.49%

0.49%

2,048 / 2,048

8,192

0%

0%

Table 4 | Error Rates at Extreme Concurrency. Only concurrency levels with non-zero errors on either stack are shown.

The error profile is comparable between the two stacks. ROCm 7 does not sacrifice reliability for throughput. Both stacks deliver clean, zero-error operation through 4,096 concurrent requests on all workloads except the most demanding (2,048/2,048), where both show identical 0.49% error rates.

A Note on Power Consumption

ROCm 7 generally draws moderately higher system power than ROCm 6 at the same concurrency level. At 1,024 concurrent requests, ROCm 7 power ranges from 6,370W to 9,797W across scenarios, compared to 6,631W to 7,799W on ROCm 6.

This increase reflects ROCm 7's more aggressive utilization of available compute resources. The GPU is doing more useful work per unit time, which directly increases power draw. The throughput gains outpace the power increase in most scenarios, but the efficiency story here is about throughput per node, not throughput per watt. Organizations running MI300X hardware already have their power and cooling infrastructure sized. The relevant question is how much more output they can extract within that existing envelope.

Scenario

ROCm 6 TPS/W

ROCm 7 TPS/W

Efficiency Delta

128 in / 128 out

3.03

4.87

+61%

128 in / 2,048 out

2.79

2.80

+0.5%

2,048 in / 128 out

1.27

1.71

+35%

2,048 in / 2,048 out

2.31

2.14

-7%

Table 5 | Performance-per-Watt at 1,024 Concurrency. TPS/W = tokens per second / system GPU power (watts).

For compute-bound workloads (short Q&A, RAG), ROCm 7 improves both throughput and efficiency. For memory-bandwidth-bound workloads (2,048/2,048), the efficiency trade-off is roughly neutral. In all cases, the total tokens-per-node improvement is the primary value driver.

| What This Means for Your MI300X Fleet

The benchmark data supports three concrete infrastructure planning conclusions.

Defer hardware expansion. If your MI300X deployment is approaching its throughput ceiling at current concurrency levels, ROCm 7 extends that ceiling by 1.3x to 2.2x depending on workload. For many organizations, this deferral represents six to twelve months of avoided capital expenditure on additional nodes, networking, and facility costs.

Absorb growth without new hardware. AI inference traffic at most enterprises is growing rapidly. A 34% average throughput increase at production concurrency (1,024 requests) means ROCm 7 can absorb a corresponding increase in user traffic on the same hardware. For organizations experiencing 20-40% annual growth in inference demand, the upgrade buys roughly a year of headroom.[2]

Extend high-concurrency headroom. The scaling ceiling improvement is particularly valuable for shared inference services that consolidate multiple applications onto a single GPU cluster. These environments routinely see concurrency spikes above 2,048 during peak hours. ROCm 7's ability to maintain throughput growth at 4,096+ concurrent requests means fewer dropped requests, shorter queues, and more predictable SLA performance during peak demand.

The Upgrade Path

ROCm 7 supports the same MI300X hardware that ran ROCm 6. The upgrade path does not require hardware changes, BIOS updates specific to ROCm 7, or re-architecture of inference serving configurations.

Step 1: Validate in staging. Deploy the ROCm 7 container (rocm/vllm:rocm7.0 or equivalent) on a non-production MI300X node. Run your production workload mix at representative concurrency levels to confirm throughput improvements and validate model output quality.

Step 2: Monitor with Dell iDRAC. Use Dell iDRAC telemetry to compare power draw, GPU temperature, and memory utilization between ROCm 6 and ROCm 7 under load. The data in this paper shows moderately higher power consumption on ROCm 7. Confirm that your cooling infrastructure accommodates the increase at your specific concurrency levels.

Step 3: Roll out during maintenance. Swap the serving container image from the ROCm 6 build to ROCm 7 during a standard maintenance window. The model weights, serving configuration, and client interfaces remain unchanged.

Organizations using Dell Enterprise Hub on Hugging Face can pull pre-validated ROCm 7 containers tested on Dell PowerEdge platforms with AMD Instinct accelerators, reducing the validation effort further.

Looking Ahead: ROCm 7 on MI355X

The gains documented in this paper apply to existing MI300X hardware. For organizations planning new deployments or hardware refreshes, ROCm 7 is also the foundation software stack for the AMD Instinct MI355X (CDNA 4). The MI355X delivers generational improvements in memory capacity (288 GB HBM3e per GPU), bandwidth (8 TB/s), and native FP4 compute. Combined with ROCm 7's software optimizations, the MI355X achieves up to 9.5x better performance-per-watt than the MI300X at production concurrency, as detailed in previous Metrum AI briefs.

ROCm 7 ensures that organizations investing in MI300X today build on a software foundation that carries forward. The kernels, memory management improvements, and scheduling optimizations developed for ROCm 7 on MI300X directly benefit MI355X deployments, providing software continuity across hardware generations.

| Conclusion: More Tokens from Every Dollar Already Spent

ROCm 7 delivers a measurable, immediate performance upgrade for every MI300X deployment. The gains are not theoretical. At production concurrency, throughput increases by an average of 34%. At high concurrency, the improvement reaches 2.2x as ROCm 7 continues scaling where ROCm 6 plateaus.

Few infrastructure improvements require no new capital, no additional power, and no facility changes. Organizations upgrade their software stack and immediately serve more users, process more tokens, and absorb more growth on hardware they have already purchased, racked, and cooled.

For infrastructure teams managing MI300X fleets, the ROCm 7 upgrade is the highest-ROI action available today.

Learn more about AMD Instinct MI355X at amd.com/instinct


| References

[1] Uptime Institute, "Uptime Institute Global Data Center Survey 2024," Jul. 2024. [Online]. Available: https://uptimeinstitute.com/resources/research-and-reports/uptime-institute-global-data-center-survey-results-2024

[2] S. Bokhari et al., "The Rapid Growth of AI Inference Demand," Stanford Institute for Human-Centered AI, "AI Index Report 2025," Apr. 2025. [Online]. Available: https://aiindex.stanford.edu/report/


Copyright © 2026 Metrum AI, Inc. All Rights Reserved. This project was commissioned by Dell Technologies. Dell, Dell PowerEdge, Dell iDRAC and other trademarks are trademarks of Dell Inc. or its subsidiaries. AMD, Instinct, ROCm, EPYC and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other product names are the trademarks of their respective owners.

***DISCLAIMER - Performance varies by hardware and software configurations, including testing conditions, system settings, application complexity, the quantity of data, batch sizes, software versions, libraries used, and other factors. The results of performance testing provided are intended for informational purposes only and should not be considered as a guarantee of actual performance.


[1] Uptime Institute, "Uptime Institute Global Data Center Survey 2024," Jul. 2024. [Online]. Available: https://uptimeinstitute.com/resources/research-and-reports/uptime-institute-global-data-center-survey-results-2024

[2] S. Bokhari et al., "The Rapid Growth of AI Inference Demand," Stanford Institute for Human-Centered AI, "AI Index Report 2025," Apr. 2025. [Online]. Available: https://aiindex.stanford.edu/report/