Executive Summary

As AI inference scales from pilot to production, the defining metric shifts from raw throughput to total cost of ownership: how much compute, power, and rack space each generated token consumes over a three-year lifecycle.

In head-to-head benchmarks running Qwen3-235B-A22B on Dell PowerEdge servers, the AMD Instinct MI355X (CDNA 4) delivers up to 3x higher throughput at production concurrency and up to 2x better performance-per-watt with MXFP4 versus the MI300X (CDNA 3). Across all four tested workload patterns, the MI355X produces 1.5x to 3.0x more tokens per second at matched concurrency, while MXFP4 precision unlocks an additional 35 to 130% efficiency gain at high concurrency.

Same rack. Same facility. Up to 3x the throughput and 2x the tokens per watt. That translates into fewer nodes to meet the same SLA, denser rack deployments, and faster payback on infrastructure investments.

Key Results at a Glance

1.5x to 3.0x Higher Gen-on-Gen Throughput

MI355X vs. MI300X Throughput at 1,024 Concurrency (FP8)

Up to 4.1x Peak Throughput

31,753 vs. 7,799 TPS at 128 in/128 out on FP8

Up to 2.0x at Production Concurrency

Performance-per-Watt (MXFP4 vs. MI300X FP8)

+35% to +130% TPS/W over FP8

MXFP4 Efficiency Bonus on MI355X

Why Throughput and Efficiency Define AI Economics

Power efficiency has become the defining accelerator metric for production AI inference.

The throughput problem. Enterprise AI deployments must serve hundreds to thousands of concurrent users. An accelerator that delivers strong single-request performance but plateaus at moderate concurrency forces teams to deploy more nodes, adding cost and complexity. The ability to scale throughput linearly with concurrency directly reduces node count and operational overhead.

Data center capacity is constrained. Most enterprise data centers were built for 8-15 kW per rack[1]. A single 8-GPU AI inference node can draw 5,000 to 10,000W of GPU power at production concurrency, approaching or exceeding an entire legacy rack's power envelope on a single node. When the accelerator delivers more tokens per watt, operators can serve the same workload at lower aggregate power or pack more serving capacity into the same power budget.

TCO compounds over time. The primary TCO lever in a generational upgrade is node consolidation, not per-node power reduction. If the newer accelerator delivers 2x to 3x the throughput per node, the same workload can run on fewer nodes. Eliminating a single node saves not just its electricity draw (running 24/7 at $0.10/kWh[2]), but also its share of cooling overhead (typically 0.3-0.5x)[3], rack space, hardware procurement, and operational management.

The Hardware: CDNA 4 vs. CDNA 3

The MI355X introduces three key efficiency innovations that directly impact performance-per-watt.

Specification MI300X (CDNA 3) MI355X (CDNA 4) Impact
HBM Capacity 192 GB HBM3 288 GB HBM3E Single-node deployment for 235B-class models
HBM Bandwidth 5.3 TB/s 8 TB/s Faster decode, fewer memory-bound stalls
Infinity Cache None 256 MB On-die weight buffering reduces HBM power draw
Native FP4 No Yes (MXFP4) 2x theoretical compute density over FP8
TDP (per GPU) 750W 1400W Higher peak power enables higher peak compute
Process Node 5/6 nm 3 nm Improved transistor efficiency

Table 1 - Architecture Comparison. TDP represents maximum board power per accelerator under peak sustained load. Runtime power varies with workload characteristics; see Table 3 for measured values.

The Infinity Cache advantage. The 256 MB on-die Infinity Cache on the MI355X keeps frequently used model weights on-chip. In transformer-based models, the same weights see repeated access across attention heads and MoE expert routing. On the MI300X, every access hits HBM. On the MI355X, Infinity Cache intercepts those repeated reads, reducing memory subsystem energy per token. Combined with the 3 nm process node, this improves energy efficiency per floating-point operation even though the MI355X delivers substantially higher throughput.

Higher TDP, higher ceiling. The MI355X has a 1,400W TDP per GPU compared to 750W on the MI300X. That higher power envelope enables significantly greater compute throughput. Under MoE inference workloads, actual runtime power depends on model sparsity, concurrency, and sequence length. Both accelerators operate well below TDP under lighter loads but scale power upward as concurrency and compute demand increase.

Benchmark Methodology

All benchmarks use the following configuration:

Parameter Value
Model Qwen3-235B-A22B (MoE, 22B active parameters per token)
Precision FP8 on both MI300X and MI355X; MXFP4 on MI355X noted separately
Serving Engine vLLM on ROCm 7.0
System 8-GPU nodes (single-node deployment)
Concurrency Sweep 1 to 8,192 concurrent requests
Metrics Tokens/second (TPS), aggregate GPU power (W, measured via amd-smi), tokens-per-watt (TPS/W)
Parameter MI300X System MI355X System
GPU Accelerators 8x AMD Instinct MI300X 8x AMD Instinct MI355X
GPU Architecture CDNA 3 CDNA 4
HBM per GPU 192 GB HBM3 288 GB HBM3E
Server Dell PowerEdge XE9680 Dell PowerEdge XE9785L
Software Stack ROCm™ 7.0, Ubuntu 24.04 ROCm 7.0, Ubuntu 24.04

Concurrency represents the number of user requests the node serves at the same time. Higher concurrency means more simultaneous users. Four scenarios map to common enterprise inference patterns:

Scenario Input / Output Tokens Enterprise Use Case
Short Q&A 128 in / 128 out Real-time scoring, classification, chatbot
Short Prompt, Long Report 128 in / 2,048 out Report generation, regulatory summaries
Long Context, Quick Answer 2,048 in / 128 out Contract analysis, earnings Q&A, RAG
Long Context, Long Report 2,048 in / 2,048 out Deep research, long-form synthesis

Performance Results

Across all four scenarios, the MI355X delivers meaningfully higher peak throughput than the MI300X at FP8 precision, with gains that scale as concurrency increases.

Throughput at Production Concurrency

At 1,024 concurrent requests, a common production operating point, the MI355X delivers 1.5x to 3.0x higher throughput than the MI300X across all scenarios.

Scenario MI300X FP8 Peak TPS MI355X FP8 Peak TPS Throughput Gain
128 in / 128 out 6,268 14,043 2.2x
128 in / 2,048 out 8,278 12,588 1.5x
2,048 in / 128 out 1,632 4,905 3.0x
2,048 in / 2,048 out 4,002 7,139 1.8x

Table 2 - Throughput at 1,024 Concurrency (FP8 vs. FP8). All values at 1,024 concurrent requests. MI355X continues scaling beyond 1,024; peak values at 8,192 concurrency reach up to 31,753 TPS (128in/128out).

The throughput advantage is most pronounced in the long-context, short-output scenario (2,048 in / 128 out), where the MI355X delivers 3.0x the output of the MI300X. This workload pattern, common in RAG pipelines and document Q&A, benefits from the MI355X's higher HBM bandwidth and Infinity Cache, which accelerate the prefill phase.

At peak concurrency (8,192 requests), the MI355X reaches 31,753 TPS on the short Q&A scenario compared to 7,799 TPS on the MI300X, a 4.1x throughput gain. The MI355X scales throughput nearly linearly from 1,024 to 8,192 concurrency, while the MI300X plateaus earlier.

System Power Consumption

Both accelerators draw power proportional to their compute activity. The MI355X, with its higher TDP and greater compute throughput, draws more aggregate GPU power than the MI300X at matched concurrency. The relevant question is not absolute power, but how much useful work each watt produces.

Scenario Concurrency MI300X GPU Power (W) MI355X GPU Power (W)
128 in / 128 out 512 5,166 7,651
128 in / 128 out 1,024 4,679 8,214
128 in / 2,048 out 512 5,489 7,542
128 in / 2,048 out 1,024 5,502 8,011
2,048 in / 128 out 512 4,888 10,002
2,048 in / 128 out 1,024 4,758 10,051
2,048 in / 2,048 out 512 5,601 9,042
2,048 in / 2,048 out 1,024 5,607 9,262

Table 3 - Aggregate GPU Power at Production Concurrency (FP8). Power values represent runtime GPU power measured via amd-smi, summed across all 8 accelerators. These values do not include CPU, NIC, or other system component power. MI300X per-GPU runtime power ranges from 585W to 701W (TDP: 750W). MI355X per-GPU runtime power ranges from 943W to 1,256W (TDP: 1,400W).

The MI355X draws 1.4x to 2.1x the aggregate GPU power of the MI300X at production concurrency. This reflects the higher compute throughput the MI355X sustains. Under the long-context scenarios (2,048 input tokens), MI355X GPUs approach 1,250W per accelerator as the prefill phase demands sustained memory bandwidth and compute.

Performance-per-Watt

Performance-per-watt (tokens per second per watt, or TPS/W) combines throughput and power into a single efficiency metric. Higher TPS/W means more tokens generated for every dollar spent on electricity.

At production concurrency, the MI355X delivers 1.0x to 1.4x better TPS/W at FP8 across most scenarios, with the strongest gains appearing in long-context workloads where the Infinity Cache and higher HBM bandwidth offset the increased power draw.

Scenario Concurrency MI300X TPS/W MI355X TPS/W Efficiency Gain
128 in / 128 out 512 0.98 1.09 1.1x
128 in / 128 out 1,024 1.34 1.71 1.3x
128 in / 2,048 out 512 1.06 0.98 0.9x
128 in / 2,048 out 1,024 1.50 1.57 1.0x
2,048 in / 128 out 512 0.29 0.42 1.4x
2,048 in / 128 out 1,024 0.34 0.49 1.4x
2,048 in / 2,048 out 512 0.57 0.65 1.1x
2,048 in / 2,048 out 1,024 0.71 0.77 1.1x

Table 4 - Performance-per-Watt at Production Concurrency (FP8). TPS/W = total output tokens per second / aggregate GPU power (watts).

The FP8-to-FP8 efficiency story is nuanced. In the short-prompt, long-output scenario (128 in / 2,048 out) at 512 concurrency, the MI300X and MI355X are at near-parity on TPS/W. The MI355X's throughput advantage is offset by its higher power draw. At 1,024 concurrency and above, the MI355X pulls ahead as its throughput continues to scale while the MI300X plateaus.

The strongest efficiency gains appear in the long-context, short-output scenario (2,048 in / 128 out), reaching 1.4x better TPS/W. This is the workload pattern for RAG pipelines, contract analysis, and document Q&A.

The MXFP4 Advantage: Where MI355X Efficiency Takes Off

MXFP4 (MXFP4) is a more compact numerical format exclusive to the MI355X that delivers higher compute density with modest additional power draw. At high concurrency, MXFP4 delivers substantial efficiency gains over FP8 on the same hardware.

Scenario Concurrency FP8 TPS/W MXFP4 TPS/W MXFP4 Efficiency Gain
128 in / 128 out 512 1.09 1.59 +46%
128 in / 2,048 out 1,024 1.57 2.12 +35%
128 in / 2,048 out 8,192 2.35 5.42 +130%
2,048 in / 128 out 1,024 0.49 0.68 +40%
2,048 in / 2,048 out 2,048 0.81 1.64 +103%
2,048 in / 2,048 out 8,192 0.86 1.98 +130%

Table 5 - MI355X MXFP4 vs. FP8 Efficiency (Selected Scenarios). All values on MI355X hardware. MXFP4 quantization via AMD Quark.

At high concurrency, MXFP4 nearly doubles or more than doubles tokens-per-watt on the MI355X. MXFP4 power draw is typically 5 to 15% lower than FP8 at matched concurrency (for example, 7,022W vs. 8,214W at 128in/128out, 1,024 concurrency), while throughput increases 16% to 130%. The combination produces significant efficiency gains with almost no additional power cost.

Full Generational Picture: MI300X FP8 to MI355X MXFP4

For organizations evaluating a hardware upgrade, the complete picture spans both the architecture change and the precision format. Here is the full progression from MI300X FP8 to MI355X MXFP4 at 1,024 concurrency:

Scenario MI300X FP8 TPS/W MI355X FP8 TPS/W MI355X MXFP4 TPS/W Total Gain (MXFP4 vs. MI300X FP8)
128 in / 128 out 1.34 1.71 2.32 1.7x
128 in / 2,048 out 1.50 1.57 2.12 1.4x
2,048 in / 128 out 0.34 0.49 0.68 2.0x
2,048 in / 2,048 out 0.71 0.77 1.25 1.8x

Table 6 - Generational Efficiency Gain at 1,024 Concurrency (TPS/W). All values at 1,024 concurrent requests.

Organizations currently running MI300X infrastructure can expect a 1.4x to 2.0x improvement in tokens-per-watt by upgrading to MI355X with MXFP4 at production concurrency. Combined with the 1.5x to 3.0x throughput increase, the upgrade reduces the number of nodes required to meet a given throughput SLA, which drives the primary TCO benefit: fewer nodes, less rack space, less procurement cost, and less operational overhead.

What This Means in Dollars per Token

Efficiency ratios become concrete when translated into electricity cost per token. Here is a worked example at 1,024 concurrent requests for the short Q&A scenario (128 in / 128 out):

Metric MI300X FP8 MI355X FP8 MI355X MXFP4
Throughput 6,268 TPS 14,043 TPS 16,324 TPS
System GPU Power 4,679W 8,214W 7,022W
Hourly Electricity Cost $0.468/hr $0.821/hr $0.702/hr
Cost per 1M Tokens $0.0207 $0.0162 $0.0120
Cost Reduction vs. MI300X Baseline 22% 42%

Table 7 - Electricity Cost per Token at 1,024 Concurrency (128 in / 128 out). Assumes $0.10/kWh. Cost per 1M tokens = (power in kW × $/kWh) / (TPS × 3,600) × 1,000,000.

The MI355X at MXFP4 reduces the marginal electricity cost per million tokens by 42% compared to the MI300X at FP8. While the MI355X draws more power per node, its higher throughput means each token costs less to produce.

The larger savings come from node consolidation. If the MI355X produces 2.6x the throughput of the MI300X at matched concurrency (16,324 vs. 6,268 TPS at MXFP4 vs. FP8), a deployment that required 10 MI300X nodes can be served by 4 MI355X nodes. That consolidation reduces hardware procurement, rack space, cooling infrastructure, and operational complexity. In practice, this means a deployment that required N MI300X nodes can often be served by roughly N/2-N/3 MI355X nodes at equivalent throughput and SLA, cutting total hardware, rack, and operational costs accordingly.

Latency at Scale: Meeting SLA Targets

Latency headroom has direct cost implications. If the MI355X meets SLA targets at higher concurrency than the MI300X, fewer nodes are required to serve the same traffic volume.

Concurrency MI300X Latency (ms) MI355X Latency (ms)
32 134.6 106.8
64 89.1 74.6
128 55.9 43.2
256 37.2 25.9
512 25.3 15.3
1,024 20.4 9.1

Table 8 - Derived Latency at Scale (128 in / 128 out, FP8).

Latency = max_output_tokens / TPS × 1,000 ms.

At 1,024 concurrent requests, the MI355X delivers 9.1ms latency versus 20.4ms for the MI300X. That headroom means the MI355X can absorb 2x the concurrent load before approaching the same latency threshold, reducing the node count required to maintain SLAs under peak traffic.

What This Means for Your Data Center

The gains above are measured per node. Infrastructure decisions are made per rack, per budget, and per deployment cycle. Here is how those numbers translate into three common planning scenarios.

Scenario A: Node Consolidation. At 1,024 concurrency in the 128in/128out scenario, one MI355X node with MXFP4 produces 16,324 TPS. An MI300X node produces 6,268 TPS. A deployment currently using 5 MI300X nodes (31,340 TPS) can consolidate to 2 MI355X nodes (32,648 TPS), eliminating 3 nodes' worth of hardware, rack space, and operational costs.

Scenario B: Scale-Up. Within the same node count, deliver 1.5x to 3.0x more throughput. Replace MI300X nodes one-for-one with MI355X and immediately increase serving capacity. For organizations whose current bottleneck is throughput rather than node count, this extends the useful life of each rack position.

Scenario C: Efficiency Optimization. Move to MI355X with MXFP4 and achieve up to 2.0x better tokens-per-watt. For teams that validate quantized accuracy against their specific use case, MXFP4 provides meaningful electricity cost reduction alongside the throughput increase.

Workload Pattern Recommended Config
Real-time scoring, fraud detection, chatbot MI355X MXFP4 (highest throughput scaling)
Report generation, regulatory summaries MI355X MXFP4 (strong MXFP4 efficiency gains at scale)
Contract analysis, RAG, document Q&A MI355X FP8 or MXFP4 (strongest per-watt gains at FP8)
Deep research, long-form synthesis MI355X FP8 (consistent throughput and efficiency)

Table 9 - Recommended Config for Differing Workloads

Conclusion: Efficiency is the New Performance

The MI355X delivers a generational improvement in AI inference capability. Three numbers anchor the value proposition:

1.5x to 3.0x higher throughput at production concurrency (FP8), enabling node consolidation and higher per-node serving capacity.

Up to 2.0x better performance-per-watt when combining the MI355X architecture with MXFP4 precision, reducing electricity cost per token by up to 42%.

2x the concurrency headroom before hitting latency thresholds, reducing node count for SLA-constrained deployments.

The primary TCO lever is node consolidation. By delivering 2x to 3x the throughput per node, the MI355X allows organizations to serve the same workload with fewer nodes, reducing hardware, rack space, cooling, and operational costs. Combined with MXFP4's efficiency gains, the MI355X provides a clear path to lower total cost of ownership for production AI inference.

For infrastructure planners evaluating a hardware refresh, the question is: "How many tokens per node, per watt, per dollar?" On all three metrics, MI355X represents a meaningful step forward.

Learn more about AMD Instinct MI355X at amd.com/instinct

References

[1] U.S. Energy Information Administration, "Electric Power Monthly," Table 5.6.A, data for December 2025, released Feb. 24, 2026. [Online]. Available: https://www.eia.gov/electricity/monthly/epm_table_grapher.php?t=epmt_5_6_a

[2] Uptime Institute, "Uptime Institute Global Data Center Survey 2024," Jul. 2024. [Online]. Available: https://uptimeinstitute.com/resources/research-and-reports/uptime-institute-global-data-center-survey-results-2024

[3] U.S. Department of Energy, National Renewable Energy Laboratory, "Best Practices Guide for Energy-Efficient Data Center Design," revised Jul. 2024. [Online]. Available: https://www.energy.gov/sites/default/files/2024-07/best-practice-guide-data-center-design_0.pdf

Notes

[1] Average enterprise rack densities remain below 8 kW according to the Uptime Institute Global Data Center Survey [2], though AI-optimized facilities increasingly deploy racks above 30 kW.

[2] $0.10/kWh is used as a conservative baseline. The U.S. average industrial electricity rate was $0.0894/kWh in December 2025; the commercial rate was $0.1373/kWh [1]. Actual data center rates vary by region, utility contract, and load factor.

[3] This corresponds to a power usage effectiveness (PUE) of 1.3-1.5. The 2024 industry average PUE is 1.56 [2]; newer builds consistently achieve 1.3 or better [3].


Copyright © 2026 Metrum AI, Inc. All Rights Reserved. This project was commissioned by Dell Technologies. Dell, Dell PowerEdge, Dell iDRAC and other trademarks are trademarks of Dell Inc. or its subsidiaries. AMD, Instinct, ROCm, EPYC and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other product names are the trademarks of their respective owners.

DISCLAIMER - Performance varies by hardware and software configurations, including testing conditions, system settings, application complexity, the quantity of data, batch sizes, software versions, libraries used, and other factors. The results of performance testing provided are intended for informational purposes only and should not be considered as a guarantee of actual performance.