The TCO Titan: MI355X vs. MI300X on Dell PowerEdge™ XE9785L with AMD Instinct™ Accelerators
March 2026
| Executive Summary
As AI inference scales from pilot to production, the economics shift. Raw throughput no longer determines which accelerator wins a deployment decision. The defining metric is cost per token: how much power, cooling, and rack space each generated token consumes over a three-year lifecycle.
In head-to-head benchmarks running Qwen3-235B-A22B at FP8 precision on Dell PowerEdge XE9785, the AMD Instinct MI355X (CDNA 4) delivers up to 9.5x better performance-per-watt than its predecessor, the MI300X (CDNA 3). The MI355X produces equal or higher throughput while drawing 65-72% less system power across all tested scenarios.
When FP4 precision is enabled on the MI355X, the total efficiency improvement over MI300X FP8 reaches 11x. These gains translate directly into lower electricity costs, denser rack deployments, reduced cooling requirements, and faster payback on hardware investments.
Key Results at a Glance
Up to 9.5x Performance-per-Watt MI355X FP8 vs MI300X FP8 at production concurrency | 65-72% Lower System Power MI355X draws ~2,200W vs. MI300X at 6,500-7,600W |
Up to 11x total efficiency gain MI300X FP8 to MI355X FP4 (hardware + precision) | `$164K+ 3-year power savings Per 10-node deployment at `$0.10/kWh with cooling |
| Why Performance-per-Watt Defines AI Economics
Power efficiency has become the defining accelerator metric for production AI inference. Here is why.
The power problem. Large language model inference is a sustained workload. Unlike training, which runs for days or weeks and then stops, inference runs 24/7 for the life of the deployment. Every watt your accelerator consumes translates into electricity, cooling, rack space, and UPS capacity costs that compound over the deployment lifecycle.
Data center capacity is constrained. Most enterprise data centers were built for 8-15 kW per rack[1]. A single 8-GPU MI300X node running Qwen3-235B at high concurrency draws 6,500-7,600W of GPU power. That approaches an entire rack's power envelope on a single node. An 8-GPU MI355X node performing the same workload draws 2,200-2,500W, freeing 4,200-5,400W per node for denser deployments and lower facility costs.
TCO compounds over time. A 4,800W average reduction per node, running 24/7 at `$0.10/kWh[2], saves approximately `$4,200 per node per year in electricity alone. Factor in cooling overhead (typically 0.3-0.5x[3] of IT power in an enterprise facility), and that number reaches `$5,500-`$6,300 annually. For a 10-node deployment, the three-year power savings alone reach `$164,000-`$189,000.
| The Hardware: CDNA 4 vs. CDNA 3
The MI355X introduces three key efficiency innovations that directly impact performance-per-watt.
Specification | MI300X (CDNA 3) | MI355X (CDNA 4) | Impact |
HBM Capacity | 192 GB HBM3 | 288 GB HBM3e | Single-node deployment for 235B-class models |
HBM Bandwidth | 5.3 TB/s | 8 TB/s | Faster decode, fewer memory-bound stalls |
Infinity Cache | None | 256 MB | On-die weight buffering reduces HBM power draw |
Native FP4 | No | Yes (MXFP4) | 2x theoretical compute density over FP8 |
TDP (per GPU) | 750W | 500W | Lower thermal envelope per accelerator |
Process Node | 5/6 nm | 3 nm | Improved transistor efficiency |
Table 1 | Architecture Comparison
The Infinity Cache advantage. The 256MB on-die Infinity Cache on the MI355X deserves special attention. In transformer-based models, the same weight tensors see repeated access across attention heads and MoE expert routing. On the MI300X, every weight access hits HBM, which is both power-hungry and bandwidth-limited. The MI355X's Infinity Cache intercepts repeated accesses before they reach HBM, performing more useful work per watt of HBM power consumed. This architectural advantage shows up clearly in the power data below.
| Benchmark Methodology
All benchmarks use the following configuration:
Parameter | Value |
Model | Qwen3-235B-A22B (MoE, 22B active parameters per token) |
Precision | FP8 on both MI300X and MI355X; FP4 on MI355X noted separately |
Serving Engine | vLLM on ROCm 7.0 |
System | 8-GPU nodes (single-node deployment) |
Concurrency Sweep | 1 to 8,192 concurrent requests |
Metrics | Tokens/second (TPS), system GPU power (W), tokens-per-watt (TPS/W) |
Four scenarios map to common enterprise inference patterns:
Scenario | Input / Output Tokens | Enterprise Use Case |
Short Q&A | 128 in / 128 out | Real-time scoring, classification, chatbot |
Short Prompt, Long Report | 128 in / 2,048 out | Report generation, regulatory summaries |
Long Context, Quick Answer | 2,048 in / 128 out | Contract analysis, earnings Q&A, RAG |
Long Context, Long Report | 2,048 in / 2,048 out | Deep research, long-form synthesis |
| Performance Results
Across all four scenarios, the MI355X delivers meaningfully higher peak throughput than the MI300X at FP8 precision.
Throughput in Tokens per Second
Scenario | MI300X FP8 Peak TPS | MI355X FP8 Peak TPS | Throughput Gain |
128 in / 128 out | 6,268 | 31,753 | 5.1x |
128 in / 2,048 out | 8,278 | 20,540 | 2.5x |
2,048 in / 128 out | 1,777 | 5,108 | 2.9x |
2,048 in / 2,048 out | 4,096 | 7,999 | 2.0x |
Table 2 | Peak Throughput Comparison (FP8 vs. FP8)
MI300X peaks at 1,024 concurrency; MI355X scales to 8,192. At matched 1,024 concurrency, the 128in/128out gain is 2.2x (14,043 vs. 6,268 TPS).
These gains range from 1.5x to over 4x depending on workload profile. But throughput alone does not capture the full picture. The MI355X achieves these numbers at a fraction of the MI300X's power draw. That is where the economic story begins.
System Power Consumption
At production-relevant concurrency levels (256-1,024 concurrent requests), the MI300X node draws 6,500-7,600W while the MI355X node stays in the 2,200-2,500W range. This pattern holds across every scenario tested.
Scenario | Concurrency | MI300X Power (W) | MI355X Power (W) | Power Reduction |
128 in / 128 out | 512 | 7,635 | 2,172 | 72% |
128 in / 128 out | 1,024 | 6,463 | 2,240 | 65% |
128 in / 2,048 out | 512 | 7,561 | 2,187 | 71% |
128 in / 2,048 out | 1,024 | 6,746 | 2,197 | 67% |
2,048 in / 128 out | 512 | 6,700 | 2,172 | 68% |
2,048 in / 128 out | 1,024 | 6,943 | 2,196 | 68% |
2,048 in / 2,048 out | 512 | 7,290 | 2,195 | 70% |
2,048 in / 2,048 out | 1,024 | 7,475 | 2,199 | 71% |
Table 3 | System Power at Key Concurrency Levels (FP8)
The MI355X draws roughly 2,200W while the MI300X requires 6,500-7,600W, despite the MI355X delivering equal or higher throughput. That 65-72% power reduction is not a cherry-picked result. It holds across short and long contexts, low and high output lengths, and low to high concurrency.
Performance-per-Watt
Performance-per-watt (tokens per second per watt, or TPS/W) combines throughput and power into a single efficiency metric. Higher TPS/W means more tokens generated for every dollar spent on electricity.
Scenario | Concurrency | MI300X TPS/W | MI355X TPS/W | Efficiency Gain |
128 in / 128 out | 512 | 0.66 | 3.85 | 5.8x |
128 in / 128 out | 1,024 | 0.97 | 6.27 | 6.5x |
128 in / 2,048 out | 512 | 0.77 | 3.38 | 4.4x |
128 in / 2,048 out | 1,024 | 1.23 | 5.73 | 4.7x |
2,048 in / 128 out | 512 | 0.21 | 1.93 | 9.1x |
2,048 in / 128 out | 1,024 | 0.24 | 2.23 | 9.5x |
2,048 in / 2,048 out | 512 | 0.44 | 2.66 | 6.0x |
2,048 in / 2,048 out | 1,024 | 0.54 | 3.25 | 6.1x |
Table 4 | Performance-Per-Watt at Production Concurrency (FP8)
TPS/W = total output tokens per second / system GPU power (watts).
The MI355X delivers between 4.4x and 9.5x better performance-per-watt than the MI300X across all tested scenarios at production concurrency. The efficiency advantage grows with concurrency because MI355X scales throughput efficiently while power draw remains nearly flat (2,172-2,534W). The MI300X sees power climb from 5,500W at low concurrency to over 7,600W at peak load.
The strongest efficiency gains appear in the long-context, short-output scenario (2,048 in / 128 out), achieving up to 9.5x better TPS/W. This is the workload pattern for RAG pipelines, contract analysis, and document Q&A where long documents are ingested but responses are concise.
What This Means in Dollars per Token
Abstract efficiency ratios become concrete when translated into electricity cost per token. Here is a worked example at 1,024 concurrent requests for the short Q&A scenario (128 in / 128 out):
Metric | MI300X FP8 | MI355X FP8 |
Throughput | 6,268 TPS | 14,043 TPS |
System GPU Power | 6,463W | 2,240W |
Hourly Electricity Cost | `$0.646/hr | `$0.224/hr |
Cost per 1M Tokens | `$0.0286 | `$0.0044 |
Electricity Cost Reduction | — | 85% |
Table 5 | Electricity Cost per Token at 1,024 Concurrency (FP8). Assumes `$0.10/kWh. Cost per 1M tokens = (power in kW * `$/kWh) / (TPS * 3,600) * 1,000,000.
At matched concurrency, the MI355X reduces the marginal electricity cost per token by 85%. For workloads processing billions of tokens monthly, this difference compounds rapidly.
The FP4 Bonus: Even Higher Efficiency on MI355X
FP4 efficiency gains grow with concurrency as the higher compute density keeps the GPU fed at scale while power remains nearly flat. Moving from FP8 to FP4 on the MI355X increases throughput while power remains nearly unchanged, delivering near-zero-cost efficiency gains.
Scenario | Concurrency | FP8 TPS/W | FP4 TPS/W | FP4 Efficiency Gain |
128 in / 128 out | 512 | 3.85 | 4.46 | +16% |
128 in / 2,048 out | 1,024 | 5.73 | 6.51 | +14% |
128 in / 2,048 out | 8,192 | 8.99 | 17.74 | +97% |
2,048 in / 128 out | 1,024 | 2.23 | 2.63 | +18% |
2,048 in / 2,048 out | 2,048 | 3.32 | 5.94 | +79% |
2,048 in / 2,048 out | 8,192 | 3.62 | 7.70 | +113% |
Table 6 | MI355X FP4 vs. FP8 Efficiency (Selected Scenarios)
At high concurrency, FP4 nearly doubles tokens-per-watt on the MI355X. The power delta between FP4 and FP8 is typically under 200W (~2,350W vs. ~2,170W), while throughput can increase by 50-130%.
Three-Way Comparison: MI300X FP8 to MI355X FP4
For the complete generational picture, here is the full progression from MI300X FP8 to MI355X FP4. This represents the total efficiency improvement available when upgrading both hardware and precision.
Scenario | MI300X FP8 | MI355X FP8 | MI355X FP4 | Total Gain |
128 in / 128 out | 0.97 | 6.27 | 6.94 | 7.2x |
128 in / 2,048 out | 1.23 | 5.73 | 6.51 | 5.3x |
2,048 in / 128 out | 0.24 | 2.23 | 2.63 | 11.0x |
2,048 in / 2,048 out | 0.54 | 3.25 | 5.94* | 11.0x* |
Table 7 | Generational Efficiency Gain at 1,024 Concurrency (TPS/W). All values measured at 1,024 concurrent requests for consistent comparison. *2,048in/2,048out FP4 value is at 2,048 concurrency (FP4 optimal for this scenario).
Organizations currently running MI300X infrastructure can expect a 5x-11x improvement in tokens-per-watt by upgrading to MI355X with FP4 at production concurrency. Even staying at FP8, the generational improvement is 4x-10x.
Latency at Scale: Meeting SLA Targets
Latency headroom has direct cost implications. If the MI355X meets SLA targets at higher concurrency than the MI300X, you serve the same traffic with fewer nodes.
Concurrency | MI300X Latency (ms) | MI355X Latency (ms) |
32 | 134.6 | 106.8 |
64 | 89.1 | 74.6 |
128 | 55.9 | 43.2 |
256 | 37.2 | 25.9 |
512 | 25.3 | 15.3 |
1,024 | 20.4 | 9.1 |
Table 8 | 100ms Latency Feasibility (128 in / 128 out, FP8). Latency = max_output_tokens / TPS x 1000ms.
At 1,024 concurrent requests, the MI355X delivers 9.1ms latency versus 20.4ms for the MI300X. That headroom means the MI355X can absorb 2x the concurrent load before approaching the same latency threshold, reducing the node count required to maintain SLAs under peak traffic.
| What This Means for Your Data Center
The efficiency gains above are measured per node. Infrastructure decisions are made per rack, per budget, and per deployment cycle. Here is how the numbers translate into three common planning scenarios.
Scenario A: Consolidation. If your current MI300X deployment delivers the throughput you need, the MI355X can match it with fewer nodes. At 1,024 concurrency, one MI355X node produces 14,043 TPS versus 6,268 TPS on the MI300X. A two-node MI300X deployment consolidates to a single MI355X node at comparable throughput, reducing GPU power from roughly 13kW to 2.2kW, an 83% reduction.
Scenario B: Scale-Up. If you are power-constrained, the MI355X puts more inference throughput into the same power envelope. Two MI300X nodes draw approximately 13kW and produce 12,500 TPS combined. That same 13kW supports five MI355X nodes producing over 70,000 TPS, a 5.6x increase in aggregate throughput within the same power budget.
Scenario C: TCO Reduction. For a straightforward hardware refresh at the same node count, the MI355X delivers 2.2x higher throughput and approximately 4,800W lower power draw per node. With cooling overhead factored in, each node saves `$5,500-`$6,300 annually. Over a three-year depreciation cycle across a 10-node cluster, that represents `$164,000-`$189,000 in power and cooling savings, while also doubling throughput capacity.
Workload Pattern | Recommended Config |
Real-time scoring, fraud detection, chatbot | MI355X FP4 |
Report generation, regulatory summaries | MI355X FP4 |
Contract analysis, RAG, document Q&A | MI355X FP8 or FP4 |
Deep research, long-form synthesis | MI355X FP8 |
Table 9 | Recommended Config for Differing Workloads
| Conclusion: Efficiency is the New Performance
The MI355X is faster than the MI300X. That matters, but it is not the headline. The headline is that the MI355X delivers equal or greater throughput at 65-72% lower power, yielding 4x-10x better performance-per-watt at FP8, and up to 11x with FP4.
For organizations planning AI infrastructure investments, this efficiency advantage translates directly into lower electricity costs, denser rack deployments, reduced cooling requirements, and faster payback on hardware investments. With FP4 support on MI355X, teams that validate quantized accuracy unlock additional efficiency gains at near-zero power cost.
The question for infrastructure planners is no longer "how fast is it?" It is "how many tokens can I generate per dollar, per watt, per rack unit?" On that metric, the MI355X represents a generational shift.
Learn more about AMD Instinct MI355X at amd.com/instinct
| References
[1] U.S. Energy Information Administration, "Electric Power Monthly," Table 5.6.A, data for December 2025, released Feb. 24, 2026. [Online]. Available: https://www.eia.gov/electricity/monthly/epm_table_grapher.php?t=epmt_5_6_a
[2] Uptime Institute, "Uptime Institute Global Data Center Survey 2024," Jul. 2024. [Online]. Available: https://uptimeinstitute.com/resources/research-and-reports/uptime-institute-global-data-center-survey-results-2024
[3] U.S. Department of Energy, National Renewable Energy Laboratory, "Best Practices Guide for Energy-Efficient Data Center Design," revised Jul. 2024. [Online]. Available: https://www.energy.gov/sites/default/files/2024-07/best-practice-guide-data-center-design_0.pdf
Copyright © 2026 Metrum AI, Inc. All Rights Reserved. This project was commissioned by Dell Technologies. Dell, Dell PowerEdge, Dell iDRAC and other trademarks are trademarks of Dell Inc. or its subsidiaries. AMD, Instinct, ROCm, EPYC and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other product names are the trademarks of their respective owners.
***DISCLAIMER - Performance varies by hardware and software configurations, including testing conditions, system settings, application complexity, the quantity of data, batch sizes, software versions, libraries used, and other factors. The results of performance testing provided are intended for informational purposes only and should not be considered as a guarantee of actual performance.
[1] Average enterprise rack densities remain below 8 kW according to the Uptime Institute Global Data Center Survey [2], though AI-optimized facilities increasingly deploy racks above 30 kW.
[2] `$0.10/kWh is used as a conservative baseline. The U.S. average industrial electricity rate was `$0.0894/kWh in December 2025; the commercial rate was `$0.1373/kWh [1]. Actual data center rates vary by region, utility contract, and load factor.
[3] This corresponds to a power usage effectiveness (PUE) of 1.3-1.5. The 2024 industry average PUE is 1.56 [2]; newer builds consistently achieve 1.3 or better [3].