The FP4 Breakthrough: How MXFP4 Quantization Delivers Up to 6.1x Inference Throughput Without Sacrificing Accuracy on Dell PowerEdge™ XE9785L with AMD Instinct™ MI355X Accelerators
How OCP Microscaling Formats and AMD Quark Deliver 4-Bit Quantization at Near Zero Error Rate
February 2026
| Executive Summary
Inference cost now exceeds training cost as the primary infrastructure challenge for production AI. Serving a 235-billion parameter model at enterprise concurrency can require multiple accelerator nodes, with power and hardware expenditures scaling proportionally. For IT leaders under pressure to expand AI capacity without expanding budgets, precision format selection offers one of the most effective levers available.
This paper presents benchmark results for MXFP4 (4-bit floating-point) quantization on the AMD Instinct MI355X accelerator, deployed in a Dell PowerEdge XE9785L server with eight GPUs. The results demonstrate substantial generational throughput gains over the prior-generation MI300X, improved power-normalized efficiency, and expanded concurrent-user capacity at latency-sensitive SLAs. The deployment path requires minimal operational change: Quark-quantized MXFP4 models export directly to standard Hugging Face safetensors format and deploy through vLLM or SGLang with no code changes.
Key Results at a Glance
Up to 2.3x throughput over FP8 MXFP4 vs. FP8 on the same MI355X node at high concurrency | Up to 6.1x throughput over MI300X MI355X MXFP4 vs. MI300X FP8 at production concurrency |
Up to 10x power efficiency gain Tokens per second per watt, MI355X MXFP4 vs. MI300X FP8 | 8x more concurrent users Within a 100ms per-token SLA on short-form workloads |
| The Quantization Challenge
Every major enterprise now quantizes its production models. The question is no longer whether to quantize, but how far precision can drop before accuracy degrades and users notice. Traditional INT4 quantization forces a painful tradeoff. Its uniform precision grid clips outlier values or sacrifices resolution across the entire weight distribution. This results in measurable accuracy loss that pushes teams to over-provision hardware at higher precision, erasing the cost savings quantization was supposed to deliver.
This effect is supported by research, consistently showing that naive INT4 post-training quantization can produce perplexity increases and accuracy drops that make models unsuitable for production deployment[1]. Advanced algorithms like GPTQ and AWQ partially address these limitations through sophisticated calibration, but they cannot overcome the fundamental constraint of uniform representation.
The AMD Instinct MI355X introduces hardware-native support for MXFP4, a floating-point format designed specifically for the distribution patterns observed in deep learning workloads. Combined with the AMD Quark quantization toolkit and ROCm 7 runtime optimizations, this architecture delivers 4-bit inference throughput with accuracy preservation that approaches higher-precision baselines.
| Understanding Microscaling Formats
The Open Compute Project (OCP) published the Microscaling Formats (MX) specification with contributions from AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm[2]. MX formats define a new approach to low-precision data representation. Rather than treating each value independently, MX formats share a scaling factor across blocks of elements, enabling efficient representation with reduced storage overhead.
MXFP4 encodes each block of 32 elements using 4-bit floating-point values (E2M1 format: 2 exponent bits, 1 mantissa bit) with a shared 8-bit scale factor (E8M0). This structure requires only 136 bits per block (17 bytes), compared to 1,024 bits (128 bytes) for the equivalent FP32 representation. The 7.5x reduction in storage translates directly to reduced memory bandwidth requirements and improved cache utilization during inference.
The Floating-Point Advantage
The critical difference between MXFP4 and INT4 lies in how values distribute across the representable range. Floating-point formats space their quantization levels logarithmically, providing higher resolution near zero and lower resolution at extremes. This non-uniform grid aligns naturally with neural network parameters, which typically follow approximately Gaussian distributions centered near zero.
When an outlier activation occurs in INT4, the uniform grid forces either severe clipping or scale expansion that quantizes most values to zero. In MXFP4, the exponent bits provide dynamic range while the block-level scale factor adjusts to local magnitude. The outlier receives appropriate representation, and neighboring values maintain precision. This architectural difference explains why MXFP4 consistently demonstrates superior signal-to-noise ratio for deep learning tensors compared to INT4 at equivalent bit-widths[3].
The MI355X implements native MXFP4 matrix multiplication through its Matrix Fused Multiply Add (MFMA) scale instructions. This hardware acceleration eliminates the dequantization overhead that software-emulated approaches require, delivering up to 20 petaFLOPS of FP4 compute performance per accelerator.
| AMD Quark: Production-Ready Quantization
Converting a model from FP16 or BF16 to MXFP4 requires careful calibration to minimize accuracy loss. AMD Quark provides a comprehensive toolkit for this transformation, integrating established algorithms with MI355X-specific optimizations[4].
Quark supports multiple quantization strategies, each offering different tradeoffs between calibration time, accuracy preservation, and computational overhead. For production deployment of large models, AutoSmoothQuant provides the optimal balance. This algorithm adaptively smooths outliers on a per-layer basis, reducing the dynamic range that any single block must represent. Unlike manual SmoothQuant tuning, AutoSmoothQuant requires no hyperparameter search, making it practical for enterprise deployment pipelines.
Quantization Workflow
A typical Quark workflow for MXFP4 quantization proceeds through three stages. First, the model loads from a standard Hugging Face checkpoint or safetensors format. Second, Quark applies the selected quantization scheme with calibration data. Third, the quantized model exports in a format compatible with vLLM, SGLang, or other serving frameworks.
Quark completes the quantization process in hours, even for models exceeding 200 billion parameters. It parallelizes calibration across available GPUs, and the exported model deploys immediately on MI355X infrastructure without additional conversion steps.
Mixed-Precision Strategies
Not all model layers respond equally to aggressive quantization. Embedding layers and certain attention projections often benefit from higher precision, while MLP layers tolerate lower precision with minimal impact. Quark supports layer-wise mixed-precision configuration, enabling MXFP4 for bulk compute with FP8 for sensitive operations.
This granular control allows organizations to tune the precision-accuracy tradeoff for their specific workloads. A model serving factual question-answering may tolerate more aggressive quantization than one performing complex reasoning tasks. Quark's configuration system exposes these choices without requiring deep expertise in quantization theory.
| Performance Analysis: Qwen3-235B
To quantify the throughput advantage of MXFP4 over FP8, benchmark testing compared identical workloads on the same MI355X hardware. All tests ran on a Dell PowerEdge XE9785L server equipped with eight AMD Instinct MI355X accelerators, ROCm 7, and vLLM as the serving framework. Qwen3-235B was selected as the benchmark model because its 235-billion parameter Mixture of Experts architecture represents the class of models most commonly deployed in production. MXFP4 quantization via Quark applies broadly to transformer-based architectures; the results presented here are representative of the throughput characteristics organizations can expect across similar large-scale models.
Runs with greater than 10% error rate were excluded from all reported results. All retained configurations produced zero inference errors across their full test duration.
Peak Throughput by Workload
Each AI workload has a distinct input/output token signature that stresses hardware differently. Short prompts with short replies characterize chatbots and API-driven Q&A. Long inputs with short outputs define document classification, RAG, and compliance review. The MI355X with MXFP4 leads across all four profiles tested, with gains increasing at higher concurrency.
Workload | MI355X MXFP4Peak (tok/s) | MI300X FP8Peak (tok/s) | Gain | Example Uses |
128 in / 128 out | 33,106 | 7,799 | 4.2x | Chat, Q&A, autocomplete |
128 in / 2,048 out | 43,916 | 8,278 | 5.3x | Content gen, code gen |
2,048 in / 128 out | 10,806 | 1,777 | 6.1x | RAG, classify, compliance |
2,048 in / 2,048 out | 18,395 | 4,096 | 4.5x | Research, translation |
Table 1 | Peak throughput across all tested concurrency levels. MI355X MXFP4 vs. MI300X FP8, Qwen3-235B. Runs with >10% error rate excluded.
The largest absolute advantage appears on document-processing workloads (2,048 input / 128 output), where the MI355X with MXFP4 delivers 6.1x the peak throughput of the MI300X with FP8. Long-generation workloads (128 input / 2,048 output) reach 5.3x. Even short-form chat workloads (128/128) show a 4.2x generational improvement.
Throughput Comparison
The following table presents output throughput (tokens per second) across matched configurations. Each row compares identical concurrency levels, input lengths, and maximum output lengths between MXFP4 and FP8 precision.
Concurrency | Input / Output | MXFP4 (tok/s) | FP8 (tok/s) | Improvement |
32 | 128 / 128 | 1,599 | 1,198 | 1.33x |
128 | 128 / 128 | 4,086 | 2,964 | 1.38x |
512 | 128 / 128 | 10,503 | 8,359 | 1.26x |
1,024 | 128 / 128 | 16,324 | 14,043 | 1.16x |
4,096 | 128 / 128 | 27,728 | 28,047 | 0.99x |
64 | 128 / 2048 | 2,070 | 1,428 | 1.45x |
256 | 128 / 2048 | 5,641 | 4,007 | 1.41x |
1,024 | 128 / 2048 | 15,318 | 12,588 | 1.22x |
8,192 | 128 / 2048 | 43,916 | 20,540 | 2.14x |
128 | 2048 / 128 | 2,601 | 2,132 | 1.22x |
512 | 2048 / 128 | 4,862 | 4,191 | 1.16x |
2,048 | 2048 / 128 | 7,857 | 4,923 | 1.60x |
64 | 2048 / 2048 | 2,097 | 1,462 | 1.43x |
512 | 2048 / 2048 | 7,000 | 5,838 | 1.20x |
2,048 | 2048 / 2048 | 14,032 | 7,448 | 1.88x |
4,096 | 2048 / 2048 | 16,647 | 7,702 | 2.16x |
8,192 | 2048 / 2048 | 18,395 | 7,999 | 2.30x |
Table 2 | Qwen3-235B Throughput: MXFP4 vs FP8 on a Single 8-GPU MI355X Node
Figure 1 | MXFP4 vs. FP8 throughput scaling with concurrency (128/2048 workload)
At moderate concurrency (32 to 1,024 requests), MXFP4 delivers 1.15x to 1.45x more throughput than FP8 across all workload profiles. The advantage grows with scale. For long-generation workloads (2048/2048) at 8,192 concurrent requests, MXFP4 produces 2.30x more output tokens per second than FP8, as FP8 throughput plateaus while MXFP4 continues to scale.
Consider a production deployment handling 10 million tokens daily at 512 concurrent requests with 128-token outputs. MXFP4 would generate an additional 2.7 million tokens per day on the same hardware. That is equivalent to adding 27% more capacity without purchasing a single additional node.
Figure 2 | MXFP4 vs. FP8 across workload profiles at 512 concurrent requests
Latency Characteristics
Throughput measures aggregate capacity. Latency governs user experience. Two metrics matter for interactive inference: Time to First Token (TTFT), which controls perceived responsiveness, and Time Per Output Token (TPOT), which controls perceived generation speed. The benchmark suite captured both across all configurations.
For short-input decode (128-token prompts), MXFP4 reduces average TPOT by 20% to 35% compared to FP8 on the same MI355X hardware. At 128 concurrent requests, MXFP4 delivers 26.6 ms per token versus 37.9 ms for FP8 (1.43x). The advantage holds across generation lengths and moderate concurrency levels.
The more significant finding is in TTFT at high concurrency with long contexts. At 1,024 concurrent requests with 2048/2048 tokens, MXFP4 delivers first-token in 1.2 seconds while FP8 requires 24.4 seconds (20.9x improvement). On the 2048/128 profile at the same concurrency, MXFP4 achieves 1.4 seconds versus 6.6 seconds for FP8. This reflects FP8's inability to sustain prefill once the KV cache saturates available memory bandwidth.
Conc. | In / Out | MXFP4 TPOT | FP8 TPOT | TPOT Gain | MXFP4 TTFT | FP8 TTFT | TTFT Gain |
32 | 128/128 | 18.4 ms | 24.9 ms | 1.35x | 107 ms | 109 ms | ~1.0x |
128 | 128/128 | 26.6 ms | 37.9 ms | 1.43x | 193 ms | 161 ms | ~1.0x |
512 | 128/128 | 41.5 ms | 52.3 ms | 1.26x | 312 ms | 320 ms | ~1.0x |
128 | 2048/128 | 40.9 ms | 48.9 ms | 1.20x | 587 ms | 708 ms | 1.21x |
512 | 2048/128 | 92.3 ms | 104 ms | 1.13x | 959 ms | 1,407 ms | 1.47x |
1,024 | 2048/128 | 148 ms | 141 ms | 0.95x | 1,427 ms | 6,636 ms | 4.65x |
128 | 2048/2048 | 31.8 ms | 41.7 ms | 1.31x | 451 ms | 483 ms | 1.07x |
512 | 2048/2048 | 61.4 ms | 69.9 ms | 1.14x | 763 ms | 1,069 ms | 1.40x |
1,024 | 2048/2048 | 89.5 ms | 78.4 ms | 0.88x | 1,169 ms | 24,400 ms | 20.9x |
Table 3 | Latency: MXFP4 vs FP8 on MI355X (Qwen3-235B, average values). TPOT in ms; TTFT in ms.
Figure 3 | Time to First Token: MXFP4 vs. FP8 at high concurrency (2048/2048 workload)
(Shorter is better)
At very high concurrency on long-context workloads (1,024+ requests, 2048-token inputs), FP8 occasionally shows lower per-token latency because the workload becomes compute-bound rather than memory-bound. However, the throughput tables confirm that MXFP4 still delivers more aggregate output tokens per second in these regimes.
For capacity planning, if an SLA requires sub-second TTFT for 2048-token prompts, FP8 supports roughly 256 concurrent requests before crossing that threshold. MXFP4 extends the limit beyond 512 concurrent requests, effectively doubling the concurrent user capacity within the same SLA envelope.
Latency SLA Compliance: The 100ms Threshold
A 100ms per-token inter-token latency (ITL) is a practical threshold for responsive AI generation. At this target, a 128-token response completes in roughly 13 seconds, acceptable for most asynchronous and batch-adjacent applications. For real-time streaming, lower targets apply.
The MI355X with MXFP4 sustains sub-100ms per-token latency at up to 8x more concurrent users than the MI300X with FP8, while delivering 4x to 7x more aggregate throughput within the same SLA window.
Workload | MI300X FP8 Max Concurrent | MI355X MXFP4 Max Concurrent | Throughput at SLA | Concurrency Gain |
128 / 128 | 256 | 2,048 | 23,564 tok/s | 8x |
128 / 2,048 | 512 | 2,048 | 23,824 tok/s | 4x |
2,048 / 128 | 64 | 256 | 3,575 tok/s | 4x |
2,048 / 2,048 | 128 | 512 | 7,000 tok/s | 4x |
Table 4 | Maximum concurrency within 100ms per-token SLA. Derived ITL = 1,000 / (tok/s / concurrent requests).
Figure 4 | Maximum concurrency within 100ms per-token SLA. Derived ITL = 1,000 / (tok/s / concurrent requests).
For any platform running inference at scale, whether a developer API, enterprise AI gateway, or consumer product, the 100ms per-token threshold defines whether the experience feels responsive or sluggish. The MI355X with MXFP4 pushes that SLA boundary 4x to 8x further than the MI300X with FP8. That translates directly to more users served at the same quality of experience, without adding hardware.
Scaling Behavior at High Concurrency
The data reveals an important inflection above 1,024 concurrent requests. At moderate concurrency, both precision formats scale in rough proportion, producing improvement ratios of 1.15x to 1.45x. Above 2,048 concurrent requests, FP8 throughput begins to plateau while MXFP4 continues to scale.
This divergence is most visible in long-generation workloads (2,048/2,048), where FP8 throughput effectively caps near 8,000 tokens per second regardless of additional concurrency. MXFP4 reaches 18,395 tokens per second at 8,192 concurrent requests on the same workload. On long-input workloads (2,048/128), MI300X FP8 plateaus at approximately 1,750 tokens per second from 2,048 concurrent onward, while MI355X MXFP4 continues scaling to 10,806 tokens per second.
Two architectural factors drive this behavior. First, MXFP4 model weights occupy half the HBM footprint of FP8, freeing memory bandwidth for KV cache operations that become the bottleneck at high concurrency. Second, the MI355X’s native MFMA scale instructions process MXFP4 matrix operations without the dequantization overhead that constrains FP8 throughput as compute utilization approaches saturation.
For capacity planning, the implication is direct: the return on investment for MXFP4 increases with deployment scale. Organizations operating inference nodes at production concurrency levels (512 and above) realize substantially more benefit than moderate-concurrency benchmarks alone suggest.
Generational Improvement: MI355X MXFP4 vs. MI300X FP8
The performance comparison between MXFP4 on MI355X and FP8 on the prior-generation MI300X quantifies the combined benefit of new hardware and new precision formats. Both configurations ran Qwen3-235B on 8-GPU nodes: the MI355X on a Dell PowerEdge XE9785L, the MI300X on a Dell PowerEdge XE9680.
Concurrency | Input / Output | MI355X MXFP4(tok/s) | MI300X FP8(tok/s) | Improvement |
128 | 128 / 128 | 4,086 | 2,292 | 1.78x |
512 | 128 / 128 | 10,503 | 5,053 | 2.08x |
1,024 | 128 / 128 | 16,324 | 6,268 | 2.60x |
4,096 | 128 / 128 | 27,728 | 7,787 | 3.56x |
8,192 | 128 / 128 | 33,106 | 7,799 | 4.25x |
256 | 128 / 2,048 | 5,641 | 3,722 | 1.52x |
1,024 | 128 / 2,048 | 15,318 | 8,278 | 1.85x |
2,048 | 128 / 2,048 | 23,824 | 7,253 | 3.28x |
8,192 | 128 / 2,048 | 43,916 | 8,097 | 5.42x |
256 | 2,048 / 128 | 3,575 | 1,224 | 2.92x |
1,024 | 2,048 / 128 | 6,188 | 1,632 | 3.79x |
2,048 | 2,048 / 128 | 7,857 | 1,750 | 4.49x |
4,096 | 2,048 / 128 | 10,646 | 1,748 | 6.09x |
8,192 | 2,048 / 128 | 10,806 | 1,777 | 6.08x |
256 | 2,048 / 2,048 | 4,513 | 2,453 | 1.84x |
1,024 | 2,048 / 2,048 | 9,942 | 4,002 | 2.48x |
4,096 | 2,048 / 2,048 | 16,647 | 4,083 | 4.08x |
8,192 | 2,048 / 2,048 | 18,395 | 4,089 | 4.50x |
Table 5 | Qwen3-235B Output Throughput: MI355X MXFP4 vs MI300X FP8 (Zero-Error Configurations)
Figure 5 | MI355X MXFP4 vs. MI300X FP8 throughput scaling (2048/2048 workload)
At moderate concurrency (128 to 256 requests), the MI355X with MXFP4 delivers 1.45x to 2.92x the throughput of the MI300X with FP8, depending on workload profile. The advantage is most pronounced for long-input, short-output workloads (2,048/128), where the MI300X’s narrower memory bandwidth and lack of native FP4 compute constrain throughput scaling.
At 1,024 concurrent requests, the generational gap widens to 2.48x for long-context generation and 3.79x for long-input summarization tasks. At 8,192 concurrent, document-processing workloads (2,048/128) reach 6.1x. For organizations currently operating MI300X infrastructure, these results establish a concrete performance baseline for migration planning.
Enterprise Use Case: Compliance and Document Intelligence
Compliance checks, regulatory document review, financial risk analysis, and contract classification all share a common token profile: long inputs (2,048+ tokens of document context) with short outputs (128 tokens of classification, risk score, or structured response). This is the workload where the MI355X with MXFP4 delivers its largest absolute and proportional advantage.
At 1,024 concurrent requests, a practical operational load for enterprise platforms, MXFP4 delivers 6,188 tokens per second versus 1,632 tokens per second for MI300X FP8: 3.8x more capacity. At peak concurrency, the ratio reaches 6.1x. Document throughput translates directly to review capacity: 6.1x throughput means 6.1x more documents reviewed per hour on equivalent hardware.
Regulatory workloads often face batch deadlines: end-of-day portfolio reviews, overnight compliance sweeps, and real-time transaction monitoring thresholds. For real-time monitoring scenarios where the 100ms per-token SLA applies, MI300X FP8 supports only 64 concurrent document streams. MI355X MXFP4 sustains 256 concurrent streams, processing 4x more documents under active review simultaneously.
Accuracy Preservation
Throughput gains matter only if accuracy remains acceptable. The benchmark suite maintained near zero error rates across all tested configurations, confirming that MXFP4 quantization via Quark produces reliable inference results under production load.
Independent MLPerf Inference v5.1 results provide additional validation. AMD's MXFP4 submission on MI355X for Llama 2-70B achieved performance levels comparable to the closed division while maintaining accuracy targets[5]. These results represent the first MXFP4 submissions to an industry-standard benchmark, establishing a reproducible baseline for 4-bit floating-point inference performance.
| Implementation Considerations
Adopting MXFP4 quantization involves minimal operational complexity when compared to other precision reduction strategies. The quantization workflow integrates with existing model management practices, and the quantized models deploy through standard serving frameworks.
Infrastructure Requirements
MXFP4 inference requires MI355X accelerators with ROCm 7.0 or later. Earlier AMD Instinct generations (MI300X, MI325X) can emulate MXFP4 through software dequantization, providing compatibility for model evaluation but without the full throughput benefits of native hardware support.
The quantization process itself can run on any ROCm-compatible hardware. Organizations can quantize models on existing MI300X infrastructure, then deploy the quantized checkpoints to MI355X systems for production inference. This flexibility simplifies the transition path for established AMD deployments.
Serving Framework Integration
vLLM provides native support for Quark-quantized MXFP4 models. Loading a quantized checkpoint requires no code changes beyond specifying the model path. vLLM automatically detects the quantization format and configures the appropriate kernels for MI355X execution.
SGLang similarly supports MXFP4 models with automatic detection. For organizations using custom serving infrastructure, Quark exports to Hugging Face-compatible safetensors format with embedded quantization metadata, enabling integration with any framework that can load quantized models.
Dell PowerEdge Platform Integration
The Dell PowerEdge XE9785L provides the validated server platform for MI355X deployment. Dell iDRAC (Integrated Dell Remote Access Controller) delivers hardware-level management capabilities that are critical for production AI infrastructure: real-time GPU telemetry, thermal monitoring, remote firmware updates, and automated alerting. These capabilities enable infrastructure teams to monitor power draw, memory utilization, and accelerator health across fleet deployments without custom tooling. Pre-validated configurations available through Dell reduce the time from hardware delivery to production inference. Organizations deploying at scale benefit from Dell's lifecycle management tools, which streamline firmware updates, configuration drift detection, and capacity planning across multi-node installations.
| Total Cost of Ownership Impact
The throughput improvement from MXFP4 translates directly to infrastructure economics. The following analysis uses system-level power measurements, which include host CPU, memory, networking, and cooling overhead in addition to GPU power draw.
Workload (In/Out) | MI355X MXFP4 tok/s | MI355X Power (W) | MI300X FP8 tok/s | MI300X Power (W) | tok/s/W (MI355X) | tok/s/W (MI300X) | Efficiency Gain |
128 / 128 | 10,503 | 2,353 | 5,053 | 7,635 | 4.46 | 0.66 | 6.7x |
128 / 2048 | 9,621 | 2,333 | 5,806 | 7,561 | 4.12 | 0.77 | 5.4x |
2048 / 128 | 4,862 | 2,353 | 1,420 | 6,700 | 2.07 | 0.21 | 9.7x |
2048 / 2048 | 7,000 | 2,353 | 3,204 | 7,290 | 2.97 | 0.44 | 6.8x |
Table 6 | Power-Normalized Throughput: MI355X MXFP4 vs MI300X FP8 (512 Concurrent Requests)
Figure 6 | Power-normalized throughput: MI355X MXFP4 vs. MI300X FP8 at 512 concurrent requests
System-level power consumption confirms the efficiency advantage. The MI355X node draws approximately 2,330W to 2,353W under sustained MXFP4 inference load at 512 concurrent requests, compared to 6,700W to 7,635W for the MI300X node running FP8 under comparable workloads. Normalizing for throughput, the MI355X with MXFP4 yields 2.1 to 4.5 tokens per second per watt. The MI300X with FP8 yields 0.21 to 0.77 tokens per second per watt. That represents a 5.4x to 9.9x improvement in power-normalized throughput.
To illustrate: at `$0.10/kWh, the MI300X node running FP8 at 512 concurrency on a 2048/128 workload costs approximately `$489/month in power alone, producing 1,420 tokens/second. The MI355X with MXFP4 produces 4,862 tokens/second at `$172/month in power, a 3.4x throughput increase at 65% lower power cost. Organizations replacing three to four MI300X nodes with a single MI355X node can recover the hardware investment within months through operational savings alone.
The memory footprint reduction compounds these benefits. MXFP4 model weights occupy half the space of FP8, freeing HBM capacity for larger KV caches. Longer context windows support more complex conversations without requiring model sharding. More concurrent users fit within a single node's memory budget. These efficiency gains accumulate across the deployment lifecycle, reducing both capital expenditure and ongoing operational costs.
| Conclusion: Production-Ready 4-Bit Inference
The combination of OCP MXFP4 standardization, AMD Quark tooling, MI355X hardware support, and Dell PowerEdge platform validation establishes a production-ready path to 4-bit inference. MXFP4 removes the traditional tradeoff between throughput and accuracy.
The benchmark results on Qwen3-235B demonstrate up to 6.1x throughput improvement over the prior-generation MI300X with FP8, with the largest gains on document-processing and long-input workloads critical to regulated industries. On the same MI355X hardware, MXFP4 delivers 1.15x to 2.30x throughput over FP8, with the advantage growing at higher concurrency. Power-normalized efficiency gains of 5x to 10x translate to immediate reductions in operating cost.
Latency testing confirms that MXFP4 reduces time per output token by 1.1x to 1.4x at moderate concurrency, while sustaining sub-1.5 second time to first token at concurrency levels where FP8 exceeds 6 seconds on long-context workloads. At the 100ms per-token SLA, MXFP4 supports 4x to 8x more concurrent users, directly expanding the operational envelope for latency-sensitive applications.
For infrastructure teams planning their next hardware refresh, MXFP4 on the AMD Instinct MI355X redefines what a single 8-GPU node can deliver. The 6.1x throughput advantage over prior-generation hardware means fewer nodes for the same workload. The 5x to 10x power efficiency gain means lower operating costs from day one. And zero-error accuracy validation means production deployment without compromise.
Organizations can begin today: quantize an existing model with AMD Quark on current MI300X infrastructure, validate accuracy, and deploy to MI355X when hardware arrives. No framework changes required. No retraining necessary.
Learn more about AMD Instinct MI355X at amd.com/instinct
| References
[1] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, "QLoRA: Efficient Finetuning of Quantized LLMs," arXiv preprint arXiv:2305.14314, 2023.
[2] Open Compute Project, "OCP Microscaling Formats (MX) Specification v1.0," September 2023. [Online]. Available: https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf
[3] B. D. Rouhani et al., "Microscaling Data Formats for Deep Learning," arXiv preprint arXiv:2310.10537, 2023.
[4] AMD, "AMD Quark Documentation," 2025. [Online]. Available: https://quark.docs.amd.com/
[5] AMD, "Technical Dive into AMD's MLPerf Inference v5.1 Submission," ROCm Blogs, September 2025. [Online]. Available: https://rocm.blogs.amd.com/artificial-intelligence/mlperf-inference-v5.1/
[6] AMD, "High-Accuracy MXFP4, MXFP6, and Mixed-Precision Models on AMD GPUs," ROCm Blogs, October 2025. [Online]. Available: https://rocm.blogs.amd.com/software-tools-optimization/mxfp4-mxfp6-quantization/
Copyright © 2026 Metrum AI, Inc. All Rights Reserved. This project was commissioned by Dell Technologies. Dell, Dell PowerEdge, Dell iDRAC and other trademarks are trademarks of Dell Inc. or its subsidiaries. AMD, Instinct, ROCm, EPYC and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other product names are the trademarks of their respective owners.
***DISCLAIMER - Performance varies by hardware and software configurations, including testing conditions, system settings, application complexity, the quantity of data, batch sizes, software versions, libraries used, and other factors. The results of performance testing provided are intended for informational purposes only and should not be considered as a guarantee of actual performance.
[1] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, "QLoRA: Efficient Finetuning of Quantized LLMs," arXiv preprint arXiv:2305.14314, 2023.
[2] Open Compute Project, "OCP Microscaling Formats (MX) Specification v1.0," September 2023. Available: https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf
[3] B. D. Rouhani et al., "Microscaling Data Formats for Deep Learning," arXiv preprint arXiv:2310.10537, 2023.
[4] AMD, "AMD Quark Documentation," 2025. Available: https://quark.docs.amd.com/
[5] AMD, "Technical Dive into AMD's MLPerf Inference v5.1 Submission," ROCm Blogs, September 2025. Available: https://rocm.blogs.amd.com/artificial-intelligence/mlperf-inference-v5.1/