The FP4 Breakthrough: How MXFP4 Quantization Delivers Up to 6.1x Inference Throughput Without Sacrificing Accuracy on Dell PowerEdge™ XE9785L with AMD Instinct™ MI355X Accelerators

How OCP Microscaling Formats and AMD Quark Deliver 4-Bit Quantization at Near Zero Error Rate

February 2026

| Executive Summary

Inference cost now exceeds training cost as the primary infrastructure challenge for production AI. Serving a 235-billion parameter model at enterprise concurrency can require multiple accelerator nodes, with power and hardware expenditures scaling proportionally. For IT leaders under pressure to expand AI capacity without expanding budgets, precision format selection offers one of the most effective levers available.

This paper presents benchmark results for MXFP4 (4-bit floating-point) quantization on the AMD Instinct MI355X accelerator, deployed in a Dell PowerEdge XE9785L server with eight GPUs. The results demonstrate substantial generational throughput gains over the prior-generation MI300X, improved power-normalized efficiency, and expanded concurrent-user capacity at latency-sensitive SLAs. The deployment path requires minimal operational change: Quark-quantized MXFP4 models export directly to standard Hugging Face safetensors format and deploy through vLLM or SGLang with no code changes.

Key Results at a Glance

Up to 2.3x throughput over FP8

MXFP4 vs. FP8 on the same MI355X node at high concurrency

Up to 6.1x throughput over MI300X

MI355X MXFP4 vs. MI300X FP8 at production concurrency

Up to 10x power efficiency gain

Tokens per second per watt, MI355X MXFP4 vs. MI300X FP8

8x more concurrent users

Within a 100ms per-token SLA on short-form workloads

| The Quantization Challenge

Every major enterprise now quantizes its production models. The question is no longer whether to quantize, but how far precision can drop before accuracy degrades and users notice. Traditional INT4 quantization forces a painful tradeoff. Its uniform precision grid clips outlier values or sacrifices resolution across the entire weight distribution. This results in measurable accuracy loss that pushes teams to over-provision hardware at higher precision, erasing the cost savings quantization was supposed to deliver.

This effect is supported by research, consistently showing that naive INT4 post-training quantization can produce perplexity increases and accuracy drops that make models unsuitable for production deployment[1]. Advanced algorithms like GPTQ and AWQ partially address these limitations through sophisticated calibration, but they cannot overcome the fundamental constraint of uniform representation.

The AMD Instinct MI355X introduces hardware-native support for MXFP4, a floating-point format designed specifically for the distribution patterns observed in deep learning workloads. Combined with the AMD Quark quantization toolkit and ROCm 7 runtime optimizations, this architecture delivers 4-bit inference throughput with accuracy preservation that approaches higher-precision baselines.

| Understanding Microscaling Formats

The Open Compute Project (OCP) published the Microscaling Formats (MX) specification with contributions from AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm[2]. MX formats define a new approach to low-precision data representation. Rather than treating each value independently, MX formats share a scaling factor across blocks of elements, enabling efficient representation with reduced storage overhead.

MXFP4 encodes each block of 32 elements using 4-bit floating-point values (E2M1 format: 2 exponent bits, 1 mantissa bit) with a shared 8-bit scale factor (E8M0). This structure requires only 136 bits per block (17 bytes), compared to 1,024 bits (128 bytes) for the equivalent FP32 representation. The 7.5x reduction in storage translates directly to reduced memory bandwidth requirements and improved cache utilization during inference.

The Floating-Point Advantage

The critical difference between MXFP4 and INT4 lies in how values distribute across the representable range. Floating-point formats space their quantization levels logarithmically, providing higher resolution near zero and lower resolution at extremes. This non-uniform grid aligns naturally with neural network parameters, which typically follow approximately Gaussian distributions centered near zero.

When an outlier activation occurs in INT4, the uniform grid forces either severe clipping or scale expansion that quantizes most values to zero. In MXFP4, the exponent bits provide dynamic range while the block-level scale factor adjusts to local magnitude. The outlier receives appropriate representation, and neighboring values maintain precision. This architectural difference explains why MXFP4 consistently demonstrates superior signal-to-noise ratio for deep learning tensors compared to INT4 at equivalent bit-widths[3].

The MI355X implements native MXFP4 matrix multiplication through its Matrix Fused Multiply Add (MFMA) scale instructions. This hardware acceleration eliminates the dequantization overhead that software-emulated approaches require, delivering up to 20 petaFLOPS of FP4 compute performance per accelerator.

| AMD Quark: Production-Ready Quantization

Converting a model from FP16 or BF16 to MXFP4 requires careful calibration to minimize accuracy loss. AMD Quark provides a comprehensive toolkit for this transformation, integrating established algorithms with MI355X-specific optimizations[4].

Quark supports multiple quantization strategies, each offering different tradeoffs between calibration time, accuracy preservation, and computational overhead. For production deployment of large models, AutoSmoothQuant provides the optimal balance. This algorithm adaptively smooths outliers on a per-layer basis, reducing the dynamic range that any single block must represent. Unlike manual SmoothQuant tuning, AutoSmoothQuant requires no hyperparameter search, making it practical for enterprise deployment pipelines.

Quantization Workflow

A typical Quark workflow for MXFP4 quantization proceeds through three stages. First, the model loads from a standard Hugging Face checkpoint or safetensors format. Second, Quark applies the selected quantization scheme with calibration data. Third, the quantized model exports in a format compatible with vLLM, SGLang, or other serving frameworks.

Quark completes the quantization process in hours, even for models exceeding 200 billion parameters. It parallelizes calibration across available GPUs, and the exported model deploys immediately on MI355X infrastructure without additional conversion steps.

Mixed-Precision Strategies

Not all model layers respond equally to aggressive quantization. Embedding layers and certain attention projections often benefit from higher precision, while MLP layers tolerate lower precision with minimal impact. Quark supports layer-wise mixed-precision configuration, enabling MXFP4 for bulk compute with FP8 for sensitive operations.

This granular control allows organizations to tune the precision-accuracy tradeoff for their specific workloads. A model serving factual question-answering may tolerate more aggressive quantization than one performing complex reasoning tasks. Quark's configuration system exposes these choices without requiring deep expertise in quantization theory.

| Performance Analysis: Qwen3-235B

To quantify the throughput advantage of MXFP4 over FP8, benchmark testing compared identical workloads on the same MI355X hardware. All tests ran on a Dell PowerEdge XE9785L server equipped with eight AMD Instinct MI355X accelerators, ROCm 7, and vLLM as the serving framework. Qwen3-235B was selected as the benchmark model because its 235-billion parameter Mixture of Experts architecture represents the class of models most commonly deployed in production. MXFP4 quantization via Quark applies broadly to transformer-based architectures; the results presented here are representative of the throughput characteristics organizations can expect across similar large-scale models.

Runs with greater than 10% error rate were excluded from all reported results. All retained configurations produced zero inference errors across their full test duration.

Peak Throughput by Workload

Each AI workload has a distinct input/output token signature that stresses hardware differently. Short prompts with short replies characterize chatbots and API-driven Q&A. Long inputs with short outputs define document classification, RAG, and compliance review. The MI355X with MXFP4 leads across all four profiles tested, with gains increasing at higher concurrency.

Workload

MI355X MXFP4Peak (tok/s)

MI300X FP8Peak (tok/s)

Gain

Example Uses

128 in / 128 out

33,106

7,799

4.2x

Chat, Q&A, autocomplete

128 in / 2,048 out

43,916

8,278

5.3x

Content gen, code gen

2,048 in / 128 out

10,806

1,777

6.1x

RAG, classify, compliance

2,048 in / 2,048 out

18,395

4,096

4.5x

Research, translation

Table 1 | Peak throughput across all tested concurrency levels. MI355X MXFP4 vs. MI300X FP8, Qwen3-235B. Runs with >10% error rate excluded.

The largest absolute advantage appears on document-processing workloads (2,048 input / 128 output), where the MI355X with MXFP4 delivers 6.1x the peak throughput of the MI300X with FP8. Long-generation workloads (128 input / 2,048 output) reach 5.3x. Even short-form chat workloads (128/128) show a 4.2x generational improvement.

Throughput Comparison

The following table presents output throughput (tokens per second) across matched configurations. Each row compares identical concurrency levels, input lengths, and maximum output lengths between MXFP4 and FP8 precision.

Concurrency

Input / Output

MXFP4 (tok/s)

FP8 (tok/s)

Improvement

32

128 / 128

1,599

1,198

1.33x

128

128 / 128

4,086

2,964

1.38x

512

128 / 128

10,503

8,359

1.26x

1,024

128 / 128

16,324

14,043

1.16x

4,096

128 / 128

27,728

28,047

0.99x

64

128 / 2048

2,070

1,428

1.45x

256

128 / 2048

5,641

4,007

1.41x

1,024

128 / 2048

15,318

12,588

1.22x

8,192

128 / 2048

43,916

20,540

2.14x

128

2048 / 128

2,601

2,132

1.22x

512

2048 / 128

4,862

4,191

1.16x

2,048

2048 / 128

7,857

4,923

1.60x

64

2048 / 2048

2,097

1,462

1.43x

512

2048 / 2048

7,000

5,838

1.20x

2,048

2048 / 2048

14,032

7,448

1.88x

4,096

2048 / 2048

16,647

7,702

2.16x

8,192

2048 / 2048

18,395

7,999

2.30x

Table 2 | Qwen3-235B Throughput: MXFP4 vs FP8 on a Single 8-GPU MI355X Node

Figure 1 | MXFP4 vs. FP8 throughput scaling with concurrency (128/2048 workload)

At moderate concurrency (32 to 1,024 requests), MXFP4 delivers 1.15x to 1.45x more throughput than FP8 across all workload profiles. The advantage grows with scale. For long-generation workloads (2048/2048) at 8,192 concurrent requests, MXFP4 produces 2.30x more output tokens per second than FP8, as FP8 throughput plateaus while MXFP4 continues to scale.

Consider a production deployment handling 10 million tokens daily at 512 concurrent requests with 128-token outputs. MXFP4 would generate an additional 2.7 million tokens per day on the same hardware. That is equivalent to adding 27% more capacity without purchasing a single additional node.

Figure 2 | MXFP4 vs. FP8 across workload profiles at 512 concurrent requests

Latency Characteristics

Throughput measures aggregate capacity. Latency governs user experience. Two metrics matter for interactive inference: Time to First Token (TTFT), which controls perceived responsiveness, and Time Per Output Token (TPOT), which controls perceived generation speed. The benchmark suite captured both across all configurations.

For short-input decode (128-token prompts), MXFP4 reduces average TPOT by 20% to 35% compared to FP8 on the same MI355X hardware. At 128 concurrent requests, MXFP4 delivers 26.6 ms per token versus 37.9 ms for FP8 (1.43x). The advantage holds across generation lengths and moderate concurrency levels.

The more significant finding is in TTFT at high concurrency with long contexts. At 1,024 concurrent requests with 2048/2048 tokens, MXFP4 delivers first-token in 1.2 seconds while FP8 requires 24.4 seconds (20.9x improvement).  On the 2048/128 profile at the same concurrency, MXFP4 achieves 1.4 seconds versus 6.6 seconds for FP8. This reflects FP8's inability to sustain prefill once the KV cache saturates available memory bandwidth.

Conc.

In / Out

MXFP4 TPOT

FP8 TPOT

TPOT Gain

MXFP4 TTFT

FP8 TTFT

TTFT Gain

32

128/128

18.4 ms

24.9 ms

1.35x

107 ms

109 ms

~1.0x

128

128/128

26.6 ms

37.9 ms

1.43x

193 ms

161 ms

~1.0x

512

128/128

41.5 ms

52.3 ms

1.26x

312 ms

320 ms

~1.0x

128

2048/128

40.9 ms

48.9 ms

1.20x

587 ms

708 ms

1.21x

512

2048/128

92.3 ms

104 ms

1.13x

959 ms

1,407 ms

1.47x

1,024

2048/128

148 ms

141 ms

0.95x

1,427 ms

6,636 ms

4.65x

128

2048/2048

31.8 ms

41.7 ms

1.31x

451 ms

483 ms

1.07x

512

2048/2048

61.4 ms

69.9 ms

1.14x

763 ms

1,069 ms

1.40x

1,024

2048/2048

89.5 ms

78.4 ms

0.88x

1,169 ms

24,400 ms

20.9x

Table 3 | Latency: MXFP4 vs FP8 on MI355X (Qwen3-235B, average values). TPOT in ms; TTFT in ms.

Figure 3 | Time to First Token: MXFP4 vs. FP8 at high concurrency (2048/2048 workload) 

(Shorter is better)

At very high concurrency on long-context workloads (1,024+ requests, 2048-token inputs), FP8 occasionally shows lower per-token latency because the workload becomes compute-bound rather than memory-bound. However, the throughput tables confirm that MXFP4 still delivers more aggregate output tokens per second in these regimes.

For capacity planning, if an SLA requires sub-second TTFT for 2048-token prompts, FP8 supports roughly 256 concurrent requests before crossing that threshold. MXFP4 extends the limit beyond 512 concurrent requests, effectively doubling the concurrent user capacity within the same SLA envelope.

Latency SLA Compliance: The 100ms Threshold

A 100ms per-token inter-token latency (ITL) is a practical threshold for responsive AI generation. At this target, a 128-token response completes in roughly 13 seconds, acceptable for most asynchronous and batch-adjacent applications. For real-time streaming, lower targets apply.

The MI355X with MXFP4 sustains sub-100ms per-token latency at up to 8x more concurrent users than the MI300X with FP8, while delivering 4x to 7x more aggregate throughput within the same SLA window.

Workload

MI300X FP8 Max Concurrent

MI355X MXFP4 Max Concurrent

Throughput at SLA

Concurrency Gain

128 / 128

256

2,048

23,564 tok/s

8x

128 / 2,048

512

2,048

23,824 tok/s

4x

2,048 / 128

64

256

3,575 tok/s

4x

2,048 / 2,048

128

512

7,000 tok/s

4x

Table 4 | Maximum concurrency within 100ms per-token SLA. Derived ITL = 1,000 / (tok/s / concurrent requests).

Figure 4 | Maximum concurrency within 100ms per-token SLA. Derived ITL = 1,000 / (tok/s / concurrent requests).

For any platform running inference at scale, whether a developer API, enterprise AI gateway, or consumer product, the 100ms per-token threshold defines whether the experience feels responsive or sluggish. The MI355X with MXFP4 pushes that SLA boundary 4x to 8x further than the MI300X with FP8. That translates directly to more users served at the same quality of experience, without adding hardware.

Scaling Behavior at High Concurrency

The data reveals an important inflection above 1,024 concurrent requests. At moderate concurrency, both precision formats scale in rough proportion, producing improvement ratios of 1.15x to 1.45x. Above 2,048 concurrent requests, FP8 throughput begins to plateau while MXFP4 continues to scale.

This divergence is most visible in long-generation workloads (2,048/2,048), where FP8 throughput effectively caps near 8,000 tokens per second regardless of additional concurrency. MXFP4 reaches 18,395 tokens per second at 8,192 concurrent requests on the same workload. On long-input workloads (2,048/128), MI300X FP8 plateaus at approximately 1,750 tokens per second from 2,048 concurrent onward, while MI355X MXFP4 continues scaling to 10,806 tokens per second.

Two architectural factors drive this behavior. First, MXFP4 model weights occupy half the HBM footprint of FP8, freeing memory bandwidth for KV cache operations that become the bottleneck at high concurrency. Second, the MI355X’s native MFMA scale instructions process MXFP4 matrix operations without the dequantization overhead that constrains FP8 throughput as compute utilization approaches saturation.

For capacity planning, the implication is direct: the return on investment for MXFP4 increases with deployment scale. Organizations operating inference nodes at production concurrency levels (512 and above) realize substantially more benefit than moderate-concurrency benchmarks alone suggest.

Generational Improvement: MI355X MXFP4 vs. MI300X FP8

The performance comparison between MXFP4 on MI355X and FP8 on the prior-generation MI300X quantifies the combined benefit of new hardware and new precision formats. Both configurations ran Qwen3-235B on 8-GPU nodes: the MI355X on a Dell PowerEdge XE9785L, the MI300X on a Dell PowerEdge XE9680.

Concurrency

Input / Output

MI355X MXFP4(tok/s)

MI300X FP8(tok/s)

Improvement

128

128 / 128

4,086

2,292

1.78x

512

128 / 128

10,503

5,053

2.08x

1,024

128 / 128

16,324

6,268

2.60x

4,096

128 / 128

27,728

7,787

3.56x

8,192

128 / 128

33,106

7,799

4.25x

256

128 / 2,048

5,641

3,722

1.52x

1,024

128 / 2,048

15,318

8,278

1.85x

2,048

128 / 2,048

23,824

7,253

3.28x

8,192

128 / 2,048

43,916

8,097

5.42x

256

2,048 / 128

3,575

1,224

2.92x

1,024

2,048 / 128

6,188

1,632

3.79x

2,048

2,048 / 128

7,857

1,750

4.49x

4,096

2,048 / 128

10,646

1,748

6.09x

8,192

2,048 / 128

10,806

1,777

6.08x

256

2,048 / 2,048

4,513

2,453

1.84x

1,024

2,048 / 2,048

9,942

4,002

2.48x

4,096

2,048 / 2,048

16,647

4,083

4.08x

8,192

2,048 / 2,048

18,395

4,089

4.50x

Table 5 | Qwen3-235B Output Throughput: MI355X MXFP4 vs MI300X FP8 (Zero-Error Configurations)

Figure 5 | MI355X MXFP4 vs. MI300X FP8 throughput scaling (2048/2048 workload)

At moderate concurrency (128 to 256 requests), the MI355X with MXFP4 delivers 1.45x to 2.92x the throughput of the MI300X with FP8, depending on workload profile. The advantage is most pronounced for long-input, short-output workloads (2,048/128), where the MI300X’s narrower memory bandwidth and lack of native FP4 compute constrain throughput scaling.

At 1,024 concurrent requests, the generational gap widens to 2.48x for long-context generation and 3.79x for long-input summarization tasks. At 8,192 concurrent, document-processing workloads (2,048/128) reach 6.1x. For organizations currently operating MI300X infrastructure, these results establish a concrete performance baseline for migration planning.

Enterprise Use Case: Compliance and Document Intelligence

Compliance checks, regulatory document review, financial risk analysis, and contract classification all share a common token profile: long inputs (2,048+ tokens of document context) with short outputs (128 tokens of classification, risk score, or structured response). This is the workload where the MI355X with MXFP4 delivers its largest absolute and proportional advantage.

At 1,024 concurrent requests, a practical operational load for enterprise platforms, MXFP4 delivers 6,188 tokens per second versus 1,632 tokens per second for MI300X FP8: 3.8x more capacity. At peak concurrency, the ratio reaches 6.1x. Document throughput translates directly to review capacity: 6.1x throughput means 6.1x more documents reviewed per hour on equivalent hardware.

Regulatory workloads often face batch deadlines: end-of-day portfolio reviews, overnight compliance sweeps, and real-time transaction monitoring thresholds. For real-time monitoring scenarios where the 100ms per-token SLA applies, MI300X FP8 supports only 64 concurrent document streams. MI355X MXFP4 sustains 256 concurrent streams, processing 4x more documents under active review simultaneously.

Accuracy Preservation

Throughput gains matter only if accuracy remains acceptable. The benchmark suite maintained near zero error rates across all tested configurations, confirming that MXFP4 quantization via Quark produces reliable inference results under production load.

Independent MLPerf Inference v5.1 results provide additional validation. AMD's MXFP4 submission on MI355X for Llama 2-70B achieved performance levels comparable to the closed division while maintaining accuracy targets[5]. These results represent the first MXFP4 submissions to an industry-standard benchmark, establishing a reproducible baseline for 4-bit floating-point inference performance.

| Implementation Considerations

Adopting MXFP4 quantization involves minimal operational complexity when compared to other precision reduction strategies. The quantization workflow integrates with existing model management practices, and the quantized models deploy through standard serving frameworks.

Infrastructure Requirements

MXFP4 inference requires MI355X accelerators with ROCm 7.0 or later. Earlier AMD Instinct generations (MI300X, MI325X) can emulate MXFP4 through software dequantization, providing compatibility for model evaluation but without the full throughput benefits of native hardware support.

The quantization process itself can run on any ROCm-compatible hardware. Organizations can quantize models on existing MI300X infrastructure, then deploy the quantized checkpoints to MI355X systems for production inference. This flexibility simplifies the transition path for established AMD deployments.

Serving Framework Integration

vLLM provides native support for Quark-quantized MXFP4 models. Loading a quantized checkpoint requires no code changes beyond specifying the model path. vLLM automatically detects the quantization format and configures the appropriate kernels for MI355X execution.

SGLang similarly supports MXFP4 models with automatic detection. For organizations using custom serving infrastructure, Quark exports to Hugging Face-compatible safetensors format with embedded quantization metadata, enabling integration with any framework that can load quantized models.

Dell PowerEdge Platform Integration

The Dell PowerEdge XE9785L provides the validated server platform for MI355X deployment. Dell iDRAC (Integrated Dell Remote Access Controller) delivers hardware-level management capabilities that are critical for production AI infrastructure: real-time GPU telemetry, thermal monitoring, remote firmware updates, and automated alerting. These capabilities enable infrastructure teams to monitor power draw, memory utilization, and accelerator health across fleet deployments without custom tooling. Pre-validated configurations available through Dell reduce the time from hardware delivery to production inference. Organizations deploying at scale benefit from Dell's lifecycle management tools, which streamline firmware updates, configuration drift detection, and capacity planning across multi-node installations.

| Total Cost of Ownership Impact

The throughput improvement from MXFP4 translates directly to infrastructure economics. The following analysis uses system-level power measurements, which include host CPU, memory, networking, and cooling overhead in addition to GPU power draw.

Workload (In/Out)

MI355X MXFP4 tok/s

MI355X Power (W)

MI300X FP8 tok/s

MI300X Power (W)

tok/s/W (MI355X)

tok/s/W (MI300X)

Efficiency Gain

128 / 128

10,503

2,353

5,053

7,635

4.46

0.66

6.7x

128 / 2048

9,621

2,333

5,806

7,561

4.12

0.77

5.4x

2048 / 128

4,862

2,353

1,420

6,700

2.07

0.21

9.7x

2048 / 2048

7,000

2,353

3,204

7,290

2.97

0.44

6.8x

Table 6 | Power-Normalized Throughput: MI355X MXFP4 vs MI300X FP8 (512 Concurrent Requests)

Figure 6 | Power-normalized throughput: MI355X MXFP4 vs. MI300X FP8 at 512 concurrent requests

System-level power consumption confirms the efficiency advantage. The MI355X node draws approximately 2,330W to 2,353W under sustained MXFP4 inference load at 512 concurrent requests, compared to 6,700W to 7,635W for the MI300X node running FP8 under comparable workloads. Normalizing for throughput, the MI355X with MXFP4 yields 2.1 to 4.5 tokens per second per watt. The MI300X with FP8 yields 0.21 to 0.77 tokens per second per watt. That represents a 5.4x to 9.9x improvement in power-normalized throughput.

To illustrate: at `$0.10/kWh, the MI300X node running FP8 at 512 concurrency on a 2048/128 workload costs approximately `$489/month in power alone, producing 1,420 tokens/second. The MI355X with MXFP4 produces 4,862 tokens/second at `$172/month in power, a 3.4x throughput increase at 65% lower power cost. Organizations replacing three to four MI300X nodes with a single MI355X node can recover the hardware investment within months through operational savings alone.

The memory footprint reduction compounds these benefits. MXFP4 model weights occupy half the space of FP8, freeing HBM capacity for larger KV caches. Longer context windows support more complex conversations without requiring model sharding. More concurrent users fit within a single node's memory budget. These efficiency gains accumulate across the deployment lifecycle, reducing both capital expenditure and ongoing operational costs.

| Conclusion: Production-Ready 4-Bit Inference

The combination of OCP MXFP4 standardization, AMD Quark tooling, MI355X hardware support, and Dell PowerEdge platform validation establishes a production-ready path to 4-bit inference. MXFP4 removes the traditional tradeoff between throughput and accuracy.

The benchmark results on Qwen3-235B demonstrate up to 6.1x throughput improvement over the prior-generation MI300X with FP8, with the largest gains on document-processing and long-input workloads critical to regulated industries. On the same MI355X hardware, MXFP4 delivers 1.15x to 2.30x throughput over FP8, with the advantage growing at higher concurrency. Power-normalized efficiency gains of 5x to 10x translate to immediate reductions in operating cost.

Latency testing confirms that MXFP4 reduces time per output token by 1.1x to 1.4x at moderate concurrency, while sustaining sub-1.5 second time to first token at concurrency levels where FP8 exceeds 6 seconds on long-context workloads. At the 100ms per-token SLA, MXFP4 supports 4x to 8x more concurrent users, directly expanding the operational envelope for latency-sensitive applications.

For infrastructure teams planning their next hardware refresh, MXFP4 on the AMD Instinct MI355X redefines what a single 8-GPU node can deliver. The 6.1x throughput advantage over prior-generation hardware means fewer nodes for the same workload. The 5x to 10x power efficiency gain means lower operating costs from day one. And zero-error accuracy validation means production deployment without compromise.

Organizations can begin today: quantize an existing model with AMD Quark on current MI300X infrastructure, validate accuracy, and deploy to MI355X when hardware arrives. No framework changes required. No retraining necessary.

Learn more about AMD Instinct MI355X at amd.com/instinct


| References

[1] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, "QLoRA: Efficient Finetuning of Quantized LLMs," arXiv preprint arXiv:2305.14314, 2023.

[2] Open Compute Project, "OCP Microscaling Formats (MX) Specification v1.0," September 2023. [Online]. Available: https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf

[3] B. D. Rouhani et al., "Microscaling Data Formats for Deep Learning," arXiv preprint arXiv:2310.10537, 2023.

[4] AMD, "AMD Quark Documentation," 2025. [Online]. Available: https://quark.docs.amd.com/

[5] AMD, "Technical Dive into AMD's MLPerf Inference v5.1 Submission," ROCm Blogs, September 2025. [Online]. Available: https://rocm.blogs.amd.com/artificial-intelligence/mlperf-inference-v5.1/

[6] AMD, "High-Accuracy MXFP4, MXFP6, and Mixed-Precision Models on AMD GPUs," ROCm Blogs, October 2025. [Online]. Available: https://rocm.blogs.amd.com/software-tools-optimization/mxfp4-mxfp6-quantization/


Copyright © 2026 Metrum AI, Inc. All Rights Reserved. This project was commissioned by Dell Technologies. Dell, Dell PowerEdge, Dell iDRAC and other trademarks are trademarks of Dell Inc. or its subsidiaries. AMD, Instinct, ROCm, EPYC and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other product names are the trademarks of their respective owners.

***DISCLAIMER - Performance varies by hardware and software configurations, including testing conditions, system settings, application complexity, the quantity of data, batch sizes, software versions, libraries used, and other factors. The results of performance testing provided are intended for informational purposes only and should not be considered as a guarantee of actual performance.


[1] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, "QLoRA: Efficient Finetuning of Quantized LLMs," arXiv preprint arXiv:2305.14314, 2023.

[2] Open Compute Project, "OCP Microscaling Formats (MX) Specification v1.0," September 2023. Available: https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf

[3] B. D. Rouhani et al., "Microscaling Data Formats for Deep Learning," arXiv preprint arXiv:2310.10537, 2023.

[4] AMD, "AMD Quark Documentation," 2025. Available: https://quark.docs.amd.com/

[5] AMD, "Technical Dive into AMD's MLPerf Inference v5.1 Submission," ROCm Blogs, September 2025. Available: https://rocm.blogs.amd.com/artificial-intelligence/mlperf-inference-v5.1/