Scaling Vision Language Model Inference on Dell PowerEdge R770 with Intel Xeon 6
January 2026
Executive Summary
Vision language model deployments are accelerating across manufacturing, healthcare, financial services, and logistics, yet most organizations lack reliable methods to determine how much infrastructure they actually need. Vendor datasheets provide theoretical maximums, and small-scale pilots fail to replicate production load. Metrum AI’s Vision Language Model Demo eliminates this uncertainty. Running directly on Dell PowerEdge R770 servers equipped with Intel Xeon 6 processors (6780P), the demo benchmarks real-world inference using InternVL 2 1B under production-representative conditions, enforcing a strict two-second latency threshold per request to ensure results reflect operational requirements.
Benchmark results on the PowerEdge R770 demonstrate a clear performance progression across three acceleration paths. CPU baseline mode delivers 94.8 tokens per second at single-request concurrency. Enabling Intel Advanced Matrix Extensions (AMX) increases throughput to 105.5 tokens per second, a 1.11x improvement requiring no additional hardware investment. Adding NVIDIA RTX Pro 6000 Blackwell Server Edition GPU acceleration transforms the capacity profile entirely, supporting 32 concurrent requests at 4,131.1 tokens per second for a 43.6x throughput increase over baseline. The 14th Generation Dell PowerEdge R740 could not sustain even a single request within the two-second quality threshold, achieving only 15 tokens per second, a gap of approximately 275x versus the GPU-accelerated R770. These results give IT directors and infrastructure architects the validated data they need to right-size deployments, justify acceleration investments, and build a defensible infrastructure roadmap for multimodal AI at scale.
The Challenge
The global multimodal AI market reached `$1.73 billion in 2024 and is projected to grow to `$10.89 billion by 2030, expanding at nearly 37% annually[1]. This growth reflects a fundamental shift in how organizations extract intelligence from visual data. Unlike traditional computer vision that identifies objects, vision language models understand context, interpret scenes, and generate natural language responses from images. Manufacturing plants analyze quality defects with detailed explanations. Healthcare systems assist radiologists with diagnostic imaging and written assessments. Financial services extract and summarize information from complex documents. Each use case shares a common requirement: infrastructure capable of processing multimodal requests without compromising response latency.
For IT teams, deploying vision language models introduces a new class of infrastructure challenges. These models integrate image encoders with large language models, producing mixed workloads that require both parallel computation for visual processing and sequential execution for text generation. Organizations must determine how to provision systems that support production-scale inference without overspending on accelerator capacity or accepting performance levels that degrade user experience.
To address these requirements, Metrum AI developed the Vision Language Model Demo. The solution provides a practical framework for evaluating infrastructure performance using InternVL 2 1B, a production-grade vision language model. The demo measures system behavior across three acceleration paths: CPU baseline, Intel AMX optimization, and NVIDIA GPU acceleration, while enforcing a strict two-second latency threshold per request. This quality gate ensures that benchmark results align with real-world deployment expectations, where interactive applications depend on responsive visual understanding.
The benchmark results quantify the infrastructure decision. On Dell PowerEdge R770 servers, CPU baseline mode achieves 40.2 tokens per second with single-request concurrency. Enabling Intel AMX optimization increases throughput to 44.9 tokens per second, delivering a 1.1x improvement without hardware changes. GPU acceleration transforms performance entirely, supporting 8 concurrent requests at 1,278.6 tokens per second for a 31.4x throughput increase over baseline. These metrics give IT directors and infrastructure architects the data required to right-size deployments, justify acceleration investments, and ensure vision language model initiatives deliver expected business outcomes.
The demo runs on Dell PowerEdge R770 servers configured with Intel Xeon 6780P processors featuring AMX support, DDR5 memory, and PCIe Gen5 expansion for GPU acceleration. This configuration enables organizations to evaluate vision language model workloads on modern server infrastructure alongside the software workflows used during evaluation.
| Solution Overview
The infrastructure challenges outlined above demand a different approach to capacity planning. Rather than relying on vendor specifications or pilot testing that may not reflect production conditions, organizations need validated performance data from real hardware running actual workloads. Metrum AI developed the Vision Language Model Demo to address this requirement. The solution embeds a complete benchmarking environment directly onto Dell PowerEdge R770 servers, enabling IT teams to measure throughput, latency, and concurrency limits before committing to deployment decisions.
Figure 1| Solution Flow
The demo follows a structured process that guides users from configuration through performance comparison. During setup, operators select the visual task and input prompt, define the latency threshold, choose which hardware scenarios to evaluate, and specify the desired concurrency level. The platform supports four core reasoning workloads: optical character recognition for text extraction, image summarization for scene interpretation, object counting and detection for inventory and safety use cases, and complex reasoning to analyze relationships between visual elements. These tasks provide representative coverage of common production applications.
Once configured, the execution engine initializes the system through a warm-up phase that loads model weights into memory and stabilizes hardware state. The benchmark then proceeds with an incremental stress test, gradually increasing request concurrency while monitoring performance against the defined two-second latency threshold. Throughout execution, the system records tokens-per-second throughput alongside average and P95 latency, capturing both aggregate capacity and tail performance characteristics.
Running on PowerEdge R770 infrastructure, this workflow reflects real deployment conditions by simultaneously exercising CPU cores for prompt processing, memory bandwidth for model access, and accelerator resources for tensor operations. The incremental scaling methodology mirrors how production environments evolve over time, allowing IT teams to observe system behavior as request density increases.
Figure 2 | Unified Command Interface
During execution, the live monitoring interface provides continuous visibility into system performance. Operators track tokens-per-second metrics, throughput trends, and hardware telemetry including CPU and GPU utilization, temperature, and memory consumption. After testing completes, the performance comparison view presents results from each scenario side by side, translating raw measurements into actionable infrastructure guidance.
| Solution Architecture
Figure 3 | Solution Architecture
The solution is organized into layered components that separate user interaction, workload orchestration, inference execution, and observability. At the application layer, a web-based dashboard provides configuration controls, real-time benchmark visualization, and comparative performance views. PostgreSQL stores benchmark results and configuration data, enabling historical analysis across multiple test runs.
The processing layer coordinates workload distribution across available compute resources. Celery manages distributed job scheduling, assigning inference tasks to CPU and GPU backends as appropriate. Valkey serves as the job queue, buffering incoming requests and routing them to available workers. This design supports horizontal scalability by allowing additional worker nodes to be added without changes to application logic.
Inference is handled by optimized runtimes selected according to the chosen hardware scenario. For CPU-based execution, Intel OpenVINO Model Server (OVMS) enables inference with optional AMX acceleration using BF16 precision to improve tensor efficiency. GPU-accelerated inference is powered by vLLM operating in FP16 precision, delivering optimized attention mechanisms and memory management for large language model workloads. Requests are automatically directed to the appropriate backend based on configuration.
On the PowerEdge R770 platform, Intel AMX accelerates BF16 tensor operations directly on CPU cores, improving transformer inference performance. PCIe Gen5 connectivity provides the bandwidth required to support GPU-based workflows, enabling efficient data movement between host memory and accelerator memory.
System observability is maintained through continuous telemetry collection. Prometheus aggregates metrics from all components, while custom exporters surface detailed hardware utilization data, including CPU load, memory bandwidth, GPU activity, and thermal status. This instrumentation allows operators to correlate benchmark results with underlying resource consumption.
Ubuntu 24.04 LTS serves as the operating system foundation, providing a stable and secure environment for production workloads.
| Infrastructure Foundation
Vision language model workloads place sustained demands on server infrastructure, requiring high memory bandwidth for model weight access, substantial memory capacity for weight storage, and parallel compute capacity for transformer inference. Dell PowerEdge R770 addresses these requirements through its 17th Generation architecture, aligned with AI inference and data-intensive application requirements.
Equipped with dual Intel Xeon 6780P processors delivering 128 total cores, the R770 distributes image encoding, prompt processing, and attention computation without resource contention. One terabyte of DDR5 memory keeps model weights resident in memory, eliminating weight loading latency during sustained operation. High-capacity NVMe storage supports model checkpoints and logging, preventing I/O bottlenecks as request volumes scale. Together, these capabilities support sustained operation under production workloads rather than short-lived peak measurements.
For organizations requiring maximum throughput density, two NVIDIA RTX Pro 6000 Blackwell Server Edition GPUs deliver dedicated inference acceleration. This configuration separates CPU resources for orchestration and request handling from GPU resources dedicated to model execution. The design enables scalable performance as request volumes increase. This balanced CPU-GPU architecture allows organizations to scale from CPU-only deployments to GPU-accelerated environments on the same R770 platform, supporting consistent operations as performance requirements evolve.
Component | Dell PowerEdge R770 | Dell PowerEdge R740 |
Generation | 17th Generation | 14th Generation |
CPU | Intel(R) Xeon(R) 6780P | Intel(R) Xeon(R) Gold 6126 CPU |
Memory | 1 TB DDR5 | 96 GB |
Storage | 9.2 TB NVMe SSD | 500 GB |
GPU | 2x NVIDIA RTX Pro 6000 Blackwell Server Edition | None |
Table 1 | Hardware Configuration
The benchmark includes a Dell PowerEdge R740 to illustrate the generational performance gap. This 14th Generation system lacks AMX support and GPU acceleration, limiting its capacity to CPU baseline scenarios. The comparison quantifies the operational constraints organizations face when running vision language model workloads on legacy infrastructure.
| Performance Benchmark
The benchmark measures system behavior against four key metrics that translate hardware capabilities into operational capacity. Tokens per second captures throughput for text generation, indicating aggregate processing power available for multimodal reasoning workloads. Inference latency reflects the responsiveness experienced by each request, ensuring response times remain within acceptable bounds. Max concurrency identifies the parallelism ceiling before quality degradation occurs. Average tokens per response measures output complexity across the evaluated visual reasoning tasks.
The demo enforces a strict two-second latency threshold per request throughout testing. This threshold reflects production requirements where interactive applications depend on responsive visual understanding. A document processing system exceeding two-second latency frustrates users and reduces throughput. A quality inspection application operating above threshold introduces production delays. The benchmark progressively increases concurrent requests until any request exceeds this quality floor, establishing true capacity rather than theoretical maximums.
Scenario | Max Concurrency (@2s latency) | Peak Token per second | Speedup vs. Baseline |
CPU Baseline | 1 | 94.8 | 1.0x |
AMX Optimized | 1 | 105.5 | 1.11x |
GPU Accelerated | 32 | 4,131.1 | 43.6x |
Table 2 | Dell PowerEdge R770 Benchmarks
These results reflect the combined impact of R770 platform capabilities, including high core density for concurrent request processing, DDR5 memory bandwidth for model weight access, and optimized AMX execution for CPU-based inference. When paired with GPU acceleration, PCIe Gen5 connectivity supports efficient data movement between host and accelerators, enabling increased request density as workloads scale.
The modest 1.11x improvement from AMX optimization reflects the memory-bandwidth-intensive nature of vision language model inference. Unlike compute-bound workloads where AMX can deliver larger gains, VLM inference performance is often constrained by the rate at which model weights load from memory rather than the speed of tensor computation. This finding helps organizations set realistic expectations when evaluating CPU-only deployments.
Scenario | Max Concurrency (@2s latency) | Peak Token per second | Speedup vs. Baseline |
CPU Baseline | Did not meet threshold[2] | 15 | 1.0x |
AMX Optimized | NA | NA | NA |
GPU Accelerated | NA | NA | NA |
Table 3 | Dell PowerEdge R740 Benchmarks
The results highlight three practical considerations for infrastructure planning on PowerEdge R770 platforms.
First, AMX optimization delivers an 11% throughput increase without GPU investment. For organizations not yet ready for GPU investment, AMX optimization provides measurable improvement through firmware-level enablement requiring no additional hardware spend.
Second, GPU acceleration transforms the capacity profile entirely. Supporting 32 concurrent requests with a 43.6x throughput increase justifies the investment when workload requirements demand high parallelism and sustained throughput density.
Third, the generational comparison reveals a critical operational gap. The 14th Generation R740 could not sustain even a single concurrent request within the two-second quality threshold. At 15 tokens per second on legacy infrastructure versus 4,131.1 on the GPU-accelerated PowerEdge R770, the difference translates to approximately 275x faster multimodal processing at production scale. This quantifies the operational constraints of legacy infrastructure for vision language model workloads.
Business Impact
Benchmark numbers quantify infrastructure capability. The real value lies in how those capabilities translate into business outcomes for the organizations deploying them.
Accelerator Investment Justification
GPU acceleration delivers a 43.6x throughput increase and supports 32 concurrent requests on a single server. For organizations processing high volumes of multimodal requests, such as document understanding pipelines, visual quality inspection systems, or diagnostic imaging workflows, this performance density consolidates workloads that would otherwise require dozens of CPU-only servers. The benchmark data provides the defensible evidence finance and procurement teams require to approve accelerator investments with clear return-on-investment projections.
Risk Reduction Through Validated Capacity Planning
The most expensive infrastructure mistake is deploying hardware that cannot sustain production workloads. When a document processing system exceeds acceptable latency, user productivity drops and processing backlogs accumulate. When a visual inspection application fails to keep pace with production line throughput, quality defects reach customers. Emergency capacity additions, unplanned downtime, and degraded application performance carry costs that far exceed the price of proper planning. By validating capacity limits before deployment, organizations reduce the risk of production-day failures and the emergency procurement cycles they trigger.
CPU-Only Deployment Path for Early Adoption
Not every organization is ready for GPU investment on day one. AMX optimization delivers an 11% throughput improvement through firmware-level enablement, requiring no additional hardware spend. For teams running initial VLM pilots, proof-of-concept deployments, or low-concurrency production workloads, the AMX-optimized CPU path provides a viable starting point. Organizations gain immediate value from their existing PowerEdge R770 investment while preserving a clear upgrade path to GPU acceleration as workload demands grow.
Flexible Upgrade Path on a Single Platform
The R770 platform supports a modular approach to scaling. Organizations can deploy with CPU-only configurations for initial rollouts, enable AMX optimization through software configuration for incremental gains, and add GPU accelerators when concurrency and throughput requirements increase. This flexibility protects the initial server investment and extends the useful life of the platform as multimodal AI workloads expand across the organization.
Legacy Infrastructure Modernization
The generational comparison delivers a clear signal to organizations still operating on older server platforms. At 15 tokens per second, the R740 cannot sustain production-quality vision language model inference at any meaningful concurrency level. Organizations planning multimodal AI initiatives on legacy infrastructure face a fundamental constraint that no software optimization can overcome. The 275x performance gap between the R740 and the GPU-accelerated R770 quantifies the operational cost of deferring infrastructure modernization, giving IT leaders concrete data to support refresh planning and capital budget requests.
Mapping Streams to Real-World Deployments 1 concurrent request (CPU baseline): Single-user document analysis, individual diagnostic image review, or ad hoc visual question answering. Suitable for pilot programs and proof-of-concept evaluations where response quality matters more than throughput volume. 1 concurrent request with AMX (optimized CPU): Same single-user scenarios with 11% faster response times. Provides measurable improvement for latency-sensitive applications without additional hardware cost. 32 concurrent requests (GPU accelerated): Production-scale document processing pipelines, multi-station quality inspection systems, or enterprise visual search applications serving dozens of simultaneous users. Supports the throughput density required for line-of-business applications where multiple teams or automated workflows submit requests continuously. |
| Conclusion
Vision language model deployments depend fundamentally on infrastructure readiness. Organizations must process multimodal requests with consistently low latency to support interactive applications where users expect immediate visual understanding. Traditional capacity planning approaches, such as relying on vendor specifications or limited pilot testing, often fail to reflect sustained production behavior. As a result, IT teams face a familiar tradeoff: overprovision accelerator resources to reduce risk, or underprovision systems and accept degraded user experience. What is needed instead is workload-driven insight that directly connects infrastructure choices to operational outcomes.
The Vision Language Model Demo addresses this requirement by providing a practical benchmarking framework on Dell PowerEdge platforms. Rather than focusing on isolated peak measurements, the solution evaluates real hardware under incremental load using InternVL 2 1B and a strict two-second latency threshold. A unified interface guides teams from configuration through execution and side-by-side comparison, while the layered architecture clearly separates orchestration, inference, and observability. By supporting CPU baseline, Intel AMX optimization, and GPU-accelerated scenarios within a consistent workflow, the demo enables infrastructure architects to compare acceleration strategies under identical test conditions. This approach transforms benchmarking from a purely technical exercise into a repeatable decision process that aligns performance, cost, and scalability with deployment requirements.
Measured performance on Dell PowerEdge R770 servers demonstrates how modern platforms translate directly into operational capacity. CPU baseline inference delivers 94.8 tokens per second at production quality. Enabling Intel AMX increases throughput to 105.5 tokens per second, providing incremental gains for single-request workloads without additional hardware investment. GPU acceleration extends performance substantially, reaching 4,131.1 tokens per second with 32 concurrent requests, a 43.6x improvement over baseline for environments that demand high throughput density.
The generational comparison further quantifies the limitations of legacy systems. The R740 fails to meet production latency thresholds even at minimal concurrency, while the GPU-accelerated R770 delivers approximately 275x faster multimodal processing. Together, these results give IT leaders and infrastructure architects clear, defensible data to right-size deployments, justify acceleration investments, and make informed deployment decisions.
With validated performance measurements, integrated observability, and a operational model, the Vision Language Model Demo provides a practical framework for evaluating multimodal AI workloads under production-representative conditions.
| Addendum
Operating System Information
Type | Details |
Operating System | Ubuntu 24.04.3 LTS |
Kernel | 6.8.0-90-generic |
Driver Status | NVIDIA CUDA 13 |
Experiment Configuration
Configuration | Details |
Vision Language Model | InternVL 2 1B |
Model Precision | BF16(CPU) and FP16 (GPU) |
Inference Engine | Intel OpenVINO(CPU) / NVIDIA Torch(GPU) |
Input Dataset | Image with resolution - 1844x892 px |
Latency Threshold | 2.0 Seconds (Target) |
Scaling Mode | auto |
Hardware Scenarios Tested
Backend | Inference Engine | Version |
CPU Un-optimized | Intel OpenVINO with AMX disabled | 2025.4.1 |
CPU Optimized with AMX | Intel OpenVINO with AMX enabled | 2025.4.1 |
GPU Accelerated | vLLM | v0.14.1 |
| References
Grand View Research, "Multimodal AI Market Size, Share & Trends Analysis Report By Component, By Data Modality, By End-use, By Enterprise Size, By Region, And Segment Forecasts, 2025 - 2030," Grand View Research, Inc., San Francisco, CA, USA, 2024. [Online]. Available: https://www.grandviewresearch.com/industry-analysis/multimodal-artificial-intelligence-ai-market-report
Copyright © 2026 Metrum AI, Inc. All Rights Reserved. This project was commissioned by Dell Technologies. Dell and other trademarks are trademarks of Dell Inc. or its subsidiaries.
***DISCLAIMER - Performance varies by hardware and software configurations, including testing conditions, system settings, application complexity, the quantity of data, batch sizes, software versions, libraries used, and other factors. The results of performance testing provided are intended for informational purposes only and should not be considered as a guarantee of actual performance.
[1] Grand View Research. "Multimodal AI Market Size, Share & Trends Analysis Report, 2030." https://www.grandviewresearch.com/industry-analysis/multimodal-artificial-intelligence-ai-market-report
[2] The R740 was unable to meet the two-second latency threshold even at single-concurrency. The 14th Generation system lacks AMX support and was not configured with GPU acceleration.