Optimizing Computer Vision at Scale on Dell PowerEdge R770 with Intel Xeon 6

Executive Summary

Computer vision deployments are accelerating across manufacturing, retail, healthcare, and public safety. Yet most organizations lack reliable methods to determine how much infrastructure they actually need. Vendor datasheets provide theoretical maximums. Small-scale pilots fail to replicate production load. The result is a costly guessing game: overprovision and waste budget, or underprovision and risk production failures.

The Metrum AI Turnkey 17G Computer Vision Demo eliminates this uncertainty. Running directly on Dell PowerEdge R770 servers equipped with Intel Xeon 6 processors (6780P), the demo benchmarks real-time object detection workloads under production-representative conditions. It enforces a strict 30 frames-per-second quality threshold per stream, ensuring results reflect operational requirements rather than peak theoretical numbers.

Benchmark results on the PowerEdge R770 demonstrate a clear performance progression across three acceleration paths. CPU baseline mode sustains 10 concurrent video streams at production quality. Enabling Intel Advanced Matrix Extensions (AMX) increases capacity to 29 streams with 2.8x throughput gains, without any GPU investment. Adding NVIDIA RTX Pro 6000 Blackwell Server Edition GPU acceleration extends capacity to 113 streams, delivering 11.1x performance gains over baseline.

These results give IT directors and infrastructure architects the validated data they need to right-size deployments, avoid overprovisioning, and build a defensible infrastructure roadmap for computer vision at scale.

Introduction

The global computer vision market reached $19.82 billion in 2024 and is projected to grow to $58.29 billion by 2030, expanding at nearly 20% annually^[1]. This growth reflects a fundamental shift in how organizations operate. Manufacturing plants detect product defects in real time. Retail stores analyze customer behavior as it happens. Cities manage traffic flow dynamically. Healthcare systems assist radiologists with diagnostic imaging.

Each use case shares a common requirement: infrastructure capable of processing multiple video streams simultaneously without dropping frames or missing detections. When a quality inspection system loses frames, defective products reach customers. When a surveillance feed drops below acceptable frame rates, security teams miss critical events. The business value of computer vision depends entirely on systems that perform reliably under production load.

The Infrastructure Challenge

For IT leaders, this growth introduces a practical challenge: determining how much infrastructure is required to support real-time vision workloads without overspending on unused capacity. Vendor datasheets provide theoretical peak numbers that rarely match sustained production performance. Small-scale pilots test a handful of streams but fail to reveal how systems behave at 30, 60, or 100 concurrent feeds. The gap between these limited inputs and actual deployment requirements creates risk on both sides of the equation.

Organizations that overprovision waste capital on servers and accelerators they do not fully utilize. Organizations that underprovision face production failures, missed detections, and costly emergency upgrades. Either outcome undermines the return on investment that justified the computer vision initiative in the first place.

A Solution to Capacity Planning

To address this challenge, Metrum AI developed this Computer Vision Demo. This solution provides a complete benchmarking environment that runs directly on Dell PowerEdge R770 servers, enabling IT teams to measure throughput, latency, and concurrency limits before committing to deployment decisions.

The demo benchmarks hardware configurations using YOLOv11 Nano, a lightweight real-time object detection model representative of the model class organizations deploy for defect detection, vehicle counting, and security monitoring. It evaluates three distinct acceleration paths: CPU baseline, Intel AMX optimization, and NVIDIA GPU acceleration, while enforcing a strict 30 frames-per-second quality threshold per stream. This threshold ensures results reflect production conditions, not theoretical peaks.

The demo runs on PowerEdge R770 servers configured with Intel Xeon 6 processors (6780P) with AMX support, DDR5 memory, and PCIe Gen5 expansion for GPU acceleration. This configuration gives organizations a production-grade evaluation environment for computer vision workloads, alongside the software workflows used during evaluation.

Solution Overview

Vendor specifications and limited pilots rarely reflect production conditions. Organizations need validated performance data from real hardware running actual workloads. The Computer Vision Demo addresses this requirement by embedding a complete benchmarking environment directly onto Dell PowerEdge R770 servers. IT teams measure throughput, latency, and concurrency limits using the same infrastructure they would deploy in production.

Figure 1- Computer Vision Demo Flow

The demo follows a workflow from configuration through performance comparison. In the configuration stage, operators select their input video source, define the FPS performance threshold, choose which hardware scenarios to evaluate, and specify whether stream scaling should be automatic or manual. These parameters establish the test conditions that reflect the organization's specific deployment requirements.

The execution engine then takes over, beginning with a warm-up phase that initializes the hardware and stabilizes system state. An incremental stress test follows, progressively scaling stream count while monitoring performance against the defined threshold.

Running on PowerEdge R770 infrastructure, this workflow reflects real deployment conditions by exercising CPU cores, memory bandwidth, storage throughput, and accelerator resources simultaneously. The incremental scaling approach mirrors how production environments grow over time, allowing IT teams to observe system behavior as stream density increases.

Throughout execution, the live monitoring interface provides real-time visibility into system behavior. Operators observe FPS metrics per stream, a visual grid of active video feeds with detection overlays, and hardware telemetry including CPU and GPU utilization. Upon completion, the performance comparison view presents results from all tested scenarios side by side, translating raw metrics into actionable infrastructure decisions.

Figure 2 - Unified Command Interface

Solution Architecture

Figure 3 - Solution Architecture

The application layer provides the user-facing interface and coordination logic. A web-based dashboard delivers configuration controls, live benchmark visualization, and results comparison. PostgreSQL maintains benchmark history and configuration state, enabling trend analysis across multiple test runs.

The processing layer manages workload distribution across available compute resources. Celery handles distributed job scheduling, coordinating inference tasks across CPU and GPU backends. Valkey provides the job queue that buffers and routes work to available processors. This architecture supports horizontal scaling by adding worker nodes without modifying application logic.

The inference layer executes object detection using YOLOv11 Nano with optimized runtimes for each hardware scenario. Intel OpenVINO powers CPU-based inference with optional AMX acceleration. NVIDIA Torch with CUDA 13 drives GPU-accelerated inference. The system routes workloads to the appropriate backend based on the selected hardware scenario, with BF16 and FP16 precision modes optimizing throughput for each path.

On the R770, Intel AMX supports BF16 tensor execution on CPU cores, while PCIe Gen5 connectivity provides the bandwidth required to support GPU-accelerated inference workflows.

The observability layer captures system telemetry throughout benchmark execution. Prometheus collects metrics from all components. Custom exporters surface hardware utilization data including CPU load, memory bandwidth, and GPU statistics. This instrumentation enables operators to correlate performance results with underlying resource consumption.

Ubuntu 24.04 LTS provides the operating system foundation, delivering a stable and secure environment for production workloads.

Infrastructure Foundation

Computer vision workloads place sustained demands on server infrastructure, requiring consistent memory bandwidth, low-latency storage, and parallel compute capacity to process multiple video streams in real time. Dell PowerEdge R770 addresses these requirements through its 17th Generation architecture, aligned with AI inference and data-intensive application requirements.

Equipped with dual Intel Xeon 6780P processors delivering 128 total cores, the R770 distributes video decode, preprocessing, and inference tasks without resource contention. One terabyte of DDR5 memory keeps frame buffers and model weights resident in memory, eliminating swap-induced latency during sustained operation. High-capacity NVMe storage supports continuous video ingestion, preventing I/O bottlenecks as stream counts scale. Together, these capabilities support sustained operation under production workloads, rather than short-lived peak measurements.

For organizations requiring maximum stream density, two NVIDIA RTX Pro 6000 Blackwell Server Edition GPUs deliver dedicated inference acceleration. This configuration separates CPU resources for orchestration and video handling from GPU resources dedicated to model execution. The design enables scalable performance as stream counts increase. This balanced CPU-GPU architecture allows organizations to scale from CPU-only deployments to GPU-accelerated environments on the same R770 platform, supporting consistent operations as performance requirements evolve.

Component	Dell PowerEdge R770	Dell PowerEdge R740
Generation	17th Generation	14th Generation
CPU	Intel(R) Xeon(R) 6780P (128 Cores, 2 Sockets)	Intel(R) Xeon(R) Gold 6126 CPU (12 Cores, 2 sockets)
Memory	1 TB DDR5	94 GB
Storage	9.2 TB NVMe SSD	500 GB
GPU	2x NVIDIA RTX Pro 6000 Blackwell Server Edition	None

Table 1 - Hardware Configuration

The benchmark includes a Dell PowerEdge R740 to illustrate the generational performance gap. This 14th Generation system lacks AMX support and GPU acceleration, limiting its capacity to CPU baseline scenarios. The comparison quantifies the operational constraints organizations face when running computer vision workloads on legacy infrastructure.

Performance Benchmark

The benchmark measures system behavior against three key metrics that translate hardware capabilities into operational capacity:

Combined FPS captures total throughput across all active video streams, indicating aggregate processing power available for detection workloads.
Average FPS per Stream reflects the responsiveness experienced by each individual feed, ensuring no single stream degrades below acceptable levels.
Max Streams identifies the concurrency ceiling before quality degradation occurs, establishing the practical limit for deployment planning.

The demo enforces a strict 30 frames-per-second threshold per stream throughout testing. Production deployments demand consistent frame processing to support real-time detection. A surveillance system dropping below 30 FPS risks missing critical events. A manufacturing inspection line operating below threshold introduces defect escapes. The benchmark progressively increases concurrent streams until any stream falls below this quality floor, establishing true operational capacity.

Scenario	Max Streams (@30 FPS)	Peak Combined FPS	Speedup vs. Baseline
CPU Baseline	10	329.1	1.0x
AMX Optimized	29	918.7	2.8x
GPU Accelerated	113	3663	11.1x

Table 2 - Dell PowerEdge R770 Benchmarks

These results reflect the combined impact of the R770 platform capabilities: high core density for concurrent stream processing, DDR5 memory bandwidth for sustained frame handling, and optimized AMX execution for CPU-based inference. When paired with GPU acceleration, PCIe Gen5 connectivity supports efficient data movement between host and accelerators, enabling increased stream density as workloads scale.

Scenario	Max Streams (@30 FPS)	Peak Combined FPS	Speedup vs. Baseline
CPU Baseline	2	40	1.0x
AMX Optimized	Not Supported	Not Supported	Not Supported
GPU Accelerated	Not Supported	Not Supported	Not Supported

Table 3 - Dell PowerEdge R740 Benchmarks

The results highlight three practical considerations for infrastructure planning:

AMX optimization delivers 2.9x stream capacity and 2.8x throughput gains without GPU investment. For organizations running 10 to 30 camera feeds, AMX-optimized CPU inference provides a cost-effective path that avoids GPU procurement, power, and cooling overhead. A single AMX-optimized R770 handles workloads that would require nearly three baseline servers.

GPU acceleration extends capacity to 113 streams with 11.1x throughput gains. For high-density environments such as large distribution centers, multi-floor retail, or campus-wide surveillance, GPU acceleration on the R770 consolidates workloads onto fewer servers, reducing rack space, power draw, and management complexity.

The generational gap is substantial. The R770 delivers 5x the baseline stream capacity of the legacy R740 platform (10 vs. 2 streams). Organizations still running 14th Generation infrastructure face significant constraints that limit their ability to support modern computer vision workloads at scale.

Business Impact

Benchmark numbers quantify infrastructure capability. The real value lies in how those capabilities translate into business outcomes for the organizations deploying them.

Server Consolidation and CapEx Efficiency

An AMX-optimized R770 handles the workload of nearly three baseline CPU servers. For a deployment targeting 30 camera feeds, organizations can consolidate from three or more servers to a single R770 with AMX enabled. This consolidation reduces server procurement costs, data center rack space, power consumption, and ongoing management overhead.

GPU-accelerated configurations extend this consolidation further. A single R770 with two NVIDIA RTX Pro 6000 GPUs processes 113 streams, workloads that would require 11 or more baseline CPU servers. For large-scale deployments, this consolidation ratio translates directly into lower total cost of ownership.

Risk Reduction Through Validated Capacity Planning

The most expensive infrastructure mistake is deploying hardware that cannot sustain production workloads. Emergency capacity additions, unplanned downtime, and missed detection events carry costs that far exceed the price of proper planning. By validating capacity limits before deployment, organizations reduce the risk of production-day failures and the emergency procurement cycles they trigger.

Flexible Upgrade Path

The R770 platform supports a modular approach to scaling. Organizations can deploy with CPU-only configurations for initial rollouts, enable AMX optimization through software configuration for mid-scale growth, and add GPU accelerators for high-density requirements. This flexibility protects the initial investment and extends the useful life of the platform as workload demands evolve.

Mapping Streams to Real-World Deployments2 streams (R740 baseline): Single entrance camera for a small office. 10 streams (R770 baseline): Small retail store or warehouse loading dock. 29 streams (R770 AMX): Mid-sized warehouse or multi-floor retail location. 113 streams (R770 GPU): Regional distribution center, campus-wide security, or city traffic management zone.

Conclusion

Computer vision initiatives succeed or fail based on infrastructure readiness. Organizations must process growing numbers of video streams in real time while maintaining consistent frame rates to avoid missed detections, operational blind spots, and downstream business risk. Traditional capacity planning methods, including vendor specifications and small-scale pilots, often fail to reflect sustained production conditions. IT leaders face a common dilemma: overprovision to mitigate risk, or underprovision and compromise performance. What enterprises need is validated, workload-driven insight that connects infrastructure choices directly to operational outcomes.

The Turnkey 17G Computer Vision Demo provides that validated insight. Rather than measuring isolated peak performance, the solution evaluates real hardware under incremental load using YOLOv11 Nano and a strict 30 FPS per-stream quality threshold. The unified interface guides teams from configuration through execution and side-by-side comparison. The layered architecture ensures clear attribution across orchestration, inference, and observability. By supporting CPU baseline, Intel AMX optimization, and GPU-accelerated scenarios within the same workflow, the demo enables infrastructure architects to compare acceleration paths using consistent test conditions.

Measured results on the Dell PowerEdge R770 demonstrate how the 17th Generation platform translates directly into operational capacity. CPU baseline inference sustains 10 concurrent streams. Intel AMX increases capacity to 29 streams with a 2.8x throughput gain, providing a cost-effective option for mid-scale deployments without GPU investment. GPU acceleration extends capacity to 113 streams, delivering 11.1x gains for environments that require maximum density. The generational comparison quantifies the limitations of legacy systems, highlighting a 5x baseline capacity gap between 14th and 17th Generation platforms.

Together, these outcomes give IT directors and infrastructure architects the defensible data they need to right-size deployments, avoid unnecessary overprovisioning, and build a scalable infrastructure roadmap for computer vision at scale.

Addendum

Operating System Information

Type	Details
Operating System	Ubuntu 24.04.3 LTS
Kernel	6.8.0-90-generic
Driver Status	NVIDIA CUDA 13

Experiment Configuration

Configuration	Details
Computer Vision Model	YOLOv11 Nano
Model Precision	BF16 and FP16
Inference Engine	Intel OpenVINO(CPU) / NVIDIA Torch(GPU)
Input Dataset	race_car.mp4 (Multi-object)
FPS Threshold	30 FPS
Scaling Mode	auto

Hardware Scenarios Tested

Backend	Inference Engine	Version
CPU Un-optimized	Intel OpenVINO with AMX disabled	2025.4
CPU Optimized with AMX	Intel OpenVINO with AMX enabled	2025.4
GPU Accelerated	NVIDIA CUDA Torch based client	torch 2.9.1+cu128

References

[1] Grand View Research, Computer Vision Market Size, Share & Trends Report, 2030, Grand View Research, Inc. Accessed: Jan. 29, 2026. [Online]. Available: https://www.grandviewresearch.com/industry-analysis/computer-vision-market

DISCLAIMER - Performance varies by hardware and software configurations, including testing conditions, system settings, application complexity, the quantity of data, batch sizes, software versions, libraries used, and other factors. The results of performance testing provided are intended for informational purposes only and should not be considered as a guarantee of actual performance.