Self-Healing 5G Core Networks: Multi-Agent AI on Dell PowerEdge XE7740 Rack Servers Powered By Intel Gaudi 3

Self-Healing 5G Core Networks: Multi-Agent AI on Dell PowerEdge™ XE7740 Rack Servers Powered By Intel® Gaudi® 3 AI Accelerator PCIe Cards .

| January 2026

Executive Summary

Network downtime now costs more than $300,000 per hour for 90% of mid-size and large enterprises^[1], with telecommunications providers facing losses of up to $11,000 per minute per server^[2]. The distributed architecture that makes 5G powerful also creates operational blind spots that prolong resolution times. Traditional Network Operations Centers (NOCs) rely on manual correlation of events across dozens of independent network functions, a process that can take hours and during which a single incident can cascade and affect thousands of subscribers. The median time to resolution for high-business-impact outages remains 51 minutes even with current monitoring tools, and 39% of organizations report resolution times exceeding one hour^[3].

To address this, Dell Technologies, Intel, and Metrum AI have developed and deployed a multi-agent AI solution using Dell PowerEdge XE7740 rack servers with Intel Gaudi 3 PCIe AI accelerators. The solution enables multiple AI agents to collaborate in real time: monitoring signals, diagnosing issues, predicting failures, and coordinating corrective actions automatically.

The multi-agent solution delivers measurable improvements across key operational metrics. Detection latency drops below 5 seconds, with full remediation completing within 10 to 15 seconds. The system processes network data locally rather than routing it to distant cloud resources, allowing Intel Gaudi 3 accelerators to provide the compute required for real-time LLM inference. Every remediation action is documented for compliance review, giving NOC teams full visibility into agent reasoning and actions.

Benchmark testing validated the solution's scalability across expanding 5G network deployments:

Telemetry Scale: The system processed 1,314 network events per minute when monitoring 10 network function clusters, demonstrating near-linear scaling as 5G Core coverage expands.
LLM Throughput: The Gaudi 3-accelerated multi-agent system delivered 5,522 tokens per second at 10 clusters, enabling concurrent root-cause investigations across larger network footprints.
Predictable Overhead: Per-event LLM inference workload remained within a 95 to 284 token range across all configurations, allowing infrastructure teams to model deployment behavior with confidence.

For a network experiencing four high-severity incidents per month, this acceleration translates into potential annual savings exceeding $2 million in avoided downtime costs alone, excluding SLA penalties and customer churn. This solution enables a shift from reactive troubleshooting to proactive network management.Figure 1 | Key Takeaway Summary

Table of Contents

Executive Summary

Introduction

The Operational Challenge

Solution Overview

Multi-Agent Solution Flow

Solution Architecture

Infrastructure Foundation

Benchmark Validation

Business Benefits and Impact

Introduction

Unlike previous generations, 5G employs a service-based architecture (SBA) composed of multiple independent network functions that communicate through standardized interfaces. Each network function serves a distinct purpose:

Network Function	Description
AMF	Access and Mobility Management Function - Handles registration, connection, and mobility
SMF	Session Management Function - Manages data sessions and quality of service
UPF	User Plane Function - Routes and forwards user data packets
AUSF	Authentication Server Function - Authenticates subscribers
UDM	Unified Data Management - Stores subscriber data and profiles
PCF	Policy Control Function - Manages policy rules and quality of service
NRF	Network Repository Function - Maintains service registry for network functions

Table 1 Network Functions

The Operational Challenge

For operators, 5G network incidents translate to lost revenue, damaged customer relationships, and SLA penalties. The distributed architecture that makes 5G powerful also creates operational blind spots.

Consider a common scenario: a subscriber cannot connect. The issue could originate anywhere in the authentication chain, session management, or user plane. NOC engineers must manually correlate events across systems, query subscriber databases, inspect configurations, and trace signaling flows. This investigation often takes two to four hours, requiring coordination across three or more engineering teams. During that time, a single incident may cascade across network slices, potentially affecting tens of thousands of subscribers.

Traditional monitoring tools reveal symptoms at the individual network function level, such as elevated CPU usage or authentication errors, but cannot correlate events across network functions or identify root causes spanning multiple components. Even with current observability investments, this operational model cannot scale. As 5G networks grow more complex with network slicing, edge computing, and increasing subscriber density, the gap between incident detection and resolution will only widen without a fundamentally different approach.

Solution Overview

The operational challenges outlined above demand a different approach to network management. As a result, we have developed a multi-agent AI solution that replaces manual event correlation with autonomous, real-time intelligence. The solution embeds specialized AI agents directly into the network control plane, enabling continuous monitoring, rapid diagnosis, and automated remediation across all 5G Core network functions.

Multi-Agent Solution Flow

Figure 2 | Multi-Agent Solution Flow

This solution uses a hierarchical multi-agent architecture supported by a unified data ingestion and analytics pipeline. Telemetry from the 5G core simulator, including logs, metrics, and time-series signals, is continuously ingested, embedded, and stored across vector and time-series databases, alongside structured and unstructured operational documentation. This pipeline normalizes real-time network state and historical knowledge into a common semantic layer, enabling correlation across control-plane events, performance metrics, and prior incidents. The agent functionalities are as follows:

Agent	Function
Operations Manager Agent	Serves as the central orchestrator, receiving user queries and delegating tasks to specialized agents. Maintains awareness of all ongoing investigations and coordinates responses.
NOC Analyst Agent	Continuously monitors logs and metrics from all network functions (AMF, SMF, UPF, AUSF, UDM, PCF). Detects anomalies and performs initial triage.
Core Network Engineer Agent	Specializes in investigating control plane issues. Analyzes the authentication chain, validates subscriber data integrity, and identifies root causes for signaling failures.
Capacity Planning Agent	Monitors resource utilization across UPF and SMF instances. Detects CPU and memory bottlenecks, analyzes traffic patterns, and implements remediation such as scaling resources.
Reporting Agent	Generates comprehensive incident reports, documenting investigation findings, actions taken, and recommendations. Facilitates integration with ticketing systems.

Table 2 | Agent Functionalities

Solution Architecture

The solution is built on a distributed multi-agent architecture. Instead of relying on centralized monitoring or manual intervention, intelligence is embedded directly within the network operations layer.

Figure 3 | Solution Components

The architecture follows a layered design, separating concerns across the access network infrastructure, operations support systems, data processing, AI orchestration, and user interfaces.

Access Network Infrastructure Layer

Physical and virtual radio access network components generate telemetry data that flows into the operations support systems. This layer represents the live 5G network being monitored and managed.

OSS & Core Network Layer

The Operations Support System (OSS) and 5G Core network functions (AMF, SMF, UPF, AUSF, UDM, PCF, NRF) generate logs, metrics, and events. In development and testing environments, Open5GS is used to simulate this layer with a full-featured open-source 5G Core implementation.

Data Pipeline Layer

Raw network function logs and metrics flow through a pipeline that extracts, transforms, and stores information in queryable formats.

Vector Data Pipeline collects events from each network function container and transforms them into structured data.
GreptimeDB Timeseries Database stores metrics and events with SQL queryability for efficient analysis.
Supabase PostgreSQL Database maintains structured data for metadata, agent state management, and session data.
PgVector extension over Supabase PostgreSQL Database enables semantic search across historical incidents and documentation.

This multi-database architecture ensures low-latency query responses during active investigations. Agents retrieve historical incidents, current metrics, and configuration data without waiting for cross-system joins or data transformations.

AI Orchestration Layer

This layer implements the multi-agent framework with full observability:

Microsoft's Agent Framework provides the infrastructure for agent coordination and communication.
The Model Context Protocol (MCP) enables agents to access tools and services through a standardized interface.
The A2A Protocol facilitates inter-agent communication and task delegation.
Agent Services (microservices) provide specialized capabilities to individual agents.

LLM Inference Layer

The solution leverages advanced large language models running on Dell PowerEdge XE7740 servers with Intel Gaudi 3 AI accelerator PCIe cards. vLLM on Intel Gaudi 3 provides optimized model serving with high throughput. Qwen3-30B-A3B-Thinking serves as the reasoning model enabling agents to perform complex analyses. The bge-large embedding model generates vector embeddings for semantic search.

Hardware Optimization Stack

The entire solution runs on a highly optimized software stack. Intel Gaudi 3 Software Stack v1.22.0 provides the necessary drivers, runtime libraries, and tools to enable the Gaudi backend for high-performance AI inference. vLLM (Fork) v0.9.0.1 with Gaudi 1.22.0 is a specialized release of vLLM optimized for Intel Gaudi 3 PCIe AI accelerators. It includes HPU-specific enhancements such as custom operator implementations, tensor and pipeline parallel inference, HPU graph execution, and multi-node distributed inference capabilities. Ubuntu 22.04.5 LTS delivers a stable, secure operating system foundation.

Solution Control Plane

User-facing interfaces provide comprehensive visibility and control. The dashboard Web UI offers real-time visualization of network status and agent activities. The API Server enables programmatic access and integration with existing NOC tools.

These interfaces maintain transparency, showing agent reasoning, data queries, and decision-making processes in real-time.

Figure 4 | Interactive Web Dashboard

Real-Time Autonomous Response

The solution operates continuously, processing network telemetry in real-time. When issues are detected, the multi-agent solution executes a comprehensive response workflow:

Detection: The NOC Analyst identifies anomalies through log analysis and metric monitoring.
Investigation: Specialized agents perform deep analysis, querying databases and examining configurations.
Root Cause Analysis: Agents correlate findings across network functions to identify underlying causes.
Remediation: The appropriate agent implements fixes, such as scaling resources or correcting configurations.
Documentation: The solution generates a complete incident report with root-cause analysis.

This entire process occurs autonomously in seconds, eliminating the need for manual intervention in routine incidents while keeping human operators informed through transparent, explainable AI-driven decision-making.

Real-Time Remediation Workflows

The following scenarios validate the multi-agent framework's autonomous capabilities against representative telecommunications challenges. Each demonstrates how specialized agents collaborate to resolve issues faster than human operators can respond.

Scenario 1: UPF Performance Degradation

Figure 5 | Autonomous UPF Remediation Workflow

Resource constraints in User Plane Functions traditionally manifest as subscriber complaints: slow data speeds, video buffering, and connection timeouts. By the time traditional monitoring tools detect elevated CPU utilization, thousands of users may already be experiencing degraded service. A single UPF bottleneck can degrade service for tens of thousands of subscribers within minutes.

In simulation, the multi-agent solution transforms this scenario through predictive intervention. AI agents continuously analyze resource utilization trends, session load patterns, and historical performance baselines. The NOC Analyst agent detects CPU utilization trending toward 80% capacity based on session load velocity. The Core Network Engineer agent correlates session growth data with UPF capacity, identifying that the current replica count would become insufficient within 10 minutes. The Capacity Planning agent executes horizontal scaling, expanding UPF replicas from two to four, completing the action before any simulated performance degradation occurs.

Scenario 2: User Equipment Registration Failure

Figure 6 Autonomous Authentication Chain Remediation Workflow

Registration failures represent some of the most damaging incidents in mobile network operations. Subscribers immediately see "No Service" messages or experience repeated connection drops despite strong signal strength. Each minute of widespread failure generates support calls, damages customer satisfaction scores, and triggers SLA penalty clauses. Traditional investigation requires manual correlation across AMF, AUSF, and UDM logs, typically requiring two to four hours and coordination among multiple engineering teams.

The multi-agent solution delivers rapid response when active failures affect subscribers. Within seconds of the first failed registration attempt, the NOC Analyst agent detects abnormal failure rates across a subscriber segment. The Core Network Engineer agent initiates concurrent log analysis across AMF, AUSF, and UDM, isolating credential synchronization errors as the root cause. The agent immediately triggers re-synchronization between UDM and AUSF, restoring subscriber authentication network-wide. Subscribers who had seen "No Service" messages regained connectivity without manual intervention or device restarts.

Infrastructure Foundation

The Dell PowerEdge XE7740 rack server delivers the compute density and reliability that telecommunications operators require for AI-driven network management. The platform supports up to four Intel Gaudi 3 PCIe accelerators in a single chassis. The platform delivers high-bandwidth connectivity through eight PCIe Gen 5 x16 slots with a 1:1 accelerator-to-NIC ratio, reducing I/O constraints during telemetry processing. Hot-swap redundant power supplies (N+N configuration across CPU and GPU zones) and Smart Cooling technology with high-performance fans support continuous operation in demanding data center environments. This configuration processes network telemetry locally rather than routing sensitive data to external cloud resources, giving infrastructure teams full control over data residency and security.

Intel Gaudi 3 PCIe accelerators provide the inference capacity for multi-agent reasoning at network scale. The 600W thermal design power enables deployment in standard air-cooled environments without requiring infrastructure retrofits. Operators can start with a single accelerator for pilot deployments and add capacity as network coverage expands, scaling investment alongside actual network growth and operational needs.

Benchmark Validation

The Dell PowerEdge XE7740 with Intel Gaudi 3 PCIe AI Accelerators platform delivers measurable performance benefits. In simulation, we achieved:

Detection Latency: Sub-5-second identification of anomalies from log and metric streams.
Investigation Time: 15 to 20-second deep analysis including database queries and configuration checks.
Remediation Speed: Automated fixes implemented within 10 to 15 seconds of detection.
Concurrent Incidents: Parallel processing of multiple, unrelated issues simultaneously.
Telemetry Processing: Over 1,300 network events per minute at 10-cluster scale, demonstrating linear scalability with network expansion.
LLM Inference Capacity: Over 5,500 tokens per second, enabling concurrent root-cause investigations across larger network footprints.

These results, obtained by monitoring the different configurations of clusters for an average of 30 minutes per configuration, confirm that telemetry and LLM-driven automation on the Dell PowerEdge XE7740 rack server with Intel Gaudi 3 scales proportionally with network size. Telemetry ingestion grows linearly with the number of monitored network-function clusters, and LLM throughput increases at a rate that supports more concurrent root-cause investigations. At the same time, the per-event reasoning cost remains stable, and the per-function monitoring load decreases as telemetry is spread across a larger set of network functions. Performance scales linearly with additional Intel Gaudi 3 PCIe accelerators, supporting networks from pilot deployments to large regional deployments tested in this benchmark.

Benchmark Results: Scalability Across 5G Network Deployments

Figure 7 |Telemetry ingestion scales linearly as 5G Core coverage expands. The solution processed 393 events per minute with a single cluster and 1,314 events per minute with 10 clusters, demonstrating 3.3x throughput growth with 10x network expansion.

At baseline (one network-function cluster with four NFs), the solution ingested 393 events per minute. Expanding to five clusters (20 NFs) increased throughput to 805.5 events per minute. At a maximum tested scale of 10 clusters (40 NFs), the solution processed 1,314 events per minute. This 3.3x throughput increase relative to a 10x network expansion demonstrates efficient resource utilization. A single Dell PowerEdge XE7740 deployment can support network growth from pilot programs through regional rollouts without requiring architecture changes or additional infrastructure investment.

Figure 8 | LLM throughput grows proportionally with network scope. The Gaudi 3 accelerated multi-agent solution delivered 622 tokens per second at baseline and 5,522 tokens per second at 10 clusters, enabling concurrent root-cause investigations across larger network footprints.

Real-time network management requires AI reasoning capacity that matches monitoring scope. At the baseline configuration, the solution delivered 622.82 tokens per second of LLM inference throughput. This capacity expanded to 3,815.74 tokens per second at five clusters and reached 5,521.86 tokens per second at the maximum tested scale. The 8.9x throughput increase enables the Operations Manager, NOC Analyst, Core Network Engineer, and Capacity Planning agents to conduct parallel investigations without queuing delays. NOC teams gain simultaneous incident response across multiple network segments rather than relying on sequential processing that extends mean time to resolution.

Figure 9 | Per-function monitoring load decreases as network functions increase. The solution distributed 98 events per minute per NF at baseline, reducing to 33 events per minute per NF at 10 clusters, demonstrating efficient load balancing across larger deployments.

Effective monitoring systems must distribute analytical workload efficiently as networks expand. At baseline (4 NFs), each function generated approximately 98 events per minute for the AI agents to process. At five clusters (20 NFs), this load dropped to 40 events per minute per function. At maximum scale (40 NFs), the per-function load decreased further to 33 events per minute. This declining ratio indicates that larger deployments achieve better efficiency as the fixed overhead of agent coordination distributes across more network functions. Infrastructure Architects can leverage this characteristic when planning phased rollouts, knowing that each expansion phase improves overall system efficiency rather than degrading it.

Business Benefits and Impact

The combination of Dell PowerEdge infrastructure, Intel Gaudi 3 PCIe AI accelerators, and multi-agent AI delivers benefits that directly address the challenges of modern telecommunications operations:

Operational Efficiency

The autonomous multi-agent solution transforms how NOC teams spend their time. End-to-end workflows complete in seconds under benchmark conditions, representing an order-of-magnitude reduction compared to traditional manual processes. The solution continuously ingests telemetry and executes investigation workflows without requiring human intervention for routine incidents. By automating detection, investigation, and initial remediation, the solution reduces the volume of repetitive tasks handled by human operators, enabling staff to focus on network optimization and strategic initiatives rather than on troubleshooting. Automated, policy-driven execution ensures consistent incident handling and eliminates variability associated with manual analysis and operator fatigue.

Figure 10 | Conceptual comparison of incident lifecycle stages for a traditional NOC workflow and an autonomous multi-agent solution on Gaudi 3. Autonomous response totals are informed by observed end-to-end workflow completion times, while per-stage durations and traditional NOC timings are illustrative and reflect typical operational practices rather than direct measurements. A logarithmic y-axis highlights the order-of-magnitude difference in Mean Time to Resolution.

Service Quality and Subscriber Experience

Faster incident resolution translates directly into improved subscriber experience. Issues are resolved before most subscribers notice service degradation. Proactive detection prevents small anomalies from cascading into widespread outages. Automated capacity management maintains optimal resource allocation during traffic fluctuations. The cumulative effect is reduced service interruptions, improved network reliability, and higher subscriber satisfaction scores.

Cost Reduction

The solution reduces costs across multiple areas:

Automation handles routine incidents without manual intervention, reducing NOC staffing requirements for repetitive tasks.
Faster resolution minimizes SLA violations and associated service credits.
AI-driven capacity planning right-sizes infrastructure, reducing over-provisioning costs.
Better service quality improves customer retention and reduces churn-related revenue loss.

Figure 11 | Business Benefits Summary Table

| Summary

Traditional 5G network operations cannot keep pace with the complexity of distributed, service-based architectures. Manual correlation of events across dozens of independent network functions extends resolution times to hours, during which a single incident can cascade and affect thousands of subscribers. This operational model does not scale as networks grow more complex with network slicing, edge computing, and increasing subscriber density.

The multi-agent AI solution presented in this paper offers a fundamentally different approach. Specialized AI agents collaborate to detect anomalies, investigate root causes across network functions, and implement corrective actions autonomously. Dell PowerEdge XE7740 servers with Intel Gaudi 3 AI accelerators deliver the inference capacity for real-time LLM reasoning, reducing Mean Time to Resolution from hours to seconds. NOC engineers shift from reactive firefighting to strategic network optimization.

Benchmark testing validates that this approach scales linearly as 5G coverage expands. At 10 network function clusters, the solution processed over 1,300 events per minute while delivering 5,500 tokens per second of LLM inference throughput. Per-event LLM workload remained within a 95 to 284 token range across all configurations, giving infrastructure teams confidence to model deployment behavior accurately.

For operators seeking to improve service quality while managing rapid subscriber growth, AI-driven root cause remediation offers a proven path forward. Benchmark results confirm the solution scales predictably, enabling telecommunications teams to expand autonomous network management from pilot deployments through regional rollouts without architectural changes.

To learn more, please request access to our reference implementation by contacting us at contact@metrum.ai.

| Addendum

Server Configuration

Server	Dell PowerEdge XE7740 Rack Server
Accelerators	4x Intel Gaudi 3 AI Accelerator PCIe Card (HL-338)
CPU	2x Intel Xeon 6787P Processor
Memory	2 TB
Accelerators Count	4
OS	Ubuntu 22.04.5 LTS
Embedded NIC	Broadcom BCM5720 2P Dual-port Gigabit Ethernet Adaptor

| References

[1] Information Technology Intelligence Consulting (ITIC), "2024 Hourly Cost of Downtime Survey," March 2024. Available: https://itic-corp.com/itic-2024-hourly-cost-of-downtime-report/

[2] Information Technology Intelligence Consulting (ITIC), "2024 Hourly Cost of Downtime Survey Part 2," March 2024. Available: https://itic-corp.com/itic-2024-hourly-cost-of-downtime-part-2/

[3] New Relic, "2024 Observability Forecast: Outages, Downtime and Cost," 2024. Available: https://newrelic.com/resources/report/observability-forecast/2024/state-of-observability/outages-downtime-cost

[4] Uptime Institute, "2024 Data Center Resiliency Survey," 2024. Referenced in: The Network Installers, "Cost of IT Downtime Statistics, Data & Trends," January 2026. Available: https://thenetworkinstallers.com/blog/cost-of-it-downtime-statistics/

Copyright © 2026 Metrum AI, Inc. All Rights Reserved. This project was commissioned by Dell Technologies. Dell and other trademarks are trademarks of Dell Inc. or its subsidiaries. Intel, Gaudi, Xeon and related marks are trademarks of Intel, Inc. Broadcom is a trademark of Broadcom Inc. All other product names mentioned are the trademarks of their respective owners.

***DISCLAIMER - Performance varies by hardware and software configurations, including testing conditions, system settings, application complexity, the quantity of data, batch sizes, software versions, libraries used, and other factors. The results of performance testing provided are intended for informational purposes only and should not be considered as a guarantee of actual performance.

[1] ITIC, "2024 Hourly Cost of Downtime Survey," Information Technology Intelligence Consulting, March 2024

[2] "Cost of IT Downtime Statistics, Data & Trends," The Network Installers, January 2026; citing telecommunications industry data.

[3] New Relic, "2024 Observability Forecast: Outages, Downtime and Cost," 2024.