Current Version
AI-Powered Multi-Agent Infrastructure Monitoring for Telecom Networks on Dell PowerEdgeTM XE9785L with AMD InstinctTM MI355X GPUs
February 2026
| Executive Summary
Mobile network data traffic is projected to reach several hundred exabytes per month by 2030, driven by an estimated 30+ billion connected devices generating both human and machine data.[1] As this growth accelerates, telecom operators face a widening gap between the speed of equipment failures and the pace of manual troubleshooting. Cascading faults across tens of thousands of geographically dispersed Remote Radio Heads (RRHs) can escalate to widespread outages before operations teams identify a root cause. This paper presents a Multi-Agent Infrastructure Monitoring solution, built on Dell PowerEdge™ XE9785L servers with AMD Instinct™ MI355X accelerators, that closes that gap through autonomous, on-premises AI.
The platform ingests system logs from Baseband Units (BBUs), RRHs, and Operations Support Systems (OSS) in real time. A coordinated suite of specialized AI agents, powered by the Qwen3-235B reasoning model, continuously detects anomalies, determines root causes, and initiates remediation, replacing multi-hour manual troubleshooting cycles with sub-two-minute automated resolution.
Key Results at a Glance
Sub-90-Second Single-Incident Remediation Detection to resolution in under 90 seconds, compared to multi-hour manual cycles | 1.5x to 2x Faster Event Processing MI355X vs MI300X across tested configurations (3 to 30 BBUs) |
2.3 TB Combined GPU Memory Single 8-GPU node, no multi-node sharding | Up to 1.6× Generational Throughput Gain MI355X vs MI300X inference tokens per second at matched concurrency |
7,773 Tokens/sec Peak Throughput At 30 BBUs / 90 RRHs max tested config | 2,253 Events/min at Scale Sustained processing at max tested configuration |
Table Of Content
The Telecom Infrastructure Challenge
Open Ecosystem Foundation: AMD ROCm
Model Provisioning and Lifecycle Management
Server Health and Uptime Management
Example Scenario: Fronthaul Link Failure
ROCm and the vLLM Inference Pipeline
Operational Ecosystem: From Deployment to Production
Per GPU Breakdown at Peak Load
Generational Comparison: MI355X vs. MI300X
Limitations and Considerations
Addendum: Key Concepts for IT Decision Makers
What is RAG, and why is it critical for enterprises?
Why is Dell PowerEdge XE9785L with AMD Instinct MI355X well-suited for RAG solutions?
What are Multi-Specialist Agents and Multi-Agent Frameworks?
| The Telecom Infrastructure Challenge
Figure 1 | The Telecom Challenge
Telecom operators manage increasingly complex infrastructure under intensifying pressure as 5G deployments accelerate and network scope expands. In a typical Cloud Radio Access Network (C-RAN) architecture, RRHs handle radio frequency processing at cell sites while BBUs provide centralized baseband signal processing. OSS platforms integrate data from both components to monitor network health and maintain Quality of Service (QoS) requirements. When equipment fails or performance degrades, operations teams must rapidly identify affected components, determine root causes, and execute corrective actions before service quality suffers. Industry analyses estimate that unplanned network downtime costs Tier 1 operators upward of $100,000 per hour in lost revenue and customer churn.[2]
C-RAN Network Primer RRH (Remote Radio Head): Distributed across multiple cell sites to perform radio signal transmission and reception. Handles radio frequency (RF) processing at the network edge. BBU (Baseband Unit): Consolidated within a centralized processing pool to deliver high-performance baseband computation. Handles digital signal processing. OSS (Operations Support Systems): Software tools that analyze and manage the telecommunications network, including network monitoring, fault management, and performance optimization. In a typical network, RRHs connect to the BBU pool through fronthaul links. BBUs can be dynamically assigned to serve clusters of RRHs in a many-to-one configuration. |
Legacy monitoring approaches often fail to keep pace with the scale and velocity of modern networks. Manual log analysis cannot process the sheer volume of telemetry data generated by thousands of distributed network elements. Batch processing systems introduce delays that allow issues to escalate before detection. Siloed monitoring tools make it difficult to correlate events across BBUs, RRHs, and OSS platforms. These operational gaps carry measurable consequences: extended mean time to resolution, increased service disruptions, and elevated operational costs associated with dispatching field technicians for issues that could be diagnosed remotely.
The table below highlights the primary operational challenges that prevent network operations teams from achieving continuous situational awareness.
Challenge | Current State | Business Impact |
Delayed Detection | Batch systems and manual log reviews run on multi-hour cycles | Equipment failures escalate to widespread outages before identification |
Cascading Failures | Single-point failures propagate across interconnected base stations | Localized issues become network-wide service disruptions |
Manual Root Cause Analysis | Engineers manually correlate logs across BBU, RRH, and OSS systems | Extended mean time to resolution increases customer impact |
Signal Degradation | Environmental factors and misconfigurations cause gradual performance decline | QoS deteriorates before thresholds trigger alerts |
Geographic Distribution | Tens of thousands of RRHs deployed across dispersed locations | Dispatching field technicians for remote diagnostics adds cost and delays resolution. |
Table 1 | Operational Challenges and Business Impact
Addressing these challenges requires a platform purpose-built for continuous, automated network intelligence. The following sections describe how the Multi-Agent Infrastructure Monitoring solution transforms fragmented troubleshooting workflows into a unified, real-time operations capability.
| Solution Overview
At the core of this solution, a coordinated team of specialized AI agents autonomously monitors network health, detects anomalies, and executes remediation actions. When new telemetry data arrives from BBUs, RRHs, or OSS platforms, the system uses large language models to analyze log patterns, classify severity levels, and identify affected components. Specialized agents then collaborate to determine root causes and initiate or recommend corrective actions aligned with the specific issue type.
Agent | Function | Operation Mode |
Operations Manager Agent | Delegates tasks to domain-specific agents based on issue classification | Continuous |
NOC Analyst Agent | Monitors BBU and RRH logs for anomalies and forwards issues to Operations Manager | Continuous |
Communication Link Monitor | Resolves communication link failures between BBU and RRH components | Event-triggered |
Synchronization Specialist | Resolves timing and synchronization issues in telecom infrastructure | Event-triggered |
Hardware Health Agent | Resolves hardware failure issues affecting BBU and RRH equipment | Event-triggered |
Reporting Agent | Generates incident reports and executive summaries for operations review | User-initiated |
Table 2 | Multi-Agent Functions
Each agent maps to a specific operational gap identified in Table 1, ensuring that no category of network incident goes unaddressed.
Solution Flow
Figure 2 | Solution Flow
The platform operates through four integrated stages, each running continuously within the telecom operator's infrastructure. In the first stage, the Vector Data Pipeline ingests telemetry from multiple source types: system logs from BBUs and RRHs; event logs and QoS metrics from OSS platforms; and operational documents containing configuration and procedural information. A Kafka-based event streamer normalizes these inputs and routes them to the processing layer.
The second stage transforms raw data into searchable knowledge. The bge-large-en text embedding model converts log entries and documents into vector representations stored in PgVector (a vector similarity search extension for PostgreSQL) for semantic search. Time-series telemetry flows to GreptimeDB (an open-source distributed time-series database) for temporal analysis. This dual-database approach enables agents to query both semantic similarity (finding related past incidents) and temporal patterns (identifying performance trends).
The third stage executes the agent workflows. When the NOC Analyst Agent detects a warning or error condition in incoming logs, it forwards the issue to the Operations Manager Agent. The Operations Manager Agent classifies the problem type and delegates to the appropriate specialist. The Communication Link Monitor handles connectivity failures, the Synchronization Specialist addresses timing issues, and the Hardware Health Agent manages equipment malfunctions. Each specialist agent queries the vector database for similar past incidents and uses the Qwen3-235B reasoning model to determine root cause and corresponding corrective actions.
Dashboard Capabilities
Figure 3 | Unified Command Center Dashboard
The final stage delivers actionable intelligence through a unified dashboard. The command center interface provides interactive Leaflet maps with geographic network visualization, real-time event streams, an agent chat interface, and GPU cluster monitoring within a single dashboard. Key components include:
- Base Station Status: Network operations staff see color-coded health indicators at a glance, eliminating the need to parse raw log files during active incidents. Green indicates normal operations, orange signals early warnings where agents are monitoring for potential degradation, and red indicates active issues that trigger automated resolution workflows.
- Global QoS Metrics: Operations teams monitor live performance without switching between multiple tools. The dashboard streams block rate, data drop, and call drop percentages alongside downlink and uplink throughput levels. Agents automatically correlate anomalies with underlying network or hardware events.
- Event Streams and Active Agents: Supervisors track workload distribution across the agent team in real time. The panel displays event counts from BBU, RRH, and OSS logs alongside active agent workflows, processed incidents, and monitored devices.
- Issue Alerts Panel: Engineers drill down from high-level alerts to root cause analysis without context switching. The interactive incident list maps affected RRHs, BBUs, and base stations with visual indicators for ongoing, resolved, or escalated issues.
- Report Generation: Compliance and operations teams export documentation for audits and post-incident reviews. Specific Incident Reports provide detailed root cause analysis (RCA) with corrective actions and affected components. Executive Summary Reports aggregate insights across all base stations. Both formats export as JSON or PDF.
Example Scenario: Fronthaul Link Failure
To illustrate how these components work together, consider a fronthaul link failure scenario. At 2:14 AM, the communication link between BBU-12 and RRH-47 fails due to a firmware mismatch introduced during a routine update. Within seconds, the NOC Analyst Agent detects the ERROR log in the incoming telemetry stream and forwards it to the Operations Manager Agent. The Operations Manager classifies the issue as a connectivity failure and delegates to the Communication Link Monitor.
The Communication Link Monitor queries the vector database for similar past incidents and identifies three prior cases where firmware version mismatches caused identical symptoms. Using the Qwen3-235B reasoning model, the agent confirms the root cause, determines that RRH-47 requires a firmware rollback, and initiates an automated link restart sequence. The entire detection-to-remediation cycle completes in less than 90 seconds.
This 90-second cycle represents a single-incident response under moderate load. Under production conditions with concurrent monitoring of 30 BBUs and 90 RRHs, average workflow durations range from 102 to 463 seconds depending on issue complexity, as detailed in the Performance Benchmarking section.
Without the multi-agent system, this incident would typically wait for the next batch processing cycle, potentially leaving customers without service for hours. Field technicians might need to be dispatched, adding cost and further extending resolution time. With the multi-agent system, the autonomous workflow transforms a multi-hour outage into a sub-two-minute remediation event. The Reporting Agent logs the complete evidence chain, and the dashboard updates to show RRH-47 returning to green status.
| Solution Architecture
The architectural decisions behind the Multi-Agent Infrastructure Monitoring solution reflect a fundamental requirement: every component must deliver the throughput needed for real-time network intelligence while operating within the telecom operator's secure infrastructure. The platform combines optimized inference runtimes and a modular software stack designed for continuous operation at telecom-grade reliability.
Figure 4 | Solution Architecture
Software Stack
The software architecture layers optimized runtimes atop the AMD Radeon Open Compute™ platform (ROCm™) 7.0. vLLM provides the inference runtime, delivering high-throughput token generation with continuous batching and PagedAttention memory management. This combination enables the solution to serve multiple concurrent agent requests while maintaining consistent, predictable latency for time-sensitive incident response.
Model Provisioning and Lifecycle Management
Production AI model deployment in telecom environments demands a streamlined path from selection through validation to runtime serving. This solution provisions the Qwen3-235B reasoning model and bge-large-en embedding model through Dell Enterprise Hub, integrated with the Hugging Face model repository. Dell Enterprise Hub provides pre-validated configurations optimized for Dell PowerEdge servers with AMD Instinct accelerators, ensuring compatibility across the hardware and software stack. The operational benefits of this approach are detailed in the Operational Ecosystem section.
For telecom operators managing multiple NOC sites, this centralized model management approach ensures consistency across deployments: every site runs the same validated model version with the same serving configuration, reducing the risk of inconsistent agent behavior across the network.
Layer | Component | Function |
Hardware Optimization | AMD ROCm 7.0 | GPU compute and memory management |
Inference Runtime | vLLM v0.10.1 | High-throughput model serving with continuous batching |
Agent Framework | AutoGen (Microsoft) | Multi-agent orchestration and asynchronous task execution |
Agent Communication | MCP + A2A Protocol | Context sharing and inter-agent coordination |
Reasoning Model | Qwen3-235B-A22B -Thinking | Root cause analysis and corrective action determination |
Embedding Model | bge-large-en | Text embeddings for semantic search and similarity matching |
Vector Database | PgVector + PostgreSQL | Vector storage, similarity search, and metadata management |
Time-Series Database | GreptimeDB | High-frequency telemetry and event data storage |
Event Streaming | Vector Data Pipeline | Real-time ingestion, transformation, and routing |
Table 3 | Software Stack Components
Agent Orchestration
The AutoGen framework (Microsoft's open-source multi-agent orchestration framework) coordinates the specialized agents that execute the monitoring and remediation workflows. The Model Context Protocol (MCP) acts as a standardized interface for context sharing across agents, while the Agent-to-Agent (A2A) protocol ensures secure, structured communication between microservices. Together, these components form the control plane that handles request routing, inter-agent orchestration, and operational telemetry for distributed AI workflows with complete traceability and observability.
Each agent accesses shared resources through well-defined interfaces: the vector database for semantic search across historical incidents, GreptimeDB for time-series telemetry analysis, and the model inference endpoints for AI-powered reasoning. This modular architecture enables independent scaling of individual components, such as adding embedding model replicas during peak event volumes, without disrupting active monitoring workflows. The architecture processes thousands of network messages and log streams per day across hundreds of nodes on a single server, while supporting horizontal scaling for larger deployments.
The orchestration layer also handles agent failure recovery. If a specialist agent encounters an error during root cause analysis, the Operations Manager Agent reassigns the task to a backup agent or escalates to the reporting layer with a partial analysis. This fault-tolerant design ensures that a single agent failure does not leave a network incident unaddressed. All inter-agent messages and task handoffs are logged, providing full traceability for post-incident audit and compliance review.
Server Health and Uptime Management
A platform responsible for continuous network monitoring must itself maintain continuous availability. Dell Integrated Dell Remote Access Controller (iDRAC) provides out-of-band server management that operates independently of the host operating system and application stack. This independence is critical: if a software fault or GPU driver issue affects the monitoring application, iDRAC remains accessible for diagnostics and recovery.
iDRAC monitors the physical health of the Dell PowerEdge XE9785L server, including CPU and GPU temperatures, power supply status, fan speeds, memory integrity, and storage health. For the Multi-Agent Infrastructure Monitoring solution, this hardware-level visibility serves two functions.
First, iDRAC provides proactive alerting for server-side issues that could degrade monitoring performance. If a GPU accelerator begins exhibiting elevated temperatures or a power supply enters a degraded state, iDRAC generates alerts through standard protocols (SNMP, Redfish, email) before performance or agent responsiveness is affected. Operations teams can schedule maintenance during planned windows rather than responding to unplanned outages.
Second, iDRAC enables remote management for geographically distributed deployments. Telecom operators deploying the monitoring solution across multiple data centers or central offices can use iDRAC's remote console, firmware update, and power management capabilities without dispatching technicians. iDRAC's Redfish API also enables programmatic integration with existing IT service management (ITSM) platforms, providing a unified view of both the telecom infrastructure being monitored and the AI infrastructure performing the monitoring.
Safety and Human Oversight
While the multi-agent system operates autonomously for common failure modes with well-established remediation procedures, the architecture includes configurable approval gates for high-impact actions. Operators can configure the system to require human confirmation before executing actions that affect multiple base stations, modify network configurations, or trigger firmware changes across device groups. All automated actions are logged with full evidence chains, enabling post-incident review and continuous policy refinement.
This graduated autonomy model recognizes that telecom operators maintain safety-critical responsibilities. Routine issue resolution (single-link restarts, individual device resets) proceeds automatically. Actions with a broader blast radius (multi-device firmware rollbacks, network-wide configuration changes) pause for operator approval through the dashboard interface. Operations teams can adjust these thresholds as they gain confidence in the system's recommendations over time.
| Infrastructure Foundation
The agent workflows, inference throughput, and sub-two-minute remediation cycles described above place specific demands on the underlying hardware. The platform requires sustained GPU memory bandwidth for large-model inference, sufficient accelerator memory to host 235 billion parameters without multi-server distribution, and enough compute headroom for concurrent agent workloads alongside continuous log ingestion. The Dell PowerEdge XE9785L server equipped with AMD Instinct MI355X accelerators meets these requirements within a single-server footprint.
Component | Specification |
Server Platform | Dell PowerEdge XE9785L Server |
Form Factor | 8U Rack Server |
GPU Accelerators | 8x AMD Instinct MI355X Accelerators |
GPU Memory | 2.3 TB aggregate HBM3e (288 GB per accelerator) |
CPU | AMD EPYC™ Processor |
Operating System | Ubuntu 22.04.5 LTS |
Table 4 | Dell PowerEdge XE9785L Hardware Configuration
The AMD Instinct MI355X accelerator provides 288 GB of HBM3e memory per accelerator, a 50 percent increase over the 192 GB available on the previous-generation MI300X. This expanded capacity directly enables efficient deployment of the Qwen3-235B reasoning model. With 288 GB of HBM3e memory per accelerator, two MI355X GPUs can comfortably host the entire 235-billion parameter model without aggressive quantization or distribution across multiple servers which would introduce latency and operational complexity.
With 2.3 TB of aggregate GPU memory across eight accelerators, the XE9785L hosts both the reasoning model and embedding model simultaneously while providing capacity for concurrent agent workloads. Memory-intensive operations such as serving multiple model instances, fine-tuning, and executing concurrent reasoning agents are feasible on a single hardware system. This consolidation simplifies procurement, reduces data center footprint, and eliminates inter-server communication latency that would degrade real-time incident response performance.
The AMD EPYC processor handles CPU-bound preprocessing: event streaming through the Vector Data Pipeline, log transformation, and database operations. During peak ingestion periods when network events generate high volumes of telemetry data, the high-core-count processor prevents CPU bottlenecks from limiting pipeline throughput.
ROCm and the vLLM Inference Pipeline
The inference performance demonstrated in this paper depends on the tight integration between AMD ROCm and the vLLM serving framework. ROCm 7.0 provides the GPU kernel libraries, memory management primitives, and inter-GPU communication layers that vLLM uses to implement continuous batching and PagedAttention. For the Qwen3-235B model deployed across two MI355X accelerators, ROCm manages tensor parallel inference with minimal inter-GPU communication overhead, delivering the 7,773 tokens-per-second throughput measured at peak load.
ROCm's compatibility with the Hugging Face model format means that new models published to the Hugging Face Hub (and curated through Dell Enterprise Hub) can be deployed on the MI355X accelerators without format conversion or custom compilation steps. When the next generation of reasoning models becomes available, operators can evaluate them on existing hardware by updating the model weights in vLLM, with ROCm handling the low-level GPU resource management automatically. This upgrade path protects the hardware investment and ensures the monitoring solution can evolve as AI model capabilities advance.
Operational Ecosystem: From Deployment to Production
The Dell PowerEdge XE9785L server's value extends beyond raw compute performance. Three ecosystem capabilities elevate the server from a standalone inference platform into a managed, production-grade AI infrastructure component.
Dell Enterprise Hub, integrated with the Hugging Face model repository, provides a curated path from model selection to validated deployment. Operations teams select models from a catalog pre-validated against the XE9785L's hardware configuration, including MI355X memory capacity, ROCm version compatibility, and vLLM serving parameters. The platform generates deployment configurations and tracks model versions across the fleet, ensuring uniformity across multi-site telecom deployments and providing the change management audit trail that regulatory compliance requires.
Dell iDRAC delivers out-of-band server management that operates independently of the AI application stack. iDRAC continuously monitors GPU temperatures, power supply health, storage integrity, and fan performance, issuing proactive alerts before hardware issues impact monitoring capability. iDRAC's Redfish API enables integration with existing ITSM platforms, providing a unified view of both the telecom network being monitored and the AI infrastructure performing the monitoring. Its remote console and firmware management capabilities reduce the need for on-site technician visits, which is particularly valuable for deployments at edge locations or central offices with limited physical access.
AMD ROCm's open-source platform ensures that the entire inference stack remains auditable, portable, and free from proprietary lock-in. Models and pipelines built on standard frameworks run on ROCm without requiring framework-level modifications. Telecom security teams can inspect the GPU runtime codebase as part of their infrastructure certification process. When next-generation AMD Instinct accelerators become available, existing model deployments and serving configurations can be migrated forward without application-level changes, protecting the operator's multi-year infrastructure investment.
Approach Comparison
Capability | Manual NOC Operations | Cloud-Based AIOps | Multi-Agent on Dell PowerEdge |
Detection Latency | Batch cycle (minutes to hours) | Near real-time (cloud dependent) | Continuous monitoring, sub-minute detection |
Root Cause Analysis | Manual log correlation | AI-assisted, requires data upload | Autonomous, RAG-powered |
Remediation | Manual execution | Recommendation with manual execution | Automated with audit trail |
Data Sovereignty | On-premises | Data leaves perimeter | On-premises, fully controlled |
Scalability | Linear staff increase | Cloud-elastic, variable cost | Single-server, deterministic cost |
Table 5 | Monitoring Approach Comparison
| Performance Benchmarking
Performance validation demonstrates that the solution scales effectively across increasing network complexity while maintaining the throughput required for real-time incident detection and remediation. Testing measured inference throughput, event processing capacity, and GPU resource utilization under production-representative workloads.
Test Configuration
The team conducted benchmarks on a Dell PowerEdge XE9785L server equipped with eight AMD Instinct MI355X accelerators. The solution deployed the Qwen3-235B-A22B reasoning model using vLLM v0.10.1 optimized for AMD ROCm 7.0. Tests simulated realistic telecom monitoring scenarios by progressively increasing the number of monitored BBUs and associated RRHs to evaluate end-to-end system performance.
Testing targeted representative configurations for small (3 to 6 BBUs), medium (15 BBUs), and large (30 BBUs) deployments. Each configuration level ran under sustained load to capture steady-state performance, throughput and resource utilization characteristics.
Scalability Results
Event processing capacity scales near-linearly as monitoring scope increases. At maximum tested configuration of 30 BBUs and 90 RRHs, the system sustained a throughput of 2,253 events per minute while generating 7,773 tokens per second of inference throughput.
BBUs Monitored | RRHs Monitored | Events Processed/min | Throughput (tokens/sec) |
3 | 9 | 539 | 2,231 |
15 | 45 | 1,515 | 6,823 |
30 | 90 | 2,253 | 7,773 |
Table 6 | Scalability Performance on Dell PowerEdge XE9785L with AMD Instinct MI355X
To put these numbers in operational context, the system resolves most incidents within minutes of detection, with average workflow durations of 102 to 463 seconds depending on complexity. Against the industry-estimated $100,000-per-hour cost of unplanned downtime cited earlier, even modest reductions in mean time to resolution translate directly into avoided revenue loss and reduced customer churn.
Generational Comparison: MI355X vs. MI300X
To quantify the generational improvement, the same Qwen3-235B-A22B-Thinking model was benchmarked on both MI355X and MI300X accelerators under identical workload conditions.
Figure 5 | Events/Minute Scalability
Figure 6 | Tokens/Second Throughput
At the maximum tested configuration of 30 BBUs, the MI355X achieves 40 percent higher inference throughput (7,773 vs. 5,476 tokens per second) and 50 percent greater event processing capacity (2,253 vs. 1,470 events per minute) compared to the MI300X. This performance gap widens at mid-range configurations: at 15 BBUs, the MI355X processes 57 percent more inference tokens per second (6,823 vs. 4,342) and nearly double the events per minute (1,515 vs. 766).
This generational improvement is attributable to architectural enhancements, most notably the MI355X's expanded 288 GB HBM3e per accelerator (versus 192 GB on the MI300X), which reduces the memory management overhead that constrains inference speed at scale. With more memory per GPU, tensor parallel inference across two accelerators operates with lower resource contention, enabling higher sustained throughput under concurrent agent workloads. For telecom operators evaluating infrastructure investments, these gains translate into expanded monitoring coverage and faster incident response within the same single-server footprint.
Latency Profile
End-to-end latency measurements confirm that the solution satisfies real-time operational requirements. Simple issue classification (such as confirming an INFO-level log requires no action) completes in approximately 45 seconds. Complex multi-component root cause analysis with automated remediation requires up to 463 seconds. The following metrics capture the range across all tested scenarios:
- Average agent workflow duration: 102 to 463 seconds, depending on complexity
- Minimum workflow completion: 45 seconds for straightforward issue classification
- Time to first token (TTFT): 200 to 245ms for inference requests
- P95 inference latency: 43 to 390 seconds depending on query complexity and concurrent load.
These latency figures align with the sub-two-minute remediation cycles demonstrated in single-incident scenarios, transforming a multi-hour outage into a sub-two-minute remediation event.
Limitations and Considerations
These benchmarking results reflect controlled test scenarios using simulated telecom log data. Production deployments may experience different throughput characteristics depending on log volume, event complexity, and the number of concurrent agent workflows.
The current implementation supports English-language log formats. Networks generating logs in other languages or non-standard formats may require additional parsing configuration.
Automated remediation actions demonstrated here (firmware rollback, link restart) represent common, well-understood failure modes. Complex multi-vendor interoperability issues may still require human escalation. The configurable approval gates described in the Safety and Human Oversight section give operators control over which actions proceed autonomously.
Finally, the 30-BBU configuration represents a large single-site deployment. At that scale, the inference engine queued 257 requests at peak, indicating that operators approaching this capacity should evaluate additional accelerator resources or model optimization strategies. Substantially larger networks should plan for multi-server scaling.
| Conclusion
Telecom operators face a fundamental choice: continue scaling manual processes that cannot keep pace with network complexity, or deploy autonomous systems that detect and resolve incidents faster than human teams can respond. The benchmarks and architecture presented in this paper demonstrate that the second path is now practical for production environments.
A coordinated team of specialized AI agents transforms network operations from reactive troubleshooting into continuous, proactive infrastructure management. These agents monitor 24/7 for link failures, synchronization issues, hardware malfunctions, and performance degradation without human initiation. When incidents occur, the Qwen3-235B reasoning model correlates current events with historical patterns retrieved from the vector database, delivering accurate root cause diagnoses in seconds. Common failure modes resolve in under two minutes, reducing mean time to resolution from hours to seconds.
Beyond incident response, the platform provides unified visibility through a single command center dashboard: geographic network visualization, live QoS metrics, event streams, and agent workflow status across all distributed base station infrastructure. Every automated action generates a complete audit trail, supporting compliance requirements and enabling continuous improvement through post-incident review. Dell iDRAC ensures the monitoring platform itself maintains the uptime that telecom operations demand, with out-of-band health management, proactive alerting, and remote administration.
The Dell PowerEdge XE9785L server with AMD Instinct MI355X accelerators provides the memory and compute density to run these workloads entirely on premises. Organizations can deploy a frontier-scale reasoning model alongside embedding models and concurrent agent workflows on a single server. This on-premises architecture eliminates cloud dependencies and external API calls that would introduce latency and data sovereignty concerns.
As network traffic grows and 5G deployments expand, the gap between manual monitoring capabilities and operational demands continues to widen. Organizations that invest in autonomous monitoring infrastructure now position themselves for higher service quality, lower operational costs, and faster response to network events as that gap accelerates.
To learn more about implementing this solution, contact Dell Technologies or request access to reference code at contact@metrum.ai.
| Addendum: Key Concepts
What is RAG, and why is it critical for enterprises?
Retrieval-Augmented Generation (RAG) is a method in natural language processing that enhances the generation of responses by incorporating external knowledge retrieved from a large corpus or database. This approach combines the strengths of retrieval-based models and generative models to deliver more accurate, informative, and contextually relevant outputs.
The key advantage of RAG is its ability to dynamically leverage external knowledge, allowing the model to generate responses informed not only by its training data but also by up-to-date and detailed information from the retrieval phase. This makes RAG particularly valuable in applications where factual accuracy and comprehensive details are essential, such as in network operations, incident management, and other fields that require precise information. RAG provides enterprises with a powerful tool for improving the accuracy, relevance, and efficiency of their information systems.
Why is Dell PowerEdge XE9785L with AMD Instinct MI355X well-suited for RAG solutions?
Designed especially for AI tasks, the Dell PowerEdge XE9785L server is a powerful data-processing server equipped with high-density GPU accelerator support (such as eight MI355X GPUs) and high-performance system architecture, making it well-suited for AI workloads involving training, fine-tuning, and conducting inference with large language models.
Effectively implementing RAG solutions requires robust hardware infrastructure that can handle both the retrieval and generation components. Key hardware features for RAG solutions include high-performance accelerator units and large memory and storage capacity. With 288 GB of HBM3e memory per GPU, a single AMD Instinct MI355X accelerator can host very large LLMs and their associated working memory. Optimized for generative AI, the MI355X accelerator delivers leadership AI/HPC performance and provides the memory bandwidth and compute density needed to drive high-throughput inference and generation in RAG pipelines.
What are Multi-Specialist Agents and Multi-Agent Frameworks?
Multi-Specialist Agents are domain-focused AI agents designed with specialized expertise to address distinct aspects of complex operational workflows. Each agent operates autonomously within its area of specialization, such as network diagnostics, hardware health, communication link analysis, or report generation, while coordinating with other agents to achieve a shared operational goal. These agents use reasoning models, contextual data retrieval, and adaptive decision-making to analyze issues, execute corrective actions, and generate insights in real time.
A Multi-Agent Framework refers to a coordinated system where multiple specialist agents collaborate dynamically to solve interrelated problems across different domains. In this framework, agents communicate, delegate tasks, and share context through structured workflows, ensuring that each task is handled by the most capable specialist. For example, in the telecom C-RAN monitoring solution, the Operations Manager Agent delegates tasks to domain-specific agents such as the NOC Analyst, Communication Link Monitor, Hardware Health Agent, and Reporting Agent.
By combining the intelligence of multiple specialized agents, the Multi-Agent Framework enables autonomous detection, analysis, and resolution of incidents across large-scale infrastructures. It ensures faster root-cause identification, reduced downtime, and comprehensive reporting through continuous collaboration and reasoning between agents. This architecture represents a key advancement toward self-governing AI systems capable of managing complex, real-time operational environments.
System Under Test
Component | Detail |
Server Platform | Dell PowerEdge XE9785L Server |
GPU Accelerators | 8x AMD Instinct MI355X Accelerator (288 GB HBM3e each) |
CPU | AMD EPYC Processor (high core count) |
Operating System | Ubuntu 22.04.5 LTS |
Hardware Optimization | AMD ROCm 7.0 |
Inference Runtime | vLLM v0.10.1 |
Reasoning Model | Qwen3-235B-A22B-Thinking |
Embedding Model | bge-large-en |
Vector Database | PgVector + PostgreSQL |
Time-Series Database | GreptimeDB |
Agent Framework | AutoGen (Microsoft) |
Table 8 | System Under Test Configuration
Glossary of Technical Terms
Term | Definition |
A2A | Agent-to-Agent protocol for secure, structured communication between AI agent microservices |
AutoGen | Microsoft's open-source multi-agent orchestration framework for coordinating AI agent workflows |
BBU | Baseband Unit; centralized equipment for baseband signal processing in C-RAN architectures |
bge-large-en | An open-source text embedding model used for semantic search and similarity matching |
C-RAN | Cloud Radio Access Network; architecture that centralizes baseband processing while distributing radio units |
GreptimeDB | An open-source distributed time-series database optimized for high-frequency telemetry data |
HBM3e | High Bandwidth Memory 3e; latest generation high-bandwidth memory for GPU accelerators |
iDRAC | Integrated Dell Remote Access Controller; out-of-band server management platform |
MCP | Model Context Protocol; standardized interface for context sharing across AI agents |
NVMe | Non-Volatile Memory Express; high-speed storage interface protocol |
OSS | Operations Support Systems; software tools for network monitoring, fault management, and performance optimization |
PagedAttention | Memory management technique for efficient GPU memory allocation during LLM inference |
PgVector | A vector similarity search extension for PostgreSQL databases |
QoS | Quality of Service; performance metrics ensuring network meets service level requirements |
RAG | Retrieval-Augmented Generation; method combining document retrieval with AI text generation |
ROCm | Radeon Open Compute platform; AMD's open-source GPU computing software platform |
RRH | Remote Radio Head; distributed equipment handling RF processing at cell sites |
TTFT | Time to First Token; latency measure for the initial response from an LLM inference request |
vLLM | Open-source high-throughput inference engine for serving large language models |
Table 9 | Glossary
| References
[1] Ericsson, "Ericsson Mobility Report, November 2024," Ericsson AB, Stockholm, Sweden, Nov. 2024. [Online]. Available: https://www.ericsson.com/en/reports-and-papers/mobility-report
[2] TM Forum, "Network Performance Benchmarking Report," TM Forum, 2024. See also: Analysys Mason, "Telecoms Network Downtime: Cost and Impact Analysis," Analysys Mason Ltd., London, U.K., 2023.
Image Sources
Dell Images: © Dell Technologies Inc. Dell PowerEdge XE9785L Server. Image source: Dell DAM via Dell.com
AMD Images: © AMD Inc. AMD Instinct MI355X Accelerator. Image source: AMD Media Library (https://library.amd.com)
Copyright © 2026 Metrum AI, Inc. All Rights Reserved. This project was commissioned by Dell Technologies. Dell and other trademarks are trademarks of Dell Inc. or its subsidiaries. AMD, AMD Instinct™, AMD ROCm™, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other product names are the trademarks of their respective owners.
***DISCLAIMER - Performance varies by hardware and software configurations, including testing conditions, system settings, application complexity, the quantity of data, batch sizes, software versions, libraries used, and other factors. The results of performance testing provided are intended for informational purposes only and should not be considered as a guarantee of actual performance.
Tab 1
Enhancing Telecom Quality of Service with Gen AI–Based Multi-Agent Infrastructure Monitoring
This blog presents a telecom Multi-Agent Infrastructure Monitoring solution powered by the AMD Instinct MI355X accelerators on Dell PowerEdge XE9785LL servers.
October 2025
The introduction of the AMD Instinct MI355X accelerator, now integrated with Dell’s flagship PowerEdge XE9785LL server, provides a robust platform for high-performance AI applications. Leveraging this powerful combination, we have developed a Multi-Agent Infrastructure Monitoring solution to demonstrate the value of Generative AI in optimizing network operations for telecom companies and their enterprise clients. This solution is crucial for minimizing network downtime, ensuring consistent service quality, and enabling telecom operators to make informed decisions about network investments and improvements.
In this blog, we offer insights into the solution architecture built with industry-leading software and hardware components, showcasing the following:
- How to utilize LLM-based Multi-Agent framework to build a telecom C-RAN monitoring solution
- How to deploy a cutting-edge reasoning capable model, embeddings model, and vector database on a Dell PowerEdge XE9785LL server equipped with eight AMD Instinct MI355X accelerators
- How to navigate an up-to-date and intelligent dashboard that tracks and recommends actions for telecom issues, with a focus on base station performance
Understanding the Telecom Landscape
The expected data volume over cellular networks is projected to exceed hundreds of exabytes per month, driven by human and machine data, representing tens of billions of devices. This explosive growth poses a challenge in ensuring that networks remain robust and resilient while handling increased demand for data throughput and low-latency services. This creates challenges in maintaining high-quality service and securing critical information. Telecom infrastructure providers must ensure the integrity and resilience of their networks to prevent costly disruptions. For example, network downtime caused by equipment failure at a Remote Radio Head (RRH) can lead to widespread service outages, particularly in densely populated areas. Even worse, cascading failures can occur when a single failure triggers a chain reaction, affecting multiple base stations across the network. With tens of thousands of RRHs deployed across geographically dispersed areas, addressing these issues manually—such as sending out repair teams—can be time-sensitive and extremely costly.
Other common challenges include signal degradation due to environmental factors, misconfigurations in the Baseband Unit (BBU), and congestion from unexpected traffic spikes. These problems not only affect the quality of service but also increase the risk of security vulnerabilities, making real-time monitoring and automated issue resolution critical for minimizing service disruptions and maintaining operational efficiency.
To better understand the impact of this solution, it is important to gain some foundational knowledge on the telecom landscape (as represented in the simplified diagram above):
- RRH (Remote Radio Head):
- RRHs are deployed across multiple sites to perform basic signal transmission and reception functions, and handle radio frequency (RF) processing at cell sites.
- BBU (Baseband Unit):
- BBUs are aggregated within a centralized BBU pool to provide robust computing capabilities for baseband signal processing. They handle the digital processing of signals.
- OSS (Operations Support Systems):
- OSS comprises software tools that analyze and manage the telecom network, including network monitoring, fault management, and performance optimization.
In a typical telecom network, RRHs are connected to the BBU pool through fronthaul links. BBUs can be dynamically assigned to serve clusters of RRHs in a many-to-one configuration. The OSS provides a comprehensive network view by integrating data from RRHs and BBUs, ensuring that the telecom network meets Quality of Service (QoS) requirements.
In this implementation, we simulate a telecom network with several RRHs, a centralized BBU pool, and an OSS using synthetic data streams, specifically system logs from various telecom devices and OSS event logs. The following section provides a detailed walkthrough of the solution architecture and user interface. In doing so, we demonstrate how telecommunications service providers can use agentic RAG solutions to analyze and address quality of service and security challenges in near real-time, increasing the uptime of their networks and improving the quality of service.
Solution Architecture
To power this solution, we selected the Dell PowerEdge XE9785LL equipped with AMD Instinct MI355X accelerators due to its exceptional performance and memory capacity, crucial for handling the latest high parameter count large language models. With 288GB of HBM3E memory per accelerator, we can comfortably run the entire Qwen3 235B reasoning model on two accelerators.. Memory- and compute-intensive workloads, such as serving multiple model instances, fine-tuning, and executing concurrent reasoning agents, are also possible using a single hardware system with eight accelerators.
To deliver an industry specific solution, we paired a large language reasoning model with critical software components as shown in the architecture below. The memory and performance capabilities of Dell PowerEdge XE9785LL with AMD Instinct MI355X accelerators make it possible to concurrent reasoning agents workloads without bottlenecks crucial for telecom-grade, low-latency environments.
This solution leverages the following technologies:
- Multi-Agent Reasoning Framework: Powered by the AutoGen Agent ecosystem, the solution supports autonomous collaboration between specialized agents. These agents use the Model Context Protocol (MCP) and A2A protocol for seamless communication and orchestration across microservices, enabling dynamic reasoning and adaptive task management.
- Real-Time Data Processing and Ingestion: A Vector Data Pipeline ingests and transforms data from telecom event sources. The transformed data is routed to GreptimeDB, a high-performance time-series database.
- Optimized AI Model Stack:
- bge-large for high-quality text embeddings and similarity search
- Qwen-235B for generation and reasoning tasks
- vLLM (v0.10.1) with AMD ROCm 7.0 for highly efficient inference runtime
- Integrated Databases and Observability: Postgres DB handles structured metadata and agent state management, while an integrated Dashboard provides system observability, agent telemetry, and performance analytics.
- Extensible and Scalable Architecture: The architecture is designed to scale efficiently capable of processing thousands of network messages and log streams per day across hundreds of nodes on a single server, while supporting horizontal scaling for expanded deployments.
To enable these features, the software stack includes the following key components:
- vLLM Inference Runtime (v0.10.1 with ROCm 7.0), an industry-standard library for optimized open-source large language model (LLM) serving, with support for AMD ROCm 7.0.
- Autogen based Multi-Agent Framework,serves as the orchestration layer for managing multi-agent collaboration and coordination. It enables asynchronous, event-driven task execution
- Model Context Protocol (MCP) and A2A Agent Services, The Model Context Protocol (MCP) acts as a standardized interface for context sharing and control across multiple reasoning and tool-using agents.
Paired with the A2A (Agent-to-Agent) protocol, this component ensures secure, scalable, and structured communication between microservices and agents.
Together, they form the control plane of the solution, handling request routing, inter-agent orchestration, and operational telemetry crucial for running distributed AI workflows with traceability and observability.
- Reasoning and Generation Models(Qwen3 235B Thinking), an industry-leading open-weight language model with 235 billion parameters, served using vLLM with AMD ROCm optimizations.
- bge-large-en embeddings model, one of the top ranked text embeddings models running with Hugging Face APIs.
- Database Layer, The PgVector extension for PostgreSQL provides vector storage and similarity search capabilities, enabling embedding-based retrieval and semantic querying.
It works in conjunction with Postgres, which serves as a metadata layer, managing agent state, session data, and system configurations.
- Vector Data Pipeline, Handles real-time ingestion and transformation of telemetry and network logs before routing enriched data to GreptimeDB.
- GreptimeDB, is a distributed time-series database optimized for handling high-frequency telemetry and event data.
Solution Overview
This solution is centered on real-time monitoring and management of telecom services by leveraging a multi-agent framework composed of specialized AI agents capable of autonomously detecting, analyzing, and resolving telecom network incidents in real time. The image below depicts the user interface of the Multi-Agent C-RAN Infrastructure Monitoring solution, through which users can monitor live QoS metrics, visualize incident resolutions executed by agents across base stations, and generate both detailed incident-specific reports and executive summary reports that capture overall network health.
- Base Station Simulated Logs
- Logs are generated from each base station’s RRH (Remote Radio Head) and BBU (Baseband Unit). The color indicators represent operational status:
- Green: All [INFO] logs — RRHs are operating normally.
- Orange: Early warning detected — agents monitoring for potential degradation.
- Red: Presence of [WARNING] or [ERROR] logs — triggers automated incident detection and resolution workflow.
- QoS Metrics(Top Section)
- Live Global QoS metrics are continuously streamed and visualized, reflecting:
- Block Rate, Data Drop, and Call Drop percentages
- Downlink and Uplink Throughput levels
- Real-time updates aggregated from all active base stations
- Agents monitor deviations and correlate anomalies with underlying network or hardware events.
- Event Streams & Active Agents (Middle Section)
- This section shows:
- Real-time event count from BBU, RRH, and OSS logs
- Active agent workflows currently diagnosing or resolving issues
- The number of processed incidents and devices monitored
- Each workflow includes specialized agents such as:
- Operations Manager Agent – Delegates the task to domain-specific agents.
- NOC Analyst Agent – Monitors the BBU and RRH for Issues and forwards it to the Operations Manager
- Communication Link Monitor Agent – Specialized in resolving communication link failures between BBU and RRH
- Synchronization Specialist – Specialized in resolving synchronization timing issues in telecom infrastructure.
- Hardware Health Agent – Specialized in resolving hardware failure issues between BBU and RRH
- Reporting Agent – Specialized in generating comprehensive reports for telecom infrastructure operations.
- Automated Incident Detection and Resolution
- When [WARNING] or [ERROR] logs are detected:
- The Operations Manager Agent delegates the task to appropriate agents.
- Agents collaboratively identify the root cause (e.g., link failure, hardware malfunction, or synchronization issue).
- Corrective actions such as restarting communication links or reconfiguring network interfaces are executed autonomously.
- The system updates the incident state from “Detected” → “Resolved” in real time.
- Issue Alerts Panel
- Displays an interactive list of incidents:
- Mapping between affected RRHs, BBUs, and base stations
- Visual indicators showing ongoing, resolved, or escalated issues
- Direct drill-down into root cause analysis reports for each event
- Report Generation
- The system supports two modes of report generation:
- Specific Incident Report: Generated per event with detailed RCA (Root Cause Analysis), corrective actions taken, and affected components.
- Executive Summary Report: Provides aggregated insights across all base stations, summarizing trends, issue types, and overall service impact.
The image above illustrates each segment of the workflow, and details how the RAG-based agentic workload and vector database interact with the simulated telecom network data.
As demonstrated in our implementation, telecom operators can now leverage Generative AI to enhance the monitoring and management of telecom services, while ensuring the privacy of their proprietary data and workflows. Dell’s flagship PowerEdge XE9785LL server, equipped with eight AMD Instinct MI355X accelerators, provides the necessary memory footprint to support these rich multimodal data and model-intensive use cases.
In this blog, we demonstrated how enterprises deploying applied AI can leverage their proprietary data to harness multimodal RAG capabilities in the context of a telecom issue management tool. We explored the capabilities of the Dell PowerEdge XE9785LL server equipped with AMD Instinct MI355X accelerators, achieving the following milestones:
- Developed a Multi-Agent C-RAN Infrastructure Monitoring solution.
- Deployed cutting-edge language model, embeddings model, and vector database on Dell PowerEdge XE9785LL server with eight AMD Instinct MI355X accelerators.
- Integrated an intelligent, real-time dashboard that monitors and executes recommended actions for telecom issues, with a focus on base station performance.
To learn more, please request access to our reference code by contacting us at contact@metrum.ai.
Additional Criteria for IT Decision Makers
| What is RAG, and why is it critical for enterprises?
Retrieval-Augmented Generation (RAG), is a method in natural language processing (NLP) that enhances the generation of responses or information by incorporating external knowledge retrieved from a large corpus or database. This approach combines the strengths of retrieval-based models and generative models to deliver more accurate, informative, and contextually relevant outputs.
The key advantage of RAG is its ability to dynamically leverage a large amount of external knowledge, allowing the model to generate responses that are informed not only based on its training data but also by up-to-date and detailed information from the retrieval phase. This makes RAG particularly valuable in applications where factual accuracy and comprehensive details are essential, such as in customer support, academic research, and other fields that require precise information.
Ultimately, RAG provides enterprises with a powerful tool for improving the accuracy, relevance, and efficiency of their information systems, leading to better customer service, cost savings, and competitive advantages.
| Why is the Dell PowerEdge XE9785LL Server with AMD Instinct MI355X Accelerators well-suited for RAG Solutions?
Designed especially for AI tasks, the Dell PowerEdge XE9785LL server is a powerful data-processing server equipped with high-density GPU accelerator support (such as eight MI350-series GPUs) and high-performance system architecture, making it well-suited for AI workloads, especially for those involving training, fine-tuning, and conducting inference with large language models.
Effectively implementing Retrieval-Augmented Generation (RAG) solutions requires a robust hardware infrastructure that can handle both the retrieval and generation components. Key hardware features for RAG solutions include high-performance accelerator units and large memory and storage capacity. With 288 GB of HBM3E memory per GPU, a single AMD Instinct MI355X accelerator can host very large LLMs and their associated working memory. Optimized for generative AI, the MI355X accelerator delivers leadership AI/HPC performance and provides the kind of memory bandwidth and compute density needed to drive high-throughput inference and generation in RAG pipelines. Paired in a server like the Dell PowerEdge XE9785LL (with support for eight MI355X GPUs, fast NVMe storage, and high-speed networking), this creates an ideal platform to power both the retrieval of knowledge (embeddings, vector search) and the generation of context-aware responses in real time.
| What are Multi-Specialist Agents, and what is a Multi-Agent Framework?
Multi-Specialist Agents are domain-focused AI agents designed with specialized expertise to address distinct aspects of complex operational workflows. Each agent operates autonomously within its area of specialization—such as network diagnostics, hardware health, communication link analysis, or report generation—while coordinating with other agents to achieve a shared operational goal. These agents use reasoning models, contextual data retrieval, and adaptive decision-making to analyze issues, execute corrective actions, and generate insights in real time.
A Multi-Agent Framework refers to a coordinated system where multiple specialist agents collaborate dynamically to solve interrelated problems across different domains. In this framework, agents communicate, delegate tasks, and share context through structured workflows, ensuring that each task is handled by the most capable specialist. For example, in the telecom C-RAN monitoring solution, the Operations Manager Agent delegates tasks to domain-specific agents such as the NOC Analyst, Communication Link Monitor, Hardware Health Agent, and Reporting Agent.
By combining the intelligence of multiple specialized agents, the Multi-Agent Framework enables autonomous detection, analysis, and resolution of incidents across large-scale infrastructures. It ensures faster root-cause identification, reduced downtime, and comprehensive reporting through continuous collaboration and reasoning between agents. This architecture represents a key advancement toward self-governing AI systems capable of managing complex, real-time operational environments.
References
Dell Images: © Dell Technologies Inc. — Dell PowerEdge XE9785L Server. Image source: Dell DAM via Dell.com
AMD Images: © AMD Inc. — AMD Instinct MI355X Accelerator. Image source: AMD Media Library (https://library.amd.com)
Copyright © 2025 Metrum AI, Inc. All Rights Reserved. This project was commissioned by Dell Technologies. Dell and other trademarks are trademarks of Dell Inc. or its subsidiaries. AMD, AMD Instinct™, AMD ROCm™, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other product names are the trademarks of their respective owners.
***DISCLAIMER - Performance varies by hardware and software configurations, including testing conditions, system settings, application complexity, the quantity of data, batch sizes, software versions, libraries used, and other factors. The results of performance testing provided are intended for informational purposes only and should not be considered as a guarantee of actual performance.
[1] Ericsson, "Ericsson Mobility Report, November 2024," Ericsson AB, Stockholm, Sweden, Nov. 2024. [Online]. Available: https://www.ericsson.com/en/reports-and-papers/mobility-report
[2] TM Forum, "Network Performance Benchmarking Report," TM Forum, 2024.
Analysys Mason, "Telecoms Network Downtime: Cost and Impact Analysis," Analysys Mason Ltd., London, U.K., 2023.