Back to Whitepapers
Technical Whitepaper
WhitepaperMarch 2026AMD InstinctMI355XXE9785LPowerEdge XE9785L

Multi-Agent Risk Analysis and Compliance Monitoring on Dell PowerEdge XE9785L Servers Powered By AMD Instinct MI355X Accelerators

Abstract

Financial institutions spend over $206 billion annually[1] on compliance, yet risk teams still operate with multi-hour blind spots during market events, fragmented data pipelines, and limited capacity to process the 80 percent[2] of institutional data that remains unstructured. L...


Executive Summary

Financial institutions spend over $206 billion annually[1] on compliance, yet risk teams still operate with multi-hour blind spots during market events, fragmented data pipelines, and limited capacity to process the 80 percent[2] of institutional data that remains unstructured. Legacy batch systems and cloud-first AI tools cannot simultaneously address real-time market velocity and on-premises data sovereignty requirements.

The Institutional Portfolio Risk Agents (IPRA) platform closes this gap. IPRA is a single-server, on-premises portfolio risk agent platform running on a Dell PowerEdge XE9785L with eight AMD Instinct MI355X accelerators. The platform continuously links filings, news, macroeconomic indicators, and regulatory updates to every portfolio position and compliance mandate through a coordinated suite of AI agents. The results that follow demonstrate near-linear throughput scaling to 850 concurrent portfolios on one server, with generational performance gains validated at the raw inference level.

Key Results at a Glance

1,415 holdings analyzed/minNear-linear scaling from 300 to 850 concurrent portfolios on a single server
109 compliance checks/minEach assessment covers 20 to 30 rule evaluations per holding with full audit trail
850 concurrent portfoliosOne server - 22,100 individual asset positions monitored on a single Dell PowerEdge XE9785L
Up to 4.3x gen-on-gen throughputMI355X vs MI300X on 128/128 workload at 8,192 concurrent requests
2.3 TB combined GPU memorySingle 8-GPU node, no multi-node sharding required
64x compliance concurrency within SLAMI355X holds sub-100ms TPOT P95 at 8,192 concurrent sessions vs 128 on MI300X
Sub-2s median TTFTAt all concurrent request levels tested
Up to 2.2x tokens per GPU wattMI355X efficiency advantage at 2,048+ concurrent requests

The Five Operational Challenges

Five operational challenges consistently surface in conversations with risk and compliance leaders at financial institutions. Table 1 summarizes each challenge, its current state and its quantified business impact, establishing the baseline for improvement by any effective risk intelligence platform.

ChallengeCurrent StateBusiness Impact
Delayed Risk AwarenessBatch systems and manual reviews typically operate on multi-hour cyclesCritical exposure changes may go undetected during fast-moving market events
Unstructured Data Overload~80% of relevant data resides in filings, transcripts, and newsAnalysts struggle to process document volumes at market speed
Compliance GapsMandate checks typically occur only after risk analysis completesBreaches often surface late, increasing regulatory and audit exposure
Siloed InsightsMarket, fundamental, macroeconomic, and regulatory signals often remain disconnectedCorrelated risks across portfolios may remain hidden
Data Sovereignty ConstraintsSensitive trading and client data cannot leave the institutional perimeterCloud-first AI tools are often constrained by regulatory requirements

Table 1 | Operational Challenges and Business Impact

Addressing these challenges requires a platform purpose-built for continuous, on-premises intelligence. The following sections explain how IPRA unifies fragmented workflows into a continuous, real-time risk and compliance capability on the Dell PowerEdge XE9785L server with eight AMD Instinct MI355X accelerators.

Solution Overview

IPRA's architecture centers on a continuously updated knowledge graph implemented in Neo4j that captures relationships between market events, issuers, sectors, portfolio positions, and compliance mandates. When new information arrives, including a 10-Q filing, breaking news headline, or regulatory notice, the system uses large language and vision models to extract entities and relationships, then maps them to the graph structure using Graphiti. Specialized agents query this shared knowledge graph to determine how incoming events affect portfolio holdings, enabling the platform to trace the impact path from a single news item to all affected positions and mandates.

AgentFunctionOperation Mode
Portfolio Impact AgentCalculates exposure changes when new signals affect portfolio holdingsContinuous
Compliance Monitor AgentChecks positions against institutional mandates and regulatory rulesContinuous
Regulatory Audit AgentMaintains audit trails with traceable source links for supervisory reviewContinuous
Scenario Stress AgentRuns stress tests using natural language queriesUser-initiated

Table 2 | IPRA Agent Functions

Solution Flow

IPRA Solution Flow diagram showing five integrated processing stages

Figure 1 | Solution Flow

The platform operates through five integrated stages, each running continuously within the institution's secure perimeter. In the first stage, IPRA ingests data from multiple sources: SEC filings, news feeds, fundamental financial data, insider trading disclosures, macroeconomic indicators, and regulatory notices. A Kafka-based event streamer normalizes these inputs and routes them to the processing layer, where GPU-accelerated language models extract structured signals from unstructured content. Vision language models handle documents containing charts, tables, and mixed media. Text models perform summarization, sentiment scoring, and entity extraction.

The extracted signals flow into the knowledge graph, where Graphiti maps relationships between entities such as identifying which issuer published which filing, which sectors face exposure to which regulatory changes, and which portfolio positions connect to which risk factors. This contextualized graph becomes the foundation for the Portfolio Risk Intelligence Layer, where agents continuously evaluate how new information affects holdings, mandates, and compliance status. The Scenario and Impact Evaluation Layer then applies predictive modeling and stress testing to quantify forward-looking risk under varying market conditions.

The final stage delivers decision-ready intelligence to risk officers through a unified dashboard. Users see a real-time risk heatmap organized by issuer and risk category, a "What Changed?" panel that highlights the most significant events and their quantified portfolio impact, compliance status indicators with pass/warn/fail flags, and narrative briefings that explain findings in plain language. Every recommendation links back to its source data, providing the required audit trail for supervisory and regulatory review.

Example Unified Dashboard showing risk heatmap and compliance indicators

Figure 2 | Example Unified Dashboard

Example Scenarios

To illustrate how these stages operate in practice, consider three scenarios based on real market events.

In a regulatory change scenario, federal regulators publish a final rule recalibrating the enhanced supplementary leverage ratio (eSLR) for systemically important banks. Within minutes of the rule's publication in the Federal Register publication, IPRA's ingestion layer captures the notice, extracts the specific constraint changes, and maps affected entities to portfolio holdings. The Compliance Monitor Agent automatically generates a side-by-side comparison of prior and updated capital requirements and flags any Global Systemically Important Bank (GSIB) positions that may require rebalancing. Risk officers receive a concise summary of exposure impacts within minutes of publication, often before markets have fully priced in the change.

In a governance shock scenario, a major agricultural company delays its earnings release and announces an accounting investigation, triggering a 24 percent decline in share price at market open. IPRA detects the escalation pattern from the company's 8-K filing and associated news coverage, adjusts the issuer's governance risk score, and flags any portfolio mandates tied to disclosure quality or ESG governance minimums. The Regulatory Audit Agent logs the complete evidence chain for compliance review. The dashboard displays updated risk scores alongside an events timeline, enabling risk officers to trace the full sequence from initial disclosure to portfolio impact.

In a liquidity crisis scenario, a macroeconomic shock triggers a rapid flight to liquidity, commonly referred to as a "dash for cash", where correlations spike across asset classes and even traditional safe havens decline. IPRA's Scenario Stress Agent recognizes the correlation breakdown. Upon the user's acknowledgment, it calculates portfolio-wide NAV drawdown estimates and recommends emergency liquidity protocol activation. The dashboard then displays a portfolio-wide drawdown estimate alongside a correlation matrix illustrating the breakdown of traditional diversification assumptions. Risk officers receive prioritized recommendations for liquidity buffer activation, with each recommendation linked to the source signals and model outputs that informed it.

Solution Architecture

The architectural decisions behind IPRA reflect a fundamental requirement: every component must operate within the institution's secure perimeter while delivering the high-performance computational throughput needed for real-time risk intelligence. The platform combines optimized inference runtimes and a modular software stack designed for continuous operation under regulatory scrutiny.

IPRA Solution Architecture diagram showing software stack layers

Figure 3 | Solution Architecture

Software Stack

The software architecture layers optimized runtimes atop AMD's Radeon Open Compute (ROCm) 7.x platform. ROCm combines a hardware-optimized foundation with an open ecosystem approach, enabling institutions to deploy models from any source without vendor lock-in. The platform supports standard frameworks and tools, allowing risk teams to incorporate new models as they become available without rewriting application code. vLLM provides the high-throughput inference runtime, delivering token generation with continuous batching and PagedAttention memory management. This combination enables IPRA to serve multiple concurrent analysis requests while maintaining predictable latency for time-sensitive compliance checks.

LayerComponentFunction
Hardware OptimizationAMD ROCm 7.xGPU compute and memory management
Inference RuntimevLLMHigh-throughput model serving with continuous batching
Event StreamingApache KafkaReal-time data ingestion and routing
Document ProcessingDoclingPDF and document parsing with vision-language models (VLM) support
Knowledge GraphGraphiti + Neo4jEntity relationship mapping and graph storage
Agent FrameworkMicrosoft Agent FrameworkMulti-agent orchestration and coordination
Relational DatabasePostgreSQLPortfolio data, compliance rules, and audit logs
AuthenticationValkeySession management and access control
ObservabilityPrometheusSystem metrics and performance monitoring

Table 3 | Software Stack Components

AI Model Deployment

IPRA deploys four specialized models across the GPU array, each optimized for specific tasks within the risk intelligence pipeline. This multi-model approach matches computational requirements to task complexity: lightweight summarization runs on smaller models with multiple replicas for throughput, while document understanding and regulatory reasoning leverage large-scale vision-language and reasoning architectures. Additionally, using Dell Enterprise Hub's integration with Hugging Face simplifies model deployment by providing pre-validated, enterprise-ready model containers. Infrastructure teams can deploy new models through a streamlined workflow rather than building custom inference pipelines from scratch.

ModelFunction
Qwen3-VL-235B-A22B-Thinking-FP8Vision-language processing for documents containing charts, tables, and mixed media
Qwen3-235B-A22B-Thinking-2507-ptpcMulti-step reasoning for compliance assessment and complex risk analysis
GPT-OSS-120B[3]Entity extraction and relationship mapping for knowledge graph construction
Magistral-Small-2507Sentiment analysis and summarization for news and filing content

Table 4 | Model Deployment Configuration

The Magistral-Small-2507 model runs with four replicas to handle high-volume summarization and sentiment scoring workloads. GPT-OSS-120B similarly runs multiple replicas for entity extraction tasks. The 235-billion-parameter reasoning and vision-language models each require substantial GPU memory but deliver the analytical depth needed for regulatory compliance decisions.

Data Pipeline Architecture: The data pipeline transforms raw market signals into structured risk insights through four processing stages. Kafka receives streaming data from external sources including SEC EDGAR, news APIs, macroeconomic feeds, and regulatory notice systems. The event streamer normalizes incoming data formats and routes messages to appropriate processing queues based on content type and priority.

Docling handles document parsing, using vision-language models to extract structured content from PDFs containing charts, tables, and mixed layouts. Unlike traditional optical character recognition approaches, the VLM-based pipeline captures document semantics, enabling more accurate extraction of financial data even from complex multi-column filings. The CPU handles initial document preprocessing, while the GPU executes the vision-language inference.

Graphiti constructs the knowledge graph by identifying entities, relationships, and temporal connections within processed documents. The resulting graph structure, stored in Neo4j, captures how issuers connect to sectors, how regulatory rules apply to asset classes, and how portfolio positions link to risk factors. This contextualized representation enables agents to trace the impact of any market event across the full scope of portfolio holdings and compliance mandates.

Agent Orchestration: The Microsoft Agent Framework coordinates the specialized agents that execute IPRA's analytical workflows. An intelligent orchestration engine routes incoming requests and events to the appropriate agent based on task type and current system load. The Portfolio Impact Agent and Compliance Monitor Agent operate continuously, processing each incoming signal against relevant portfolio positions and mandates. The Regulatory Audit Agent maintains audit trails, while the Scenario Stress Agent responds to user-initiated requests for stress testing and sensitivity analysis.

Each agent accesses shared resources through well-defined interfaces: the knowledge graph for contextual relationships, PostgreSQL for portfolio and compliance data, and the model inference endpoints for AI-powered analysis. This separation of concerns enables independent scaling and updates to individual components without disrupting overall system operation.

Accuracy Controls and Design Safeguards

IPRA mitigates hallucination risk through a multi-layered approach. Every LLM-generated assessment is grounded in source data retrieved from the knowledge graph and PostgreSQL. The Compliance Monitor Agent applies deterministic threshold checks before invoking LLM-based qualitative assessments, ensuring that quantitative compliance limits are enforced. The Regulatory Audit Agent logs the complete evidence chain for each finding, enabling human reviewers to verify any AI-generated recommendation against its source material.

Security and Observability: Valkey, a Redis-compatible key-value store, manages authentication tokens and session state. All user interactions require authenticated sessions, and the system logs every query, analysis result, and recommendation for regulatory audit purposes. Prometheus collects system metrics from all components, enabling operations teams to monitor GPU utilization, inference latency, and pipeline throughput in real time.

Dell's Integrated Dell Remote Access Controller (iDRAC) provides out-of-band management for the underlying server infrastructure. Operations teams can monitor hardware health, thermal status, and power consumption independent of the operating system. iDRAC's secure remote management capabilities enable infrastructure teams to perform firmware updates, diagnose issues, and restore service without physical access to the data center, a critical requirement for institutions with distributed operations or limited on-site staff.

The entire architecture operates within the institution's network perimeter. No data leaves the secure environment, and all model inference executes locally on the dedicated hardware. This sovereign deployment model satisfies regulatory requirements that prohibit transmission of sensitive portfolio and client data to external cloud services.

Infrastructure Foundation

Sovereign AI infrastructure for regulated financial services must satisfy three requirements on a single platform: run large-scale models without aggressive compression, keep all data within a single physical boundary for simplified audit controls, and sustain peak throughput during extended market stress events. The Dell PowerEdge XE9785L server with AMD Instinct MI355X accelerators addresses all three, delivering the compute density and memory capacity that IPRA's multi-agent architecture requires for continuous inference, document ingestion, and compliance monitoring.

ComponentSpecification
ServerDell PowerEdge XE9785L
CPUAMD EPYC 9965 (192 cores, 384 threads)
System Memory2.95 TB DDR5
GPU Accelerators8x AMD Instinct MI355X Accelerators
GPU Memory2.3 TB aggregate HBM3e (288 GB per accelerator)
CoolingDirect Liquid Cooling

Table 5 | Dell PowerEdge XE9785L Hardware Configuration

Direct liquid cooling enables the Dell PowerEdge XE9785L to sustain full GPU power draw across all eight AMD Instinct MI355X accelerators without thermal throttling, even during extended market stress events that demand continuous high-concurrency inference. For regulated institutions running 24/7 risk monitoring, this thermal headroom translates directly into predictable, uninterrupted performance at peak loads.

The AMD Instinct MI355X accelerator provides 288 GB of HBM3e memory per accelerator, a 50 percent increase over the 192 GB available on the previous-generation MI300X. This expanded capacity gives institutions a strategic choice: deploy larger, more capable models for deeper analysis, or run more replicas of production models for higher throughput. IPRA leverages both options simultaneously. The 235-billion-parameter reasoning and vision-language models consume significant memory but deliver the analytical depth required for complex regulatory decisions. Meanwhile, multiple Magistral-Small replicas handle high-volume summarization tasks in parallel. On the MI300X platform, this combination would require aggressive quantization or distribution across multiple servers, adding latency and operational complexity that regulated institutions seek to avoid.

With 2.3 TB of aggregate GPU memory, the XE9785L hosts all four production models simultaneously: Qwen3-VL-235B for document understanding, Qwen3-235B for compliance reasoning, GPT-OSS-120B for entity extraction, and multiple Magistral-Small-2507 replicas for summarization. This consolidation onto a single server streamlines procurement, reduces data center footprint, and eliminates inter-server communication latency. All data remains within one physical boundary, reducing the attack surface and simplifying the access controls that compliance teams must audit.

The AMD EPYC 9965 processor handles CPU-bound preprocessing tasks, including Kafka event streaming, Docling document parsing, and knowledge graph updates. During peak ingestion periods, when regulatory agencies publish multiple notices or earnings-season concentrated filings, the 192-core processor prevents CPU bottlenecks from constraining pipeline throughput. Direct liquid cooling maintains optimal temperatures during extended operation, enabling consistent performance during prolonged market stress events.

8x AMD Instinct MI355X Accelerator Model Map showing model allocation across GPUs

Figure 4 | 8x AMD Instinct MI355X Accelerator Model Map

How IPRA Compares

The following table summarizes how IPRA's architecture addresses the limitations of legacy batch systems and cloud-based AI platforms in regulated financial environments.

CapabilityLegacy Batch SystemsCloud AI PlatformsIPRA on Dell PowerEdge
Processing LatencyMulti-hour batch cyclesNear real-time, subject to network latencyContinuous, on-premises, sub-minute signal processing
Data SovereigntyOn-premises, fully controlledIntroduces data residency considerationsOn-premises, all data stays within institutional boundaries
Compliance MonitoringPeriodic, post-analysis checksVaries by provider; audit trail gaps commonContinuous, integrated with every analysis cycle
Audit TrailManual assembly from multiple systemsProvider-dependent, may limit traceabilityAutomated, source-linked, regulator-ready
Unstructured Data ProcessingLimited or manualStrong, but subject to data residency constraintsGPU-accelerated NLP on-premises with large-scale models
ScalabilityRequires multi-system expansionElastic, but introduces sovereignty riskLinear scaling on a single server; horizontal expansion available
Infrastructure FootprintMultiple servers, disparate software stacksCloud tenancy, shared infrastructureSingle Dell PowerEdge XE9785L server with consolidated model deployment

Table 6 | Capability Comparison Across Risk Intelligence Approaches

Performance Benchmarking

The benchmarking program validates IPRA's ability to deliver continuous risk intelligence at institutional scale on a single Dell PowerEdge XE9785L server. Testing focused on the two key throughput metrics that matter most to risk operations: how many holdings the system can analyze per minute and how many compliance checks it can execute concurrently.

Methodology

The benchmark simulates a realistic stress scenario based on the March 2023 Silicon Valley Bank collapse, a contagion event that spread rapidly across technology stocks, regional banks, and stablecoins. The benchmarking team selected this scenario because it exercises every component of the pipeline: filings trigger credit reassessment, news drives sentiment shifts, and cross-sector contagion forces the system to evaluate correlated risks across multiple portfolio positions simultaneously.

Each test portfolio contains approximately 26 holdings drawn from 10 base institutional portfolio templates spanning equities, corporate debt, commodities, and forex positions. The system scales load by replicating these portfolios and processing them concurrently. Tests sweep across concurrency levels from 300 to 850 simultaneous portfolios, with each level running for 45 minutes while metrics are sampled every 30 seconds. A drain and cooldown phase separates each run to ensure clean measurement.

Two primary metrics capture IPRA's operational throughput:

Holdings analyzed per minute measures the count of individual ticker positions that receive a complete four-dimensional risk assessment covering credit, market, liquidity, and regulatory risk. Each holding analysis draws on data from PostgreSQL (portfolio weights, financial fundamentals, historical prices) and the knowledge graph (scenario-specific news, regulatory filings). The system computes quantitative risk scores using Altman Z-Score, Value at Risk, Conditional VaR, GARCH volatility modeling, and the Amihud illiquidity ratio, then enriches results with knowledge-graph-sourced news sentiment.

Compliance checks per minute measures the count of individual holdings (tickers) that complete the full compliance evaluation pipeline. Each completed check encompasses 20 to 30 individual rule evaluations per holding, including deterministic threshold assessments and LLM-based qualitative evaluations. The compliance pipeline retrieves evidence from the knowledge graph, applies deterministic threshold checks and LLM-based qualitative assessments, and resolves conflicts before recording each result with full audit metadata.

Dell's Integrated Dell Remote Access Controller (iDRAC) provided out-of-band hardware monitoring throughout all test runs, enabling the benchmarking team to verify thermal stability, GPU power draw, and system health independent of the operating system during each 45-minute stress cycle.

ConfigurationDetails
Test ScenarioTech and Banking Contagion (SVB Collapse)
Base Portfolios10 institutional portfolio templates
Holdings per Portfolio~26 assets
Concurrency Sweep300 - 850 (incrementing by 50)
Test Duration per Level45 minutes (90 samples at 30-second intervals)
Reasoning Modelamd/Qwen3-235B-A22B-Thinking-2507-ptpc
Scenario PlaybackTimestamped news events at 100x speed via Kafka

Table 7 | Benchmarking Configuration

Results

On the Dell PowerEdge XE9785L server with eight AMD Instinct MI355X accelerators, IPRA demonstrated consistent, near-linear throughput scaling from 300 to 850 concurrent portfolios. Each concurrency level ran for 45 minutes with metrics sampled every 30 seconds.

Holdings Analyzed per Minute by Concurrent Portfolio Count

Figure 5 | Holdings Analyzed per Minute by Concurrent Portfolio Count

Compliance Checks per Minute by Concurrent Portfolio Count

Figure 6 | Compliance Checks per Minute by Concurrent Portfolio Count

The data reveals three important operational characteristics of the MI355X platform under continuous FSI workloads:

  • Linear throughput scaling through the operational range. Holdings throughput scales from 558 holdings/min at 300 concurrent portfolios to 1,415 holdings/min at 850 concurrent portfolios, representing a 2.54x increase as load grows 2.83x. Compliance throughput follows a similar trajectory, scaling from 53.35 to 108.90 checks/min. This near-linear scaling behavior confirms that the MI355X platform maintains consistent inference performance without degradation as concurrent workload increases.
  • Stable generation rate across load conditions. The Max Generation Tokens/sec column remains between 1,339 and 1,599 tokens/sec regardless of concurrency level. This stability reflects the input-dominated nature of FSI risk workloads: each holding evaluation requires extensive context ingestion (news, filings, portfolio data, compliance rules) before generating a comparatively short structured output. The generation phase is not the throughput bottleneck.
  • Continued headroom at 850 concurrent portfolios. The throughput curve does not show signs of saturation at the highest tested concurrency level. Holdings throughput increased by 9.8% between the 800 and 850 data points (1,288.75 to 1,415.39), suggesting the MI355X node retains additional capacity beyond the tested range. This headroom provides operational margin for production deployments that experience periodic load spikes during market events.

An important distinction exists between raw inference capacity and application-level throughput. When comparing the MI355X and MI300X platforms under the same IPRA workload configuration, application-level metrics such as holdings analyzed per minute and compliance checks per minute show only marginal differences, typically 1 to 3 percent. This result reflects the current pipeline architecture, not GPU capability. IPRA's orchestration layer dispatches a fixed batch of tasks per cycle and waits for the full batch to complete before initiating the next. Both GPU platforms therefore receive work at the same controlled rate, and the faster MI355X simply completes its assigned inference sooner. The raw inference benchmarks presented in the Generational Performance section confirm the MI355X delivers 3.3x to 4.3x more tokens per second at the model level. This gap represents built-in headroom to add more agents, increase dispatch rates, or deploy larger models without adding servers.

To contextualize these figures for portfolio management operations: at 850 concurrent portfolios with approximately 26 holdings each, the system monitors 22,100 individual asset positions simultaneously. At 1,415 holdings evaluated per minute, the system completes a full risk assessment pass across all monitored positions approximately every 15.6 minutes. For a firm managing 500 institutional portfolios, this delivers multiple complete risk cycles per hour, replacing the once-daily batch assessment that legacy systems provide. At cloud market-average rates, this translates to approximately $37 per portfolio per month for always-on, multi-dimensional risk monitoring, comparable to many batch-only platforms that provide far less frequent coverage.

All results were achieved on a single Dell PowerEdge XE9785L server, confirming that the platform delivers institutional-scale risk intelligence without multi-server expansion.

Cloud Infrastructure Cost Context

The following pricing analysis provides directional context for infrastructure planning, not a complete total cost of ownership (TCO) model. The estimates focus on GPU compute costs, which represent the dominant variable cost component for AI inference workloads. A production TCO assessment would additionally account for storage, networking, data-feed licensing, egress charges, support contracts, staffing, and utilization variability.

MI355X 8-GPU Node Pricing

Cloud GPU pricing for MI355X 8-GPU nodes varies based on provider, commitment term, and availability tier. The table below summarizes current market pricing for an 8-GPU MI355X node operating continuously (8,760 hours per year). Pricing reflects publicly listed rates from named providers as of February 2026.

Pricing TierPer-GPU/hr8-GPU Node/hrAnnual Cost (24/7)
Reserved (48-month)$2.29[4]$18.32~$160,000
On-Demand (Lowest Listed)$2.95[5]$23.60~$207,000
Market Average (4 Providers)$5.45[6]$43.60~$382,000
Full On-Demand (OCI BM.GPU.MI355X.8)$8.60[7]$68.80~$603,000

Table 8 | MI355X Cloud GPU Pricing (8-GPU Node, Annual 24/7 Operation)

The enterprise sweet spot for most organizations with a 1-year commitment falls in the $250,000 to $400,000 per year range. These rates reflect the current cloud GPU market for MI355X accelerators and are subject to change as additional providers bring MI355X capacity online. Four providers currently offer MI355X nodes: Vultr, Oracle Cloud, TensorWave (custom pricing), and Crusoe (reserved capacity). Organizations should request current quotes from preferred providers at the time of procurement.

Cost per Portfolio

A meaningful way to evaluate infrastructure cost is on a per-portfolio basis. This allows direct comparison against existing risk platform licensing fees and translates cloud GPU economics into terms that align with how financial institutions budget for risk technology.

Calculation methodology: Per-portfolio cost divides the annual node cost by the peak tested capacity of 850 concurrent portfolios. For example, at the market average rate of $382,000 per year: $382,000 / 850 portfolios = $449 per portfolio per year, or approximately $37 per portfolio per month. This calculation assumes continuous 24/7 operation with the node running at peak portfolio capacity. Actual per-portfolio cost will vary based on the number of portfolios monitored, utilization patterns, and the pricing tier selected.

Pricing TierAnnual Cost per PortfolioMonthly Cost per Portfolio
Reserved (36-month)~$188~$16
Enterprise (1-Year)$294 to $470$25 to $39
Full On-Demand~$709~$59

Table 9 | Per-Portfolio Cost at 850 Concurrent Portfolios

These per-portfolio costs cover continuous, 24/7 risk monitoring with multi-dimensional analysis across credit, market, liquidity, and regulatory risk. Compared to traditional FSI risk analytics platforms that license per-seat or per-portfolio and typically deliver only batch-mode analysis, the GPU-accelerated approach provides a fundamentally different value proposition: always-on monitoring at a comparable or lower per-unit cost.

Scaling Considerations

Organizations that monitor more than 850 portfolios can deploy additional MI355X nodes with near-linear cost scaling. Two 8-GPU nodes would support approximately 1,700 concurrent portfolios at double the infrastructure cost. The Dell PowerEdge XE9785L's rack-dense form factor enables efficient scaling within existing data center footprints, and the AMD ROCm software stack supports multi-node orchestration without proprietary licensing overhead.

For organizations considering on-premises deployment to meet data residency requirements, the capital expenditure model changes but the per-portfolio economics remain favorable. A 3-year amortization of on-premises MI355X infrastructure typically aligns with or improves upon the reserved cloud pricing tier, with the additional benefit of eliminating recurring cloud egress and storage fees.

Generational Performance Gains: MI355X vs. MI300X

The generational comparison quantifies what upgrading from MI300X to MI355X accelerators delivers, benchmarked on Dell PowerEdge platforms (MI355X on the XE9785L, MI300X on the XE9680). Three results define the upgrade value: 4.3x peak throughput on short-form alert workloads at 8,192 concurrent requests, 64x more concurrent compliance sessions within the 100ms TPOT SLA (8,192 vs. 128 on MI300X), and up to 2.2x more tokens per GPU watt at high concurrency. The detailed tables and charts that follow provide the supporting evidence across workload types and concurrency levels.

Testing used the Qwen3-235B-A22B-Thinking-2507-ptpc reasoning model, the same model IPRA deploys for compliance assessment and risk analysis, across a sweep of input/output token lengths and concurrency levels ranging from 1 to 8,192 simultaneous requests. All runs with error rates exceeding 10 percent were excluded. GPU total power is computed as the sum of all eight individual GPU power sensor readings, representing the authoritative power metric for efficiency calculations.

The following table holds concurrency fixed at 4,096 simultaneous requests and varies the input/output token configuration to isolate how workload type affects the generational advantage.

Config (In/Out)MI355X tok/sMI300X tok/sMI355X GPU Power (W)MI300X GPU Power (W)Throughput GainFSI Use Cases
128/12839,9309,5689,1464,5734.2xTrade alerts, chatbots
2,048/1287,4382,29510,2604,7373.2xCompliance, KYC, AML
128/2,04819,4149,0418,507*5,583*2.1xResearch notes, summaries
2,048/2,0488,8106,3028,693*5,695*1.4xRisk reports, stress tests

Table 10 | Qwen3-235B Raw Inference: MI355X vs. MI300X at 4,096 Concurrent Requests

* GPU total power values for the 128/2,048 and 2,048/2,048 configurations are measured at 1,024 concurrent requests (the highest available data point for these workloads), while throughput figures reflect 4,096 concurrent requests. Because power draw typically increases with concurrency, the actual GPU power at 4,096 concurrent would likely be higher than shown, meaning the tokens-per-watt figures for these two rows may overstate efficiency.

Raw Inference Throughput Scaling: MI355X vs. MI300X across concurrency levels

Figure 8 | Raw Inference Throughput Scaling: MI355X vs. MI300X

Figure 8 illustrates the concurrency scaling behavior across 1 to 4,096 simultaneous requests. While both platforms scale with increasing concurrency, the MI355X sustains materially higher token throughput, particularly in short-context, high-parallel workloads characteristic of real-time risk scoring. The divergence at higher concurrency levels reflects improved HBM3e bandwidth and compute efficiency.

Peak Throughput: Scaling Under Concurrent Load

On the Dell PowerEdge XE9785L, the most consequential difference between accelerator generations is sustained throughput under concurrent load. The MI355X continues scaling well beyond the point where the MI300X saturates. The MI300X saturates at approximately 1,024 concurrent requests on short-form workloads and delivers no additional throughput beyond that point. The MI355X continues scaling to 8,192 concurrent requests and beyond.

Token ConfigMI355X Peak tok/sAt ConcurrentMI300X Peak tok/sAt ConcurrentGainFSI Use Case
128/12843,8018,19210,1431,0244.3xAlerts, Chatbot
128/2,04819,9168,19210,4901,0241.9xResearch Notes
2,048/1287,7098,1922,3528,1923.3xCompliance, KYC
2,048/2,0488,8114,0966,5661,0241.3xRisk Reports

Table 11 | Peak Throughput by Workload: MI355X vs. MI300X

The concurrency scaling behavior on the short-form 128/128 workload illustrates this dynamic clearly. The MI300X reaches its throughput ceiling of approximately 10,143 tokens per second at 1,024 concurrent requests and delivers no meaningful increase beyond that point. The MI355X continues scaling through 2,048, 4,096, and 8,192 concurrent sessions, reaching 43,801 tokens per second at peak. In a real-time FSI environment, trade alerts, client notification engines, and fraud signals generate bursty concurrent demand. This scaling headroom determines whether the system responds in real time or queues requests during peak market hours.

Concurrency Scaling on 128/128 Short-Form Workload

Figure 9 | Concurrency Scaling: 128/128 Short-Form Workload

At the standard 1,024 concurrent comparison point, the MI355X already delivers meaningful throughput advantages across all workload types.

ConfigMI355X tok/s @ 1,024MI300X tok/s @ 1,024Throughput GainFSI Use Case
128/12816,09710,1431.6xTrade alerts, signals
128/2,04815,01210,4901.4xResearch summaries
2,048/1287,3282,2353.3xCompliance, KYC, AML
2,048/2,0488,5596,5661.3xRisk reports

Table 12 | Throughput at 1,024 Concurrent: Like-for-Like Comparison

Real-Time Latency: Meeting the 100ms SLA Under Load

For production FSI systems deployed on the Dell PowerEdge XE9785L, throughput alone does not determine deployment readiness. Per-token latency under concurrent load is the metric that separates a system that can serve a compliance dashboard from one that cannot. Two metrics capture this behavior.

TPOT (Time Per Output Token) P95 measures the 95th-percentile milliseconds between consecutive generated tokens. This is the production-grade SLA measure. If TPOT P95 remains below 100 milliseconds, 95 percent of all users at that concurrency level experience real-time-feeling responses.

TTFT (Time to First Token) P95 measures how long users wait before the response begins. This metric is dominated by input processing (prefill latency) and is critical for user-facing applications where perceived responsiveness determines adoption.

A TPOT P95 below 100 milliseconds is the standard threshold for real-time AI in financial services, covering trade alert generation, pre-trade risk checks, and live compliance screening.

WorkloadMI300X Max Conc within 100msTPOT at LimitMI355X Max Conc within 100msTPOT at LimitCapacity Advantage
128/128 (Alerts)1,02497.8ms2,04883.8ms2x more concurrent
128/2,048 (Research)1,02493.8ms1,02463.1msSame concurrency, 32% faster TPOT
2,048/128 (Compliance)12882.3ms8,19299.6ms64x more concurrent
2,048/2,048 (Risk Reports)51299.6ms4,09667.6ms8x more concurrent

Table 13 | Maximum Concurrent Sessions Within 100ms TPOT P95 SLA

The latency metrics in this section measure raw model inference responsiveness (TTFT and TPOT), not end-to-end workflow completion time. In production, a single IPRA risk assessment involves multiple sequential inference calls, knowledge graph queries, and data retrievals. End-to-end workflow completion ranges from approximately 100 to 460 seconds depending on portfolio complexity and concurrency level.

TPOT P95 Detail: 2,048/128 Compliance Screening

The compliance workload (2,048 input / 128 output) presents the most decisive contrast. This token profile mirrors production compliance systems: the model reads a long transaction record, contract, or regulatory filing and produces a short classification or verdict. The MI300X breaches the 100ms TPOT threshold at just 256 concurrent sessions and exceeds the SLA by 6x at higher concurrency levels. The MI355X holds under 100ms all the way through 8,192 concurrent sessions, delivering 64x more capacity within the same latency guarantee.

Concurrent SessionsMI300X TPOT P95MI355X TPOT P95Status
12882.3ms41.3msBoth within SLA
256124.4ms (BREACH)52.7msMI300X breaches SLA
512237.7ms (BREACH)75.5msMI300X 2.4x over SLA
1,024455.6ms (BREACH)99.2msMI300X 4.6x over SLA
2,048607.5ms (BREACH)99.5msMI300X 6.1x over SLA
4,096608.0ms (BREACH)99.5msMI300X 6.1x over SLA
8,192605.7ms (BREACH)99.6msMI300X 6.1x over SLA

Table 14 | TPOT P95 Detail: 2,048/128 Compliance Screening Workload

The operational implication is direct. Before the trading floor has fully loaded its morning queue, the MI300X is already at its latency limit on compliance workloads. The MI355X holds the same SLA at 8,192 concurrent sessions, serving the full trading day's compliance volume without queuing or degradation.

TPOT P95 Detail: 128/128 Trade Alerts and Short-Form

Concurrent SessionsMI300X TPOT P95MI355X TPOT P95Note
51269.6ms47.7msMI355X 32% faster
1,02497.8ms (at edge)59.4msMI300X at limit; MI355X has 40ms headroom
2,048299.9ms (BREACH)83.8msMI355X still within SLA
4,096406.7ms (BREACH)100.2msMI355X at limit
8,192416.1ms (BREACH)208.1ms (BREACH)Both exceed SLA at 8,192 concurrent

Table 15 | TPOT P95 Detail: 128/128 Trade Alert Workload

Time to First Token: 2,048/128 Compliance Workload

In addition to per-token latency, the MI355X delivers responses significantly faster at the prefill stage. Time to First Token determines how quickly a compliance analyst sees the system begin responding. This metric is critical for user-perceived responsiveness in interactive review interfaces where analysts process hundreds of documents per shift.

TTFT P95 comparison for 2,048/128 Compliance Workload

Figure 10 | TTFT P95: 2,048/128 Compliance Workload (Shorter is Better)

At 512 concurrent sessions, a compliance analyst on the MI300X waits over 17 seconds before seeing the first token of a response. On the MI355X, the same analyst receives the first token in under 4 seconds. For interactive compliance review workflows where analysts process hundreds of documents per shift, this difference compounds into hours of recovered productivity across a team.

For any FSI production system requiring both high concurrency and guaranteed per-token latency, the combination of TPOT and TTFT data, measured on a single Dell PowerEdge XE9785L, is unambiguous. The MI355X holds the 100ms TPOT SLA at 64x more concurrent sessions while also delivering first-token responses 3.6x to 4.7x faster.

GPU Efficiency: Tokens per Watt at Scale

On the Dell PowerEdge XE9785L, GPU efficiency at low to moderate concurrency is broadly comparable between the MI355X and MI300X. The decisive difference emerges at high concurrency, where the MI300X throughput plateaus but its GPU power consumption remains elevated. The MI355X continues scaling throughput while power grows modestly, delivering 2.2x more tokens per GPU watt at 8,192 concurrently on the 128/128 workload.

Tokens-Per-GPU-Watt Scaling on 128/128 Short-Form Workload

Figure 11 | Tokens-Per-GPU-Watt Scaling: 128/128 Short-Form Workload

The pattern is consistent: the MI355X draws more absolute GPU power than the MI300X at every concurrency level. The efficiency advantage emerges because throughput scales faster than power consumption. At 1,024 concurrent, both platforms deliver approximately the same tokens per watt. At 2,048 concurrent and above, the MI300X's throughput plateaus while its power draw remains largely unchanged, creating a widening efficiency gap that reaches 2.2x at peak concurrency.

ConfigMI355X tok/sGPU Wtok/WMI300X tok/sGPU Wtok/WEfficiency
128/12816,0978,5641.8810,1435,4601.86~same (1.01x)
128/2,04815,0128,5071.7710,4905,5831.88~same (0.94x)
2,048/1287,32810,2660.7142,2354,7010.4751.50x MI355X
2,048/2,0488,5598,6930.9856,5665,6951.153~same (0.85x)

Table 16 | Tokens-Per-GPU-Watt at 1,024 Concurrent: All Configurations

FSI data centers face power density constraints, rack space limits, and sustainability reporting requirements. The MI355X efficiency story centers on what happens beyond 1,024 concurrent, the inflection point where the MI300X stops delivering additional value but continues consuming similar GPU power. From 2,048 concurrent upward, the MI355X extracts 1.6x to 2.2x more inference work from every watt of GPU power.

Compliance and Risk Analysis: The 2,048/128 Workload

The 2,048 input / 128 output token profile mirrors how AI operates in production FSI compliance systems on the Dell PowerEdge XE9785L: read a long transaction record, contract, or regulatory filing and produce a short classification, flag, or verdict. This workload exercises the prefill-heavy compute path that distinguishes compliance inference from general-purpose chatbot workloads. The MI355X delivers 3.3x more throughput on this profile, while the MI300X reaches a hard ceiling of approximately 2,352 tokens per second from 512 concurrent requests onward.

Concurrency Scaling - Throughput on 2,048/128 Compliance Workload

Figure 12 | Concurrency Scaling - Throughput (2,048/128 Compliance Workload)

Concurrency Scaling - Tokens per GPU Watt on 2,048/128 Compliance Workload

Figure 13 | Concurrency Scaling - Tokens per GPU Watt (2,048/128 Compliance Workload)

The MI300X's throughput curve on this workload is effectively flat from 512 concurrent onward, fluctuating between 2,075 and 2,352 tokens per second regardless of how many additional requests arrive. The MI355X continues scaling from 3,050 tokens per second at 128 concurrent to 7,709 at 8,192, reaching a plateau only at the highest tested concurrency levels. The throughput advantage holds steady at 3.1x to 3.3x across the production-relevant concurrency range.

Operational Capacity: Daily Document Volume

Translating the Dell PowerEdge XE9785L's peak throughput into operational terms clarifies the infrastructure planning implications. The projections below assume 2,000 tokens per document and continuous operation, representative of production compliance screening pipelines.

Estimated Document Processing Capacity at Peak Throughput

Figure 14 | Estimated Document Processing Capacity at Peak Throughput

Based on MI355X peak of 7,709 tok/s and MI300X peak of 2,352 tok/s on the 2,048/128 workload. Assumes 2,000 tokens per document and continuous operation.

The compliance workload advantage extends beyond raw throughput. Combining the data from this section with the latency analysis: the MI355X delivers 3.3x more compliance checks per second, produces first-token responses 4.0x to 4.7x faster, and maintains TPOT P95 under 100ms at 64x more concurrent sessions. For end-of-day regulatory batch runs, this means the system processes the full document queue without degradation. For interactive compliance review, analysts see responses begin in under 4 seconds instead of over 17. The MI300X's hard ceiling of approximately 2,352 tokens per second means compliance teams face a capacity wall that can only be addressed by adding more nodes. The MI355X processes 3.3x more compliance decisions at a single node, with 50 percent better tokens per GPU watt on this workload.

Conclusion

Financial institutions face a widening gap between the speed of market events and the capacity of legacy systems to detect, analyze, and act on the risks those events create. Batch processing, fragmented data sources, and bolt-on compliance checks leave risk officers hours behind during volatile conditions. At the same time, regulatory expectations for auditability, data sovereignty, and continuous monitoring continue to intensify.

The Institutional Portfolio Risk Agent closes this gap by combining continuous data ingestion, GPU-accelerated AI analysis, and automated compliance monitoring on a single on-premises server. Purpose-built for the requirements of regulated financial environments, IPRA delivers four measurable advances over legacy approaches.

  • Real-time risk visibility: IPRA eliminates hours-long detection delays. The platform continuously processes SEC filings, breaking news, macroeconomic indicators, and regulatory notices, linking each signal to portfolio positions through a knowledge graph that captures entity relationships in real time. Risk officers see exposure changes as they develop, not hours after the fact.
  • Continuous, embedded compliance: IPRA transforms mandate checks from a periodic checkpoint into a continuous process. The Compliance Monitor Agent evaluates every holding against institutional mandates and regulatory rules as new data arrives, maintaining a complete audit trail with source evidence, timestamps, and decision logic. Benchmarking confirmed the system scales to 850 concurrent portfolios, sustaining 109 compliance assessments per minute at peak load while preserving full audit integrity.
  • Sovereign, single-server deployment: IPRA keeps sensitive data within institutional boundaries. The entire platform, including four large-scale AI models, runs on a single Dell PowerEdge XE9785L server with 8x AMD Instinct MI355X accelerators. The 2.3 TB of aggregate GPU memory enables simultaneous deployment of 235-billion parameter reasoning and vision-language models without distributing workloads across multiple servers or relying on external cloud infrastructure.
  • Generational headroom with MI355X: Infrastructure testing confirms the platform is built for growth, not just for today's workload. At the raw inference level, the MI355X accelerator delivered up to 4.3x higher peak token throughput, maintained approximately 100ms per-token latency at 64x more concurrent sessions on compliance workloads, and achieved up to 2.2x better tokens-per-GPU-watt efficiency compared to the MI300X. This additional inference capacity represents forward-looking headroom: institutions can increase dispatch concurrency, deploy larger models, or add analytical agents within the same single-server footprint as requirements evolve.

The infrastructure economics reinforce the performance story. At peak capacity of 850 concurrent portfolios, GPU compute costs alone translate to approximately $25 to $39 per portfolio per month under typical enterprise cloud commitments. The MI355X further improves this equation at scale: from 2,048 concurrent requests upward, the accelerator extracts 1.6x to 2.2x more inference work per GPU watt than the MI300X, delivering more AI capacity per rack unit without additional power infrastructure.

For IT directors, CTOs, and infrastructure architects evaluating sovereign AI for financial services, the Dell PowerEdge XE9785L with AMD Instinct MI355X accelerators provides a validated, single-server foundation for continuous portfolio risk intelligence, deployable today and architected for tomorrow's workloads.

Addendum

System Under Test

TypeDetails
ModelDell PowerEdge XE9785L server
No. of servers1
CPUAMD EPYC 9965 192-Core Processor
Memory12-channel DDR5, 128 GB DIMMs running at 6400 MT/s, 2.95 TB total system memory, delivering ~614 GB/s theoretical threaded memory bandwidth
Storage3x Micron_7450_MTFDKCE3T8TFR
GPU8x AMD Instinct MI355X Accelerators
Operating SystemUbuntu 24.04.3 LTS
Kernel6.8.0-90-generic

Gen-on-Gen Benchmarking Configuration

ParameterDetails
Platforms Compared8x AMD Instinct MI355X on Dell PowerEdge XE9785L vs. 8x AMD Instinct MI300X on Dell PowerEdge XE9680
Raw Inference Modelamd/Qwen3-235B-A22B-Thinking-2507-ptpc
Raw Inference Concurrency1 to 8,192 concurrent requests
Raw Inference ConfigurationsInput/Output token lengths: 128/128, 2048/128, 128/2048, 2048/2048
Solution Benchmark ScenarioTech and Banking Contagion (SVB Collapse), identical to primary benchmark
Solution Concurrency Sweep300 to 850 simultaneous portfolios (incrementing by 50)
Reasoning Model DeploymentTensor parallel size 2 (2 instances) on both MI355X and MI300X
Test Duration per Level45 minutes (90 samples at 30-second intervals)

Key Performance Indicators (KPIs)

MetricDescription
Holdings ThroughputMeasures risk analysis capacity as individual asset positions evaluated per minute.
Compliance ThroughputMeasures regulatory assessment capacity as the number of holdings that complete the full compliance evaluation pipeline per minute. Each assessment encompasses 20-30 individual rule evaluations.
Max Generation Tokens/secMeasures peak reasoning-model token generation rate observed during active IPRA workload. Captures sustained GPU inference output while all pipeline components operate concurrently.

References

[1] LexisNexis Risk Solutions, True Cost of Financial Crime Compliance Study: United States and Canada (Atlanta: LexisNexis, February 21, 2024), https://risk.lexisnexis.com/about-us/press-room/press-release/20240221-true-cost-of-compliance-us-ca

[2] Ascent RegTech, "The Not So Hidden Costs of Compliance," Ascent Blog, March 27, 2025; Shashank Guda, "Unstructured Data Management in Finance," Medium, November 4, 2025.

[3] GPT-OSS-120B is an open-source 120-billion parameter language model.

[4] Vultr, "Cloud GPU Pricing," Vultr.com, 2026. Reserved rate requires 48-month prepaid commitment.

[5] gpus.io, "AMD Instinct MI355X GPU Price Comparison," gpus.io, 2026. Reflects lowest on-demand rate across tracked providers.

[6] GetDeploying.com, "AMD MI355X: Price, Specs and Cloud Providers," GetDeploying.com, 2026. Average across 4 tracked MI355X cloud providers.

[7] Oracle Cloud Infrastructure, "Compute Pricing: GPU Instances," Oracle.com, 2026. BM.GPU.MI355X.8 bare metal instance on-demand rate.

Image Sources

Dell Images: Dell Technologies Inc. Dell PowerEdge XE9785L Server. Image source: Dell DAM via Dell.com

AMD Images: AMD Inc. AMD Instinct MI355X Accelerator. Image source: AMD Media Library (https://library.amd.com)


Copyright 2026 Metrum AI, Inc. All Rights Reserved. This project was commissioned by Dell Technologies. Dell, Dell PowerEdge and other trademarks are trademarks of Dell Inc. or its subsidiaries. AMD, Instinct, ROCm, EPYC and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other product names are the trademarks of their respective owners.

DISCLAIMER - Performance varies by hardware and software configurations, including testing conditions, system settings, application complexity, the quantity of data, batch sizes, software versions, libraries used, and other factors. The results of performance testing provided are intended for informational purposes only and should not be considered as a guarantee of actual performance. Gen-on-gen comparisons reflect different server platforms (Dell PowerEdge XE9785L for MI355X, Dell PowerEdge XE9680 for MI300X). Differences in server architecture, memory configuration, and cooling may contribute to observed performance variations beyond GPU-level improvements.