MetalSoft for AI FactoriesMetalSoft for AI Factories

MetalSoft provides an automated bare-metal and network orchestration platform designed specifically for Enterprise AI Factories and GPU-as-a-Service (GPUaaS) providers.MetalSoft provides an automated bare-metal and network orchestration platform designed specifically for Enterprise AI Factories and GPU-as-a-Service (GPUaaS) providers.

Better Security

Hardware-level switch and DPU micro-segmentation prevents lateral movement without the overhead or latency of third-party overlays.

Lower Complexity, Easier to Manage

No overlay networks, no complex CRDs to understand. MetalSoft is simple and powerful. Kubernetes-based complexity is not required.

Higher Performance

Maximize AI training throughput on bare metal while preserving direct access to NUMA nodes, NVLink topologies, and DPU capabilities.

Faster infrastructure deployment

Provision, reconfigure, and scale GPU infrastructure in minutes instead of days with unified automation across compute, networking, and storage.

Build on top of a solid foundation

Let MetalSoft handle the underlying hardware complexity through native integrations, automation hooks, and infrastructure services, so your teams can focus on building differentiated platforms and AI services on top.

Supported Vendors

Dell Technologies
Hewlett Packard Enterprise
Cisco
NVIDIA
Lenovo
Supermicro
Juniper Networks
Huawei
Dell Technologies
Hewlett Packard Enterprise
Cisco
NVIDIA
Lenovo
Supermicro
Juniper Networks
Huawei
Dell Technologies
Hewlett Packard Enterprise
Cisco
NVIDIA
Lenovo
Supermicro
Juniper Networks
Huawei

Why MetalSoft for your AI factory

MetalSoft is the only vendor-neutral platform managing the entire AI infrastructure stack: bare metal, networking, storage, VMs, and containers.

Technical PillarCurrent Architectural ChallengeMetalSoft Native Solution
Network Fabric IntegrationManual orchestration

Manual or bolted-on orchestration of complex east-west backend networks (NVIDIA HGX/Quantum InfiniBand/RoCEv2).

Automated segmentation

Automated Fabric Manager dynamically segmenting traffic while also aligning to NVIDIA RA guidelines.

Hardware Health & RemediationReactive operations

High GPU/InfiniBand failure rates (~10% average) causing training job interruptions.

AI-driven remediation

Cross-medium health checks across servers, GPUs, switch ports, and storage with automated, AI-driven infrastructure isolation and remediation.

Multi-Tenant SecuritySoftware isolation

Multi-tenant isolation at the scale of bare-metal GPU clusters without performance degradation.

Hardware-level isolation

DPU and hardware switch segmentation enforcing zero-trust isolation directly at the silicon level.

Infrastructure-as-CodeFragmented tooling

Fragmented tooling across GPU compute, networking, and storage infrastructure.

Unified orchestration

Unified Terraform Provider and API endpoint orchestrating the entire bare-metal server, switch, and storage stack.

AI OpsGeneric AI agents

Generic AI agents lack infrastructure awareness and cannot safely operate physical systems.

Infrastructure-aware AI

Infrastructure-aware AI operations with guarded execution, diagnostics, and human-in-the-loop remediation.

Direct infrastructure access for maximum AI performance

MetalSoft eliminates unnecessary Kubernetes networking layers, allowing AI workloads to run closer to the physical infrastructure for higher throughput, lower complexity, and predictable performance.

Traditional Kubernetes approach

  • Relies on overlay networks, CRDs, and host-based SDN layers
  • Adds abstraction between workloads and physical infrastructure
  • Increases network and topology complexity
  • Makes high-performance traffic harder to optimize
  • Can limit line-rate networking performance

The MetalSoft approach

  • Provisions workloads directly onto bare metal
  • Uses the physical network fabric for high-performance traffic
  • Enables line-rate east-west, north-south, and storage traffic
  • Supports tenant isolation through EVPN and DPU technologies
  • Reduces unnecessary networking complexity

Multi-Tenant Fabric Orchestration: Balancing NVIDIA RA with Client Flexibility

Manually configuring a high-performance network fabric that adapts as multi-tenant environments scale up or down is virtually impossible without causing configuration drift or security vulnerabilities.

Two-layer network orchestration

MetalSoft's Fabric Manager automates this orchestration by segmenting the network into two distinct layers:

  • Backend (East-West) Traffic: Strict automated adherence to NVIDIA Reference Architecture (RA) guidelines for high-throughput, low-latency GPU clustering (via NVIDIA Spectrum-X, InfiniBand, or RoCEv2).

  • Frontend (North-South) Traffic: Dynamic, isolated routing paths that securely connect bare metal, VMs, or container stacks directly to diverse client enterprise environments.

Video coming soon

The MetalSoft for AI Factories Workflow

Orchestrate the full lifecycle of AI infrastructure, from discovery and provisioning to remediation and secure de-provisioning.

The MetalSoft AI Infrastructure Lifecycle

Orchestrate the full lifecycle of AI infrastructure, from discovery and provisioning to remediation and secure decommissioning.

Let's talk infrastructure

See how MetalSoft automates your stack from API to switch port.