MetalSoft for AI FactoriesMetalSoft for AI Factories

MetalSoft provides an automated bare-metal and network orchestration platform designed specifically for Enterprise AI Factories and GPU-as-a-Service (GPUaaS) providers.MetalSoft provides an automated bare-metal and network orchestration platform designed specifically for Enterprise AI Factories and GPU-as-a-Service (GPUaaS) providers.

Request a demo

Better Security

Hardware-level switch and DPU micro-segmentation prevents lateral movement without the overhead or latency of third-party overlays.

Lower Complexity, Easier to Manage

No overlay networks, no complex CRDs to understand. MetalSoft is simple and powerful. Kubernetes-based complexity is not required.

Higher Performance

Maximize AI training throughput on bare metal while preserving direct access to NUMA nodes, NVLink topologies, and DPU capabilities.

Faster infrastructure deployment

Provision, reconfigure, and scale GPU infrastructure in minutes instead of days with unified automation across compute, networking, and storage.

Better multi-tenant support

Native, deeply integrated support for switch and DPU orchestration instead of a bolted-on 3rd party option improves security and reliability without sacrificing performance.

Higher uptime with auto-remediation

Monitor node health and automatically replace failed nodes, keeping jobs running and reducing downtime for tenants and operators.

More flexibility

Give clients freedom to deploy whatever stack they prefer, from Jupyter and SLURM on bare metal to RunAI and Spark, without locking them into an orchestration model.

Higher performance

Enable maximum performance through kernel-level optimizations and specialized drivers. This is only possible on bare metal.

Build on top of a solid foundation

Let MetalSoft handle the underlying hardware complexity through native integrations, automation hooks, and infrastructure services, so your teams can focus on building differentiated platforms and AI services on top.

Supported Vendors

Why MetalSoft for your AI factory

MetalSoft is the only vendor-neutral platform managing the entire AI infrastructure stack: bare metal, networking, storage, VMs, and containers.

Technical Pillar	Current Architectural Challenge	MetalSoft Native Solution
Network Fabric Integration	Manual orchestration Manual or bolted-on orchestration of complex east-west backend networks (NVIDIA HGX/Quantum InfiniBand/RoCEv2).	Automated segmentation Automated Fabric Manager dynamically segmenting traffic while also aligning to NVIDIA RA guidelines.
Hardware Health & Remediation	Reactive operations High GPU/InfiniBand failure rates (~10% average) causing training job interruptions.	AI-driven remediation Cross-medium health checks across servers, GPUs, switch ports, and storage with automated, AI-driven infrastructure isolation and remediation.
Multi-Tenant Security	Software isolation Multi-tenant isolation at the scale of bare-metal GPU clusters without performance degradation.	Hardware-level isolation DPU and hardware switch segmentation enforcing zero-trust isolation directly at the silicon level.
Infrastructure-as-Code	Fragmented tooling Fragmented tooling across GPU compute, networking, and storage infrastructure.	Unified orchestration Unified Terraform Provider and API endpoint orchestrating the entire bare-metal server, switch, and storage stack.
AI Ops	Generic AI agents Generic AI agents lack infrastructure awareness and cannot safely operate physical systems.	Infrastructure-aware AI Infrastructure-aware AI operations with guarded execution, diagnostics, and human-in-the-loop remediation.

Direct infrastructure access for maximum AI performance

MetalSoft eliminates unnecessary Kubernetes networking layers, allowing AI workloads to run closer to the physical infrastructure for higher throughput, lower complexity, and predictable performance.

Traditional Kubernetes approach

Relies on overlay networks, CRDs, and host-based SDN layers
Adds abstraction between workloads and physical infrastructure
Increases network and topology complexity
Makes high-performance traffic harder to optimize
Can limit line-rate networking performance

The MetalSoft approach

Provisions workloads directly onto bare metal
Uses the physical network fabric for high-performance traffic
Enables line-rate east-west, north-south, and storage traffic
Supports tenant isolation through EVPN and DPU technologies
Reduces unnecessary networking complexity

Multi-Tenant Fabric Orchestration: Balancing NVIDIA RA with Client Flexibility

Manually configuring a high-performance network fabric that adapts as multi-tenant environments scale up or down is virtually impossible without causing configuration drift or security vulnerabilities.

Two-layer network orchestration

MetalSoft's Fabric Manager automates this orchestration by segmenting the network into two distinct layers:

Backend (East-West) Traffic: Strict automated adherence to NVIDIA Reference Architecture (RA) guidelines for high-throughput, low-latency GPU clustering (via NVIDIA Spectrum-X, InfiniBand, or RoCEv2).
Frontend (North-South) Traffic: Dynamic, isolated routing paths that securely connect bare metal, VMs, or container stacks directly to diverse client enterprise environments.

Video coming soon

The MetalSoft for AI Factories Workflow

Orchestrate the full lifecycle of AI infrastructure, from discovery and provisioning to remediation and secure de-provisioning.

Automatically detect and map physical infrastructure across servers, GPUs, switches, and DPUs in real time.

Let's talk infrastructure

See how MetalSoft automates your stack from API to switch port.

Request a demo

Platform

Standalone products

Use cases

Platform

AI use cases

Resources

Documentation