Today, in data centers CLOS networks are universally adopted and deployed. There is sort of industry consensus on leaf-spine architecture as it provides a simple, stable and scalable fabric with consistent performance. This architecture is also open, standard based, interoperable and hardware agnostic, making it compatible with various routing protocols which makes it easy to automate.
Despite the benefits, many enterprises and service providers are still questioning whether existing data center networks can meet the demands of AI networking.
To answer this, we need to first understand the nature of an AI application.
AI Application construct
AI applications use various Neural Networks such as feed-forward, convolutional, perceptron, and recurrent networks. These neural networks process input data by passing it through neurons which apply weights to the inputs. The output is then generated as weighted sums, typically using activation functions like sigmoid or exponential functions.
This process is repeated with more data, often with billions of parameters, to achieve higher training accuracy. Sharing these parameters across nodes is required before moving to the next training steps. This puts demand on the network to quickly share data (gradient aggregation), so errors and weights can be adjusted across the models.
Key requirements for networks to run AI workloads:
High bandwidth network is required as large messages are exchanged (these messages are set of parameters, typically 1-32MB)
Ability to efficiently manage Low Entropy traffic, as there are few data flows happening
AI workloads require minimal tail latency, as stragglers can slow down the process and prevent progression to the next training phase
AI workloads have low tolerance for jitter
Switch/link failure must be managed efficiently, as they can have dramatic impact on training
The InfiniBand switching system is considered a potential networking solution for AI workloads as it meets the above requirements. However, Infiniband is relatively difficult to configure, maintain, and scale. Its control plane is centrally controlled by a single subnet manager, making it suitable for small clusters but challenging to scale for networks with 32K or more GPUs. Moreover, the IB network also requires specialized hardware, such as host channel adapters and InfiniBand cables, which makes its expansion more costly than Ethernet networks.
Is Ethernet a viable solution for AI Networking?
The two main concerns regarding Ethernet networks for AI are Load balancing and Congestion control.
In terms of load balancing, routing protocols such as BGP use equal-cost multipath routing (ECMP) to distribute packets over multiple paths with equal “cost” to the destination. When packets arrive at a switch that has multiple equal-cost paths to the target, the switch uses a hash function to decide the path of the packets. AI workloads generate low-entropy traffic, causing hash collisions due to an insufficient number of flows for the hash function to work effectively.
The second problem is lack of global congestion information. AI workloads are particularly affected by asymmetry in case of link failures. Unlike general purpose DC networks, which are designed for loosely coupled applications and have high jitter tolerance, AI workloads require low latency and consistent performance.
Making Ethernet work for AI Workloads
To address the Load Balancing issue, several strategies can be implemented. One approach is to reserve a slight excess of bandwidth or implement adaptive load balancing, allowing switches to redirect new flow packets to alternate ports during congestion. Many switches already support this capability. Additionally, RoCEv2's packet-level load balancing evenly distributes packets across all available links to maintain link balance. Advanced Hashing allows flexible hashing keys to provide better packet entropy.
Ethernet supports RDMA (remote direct memory access) through RoCEv2 (RDMA over Converged Ethernet), where RDMA frames are encapsulated in IP/UDP. When RoCEv2 packets arrive at the network adapter (NIC) in the GPU server, the NIC can directly transfer the RDMA data to the GPU’s memory, without CPU intervention.
Ethernet can provide lossless transmission service, through priority flow control (PFC). PFC supports eight service classes, each of which can be flow-controlled on a per priority or per queue basis.
End-to-end congestion control schemes, such as DC-QCN can be deployed to reduce the end-to-end congestion and packet loss of RDMA. Mechanisms like Packet Spray which spray packets of a flow across all available parallel paths can also be used. However, NICs must support unordered packets arrival. Another mechanism for Congestion Notification is when network switches inform the end points about congestion and end points modulate the traffic based on RTT and congestion notifications.
MetalSoft support for AI Networking
With possibilities of leveraging Ethernet based network topology, MetalSoft platform provides the flexibility of managing not only the host/compute having GPUs but also supports various networking topologies like Frontend/Backend networking for AI. MetalSoft supports eBGP unnumbered (RFC5549) for underlay and EVPN-VxLAN for overlay frontend networking.
Similarly, backend network uses BGP unnumbered/RFC 5549 for IP routing.
AI networking is usually defined as “Scale-Up” and “Scale-Out” as shown in above figure.
The “scale-up” network provides GPU to GPU communication within the AI accelerator node. Currently it is implemented mostly by proprietary ways like NVLink (Nvidia), InfinityFabric (AMD), CustomRoCE (Intel Gaudi). This network can be used for Tensor Parallelism where matrix multiplication is spread across multiple GPUs within the server.
The “scale-out” network provides GPU to GPU communication between AI accelerator nodes. Here Ethernet can be used as the preferred technology. (However, there exists proprietary implementations like Custom RoCE over Ethernet (Nvidia, Intel) and Infiniband (Nvidia)).
For Pipeline Parallelism, rails-based implementation can be leveraged where model is partitioned across GPUs at a boundary of a group of layers. Scale-out networks also support Data Parallelism, where many model copies are trained in parallel on mini batches.
SONiC as an alternative for Backend AI network fabric
SONiC provides a flat, highly scalable, leaf-spine IP-Clos datacenter fabric enabling Virtual Extensible LAN (VxLAN for data plane) tunneling of Layer 2 traffic over a Layer 3 Border Gateway Protocol Ethernet VPN (BGP-EVPN for Control plane) and is widely deployed in frontend networks.
SONiC can also be leveraged for Backend networks as it has support for RoCEv2, has built-in QoS and load balancing technologies including DSCP-based QoS profiles, ECN and PFC, Flexible buffer allocation, PFC watchdog and per-queue telemetry.
MetalSoft fully supports the provisioning and configuration of Enterprise SONiC distribution by Dell.
Conclusion:
AI applications are truly becoming important to many enterprises; hence AI clusters are becoming more critical. Ethernet/IP fabric have features to be the dominant fabric for large AI clusters.
Ethernet is also widely used in various applications, from data centers to backbone networks, with speeds ranging from 1Gbps to 800Gbps, and even expected to reach 1.6Tbps in the future.
High-end Ethernet switches and network cards have powerful congestion control, load balancing, and RDMA support, and they can scale to larger designs.
MetalSoft makes it easy to provision the frontend and backend networks with Zero-touch provisioning support of switches.
コメント