AMD Looks to Infinity for AI Interconnects

With the formal launch of the MI300 GPU, AMD revealed new plans for scaling the multi-GPU interconnects vital to AI-training performance. The company's approach relies on a partner ecosystem, which stands in stark contrast with NVIDIA's end-to-end solutions. The plans revolve around AMD's proprietary Infinity Fabric and its underlying XGMI interconnect.

Infinity Fabric Adopts Switching

As with its prior generation, AMD uses XGMI to connect multiple MI300 GPUs in what it calls a hive. The hive shares a homogeneous memory space formed by the HBM attached to each GPU. In current designs, the GPUs connect directly using XGMI in a mesh or ring topology. Each MI300X GPU has up to seven Infinity Fabric links, each with 16 lanes. The 4th-gen Infinity Fabric supports up to 32Gbps per lane, yielding 128GB/s of bidirectional bandwidth per link.

At the MI300 launch, Broadcom announced that its next-generation PCI Express (PCIe) switch chip will add support for XGMI. At last October's OCP Global Summit, the company also disclosed that its Atlas 3 PCIe Gen6 switch would include support for CXL 3.1. Broadcom expects to sample its next-generation PCIe switch in December, which should enable production shipments by 2H25.

With 144 lanes, a single Atlas 3 chip will connect no more than nine MI300 GPUs using XGMI x16 links. Rather than increasing the scale of GPU servers, the switch is perhaps more important in enabling XGMI expansion both inside and between chassis. In fact, Broadcom claims its serdes can drive a 36dB channel, which could include passive-copper (DAC) cables between chassis. AMD and Broadcom have yet to disclose what topologies Infinity Fabric will support, but we expect the fabric features will be similar to those of CXL 3.1.

Ultra Ethernet Provides Back End

Whereas Infinity Fabric provides the coherent interconnect between GPUs, AMD is promoting Ethernet as its preferred GPU-to-GPU (or back-end) network. The company is a cofounder of the Ultra Ethernet Consortium (UEC) and currently chairs the steering committee. Fundamentally, the UEC seeks to modernize Ethernet for data-center workloads, including AI, by replacing or augmenting existing protocols such as ECMP and RoCEv2. It plans to ratify its first set of specifications during 2024.

AMD plans to support the forthcoming Ultra Ethernet Transport (UET) protocol in future P4-programmable NICs, which come from the Pensando acquisition. Smart NICs provide the intelligence needed for UET endpoints to participate in congestion management as well as packet spraying and load balancing. Last year, AMD disclosed plans for its next-generation Pensando DPU, which is due for production in 2025. Code-named Salina, the new chip will offer dual 400G Ethernet ports and is likely to add UEC-specific features.

In support of the MI300 launch, AMD articulated its strategy for AI networking, which combines in-house DPUs with Ethernet switch chips from partners such as Broadcom. Unsurprisingly, the strategy includes Infinity Fabric for memory pooling across GPUs, while DPUs handle AI-cluster scale out. The revelation, however, was AMD's plan to integrate XGMI in its so-called AI NIC. The company declined comment on whether Salina and the AI NIC are one and the same, but we doubt it would fork the DPU roadmap into multiple architectures.

AI NIC with XGMI (Source: AMD)

Given AI-leader NVIDIA has yet to integrate NVLink into its BlueField DPUs, the obvious question is what benefits XGMI can bring compared with PCIe. The figure above illustrates the first answer, which is 112Gbps per-lane bandwidth compared with 64Gbps for PCIe Gen6. Additionally, a coherent interconnect could improve the performance of in-network compute, such as offloading AllReduce operations to the NIC.

SuperNIC Rebuttal

Last November, NVIDIA branded its BlueField-3 DPU bundled with Spectrum-X software as a "SuperNIC." It positions the SuperNIC for GPU-to-GPU networking, whereas the standard DPU handles front-end-network duties. Spectrum-X is NVIDIA's umbrella brand for its end-to-end Ethernet AI networking solution, which also includes SN5000 switch systems based on the company's Spectrum-4 ASIC.

Although AMD withheld the schedule for its AI NIC, it promises an alternative to NVIDIA's SuperNIC. End-to-end solutions, however, will comprise AMD DPUs plus Ethernet switch silicon from partners, both of which will require new software to enable the UET protocol. Developing and validating new standards takes time, giving NVIDIA about a two-year lead on direct competitors. For AMD, it's important to acknowledge the role of both coherent and network interconnects in scaling AI performance. Over time, this strategy should boost interest in AMD's GPUs for a wider range of AI workloads.


Comments

Popular posts from this blog

NVIDIA Networks NVLink

NVIDIA Reveals DGX GH200 System Architecture