Posts

Showing posts with the label RDMA

AMD Looks to Infinity for AI Interconnects

Image
With the formal launch of the MI300 GPU, AMD revealed new plans for scaling the multi-GPU interconnects vital to AI-training performance. The company's approach relies on a partner ecosystem, which stands in stark contrast with NVIDIA's end-to-end solutions. The plans revolve around AMD's proprietary Infinity Fabric and its underlying XGMI interconnect. Infinity Fabric Adopts Switching As with its prior generation, AMD uses XGMI to connect multiple MI300 GPUs in what it calls a hive. The hive shares a homogeneous memory space formed by the HBM attached to each GPU. In current designs, the GPUs connect directly using XGMI in a mesh or ring topology. Each MI300X GPU has up to seven Infinity Fabric links, each with 16 lanes. The 4th-gen Infinity Fabric supports up to 32Gbps per lane, yielding 128GB/s of bidirectional bandwidth per link. At the MI300 launch, Broadcom announced that its next-generation PCI Express (PCIe) switch chip will add support for XGMI. At last October...

Ultra Ethernet Promises New RDMA Protocol

Image
This week saw the formal launch of the Ultra Ethernet Consortium (UEC), which aims to reinvent Ethernet fabrics for massive-scale AI and HPC deployments. An impressive list of founding members back this ambitious effort: hyperscalers Meta and Microsoft; chip vendors AMD, Broadcom, and Intel; OEMs Arista, Atos, and HPE; and Cisco, which straddles the chip and OEM camps. Absent this backing, we could easily write off this consortium as doomed to failure. Our skepticism is rooted not in the obvious need the UEC looks to serve but rather in the challenges of standardizing and implementing a full-stack approach. The effort plans to replace existing transport protocols as well as user-space APIs. Specifically, the Ultra Ethernet Transport (UET) protocol will be a new RDMA protocol to replace ROCE, and new APIs will replace the Verbs API from the InfiniBand heritage. UET will provide an alternative to RoCEv2 and Amazon’s SRD , both of which are deployed in hyperscale data centers. (Source: Ul...

White Paper: The Evolution of Memory Tiering at Scale

Image
With first-generation chips now available, the early hype around CXL is giving way to realistic performance expectations. At the same time, software support for memory tiering is advancing, building on prior work around NUMA and persistent memory. Finally, operators have deployed RDMA to enable storage disaggregation and high-performance workloads. Thanks to these advancements, main-memory disaggregation is now within reach.  Enfabrica sponsored the creation of this white paper, but the opinions and analysis are those of the author. Download the full  white paper  for free, no registration required.