The Byrne-Wheeler Report Episode 4 discusses the RISC-V Summit NA, BlueField-4, SambaNova, and AWS Rainier. You can skip to 12:12 if you're uninterested in RISC-V.
With the formal launch of the MI300 GPU, AMD revealed new plans for scaling the multi-GPU interconnects vital to AI-training performance. The company's approach relies on a partner ecosystem, which stands in stark contrast with NVIDIA's end-to-end solutions. The plans revolve around AMD's proprietary Infinity Fabric and its underlying XGMI interconnect. Infinity Fabric Adopts Switching As with its prior generation, AMD uses XGMI to connect multiple MI300 GPUs in what it calls a hive. The hive shares a homogeneous memory space formed by the HBM attached to each GPU. In current designs, the GPUs connect directly using XGMI in a mesh or ring topology. Each MI300X GPU has up to seven Infinity Fabric links, each with 16 lanes. The 4th-gen Infinity Fabric supports up to 32Gbps per lane, yielding 128GB/s of bidirectional bandwidth per link. At the MI300 launch, Broadcom announced that its next-generation PCI Express (PCIe) switch chip will add support for XGMI. At last October...
At GTC DC last month, Jensen Huang showed off components of the Vera Rubin NVL144 platform. First, here's the latest roadmap, which now includes BlueField-4 and BlueField-5. For more on that, see BWR Episode 4 . Source: Nvidia Below is the Vera Rubin compute tray, which includes four Rubin GPUs. By GPU, we mean package not die. Note that the Blackwell NVL72 and Rubin NVL144 both have 72 GPU packages, but the NVL144 moniker denotes Nvidia's new math counting die. The company didn't rename the Blackwell configuration, even though that GPU also has two die. Each compute tray has two Vera CPUs, which are 88-core Arm processors. Two GPUs connect with one CPU using NVLink-C2C, a coherent variant of NVLink. Although the roadmap above shows CX9 as 1600G, each ConnectX-9 is actually 800Gbps, requiring eight chips to deliver the aggregate 800GB/s quoted for the tray. That means each GPU has a pair of 800G Ethernet/InfiniBand NICs for scale-out networking. Finally, a single BlueField...
Yes, $11B in Blackwell revenue is impressive. Yes, Nvidia's data-center revenue grew 93% year over year. Under the surface, however, there's trouble in networking. In the January quarter (Q4 FY25), networking revenue declined 9% year over year and 3% sequentially. In its earnings call, CFO Collette Kress said that Nvidia's networking attach rate was "robust" at more than 75%. Her very next sentence, however, hinted at what's happening underneath that supposed robustness. "We are transitioning from small NVLink8 with InfiniBand to large NVLink72 with Spectrum-X," said Kress. About one year ago, Nvidia positioned InfiniBand for "AI factories" and Spectrum-X for multi-tenant clouds. That positioning collapsed when the company revealed xAI selected Spectrum-X for what is clearly an AI factory. InfiniBand appears to be retreating to its legacy HPC market while Ethernet comes to the fore. Nvidia Data-Center Revenue So how do we square 93% DC grow...
Comments
Post a Comment