NVIDIA Reveals DGX GH200 System Architecture

We industry analysts sometimes get out over our skis when trying to project details of new products. Following NVIDIA's DGX GH200 announcement at Computex, we noticed industry press making the same mistake. Rather than correct our previous NVLink Network post, we'll explain here what we've since learned. Our revelation came when NVIDIA published a white paper titled NVIDIA Grace Hopper Superchip Architecture. The figure below from that paper shows the interconnects at the HGX-module level. The GPU (Hopper) side uses NVLink as a coherent interconnect, whereas the CPU (Grace) side uses InfiniBand, in this case connected through a Bluefield-3 DPU. From a networking perspective, the NVLink and InfiniBand domains are independent, that is, there is no bridging between the two protocols.

HGX Grace Hopper Superchip System With NVLink Switch (Source: NVIDIA)

The new DGX GH200 builds a SuperPOD based on this underlying module-level architecture. You've probably seen the headline numbers elsewhere: up to 256 Grace Hopper Superchips with one exaFLOPS of FP8 performance. The GPU side of all 256 chips is connected using a two-level NVLink network. The first level consists of 96 NVLink switch chips integrated into HGX chassis, whereas 32 NVLink switches reside in switch-system chassis and form the second level. Combined with Grace Hopper's internal coherent interconnect (NVLink-C2C), the NVLink network enables 256 GPUs to access up to 144TB of memory. NVIDIA positions the DGX GH200 for training large-language models (LLMs) that most benefit from this massive shared memory. At least initially, NVIDIA will sell the NVLink Switch System only as a part of DGX GH200 pods.

On the CPU side, each Grace Hopper connects with one ConnectX-7 VPI adapter and a BlueField-3 DPU. The ConnectX-7 provides one OSFP port for 400Gbps InfiniBand, whereas the DPU provides a pair of 200Gbps Ethernet ports. At the pod level, 24 QM9700 InfiniBand switch systems, based on the 25.6Tbps Quantum-2 switch IC, provide the InfiniBand fabric. The Ethernet network consists of 22 SN3700 switches, each of which provides 32x200G ports using a Spectrum-2 IC. Another 20 SN2201 switches provide an out-of-band management network.

In our previous NVLink Network post, we erroneously stated that NVLink would replace InfiniBand as the first-level interconnect in DGX SuperPODs. Instead, NVLink Network adds a coherent pod-level interconnect while InfiniBand remains for network and storage traffic. Logically, it's more of an overlay than a replacement. It's up to GPU software to decide which network to use for a given operation.

In reality, the DGX GH200 is likely to serve as a reference design rather than an off-the-shelf system. NVIDIA is building an internal system dubbed Helios, which comprises four DGX GH200 SuperPODs connected using 400Gbps InfiniBand and is due to come online by year end. Customer systems will use the same building blocks but may differ in their exact configurations. Regardless, the NVLink Switch System enables a massive shared-memory that will both simplify and accelerate training the largest AI models. The DGX GH200 is a powerful demonstration of NVIDIA's full-stack approach to both hardware and software.


Popular posts from this blog

NVIDIA Networks NVLink

Ultra Ethernet Promises New RDMA Protocol