Named after computer scientist Grace Hopper, Hopper is Nvidia's next GPU architecture designed to solve new data center challenges. A new Transformer Engine can accelerate corresponding deep learning architectures. H100 is the first hopper GPU that aims to make exaFLOPS performance the standard.
At the start of today's GTC 2022, Nvidia is following the Volta and Ampere with the new hopper architecture, which is to be used in the first products for the data center in the third quarter of this year. Hopper was presented as a pure enterprise solution, new graphics cards for consumers, whose architecture should be called "Ada (Lovelace)", are traditionally left out at the GTC. At the start of the GPU Technology Conference, Hopper was presented with the main features and the most important innovations.
H100 GPU as the first descendant of the GH100
Nvidia calls Hopper nothing short of the most advanced chip in the world. A distinction must be made between the full expansion GH100 and the first published product H100. Similar to Ampere's announcement two years ago, when the full expansion GA100 and the A100 Tensor Core GPU were announced as the first product based on it, H100 does not use all the components of GH100. Full specifications of both expansion stages will only be provided by the Hopper white paper, which has not yet been published.
H100 Die
80 billion transistors, TSMC N4, HBM3, 700 watts
H100 is a chip with 80 billion transistors and comes from a custom production adapted for Nvidia in N4 at TSMC. That's almost 48 percent more transistors than were used in the A100. Details on the size, SMs, CUDA cores, tensor cores, clock and other features are still pending. However, the use of 80 GB HBM3 has already been confirmed, after Ampere was still using HBM2 and, in a later expansion stage, HBM2e. The memory bandwidth should increase to 3 TB/s with HBM3. Hopper also gets faster connections: 4th generation NVLink as a direct connection between several GPUs now works with 900 GB/s instead of 600 GB/s and PCIe switches from generation 4.0 to 5.0 in Hopper by 128 GB/s instead of 64 GB/s deal with.
4,000 TFLOPS for FP8
Nvidia also makes initial statements about Hopper's computing power. The main focus is on the performance of 4,000 TFLOPS for FP8 calculations, which has been increased by a factor of 6. The Transformer Engine mentioned at the beginning is important in this context – more on that later in the article. With FP16, H100 achieves 2,000 TFLOPS, for TF32 it is 1,000 TFLOPS and for FP64 Nvidia calls 60 TFLOPS.
Efficiency increases despite increased consumption
The higher computing power goes hand in hand with an even higher TDP. Nvidia calls 700 watts for an SXM module with H100. As a reminder: Nvidia specifies the A100 Tensor Core GPU with 400 watts. This corresponds to an increase of 75 percent. In relation to the given performance, the efficiency is still significantly better.
H100 on SXM module
Transformer Engine accelerates huge AI models
Hopper's Transformer Engine was built specifically for Transformer's deep learning architecture. Transformers have recently become the dominant component of neural networks. If a transformer is trained on a lot of example data using machine learning, there is no need for labels for this data, for example. This, in turn, drastically expands the amount of data that can be used to train AI models because the tedious labeling is completely eliminated. Transformers are used, for example, to translate text from one language into another. Transformers are also used in the medical field, for example in protein sequencing or in various areas of computer vision.
Ampere is pushing the limits when it comes to transformers
With transformers, however, Ampere is already reaching the limits of the architecture, explained Nvidia about the GTC. While non-Transformer models have only grown by a factor of 8 in size and complexity in the last two years, Transformer models have grown by a factor of 275 in the same period of time – a truly huge disadvantage. In order to make these highly complex models more and more accurate by training with unlabeled data, enormous computing power is required, which even a supercomputer like the Selene operated by Nvidia requires a calculation time of one and a half months using the example of the Megatron-Turing NLG 530B.
Nvidia splits into FP8 and FP16
The Transformer Engine speeds up these calculations by dividing them into FP8 and FP16. Tensor core operations in FP8 naturally have twice the data throughput as those in FP16. However, the challenge for Nvidia is to combine the accuracy of the larger numeric format with the performance gain of the smaller, faster format. A heuristic developed by Nvidia is intended to manage the dynamic change between FP8 and FP16 and the associated precision for each layer of the model. Details can be expected for a deep dive into the hopper architecture later in the GTC.
Hopper eliminates two bottlenecks
In addition to the ever longer calculation times, even with the most modern hardware and software, another problem is that the increases in performance no longer scale as desired with the number of GPUs added. Hopper is supposed to fix both construction sites, on the one hand by the Transformer Engine eliminating the bottleneck directly on the GPU and on the other by the new NVLink generation eliminating the bottleneck among the individual nodes in the data center. While the GPUs within a node communicate via NVLink at 900 GB/s instead of just 128 GB/s via PCIe 5.0, NVLink switches can be used to combine up to 256 GPUs in a SuperPOD with the new DGX H100, whose bandwidth, according to Nvidia, is a factor of 9 versus a SuperPOD with DGX A100s still connected to Quantum-1 InfiniBand.
DGX H100 uses eight H100
After the H100 GPU, the DGX H100 is the first system from Nvidia in which the Hopper architecture will be used. DGX H100 is the direct successor of the DGX A100 and houses eight H100s that provide 32 PFLOPS (FP8) and 0.5 PFLOPS (FP64) of AI performance. Such a node comes with 640 GB HBM3.
Nvidia DGX H100
Image 1 of 2
1 ExaFLOPS FP8 with 256 H100 GPUs
Based on this, the new SuperPOD can be built with 32 DGX H100, with a total of 256 H100 GPUs working, which achieve a computing power of 1 ExaFLOPS based on FP8 thanks to the new Transformer Engine. The 256 GPUs communicate via NVLink or NVLink Switch outside the node and can be connected to other SuperPODs via Quantum-2 InfiniBand. 20 TB HBM3 are in one of the new SuperPODs based on the Hopper architecture.
Nvidia Eos supercomputer
The new SuperPOD can be supplemented with incremental levels of 32 additional DGX-H100 nodes at a time to build a supercomputer. Nvidia itself is the first operator of such a supercomputer, succeeding the previous Selene under the name Eos. The Nvidia Eos is composed of 18 of the new SuperPODs, so that 576 DGX H100 with a total of 4,608 H100 GPUs are used for 18 exaFLOPS FP8 performance or 9 exaFLOPS related to FP16 or 275 PFLOPS for FP64. Eos is a supercomputer that Nvidia operates itself, but the structure is intended to serve as a blueprint for comparable systems for OEMs and cloud partners.
Hopper comes on PCIe 5.0 cards
H100 CNX
Image 1 of 2
A whole Hopper family starts in the third quarter
With today's presentation, Nvidia Hopper is not just a new architecture that will only be followed by solutions later, but Nvidia wants to start with it on a broad front in the third quarter. The H100 GPU forms the basis of all solutions ranging from the DGX H100 to the DGX H100 SuperPOD and Eos Supercomputer. With the H100 CNX and a PCIe variant of the H100, for which details are still missing, Nvidia is also addressing those interested who do not want to purchase a complete system. And the DGX H100 will also be available again as the HGX H100, which just provides the innards for server vendors to construct their own framework.
H100 family
Nvidia has already won numerous partners for Hopper. The cloud providers Alibaba, AWS, Baidu, Google, Microsoft, Oracle and Tencent want to launch the corresponding systems, while among the server providers AtoS, Dell, Fujitsu, Gigabyte, H3C, Hewlett Packard, Inspur, Lenovo, Nettrix and Supermicro are to be mentioned.
ComputerBase has received information about this item from Nvidia under NDA. The only requirement was the earliest possible publication date.
Was this article interesting, helpful or both? The editors are happy about any support from ComputerBase Pro and disabled ad blockers. More about ads on ComputerBase.