About Ampere, the GPU architecture developed by Nvidia, and the first data center GPU products to introduce this architectureA100, it has been more than two years since it was launched, and a number of GPU products using the Ampere architecture have been launched in the future, covering data centers, professional graphics, home entertainment, edge computing and other application fields. At the GTC Spring Conference held in March this year, they finally announced the launch of a new generation of GPU architecture Hopper, and the first data center GPU product H100 using this architecture.
This GPU accelerator has a built-in Transformer Engine computing acceleration engine and is equipped with the latest 4th generation NVLink interconnect technology, which can support giant AI language models, deep recommendation systems, genomics operations, and complex digital twins (digital twins) twins).
In terms of production technology used, H100 adopts TSMC’s 4N node process, contains 80 billion transistors, provides powerful AI and HPC accelerated computing capabilities, and is the first to use PCIe 5.0 I/O interface and 80 GB capacity. HBM3 memory, which provides 3 TB/s or 2 TB/s of memory bandwidth.
In contrast, the A100 currently promoted by Nvidia uses TSMC’s 7nm N7 process, contains 54.2 billion transistors, uses PCIe 4.0 I/O interface, and 40 GB HBM2 and 80 GB HBM2e memory. This provides 1,555GB/s, 1,935GB/s or 2,039GB/s of memory bandwidth.
In terms of computing performance breakthroughs, Nvidia revealed some of the results of their tests during the initial release. For example, if used with InfiniBand interconnect network, H100 can achieve 30 times the AI and HPC performance of A100 – when using Nvidia’s single large-scale language model Megatron-Turing NLG 530B (Megatron 530B), it responds to chatbots in real-time conversations It can provide 30 times the throughput when the AI latency needs to be less than 1 second.
When training large models for researchers and developers, using the H100 can dramatically reduce the time required from weeks to days – up to the A100 when processing a Mixture of Experts (MoE) model with 395 billion parameters 9 times.
And after nearly half a year, finally there is new H100 performance information. According to the latest AI inference performance test results released by MLCommons on September 8, namely MLPerf Inference v2.1, Nvidia submitted the H100 test data for the first time. Compared with the A100, this latest data center GPU can provide 4.5 times the Performance – In the test of natural language processing using the BERT model, the A100 can process 1756.84 samples per second, and the H100 can process 7921.10 samples per second.
Nvidia says these inference benchmarks are the first public demonstration of the H100, and it predicts that the product will be available later this year and will participate in MLPerf training tests in the future.
Performance improvement hits a new record, but power consumption also rises
During the GTC 2022 Spring Conference, when Nvidia CEO Huang Renxun announced the H100, he emphasized that this new generation of data center GPUs brings a number of breakthroughs in computing performance, for example, in the processing performance of FP16, FP32, FP64, TF32 and other data types. , H100 can reach 3 times that of A100. If the FP8 data type newly supported by this GPU is used, it can provide 4,000 TFLOPS performance. Compared with the A100 using the current supported data type FP16 for related processing, the H100 can now reach 4,000 TFLOPS. 6 times the efficiency increase.
In terms of thermal design power (TDP), Huang Renxun also mentioned that the H100 is designed for air-cooled and liquid-cooled systems, and is the first GPU to increase power consumption to 700 watts for performance. Looking at the technical specifications of the H100, the maximum thermal design power consumption of the SXM version is indeed 700 watts, and the PCIe version is 350 watts. In contrast, the current A100, the SXM version is 400 watts, and the PCIe version has two configurations, 250 watts and 300 watts (GPU memory is 40 GB and 80 GB, respectively).
The power consumption of 700 watts looks amazing, which seems to pose a big challenge to the entire GPU server ecosystem. Because, since the H100 was released in March, except for the AI integrated application device DGX H100 released by Nvidia at that time, it is determined to be equipped with this GPU. So far, we have not seen any server manufacturers that announced that they can be used with the H100 SXM. Version of the server product model.
It is worth noting that Nvidia mentioned in the H100 data specification sheet and technical architecture document that there are two ways to match the SXM version of the server, one is to use DGX H100 with 8 H100s, and the other is to integrate with HGX H100 Cooperative manufacturers of server motherboards will be equipped with 4 or 8 H100s, and the 4-GPU configuration will include NVLink links to support the connection between GPUs and the connection between CPU and GPU; 8-GPU The configuration will include NVSwitch chips and provide full NVLink bandwidth for the connection between GPUs – Huang Renxun mentioned in the GTC 2022 Spring Conference keynote speech that on the HGX system motherboard, 8 SXM versions of the H100 will be available. Connected through four NVSwitch chips, each NVSwitch chip will provide 3.6 TFLOPS of SHARP in-network computing power.
Regarding the connection characteristics, the NVSwitch chip used here is the third-generation technology. Each switch can provide 64 fourth-generation NVLink ports, and the throughput can reach 13.6 Tbps (the second-generation NVSwitch is 7.2 Tbps). In addition, the third-generation NVSwitch chip also has built-in hardware acceleration, which can be used for collective processing, such as multicast (multicast), SHARP network computing reduction processing, etc.
Design Breakthrough of Streaming Composite Processors
In Nvidia’s previous GPU architectures, the internal configuration of the Streaming Multiprocessor (SM) and the newly added computing methods are often the key to technological innovation. In this report, we first introduce the Tensor Core core and the new instruction set DPX.
For the current data center GPU product A100, it contains 108 SMs. The new-generation GPU product H100, which debuted this year, contains 132 SMs in the SXM5 version and 114 SMs in the PCIe version; and in the internal design of each SM , A100 uses 6,912 FP32 (CUDA) cores, 432 Tensor Core cores; H100, SXM5 version uses 16,896 FP32 cores, 528 Tensor Core cores, PCIe version uses 14,592 FP32 cores, 456 Tensor Core cores. Obviously, the H100 is equipped with more SM processors, FP32 cores, and Tensor Core cores than the previous generation of data center GPU products.
In addition, H100 also refurbished part of the computing core design. For example, compared with the third-generation Tensor Core used by the A100, the H100 uses the fourth-generation Tensor Core, and the operation speed can be increased by 6 times; for the same data type, the H100 performs matrix multiply-accumulate operations (Matrix Multiply-Accumulate, MMA), it can provide 2 times the processing power of A100; if using the data type FP8 newly supported by H100, the computing power can be increased to 4 times compared to A100 using FP16.
Another new computing function of the H100 that is different from the previous data center GPU products is the DPX instruction set known to accelerate dynamic programming. Huang Renxun said that this stylized processing method can truncate complex problems, It becomes a simple sub-problem that can be solved by recursive processing, reducing the processing complexity and required time to the scale of polynomial processing. He believes that if this method can be used, the DPX of the H100 can improve the performance of the algorithm to Originally only relying on the CPU to perform operations 40 times, if it is based on the previous generation GPU to handle such situations, DPX can also provide 7 times the enhancement effect.
Nvidia said that DPX can be widely used in algorithms in various fields such as route optimization, genomics, and graphics processing optimization. For example, in a dynamic warehousing environment, the Floyd-Warshall algorithm may be used to assist the robot to find the best route when traveling automatically; when DNA and protein are to be classified, gene sequenced, and folded, the Floyd-Warshall algorithm may be used. The Smith-Waterman algorithm performs feature alignment and sequence alignment.
For the Transformer model used by various AI language models, a new acceleration engine is set up
About the new acceleration engine Transformer Engine of H100, it is specially designed for Transformer, the standard model of deep learning for natural language processing. In fact, Transformer is a set of models widely used by well-known language AI models such as BERT and GPT-3. Based on the previous generation of Nvidia products, the H100 equipped with this engine can achieve 6 times the neural network operation. speed and accuracy.
Basically, the Transformer Engine combines the special Hopper Tensor Core technology and software to dynamically process multiple levels of the Transformer network, which can be used to accelerate the training and inference of the Transformer model. Huang Renxun said that the time required to train the Transformer model can range from several weeks reduced to a few days. The use of 16-bit precision and the newly added 8-bit floating point data format (FP8), coupled with advanced software algorithms, can further accelerate AI performance and processing power.
For FP8 and 16-bit computing, it manages and dynamically selects in a smart way, automatically handling re-casting and scaling in each of the two centralizations. Therefore, it also faces the processing of large language models. Compared with A100, H100 can achieve 9 times faster AI training and 30 times faster AI inference.
Multi-execution individual GPU technology enters the second generation
In order to make full use of the data center GPU, Nvidia has built a hardware-level GPU segmentation technology from the A100 data center GPU, called Multi-Instance GPU (MIG), which can separate a single GPU accelerator. Execute instances for 7 smaller, fully isolated GPUs.
With the newly launched H100, the second-generation MIG technology is introduced, and the processing capacity can be increased to 7 times that of the original. It can span each GPU individual in the cloud environment and provide a secure multi-tenant service configuration.
In this regard, Huang Renxun said that the Hopper architecture increases the complete isolation of each GPU execution entity, as well as I/O virtualization capabilities, to support the multi-tenant application requirements of the cloud environment. Taking the H100 as an example, the cloud service tenant can carry seven at the same time, while the A100 can carry only one.
In terms of each GPU execution unit, if the first-generation MIG of the A100 is used as the benchmark, the second-generation MIG of the H100 can provide nearly three times the computing capacity and nearly three times the memory bandwidth. In terms of performance alone, Huang Renxun believes that a single GPU execution entity of the H100 can provide the performance equivalent to two Nvidia T4 GPUs.
The support for confidential computing is also one of the biggest selling points of this H100. Huang Renxun emphasized that this type of data protection application was limited to CPU systems in the past, and Hopper was the first to provide GPU confidential computing solutions, which are enough to protect users’ AI models and algorithms. Confidentiality and integrity enable developers and service providers to safely distribute and deploy various valuable exclusive AI models in a shared or remote IT infrastructure environment, taking into account the needs of protecting intellectual property and expanding business models.
Nvidia claims this is the first accelerator equipped with confidential computing to provide protection while processing AI models and customer data. At the same time, enterprises can also use this product to implement confidential computing for federated learning AI applications in privacy-sensitive industries such as medical care and financial services, as well as cloud-based shared infrastructure.
In fact, H100 not only supports confidential computing for the entire GPU, but also provides the ability to implement confidential computing in the Trusted Execution Environment (TEE) at the MIG level – the 7 internal GPU execution entities can support it, and each can be configured with its own The NVDEC, NVJPG and other decoder units also include independent performance monitoring mechanisms, which can be used with Nvidia’s developer tools.
Introduce a new generation of NVLink and NVLink Switch to increase I/O bandwidth
In the part of the chip interconnection interface, the H100 introduces the 4th generation NVLink, which can be combined with the third generation NVSwitch chip and external NVLink switch to extend NVLink, which can go beyond the server and establish a vertically scalable network system. The so-called NVLink Switch System is formed, which not only can connect up to 256 H100s at the same time, but also provides 9 times the access bandwidth compared to the previous generation of connecting through the HDR Quantum InfiniBand network.
Nvidia says the NVLink Switch System can connect a large number of GPUs for a 2-to-1 tapered, fat tree topology, with access bandwidths up to 57.6 TB across all nodes in a many-to-many connection /s, enough to support 1 EFLOP of FP8 type AI operations, and at the same time, it can also provide isolation and protection mechanisms.
Product Information
Nvidia H100
●Original factory: Nvidia
●Suggested selling price: not provided by the manufacturer
●Processor process: TSMC 4N
●I/O interface: PCIe 5.0
●Appearance: SXM5, PCIe dual-slot interface card (air cooling)
●GPU architecture: Nvidia Hopper
GPU core: SXM5 version has 16896 CUDA cores and 528 Tensor Cores, and PCIe version has 14592 CUDA cores and 456 Tensor Cores
GPU memory: 80 GB, SXM5 version with HBM3, PCIe version with HBM2e
Memory bandwidth: 3 TB/s for SXM5 version, 2 TB/s for PCIe version
Operational performance: double precision (FP64) SXM5 version is 30 TFLOPS, PCIe version is 24 TFLOPS
GPU interconnect interface: 4th generation NVLink, 900 GB/s for SXM5 version, 600 GB/s for PCIe version
Power consumption: 700 watts for the SXM5 version and 350 watts for the PCIe version
[Note: Specifications and prices are provided by the manufacturer, due to changes from time to time, please contact the manufacturer for correct information]