What are tensor cores

GeForce RTX 3080: Tensor cores compared to Turing and A100

With the Ampere architecture, Nvidia has also revised the GPU's tensor cores introduced with Volta. After it was initially unclear to what extent the tensor cores of the GeForce RTX 3080 (test) differ from those of the professional A100 GPU, an Nvidia whitepaper now provides information.

What tensor kernels are good for

Tensor cores are computing units that have been specially developed to accelerate matrix multiplications. The name comes from the mathematical tensors used in neural networks. This is also where the purpose of the tensor kernels lies, which can massively accelerate the training as well as the inference, which make extensive use of matrix multiplications, of a neural network.

More relevant data types

The first generation of tensor cores, which was introduced with Volta (Titan V and Quadro GV100), could only handle FP16 matrices. In most cases, however, neural networks use FP32, and switching to FP16 can have a negative impact on the accuracy of the output. With Turing, Nvidia revised the tensor cores and added support for the data types INT8, INT4 and INT1. Since all of these data types only cover very small value ranges and the input had to be quantized, their purposes were limited.

The third generation of tensor cores, which found its way into amperes, can handle significantly more data types. The tensor cores of the A100 GPU introduced in May can also handle FP64, TF32 and bfloat16. With TF32 and Bfloat16, the two more relevant data types have also made it to the GA10x GPUs, which can be found in the GeForce RTX 3070, 3080 and 3090. TF32 is a middle ground between FP32 and FP16, although the name is a bit misleading. Like FP32, TF32 has 8 bits for the representation of the exponent and like FP16 10 bits for the representation of the mantissa and an additional bit for the sign. This means that the data type consists of 19 bits and is therefore significantly smaller than the 32-bit FP32. According to Nvidia, the use of TF32 does not result in any (significant) loss of accuracy in neural networks, while the tensor cores can work with TF32 about 2.7 times as quickly as with FP32. With Bfloat16, the exponent has been expanded from 5 to 8 bits compared to FP16 and the mantissa has been shortened from 10 to 7 bits in return.

Since TF32 has the same structure as FP32 (with a shorter mantissa), the tensor cores can work on FP32 inputs and also output FP32 again. Thus, the switch from FP32 to TF32 is trivial for developers and is completely handled by Nvidia's CUDA / cuDNN.

Fewer but faster tensor cores

While with Volta and Turing there were 8 tensor cores per SM, with Ampere there are only 4. Each tensor core itself works twice as fast. Like Turing, Ampere achieves 512 FP16 FMA operations per SM. In comparison, Nvidia's A100 has 1,024 FP16 FMA operations per SM. Another feature that Nvidia has introduced in the new tensor cores is to automatically accelerate sparsely populated neural networks (sparsity). For this, the tensor kernels can ignore up to 2 zero values ​​within a 4 cell wide structure when calculating matrix operations. This allows the operations to be accelerated by a factor of 2. However, this method can (for the time being) only be used in the case of inference - i.e. the use of a neural network.

High AI performance

In theory, when it comes to AI training and application, Ampere offers not only more flexibility on the GeForce RTX 3080 than on Turing and Volta, but also significantly higher performance. Compared to Turing in the form of the GeForce RTX 2080 (Super), the FP32 has a factor of 2.66 and the FP16 1.40 (2.80 with Sparsity). When using the TF32, the GeForce 2080 Super has a performance jump of a factor of 2.66 compared to the FP32 (5.22 with sparsity). To use the new features, both CUDA 11 and cuDNN 8 as well as an adapted version of the corresponding deep learning framework are required.

  • Robert McHardy Email Twitter
    ... develops software and is completing his master's degree in machine learning at University College London.