In this era, any machine learning enthusiast who has been trying to catch a good performance by training neural networks on a huge amount of data has struggled with the amount of time that the deep neural network models take for training. In addition, deep learning models are in dire need of hardware resources. Further, it is better to use Graphics Processing Unit (GPUs) instead of Central Processing Unit (CPUs), since they are faster than CPUs for training deep neural networks. But there is another available option called Tensor Processing Unit or TPU which is competitive to GPU. Thus, the present study aimed to clarify the differences between GPUs and TPUs.
Graphics Processing Unit
A GPU is a particular processor with dedicated memory which is utilized in graphical and mathematical computations. Technically, GPUs are specialized for one task. Therefore, they are designed for Single Instruction and Multi Data (SIMD) architecture. They run similar types of computations in parallel.
As we are dealing with millions of parameters in deep neural networks, GPUs can play a significant role since they employ a large number of logical cores (arithmetic logic units (ALUs) control units and memory cache).
GPUs involve a large number of cores which allows for better computation of multiple parallel processes.
The Tensor Processing Unit
Google announced TPU in May 2016 at Google I/O (an annual developer conference held by Google) when the company indicated that their TPUs had already been used inside their data centers for over a year. TPU is specifically designed for neural network and machine learning tasks. In addition, it has been available for third party use since 2018.
Google has used TPU for Google Street view text processing. It was able to find all of the text within the Street View database in less than five days. Also, in Google Photos, a single TPU is able to process over 100 million photos in a single day. Google also utilizes TPU in its machine learning based search engine algorithm ,RankBrain, for providing search results.
TPU is a co-processor designed to accelerate deep learning tasks.Figure 1 illustrates the TPU block diagram.
To reduce the chances of delaying deployment, the TPU was designed to be a coprocessor on the PCIe (PCI Express, a high-speed serial computer expansion bus standard) I/O bus instead of being tightly integrated with a CPU, which allows for plugging into existing servers just as a GPU does.
Furthermore, the host server sends TPU instructions to execute rather than fetching them itself in order to simplify hardware designing and debugging. Therefore, the TPU is closer in spirit to an FPU (floating-point unit) coprocessor compared to a GPU (Jouppi, et al., 2017).
As shown in Figure 1, the yellow Matrix Multiply unit in the upper right hand corner is considered as the main computation part of the architecture.
In general, CPUs are able to handle tens of operations per cycle.GPUs are able to handle tens of thousands of operations per cycle. Finally, TPUs can handle up to 128000 operations per cycle. The present study aimed to compare TPU and GPU in naturallanguage processing tasks.
In the present work, a Geforce RTX 2080 Ti with 11 GB GDDR6 RAM was used as GPU and a Cloud TPU v3 with 128 GB memory was selected for TPU in order to compare the GPU and TPU in a Natural Language Processing Task.
|GPU||GeForce RTX 2080 Ti||11GB GDDR6 RAM|
|TPU||Cloud TPU v3||128GB|
Two tasks were used for comparison. The first is related to pre-training the BERT (Bidirectional Encoder Representations from Transformers) with Inui Laboratory, Tohoku University method on Japanese Wikipedia dataset. The second is fine tuning the pre-trained BERT for a simple Japanese text classification problem with livedoor ニュースコーパス from Yohei Kikuta github page.
- pretraining BERT ( Bidirectional Encoder Representations from Transformers) on Japanese Wikipedia dataset
- fine tuning the pretrained BERT for a simple Japanese text classification problem with livedoor ニュースコーパス
Task 1: Pretraining the BERT
A base BERT was pre-trained with Japanese Wikipedia data. The mini-batch size was set to 4 for GPU and 256 for TPU since we had different amounts of RAM in the selected GPU and the TPU. Now, pre-training BERT over both GPU and TPU is considered. Figure 2 displays the losses over a period of 1M steps through utilizing TPU.
Figure 3 illustrates losses of pre-training BERT over a GPU.
For GPU, it took 80 hours for 1 M steps, in which a mini-batch with size 4 is processed at each step. It means the GPU requires 2E-5 hour for each training example. For TPU, it took 145 hours for 1 M steps, in which a mini-batch with size 256 is processed at each step, which means 5.67E-7 hours for each training example. It proves that TPU is 35 times faster than GPU for pre-training a BERT.
Task 2: Fine-tuning the BERT
The BERT was fine-tuned with GPU within 52 minutes, while it took only 5 minutes by using TPU. The mini-batch sizes were the same as Task 1.
Based on the results, TPU is considerably faster than GPU in our tasks. Based on leadergpu.com, our GPU rental price is about 0.7 USD per hour, while TPU price is about 8 USD per hour. Thus, utilizing TPU is not only faster, but also it is cheaper, especially for training big models such as BERT.
This comparison study supported with Cloud TPUs from Google’s TensorFlow Research Cloud (TFRC).
- Jouppi, Norman P., et al. In-Datacenter Performance Analysis of a Tensor Processing Unit. 16 Apr. 2017, http://arxiv.org/abs/1704.04760.