Jittor-GAN: A fast-training generative adversarial network model zoo based on Jittor

With the emergence of CNNs and massive datasets, the performance of many tasks in computer vision has been greatly improved, such as object detection, instance segmentation, and image generation. The last has many novel applications, including image-toimage translation, image inpainting, and image superresolution. It can generate authentic and creative images. GAN [1] is the current mainstream model for image generation. It usually consists of an encoder, a generator, and a discriminator, which are constructed from CNN layers. The encoder is responsible for mapping images to a latent space. The generator is responsible for generating images from latent vectors, using one or multiple images. The discriminator is responsible for distinguishing generated images from real images. Through joint adversarial training of the generator and the discriminator, the generative ability of the generator is continuously improved, thereby generating more and more realistic images. However, training GAN models is time consuming. Thus, we have implemented a GAN model zoo based on Jittor, a fully just-in-time (JIT) complied deep learning framework by Tsinghua University [2]. This model zoo is a collection of 27 mainstream GAN models published from 2014 to 2019, listed in Table 1. These models have an average of 3070 citations per model, and they have great influences and have been widely used in both academia and industry. Our model zoo covers 4 kinds of tasks, including image generation (G), image-to-image translation

(T), super-resolution (S), and image inpainting (I). Table 1 lists the models for different tasks and their representative models. Further details of these 27 GAN models can be found in Ref. [3].

Why can Jittor-GAN accelerate model training?
Training the Jittor models is 2.26 times faster than equivalent PyTorch models on average. There are three main reasons. Firstly, Jittor's unique operator fusion mechanism saves much memory access time. Secondly, Jittor's optimization makes better utilization of GPU computing resources. Thirdly, Jittor's precise back propagation algorithm avoids computing derivatives of parameters that do not need to be updated.

Operator fusion
Jittor proposed the concept of meta-operators, which cover three operator categories: reindex, reindexreduce, and element-wise. Most common element operators can be fused, e.g., convolution and matrix multiplication. Jittor also has a unique meta-operator fusion mechanism, which can fuse adjacent operators together. After doing so, intermediate results do not need to be stored in memory, saving memory read and write time. In addition, Jittor uses lazy execution, allowing Jittor to fuse more meta-operators for increased optimization. Jittor's unique lazy execution mechanism separates construction of the calculation graph from calculation, and performs calculation when a result is required or the calculation graph reaches a certain scale. In contrast, PyTorch uses eager execution for calculation graphs, so results are calculated as soon as the calculation graph is constructed. For example, when computing convolutions, the calculation is performed immediately after the image data is input and the results are stored in memory. This has the advantage that the network structure can be particularly flexible, and the network can include conditional or loop statements. The disadvantage is that it limits the potential for optimization. If two element operators are in different operators and they can be merged, the eager execution mechanism will need to save the result of the first element operator in memory, so it cannot be merged with the second element operator. The lazy execution mechanism will optimize and determine the largest possible calculation graph, with maximal meta-operator fusion.

GPU utilization
When training a network, greater GPU utilization leads to more fully utilized computing resources and faster computation. Jittor improves GPU utilization through Fetch Sync methods.

Async fetch
Fetch Sync is a unique asynchronous interface in Jittor.
When training a network, users often output the loss of each round for observation. In order to output network results, PyTorch forces GPU and CPU data synchronization using the cuda synchronize function. This blocks the code running until the required output calculation is complete and copied to the CPU, causing the pipeline to be emptied, resulting in decreased GPU utilization. Fetch Sync supports asynchronous acquisition of network results, and the corresponding function is called for output after the network result is calculated and transmitted.

Kernel launches
Small amounts of data (for example, training WGAN using the MNIST dataset uses size 64 × 1 × 28 × 28) can be processed quickly in the GPU, making the GPU frequently wait, resulting in low GPU utilization in PyTorch. Jittor's operator fusion can reduce the number of kernel launches, thereby reducing CPU-GPU communication and improving GPU scheduling.

Precise backward algorithm
A GAN model has a generator and a discriminator. The discriminator determines whether an image is real or generated by the generator. The task of the generator is to generate an image that is difficult for the discriminator to distinguish, to provide confrontational training.
When calculating parameters' gradients, PyTorch uses loss.backward() to propagate the gradient to all related parameters, while Jittor uses jt.grad(loss, parameters) to avoid unnecessary gradient calculations. For example, gradients of the generator do not need to be calculated when training the discriminator. PyTorch calculates the gradients of both generator and discriminator if the variables fed to the discriminator are not detached, while Jittor just calculates gradients of the discriminator. Therefore, Jittor's gradient calculation method is a point-to-point gradient calculation which requires less computation than PyTorch.

Experiments
We first compare training speed for Jittor and Pytorch on the GAN model zoo. We then show generated results for the 4 tasks mentioned in Table 1. Jittor-GAN model zoo has been available in GitHub x .

Model training speed
The biggest advantage of our model zoo is that model training is very fast due to the targeted optimizations of the Jittor framework. To demonstrate our speed advantage, we compare with the currently popular deep learning framework PyTorch (version 1.3.0).
To ensure fairness, we ensure that the network architecture, input image, and network parameters are identical. All models were tested on an NVIDIA Titan RTX Graphics Card with E5-2678 v3 CPU. We ran 100 times to warm up the model, and then ran 1000 times to test the speed of model training.
Results are shown in Table 1, giving the number of training iterations per second for Jittor and PyTorch frameworks. It can be seen that the training speed of all Jittor models is faster than for Pytorch equivalents, ranging from 1.27 to 3.83 times faster, with an average of 2.26. Therefore, using our model zoo can greatly improve model training for development.

Applications
We now consider our models' performance on different datasets for 4 tasks.
Image generation was one of the first and is one of the most popular applications of GANs. We provide 20 GAN models for image generation, such including GAN, CGAN, DCGAN, and WGAN. Outputs for some common and important models using the MNIST dataset are shown in Fig. 1.
Image-to-Image translation aims to convert an image to another image domain while ensuring that the image content is consistent. We provide 5 GAN models: CYCLEGAN, Pix2Pix, BICYCLEGAN, UNIT, and StarGan. Example results using the map dataset are shown in Fig. 2.  The super resolution task aims to generate highresolution images from lower-resolution ones. Results from ESRGAN on the celeba dataset are shown in Fig. 3.
Image inpainting aims to fill in missing image blocks in an image. Results from ContextEncoder on the celeba dataset are shown in Fig. 4.   Fig. 3 Generated results of ESRGAN on celeba dataset. The left image of each group is a low-resolution image, and the right image is a high-resolution image output by ESRGAN. Fig. 4 Generated results of ContextEncoder on celeba dataset. The first line is the input image, the second line is the image predicted by ContextEncoder, and the third line is the real image.