Customizable GAN: Customizable Image Synthesis Based on Adversarial Learning
- 540 Downloads
In this paper, we propose a highly flexible and controllable image synthesis method based on the simple contour and text description. The contour determines the object’s basic shape, and the text describes the specific content of the object. The method is verified in the Caltech-UCSD Birds (CUB) and Oxford-102 flower datasets. The experimental results demonstrate its effectiveness and superiority. Simultaneously, our method can synthesize the high-quality image synthesis results based on artificial hand-drawing contour and text description, which demonstrates the high flexibility and customizability of our method further.
KeywordsComputer vision Deep learning Customizable synthesis Generative adversarial networks
To make the image synthesis process more controllable, it is necessary to provide high-level control information. The current research is mainly from two aspects: one is to control the shape of the synthesis through the contour information, the other is to control the specific content of the synthesis through the text information. Some achievements  have been made in the aspect of shape control by contour, but the biggest problem in this aspect is that it can only control the shape information, not the specific content. The method of text control starts with image attributes or class labels , which can control the categories of synthetic content but cannot do anything for more specific details. Furthermore, Reed et al.  proposed a method of synthesizing images based on the text description, which makes the whole synthesis process more flexible and better control the specific content of the synthesis. However, the synthesis method based on the text description can not control the shape information of the synthesized object. Subsequently, although many improvement text-to-image synthesis methods [5, 6, 7] have been put forward and achieved amazing results, the problem of unable to control the shape information of objects has not been solved.
To solve this problem, Reed et al.  proposed the Generative Adversarial What-Where Network (GAWWN), and realized the controllable image generation for the first time by combining the location information and text description. Although the method has achieved good control, on the one hand, the results of the method are not satisfactory. On the other hand, the bounding box and key points information used in this method belongs to the rough information, which can not realize the refined control of the object shape.
In order to achieve better control effect and synthesize more realistic results, a customizable synthesis method based on simple contour and text description is proposed. As shown in Fig. 1, the simple contour is used to determine the specific shape information for the object. The text description is then used to generate specific content. Finally, the high-quality images based on hand-drawn contour and artificial text description are obtained by using this method. It not only realizes fine-grained control but also completes the generation of the realistic image.
2 Customizable GAN
2.1 Network Architecture
The architecture of our approach is shown in Figs. 2 and 3. Figure 2 shows the network structure of the generator. In the generator, the simple contour and text description as the input is encoded in different ways and then combined. After that, the corresponding result is synthesized by de-convolution .
In order to effectively combine text embeddings with contour features, spatial replication is performed to expand the dimension of text embeddings. After the contour feature and text embeddings fusion, it will pass through two residual units, which are composed of residual blocks . After that, the corresponding image results are synthesized by two up-sampling operations.
There are two parts of the content in the discriminator. One is to judge whether the input image is true or false; the other is to judge whether the input image and text match. Figure 3 shows the network structure of the discriminator. In the discriminator, the corresponding feature vector of the input image is obtained through the down-sampling operation. The down-sampling operation is divided into two types: one is to get the corresponding feature through two convolution layers and use it to judge whether the image is true or false; the other is to obtain the input image feature through five convolutions and then combine the extended dimension text vector to judge whether the image and the text match. In the image discrimination loss, the first convolution layer is followed by BN and leaky-ReLU  and the sigmoid function directly follows the second layer. In the joint discrimination loss, the combined image and text features are used to calculate the loss by two convolution layers. BN and leaky-ReLU operation follow the first convolution layer.
2.2 Adversarial Learning
2.3 Training Details
In the training process, the initial learning rate is set to 0.0002, and it decays to half of the original every 100 epochs. Adam optimization  with a momentum of 0.5 is used to optimize and update parameters. A total of 600 epochs are trained iteratively in the network, of which the batch size is 64.
3.1 Dataset and Data Preprocessing
We validated our method on the CUB and the Oxford-102 flower datasets. The CUB dataset contains 11,788 images with 200 classes. The Oxford-102 dataset contains 8,189 images with 102 classes. Each image has ten corresponding text descriptions. Following Reed et al. , we split CUB dataset to 150 train classes and 50 test classes as well as Oxford-102 to 82 train classes and 20 test classes.
3.2 Qualitative Results
The quantitative comparison results in the CUB dataset.
3.3 Quantitative Results
For the evaluation of the controllable image synthesis model, Human Rank (HR) is used to quantify the comparison models. We employed 10 subjects to rank the quality of synthetic images by different methods. The text descriptions and contours corresponding to these results are all from the test set and are divided into 10 groups for use by 10 subjects. The subjects are asked to rank the results in the following ways: consistency, text, and authenticity. “Consistency" indicates whether the result is consistent with control information (the contour or bounding box or key points). “Text" denotes whether the result matches the text description. “Authenticity" represents the level of the authenticity of all results.
Table 1 shows the comparison results with GAWWN in the CUB dataset. It is evident from the results that our method is considered to be the best in all respect. Our results have better authenticity and more conform to the text and control information.
3.4 Controllable Image Synthesis
The most important feature of our work is to realize fine-grained controllable image synthesis based on manual hand drawing and artificial text description. The relevant results are shown in Fig. 5. The contours in the figure are drawn by hand, and the text descriptions are also artificially described. These results reflect well the hand-drawn contour and artificial text description content, but also have high authenticity. This shows the effectiveness of our method in synthesizing high-quality images on the one hand. On the other hand, it reflects the high flexibility and controllability of our method because all inputs can be controlled artificially.
In this paper, we propose a fine-grained customized image synthesis method based on the simple contour and text description. The synthesis results demonstrate that our method not only maintains the basic shape of the contour but also conforms to the text description. Furthermore, we have evaluated the model on the Caltech-UCSD Birds and the Oxford-102 flower datasets. The experimental results show the effectiveness of our method. Besides, the high-quality image synthesis results based on hand-drawn contour and artificial description are also illustrated to prove that our method is highly controllable and flexible.
This research is supported by Sichuan Science and Technology Program (No. 2020YFS0307), National Natural Science Foundation of China (No. 61907009), Science and Technology Planning Project of Guangdong Province (No. 2019B010150002).
- 1.Goodfellow, I.J., et al.: Generative adversarial nets. In: Neural Information Processing Systems, vol. 27, pp. 2672–2680 (2014)Google Scholar
- 2.Isola, P., Zhu, J., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Computer Vision and Pattern Recognition, pp. 5967–5976 (2017)Google Scholar
- 3.Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv:1411.1784 (2014)
- 4.Reed, S.E., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: International Conference on Machine Learning, pp. 1060–1069 (2016)Google Scholar
- 5.Zhang, H., et al.: StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: International Conference on Computer Vision, pp. 5908–5916 (2017)Google Scholar
- 6.Xu, T., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: Computer Vision and Pattern Recognition, pp. 1316–1324 (2018)Google Scholar
- 7.Zhu, M., Pan, P., Chen, W., Yang, Y.: DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis. In: Computer Vision and Pattern Recognition, pp. 5802–5810 (2019)Google Scholar
- 8.Reed, S.E., Akata, Z., Mohan, S., Tenka, S., Schiele, B., Lee, H.: Learning what and where to draw. In: Neural Information Processing Systems, vol. 29, pp. 885–895 (2016)Google Scholar
- 9.Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The caltech-UCSD birds-200-2011 dataset. Technical report CNS-TR-2011-001, California Institute of Technology (2011)Google Scholar
- 10.Nilsback, M.-E., Zisserman, A.: Automated flower classification over a large number of classes. In: Indian Conference on Computer Vision, Graphics and Image Processing, pp. 722–729 (2008)Google Scholar
- 12.Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456 (2015)Google Scholar
- 13.Reed, S.E., Akata, Z., Lee, H., Schiele, B.: Learning deep representations of fine-grained visual descriptions. In: Computer Vision and Pattern Recognition, pp. 49–58 (2016)Google Scholar
- 14.He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
- 15.Xu, B., Wang, N., Chen, T., Li, M.: Empirical evaluation of rectified activations in convolutional network. arXiv:1505.00853 (2015)
- 16.Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2015)Google Scholar