Customizable GAN: Customizable Image Synthesis Based on Adversarial Learning

Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 1332)


In this paper, we propose a highly flexible and controllable image synthesis method based on the simple contour and text description. The contour determines the object’s basic shape, and the text describes the specific content of the object. The method is verified in the Caltech-UCSD Birds (CUB) and Oxford-102 flower datasets. The experimental results demonstrate its effectiveness and superiority. Simultaneously, our method can synthesize the high-quality image synthesis results based on artificial hand-drawing contour and text description, which demonstrates the high flexibility and customizability of our method further.


Computer vision Deep learning Customizable synthesis Generative adversarial networks 

1 Introduction

In computer vision, image synthesis is always essential but challenging research. In recent years, with the development of deep learning, especially the introduction of generative adversarial networks (GAN) [1], image synthesis has made a significant breakthrough. However, the input of the original GAN is the noise vector of Gaussian distribution, resulting in image synthesis that cannot artificially control.
Fig. 1.

The figure display the results of the corresponding birds and flowers under different texts and contours. The contours above are obtained by pre-processing the original dataset. The contours below are drawn by hand.

To make the image synthesis process more controllable, it is necessary to provide high-level control information. The current research is mainly from two aspects: one is to control the shape of the synthesis through the contour information, the other is to control the specific content of the synthesis through the text information. Some achievements [2] have been made in the aspect of shape control by contour, but the biggest problem in this aspect is that it can only control the shape information, not the specific content. The method of text control starts with image attributes or class labels [3], which can control the categories of synthetic content but cannot do anything for more specific details. Furthermore, Reed et al. [4] proposed a method of synthesizing images based on the text description, which makes the whole synthesis process more flexible and better control the specific content of the synthesis. However, the synthesis method based on the text description can not control the shape information of the synthesized object. Subsequently, although many improvement text-to-image synthesis methods [5, 6, 7] have been put forward and achieved amazing results, the problem of unable to control the shape information of objects has not been solved.

To solve this problem, Reed et al. [8] proposed the Generative Adversarial What-Where Network (GAWWN), and realized the controllable image generation for the first time by combining the location information and text description. Although the method has achieved good control, on the one hand, the results of the method are not satisfactory. On the other hand, the bounding box and key points information used in this method belongs to the rough information, which can not realize the refined control of the object shape.

In order to achieve better control effect and synthesize more realistic results, a customizable synthesis method based on simple contour and text description is proposed. As shown in Fig. 1, the simple contour is used to determine the specific shape information for the object. The text description is then used to generate specific content. Finally, the high-quality images based on hand-drawn contour and artificial text description are obtained by using this method. It not only realizes fine-grained control but also completes the generation of the realistic image.

The main contributions of this paper are as follows: (1) an effective customizable image synthesis method is proposed, and it can achieve fine-grained control and high-quality image generation. (2) the whole process of image synthesis can be controlled manually, which makes our method most flexible. (3) experiments on the Caltech-UCSD Birds (CUB) [9] and the Oxford-102 flower [10] datasets demonstrate the effectiveness of the method.
Fig. 2.

The generator structure of the model. The generator synthesizes the corresponding image based on the text description and contour. The blue and brown cubes in the figure represent the corresponding extracted contour and text features. (Color figure online)

The rest of this paper is arranged as follows. Our method details are discussed in Sect. 2 and validated in Sect. 3 with promising experimental results. Section 4 concludes our work.

2 Customizable GAN

2.1 Network Architecture

The architecture of our approach is shown in Figs. 2 and 3. Figure 2 shows the network structure of the generator. In the generator, the simple contour and text description as the input is encoded in different ways and then combined. After that, the corresponding result is synthesized by de-convolution [11].

Specifically, in the generator, the contour is convoluted by a three-layer convolution neural network (CNN), followed by a ReLU activation operation. In addition to the first layer, each ReLU has a Batch Normalization (BN) operation [12] before it. The text description is first encoded as the text vector by a pre-trained text encoder [13], then expanded dimension by full connection layer. Finally, increase the number of text embeddings by condition augmentation (CA) [7] technology.
Fig. 3.

The discriminator structure of the model. The discriminator judges whether the received image itself is true or fake and the matching degree between the image and text.

In order to effectively combine text embeddings with contour features, spatial replication is performed to expand the dimension of text embeddings. After the contour feature and text embeddings fusion, it will pass through two residual units, which are composed of residual blocks [14]. After that, the corresponding image results are synthesized by two up-sampling operations.

There are two parts of the content in the discriminator. One is to judge whether the input image is true or false; the other is to judge whether the input image and text match. Figure 3 shows the network structure of the discriminator. In the discriminator, the corresponding feature vector of the input image is obtained through the down-sampling operation. The down-sampling operation is divided into two types: one is to get the corresponding feature through two convolution layers and use it to judge whether the image is true or false; the other is to obtain the input image feature through five convolutions and then combine the extended dimension text vector to judge whether the image and the text match. In the image discrimination loss, the first convolution layer is followed by BN and leaky-ReLU [15] and the sigmoid function directly follows the second layer. In the joint discrimination loss, the combined image and text features are used to calculate the loss by two convolution layers. BN and leaky-ReLU operation follow the first convolution layer.

2.2 Adversarial Learning

There are three types of text input in the adversarial training process, that is, the matching text T, the mismatching text \(T_{mis}\), and the relevant text \(T_{rel}\). In the specific training, the generator synthesizes the corresponding image results through the simple contour and text description, and then the generated results will be sent to the discriminator. In the discriminator, it needs to distinguish three situations: the real image with the matched text, the fake image with the relevant text, the real image with the mismatched text. In each case, the discriminator will distinguish the authenticity of the image and the consistency between image and text. The specific loss functions are as follows:
$$\begin{aligned} L_G={\textstyle \sum _{(I,T)\sim p_{data}}}\log D_0(I_{fake},T_{rel}) +\log D_1(I_{fake},T_{rel}) \end{aligned}$$
$$\begin{aligned} \begin{aligned} L_D={\textstyle \sum _{(I,T)\sim p_{data}}}\{\log D_0(I_{real},T) + [\log (1-D_0 (I_{real},T_{mis}))\\ + \log (1-D_0(I_{fake},T_{rel}))]/2\} \\ +\ \{\log D_1(I_{real},T) + [\log D_1(I_{real},T_{mis})\\ + \log (1-D_1(I_{fake},T_{rel}))]/2\} \end{aligned} \end{aligned}$$
where \(D_0\) represents the first output of the discriminator, and \(D_1\) represents the second.

2.3 Training Details

In the training process, the initial learning rate is set to 0.0002, and it decays to half of the original every 100 epochs. Adam optimization [16] with a momentum of 0.5 is used to optimize and update parameters. A total of 600 epochs are trained iteratively in the network, of which the batch size is 64.

3 Experiments

3.1 Dataset and Data Preprocessing

We validated our method on the CUB and the Oxford-102 flower datasets. The CUB dataset contains 11,788 images with 200 classes. The Oxford-102 dataset contains 8,189 images with 102 classes. Each image has ten corresponding text descriptions. Following Reed et al. [4], we split CUB dataset to 150 train classes and 50 test classes as well as Oxford-102 to 82 train classes and 20 test classes.

In order to experiment with customizable synthesis, it is necessary to pre-process the contour. For the processing of the bird dataset, we first download the corresponding binary image on its official website, then turn the black part of the background into white and retain the outermost contour lines. For the flower dataset, we use the Canny operator to process the flower foreground map, the official website provides the foreground map of the blue background, and pure foreground map can be obtained by turning the blue to white.
Fig. 4.

The comparison between our method and GAWWN (including two results based on bounding box and key points).

3.2 Qualitative Results

Compare our method with the existing controllable image synthesis (GAWWN), as shown in Fig. 4. There are two kinds of annotations in GAWWN: the bounding box, and the key points. In the figure, GAWWN_bb represents the GAWWN result based on the bounding box. GAWWN_kp represents the corresponding result based on the key points. The synthesis results based on the bounding box, and key points generally have poor authenticity. By contrast, the results synthesized by our method have better authenticity as a whole. In detail processing, such as smoothness and texture, our results are also better than GAWWN. Besides, the resulting shape of GAWWN is rough, and the generated shape cannot be controlled accurately. Our method can control the specific shape precisely because it inputs contour information.
Table 1.

The quantitative comparison results in the CUB dataset.

















3.3 Quantitative Results

For the evaluation of the controllable image synthesis model, Human Rank (HR) is used to quantify the comparison models. We employed 10 subjects to rank the quality of synthetic images by different methods. The text descriptions and contours corresponding to these results are all from the test set and are divided into 10 groups for use by 10 subjects. The subjects are asked to rank the results in the following ways: consistency, text, and authenticity. “Consistency" indicates whether the result is consistent with control information (the contour or bounding box or key points). “Text" denotes whether the result matches the text description. “Authenticity" represents the level of the authenticity of all results.

For the bird results, we established one way for quantitative comparison. It contains three results: 1) GAWWN_bb, 2) GAWWN_kp, 3) ours. The employers were not informed of the method corresponding to the result, but only knew the text description and contour, bounding box, and key points corresponding to the current result. The subjects were asked to rank the results (1 is best, 3 is worst).
Fig. 5.

The text descriptions on the left are all artificial descriptions that do not exist in the dataset. The contours are also drawn manually.

Table 1 shows the comparison results with GAWWN in the CUB dataset. It is evident from the results that our method is considered to be the best in all respect. Our results have better authenticity and more conform to the text and control information.

3.4 Controllable Image Synthesis

The most important feature of our work is to realize fine-grained controllable image synthesis based on manual hand drawing and artificial text description. The relevant results are shown in Fig. 5. The contours in the figure are drawn by hand, and the text descriptions are also artificially described. These results reflect well the hand-drawn contour and artificial text description content, but also have high authenticity. This shows the effectiveness of our method in synthesizing high-quality images on the one hand. On the other hand, it reflects the high flexibility and controllability of our method because all inputs can be controlled artificially.

4 Conclusion

In this paper, we propose a fine-grained customized image synthesis method based on the simple contour and text description. The synthesis results demonstrate that our method not only maintains the basic shape of the contour but also conforms to the text description. Furthermore, we have evaluated the model on the Caltech-UCSD Birds and the Oxford-102 flower datasets. The experimental results show the effectiveness of our method. Besides, the high-quality image synthesis results based on hand-drawn contour and artificial description are also illustrated to prove that our method is highly controllable and flexible.



This research is supported by Sichuan Science and Technology Program (No. 2020YFS0307), National Natural Science Foundation of China (No. 61907009), Science and Technology Planning Project of Guangdong Province (No. 2019B010150002).


  1. 1.
    Goodfellow, I.J., et al.: Generative adversarial nets. In: Neural Information Processing Systems, vol. 27, pp. 2672–2680 (2014)Google Scholar
  2. 2.
    Isola, P., Zhu, J., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Computer Vision and Pattern Recognition, pp. 5967–5976 (2017)Google Scholar
  3. 3.
    Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv:1411.1784 (2014)
  4. 4.
    Reed, S.E., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: International Conference on Machine Learning, pp. 1060–1069 (2016)Google Scholar
  5. 5.
    Zhang, H., et al.: StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: International Conference on Computer Vision, pp. 5908–5916 (2017)Google Scholar
  6. 6.
    Xu, T., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: Computer Vision and Pattern Recognition, pp. 1316–1324 (2018)Google Scholar
  7. 7.
    Zhu, M., Pan, P., Chen, W., Yang, Y.: DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis. In: Computer Vision and Pattern Recognition, pp. 5802–5810 (2019)Google Scholar
  8. 8.
    Reed, S.E., Akata, Z., Mohan, S., Tenka, S., Schiele, B., Lee, H.: Learning what and where to draw. In: Neural Information Processing Systems, vol. 29, pp. 885–895 (2016)Google Scholar
  9. 9.
    Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The caltech-UCSD birds-200-2011 dataset. Technical report CNS-TR-2011-001, California Institute of Technology (2011)Google Scholar
  10. 10.
    Nilsback, M.-E., Zisserman, A.: Automated flower classification over a large number of classes. In: Indian Conference on Computer Vision, Graphics and Image Processing, pp. 722–729 (2008)Google Scholar
  11. 11.
    Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). Scholar
  12. 12.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456 (2015)Google Scholar
  13. 13.
    Reed, S.E., Akata, Z., Lee, H., Schiele, B.: Learning deep representations of fine-grained visual descriptions. In: Computer Vision and Pattern Recognition, pp. 49–58 (2016)Google Scholar
  14. 14.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  15. 15.
    Xu, B., Wang, N., Chen, T., Li, M.: Empirical evaluation of rectified activations in convolutional network. arXiv:1505.00853 (2015)
  16. 16.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2015)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Southwest University of Science and TechnologyMianyangChina
  2. 2.Hosei UniversityTokyoJapan
  3. 3.Xidian UniversityXi’anChina
  4. 4.Guangdong University of TechnologyGuangzhouChina

Personalised recommendations