Incorporating Reinforced Adversarial Learning in Autoregressive Image Generation

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12366)


Autoregressive models recently achieved comparable results versus state-of-the-art Generative Adversarial Networks (GANs) with the help of Vector Quantized Variational AutoEncoders (VQ-VAE). However, autoregressive models have several limitations such as exposure bias and their training objective does not guarantee visual fidelity. To address these limitations, we propose to use Reinforced Adversarial Learning (RAL) based on policy gradient optimization for autoregressive models. By applying RAL, we enable a similar process for training and testing to address the exposure bias issue. In addition, visual fidelity has been further optimized with adversarial loss inspired by their strong counterparts: GANs. Due to the slow sampling speed of autoregressive models, we propose to use partial generation for faster training. RAL also empowers the collaboration between different modules of the VQ-VAE framework. To our best knowledge, the proposed method is first to enable adversarial learning in autoregressive models for image generation. Experiments on synthetic and real-world datasets show improvements over the MLE trained models. The proposed method improves both negative log-likelihood (NLL) and Fréchet Inception Distance (FID), which indicates improvements in terms of visual quality and diversity. The proposed method achieves state-of-the-art results on Celeba for 64\(\times \)64 image resolution, showing promise for large scale image generation.


Autoregressive models Reinforcement learning Vector quantized variational autoencoders Generative adversarial networks 

Supplementary material

504479_1_En_2_MOESM1_ESM.pdf (4.1 mb)
Supplementary material 1 (pdf 4192 KB)


  1. 1.
    Ak, K.E.: Deep learning approaches for attribute manipulation and text-to-image synthesis. Ph.D. thesis (2019)Google Scholar
  2. 2.
    Ak, K.E., Lim, J.H., Tham, J.Y., Kassim, A.A.: Attribute manipulation generative adversarial networks for fashion images. In: ICCV. IEEE (2019)Google Scholar
  3. 3.
    Ak, K.E., Lim, J.H., Tham, J.Y., Kassim, A.A.: Semantically consistent hierarchical text to fashion image synthesis with an enhanced-attentional generative adversarial network. In: ICCVW (2019)Google Scholar
  4. 4.
    Ak, K.E., Lim, J.H., Tham, J.Y., Kassim, A.A.: Semantically consistent text to fashion image synthesis with an enhanced attentional generative adversarial network. PRL (2020)Google Scholar
  5. 5.
    Ak, K.E., Ying, S., Lim, J.H.: Learning cross-modal representations for language-based image manipulation. In: ICIP (2020)Google Scholar
  6. 6.
    Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: ICML, pp. 214–223 (2017)Google Scholar
  7. 7.
    Arora, S., Ge, R., Liang, Y., Ma, T., Zhang, Y.: Generalization and equilibrium in generative adversarial nets (gans). In: ICML, pp. 224–232 (2017)Google Scholar
  8. 8.
    Arora, S., Risteski, A., Zhang, Y.: Do GANs learn the distribution? some theory and empirics. In: International Conference on Learning Representations (2018)Google Scholar
  9. 9.
    Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. In: Advances in Neural Information Processing Systems, pp. 1171–1179 (2015)Google Scholar
  10. 10.
    Bertalmio, M., Sapiro, G., Caselles, V., Ballester, C.: Image inpainting. In: Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pp. 417–424. ACM Press/Addison-Wesley Publishing Co. (2000)Google Scholar
  11. 11.
    Brock, A., Donahue, J., Simonyan, K.: Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096 (2018)
  12. 12.
    Chen, X., Mishra, N., Rohaninejad, M., Abbeel, P.: Pixelsnail: an improved autoregressive generative model. arXiv preprint arXiv:1712.09763 (2017)
  13. 13.
    Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: Stargan: unified generative adversarial networks for multi-domain image-to-image translation. In: CVPR, June 2018Google Scholar
  14. 14.
    Chorowski, J., Weiss, R.J., Bengio, S., van den Oord, A.: Unsupervised speech representation learning using wavenet autoencoders. IEEE/ACM Trans. Audio Speech Lang. Process. 27(12), 2041–2053 (2019)CrossRefGoogle Scholar
  15. 15.
    Criminisi, A., Perez, P., Toyama, K.: Object removal by exemplar-based inpainting. In: CVPR, vol. 2, pp. II-II (2003)Google Scholar
  16. 16.
    De Fauw, J., Dieleman, S., Simonyan, K.: Hierarchical autoregressive image models with auxiliary decoders. arXiv preprint arXiv:1903.04933 (2019)
  17. 17.
    Fawzi, A., Samulowitz, H., Turaga, D., Frossard, P.: Image inpainting through neural networks hallucinations. In: 2016 IEEE 12th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP), pp. 1–5. IEEE (2016)Google Scholar
  18. 18.
    Ganin, Y., Kulkarni, T., Babuschkin, I., Eslami, S., Vinyals, O.: Synthesizing programs for images using reinforced adversarial learning. arXiv preprint arXiv:1804.01118 (2018)
  19. 19.
    Goodfellow, I., et al.: Generative adversarial nets. In: NeurIPS, pp. 2672–2680 (2014)Google Scholar
  20. 20.
    Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of wasserstein gans. In: NeurIPS, pp. 5767–5777 (2017)Google Scholar
  21. 21.
    Heqing, Z., Ak, K.E., Kassim, A.A.: Learning cross-modal representations for language-based image manipulation. In: ICIP (2020)Google Scholar
  22. 22.
    Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: NeurIPS, pp. 6626–6637 (2017)Google Scholar
  23. 23.
    Huang, Z., Heng, W., Zhou, S.: Learning to paint with model-based deep reinforcement learning. arXiv preprint arXiv:1903.04411 (2019)
  24. 24.
    Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: CVPR (2017)Google Scholar
  25. 25.
    Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017)
  26. 26.
    Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4401–4410 (2019)Google Scholar
  27. 27.
    Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  28. 28.
    Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
  29. 29.
    Lin, C.H., Chang, C.C., Chen, Y.S., Juan, D.C., Wei, W., Chen, H.T.: Coco-gan: generation by parts via conditional coordinating. arXiv preprint arXiv:1904.00284 (2019)
  30. 30.
    Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: ICCV, December 2015Google Scholar
  31. 31.
    Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv:1411.1784 (2014)
  32. 32.
    van den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. In: Advances in Neural Information Processing Systems, pp. 6306–6315 (2017)Google Scholar
  33. 33.
    Oord, A.v.d., Kalchbrenner, N., Kavukcuoglu, K.: Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759 (2016)
  34. 34.
    Oord, A.v.d., Kalchbrenner, N., Vinyals, O., Espeholt, L., Graves, A., Kavukcuoglu, K.: Conditional image generation with pixelcnn decoders. In: NeurIPS, pp. 4797–4805 (2016)Google Scholar
  35. 35.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Association for Computational Linguistics (2002)Google Scholar
  36. 36.
    Ramachandran, P., et al.: Fast generation for convolutional autoregressive models. arXiv preprint arXiv:1704.06001 (2017)
  37. 37.
    Ranzato, M., Chopra, S., Auli, M., Zaremba, W.: Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732 (2015)
  38. 38.
    Razavi, A., Oord, A.v.d., Vinyals, O.: Generating diverse high-fidelity images with vq-vae-2. arXiv preprint arXiv:1906.00446 (2019)
  39. 39.
    Saito, Y., Takamichi, S., Saruwatari, H.: Statistical parametric speech synthesis incorporating generative adversarial networks. IEEE/ACM 26(1), 84–96 (2017)Google Scholar
  40. 40.
    Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. In: NeurIPS, pp. 2234–2242 (2016)Google Scholar
  41. 41.
    Sisman, B., Vijayan, K., Dong, M., Li, H.: Singan: Singing voice conversion with generative adversarial networks. In: APSIPA ASC, pp. 112–118 (2019)Google Scholar
  42. 42.
    Sisman, B., Li, H.: Generative adversarial networks for singing voice conversion with and without parallel data. In: Speaker Odyssey, pp. 238–244 (2020)Google Scholar
  43. 43.
    Sisman, B., Zhang, M., Dong, M., Li, H.: On the study of generative adversarial networks for cross-lingual voice conversion. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 144–151. IEEE (2019)Google Scholar
  44. 44.
    Sisman, B., Zhang, M., Sakti, S., Li, H., Nakamura, S.: Adaptive wavenet vocoder for residual compensation in gan-based voice conversion. In: SLT, pp. 282–289 (2018)Google Scholar
  45. 45.
    Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NeurIPS, pp. 3104–3112 (2014)Google Scholar
  46. 46.
    Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Advances in neural information processing systems, pp. 1057–1063 (2000)Google Scholar
  47. 47.
    Theis, L., Oord, A.v.d., Bethge, M.: A note on the evaluation of generative models. arXiv preprint arXiv:1511.01844 (2015)
  48. 48.
    Tjandra, A., Sisman, B., Zhang, M., Sakti, S., Li, H., Nakamura, S.: VQVAE unsupervised unit discovery and multi-scale code2spec inverter for zerospeech challenge 2019. arXiv preprint arXiv:1905.11449 (2019)
  49. 49.
    Toyama, J., Iwasawa, Y., Nakayama, K., Matsuo, Y.: Toward learning better metrics for sequence generation training with policy gradient (2018)Google Scholar
  50. 50.
    Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8(3–4), 229–256 (1992)zbMATHGoogle Scholar
  51. 51.
    Xu, T., et al.: Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In: CVPR (2018)Google Scholar
  52. 52.
    Yu, F., Zhang, Y., Song, S., Seff, A., Xiao, J.: LSUN: construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365 (2015)
  53. 53.
    Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Free-form image inpainting with gated convolution. In: ICCV, pp. 4471–4480 (2019)Google Scholar
  54. 54.
    Yu, L., Zhang, W., Wang, J., Yu, Y.: Seqgan: sequence generative adversarial nets with policy gradient. In: AAAI (2017)Google Scholar
  55. 55.
    Zhang, H., Goodfellow, I., Metaxas, D., Odena, A.: Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318 (2018)
  56. 56.
    Zhou, K., Sisman, B., Li, H.: Transforming spectrum and prosody for emotional voice conversion with non-parallel training data. arXiv preprint arXiv:2002.00198 (2020)

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Institute for Infocomm Research, A*STARSingaporeSingapore
  2. 2.Adobe ResearchSan JoseUSA

Personalised recommendations