PMGAN: pretrained model-based generative adversarial network for text-to-image generation

Yu, Yue; Yang, Yue; Xing, Jingshuo

doi:10.1007/s00371-024-03326-1

PMGAN: pretrained model-based generative adversarial network for text-to-image generation

Original article
Published: 28 March 2024

(2024)
Cite this article

The Visual Computer Aims and scope Submit manuscript

Yue Yu¹,
Yue Yang¹ &
Jingshuo Xing¹

104 Accesses
1 Altmetric
Explore all metrics

Abstract

Text-to-image generation is a challenging task. Although diffusion models can generate high-quality images of complex scenes, they sometimes suffer from a lack of realism. Additionally, there is often a large diversity among images generated from different text with the same semantics. Furthermore, the generation of details is sometimes insufficient. Generative adversarial networks can generate realism images. These images are consistent with the text descriptions. And the networks can generate content-consistent images. In this paper, we argue that generating images that are more consistent with the text descriptions is more important than generating higher-quality images. Therefore, this paper proposes the pretrained model-based generative adversarial network (PMGAN). PMGAN utilizes multiple pre-trained models in both generator and discriminator. Specifically, in the generator, the deep attentional multimodal similarity model text encoder extracts word and sentence embeddings from the input text, and the contrastive language-image pre-training (CLIP) text encoder extracts initial image features from the input text. In the discriminator, a pre-trained CLIP image encoder extracts image features from the input image. The CLIP encoder can map text and images into a common semantic space, which is beneficial to generate high-quality images. Experimental results show that compared to the state-of-the-art methods, PMGAN achieves better scores on both inception score and Fréchet inception distance and can produce higher quality images while maintaining greater consistency with text descriptions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ACMA-GAN: Adaptive Cross-Modal Attention for Text-to-Image Generation

TRGAN: Text to Image Generation Through Optimizing Initial Image

MAGAN: Multi-attention Generative Adversarial Networks for Text-to-Image Generation

Availability of data and materials

The datasets generated during and analyzed during the current study are available in the CUB-200-2011 repository, http://www.vision.caltech.edu/datasets/cub_200_2011/, and the COCO 2014 repository, https://cocodataset.org/#download.

References

Chang, H., Zhang, H., Barber, J., et al.: Muse: Text-to-image generation via masked generative transformers (2023). arXiv:2301.00704
Ding, M., Yang, Z., Hong, W., et al.: Cogview: mastering text-to-image generation via transformers. Adv. Neural. Inf. Process. Syst. 34, 19822–19835 (2021)
Google Scholar
Ding, M., Zheng, W., Hong, W., et al.: Cogview2: faster and better text-to-image generation via hierarchical transformers. Adv. Neural. Inf. Process. Syst. 35, 16890–16902 (2022)
Google Scholar
Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12873–12883 (2021)
Gu, S., Chen, D., Bao, J., et al.: Vector quantized diffusion model for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10696–10706 (2022)
Kang, M., Zhu, J.Y,. Zhang, R., et al.: Scaling up gans for text-to-image synthesis. arXiv:2303.05511 (2023)
Kenton, J.D.M.W.C., Toutanova, L.K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)
Li, B., Torr, P.H., Lukasiewicz, T.: Memory-driven text-to-image generation. arXiv:2208.07022 (2022)
Li, X., Du, Z., Huang, Y., et al.: A deep translation (gan) based change detection network for optical and sar remote sensing images. ISPRS J. Photogramm. Remote. Sens. 179, 14–34 (2021)
Article Google Scholar
Lin, T.Y., Maire, M., Belongie, S., et al.: Microsoft coco: Common objects in context. In: European Conference on Computer Vision (ECCV), Springer, pp. 740–755 (2014)
Nichol, A., Dhariwal, P., Ramesh, A., et al.: Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv:2112.10741 (2021)
Qiao, T., Zhang, J., Xu, D., et al.: Mirrorgan: Learning text-to-image generation by redescription. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1505–1514 (2019)
Radford, A., Kim, J.W., Hallacy, C., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, PMLR, pp. 8748–8763 (2021)
Raffel, C., Shazeer, N., Roberts, A., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(1), 5485–5551 (2020)
MathSciNet Google Scholar
Ramesh, A., Pavlov, M., Goh, G., et al.: Zero-shot text-to-image generation. In: International Conference on Machine Learning, PMLR, pp. 8821–8831 (2021)
Ramesh, A., Dhariwal, P., Nichol, A., et al.: Hierarchical text-conditional image generation with clip latents. arXiv:2204.06125 1(2) (2022)
Reed, S., Akata, Z., Yan, X., et al.: Generative adversarial text to image synthesis. In: International conference on machine learning, PMLR, pp. 1060–1069 (2016)
Rombach, R., Blattmann, A., Lorenz, D., et al.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Russakovsky, O., Deng, J., Su, H., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015)
Article MathSciNet Google Scholar
Saharia, C., Chan, W., Saxena, S., et al.: Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural. Inf. Process. Syst. 35, 36479–36494 (2022)
Google Scholar
Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)
Article Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., et al.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Tao, M., Tang, H., Wu, F., et al.: Df-gan: A simple and effective baseline for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16515–16525 (2022)
Tao, M., Bao, B.K., Tang, H., et al.: Galip: Generative adversarial clips for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14214–14223 (2023)
van den Oord, A., Vinyals, O., Kavukcuoglu, K.: Neural discrete representation learning. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6309–6318 (2017)
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. Adv. Neural. Inf. Process. Syst. 30, 5998–6008 (2017)
Google Scholar
Wah, C., Branson, S., Welinder, P., et al.: The caltech-ucsd birds-200-2011 dataset. California Institute of Technology (2011)
Wang, S., Gao, Z., Liu, D.: Swin-gan: generative adversarial network based on shifted windows transformer architecture for image generation. Vis. Comput. 39(12), 6085–6095 (2023)
Article Google Scholar
Xu, T., Zhang, P., Huang, Q., et al.: Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Rrecognition, pp. 1316–1324 (2018)
Yu, J., Xu, Y., Koh, J.Y., et al.: Scaling autoregressive models for content-rich text-to-image generation. arXiv:2206.10789 2(3) (2022)
Yuan, L., Chen, D., Hu, H.: Unsupervised object-level image-to-image translation using positional attention bi-flow generative network. IEEE Access 7, 30637–30647 (2019)
Article Google Scholar
Zhang, H., Xu, T., Li, H., et al.: Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5907–5915 (2017)
Zhang, H., Xu, T., Li, H., et al.: Stackgan++: realistic image synthesis with stacked generative adversarial networks. IEEE Trans. Pattern Anal. Mach. Intell. 41(8), 1947–1962 (2018)
Article Google Scholar
Zhang, H., Koh, J.Y., Baldridge, J., et al.: Cross-modal contrastive learning for text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 833–842 (2021)
Zhang, Y., Han, S., Zhang, Z., et al.: Cf-gan: cross-domain feature fusion generative adversarial network for text-to-image synthesis. Vis. Comput. 39(4), 1283–1293 (2023)
Google Scholar
Zhou, Y., Zhang, R., Chen, C., et al.: Towards language-free training for text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17907–17917 (2022)
Zhu, M., Pan, P., Chen, W., et al.: Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5802–5810 (2019)

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China [Grant Numbers 61807002].

Funding

This work was supported by the National Natural Science Foundation of China [Grant Numbers 61807002].

Author information

Authors and Affiliations

School of Computer Science and Technology, Beijing Institute of Technology, No. 5, Zhongguancun South Street, Haidian District, Beijing, 100081, China
Yue Yu, Yue Yang & Jingshuo Xing

Authors

Yue Yu
View author publications
You can also search for this author in PubMed Google Scholar
Yue Yang
View author publications
You can also search for this author in PubMed Google Scholar
Jingshuo Xing
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yue Yu.

Ethics declarations

Conflict of interest

No potential conflict of interest was reported by the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yu, Y., Yang, Y. & Xing, J. PMGAN: pretrained model-based generative adversarial network for text-to-image generation. Vis Comput (2024). https://doi.org/10.1007/s00371-024-03326-1

Download citation

Accepted: 20 February 2024
Published: 28 March 2024
DOI: https://doi.org/10.1007/s00371-024-03326-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PMGAN: pretrained model-based generative adversarial network for text-to-image generation

Abstract

Access this article

Similar content being viewed by others

ACMA-GAN: Adaptive Cross-Modal Attention for Text-to-Image Generation

TRGAN: Text to Image Generation Through Optimizing Initial Image

MAGAN: Multi-attention Generative Adversarial Networks for Text-to-Image Generation

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

PMGAN: pretrained model-based generative adversarial network for text-to-image generation

Abstract

Access this article

Similar content being viewed by others

ACMA-GAN: Adaptive Cross-Modal Attention for Text-to-Image Generation

TRGAN: Text to Image Generation Through Optimizing Initial Image

MAGAN: Multi-attention Generative Adversarial Networks for Text-to-Image Generation

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation