Skip to main content
Log in

Masked cross-attention and multi-head channel attention guiding single-stage generative adversarial networks for text-to-image generation

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Although the text-to-image model aims to generate realistic images that correspond to the text description, generating high-quality, and accurate images remains a significant challenge. Most existing text-to-image methods are implemented through a two-stage stacking model, where the generation process is initiated by creating an initial image with a basic outline and subsequently refined to generate a high-resolution image. However, the quality of the initial image imposes limitations on this method as it directly impacts the final quality of the high-resolution output and may compromise the level of randomness in the high-resolution image, making it difficult for the model to generate a high-quality and realistic final image if the initial image is of low quality or lacks detail, causing the final image to lack diversity and to appear artificial if the initial image is too rigid or lacks randomness. Therefore, to overcome the limitation of the stacked structure, a new generative adversarial network method has been proposed, which generates high-resolution images directly from text descriptions, thus providing a more efficient and effective way to generate realistic images from text. Multi-head channel attention and masked cross-attention mechanisms are employed to emphasize the importance of relevance from various perspectives in order to enhance significant features associated with the text description and suppress non-essential features unrelated to the textual information. The integration of image and text information at a granular level is accomplished while employing a masked mechanism to minimize computational expenses and expedite the generation time of images. Furthermore, a discriminator-based semantic consistency loss function is devised to bolster the visual coherence between text and images, thereby directing the generator toward the production of more realistic images that align closely with text descriptions. The enhanced model improves the semantic consistency between text and images, leading to higher-quality generated images. Extensive experiments confirm the superiority of our proposed model to ControlGAN. On the CUB dataset, the model achieves an increased IS score from 4.58 to 4.96, while on the COCO dataset, the IS score improves from 24.06 to 33.56. Code is available at https://github.com/Leeziying0307/Github.git.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request. The code are available from the corresponding author upon reasonable request.

References

  1. Lewis, P., Perez, E., Piktus, A., et al.: Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural. Inf. Process. Syst. 33, 9459–9474 (2020)

    Google Scholar 

  2. Tang, Y., Han, K., Xu, C., et al.: Augmented shortcuts for vision transformers. Adv. Neural. Inf. Process. Syst. 34, 15316–15327 (2021)

    Google Scholar 

  3. Deldjoo, Y., Noia, T.D., Merra, F.A.: A survey on adversarial recommender systems: from attack/defense strategies to generative adversarial networks. ACM Comput. Surv. (CSUR) 54(2), 1–38 (2021)

    Article  Google Scholar 

  4. Kim, H., Kim, J., Yang, H.: A GAN-based face rotation for artistic portraits. Mathematics 10(20), 3860 (2022)

    Article  Google Scholar 

  5. Ramesh, A., Pavlov, M., Goh, G., et al.: Zero-shot text-to-image generation, international conference on machine learning. PMLR, 8821–8831, (2021)

  6. Ramesh, A., Dhariwal, P., Nichol, A., et al.: Hierarchical text-conditional image generation with CLIP latents. arXiv, 2022(2022-04-12)[2023-03-20].

  7. Goodfellow, I., Pouget-Abadie, J., Mirza, M., et al.: Generative adversarial nets. In advances in neural information processing systems 27. Annu. Conf. Neural Info Process. Syst. 204, 2672–2680 (2014)

    Google Scholar 

  8. Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, (2014)

  9. Zhang, H., Xu, T., Li, H., et al.: StackGAN++: realistic image synthesis with stacked generative adversarial networks. arXiv, 2018(2018-06-27)[2023-03-20]

  10. Tan, H., Yin, B., Wei, K., Liu, X., Li, X.: ALR-GAN: Adaptive Layout Refinement for Text-to-Image Synthesis. IEEE Trans. Multimedia. 25, 8620–8631 (2023)

  11. Zhu, J., Li, Z., Wei, J., et al.: PBGN: phased bidirectional generation network in text-to-image synthesis. Neural. Process. Lett. 54(6), 5371–5391 (2022)

    Article  Google Scholar 

  12. Agarwal, V., Sharma, S., Aurelia, S., et al.: Deep learning techniques to improve radio resource management in vehicular communication network. In: Biswas, S.K. (ed.) Sustainable advanced computing, pp. 161–171. Springer, Singapore (2022)

    Chapter  Google Scholar 

  13. Agarwal, V., Sharma, S.: EMVD: efficient multitype vehicle detection algorithm using deep learning approach in vehicular communication network for radio resource management. Int. J. Image Gr. Sign. Process. 14(2), 25–37 (2022)

    Google Scholar 

  14. Karras, T., Laine, S., Aittala, M., et al.: Analyzing and improving the image quality of stylegan. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

  15. Wang, Y., Qiu, H., Qin, C.: Conditional deformable image registration with spatially-variant and adaptive regularization. arXiv, 2023(2023-03-19)[2023-03-23]

  16. Chen, L., Lu, X., Zhang, J., et al.: Hinet: half instance normalization network for image restoration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

  17. Xu, T., Zhang, P., Huang, Q., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks[Z/OL]. arXiv, 2017(2017-11-28)[2023-03-20].

  18. Zhang, H., Xu, T., Li, H., et al.: Stackgan: text to photo-realistic image synthesis with stacked generative adversarial networks, In: Proceedings of the IEEE international conference on computer vision

  19. Karras, T., Aila, T., Laine, S., et al.: Progressive growing of GANs for improved quality, stability, and variation. arXiv, 2018 (2018-02-26)[2022-10-09]

  20. Agarwal, V., Sharma, S.: DQN Algorithm for network resource management in vehicular communication network. Int. J. Inf. Technol. 15(6), 3371–3379 (2023)

    Google Scholar 

  21. Huang, S., Chen, Y.: Generative adversarial networks with adaptive semantic normalization for text-to-image synthesis. Digit. Sign. Process. 120, 103267 (2022)

    Article  Google Scholar 

  22. Liao, W., Hu, K., Yang, M. Y., et al.: Text to image generation with semantic-spatial aware gan. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

  23. Zhang, Y., Han, S., Zhang, Z., et al.: CF-GAN: cross-domain feature fusion generative adversarial network for text-to-image synthesis. Visual Comput. 39(4), 1283–1293 (2022)

    Google Scholar 

  24. Peng, D., Yang, W., Liu, C., et al.: SAM-GAN: self-attention supporting multi-stage generative adversarial networks for text-to-image synthesis. Neural Netw.Netw. 138, 57–67 (2021)

    Article  Google Scholar 

  25. Li, B., Qi, X., Lukasiewicz, T., et al.: Controllable text-to-image generation. arXiv, 2019 (2019-12-19)[2023-05-04]

  26. Schuster, M., Paliwal, K. K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)

  27. Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)

  28. Vinker, Y., Pajouheshgar, E., Bo, J. Y., et al.: Clipasso: Semantically-aware object sketching. arXiv preprint arXiv:2202.05822

  29. Lin, T. Y., Maire, M., Belongie, S., et al.: Microsoft COCO. In: common objects in context, european conference on computer vision. [2023-03-20]

  30. Szegedy, C., Vanhoucke, V., Ioffe, S.: Rethinking the inception architecture for computer vision, In: Proceedings of the IEEE conference on computer vision and pattern recognition

Download references

Funding

This work was supported by Nation Natural Science Foundation of China (62072150), Shaanxi Provincial Key Research and Development Program (2023-YBGY-148), Henan Provincial Science and Technology Plan Project (222102210240) and Henan Provincial Higher Education Key Scientific Research Project (22B520012, 22A510017), Shaanxi Provincial Social Science Fund Project (2022M007).

Author information

Authors and Affiliations

Authors

Contributions

SH was involved in supervision and writing—review and editing; ZL contributed to conceptualization, methodology, and writing—original draft; KW was involved in writing—review and editing; YZ contributed to formal analysis, supervision, and writing—review and editing; and HL was involved in writing—review and editing.

Corresponding authors

Correspondence to Yinggang Zhao or Hui Li.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical approval

Not applicable.

Consent to participate

All authors agreed to participate in this paper.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hou, S., Li, Z., Wu, K. et al. Masked cross-attention and multi-head channel attention guiding single-stage generative adversarial networks for text-to-image generation. Vis Comput (2024). https://doi.org/10.1007/s00371-024-03260-2

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00371-024-03260-2

Keywords

Navigation