Masked cross-attention and multi-head channel attention guiding single-stage generative adversarial networks for text-to-image generation

Hou, Shouming; Li, Ziying; Wu, Kuikui; Zhao, Yinggang; Li, Hui

doi:10.1007/s00371-024-03260-2

Masked cross-attention and multi-head channel attention guiding single-stage generative adversarial networks for text-to-image generation

Original article
Published: 09 February 2024

(2024)
Cite this article

The Visual Computer Aims and scope Submit manuscript

Shouming Hou¹,
Ziying Li¹,
Kuikui Wu²,
Yinggang Zhao ORCID: orcid.org/0009-0007-4278-0868¹ &
…
Hui Li³

105 Accesses
1 Altmetric
Explore all metrics

Abstract

Although the text-to-image model aims to generate realistic images that correspond to the text description, generating high-quality, and accurate images remains a significant challenge. Most existing text-to-image methods are implemented through a two-stage stacking model, where the generation process is initiated by creating an initial image with a basic outline and subsequently refined to generate a high-resolution image. However, the quality of the initial image imposes limitations on this method as it directly impacts the final quality of the high-resolution output and may compromise the level of randomness in the high-resolution image, making it difficult for the model to generate a high-quality and realistic final image if the initial image is of low quality or lacks detail, causing the final image to lack diversity and to appear artificial if the initial image is too rigid or lacks randomness. Therefore, to overcome the limitation of the stacked structure, a new generative adversarial network method has been proposed, which generates high-resolution images directly from text descriptions, thus providing a more efficient and effective way to generate realistic images from text. Multi-head channel attention and masked cross-attention mechanisms are employed to emphasize the importance of relevance from various perspectives in order to enhance significant features associated with the text description and suppress non-essential features unrelated to the textual information. The integration of image and text information at a granular level is accomplished while employing a masked mechanism to minimize computational expenses and expedite the generation time of images. Furthermore, a discriminator-based semantic consistency loss function is devised to bolster the visual coherence between text and images, thereby directing the generator toward the production of more realistic images that align closely with text descriptions. The enhanced model improves the semantic consistency between text and images, leading to higher-quality generated images. Extensive experiments confirm the superiority of our proposed model to ControlGAN. On the CUB dataset, the model achieves an increased IS score from 4.58 to 4.96, while on the COCO dataset, the IS score improves from 24.06 to 33.56. Code is available at https://github.com/Leeziying0307/Github.git.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DE-GAN: Text-to-image synthesis with dual and efficient fusion model

Article 18 August 2023

CA-GAN: Conditional Adaptive Generative Adversarial Network for Text-to-Image Synthesis

MAGAN: Multi-attention Generative Adversarial Networks for Text-to-Image Generation

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request. The code are available from the corresponding author upon reasonable request.

References

Lewis, P., Perez, E., Piktus, A., et al.: Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural. Inf. Process. Syst. 33, 9459–9474 (2020)
Google Scholar
Tang, Y., Han, K., Xu, C., et al.: Augmented shortcuts for vision transformers. Adv. Neural. Inf. Process. Syst. 34, 15316–15327 (2021)
Google Scholar
Deldjoo, Y., Noia, T.D., Merra, F.A.: A survey on adversarial recommender systems: from attack/defense strategies to generative adversarial networks. ACM Comput. Surv. (CSUR) 54(2), 1–38 (2021)
Article Google Scholar
Kim, H., Kim, J., Yang, H.: A GAN-based face rotation for artistic portraits. Mathematics 10(20), 3860 (2022)
Article Google Scholar
Ramesh, A., Pavlov, M., Goh, G., et al.: Zero-shot text-to-image generation, international conference on machine learning. PMLR, 8821–8831, (2021)
Ramesh, A., Dhariwal, P., Nichol, A., et al.: Hierarchical text-conditional image generation with CLIP latents. arXiv, 2022(2022-04-12)[2023-03-20].
Goodfellow, I., Pouget-Abadie, J., Mirza, M., et al.: Generative adversarial nets. In advances in neural information processing systems 27. Annu. Conf. Neural Info Process. Syst. 204, 2672–2680 (2014)
Google Scholar
Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, (2014)
Zhang, H., Xu, T., Li, H., et al.: StackGAN++: realistic image synthesis with stacked generative adversarial networks. arXiv, 2018(2018-06-27)[2023-03-20]
Tan, H., Yin, B., Wei, K., Liu, X., Li, X.: ALR-GAN: Adaptive Layout Refinement for Text-to-Image Synthesis. IEEE Trans. Multimedia. 25, 8620–8631 (2023)
Zhu, J., Li, Z., Wei, J., et al.: PBGN: phased bidirectional generation network in text-to-image synthesis. Neural. Process. Lett. 54(6), 5371–5391 (2022)
Article Google Scholar
Agarwal, V., Sharma, S., Aurelia, S., et al.: Deep learning techniques to improve radio resource management in vehicular communication network. In: Biswas, S.K. (ed.) Sustainable advanced computing, pp. 161–171. Springer, Singapore (2022)
Chapter Google Scholar
Agarwal, V., Sharma, S.: EMVD: efficient multitype vehicle detection algorithm using deep learning approach in vehicular communication network for radio resource management. Int. J. Image Gr. Sign. Process. 14(2), 25–37 (2022)
Google Scholar
Karras, T., Laine, S., Aittala, M., et al.: Analyzing and improving the image quality of stylegan. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Wang, Y., Qiu, H., Qin, C.: Conditional deformable image registration with spatially-variant and adaptive regularization. arXiv, 2023(2023-03-19)[2023-03-23]
Chen, L., Lu, X., Zhang, J., et al.: Hinet: half instance normalization network for image restoration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Xu, T., Zhang, P., Huang, Q., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks[Z/OL]. arXiv, 2017(2017-11-28)[2023-03-20].
Zhang, H., Xu, T., Li, H., et al.: Stackgan: text to photo-realistic image synthesis with stacked generative adversarial networks, In: Proceedings of the IEEE international conference on computer vision
Karras, T., Aila, T., Laine, S., et al.: Progressive growing of GANs for improved quality, stability, and variation. arXiv, 2018 (2018-02-26)[2022-10-09]
Agarwal, V., Sharma, S.: DQN Algorithm for network resource management in vehicular communication network. Int. J. Inf. Technol. 15(6), 3371–3379 (2023)
Google Scholar
Huang, S., Chen, Y.: Generative adversarial networks with adaptive semantic normalization for text-to-image synthesis. Digit. Sign. Process. 120, 103267 (2022)
Article Google Scholar
Liao, W., Hu, K., Yang, M. Y., et al.: Text to image generation with semantic-spatial aware gan. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Zhang, Y., Han, S., Zhang, Z., et al.: CF-GAN: cross-domain feature fusion generative adversarial network for text-to-image synthesis. Visual Comput. 39(4), 1283–1293 (2022)
Google Scholar
Peng, D., Yang, W., Liu, C., et al.: SAM-GAN: self-attention supporting multi-stage generative adversarial networks for text-to-image synthesis. Neural Netw.Netw. 138, 57–67 (2021)
Article Google Scholar
Li, B., Qi, X., Lukasiewicz, T., et al.: Controllable text-to-image generation. arXiv, 2019 (2019-12-19)[2023-05-04]
Schuster, M., Paliwal, K. K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Vinker, Y., Pajouheshgar, E., Bo, J. Y., et al.: Clipasso: Semantically-aware object sketching. arXiv preprint arXiv:2202.05822
Lin, T. Y., Maire, M., Belongie, S., et al.: Microsoft COCO. In: common objects in context, european conference on computer vision. [2023-03-20]
Szegedy, C., Vanhoucke, V., Ioffe, S.: Rethinking the inception architecture for computer vision, In: Proceedings of the IEEE conference on computer vision and pattern recognition

Download references

Funding

This work was supported by Nation Natural Science Foundation of China (62072150), Shaanxi Provincial Key Research and Development Program (2023-YBGY-148), Henan Provincial Science and Technology Plan Project (222102210240) and Henan Provincial Higher Education Key Scientific Research Project (22B520012, 22A510017), Shaanxi Provincial Social Science Fund Project (2022M007).

Author information

Authors and Affiliations

School of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, 454000, Henan, China
Shouming Hou, Ziying Li & Yinggang Zhao
Kaifeng Vocational and Technical Training Teaching and Research Office, Kaifeng Vocational Skills Appraisal and Guidance Center, Kaifeng, 475000, Henan, China
Kuikui Wu
School of Mechanical and Automotive Engineering, Kaifeng University, Kaifeng, 475004, Henan, China
Hui Li

Authors

Shouming Hou
View author publications
You can also search for this author in PubMed Google Scholar
Ziying Li
View author publications
You can also search for this author in PubMed Google Scholar
Kuikui Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yinggang Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Hui Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

SH was involved in supervision and writing—review and editing; ZL contributed to conceptualization, methodology, and writing—original draft; KW was involved in writing—review and editing; YZ contributed to formal analysis, supervision, and writing—review and editing; and HL was involved in writing—review and editing.

Corresponding authors

Correspondence to Yinggang Zhao or Hui Li.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical approval

Not applicable.

Consent to participate

All authors agreed to participate in this paper.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Hou, S., Li, Z., Wu, K. et al. Masked cross-attention and multi-head channel attention guiding single-stage generative adversarial networks for text-to-image generation. Vis Comput (2024). https://doi.org/10.1007/s00371-024-03260-2

Download citation

Accepted: 29 December 2023
Published: 09 February 2024
DOI: https://doi.org/10.1007/s00371-024-03260-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Masked cross-attention and multi-head channel attention guiding single-stage generative adversarial networks for text-to-image generation

Abstract

Access this article

Similar content being viewed by others

DE-GAN: Text-to-image synthesis with dual and efficient fusion model

CA-GAN: Conditional Adaptive Generative Adversarial Network for Text-to-Image Synthesis

MAGAN: Multi-attention Generative Adversarial Networks for Text-to-Image Generation

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Ethical approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Masked cross-attention and multi-head channel attention guiding single-stage generative adversarial networks for text-to-image generation

Abstract

Access this article

Similar content being viewed by others

DE-GAN: Text-to-image synthesis with dual and efficient fusion model

CA-GAN: Conditional Adaptive Generative Adversarial Network for Text-to-Image Synthesis

MAGAN: Multi-attention Generative Adversarial Networks for Text-to-Image Generation

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Ethical approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation