Abstract
Although text-to-image generation technology has made significant progress in visually realistic images, the generated images cannot be completely consistent with the texts. In this paper, a novel generative adversarial network based on semantic consistency is proposed to generate semantically consistent and realistic images according to text descriptions. The proposed method explores the semantic consistency between text and image for an efficient cross-modal generation that combines image generation and semantic correlation. A generation network with a hybrid attention is utilized to generate different resolution images, which improves the authenticity of the generated images. In addition, a semantic comparison module is presented to map the texts and the generated images to the same semantic space for comparison through consistency refinement and information classification. Extensive experiments on public benchmark datasets demonstrate that the proposed method outperforms the comparative methods.
Similar content being viewed by others
Data availability
The datasets generated and analysed during this study are available in the repository: http://cocodataset.org. and http://www.vision.caltech.edu/visipedia/CUB-200-2011.html. All other data are available from the authors upon reasonable request.
References
Agnese J, Herrera J, Tao H, Zhu X (2020) A survey and taxonomy of adversarial neural networks for text-to-image synthesis. Wiley Interdiscip Rev Data Min Knowl Discov 10(4)
Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville CC, Bengio Y (2014) Generative adversarial nets. NIPS 2014:2672–2680
Yang Z, He X, Gao J, Deng L, Smola AJ (2016) Stacked attention networks for image question answering. CVPR, 21–29
Reed SE, Akata Z, Yan X, Logeswaran L, Schiele B, Lee H (2016) Generative adversarial text to image synthesis. ICML, 1060–1069
Li M (2022) Gai-Ge Wang:A review of green shop scheduling problem. Inf Sci 589:478–496
Wang G, Lu M, Dong Y-Q, Zhao X-J (2016) Self-adaptive extreme learning machine. Neural Comput Appl 27(2):291–303
Yi J-H, Wang J, Wang G-G (2016) Improved probabilistic neural networks with self-adaptive strategies for transformer fault diagnosis problem. Adv Mech Eng 8(1):1–13
Cui Z, Xue F, Cai X, Cao Y, Wang G, Chen J (2018) Detection of malicious code variants based on deep learning. IEEE Trans Ind Inf 14(7):3187–3196
Zhang H, Xu T, Li H, Zhang S, Huang X, Wang X, Metaxas DN (2016) StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. CoRR abs/1612.03242
Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Dimitris N (2019) Metaxas: StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks. IEEE Trans Pattern Anal Mach Intell 41(8):1947–1962
Xu T, Zhang P, Huang Q, Zhang H, Gan Z, Huang X, He X (2018) AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. CVPR :1316–1324
Park H, Yoo Y, Kwak N (2018) MC-GAN: Multi-conditional generative adversarial network for image synthesis. BMVC :76
Zhang Z, Xie Y, Yang L (2018) Photographic text-to-image synthesis with a hierarchically-nested adversarial network. CVPR :6199–6208
Zhu M, Pan P, Chen W, Yang Y (2019) DM-GAN: Dynamic memory generative adversarial networks for text-to-image synthesis. CVPR :5802–5810
Pande S, Chouhan S, Sonavane R, Walambe R, Ghinea G, Kotecha K (2021) Development and deployment of a generative model-based framework for text to photorealistic image generation. Neurocomputing 463:1–16
Qiao T, Zhang J, Xu D, Tao D (2019) MirrorGAN: Learning text-to-image generation by redescription. CVPR :1505–1514
Almahairi A, Rajeswar S, Sordoni A, Bachman P, Courville AC (2018) Augmented CycleGAN: Learning many-to-many mappings from unpaired data. ICML :195–204
Li B, Qi X, Lukasiewicz T, Torr PHS (2019) Controllable text-to-image generation. NeurIPS 2019:2063–2073
Zhang H, Koh JY, Baldridge J, Lee H, Yang Y (2021) Cross-modal contrastive learning for text-to-image generation. CVPR :833–842
Xie D, Deng C, Li C, Liu X, Tao D (2020) Multi-task consistency-preserving adversarial hashing for cross-modal retrieval. IEEE Trans Image Process 29:3626–3637
Johnson J, Alahi A, Fei-Fei L (2016) Perceptual losses for real-time style transfer and super-resolution. ECCV :694–7112
C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. TheCaltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
Lin T-Yi, Maire M, Belongie SJ, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: Common objects in context. ECCV :740–755
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein MS, Berg AC, Fei-Fei L (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980, 2014.
Salimans T, Goodfellow IJ, Zaremba W, Cheung V, Radford A, Chen X (2016) Improved techniques for training GANs. NIPS :2226–2234
Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S (2017) GANs trained by a two time-scale update rule converge to a local nash equilibrium. NIPS :6626–6637
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ma, Y., Liu, L., Zhang, H. et al. Generative adversarial network based on semantic consistency for text-to-image generation. Appl Intell 53, 4703–4716 (2023). https://doi.org/10.1007/s10489-022-03660-8
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-03660-8