Skip to main content
Log in

A review of multimodal learning for text to images

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Information exists in various forms in the real world, and the effective interaction and fusion of multimodal information plays a key role in the research of computer vision and deep learning. Generating an image that matches a given text description is one of the multimodal tasks that requires a strong generative model and cross-modal understanding. This paper provided a comprehensive analysis of recent advances in text-generated images and a taxonomy based on model architecture and characteristics. We introduced the classification of text generated image based on different frames, including text generated image method based on generation adversarial network, transformer and diffusion model. This paper introduced the network structure, advantages and disadvantages of each method, the benchmark data set and corresponding evaluation index, and summarized the application progress and experimental results according to different classification methods. Finally, we provided insights into current research challenges and possible future research directions and applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Data availability

No data was used for the research described in the article.

References

  1. Elgammal A, Liu B, Elhoseiny M et al (2017) CAN: Creative adversarial networks generating "Art" by learning about styles and deviating from style norms[C]. In: 8th international conference on computational creativity, ICCC 2017, June 19, 2017 - June 23, 2017. Georgia Institute of Technology. Artificial Intelligence; Georgia Institute of Technology; Georgia Tech GVU Center; US National Science Foundation

  2. Chen J, Shen Y, Gao J et al (2018) Language-based image editing with recurrent attentive models[C]. In: 31st meeting of the IEEE/CVF conference on computer vision and pattern recognition, CVPR 2018, June 18, 2018 - June 22, 2018. IEEE Computer Society, pp 8721–8729

  3. Yan Z, Zhang H, Wang B et al (2016) Automatic photo adjustment using deep neural networks[J]. ACM Trans Graphics (TOG) 35(2):1–15

    Article  Google Scholar 

  4. Islam NU, Lee S, Park J (2020) Accurate and consistent image-to-image conditional adversarial network[J]. Electronics 9(3):395

    Article  Google Scholar 

  5. Reed S, Akata Z, Yan X et al (2016) Generative adversarial text to image synthesis[C]. In: 33rd international conference on machine learning, ICML 2016, June 19, 2016 - June 24, 2016. International Machine Learning Society (IMLS), pp 1681–1690

  6. Liu X, Meng G, Xiang S et al (2018) Semantic image synthesis via conditional cycle-generative adversarial networks[C]. In: 24th international conference on pattern recognition, ICPR 2018, August 20, 2018 - August 24, 2018. Institute of Electrical and Electronics Engineers Inc, pp 988–993

  7. Jiaqi Wu, Zhang Wenqi, Chen Wei, Wang Shuai (2023) Image enhancement method of underground low illumination in coal mine based on improved CycleGAN[J]. J Huazhong Univ Sci Technol (Natural Science Edition). 51(05):40–46

    Google Scholar 

  8. Li C, Su Y, Liu W (2018) Text-to-text generative adversarial networks[C]. In: 2018 international joint conference on neural networks, IJCNN 2018, July 8, 2018 - July 13, 2018. Institute of Electrical and Electronics Engineers Inc

  9. Wang X, Gupta A (2016) Generative image modeling using style and structure adversarial networks[C]. European conference on computer vision. Springer International Publishing, Cham, pp 318–335

    Google Scholar 

  10. Zhang G, Tu E, Cui D et al (2017) Stable and improved generative adversarial nets (gans): a constructive survey[C]. In: 24th IEEE international conference on image processing (ICIP), pp 1871–1875

  11. Mirza M, Osindero S (2014) Conditional generative adversarial nets[J]. arXiv preprint arXiv:1411.1784

  12. Srivastava N, Salakhutdinov R (2012) Learning representations for multimodal data with deep belief nets[C]. Int Conf Mach Learn Workshop 79(10.1007):978–1

    Google Scholar 

  13. Wu Y, Wu X, Li X et al (2021) MGH: metadata guided hypergraph modeling for unsupervised person re-identification[C]. In: 29th ACM International Conference on Multimedia, MM 2021, October 20, 2021 - October 24, 2021. Association for Computing Machinery, Inc, pp 1571–1580

  14. Antol S, Agrawal A, Lu J et al (2015) VQA: Visual question answering[C]. In: 15th IEEE International Conference on Computer Vision, ICCV 2015, December 11, 2015 - December 18, 2015. Institute of Electrical and Electronics Engineers Inc, pp 2425–2433

  15. Chen T-H, Liao Y-H, Chuang C-Y et al (2017) Show, adapt and tell: adversarial training of cross-domain image Captioner[C]. In: 16th IEEE international conference on computer vision, ICCV 2017, October 22, 2017 - October 29, 2017. Institute of Electrical and Electronics Engineers Inc, pp 521–530

  16. Xiao S, Chen L, Zhang S et al (2021) Boundary proposal network for two-stage natural language video localization[C]. Proceed AAAI Conf Artif Intell 35(4):2986–2994

    Google Scholar 

  17. Yang X, Feng F, Ji W et al (2021) Deconfounded Video Moment Retrieval with Causal Intervention[C]. In: 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2021, July 11, 2021 - July 15, 2021. Association for Computing Machinery, Inc, pp 1–10

  18. Zhang H, Xu T, Li H et al (2017) StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks[C]. In: 16th IEEE international conference on computer vision, ICCV 2017, October 22, 2017 - October 29, 2017. Institute of Electrical and Electronics Engineers Inc, pp 5908–5916

  19. Park T, Liu M-Y, Wang T-C et al (2019) Semantic image synthesis with spatially-adaptive normalization[C]. In: 32nd IEEE/CVF conference on computer vision and pattern recognition, CVPR 2019, June 16, 2019 - June 20, 2019. IEEE Computer Society. pp 2332–2341

  20. Shrivastava A, Pfister T, Tuzel O et al (2017) Learning from simulated and unsupervised images through adversarial training[C]. In: 30th IEEE conference on computer vision and pattern recognition, CVPR 2017, July 21, 2017 - July 26, 2017. Institute of Electrical and Electronics Engineers Inc, pp 2242–2251

  21. Isola P, Zhu J-Y, Zhou T et al (2017) Image-to-image translation with conditional adversarial networks[C]. In: 30th IEEE conference on computer vision and pattern recognition, CVPR 2017, July 21, 2017 - July 26, 2017. Institute of Electrical and Electronics Engineers Inc, pp 5967–5976

  22. Arjovsky M, Chintala S, Bottou L (2017) Wasserstein generative adversarial networks[C]. In: 34th international conference on machine learning, ICML 2017, August 6, 2017 - August 11, 2017. International Machine Learning Society (IMLS), pp 298–321

  23. Wang K, Chen Q, Chen W, Liu J, Yang Y (n.d.) Overview of 2D Human Pose Estimation Based on Deep Learning. J Zhengzhou Univ (Natural Science Edition). https://doi.org/10.13705/j.issn.1671-6841.2022334 

  24. Zhan F, Xue C, Lu S (2019) GA-DAN: Geometry-aware domain adaptation network for scene text detection and recognition[C]. In: 17th IEEE/CVF international conference on computer vision, ICCV 2019, October 27, 2019 - November 2, 2019. Institute of Electrical and Electronics Engineers Inc, pp 9104–9114

  25. Lin C-H, Yumer E, Wang O et al (2018) ST-GAN: spatial transformer generative adversarial networks for image compositing[C]. In: 31st meeting of the ieee/cvf conference on computer vision and pattern recognition, CVPR 2018, June 18, 2018 - June 22, 2018. IEEE Computer Society, pp 9455–9464

  26. Goodfellow I, Pouget-Abadie J, Mirza M et al (2014) Generative adversarial nets[J]. Adv Neural Inform Process Syst 27:2672–2680

    Google Scholar 

  27. Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models[J]. Adv Neural Inform Process Syst 33:6840–6851

    Google Scholar 

  28. Song Y, Ermon S (2019) Generative modeling by estimating gradients of the data distribution[C]. In: 33rd annual conference on neural information processing systems, NeurIPS 2019, December 8, 2019 - December 14, 2019. Neural Information Processing Systems Foundation Citadel; Doc.AI; et al.; Lambda; Lyft; Microsoft Research

  29. Zhu X, Goldberg AB, Eldawy M et al (2007) A text-to-picture synthesis system for augmenting communication[C]. AAAI. 7:1590–1595

    Google Scholar 

  30. Bengio Y, Mesnil G, Dauphin Y et al (2013) Better mixing via deep representations[C]. In: 30th international conference on machine learning, ICML 2013, June 16, 2013 - June 21, 2013. International Machine Learning Society (IMLS), pp 552–560

  31. Wah C, Branson S, Welinder P et al (2011) The Caltech-UCSD Birds200–2011 dataset[J] . Advances in water resources - Adv Water Resour

  32. Lin T-Y, Maire M, Belongie S et al (2014) Microsoft COCO: Common objects in context[C]. In: 13th European conference on computer vision, ECCV 2014, September 6, 2014 - September 12, 2014. Springer Verlag, pp 740–755

  33. Ding M, Yang Z, Hong W et al (2021) Cogview: Mastering text-to-image generation via transformers[J]. Adv Neural Inform Process Syst 34:19822–19835

    Google Scholar 

  34. Ramesh A, Pavlov M, Goh G et al (2021) Zero-Shot Text-to-Image Generation[C]. In: 38th international conference on machine learning, ICML 2021, July 18, 2021 - July 24, 2021. ML Research Press, pp 8821–8831

  35. Lin J, Men R, Yang A et al (2021) M6: A Chinese multimodal pretrainer[J]. arXiv preprint arXiv:2103.00823

  36. Nichol A, Dhariwal P, Ramesh A et al (2022) GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models[C]. In: 39th international conference on machine learning, ICML 2022, July 17, 2022 - July 23, 2022. ML Research Press, pp 16784–16804

  37. Gu S, Chen D, Bao J et al (2022) Vector Quantized Diffusion Model for Text-to-Image Synthesis[C]. In: 2022 IEEE/CVF conference on computer vision and pattern recognition, CVPR 2022, June 19, 2022 - June 24, 2022. IEEE Computer Society, pp 10686–10696

  38. Sohl-Dickstein J, Weiss EA, Maheswaranathan N et al (2015) Deep unsupervised learning using nonequilibrium thermodynamics[C]. In: 32nd international conference on machine learning, ICML 2015, July 6, 2015 - July 11, 2015. International Machine Learning Society (IMLS), pp 2246–2255

  39. Kim D, Joo D, Kim J (2020) Tivgan: Text to image to video generation with step-by-step evolutionary generator[J]. IEEE Access 8:153113–153122

    Article  Google Scholar 

  40. Zhang H, Xu T, Li H et al (2018) Stackgan++: Realistic image synthesis with stacked generative adversarial networks[J]. IEEE Trans Pattern Anal Mach Intell 41(8):1947–1962

    Article  Google Scholar 

  41. Zhang Z, Xie Y, Yang L (2018) Photographic text-to-image synthesis with a hierarchically-nested adversarial network[C]. In: 31st meeting of the IEEE/CVF conference on computer vision and pattern recognition, CVPR 2018, June 18, 2018 - June 22, 2018. IEEE Computer Society, pp 6199–6208

  42. Cai Y, Wang X, Yu Z et al (2019) Dualattn-GAN: Text to image synthesis with dual attentional generative adversarial network[J]. IEEE Access 7:183706–183716

    Article  Google Scholar 

  43. Gou Y, Wu Q, Li M et al (2020) Segattngan: Text to image generation with segmentation attention[J]. arXiv preprint arXiv:2005.12444

  44. Li B, Torr PHS, Lukasiewicz T (2022) Memory-driven text-to-image generation[J]. arXiv preprint arXiv:2208.07022

  45. Zhu M, Pan P, Chen W et al (2019) DM-GAN: Dynamic memory generative adversarial networks for text-to-image synthesis[C]. In: 32nd IEEE/CVF conference on computer vision and pattern recognition, CVPR 2019, June 16, 2019 - June 20, 2019. IEEE Computer Society, pp 5795–5803

  46. Xu T, Zhang P, Huang Q et al (2018) AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks[C]. In: 31st meeting of the IEEE/CVF conference on computer vision and pattern recognition, CVPR 2018, June 18, 2018 - June 22, 2018. IEEE Computer Society, pp 1316–1324

  47. Mansimov E, Parisotto E, Ba JL et al (2016) Generating images from captions with attention[C]. In: 4th international conference on learning representations, ICLR 2016, May 2, 2016 - May 4, 2016. International Conference on Learning Representations, ICLR

  48. Qiao T, Zhang J, Xu D et al (2019) Mirrorgan: learning text-to-image generation by redescription[C]. In: 32nd IEEE/CVF conference on computer vision and pattern recognition, CVPR 2019, June 16, 2019 - June 20, 2019. IEEE Computer Society, pp 1505–1514

  49. Liang J, Pei W, Lu F (2020) CPGAN: content-parsing generative adversarial networks for text-to-image synthesis[C]. In: 16th European conference on computer vision, ECCV 2020, August 23, 2020 - August 28, 2020. Springer Science and Business Media Deutschland GmbH, pp 491–508

  50. Li B, Qi X, Lukasiewicz T et al (2019) Controllable text-to-image generation[C]. In: 33rd annual conference on neural information processing systems, NeurIPS 2019, December 8, 2019 - December 14, 2019. Neural information processing systems foundation Citadel; Doc.AI; et al.; Lambda; Lyft; Microsoft Research

  51. Nguyen A, Clune J, Bengio Y et al (2017) Plug and play generative networks: conditional iterative generation of images in latent space[C]. In: 30th IEEE conference on computer vision and pattern recognition, CVPR 2017, July 21, 2017 - July 26, 2017. Institute of Electrical and Electronics Engineers Inc, pp 3510–3520

  52. Li W, Zhang P, Zhang L et al (2019) Object-driven text-to-image synthesis via adversarial training[C]. In: 32nd IEEE/CVF conference on computer vision and pattern recognition, CVPR 2019, June 16, 2019 - June 20, 2019. IEEE Computer Society, pp 12166–12174

  53. Hong S, Yang D, Choi J et al (2018) Inferring semantic layout for hierarchical text-to-image synthesis[C]. In: 31st meeting of the IEEE/CVF conference on computer vision and pattern recognition, CVPR 2018, June 18, 2018 - June 22, 2018. IEEE Computer Society, pp 7986–7994

  54. De Vries H, Strub F, Mary J et al (2017) Modulating early visual processing by language[C]. In: 31st annual conference on neural information processing systems, NIPS 2017, December 4, 2017 - December 9, 2017. Neural Information Processing Systems Foundation, pp 6595–6605

  55. Yin G, Liu B, Sheng L et al (2019) Semantics disentangling for text-to-image generation[C]. In: 32nd IEEE/CVF conference on computer vision and pattern recognition, CVPR 2019, June 16, 2019 - June 20, 2019. IEEE Computer Society, pp 2322–2331

  56. Tao M, Tang H, Wu F et al (2022) DF-GAN: a simple and effective baseline for text-to-image synthesis[C]. In: 2022 IEEE/CVF conference on computer vision and pattern recognition, CVPR 2022, June 19, 2022 - June 24, 2022. IEEE Computer Society, pp. 16494–16504

  57. Hinz T, Heinrich S, Wermter S (2020) Semantic object accuracy for generative text-to-image synthesis[J]. IEEE Trans Pattern Anal Mach Intell 44(3):1552–1565

    Article  Google Scholar 

  58. Hinz T, Heinrich S, Wermter S (2019) Generating multiple objects at spatially distinct locations[J]. arXiv preprint arXiv:1901.00686

  59. Ye H, Yang X, Taka M et al (2021) Improving text-to-image synthesis using contrastive learning[C]. In: 32nd British machine vision conference, BMVC 2021, November 22, 2021 - November 25, 2021. British Machine Vision Association, BMVA

  60. Johnson J, Alahi A, Fei-Fei L (2016) Perceptual losses for real-time style transfer and super-resolution[C]. In: 14th European conference on computer vision, ECCV 2016, October 8, 2016 - October 16, 2016. Springer Verlag, pp 694–711

  61. Dong H, Yu S, Wu C et al (2017) Semantic image synthesis via adversarial learning[C]. In: 16th IEEE international conference on computer vision, ICCV 2017, October 22, 2017 - October 29, 2017. Institute of Electrical and Electronics Engineers Inc, pp 5707–5715

  62. Park H, Yoo Y, Kwak N (2019) MC-GAN: multi-conditional generative adversarial network for image synthesis[C]. In: 29th British machine vision conference, BMVC 2018, September 3, 2018 - September 6, 2018. BMVA Press Amazon; et al.; Microsoft; NVIDIA; SCANs; SCAPE

  63. Yang Y, Guan Z, Li J et al (2023) Interpretable and efficient heterogeneous graph convolutional network. IEEE Trans Knowl Data Eng 35(2):1637–1650

    Google Scholar 

  64. Zagoruyko S, Komodakis N (2015) Learning to compare image patches via convolutional neural networks[C]. In: IEEE conference on computer vision and pattern recognition, CVPR 2015, June 7, 2015 - June 12, 2015. IEEE Computer Society, pp 4353–4361

  65. Yang B, Feng F, Wang X (2022) GR-GAN: gradual refinement text-to-image generation[C]. In: 2022 IEEE international conference on multimedia and expo, ICME 2022, July 18, 2022 - July 22, 2022. IEEE computer society. CAS; IEEE; IEEE circuits and systems society (CAS); IEEE communications society (ComSoc); IEEE signal processing society

  66. Huang X, Li Y, Poursaeed O et al (2017) Stacked generative adversarial networks[C]. In: 30th IEEE conference on computer vision and pattern recognition, CVPR 2017, July 21, 2017 - July 26, 2017. Institute of Electrical and Electronics Engineers Inc, pp 1866–1875

  67. Denton E, Chintala S, Szlam A et al (2015) Deep generative image models using a laplacian pyramid of adversarial networks[C]. In: 29th annual conference on neural information processing systems, NIPS 2015, December 7, 2015 - December 12, 2015. Neural Information Processing Systems Foundation, pp 1486–1494

  68. Xu K, Ba J, Kiros R et al (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning. PMLR, (pp 2048-2057)

  69. Zhang Z, Xie Y, Xing F et al (2017) MDNet: a semantically and visually interpretable medical image diagnosis network[C]. In: 30th IEEE conference on computer vision and pattern recognition, CVPR 2017, July 21, 2017 - July 26, 2017. Institute of Electrical and Electronics Engineers Inc, pp 3549–3557

  70. Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate[C]. In: 3rd international conference on learning representations, ICLR 2015, may 7, 2015 - may 9, 2015. International Conference on Learning Representations, ICLR

  71. Oliva A, Torralba A, Castelhano MS et al (2003) Top-down control of visual attention in object detection[C]. In: Proceedings: 2003 international conference on image processing, ICIP-2003, September 14, 2003 - September 17, 2003. Institute of Electrical and Electronics Engineers Computer Society, pp 253–256

  72. Zhang X, Wang T, Qi J et al (2018) Progressive attention guided recurrent network for salient object detection[C]. In: 31st meeting of the IEEE/CVF conference on computer vision and pattern recognition, CVPR 2018, June 18, 2018 - June 22, 2018. IEEE Computer Society, pp 714–722

  73. Yang Z, He X, Gao J et al (2016) Stacked attention networks for image question answering[C]. In: 29th IEEE conference on computer vision and pattern recognition, CVPR 2016, June 26, 2016 - July 1, 2016. IEEE Computer Society, pp 21–29

  74. Reed S, Akata Z, Mohan S et al (2016) Learning what and where to draw[C]. In: 30th annual conference on neural information processing systems, NIPS 2016, December 5, 2016 - December 10, 2016. Neural information processing Systems Foundation, pp 217–225

  75. Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks[J]. IEEE Trans Signal Process 45(11):2673–2681

    Article  Google Scholar 

  76. Zhang Yunfan, Yi Yaohua, Tang Ziwei et al (2022) Text-to-Image Synthesis Method Based on Channel Attention Mechanism. Comput Eng 48(4):206–212

    Google Scholar 

  77. Gulcehre C, Chandar S, Cho K et al (2018) Dynamic neural turing machine with soft and hard addressing schemes. Neural Comput 30(4):857–884

    Article  MathSciNet  Google Scholar 

  78. Weston J, Chopra S, Bordes A (2014) Memory networks[J]. arXiv preprint arXiv:14103916

  79. Sukhbaatar S, Szlam A, Weston J et al (2015) End-to-end memory networks[C]. In: 29th annual conference on neural information processing systems, NIPS 2015, December 7, 2015 - December 12, 2015. Neural Information Processing Systems Foundation, pp 2440–2448

  80. Wu X, Zhao H, Zheng L et al (2022) Adma-GAN: attribute-driven memory augmented GANs for text-to-image generation[C]. In: 30th ACM international conference on multimedia, MM 2022, October 10, 2022 - October 14, 2022. Association for Computing Machinery, Inc, pp 1593–1602

  81. Huang X, Belongie S (2017) Arbitrary style transfer in real-time with adaptive instance normalization[C]. In: 16th IEEE international conference on computer vision, ICCV 2017, 2017 - October 29, 2017. Institute of Electrical and Electronics Engineers Inc. 1510–1519

  82. Karras T, Laine S, Aila T (2019) A style-based generator architecture for generative adversarial networks[C]. In: 32nd IEEE/CVF conference on computer vision and pattern recognition, CVPR 2019, June 16, 2019 - June 20, 2019. IEEE Computer Society, pp 4396–4405

  83. Liao W, Hu K, Yang MY et al (2022) Text to image generation with semantic-spatial aware GAN[C]. In: 2022 IEEE/CVF conference on computer vision and pattern recognition, CVPR 2022, June 19, 2022 - June 24, 2022. IEEE Computer Society, pp 18166–18175

  84. Ye S, Wang H, Tan M et al (2024) Recurrent affine transformation for text-to-image synthesis[J]. IEEE 26:462–473

    Google Scholar 

  85. He K, Fan H, Wu Y et al (2020) Momentum contrast for unsupervised visual representation learning[C]. In: 2020 IEEE/CVF conference on computer vision and pattern recognition, CVPR 2020, June 14, 2020 - June 19, 2020. IEEE Computer Society, pp 9726–9735

  86. Grill JB, Strub F, Altché F et al (2020) Bootstrap your own latent-a new approach to self-supervised learning[J]. Adv Neural Inform Process Syst 33:21271–21284

    Google Scholar 

  87. Chen T, Kornblith S, Norouzi M et al (2020) A simple framework for contrastive learning of visual representations[C]. In: 37th international conference on machine learning, ICML 2020, July 13, 2020 - July 18, 2020. International Machine Learning Society (IMLS), pp 1575–1585

  88. Liu X, Zhang F, Hou Z et al (2021) Self-supervised learning: Generative or contrastive[J]. IEEE Trans Knowl Data Eng 35(1):857–876

    Google Scholar 

  89. Radford A, Kim J W, Hallacy C et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning. PMLR, (pp 8748-8763)

  90. Radford A, Wu J, Child R et al (2019) Language models are unsupervised multitask learners[J]. OpenAI blog 1(8):9

    Google Scholar 

  91. Brown T, Mann B, Ryder N et al (2020) Language models are few-shot learners[J]. Adv Neural Inform Process Syst 33:1877–1901

    Google Scholar 

  92. Liu X, Zheng Y, Du Z et al (2021) GPT understands, too [Z]. arXiv 2021.1048550/arXiv210310385

  93. Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need[J]. Advances in neural information processing systems (NIPS), Long Beach, F Dec 04-09, 2017

  94. Yu Y, Zhan F, Wu R et al (2021) Diverse image inpainting with bidirectional and autoregressive transformers. In: Proceedings of the 29th ACM International Conference on Multimedia. (pp 69-78)

  95. Chen M, Radford A, Child R et al (2020) Generative pretraining from pixels. In: International conference on machine learning. PMLR, (pp 1691-1703)

  96. Esser P, Rombach R, Ommer B (2021) Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (pp 12873-12883)

  97. Van Den Oord A, Vinyals O (2017) Neural discrete representation learning. In: Advances in neural information processing systems (NIPS), Long Beach, (vol. 2017, pp 6307-6316)

  98. Razavi A, Van den Oord A, Vinyals O (2019) Generating diverse high-fidelity images with vq-vae-2. In: Advances in neural information processing systems (NeurIPS), Vancouver, F Dec 08-14, 2019

  99. Cai J, Wu L, Wu D et al (2022) Multi-dimensional alignment via variational autoencoders for generalized zero-shot and few-shot learning. In: 2022 9th International Conference on Behavioural and Social Computing (BESC). IEEE, (pp 1-4)

  100. Kudo T, Richardson J (2018) Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226

  101. Lee D, Kim C, Kim S et al (2022) Autoregressive image generation using residual quantization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. (pp 11523-11532)

  102. Juang B H, Gray A (1982) Multiple stage vector quantization for speech coding. In: ICASSP'82. IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, (vol. 7, pp 597-600)

  103. Martinez J, Hoos HH, Little JJ (2014) Stacked quantizers for compositional vector compression[J]. arXiv preprint arXiv:1411.2173

  104. Karras T, Laine S, Aittala M et al (2020) Analyzing and improving the image quality of stylegan[C]. In: 2020 IEEE/CVF conference on computer vision and pattern recognition, CVPR 2020, June 14, 2020 - June 19, 2020. IEEE Computer Society, pp 8107–8116

  105. Frans K, Soros L, Witkowski O (2022) Clipdraw: Exploring text-to-drawing synthesis through language-image encoders. Adv Neural Inform Process Syst 35:5207–5218

    Google Scholar 

  106. Gal R, Patashnik O, Maron H et al (2022) StyleGAN-NADA: CLIP-guided domain adaptation of image generators[J]. ACM Trans Graphics (TOG) 41(4):1–13

    Article  Google Scholar 

  107. Patashnik O, Wu Z, Shechtman E et al (2021) Styleclip: Text-driven manipulation of stylegan imagery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. (pp 2085-2094)

  108. Sennrich R, Haddow B, Birch A (2016) Neural machine translation of rare words with subword units[C]. In: 54th annual meeting of the Association for Computational Linguistics, ACL 2016, august 7, 2016 - august 12, 2016. Association for Computational Linguistics (ACL), pp 1715–1725

  109. Sun C, Shrivastava A, Singh S et al (2017) Revisiting unreasonable effectiveness of data in deep learning era[C]. In: 16th IEEE international conference on computer vision, ICCV 2017, October 22, 2017 - October 29, 2017. Institute of Electrical and Electronics Engineers Inc, pp 843–852

  110. Wang Z, Liu W, He Q et al (2022) Clip-gen: language-free training of a text-to-image generator with clip[J]. arXiv preprint arXiv:220300386

  111. Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks[J]. Commun Acm 60(6):84–90

    Article  Google Scholar 

  112. Devlin J, Chang M-W, Lee K et al (2019) BERT: pre-training of deep bidirectional transformers for language understanding[C]. In: 2019 conference of the north American chapter of the Association for Computational Linguistics: human language technologies, NAACL HLT 2019, June 2, 2019 - June 7, 2019. Association for Computational Linguistics (ACL), pp 4171–4186

  113. Radford A, Narasimhan K, Salimans T et al (2018) Improving language understanding by generative pre-training[J]

  114. Ho J, Salimans T (2022) Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598

  115. Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis[J]. Adv Neural Inform Process Syst 34:8780–8794

    Google Scholar 

  116. Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation[C]. In: 18th international conference on medical image computing and computer-assisted intervention, MICCAI 2015, October 5, 2015 - October 9, 2015. Springer Verlag, pp 234–241

  117. Rombach R, Blattmann A, Lorenz D et al (2022) High-resolution image synthesis with latent diffusion models[C]. In: 2022 IEEE/CVF conference on computer vision and pattern recognition, CVPR 2022, June 19, 2022 - June 24, 2022. IEEE Computer Society, pp 10674–10685

  118. Saharia C, Chan W, Saxena S et al (2022) Photorealistic text-to-image diffusion models with deep language understanding. Adv Neural Inform Process Syst 35:36479–36494

    Google Scholar 

  119. Ramesh A, Dhariwal P, Nichol A et al (2022) Hierarchical text-conditional image generation with clip latents 2022:7. https://arxiv.org/abs/2204.06125

  120. Zhou Y, Zhang R, Chen C et al (2022) Towards language-free training for text-to-image generation[C]. In: 2022 IEEE/CVF conference on computer vision and pattern recognition, CVPR 2022, June 19, 2022 - June 24, 2022. IEEE Computer Society, pp 17886–17896

  121. Nilsback M-E, Zisserman A (2008) Automated flower classification over a large number of classes[C]. In: 6th Indian conference on computer vision, graphics and image processing, ICVGIP 2008, December 16, 2008 - December 19, 2008. IEEE Computer Society, pp 722–729

  122. Deng J, Dong W, Socher R et al (2009) ImageNet: a large-scale hierarchical image database[C]. In: IEEE-computer-society conference on computer vision and pattern recognition workshops, pp 248–255

  123. Salimans T, Goodfellow I, Zaremba W et al (2016) Improved techniques for training GANs[C]. In: 30th annual conference on neural information processing systems, NIPS 2016, December 5, 2016 - December 10, 2016. Neural Information Processing Systems Foundation, pp 2234–2242

  124. Heusel M, Ramsauer H, Unterthiner T et al (2017) GANs trained by a two time-scale update rule converge to a local Nash equilibrium[C]. In: 31st annual conference on neural information processing systems, NIPS 2017, December 4, 2017 - December 9, 2017. Neural Information Processing Systems Foundation, pp 6627–6638

  125. Xu Z, Zhang X, Chen W, Liu J, Xu T, Wang Z (n.d.) MuralDiff: Diffusion for Ancient Murals Restoration on Large-Scale Pre-Training. IEEE Trans Emerg Topics Comput Intell. https://doi.org/10.1109/TETCI.2024.3359038 

  126. Li Mingyang, Chen Wei, Wang Shanshan, Li Jie, Tian Zijian, Zhang Fan (2023) Survey on 3D Reconstruction Methods Based on Visual Deep Learning. J Front Comput Sci Technol 17(02):279–302

    Google Scholar 

  127. Chen Wei, Li Yan, Tian Zijian, Zhang Fan (2023) 2D and 3D object detection algorithms from images: a survey. Array 19(100305):1–23

    Google Scholar 

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (grant number 52274160, 52074305, 51874300), the National Natural Science Foundation-Shanxi Provincial People's Government Coal-based low-carbon Joint Fund (grant number U1510115).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei Chen.

Ethics declarations

Competing interest

We declare that we have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. We confirm that this work is original and has not been published elsewhere, nor is it currently under consideration for publication elsewhere.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, W., Yang, Y., Tian, Z. et al. A review of multimodal learning for text to images. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-19117-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11042-024-19117-8

Keywords

Navigation