A review of multimodal learning for text to images

Chen, Wei; Yang, Yuqing; Tian, Zijian; Chen, Qiteng; Liu, Jueting

doi:10.1007/s11042-024-19117-8

A review of multimodal learning for text to images

Published: 30 April 2024

(2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Wei Chen ORCID: orcid.org/0000-0002-7663-278X^1,2,
Yuqing Yang^1,2,
Zijian Tian³,
Qiteng Chen^1,2 &
…
Jueting Liu^1,2

43 Accesses
Explore all metrics

Abstract

Information exists in various forms in the real world, and the effective interaction and fusion of multimodal information plays a key role in the research of computer vision and deep learning. Generating an image that matches a given text description is one of the multimodal tasks that requires a strong generative model and cross-modal understanding. This paper provided a comprehensive analysis of recent advances in text-generated images and a taxonomy based on model architecture and characteristics. We introduced the classification of text generated image based on different frames, including text generated image method based on generation adversarial network, transformer and diffusion model. This paper introduced the network structure, advantages and disadvantages of each method, the benchmark data set and corresponding evaluation index, and summarized the application progress and experimental results according to different classification methods. Finally, we provided insights into current research challenges and possible future research directions and applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PMGAN: pretrained model-based generative adversarial network for text-to-image generation

Article 28 March 2024

CA-GAN: Conditional Adaptive Generative Adversarial Network for Text-to-Image Synthesis

BA-GAN: Bidirectional Attention Generation Adversarial Network for Text-to-Image Synthesis

Data availability

No data was used for the research described in the article.

References

Elgammal A, Liu B, Elhoseiny M et al (2017) CAN: Creative adversarial networks generating "Art" by learning about styles and deviating from style norms[C]. In: 8th international conference on computational creativity, ICCC 2017, June 19, 2017 - June 23, 2017. Georgia Institute of Technology. Artificial Intelligence; Georgia Institute of Technology; Georgia Tech GVU Center; US National Science Foundation
Chen J, Shen Y, Gao J et al (2018) Language-based image editing with recurrent attentive models[C]. In: 31st meeting of the IEEE/CVF conference on computer vision and pattern recognition, CVPR 2018, June 18, 2018 - June 22, 2018. IEEE Computer Society, pp 8721–8729
Yan Z, Zhang H, Wang B et al (2016) Automatic photo adjustment using deep neural networks[J]. ACM Trans Graphics (TOG) 35(2):1–15
Article Google Scholar
Islam NU, Lee S, Park J (2020) Accurate and consistent image-to-image conditional adversarial network[J]. Electronics 9(3):395
Article Google Scholar
Reed S, Akata Z, Yan X et al (2016) Generative adversarial text to image synthesis[C]. In: 33rd international conference on machine learning, ICML 2016, June 19, 2016 - June 24, 2016. International Machine Learning Society (IMLS), pp 1681–1690
Liu X, Meng G, Xiang S et al (2018) Semantic image synthesis via conditional cycle-generative adversarial networks[C]. In: 24th international conference on pattern recognition, ICPR 2018, August 20, 2018 - August 24, 2018. Institute of Electrical and Electronics Engineers Inc, pp 988–993
Jiaqi Wu, Zhang Wenqi, Chen Wei, Wang Shuai (2023) Image enhancement method of underground low illumination in coal mine based on improved CycleGAN[J]. J Huazhong Univ Sci Technol (Natural Science Edition). 51(05):40–46
Google Scholar
Li C, Su Y, Liu W (2018) Text-to-text generative adversarial networks[C]. In: 2018 international joint conference on neural networks, IJCNN 2018, July 8, 2018 - July 13, 2018. Institute of Electrical and Electronics Engineers Inc
Wang X, Gupta A (2016) Generative image modeling using style and structure adversarial networks[C]. European conference on computer vision. Springer International Publishing, Cham, pp 318–335
Google Scholar
Zhang G, Tu E, Cui D et al (2017) Stable and improved generative adversarial nets (gans): a constructive survey[C]. In: 24th IEEE international conference on image processing (ICIP), pp 1871–1875
Mirza M, Osindero S (2014) Conditional generative adversarial nets[J]. arXiv preprint arXiv:1411.1784
Srivastava N, Salakhutdinov R (2012) Learning representations for multimodal data with deep belief nets[C]. Int Conf Mach Learn Workshop 79(10.1007):978–1
Google Scholar
Wu Y, Wu X, Li X et al (2021) MGH: metadata guided hypergraph modeling for unsupervised person re-identification[C]. In: 29th ACM International Conference on Multimedia, MM 2021, October 20, 2021 - October 24, 2021. Association for Computing Machinery, Inc, pp 1571–1580
Antol S, Agrawal A, Lu J et al (2015) VQA: Visual question answering[C]. In: 15th IEEE International Conference on Computer Vision, ICCV 2015, December 11, 2015 - December 18, 2015. Institute of Electrical and Electronics Engineers Inc, pp 2425–2433
Chen T-H, Liao Y-H, Chuang C-Y et al (2017) Show, adapt and tell: adversarial training of cross-domain image Captioner[C]. In: 16th IEEE international conference on computer vision, ICCV 2017, October 22, 2017 - October 29, 2017. Institute of Electrical and Electronics Engineers Inc, pp 521–530
Xiao S, Chen L, Zhang S et al (2021) Boundary proposal network for two-stage natural language video localization[C]. Proceed AAAI Conf Artif Intell 35(4):2986–2994
Google Scholar
Yang X, Feng F, Ji W et al (2021) Deconfounded Video Moment Retrieval with Causal Intervention[C]. In: 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2021, July 11, 2021 - July 15, 2021. Association for Computing Machinery, Inc, pp 1–10
Zhang H, Xu T, Li H et al (2017) StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks[C]. In: 16th IEEE international conference on computer vision, ICCV 2017, October 22, 2017 - October 29, 2017. Institute of Electrical and Electronics Engineers Inc, pp 5908–5916
Park T, Liu M-Y, Wang T-C et al (2019) Semantic image synthesis with spatially-adaptive normalization[C]. In: 32nd IEEE/CVF conference on computer vision and pattern recognition, CVPR 2019, June 16, 2019 - June 20, 2019. IEEE Computer Society. pp 2332–2341
Shrivastava A, Pfister T, Tuzel O et al (2017) Learning from simulated and unsupervised images through adversarial training[C]. In: 30th IEEE conference on computer vision and pattern recognition, CVPR 2017, July 21, 2017 - July 26, 2017. Institute of Electrical and Electronics Engineers Inc, pp 2242–2251
Isola P, Zhu J-Y, Zhou T et al (2017) Image-to-image translation with conditional adversarial networks[C]. In: 30th IEEE conference on computer vision and pattern recognition, CVPR 2017, July 21, 2017 - July 26, 2017. Institute of Electrical and Electronics Engineers Inc, pp 5967–5976
Arjovsky M, Chintala S, Bottou L (2017) Wasserstein generative adversarial networks[C]. In: 34th international conference on machine learning, ICML 2017, August 6, 2017 - August 11, 2017. International Machine Learning Society (IMLS), pp 298–321
Wang K, Chen Q, Chen W, Liu J, Yang Y (n.d.) Overview of 2D Human Pose Estimation Based on Deep Learning. J Zhengzhou Univ (Natural Science Edition). https://doi.org/10.13705/j.issn.1671-6841.2022334
Zhan F, Xue C, Lu S (2019) GA-DAN: Geometry-aware domain adaptation network for scene text detection and recognition[C]. In: 17th IEEE/CVF international conference on computer vision, ICCV 2019, October 27, 2019 - November 2, 2019. Institute of Electrical and Electronics Engineers Inc, pp 9104–9114
Lin C-H, Yumer E, Wang O et al (2018) ST-GAN: spatial transformer generative adversarial networks for image compositing[C]. In: 31st meeting of the ieee/cvf conference on computer vision and pattern recognition, CVPR 2018, June 18, 2018 - June 22, 2018. IEEE Computer Society, pp 9455–9464
Goodfellow I, Pouget-Abadie J, Mirza M et al (2014) Generative adversarial nets[J]. Adv Neural Inform Process Syst 27:2672–2680
Google Scholar
Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models[J]. Adv Neural Inform Process Syst 33:6840–6851
Google Scholar
Song Y, Ermon S (2019) Generative modeling by estimating gradients of the data distribution[C]. In: 33rd annual conference on neural information processing systems, NeurIPS 2019, December 8, 2019 - December 14, 2019. Neural Information Processing Systems Foundation Citadel; Doc.AI; et al.; Lambda; Lyft; Microsoft Research
Zhu X, Goldberg AB, Eldawy M et al (2007) A text-to-picture synthesis system for augmenting communication[C]. AAAI. 7:1590–1595
Google Scholar
Bengio Y, Mesnil G, Dauphin Y et al (2013) Better mixing via deep representations[C]. In: 30th international conference on machine learning, ICML 2013, June 16, 2013 - June 21, 2013. International Machine Learning Society (IMLS), pp 552–560
Wah C, Branson S, Welinder P et al (2011) The Caltech-UCSD Birds200–2011 dataset[J] . Advances in water resources - Adv Water Resour
Lin T-Y, Maire M, Belongie S et al (2014) Microsoft COCO: Common objects in context[C]. In: 13th European conference on computer vision, ECCV 2014, September 6, 2014 - September 12, 2014. Springer Verlag, pp 740–755
Ding M, Yang Z, Hong W et al (2021) Cogview: Mastering text-to-image generation via transformers[J]. Adv Neural Inform Process Syst 34:19822–19835
Google Scholar
Ramesh A, Pavlov M, Goh G et al (2021) Zero-Shot Text-to-Image Generation[C]. In: 38th international conference on machine learning, ICML 2021, July 18, 2021 - July 24, 2021. ML Research Press, pp 8821–8831
Lin J, Men R, Yang A et al (2021) M6: A Chinese multimodal pretrainer[J]. arXiv preprint arXiv:2103.00823
Nichol A, Dhariwal P, Ramesh A et al (2022) GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models[C]. In: 39th international conference on machine learning, ICML 2022, July 17, 2022 - July 23, 2022. ML Research Press, pp 16784–16804
Gu S, Chen D, Bao J et al (2022) Vector Quantized Diffusion Model for Text-to-Image Synthesis[C]. In: 2022 IEEE/CVF conference on computer vision and pattern recognition, CVPR 2022, June 19, 2022 - June 24, 2022. IEEE Computer Society, pp 10686–10696
Sohl-Dickstein J, Weiss EA, Maheswaranathan N et al (2015) Deep unsupervised learning using nonequilibrium thermodynamics[C]. In: 32nd international conference on machine learning, ICML 2015, July 6, 2015 - July 11, 2015. International Machine Learning Society (IMLS), pp 2246–2255
Kim D, Joo D, Kim J (2020) Tivgan: Text to image to video generation with step-by-step evolutionary generator[J]. IEEE Access 8:153113–153122
Article Google Scholar
Zhang H, Xu T, Li H et al (2018) Stackgan++: Realistic image synthesis with stacked generative adversarial networks[J]. IEEE Trans Pattern Anal Mach Intell 41(8):1947–1962
Article Google Scholar
Zhang Z, Xie Y, Yang L (2018) Photographic text-to-image synthesis with a hierarchically-nested adversarial network[C]. In: 31st meeting of the IEEE/CVF conference on computer vision and pattern recognition, CVPR 2018, June 18, 2018 - June 22, 2018. IEEE Computer Society, pp 6199–6208
Cai Y, Wang X, Yu Z et al (2019) Dualattn-GAN: Text to image synthesis with dual attentional generative adversarial network[J]. IEEE Access 7:183706–183716
Article Google Scholar
Gou Y, Wu Q, Li M et al (2020) Segattngan: Text to image generation with segmentation attention[J]. arXiv preprint arXiv:2005.12444
Li B, Torr PHS, Lukasiewicz T (2022) Memory-driven text-to-image generation[J]. arXiv preprint arXiv:2208.07022
Zhu M, Pan P, Chen W et al (2019) DM-GAN: Dynamic memory generative adversarial networks for text-to-image synthesis[C]. In: 32nd IEEE/CVF conference on computer vision and pattern recognition, CVPR 2019, June 16, 2019 - June 20, 2019. IEEE Computer Society, pp 5795–5803
Xu T, Zhang P, Huang Q et al (2018) AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks[C]. In: 31st meeting of the IEEE/CVF conference on computer vision and pattern recognition, CVPR 2018, June 18, 2018 - June 22, 2018. IEEE Computer Society, pp 1316–1324
Mansimov E, Parisotto E, Ba JL et al (2016) Generating images from captions with attention[C]. In: 4th international conference on learning representations, ICLR 2016, May 2, 2016 - May 4, 2016. International Conference on Learning Representations, ICLR
Qiao T, Zhang J, Xu D et al (2019) Mirrorgan: learning text-to-image generation by redescription[C]. In: 32nd IEEE/CVF conference on computer vision and pattern recognition, CVPR 2019, June 16, 2019 - June 20, 2019. IEEE Computer Society, pp 1505–1514
Liang J, Pei W, Lu F (2020) CPGAN: content-parsing generative adversarial networks for text-to-image synthesis[C]. In: 16th European conference on computer vision, ECCV 2020, August 23, 2020 - August 28, 2020. Springer Science and Business Media Deutschland GmbH, pp 491–508
Li B, Qi X, Lukasiewicz T et al (2019) Controllable text-to-image generation[C]. In: 33rd annual conference on neural information processing systems, NeurIPS 2019, December 8, 2019 - December 14, 2019. Neural information processing systems foundation Citadel; Doc.AI; et al.; Lambda; Lyft; Microsoft Research
Nguyen A, Clune J, Bengio Y et al (2017) Plug and play generative networks: conditional iterative generation of images in latent space[C]. In: 30th IEEE conference on computer vision and pattern recognition, CVPR 2017, July 21, 2017 - July 26, 2017. Institute of Electrical and Electronics Engineers Inc, pp 3510–3520
Li W, Zhang P, Zhang L et al (2019) Object-driven text-to-image synthesis via adversarial training[C]. In: 32nd IEEE/CVF conference on computer vision and pattern recognition, CVPR 2019, June 16, 2019 - June 20, 2019. IEEE Computer Society, pp 12166–12174
Hong S, Yang D, Choi J et al (2018) Inferring semantic layout for hierarchical text-to-image synthesis[C]. In: 31st meeting of the IEEE/CVF conference on computer vision and pattern recognition, CVPR 2018, June 18, 2018 - June 22, 2018. IEEE Computer Society, pp 7986–7994
De Vries H, Strub F, Mary J et al (2017) Modulating early visual processing by language[C]. In: 31st annual conference on neural information processing systems, NIPS 2017, December 4, 2017 - December 9, 2017. Neural Information Processing Systems Foundation, pp 6595–6605
Yin G, Liu B, Sheng L et al (2019) Semantics disentangling for text-to-image generation[C]. In: 32nd IEEE/CVF conference on computer vision and pattern recognition, CVPR 2019, June 16, 2019 - June 20, 2019. IEEE Computer Society, pp 2322–2331
Tao M, Tang H, Wu F et al (2022) DF-GAN: a simple and effective baseline for text-to-image synthesis[C]. In: 2022 IEEE/CVF conference on computer vision and pattern recognition, CVPR 2022, June 19, 2022 - June 24, 2022. IEEE Computer Society, pp. 16494–16504
Hinz T, Heinrich S, Wermter S (2020) Semantic object accuracy for generative text-to-image synthesis[J]. IEEE Trans Pattern Anal Mach Intell 44(3):1552–1565
Article Google Scholar
Hinz T, Heinrich S, Wermter S (2019) Generating multiple objects at spatially distinct locations[J]. arXiv preprint arXiv:1901.00686
Ye H, Yang X, Taka M et al (2021) Improving text-to-image synthesis using contrastive learning[C]. In: 32nd British machine vision conference, BMVC 2021, November 22, 2021 - November 25, 2021. British Machine Vision Association, BMVA
Johnson J, Alahi A, Fei-Fei L (2016) Perceptual losses for real-time style transfer and super-resolution[C]. In: 14th European conference on computer vision, ECCV 2016, October 8, 2016 - October 16, 2016. Springer Verlag, pp 694–711
Dong H, Yu S, Wu C et al (2017) Semantic image synthesis via adversarial learning[C]. In: 16th IEEE international conference on computer vision, ICCV 2017, October 22, 2017 - October 29, 2017. Institute of Electrical and Electronics Engineers Inc, pp 5707–5715
Park H, Yoo Y, Kwak N (2019) MC-GAN: multi-conditional generative adversarial network for image synthesis[C]. In: 29th British machine vision conference, BMVC 2018, September 3, 2018 - September 6, 2018. BMVA Press Amazon; et al.; Microsoft; NVIDIA; SCANs; SCAPE
Yang Y, Guan Z, Li J et al (2023) Interpretable and efficient heterogeneous graph convolutional network. IEEE Trans Knowl Data Eng 35(2):1637–1650
Google Scholar
Zagoruyko S, Komodakis N (2015) Learning to compare image patches via convolutional neural networks[C]. In: IEEE conference on computer vision and pattern recognition, CVPR 2015, June 7, 2015 - June 12, 2015. IEEE Computer Society, pp 4353–4361
Yang B, Feng F, Wang X (2022) GR-GAN: gradual refinement text-to-image generation[C]. In: 2022 IEEE international conference on multimedia and expo, ICME 2022, July 18, 2022 - July 22, 2022. IEEE computer society. CAS; IEEE; IEEE circuits and systems society (CAS); IEEE communications society (ComSoc); IEEE signal processing society
Huang X, Li Y, Poursaeed O et al (2017) Stacked generative adversarial networks[C]. In: 30th IEEE conference on computer vision and pattern recognition, CVPR 2017, July 21, 2017 - July 26, 2017. Institute of Electrical and Electronics Engineers Inc, pp 1866–1875
Denton E, Chintala S, Szlam A et al (2015) Deep generative image models using a laplacian pyramid of adversarial networks[C]. In: 29th annual conference on neural information processing systems, NIPS 2015, December 7, 2015 - December 12, 2015. Neural Information Processing Systems Foundation, pp 1486–1494
Xu K, Ba J, Kiros R et al (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning. PMLR, (pp 2048-2057)
Zhang Z, Xie Y, Xing F et al (2017) MDNet: a semantically and visually interpretable medical image diagnosis network[C]. In: 30th IEEE conference on computer vision and pattern recognition, CVPR 2017, July 21, 2017 - July 26, 2017. Institute of Electrical and Electronics Engineers Inc, pp 3549–3557
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate[C]. In: 3rd international conference on learning representations, ICLR 2015, may 7, 2015 - may 9, 2015. International Conference on Learning Representations, ICLR
Oliva A, Torralba A, Castelhano MS et al (2003) Top-down control of visual attention in object detection[C]. In: Proceedings: 2003 international conference on image processing, ICIP-2003, September 14, 2003 - September 17, 2003. Institute of Electrical and Electronics Engineers Computer Society, pp 253–256
Zhang X, Wang T, Qi J et al (2018) Progressive attention guided recurrent network for salient object detection[C]. In: 31st meeting of the IEEE/CVF conference on computer vision and pattern recognition, CVPR 2018, June 18, 2018 - June 22, 2018. IEEE Computer Society, pp 714–722
Yang Z, He X, Gao J et al (2016) Stacked attention networks for image question answering[C]. In: 29th IEEE conference on computer vision and pattern recognition, CVPR 2016, June 26, 2016 - July 1, 2016. IEEE Computer Society, pp 21–29
Reed S, Akata Z, Mohan S et al (2016) Learning what and where to draw[C]. In: 30th annual conference on neural information processing systems, NIPS 2016, December 5, 2016 - December 10, 2016. Neural information processing Systems Foundation, pp 217–225
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks[J]. IEEE Trans Signal Process 45(11):2673–2681
Article Google Scholar
Zhang Yunfan, Yi Yaohua, Tang Ziwei et al (2022) Text-to-Image Synthesis Method Based on Channel Attention Mechanism. Comput Eng 48(4):206–212
Google Scholar
Gulcehre C, Chandar S, Cho K et al (2018) Dynamic neural turing machine with soft and hard addressing schemes. Neural Comput 30(4):857–884
Article MathSciNet Google Scholar
Weston J, Chopra S, Bordes A (2014) Memory networks[J]. arXiv preprint arXiv:14103916
Sukhbaatar S, Szlam A, Weston J et al (2015) End-to-end memory networks[C]. In: 29th annual conference on neural information processing systems, NIPS 2015, December 7, 2015 - December 12, 2015. Neural Information Processing Systems Foundation, pp 2440–2448
Wu X, Zhao H, Zheng L et al (2022) Adma-GAN: attribute-driven memory augmented GANs for text-to-image generation[C]. In: 30th ACM international conference on multimedia, MM 2022, October 10, 2022 - October 14, 2022. Association for Computing Machinery, Inc, pp 1593–1602
Huang X, Belongie S (2017) Arbitrary style transfer in real-time with adaptive instance normalization[C]. In: 16th IEEE international conference on computer vision, ICCV 2017, 2017 - October 29, 2017. Institute of Electrical and Electronics Engineers Inc. 1510–1519
Karras T, Laine S, Aila T (2019) A style-based generator architecture for generative adversarial networks[C]. In: 32nd IEEE/CVF conference on computer vision and pattern recognition, CVPR 2019, June 16, 2019 - June 20, 2019. IEEE Computer Society, pp 4396–4405
Liao W, Hu K, Yang MY et al (2022) Text to image generation with semantic-spatial aware GAN[C]. In: 2022 IEEE/CVF conference on computer vision and pattern recognition, CVPR 2022, June 19, 2022 - June 24, 2022. IEEE Computer Society, pp 18166–18175
Ye S, Wang H, Tan M et al (2024) Recurrent affine transformation for text-to-image synthesis[J]. IEEE 26:462–473
Google Scholar
He K, Fan H, Wu Y et al (2020) Momentum contrast for unsupervised visual representation learning[C]. In: 2020 IEEE/CVF conference on computer vision and pattern recognition, CVPR 2020, June 14, 2020 - June 19, 2020. IEEE Computer Society, pp 9726–9735
Grill JB, Strub F, Altché F et al (2020) Bootstrap your own latent-a new approach to self-supervised learning[J]. Adv Neural Inform Process Syst 33:21271–21284
Google Scholar
Chen T, Kornblith S, Norouzi M et al (2020) A simple framework for contrastive learning of visual representations[C]. In: 37th international conference on machine learning, ICML 2020, July 13, 2020 - July 18, 2020. International Machine Learning Society (IMLS), pp 1575–1585
Liu X, Zhang F, Hou Z et al (2021) Self-supervised learning: Generative or contrastive[J]. IEEE Trans Knowl Data Eng 35(1):857–876
Google Scholar
Radford A, Kim J W, Hallacy C et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning. PMLR, (pp 8748-8763)
Radford A, Wu J, Child R et al (2019) Language models are unsupervised multitask learners[J]. OpenAI blog 1(8):9
Google Scholar
Brown T, Mann B, Ryder N et al (2020) Language models are few-shot learners[J]. Adv Neural Inform Process Syst 33:1877–1901
Google Scholar
Liu X, Zheng Y, Du Z et al (2021) GPT understands, too [Z]. arXiv 2021.1048550/arXiv210310385
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need[J]. Advances in neural information processing systems (NIPS), Long Beach, F Dec 04-09, 2017
Yu Y, Zhan F, Wu R et al (2021) Diverse image inpainting with bidirectional and autoregressive transformers. In: Proceedings of the 29th ACM International Conference on Multimedia. (pp 69-78)
Chen M, Radford A, Child R et al (2020) Generative pretraining from pixels. In: International conference on machine learning. PMLR, (pp 1691-1703)
Esser P, Rombach R, Ommer B (2021) Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (pp 12873-12883)
Van Den Oord A, Vinyals O (2017) Neural discrete representation learning. In: Advances in neural information processing systems (NIPS), Long Beach, (vol. 2017, pp 6307-6316)
Razavi A, Van den Oord A, Vinyals O (2019) Generating diverse high-fidelity images with vq-vae-2. In: Advances in neural information processing systems (NeurIPS), Vancouver, F Dec 08-14, 2019
Cai J, Wu L, Wu D et al (2022) Multi-dimensional alignment via variational autoencoders for generalized zero-shot and few-shot learning. In: 2022 9th International Conference on Behavioural and Social Computing (BESC). IEEE, (pp 1-4)
Kudo T, Richardson J (2018) Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226
Lee D, Kim C, Kim S et al (2022) Autoregressive image generation using residual quantization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. (pp 11523-11532)
Juang B H, Gray A (1982) Multiple stage vector quantization for speech coding. In: ICASSP'82. IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, (vol. 7, pp 597-600)
Martinez J, Hoos HH, Little JJ (2014) Stacked quantizers for compositional vector compression[J]. arXiv preprint arXiv:1411.2173
Karras T, Laine S, Aittala M et al (2020) Analyzing and improving the image quality of stylegan[C]. In: 2020 IEEE/CVF conference on computer vision and pattern recognition, CVPR 2020, June 14, 2020 - June 19, 2020. IEEE Computer Society, pp 8107–8116
Frans K, Soros L, Witkowski O (2022) Clipdraw: Exploring text-to-drawing synthesis through language-image encoders. Adv Neural Inform Process Syst 35:5207–5218
Google Scholar
Gal R, Patashnik O, Maron H et al (2022) StyleGAN-NADA: CLIP-guided domain adaptation of image generators[J]. ACM Trans Graphics (TOG) 41(4):1–13
Article Google Scholar
Patashnik O, Wu Z, Shechtman E et al (2021) Styleclip: Text-driven manipulation of stylegan imagery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. (pp 2085-2094)
Sennrich R, Haddow B, Birch A (2016) Neural machine translation of rare words with subword units[C]. In: 54th annual meeting of the Association for Computational Linguistics, ACL 2016, august 7, 2016 - august 12, 2016. Association for Computational Linguistics (ACL), pp 1715–1725
Sun C, Shrivastava A, Singh S et al (2017) Revisiting unreasonable effectiveness of data in deep learning era[C]. In: 16th IEEE international conference on computer vision, ICCV 2017, October 22, 2017 - October 29, 2017. Institute of Electrical and Electronics Engineers Inc, pp 843–852
Wang Z, Liu W, He Q et al (2022) Clip-gen: language-free training of a text-to-image generator with clip[J]. arXiv preprint arXiv:220300386
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks[J]. Commun Acm 60(6):84–90
Article Google Scholar
Devlin J, Chang M-W, Lee K et al (2019) BERT: pre-training of deep bidirectional transformers for language understanding[C]. In: 2019 conference of the north American chapter of the Association for Computational Linguistics: human language technologies, NAACL HLT 2019, June 2, 2019 - June 7, 2019. Association for Computational Linguistics (ACL), pp 4171–4186
Radford A, Narasimhan K, Salimans T et al (2018) Improving language understanding by generative pre-training[J]
Ho J, Salimans T (2022) Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598
Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis[J]. Adv Neural Inform Process Syst 34:8780–8794
Google Scholar
Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation[C]. In: 18th international conference on medical image computing and computer-assisted intervention, MICCAI 2015, October 5, 2015 - October 9, 2015. Springer Verlag, pp 234–241
Rombach R, Blattmann A, Lorenz D et al (2022) High-resolution image synthesis with latent diffusion models[C]. In: 2022 IEEE/CVF conference on computer vision and pattern recognition, CVPR 2022, June 19, 2022 - June 24, 2022. IEEE Computer Society, pp 10674–10685
Saharia C, Chan W, Saxena S et al (2022) Photorealistic text-to-image diffusion models with deep language understanding. Adv Neural Inform Process Syst 35:36479–36494
Google Scholar
Ramesh A, Dhariwal P, Nichol A et al (2022) Hierarchical text-conditional image generation with clip latents 2022:7. https://arxiv.org/abs/2204.06125
Zhou Y, Zhang R, Chen C et al (2022) Towards language-free training for text-to-image generation[C]. In: 2022 IEEE/CVF conference on computer vision and pattern recognition, CVPR 2022, June 19, 2022 - June 24, 2022. IEEE Computer Society, pp 17886–17896
Nilsback M-E, Zisserman A (2008) Automated flower classification over a large number of classes[C]. In: 6th Indian conference on computer vision, graphics and image processing, ICVGIP 2008, December 16, 2008 - December 19, 2008. IEEE Computer Society, pp 722–729
Deng J, Dong W, Socher R et al (2009) ImageNet: a large-scale hierarchical image database[C]. In: IEEE-computer-society conference on computer vision and pattern recognition workshops, pp 248–255
Salimans T, Goodfellow I, Zaremba W et al (2016) Improved techniques for training GANs[C]. In: 30th annual conference on neural information processing systems, NIPS 2016, December 5, 2016 - December 10, 2016. Neural Information Processing Systems Foundation, pp 2234–2242
Heusel M, Ramsauer H, Unterthiner T et al (2017) GANs trained by a two time-scale update rule converge to a local Nash equilibrium[C]. In: 31st annual conference on neural information processing systems, NIPS 2017, December 4, 2017 - December 9, 2017. Neural Information Processing Systems Foundation, pp 6627–6638
Xu Z, Zhang X, Chen W, Liu J, Xu T, Wang Z (n.d.) MuralDiff: Diffusion for Ancient Murals Restoration on Large-Scale Pre-Training. IEEE Trans Emerg Topics Comput Intell. https://doi.org/10.1109/TETCI.2024.3359038
Li Mingyang, Chen Wei, Wang Shanshan, Li Jie, Tian Zijian, Zhang Fan (2023) Survey on 3D Reconstruction Methods Based on Visual Deep Learning. J Front Comput Sci Technol 17(02):279–302
Google Scholar
Chen Wei, Li Yan, Tian Zijian, Zhang Fan (2023) 2D and 3D object detection algorithms from images: a survey. Array 19(100305):1–23
Google Scholar

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (grant number 52274160, 52074305, 51874300), the National Natural Science Foundation-Shanxi Provincial People's Government Coal-based low-carbon Joint Fund (grant number U1510115).

Author information

Authors and Affiliations

School of Computer Science & Technology, China University of Mining and Technology, Xuzhou, 221116, China
Wei Chen, Yuqing Yang, Qiteng Chen & Jueting Liu
Engineering Research Center of Mine Digitization Ministry of Education, China University of Mining and Technology, Xuzhou, 221116, China
Wei Chen, Yuqing Yang, Qiteng Chen & Jueting Liu
School of Mechanical Electronic & Information Engineering, China University of Mining and Technology (Beijing), Beijing, 100083, China
Zijian Tian

Authors

Wei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yuqing Yang
View author publications
You can also search for this author in PubMed Google Scholar
Zijian Tian
View author publications
You can also search for this author in PubMed Google Scholar
Qiteng Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jueting Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei Chen.

Ethics declarations

Competing interest

We declare that we have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. We confirm that this work is original and has not been published elsewhere, nor is it currently under consideration for publication elsewhere.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chen, W., Yang, Y., Tian, Z. et al. A review of multimodal learning for text to images. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-19117-8

Download citation

Received: 25 June 2023
Revised: 04 March 2024
Accepted: 27 March 2024
Published: 30 April 2024
DOI: https://doi.org/10.1007/s11042-024-19117-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A review of multimodal learning for text to images

Abstract

Access this article

Similar content being viewed by others

PMGAN: pretrained model-based generative adversarial network for text-to-image generation

CA-GAN: Conditional Adaptive Generative Adversarial Network for Text-to-Image Synthesis

BA-GAN: Bidirectional Attention Generation Adversarial Network for Text-to-Image Synthesis

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A review of multimodal learning for text to images

Abstract

Access this article

Similar content being viewed by others

PMGAN: pretrained model-based generative adversarial network for text-to-image generation

CA-GAN: Conditional Adaptive Generative Adversarial Network for Text-to-Image Synthesis

BA-GAN: Bidirectional Attention Generation Adversarial Network for Text-to-Image Synthesis

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation