Skip to main content
Log in

Image synthesis: a review of methods, datasets, evaluation metrics, and future outlook

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

Image synthesis is a process of converting the input text, sketch, or other sources, i.e., another image or mask, into an image. It is an important problem in the computer vision field, where it has attracted the research community to attempt to solve this challenge at a high level to generate photorealistic images. Different techniques and strategies have been employed to achieve this purpose. Thus, the aim of this paper is to provide a comprehensive review of various image synthesis models covering several aspects. First, the image synthesis concept is introduced. We then review different image synthesis methods divided into three categories: image generation from text, sketch, and other inputs, respectively. Each sub-category is introduced under the proper category based upon the general framework to provide a broad vision of all existing image synthesis methods. Next, brief details of the benchmarked datasets used in image synthesis are discussed along with specifying the image synthesis models that leverage them. Regarding the evaluation, we summarize the metrics used to evaluate the image synthesis models. Moreover, a detailed analysis based on the evaluation metrics of the results of the introduced image synthesis is provided. Finally, we discuss some existing challenges and suggest possible future research directions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Data availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

  • Adiban M, Safari A, Salvi G (2020) Step-gan: A step-by-step training for multi generator gans with application to cyber security in power systems. arXiv [eess.SP].

  • . Arjovsky M, Chintala S, Bottou L (2017) Wasserstein gan. arXiv [stat.ML]

  • Baraheem SS, Nguyen TV (2020b) Aesthetic-aware text to image synthesis. In 2020b 54th Annual Conference on Information Sciences and Systems (CISS), p 1–6

  • Baraheem SS, Nguyen TV (2020) Text-to-image via mask anchor points. Pattern Recognition Lett 133:25–32

    Google Scholar 

  • Caesar H, Uijlings J, Ferrari V (2018) Coco-stuff: Thing and stuff classes in context. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, p 1209–1218

  • Cai L, Gao H, Ji S (2017) Multi-stage variational auto-encoders for coarse- to-fine image generation. arXiv [cs.CV]

  • Chalechale A, Mertins A, Naghdy G (2004) Edge image description us- ing angular radial partitioning. IEE Proc - Vis. Image Signal Process 151(2):93

    Google Scholar 

  • Chen T, Cheng M-M, Tan P, Shamir A, Hu S-M (2009) Sketch2photo: internet image montage. ACM Trans Graph 28(5):1–10

    Google Scholar 

  • Chen W, Hays J (2018) Sketchygan: towards diverse and realistic sketch to image synthesis. arXiv [cs.CV]

  • Chen H, Jiang L (2019) Efficient gan-based method for cyber-intrusion detection. arXiv [cs.LG

  • Chen X, Kingma DP, Salimans T, Duan Y, Dhariwal P, Schulman J, Sutskever I, Abbeel P (2016) Abbeel. Variational lossy autoencoder. arXiv [cs.LG]

  • Chen J, Shen Y, Gao J, Liu J, Liu X (2017a) Language-based image editing with recurrent attentive models. arXiv [cs.CV]

  • Chen L, Srivastava S, Duan Z, Xu C (2017b) Deep cross-modal audio- visual generation arXiv [cs.CV].

  • Chicco D (2021) Siamese neural networks: An overview. Methods Mol Biol 2190:73–94

    Google Scholar 

  • Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).

  • Comaniciu D, Meer P (2002) obust analysis of feature spaces: color image segmentation. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

  • “Common problems,” Google Developers. https://developers.google.com/machine-learning/gan/problems (accessed Jan. 10, 2023)

  • Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B (2016) The cityscapes dataset for semantic ur- ban scene understanding. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  • Dalal N, Triggs B (2005) Histograms of oriented gradients for human de- tection. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol 1, p 886–893

  • Dalmaz O, Yurt M, Cukur T (2022) Resvit: Residual vision transformers for multimodal medical image synthesis. IEEE Trans Med Imaging 41(10):2598–2614

    Google Scholar 

  • Das A, Kottur S, Moura JM, Lee S, Batra D (2017) Visual dialog. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Davenport RK, Rogers CM, Russell IS (1973) Cross modal perception in apes. Neuropsychologia 11(1):21–28

    Google Scholar 

  • Deng L (2012) The mnist database of handwritten digit images for ma- chine learning research [best of the web. IEEE Signal Process Mag 29(6):141–142

    Google Scholar 

  • Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, p 248–255

  • Denton EL, Chintala S, Fergus R 2015 Deep generative image models using a laplacian pyramid of adversarial networks. arXiv [cs.CV]

  • Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language under- standing. arXiv [cs.CL]

  • Dokmanic I, Parhizkar R, Ranieri J, Vetterli M (2015) Euclidean distance matrices: essential theory, algorithms and applications. IEEE Signal Process Mag 32(6):12–30

    Google Scholar 

  • Dosovitskiy A, Springenberg JT, Brox T (2015) Learning to generate chairs with convolutional neural networks. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p 1538–1546

  • Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv [cs.CV]

  • Dumitrescu B (2017) Gram matrix representation. Signals and Communication Technology. Springer International Publishing, Cham, pp 23–69

    Google Scholar 

  • Dumoulin V, Shlens J, Kudlur M (2016) A learned representation for artistic style. arXiv [cs.CV]

  • Eitz M, Richter R, Hildebrand K, Boubekeur T, Alexa M (2011) Photo- sketcher: interactive sketch-based image synthesis. IEEE Comput Graph Appl 31(6):56–66

    Google Scholar 

  • Eitz M, Hays J, Alexa M (2012) How do humans sketch objects? ACM Trans Graph 31(4):1–10

    Google Scholar 

  • Eitz M, Hildebrand K, Boubekeur T, Alexa M (2009) A descriptor for large scale image retrieval based on sketched feature lines. In Proceed- ings of the 6th Eurographics Symposium on Sketch-Based Interfaces and Modeling - SBIM

  • Elgammal A, Liu B, Elhoseiny M, Mazzone M (2017) Can: creative adversarial networks, generating ‘art’ by learning about styles and deviating from style norms. arXiv [cs.AI]

  • Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compo- sitionality. arXiv [cs.CL]

  • He K, Zhang X, Ren S, Sun J. (2015) Deep residual learning for image recognition. arXiv [cs.CV]

  • Perarnau G, Weijer J, Raducanu B, A´lvarez JM (2016) Invertible conditional gans for image editing. arXiv [cs.CV]

  • Liu Y, Qin Z, Luo Z, Wang H (2017) Auto-painter: cartoon image generation from sketch by using conditional generative adversarial networks. arXiv [cs.CV]

  • Feng F, Li R, Wang X (2014) Cross-modal retrieval with correspondence autoencoder. In Proceedings of the ACM International Conference on Multimedia, vol MM 14

  • Feng Z, Xu C, Tao D (2019) Self-supervised representation learning by rotation feature decoupling. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), p 10364–10374

  • Finlayson SG, Lee H, Kohane IS, Oakden-Rayner L (2018) Towards generative adversarial networks as a new paradigm for radiology education. arXiv [cs.CV]

  • Gadde R, Karlapalem K (2011) Aesthetic guideline driven photography by robots. Proceedings of the Twenty-Second international joint conference on Artificial Intelligence, p 2060–2065

  • Gao L, Chen D, Zhao Z, Shao J, Shen HT (2021) Lightweight dynamic conditional gan with pyramid attention for text-to-image synthesis. Pattern Recogn 110(107384):107384

    Google Scholar 

  • Gao C, Liu Q, Xu Q, Wang L, Liu J, Zou C (2020) Sketchycoco: Image generation from freehand scene sketches,. arXiv [cs.CV].

  • Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems vol 2, p 2672–2680

  • Gou Y, Wu Q, Li M, Gong B, Han M (2020) Segattngan: Text to image generation with segmentation attention. arXiv [cs.CV]

  • . Gregor K, Danihelka I, Graves A, Rezende DJ, Wierstra D (2015) Draw: a recurrent neural network for image generation. arXiv [cs.CV]

  • Grother P (1995) Nist special database 19 handprinted forms and characters database.

  • Gulrajani I, Kumar K, Ahmed F, Taiga AA, Visin F, Vazquez D, Courville A (2016) Pixelvae: A latent variable model for natural images. arXiv [cs.LG]

  • Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville AC (2017) Improved training of wasserstein gans. arXiv [cs.LG]

  • Güngör A, Dar SU, Öztürk Ş, Korkmaz Y, Elmas G, Özbey M, Güngör A, Çukur T (2022) Adaptive diffusion priors for accelerated mri reconstruction. arXiv [eess.IV]

  • Hao W, Zhang Z, Guan H (2018) Cmcgan: a uniform framework for cross-modal visual-audio mutual generation. Proc. Conf. AAAI Artif. Intell, 32(1)

  • Harris Zellig S (1981) Distributional Structure. Springer Netherlands, Dordrecht, pp 3–22

    Google Scholar 

  • He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. arXiv [cs.CV]

  • Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. arXiv [cs.LG].

  • Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. arXiv [cs.LG]

  • Huang X, Belongie S (2017) Arbitrary style transfer in real-time with adaptive instance normalization. arXiv [cs.CV]

  • Huang GB, Mattar M, Berg T, Learned-Miller E (2008) Labeled faces in the wild: A database for studying face recognition in un- constrained environments. In Workshop on Faces in “Real-Life” Images: Detection, Alignment, and Recognition

  • Huang X, Liu M-Y, Belongie S, Kautz J (2018a) Multimodal unsupervised image-to-image translation, arXiv [cs.CV]

  • Huang H, Yu PS, Wang C (2018b) An introduction to image synthesis with generative adversarial nets, 2018b. arXiv [cs.CV]

  • Huiskes MJ, Lew MS (2008) Lew. The mir flickr retrieval evaluation. In Proceed- ing of the 1st ACM international conference on Multimedia information retrieval - MIR

  • Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv [cs.LG]

  • Isola P, Zhu J-Y, Zhou T, Efros AA (2016) Image-to-image translation with conditional adversarial networks. arXiv [cs.CV]

  • Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding. arXiv [cs.CV]

  • Jinzhen Mu, Chen C, Zhu W, Li S, Zhou Y (2022) Taming mode collapse in generative adversarial networks using cooperative realness discriminators. IET Image Proc 16(8):2240–2262

    Google Scholar 

  • Johnson J, Alahi A, Li Fei-Fei (2016) Perceptual Losses for Real-time Style Transfer and Super-resolution. Springer International Publishing, Cham

    Google Scholar 

  • Jolicoeur-Martineau A, Piché-Taillefer R, Combes RT, Mitliagkas I (2020) Adversarial score matching and im- proved sampling for image generation. arXiv [cs.LG]

  • Amit Kamran S, Fariha Hossain K, Tavakkoli A, Zuckerbrod SL, Baker SA (2021) Vtgan: semi-supervised retinal image synthesis and disease prediction using vision transformers. arXiv [eess.IV]

  • Karras T, Aila T, Laine S, Lehtinen J (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv [cs.NE]

  • Kingma DP, Welling M (2013) Auto-encoding variational bayes. arXiv [stat.ML]

  • Kingma DP, Salimans T, Jozefowicz R, Chen X, Sutskever I, Welling M (2016) Welling. Improving variational inference with inverse autoregressive flow. arXiv [cs.LG].

  • Kiros R, Zhu Y, Salakhutdinov RR, Zemel R, Urtasun R, Torralba A, Fidler S (2015) Skip-thought vectors. arXiv [cs.CL]

  • Kolesnikov A, Zhai X, Beyer L (2019) Beyer. Revisiting self-supervised visual representation learning. arXiv [cs.CV]

  • Kong Z, Ping W, Huang J, Zhao K, Catanzaro B (2020) Diffwave: a versatile diffusion model for audio synthesis. arXiv [eess.AS]

  • Kramer MA (1991) Nonlinear principal component analysis using autoassocia- tive neural networks. AIChE J 37(2):233–243

    Google Scholar 

  • Krizhevsky A (2009) Learning multiple layers of features from tiny images. Technical report, Journal of Software Engineering and Applications.

  • Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90

    Google Scholar 

  • Kruskal JB (1964) Nonmetric multidimensional scaling: A numerical method. Psychometrika 29(2):115–129

    MathSciNet  MATH  Google Scholar 

  • Kumar N, Berg AC, Belhumeur PN, Nayar SK (2009) Attribute and simile classifiers for face verification. In 2009 IEEE 12th International Conference on Computer Vision. IEEE p 365–372

  • Li C, Wand M (2016) Precomputed Real-time Texture Synthesis with Markovian Generative Adversarial Networks. Springer International Publishing, Cham

    Google Scholar 

  • Li B, Liu X, Dinesh K, Duan Z, Sharma G (2016) Creating a multi- track classical musical performance dataset for multimodal music analysis: challenges, insights, and applications. EEE Trans Multimedia 21(2):522–535

    Google Scholar 

  • Li L, Sun Y, Hu F, Zhou T, Xi X, Ren J (2020) Text to realistic image generation with attentional concatenation generative adversarial networks. Discrete Dyn Nat Soc 2020(1):10

    MathSciNet  Google Scholar 

  • Li JG, Zhang XF, Jia CM, Xu JZ, Zhang L, Wang Y, Ma SW, Gao W (2020) Direct speech-to-image translation. IEEE Journal of Selected Topics in Signal Processing 14(3):517–529

    Google Scholar 

  • Li Z, Deng C, Yang E, Tao D (2021) Staged sketch-to-image synthesis via semi-supervised generative adversarial networks. IEEE Trans Multi- Media 23:2694–2705

    Google Scholar 

  • Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. arXiv [cs.CV]

  • Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2016) Feature pyramid networks for object detection. arXiv [cs.CV]

  • Lin YJ, Wu PW, Chang CH, Chang EY, Liao SW (2019) Liao. Relgan: Multi-domain image-to-image translation via relative attributes. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV)

  • Liu L, Chen R, Wolf L, Cohen-Or D (2010) Optimizing photo composition. Comput. Graph. Forum 29(2):469–478

    Google Scholar 

  • Liu Y, Dellaert F (2002) A classification based similarity metric for 3D image retrieval, in Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231), p. 800–805

  • Liu R, Yu Q, Yu S (2019) Unsupervised sketch-to-photo synthesis. arXiv [cs.CV]

  • Liu B, Zhu Y, Song K, Elgammal A (2020) Self-supervised sketch-to- image synthesis. arXiv [cs.CV]

  • Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput vis 60(2):91–110

    Google Scholar 

  • Lu Y, Wu S, Tai Y-W, C.-K. (2018) Tang. Image generation from sketch constraint using contextual gan. Computer Vision – ECCV 2018. Springer International Publishing, Cham, pp 213–228

    Google Scholar 

  • Manjunath BS, Salembier P, Sikora T (2002) Introduction to mpeg-7: Multimedia content description interface. In: Manjunath BS, Salembier P, Sikora T (eds) Introduction to mpeg-7: Multimedia content description interface. John Wiley and Sons, Chichester

    Google Scholar 

  • Mansimov E, Parisotto E, Ba J.L, Salakhutdinov R (2015) Generating images from captions with attention. arXiv [cs.LG]

  • Mao X, Li Q, Xie H, Lau RY, Wang Z, Paul Smolley S (2016) Least squares generative adversarial networks, 2016. arXiv [cs.CV]

  • Miller GA, Beckwith R, Fellbaum C, Gross D, Miller KJ (1990) In- troduction to wordnet: An on-line lexical database. Int j Lexicogr 3(4):235–244

    Google Scholar 

  • Mirza M, Osindero S (2014) Conditional generative adversarial nets. arXiv [cs.LG].

  • Özbey M, Dar SU, Bedel HA, Dalmaz O, Özturk Ş, Güngör A, Çukur T (2022) Unsupervised medical image translation with adversarial diffusion models. arXiv [eess.IV]

  • Chen N, Zhang Y, Zen H, Ron Weiss J, Norouzi M, Chan W (2020) Wavegrad: Estimating gradients for waveform gener- ation. arXiv [eess.AS]

  • Nazeri K, Ng E, Joseph T, Qureshi FZ, Ebrahimi M (2019) Edgeconnect: Generative image inpainting with adversarial edge learning. arXiv [cs.CV]

  • Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models, p 18–24. arXiv [cs.LG]

  • Nilsback ME, Zisserman A (2008) Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision. Graphics and Image Processing.

  • Odena A, Olah C, Shlens J (2016) Conditional image synthesis with auxiliary classifier gans. arXiv [stat.ML]

  • Ojala T, Pietikäinen M, Harwood D (1996) A comparative study of tex- ture measures with classification based on featured distributions. Pattern Recognit 29(1):51–59

    Google Scholar 

  • Osahor U, Kazemi H, Dabouei A, Nasrabadi N (2020) Quality guided sketch-to-photo image synthesis. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

  • Otsu N (1979) A threshold selection method from gray-level histograms. IEEE Trans Syst Man Cybern 9(1):62–66

    Google Scholar 

  • Park H, Yoo Y, Kwak N (2018) Mc-gan: Multi-conditional generative adversarial network for image synthesis. arXiv [cs.CV]

  • Park T, Liu M-Y, Wang T-C, Zhu J-Y (2019a) Semantic image synthesis with spatially-adaptive normalization. arXiv [cs.CV]

  • Park T, Liu MY, Wang TC, Zhu JY (2019b) Semantic image synthesis with spatially-adaptive normalization. In 2019b IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), p 2337–2346

  • Pereira JC, Coviello E, Doyle G, Rasiwasia N, Lanckriet G, Levy R, Vasconcelos N (2014) On the role of correlation and abstraction in cross- modal multimedia retrieval. IEEE Trans Pattern Anal Mach Intell 36(3):521–535

    Google Scholar 

  • Qiao T, Zhang J, Xu D, Tao D (2019) Mirrorgan: Learning text-to-image generation by redescription. arXiv [cs.CL]

  • Radford A, Metz L, Chintala S (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv [cs.LG]

  • Rajput GG, Prashantha (2019) Sketch based image retrieval using grid approach on large scale database. Procedia Comput. Sci 165:216–223

    Google Scholar 

  • Rasmussen CE (1999) The infinite gaussian mixture model. In Proceedings of the 12th International Conference on Neural Information Processing Systems, p 554–560

  • Reed S, Akata Z, Yan X, Logeswaran L, Schiele B, Lee H (2016a) Generative adversarial text to image synthesis. arXiv [cs.NE]

  • Reed S, Akata Z, Lee H, Schiele B (2016a) Learning deep representations of fine-grained visual descriptions. arXiv [cs.CV]

  • Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. Springer International Publishing, Cham

    Google Scholar 

  • Rother C, Kolmogorov V, Blake A (2004) Grabcut: Interactive foreground extraction using iterated graph cuts. ACM Trans Graph 23(3):309–314

    Google Scholar 

  • Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, Chen (2016) Improved techniques for training gans. In Proceedings of the 30th International Conference on Neural Information Processing Systems, p 2234–2242.

  • Sangkloy P, Burnell N, Ham C, Hays J (2016) The sketchy database: learning to retrieve badly drawn bunnies. ACM Trans Graph 35(4):1–12

    Google Scholar 

  • Sangkloy P, Lu J, Fang C, Yu F, Hays J (2016b) Scribbler: controlling deep image synthesis with sketch and color. arXiv [cs.CV]

  • Sasaki H, Willcocks CG, Breckon TP (2021) Unit-ddpm: Unpaired image translation with denoising diffusion probabilistic models. arXiv [cs.CV]

  • Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing: a Publication of the IEEE Signal Processing Society 45(11):2673–2681

    Google Scholar 

  • Sharma S, Suhubdy D, Michalski V, Kahou SE, Bengio Y (2018) Chat- painter: Improving text to image generation using dialogue. arXiv [cs.CV]

  • Sohl-Dickstein J, Weiss EA, Maheswaranathan N, Ganguli S (2015) Deep unsupervised learning using nonequilibrium thermodynam- ics. arXiv [cs.LG].

  • Song Y, Ermon S (2020) Improved techniques for training score- based generative models. Adv Neural Inf Process Syst 33(12438):12448

    Google Scholar 

  • Souza DM, Wehrmann J, Ruiz DD (2020) Efficient neural architecture for text-to-image synthesis. arXiv [cs.LG]

  • Springenberg JT, Dosovitskiy A, Brox T, Riedmiller M (2014) Striving for simplicity: The all convolutional net. arXiv [cs.LG]

  • Stein BE, Meredith MA (1993) The merging of the senses. The MIT Press, Cambridge

    Google Scholar 

  • Sushko V, Schönfeld E, Zhang D, Gall J, Schiele B, Khoreva A (2020) You only need adversarial supervision for semantic image synthesis. arXiv [cs.CV]

  • Szanto B, Pozsegovics P, Vamossy Z, Sergyan S (2011) Sketch4match — content-based image retrieval system using sketches. In 2011 IEEE 9th In- ternational Symposium on Applied Machine Intelligence and Informatics (SAMI), p 183–188

  • Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p 1–9

  • Taigman Y, Yang M, Ranzato MA, Wolf L (2014) Deepface: closing the gap to human-level performance in face verification. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, p 1701–1708

  • Tanveer MI, Liu J, Hoque ME (2015) Unsupervised extraction of human- interpretable nonverbal behavioral cues in a public speaking scenario. In Proceedings of the 23rd ACM international conference on Multimedia - MM 15

  • Thaung L (2020) Advanced data augmentation: With generative adversarial networks and computer-aided design.

  • Torralba A, Fergus R, Freeman WT (2008) 80 million tiny images: a large data set for nonparametric object and scene recognition. IEEE Trans Pattern Anal Mach Intell 30(11):1958–1970

    Google Scholar 

  • Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. arXiv [cs.CL]

  • Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P-A (2010) Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11:3371–3408

    MathSciNet  MATH  Google Scholar 

  • Vroomen J, Gelder B (2000) Sound enhances visual perception: cross-modal effects of auditory organization on vision. J Exp Psychol Hum Percept Perform 26(5):1583–1590

    Google Scholar 

  • Wah C, Branson S, Welinder P, Perona P, Belongie S (2011) The caltech-ucsd birds-200–2011 dataset.

  • Wah C, Branson S, Welinder P, Perona P, Belongie S (2011) The caltech-ucsd birds200–2011 dataset. Advances in Water Rerces - ADV WATER RESOUR.

  • Wang Z, Simoncelli EP, Bovik A (2003) Multi-scale structural similarity for image quality assessment. Ieee, New York

    Google Scholar 

  • Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error measurement to structural similarity. IEEE Trans Image Processing 13(4):600–612

    Google Scholar 

  • Wang X, Qiao T, Zhu J, Hanjalic A, Scharenborg O (2021) Generating images from spoken descriptions. IEEE ACM Trans Audio Speech Lang Process 29:850–865

    Google Scholar 

  • Wang TC, Liu MY, Zhu JY, Tao A, Kautz J, Catanzaro B (2017) Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. ArXiv [Cs.CV]

  • Wang M,. Lang C,. Liang L,. Lyu G,. Feng S, and. Wang T (2020) Attentive generative adversarial network to bridge multi-domain gap for image syn- thesis. In 2020 IEEE International Conference on Multimedia and Expo (ICME,).

  • Welinder P, Branson S, Mita T, Wah C, Schroff F, Belongie S, Perona P (2010a) Caltech-ucsd birds 200”. Technical report cns-tr-2010a-001, California Institute of Technology.

  • Welinder P, Branson S, Perona P (2010b) The multidimensional wisdom of crowds. NIPS.

  • Wu W, Cao K, Li C, Qian C, Loy CC (2019) Transgaga: Geometry- aware unsupervised image-to-image translation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), p 8012–8021

  • Xian W, Sangkloy P, Agrawal V, Raj A, Lu J, Fang C, Yu F, Hays J (2018) Texturegan: Controlling deep image synthesis with texture patches. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p 8456–8465

  • Xiao J, Hays J, Ehinger KA, Oliva A, Torralba A (2010) Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, p 3485–3492

  • Xie S, Tu Z (2015) Holistically-nested edge detection. In 2015 IEEE Inter- national Conference on Computer Vision (ICCV), p 1395–1403

  • Xu T, Zhang P, Huang Q, Zhang H, Gan Z, Huang X, He X (2017) At- tngan: Fine-grained text to image generation with attentional generative adversarial networks. arXiv [cs.CV]

  • Yan Z, Zhang H, Wang B, Paris S, Yu Y (2014) Automatic photo adjustment using deep neural networks. arXiv [cs.CV]

  • Yan X, Yang J, Sohn K, Lee H (2015) Attribute2image: conditional image generation from visual attributes. arXiv [cs.LG]

  • Yu Q, Yang Y, Song YZ, Xiang T (2015) Hospedales. Sketch-a-net that beats humans. arXiv [cs.CV]

  • Yu Q, Liu F, Song Y-Z, Xiang T, Hospedales TM, Loy CC (2016) Sketch me that shoe. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p 799–807

  • Yu J, Lin Z, Yang J, Shen X, Lu X, Huang TS (2018) Generative image inpainting with contextual attention. arXiv [cs.CV]

  • Zhang Z, Luo P, Loy CC, Tang X (2014) Deep learning face attributes in the wild. arXiv [cs.CV]

  • Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas D (2016) Stackgan: text to photo-realistic image synthesis with stacked generative adversarial networks. arXiv [cs.CV]

  • Zhang H, Xu T, Li H,. Zhang S, Wang X, Huang X, Metaxas D (2017) Stackgan++: Realistic image synthesis with stacked generative adversar- ial networks,. arXiv [cs.CV]

  • Zhang R, Isola P, Efros AA, Shechtman E, Wang O (2018) The unreasonable effectiveness of deep features as a perceptual metric. arXiv [cs.CV]

  • Zhang P, Zhang B, Chen D, Yuan L, Wen F (2020) Cross-domain cor- respondence learning for exemplar-based image translation. arXiv [cs.CV]

  • Zhang H, Koh JY, Baldridge J, Lee H, Yang Y (2021a) Cross-modal contrastive learning for text-to-image generation. arXiv [cs.CV]

  • Zhang J, Li K, Lai YK, Yang J (2021b) Pise: Person image synthesis and editing with decoupled gan. In 2021b IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) p 7978–7986

  • Zhao T, Chen C, Liu Y, Zhu X (2021) Guigan: Learning to generate gui designs using generative adversarial networks. arXiv [cs.HC]

  • Zhou B, Zhao H, Puig X, Fidler S, Barriuso A, Torralba A (2016) Se- mantic understanding of scenes through the ade20k dataset. Int J Comput Vision 127:302–321

    Google Scholar 

  • Zhou B, Lapedriza A, Khosla A, Oliva A, Torralba A (2018) Places: A 10 million image database for scene recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1452–1464

    Google Scholar 

  • Zhu X, Goldberg AB, Eldawy M, Dyer CR, Strock B (2007) A text-to- picture synthesis system for augmenting communication. In Proceedings of the 22nd national conference on Artificial intelligence, vol 2, p 1590–1595

  • Zhu J-Y, Park T, Isola P, Efros AA (2017) Unpaired image-to- image translation using cycle-consistent adversarial networks. arXiv [cs.CV]

  • Zhu P, Abdal R, Qin Y, Wonka (2020) Wonka. Sean: Image synthesis with semantic region-adaptive normalization. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

  • Zou C, Mo H, Gao C, Du R, Fu H (2019) Language-based colorization of scene sketches. ACM Trans Graph 38(6):1–16

    Google Scholar 

Download references

Acknowledgements

The first author would like to thank Umm Al-Qura University, in Saudi Arabia, for the continuous support. This work has been supported in part by the University of Dayton Office for Graduate Academic Affairs through the Graduate Student Summer Fellowship Program. This work was also supported in part by the National Science Foundation under Grant NSF 2025234.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Samah Saeed Baraheem.

Ethics declarations

Conflicts of interest

The authors have no conflicts of interest to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Baraheem, S.S., Le, TN. & Nguyen, T.V. Image synthesis: a review of methods, datasets, evaluation metrics, and future outlook. Artif Intell Rev 56, 10813–10865 (2023). https://doi.org/10.1007/s10462-023-10434-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-023-10434-2

Keywords

Navigation