Image synthesis: a review of methods, datasets, evaluation metrics, and future outlook

Baraheem, Samah Saeed; Le, Trung-Nghia; Nguyen, Tam V.

doi:10.1007/s10462-023-10434-2

Image synthesis: a review of methods, datasets, evaluation metrics, and future outlook

Published: 28 February 2023

Volume 56, pages 10813–10865, (2023)
Cite this article

Artificial Intelligence Review Aims and scope Submit manuscript

1549 Accesses
6 Citations
Explore all metrics

Abstract

Image synthesis is a process of converting the input text, sketch, or other sources, i.e., another image or mask, into an image. It is an important problem in the computer vision field, where it has attracted the research community to attempt to solve this challenge at a high level to generate photorealistic images. Different techniques and strategies have been employed to achieve this purpose. Thus, the aim of this paper is to provide a comprehensive review of various image synthesis models covering several aspects. First, the image synthesis concept is introduced. We then review different image synthesis methods divided into three categories: image generation from text, sketch, and other inputs, respectively. Each sub-category is introduced under the proper category based upon the general framework to provide a broad vision of all existing image synthesis methods. Next, brief details of the benchmarked datasets used in image synthesis are discussed along with specifying the image synthesis models that leverage them. Regarding the evaluation, we summarize the metrics used to evaluate the image synthesis models. Moreover, a detailed analysis based on the evaluation metrics of the results of the introduced image synthesis is provided. Finally, we discuss some existing challenges and suggest possible future research directions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep image synthesis from intuitive user input: A review and perspectives

Article Open access 27 October 2021

Sketch-to-image synthesis via semantic masks

Article 09 September 2023

Unsupervised Scene Sketch to Photo Synthesis

Data availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

Adiban M, Safari A, Salvi G (2020) Step-gan: A step-by-step training for multi generator gans with application to cyber security in power systems. arXiv [eess.SP].
. Arjovsky M, Chintala S, Bottou L (2017) Wasserstein gan. arXiv [stat.ML]
Baraheem SS, Nguyen TV (2020b) Aesthetic-aware text to image synthesis. In 2020b 54th Annual Conference on Information Sciences and Systems (CISS), p 1–6
Baraheem SS, Nguyen TV (2020) Text-to-image via mask anchor points. Pattern Recognition Lett 133:25–32
Google Scholar
Caesar H, Uijlings J, Ferrari V (2018) Coco-stuff: Thing and stuff classes in context. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, p 1209–1218
Cai L, Gao H, Ji S (2017) Multi-stage variational auto-encoders for coarse- to-fine image generation. arXiv [cs.CV]
Chalechale A, Mertins A, Naghdy G (2004) Edge image description us- ing angular radial partitioning. IEE Proc - Vis. Image Signal Process 151(2):93
Google Scholar
Chen T, Cheng M-M, Tan P, Shamir A, Hu S-M (2009) Sketch2photo: internet image montage. ACM Trans Graph 28(5):1–10
Google Scholar
Chen W, Hays J (2018) Sketchygan: towards diverse and realistic sketch to image synthesis. arXiv [cs.CV]
Chen H, Jiang L (2019) Efficient gan-based method for cyber-intrusion detection. arXiv [cs.LG
Chen X, Kingma DP, Salimans T, Duan Y, Dhariwal P, Schulman J, Sutskever I, Abbeel P (2016) Abbeel. Variational lossy autoencoder. arXiv [cs.LG]
Chen J, Shen Y, Gao J, Liu J, Liu X (2017a) Language-based image editing with recurrent attentive models. arXiv [cs.CV]
Chen L, Srivastava S, Duan Z, Xu C (2017b) Deep cross-modal audio- visual generation arXiv [cs.CV].
Chicco D (2021) Siamese neural networks: An overview. Methods Mol Biol 2190:73–94
Google Scholar
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).
Comaniciu D, Meer P (2002) obust analysis of feature spaces: color image segmentation. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.
“Common problems,” Google Developers. https://developers.google.com/machine-learning/gan/problems (accessed Jan. 10, 2023)
Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B (2016) The cityscapes dataset for semantic ur- ban scene understanding. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Dalal N, Triggs B (2005) Histograms of oriented gradients for human de- tection. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol 1, p 886–893
Dalmaz O, Yurt M, Cukur T (2022) Resvit: Residual vision transformers for multimodal medical image synthesis. IEEE Trans Med Imaging 41(10):2598–2614
Google Scholar
Das A, Kottur S, Moura JM, Lee S, Batra D (2017) Visual dialog. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Davenport RK, Rogers CM, Russell IS (1973) Cross modal perception in apes. Neuropsychologia 11(1):21–28
Google Scholar
Deng L (2012) The mnist database of handwritten digit images for ma- chine learning research [best of the web. IEEE Signal Process Mag 29(6):141–142
Google Scholar
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, p 248–255
Denton EL, Chintala S, Fergus R 2015 Deep generative image models using a laplacian pyramid of adversarial networks. arXiv [cs.CV]
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language under- standing. arXiv [cs.CL]
Dokmanic I, Parhizkar R, Ranieri J, Vetterli M (2015) Euclidean distance matrices: essential theory, algorithms and applications. IEEE Signal Process Mag 32(6):12–30
Google Scholar
Dosovitskiy A, Springenberg JT, Brox T (2015) Learning to generate chairs with convolutional neural networks. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p 1538–1546
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv [cs.CV]
Dumitrescu B (2017) Gram matrix representation. Signals and Communication Technology. Springer International Publishing, Cham, pp 23–69
Google Scholar
Dumoulin V, Shlens J, Kudlur M (2016) A learned representation for artistic style. arXiv [cs.CV]
Eitz M, Richter R, Hildebrand K, Boubekeur T, Alexa M (2011) Photo- sketcher: interactive sketch-based image synthesis. IEEE Comput Graph Appl 31(6):56–66
Google Scholar
Eitz M, Hays J, Alexa M (2012) How do humans sketch objects? ACM Trans Graph 31(4):1–10
Google Scholar
Eitz M, Hildebrand K, Boubekeur T, Alexa M (2009) A descriptor for large scale image retrieval based on sketched feature lines. In Proceed- ings of the 6th Eurographics Symposium on Sketch-Based Interfaces and Modeling - SBIM
Elgammal A, Liu B, Elhoseiny M, Mazzone M (2017) Can: creative adversarial networks, generating ‘art’ by learning about styles and deviating from style norms. arXiv [cs.AI]
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compo- sitionality. arXiv [cs.CL]
He K, Zhang X, Ren S, Sun J. (2015) Deep residual learning for image recognition. arXiv [cs.CV]
Perarnau G, Weijer J, Raducanu B, A´lvarez JM (2016) Invertible conditional gans for image editing. arXiv [cs.CV]
Liu Y, Qin Z, Luo Z, Wang H (2017) Auto-painter: cartoon image generation from sketch by using conditional generative adversarial networks. arXiv [cs.CV]
Feng F, Li R, Wang X (2014) Cross-modal retrieval with correspondence autoencoder. In Proceedings of the ACM International Conference on Multimedia, vol MM 14
Feng Z, Xu C, Tao D (2019) Self-supervised representation learning by rotation feature decoupling. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), p 10364–10374
Finlayson SG, Lee H, Kohane IS, Oakden-Rayner L (2018) Towards generative adversarial networks as a new paradigm for radiology education. arXiv [cs.CV]
Gadde R, Karlapalem K (2011) Aesthetic guideline driven photography by robots. Proceedings of the Twenty-Second international joint conference on Artificial Intelligence, p 2060–2065
Gao L, Chen D, Zhao Z, Shao J, Shen HT (2021) Lightweight dynamic conditional gan with pyramid attention for text-to-image synthesis. Pattern Recogn 110(107384):107384
Google Scholar
Gao C, Liu Q, Xu Q, Wang L, Liu J, Zou C (2020) Sketchycoco: Image generation from freehand scene sketches,. arXiv [cs.CV].
Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems vol 2, p 2672–2680
Gou Y, Wu Q, Li M, Gong B, Han M (2020) Segattngan: Text to image generation with segmentation attention. arXiv [cs.CV]
. Gregor K, Danihelka I, Graves A, Rezende DJ, Wierstra D (2015) Draw: a recurrent neural network for image generation. arXiv [cs.CV]
Grother P (1995) Nist special database 19 handprinted forms and characters database.
Gulrajani I, Kumar K, Ahmed F, Taiga AA, Visin F, Vazquez D, Courville A (2016) Pixelvae: A latent variable model for natural images. arXiv [cs.LG]
Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville AC (2017) Improved training of wasserstein gans. arXiv [cs.LG]
Güngör A, Dar SU, Öztürk Ş, Korkmaz Y, Elmas G, Özbey M, Güngör A, Çukur T (2022) Adaptive diffusion priors for accelerated mri reconstruction. arXiv [eess.IV]
Hao W, Zhang Z, Guan H (2018) Cmcgan: a uniform framework for cross-modal visual-audio mutual generation. Proc. Conf. AAAI Artif. Intell, 32(1)
Harris Zellig S (1981) Distributional Structure. Springer Netherlands, Dordrecht, pp 3–22
Google Scholar
He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. arXiv [cs.CV]
Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. arXiv [cs.LG].
Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. arXiv [cs.LG]
Huang X, Belongie S (2017) Arbitrary style transfer in real-time with adaptive instance normalization. arXiv [cs.CV]
Huang GB, Mattar M, Berg T, Learned-Miller E (2008) Labeled faces in the wild: A database for studying face recognition in un- constrained environments. In Workshop on Faces in “Real-Life” Images: Detection, Alignment, and Recognition
Huang X, Liu M-Y, Belongie S, Kautz J (2018a) Multimodal unsupervised image-to-image translation, arXiv [cs.CV]
Huang H, Yu PS, Wang C (2018b) An introduction to image synthesis with generative adversarial nets, 2018b. arXiv [cs.CV]
Huiskes MJ, Lew MS (2008) Lew. The mir flickr retrieval evaluation. In Proceed- ing of the 1st ACM international conference on Multimedia information retrieval - MIR
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv [cs.LG]
Isola P, Zhu J-Y, Zhou T, Efros AA (2016) Image-to-image translation with conditional adversarial networks. arXiv [cs.CV]
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding. arXiv [cs.CV]
Jinzhen Mu, Chen C, Zhu W, Li S, Zhou Y (2022) Taming mode collapse in generative adversarial networks using cooperative realness discriminators. IET Image Proc 16(8):2240–2262
Google Scholar
Johnson J, Alahi A, Li Fei-Fei (2016) Perceptual Losses for Real-time Style Transfer and Super-resolution. Springer International Publishing, Cham
Google Scholar
Jolicoeur-Martineau A, Piché-Taillefer R, Combes RT, Mitliagkas I (2020) Adversarial score matching and im- proved sampling for image generation. arXiv [cs.LG]
Amit Kamran S, Fariha Hossain K, Tavakkoli A, Zuckerbrod SL, Baker SA (2021) Vtgan: semi-supervised retinal image synthesis and disease prediction using vision transformers. arXiv [eess.IV]
Karras T, Aila T, Laine S, Lehtinen J (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv [cs.NE]
Kingma DP, Welling M (2013) Auto-encoding variational bayes. arXiv [stat.ML]
Kingma DP, Salimans T, Jozefowicz R, Chen X, Sutskever I, Welling M (2016) Welling. Improving variational inference with inverse autoregressive flow. arXiv [cs.LG].
Kiros R, Zhu Y, Salakhutdinov RR, Zemel R, Urtasun R, Torralba A, Fidler S (2015) Skip-thought vectors. arXiv [cs.CL]
Kolesnikov A, Zhai X, Beyer L (2019) Beyer. Revisiting self-supervised visual representation learning. arXiv [cs.CV]
Kong Z, Ping W, Huang J, Zhao K, Catanzaro B (2020) Diffwave: a versatile diffusion model for audio synthesis. arXiv [eess.AS]
Kramer MA (1991) Nonlinear principal component analysis using autoassocia- tive neural networks. AIChE J 37(2):233–243
Google Scholar
Krizhevsky A (2009) Learning multiple layers of features from tiny images. Technical report, Journal of Software Engineering and Applications.
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90
Google Scholar
Kruskal JB (1964) Nonmetric multidimensional scaling: A numerical method. Psychometrika 29(2):115–129
MathSciNet MATH Google Scholar
Kumar N, Berg AC, Belhumeur PN, Nayar SK (2009) Attribute and simile classifiers for face verification. In 2009 IEEE 12th International Conference on Computer Vision. IEEE p 365–372
Li C, Wand M (2016) Precomputed Real-time Texture Synthesis with Markovian Generative Adversarial Networks. Springer International Publishing, Cham
Google Scholar
Li B, Liu X, Dinesh K, Duan Z, Sharma G (2016) Creating a multi- track classical musical performance dataset for multimodal music analysis: challenges, insights, and applications. EEE Trans Multimedia 21(2):522–535
Google Scholar
Li L, Sun Y, Hu F, Zhou T, Xi X, Ren J (2020) Text to realistic image generation with attentional concatenation generative adversarial networks. Discrete Dyn Nat Soc 2020(1):10
MathSciNet Google Scholar
Li JG, Zhang XF, Jia CM, Xu JZ, Zhang L, Wang Y, Ma SW, Gao W (2020) Direct speech-to-image translation. IEEE Journal of Selected Topics in Signal Processing 14(3):517–529
Google Scholar
Li Z, Deng C, Yang E, Tao D (2021) Staged sketch-to-image synthesis via semi-supervised generative adversarial networks. IEEE Trans Multi- Media 23:2694–2705
Google Scholar
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. arXiv [cs.CV]
Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2016) Feature pyramid networks for object detection. arXiv [cs.CV]
Lin YJ, Wu PW, Chang CH, Chang EY, Liao SW (2019) Liao. Relgan: Multi-domain image-to-image translation via relative attributes. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV)
Liu L, Chen R, Wolf L, Cohen-Or D (2010) Optimizing photo composition. Comput. Graph. Forum 29(2):469–478
Google Scholar
Liu Y, Dellaert F (2002) A classification based similarity metric for 3D image retrieval, in Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231), p. 800–805
Liu R, Yu Q, Yu S (2019) Unsupervised sketch-to-photo synthesis. arXiv [cs.CV]
Liu B, Zhu Y, Song K, Elgammal A (2020) Self-supervised sketch-to- image synthesis. arXiv [cs.CV]
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput vis 60(2):91–110
Google Scholar
Lu Y, Wu S, Tai Y-W, C.-K. (2018) Tang. Image generation from sketch constraint using contextual gan. Computer Vision – ECCV 2018. Springer International Publishing, Cham, pp 213–228
Google Scholar
Manjunath BS, Salembier P, Sikora T (2002) Introduction to mpeg-7: Multimedia content description interface. In: Manjunath BS, Salembier P, Sikora T (eds) Introduction to mpeg-7: Multimedia content description interface. John Wiley and Sons, Chichester
Google Scholar
Mansimov E, Parisotto E, Ba J.L, Salakhutdinov R (2015) Generating images from captions with attention. arXiv [cs.LG]
Mao X, Li Q, Xie H, Lau RY, Wang Z, Paul Smolley S (2016) Least squares generative adversarial networks, 2016. arXiv [cs.CV]
Miller GA, Beckwith R, Fellbaum C, Gross D, Miller KJ (1990) In- troduction to wordnet: An on-line lexical database. Int j Lexicogr 3(4):235–244
Google Scholar
Mirza M, Osindero S (2014) Conditional generative adversarial nets. arXiv [cs.LG].
Özbey M, Dar SU, Bedel HA, Dalmaz O, Özturk Ş, Güngör A, Çukur T (2022) Unsupervised medical image translation with adversarial diffusion models. arXiv [eess.IV]
Chen N, Zhang Y, Zen H, Ron Weiss J, Norouzi M, Chan W (2020) Wavegrad: Estimating gradients for waveform gener- ation. arXiv [eess.AS]
Nazeri K, Ng E, Joseph T, Qureshi FZ, Ebrahimi M (2019) Edgeconnect: Generative image inpainting with adversarial edge learning. arXiv [cs.CV]
Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models, p 18–24. arXiv [cs.LG]
Nilsback ME, Zisserman A (2008) Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision. Graphics and Image Processing.
Odena A, Olah C, Shlens J (2016) Conditional image synthesis with auxiliary classifier gans. arXiv [stat.ML]
Ojala T, Pietikäinen M, Harwood D (1996) A comparative study of tex- ture measures with classification based on featured distributions. Pattern Recognit 29(1):51–59
Google Scholar
Osahor U, Kazemi H, Dabouei A, Nasrabadi N (2020) Quality guided sketch-to-photo image synthesis. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
Otsu N (1979) A threshold selection method from gray-level histograms. IEEE Trans Syst Man Cybern 9(1):62–66
Google Scholar
Park H, Yoo Y, Kwak N (2018) Mc-gan: Multi-conditional generative adversarial network for image synthesis. arXiv [cs.CV]
Park T, Liu M-Y, Wang T-C, Zhu J-Y (2019a) Semantic image synthesis with spatially-adaptive normalization. arXiv [cs.CV]
Park T, Liu MY, Wang TC, Zhu JY (2019b) Semantic image synthesis with spatially-adaptive normalization. In 2019b IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), p 2337–2346
Pereira JC, Coviello E, Doyle G, Rasiwasia N, Lanckriet G, Levy R, Vasconcelos N (2014) On the role of correlation and abstraction in cross- modal multimedia retrieval. IEEE Trans Pattern Anal Mach Intell 36(3):521–535
Google Scholar
Qiao T, Zhang J, Xu D, Tao D (2019) Mirrorgan: Learning text-to-image generation by redescription. arXiv [cs.CL]
Radford A, Metz L, Chintala S (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv [cs.LG]
Rajput GG, Prashantha (2019) Sketch based image retrieval using grid approach on large scale database. Procedia Comput. Sci 165:216–223
Google Scholar
Rasmussen CE (1999) The infinite gaussian mixture model. In Proceedings of the 12th International Conference on Neural Information Processing Systems, p 554–560
Reed S, Akata Z, Yan X, Logeswaran L, Schiele B, Lee H (2016a) Generative adversarial text to image synthesis. arXiv [cs.NE]
Reed S, Akata Z, Lee H, Schiele B (2016a) Learning deep representations of fine-grained visual descriptions. arXiv [cs.CV]
Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. Springer International Publishing, Cham
Google Scholar
Rother C, Kolmogorov V, Blake A (2004) Grabcut: Interactive foreground extraction using iterated graph cuts. ACM Trans Graph 23(3):309–314
Google Scholar
Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, Chen (2016) Improved techniques for training gans. In Proceedings of the 30th International Conference on Neural Information Processing Systems, p 2234–2242.
Sangkloy P, Burnell N, Ham C, Hays J (2016) The sketchy database: learning to retrieve badly drawn bunnies. ACM Trans Graph 35(4):1–12
Google Scholar
Sangkloy P, Lu J, Fang C, Yu F, Hays J (2016b) Scribbler: controlling deep image synthesis with sketch and color. arXiv [cs.CV]
Sasaki H, Willcocks CG, Breckon TP (2021) Unit-ddpm: Unpaired image translation with denoising diffusion probabilistic models. arXiv [cs.CV]
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing: a Publication of the IEEE Signal Processing Society 45(11):2673–2681
Google Scholar
Sharma S, Suhubdy D, Michalski V, Kahou SE, Bengio Y (2018) Chat- painter: Improving text to image generation using dialogue. arXiv [cs.CV]
Sohl-Dickstein J, Weiss EA, Maheswaranathan N, Ganguli S (2015) Deep unsupervised learning using nonequilibrium thermodynam- ics. arXiv [cs.LG].
Song Y, Ermon S (2020) Improved techniques for training score- based generative models. Adv Neural Inf Process Syst 33(12438):12448
Google Scholar
Souza DM, Wehrmann J, Ruiz DD (2020) Efficient neural architecture for text-to-image synthesis. arXiv [cs.LG]
Springenberg JT, Dosovitskiy A, Brox T, Riedmiller M (2014) Striving for simplicity: The all convolutional net. arXiv [cs.LG]
Stein BE, Meredith MA (1993) The merging of the senses. The MIT Press, Cambridge
Google Scholar
Sushko V, Schönfeld E, Zhang D, Gall J, Schiele B, Khoreva A (2020) You only need adversarial supervision for semantic image synthesis. arXiv [cs.CV]
Szanto B, Pozsegovics P, Vamossy Z, Sergyan S (2011) Sketch4match — content-based image retrieval system using sketches. In 2011 IEEE 9th In- ternational Symposium on Applied Machine Intelligence and Informatics (SAMI), p 183–188
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p 1–9
Taigman Y, Yang M, Ranzato MA, Wolf L (2014) Deepface: closing the gap to human-level performance in face verification. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, p 1701–1708
Tanveer MI, Liu J, Hoque ME (2015) Unsupervised extraction of human- interpretable nonverbal behavioral cues in a public speaking scenario. In Proceedings of the 23rd ACM international conference on Multimedia - MM 15
Thaung L (2020) Advanced data augmentation: With generative adversarial networks and computer-aided design.
Torralba A, Fergus R, Freeman WT (2008) 80 million tiny images: a large data set for nonparametric object and scene recognition. IEEE Trans Pattern Anal Mach Intell 30(11):1958–1970
Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. arXiv [cs.CL]
Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P-A (2010) Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11:3371–3408
MathSciNet MATH Google Scholar
Vroomen J, Gelder B (2000) Sound enhances visual perception: cross-modal effects of auditory organization on vision. J Exp Psychol Hum Percept Perform 26(5):1583–1590
Google Scholar
Wah C, Branson S, Welinder P, Perona P, Belongie S (2011) The caltech-ucsd birds-200–2011 dataset.
Wah C, Branson S, Welinder P, Perona P, Belongie S (2011) The caltech-ucsd birds200–2011 dataset. Advances in Water Rerces - ADV WATER RESOUR.
Wang Z, Simoncelli EP, Bovik A (2003) Multi-scale structural similarity for image quality assessment. Ieee, New York
Google Scholar
Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error measurement to structural similarity. IEEE Trans Image Processing 13(4):600–612
Google Scholar
Wang X, Qiao T, Zhu J, Hanjalic A, Scharenborg O (2021) Generating images from spoken descriptions. IEEE ACM Trans Audio Speech Lang Process 29:850–865
Google Scholar
Wang TC, Liu MY, Zhu JY, Tao A, Kautz J, Catanzaro B (2017) Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. ArXiv [Cs.CV]
Wang M,. Lang C,. Liang L,. Lyu G,. Feng S, and. Wang T (2020) Attentive generative adversarial network to bridge multi-domain gap for image syn- thesis. In 2020 IEEE International Conference on Multimedia and Expo (ICME,).
Welinder P, Branson S, Mita T, Wah C, Schroff F, Belongie S, Perona P (2010a) Caltech-ucsd birds 200”. Technical report cns-tr-2010a-001, California Institute of Technology.
Welinder P, Branson S, Perona P (2010b) The multidimensional wisdom of crowds. NIPS.
Wu W, Cao K, Li C, Qian C, Loy CC (2019) Transgaga: Geometry- aware unsupervised image-to-image translation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), p 8012–8021
Xian W, Sangkloy P, Agrawal V, Raj A, Lu J, Fang C, Yu F, Hays J (2018) Texturegan: Controlling deep image synthesis with texture patches. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p 8456–8465
Xiao J, Hays J, Ehinger KA, Oliva A, Torralba A (2010) Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, p 3485–3492
Xie S, Tu Z (2015) Holistically-nested edge detection. In 2015 IEEE Inter- national Conference on Computer Vision (ICCV), p 1395–1403
Xu T, Zhang P, Huang Q, Zhang H, Gan Z, Huang X, He X (2017) At- tngan: Fine-grained text to image generation with attentional generative adversarial networks. arXiv [cs.CV]
Yan Z, Zhang H, Wang B, Paris S, Yu Y (2014) Automatic photo adjustment using deep neural networks. arXiv [cs.CV]
Yan X, Yang J, Sohn K, Lee H (2015) Attribute2image: conditional image generation from visual attributes. arXiv [cs.LG]
Yu Q, Yang Y, Song YZ, Xiang T (2015) Hospedales. Sketch-a-net that beats humans. arXiv [cs.CV]
Yu Q, Liu F, Song Y-Z, Xiang T, Hospedales TM, Loy CC (2016) Sketch me that shoe. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p 799–807
Yu J, Lin Z, Yang J, Shen X, Lu X, Huang TS (2018) Generative image inpainting with contextual attention. arXiv [cs.CV]
Zhang Z, Luo P, Loy CC, Tang X (2014) Deep learning face attributes in the wild. arXiv [cs.CV]
Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas D (2016) Stackgan: text to photo-realistic image synthesis with stacked generative adversarial networks. arXiv [cs.CV]
Zhang H, Xu T, Li H,. Zhang S, Wang X, Huang X, Metaxas D (2017) Stackgan++: Realistic image synthesis with stacked generative adversar- ial networks,. arXiv [cs.CV]
Zhang R, Isola P, Efros AA, Shechtman E, Wang O (2018) The unreasonable effectiveness of deep features as a perceptual metric. arXiv [cs.CV]
Zhang P, Zhang B, Chen D, Yuan L, Wen F (2020) Cross-domain cor- respondence learning for exemplar-based image translation. arXiv [cs.CV]
Zhang H, Koh JY, Baldridge J, Lee H, Yang Y (2021a) Cross-modal contrastive learning for text-to-image generation. arXiv [cs.CV]
Zhang J, Li K, Lai YK, Yang J (2021b) Pise: Person image synthesis and editing with decoupled gan. In 2021b IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) p 7978–7986
Zhao T, Chen C, Liu Y, Zhu X (2021) Guigan: Learning to generate gui designs using generative adversarial networks. arXiv [cs.HC]
Zhou B, Zhao H, Puig X, Fidler S, Barriuso A, Torralba A (2016) Se- mantic understanding of scenes through the ade20k dataset. Int J Comput Vision 127:302–321
Google Scholar
Zhou B, Lapedriza A, Khosla A, Oliva A, Torralba A (2018) Places: A 10 million image database for scene recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1452–1464
Google Scholar
Zhu X, Goldberg AB, Eldawy M, Dyer CR, Strock B (2007) A text-to- picture synthesis system for augmenting communication. In Proceedings of the 22nd national conference on Artificial intelligence, vol 2, p 1590–1595
Zhu J-Y, Park T, Isola P, Efros AA (2017) Unpaired image-to- image translation using cycle-consistent adversarial networks. arXiv [cs.CV]
Zhu P, Abdal R, Qin Y, Wonka (2020) Wonka. Sean: Image synthesis with semantic region-adaptive normalization. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Zou C, Mo H, Gao C, Du R, Fu H (2019) Language-based colorization of scene sketches. ACM Trans Graph 38(6):1–16
Google Scholar

Download references

Acknowledgements

The first author would like to thank Umm Al-Qura University, in Saudi Arabia, for the continuous support. This work has been supported in part by the University of Dayton Office for Graduate Academic Affairs through the Graduate Student Summer Fellowship Program. This work was also supported in part by the National Science Foundation under Grant NSF 2025234.

Author information

Authors and Affiliations

Department of Computer Science, Umm Al-Qura University, Al-lith, Saudi Arabia
Samah Saeed Baraheem
Department of Computer Science, University of Dayton, Dayton, USA
Samah Saeed Baraheem & Tam V. Nguyen
University of Science, VNU-HCM, Ho Chi Minh City, Vietnam
Trung-Nghia Le
Vietnam National University, Ho Chi Minh City, Vietnam
Trung-Nghia Le

Authors

Samah Saeed Baraheem
View author publications
You can also search for this author in PubMed Google Scholar
Trung-Nghia Le
View author publications
You can also search for this author in PubMed Google Scholar
Tam V. Nguyen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Samah Saeed Baraheem.

Ethics declarations

Conflicts of interest

The authors have no conflicts of interest to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Baraheem, S.S., Le, TN. & Nguyen, T.V. Image synthesis: a review of methods, datasets, evaluation metrics, and future outlook. Artif Intell Rev 56, 10813–10865 (2023). https://doi.org/10.1007/s10462-023-10434-2

Download citation

Published: 28 February 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s10462-023-10434-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Image synthesis: a review of methods, datasets, evaluation metrics, and future outlook

Abstract

Access this article

Similar content being viewed by others

Deep image synthesis from intuitive user input: A review and perspectives

Sketch-to-image synthesis via semantic masks

Unsupervised Scene Sketch to Photo Synthesis

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Image synthesis: a review of methods, datasets, evaluation metrics, and future outlook

Abstract

Access this article

Similar content being viewed by others

Deep image synthesis from intuitive user input: A review and perspectives

Sketch-to-image synthesis via semantic masks

Unsupervised Scene Sketch to Photo Synthesis

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation