Skip to main content
Log in

Towards photorealistic face generation using text-guided Semantic-Spatial FaceGAN

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In this paper, we propose a simple yet effective Text-To-Face (T2F) generative adversarial network named Semantic-Spatial FaceGAN, which addresses the challenge of generating facial images from natural language descriptions. Natural language is inherently abstract, whereas images are concrete. This discrepancy poses a significant challenge, especially when utilizing multiple descriptions to generate accurate images. To overcome this issue, we introduce the Semantic Spatial FaceGAN (SS-FaceGAN) network, capable of generating precise features from multiple descriptions. Additionally, we incorporate a novel Focus Spatial (FS) module that predicts masks based on text semantics to refine image feature mapping. We also introduce an attention mechanism, the Word Attention Reuse (WAR) module, which leverages the potential distribution of each word in the description to compute word-level attention. Finally, our experiments demonstrate the effectiveness of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data Availability

All data generated or analysed during this study are included in this article.

References

  1. Bai Q, Yang C, Xu Y, Liu X, Yang Y, Shen Y (2023) Glead: Improving gans with a generator-leading task. Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp 12094–12104

  2. Ben-Yosef M, Weinshall D (2018) Gaussian mixture generative adversarial networks for diverse datasets, and the unsupervised clustering of images. Preprint arXiv:1808.10356

  3. Brock A, Donahue J, Simonyan K (2019) Large, scale gan training for high fidelity natural image. 7th international conference on learning representations (iclr). New Orleans, LA

  4. Dash A, Ye J, Wang G (2023) A review of generative adversarial networks (gans) and its applications in a wide variety of disciplines: From medical to remote sensing. IEEE Access

  5. Deng Q, Cao J, Liu Y, Chai Z, Li Q, Sun Z (2020) Reference-guided face component editing. Preprint arXiv:2006.02051

  6. Doan T, Monteiro J, Albuquerque I, Mazoure B, Durand A, Pineau J, Hjelm RD (2019) On-line adaptative curriculum learning for gans. Proceedings of the aaai conference on artificial intelligence, vol 33, pp 3470–3477

  7. Du X, Peng J, Zhou Y, Zhang J, Chen S, Jiang G, ... Ji R (2023) Pixelface+: Towards controllable face generation and manipulation with text descriptions and segmentation masks. Proceedings of the 31st acm international conference on multimedia, pp 4666–4677

  8. Franceschi J-Y, Gartrell M, Dos Santos L, Issenhuth T, de Bézenac E, Chen M, Rakotomamonjy A (2024) Unifying gans and score-based diffusion as generative particle models. Advances in Neural Information Processing Systems, 36

  9. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, ... Bengio Y (2014) Generative adversarial nets. Advances in neural information processing systems, 27

  10. He Z, Zuo W, Kan M, Shan S, Chen X (2019) Attgan: Facial attribute editing by only changing what you want. IEEE Trans Image Process 28(11):5464–5478

    Article  MathSciNet  Google Scholar 

  11. Kang M, Zhu J-Y, Zhang R, Park J, Shechtman E, Paris S, Park T (2023) Scaling up gans for text-to-image synthesis. Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp 10124–10134

  12. Karras T, Laine S, Aila T (2019) A style-based generator architecture for generative adversarial networks. Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp 4401–4410

  13. Karras T, Laine S, Aittala M, Hellsten J, Lehtinen J, Aila T (2020) Analyzing and improving the image quality of stylegan. Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp 8110–8119

  14. Kim M, Liu F, Jain A, Liu X (2023) Dcface: Synthetic face generation with dual condition diffusion model. Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp 12715–12725

  15. Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. Preprint arXiv:1412.6980

  16. Koley S, Bhunia AK, Sain A, Chowdhury PN, Xiang T, Song Y-Z (2023) Picture that sketch: Photorealistic image generation from abstract sketches. Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp 6850–6861

  17. Lee C-H, Liu Z, Wu L, Luo P (2020) Maskgan: Towards diverse and interactive facial image manipulation. Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp 5549–5558

  18. Li B, Qi X, Lukasiewicz T, Torr P (2019a) Controllable text-to-image generation. Advances in Neural Information Processing Systems, 32

  19. Li B, Qi X, Lukasiewicz T, Torr P (2019b) Controllable text-to-image generation. Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R (eds), Advances in neural information processing systems, vol. 32. Curran Associates, Inc. Retrieved from https://proceedings.neurips.cc/paper/2019/file/1d72310edc006dadf2190caad5802983-Paper.pdf

  20. Liao W, Hu K, Yang MY, Rosenhahn B (2022) Text to image generation with semantic-spatial aware gan. Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp 18187–18196

  21. Liu C, Hu J, Lin H (2023) Swf-gan: A text-to-image model based on sentence-word fusion perception. Comput Graph 115:500–510

    Article  Google Scholar 

  22. Liu Y, Li Q, Deng Q, Sun Z, Yang M-H (2023) Gan-based facial attribute manipulation. IEEE Trans Pattern Anal Mach Intell

  23. Liu Y, Li Q, Sun Z (2019) Attribute-aware face aging with wavelet-based generative adversarial networks. Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp 11877–11886

  24. Nasir OR, Jha SK, Grover MS, Yu Y, Kumar A, Shah RR (2019) Text2facegan: Face generation from fine grained textual descriptions. 2019 ieee fifth international conference on multimedia big data (bigmm), pp 58–67

  25. Nguyen V-Q, Suganuma M, Okatani T (2020) Efficient attention mechanism for visual dialog that can handle all the interactions between multiple inputs. European conference on computer vision, pp 223–240

  26. Ning X, Nan F, Xu S, Yu L, Zhang L (2023) Multi-view frontal face image generation: a survey. Concurr Comput Pract Exp 35(18):e6147

    Article  Google Scholar 

  27. Oza M, Chanda S, Doermann D (2021) Semantic text-to-face gan-st \(\hat{}\)  2fg. Preprint arXiv:2107.10756

  28. Reed S, Akata Z, Yan X, Logeswaran L, Schiele B, Lee H (2016) Generative adversarial text to image synthesis. International conference on machine learning, pp 1060–1069

  29. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252

    Article  MathSciNet  Google Scholar 

  30. Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. Proceedings of the ieee conference on computer vision and pattern recognition, pp 815–823

  31. Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681

    Article  Google Scholar 

  32. Sharma R, Barratt S, Ermon S, Pande V (2018) Improved training with curriculum gans. Preprint arXiv:1807.09295

  33. Song Y, Soleymani M (2019) Polysemous visual-semantic embedding for cross-modal retrieval. Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp 1979–1988

  34. Sun J, Deng Q, Li Q, Sun M, Liu Y, Sun Z (2024) Anyface++: A unified framework for free-style text-to-face synthesis and manipulation. IEEE Trans Pattern Anal Mach Intell

  35. Sun J, Deng Q, Li Q, Sun M, Ren M, Sun Z (2022) Anyface: Free-style text-to-face synthesis and manipulation. Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp 18687–18696

  36. Sun J, Li Q, Wang W, Zhao J, Sun Z (2021) Multi-caption text-to-face synthesis: Dataset and algorithm. Proceedings of the 29th acm international conference on multimedia, pp 2290–2298

  37. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. Proceedings of the ieee conference on computer vision and pattern recognition, pp 2818–2826

  38. Tao M, Tang H, Wu S, Sebe N, Jing X-Y, Wu F, Bao B (2020) Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis. Preprint arXiv:2008.05865

  39. Xia W, Yang Y, Xue J-H, Wu B (2021) Tedigan: Text-guided diverse face image generation and manipulation. 2021 ieee/cvf conference on computer vision and pattern recognition (cvpr), pp 2256–2265. https://doi.org/10.1109/CVPR46437.2021.00229

  40. Xu T, Zhang P, Huang Q, Zhang H, Gan Z, Huang X, He X (2018) Attngan: Fine-grained text to image generation with attentional generative adversarial networks. Proceedings of the ieee conference on computer vision and pattern recognition, pp 1316–1324

  41. Yauri-Lozano E, Castillo-Cara M, Orozco-Barbosa L, García-Castro R (2024) Generative adversarial networks for text-to-face synthesis & generation: A quantitative-qualitative analysis of natural language processing encoders for spanish. Inf Process Manag 61(3):103667

    Article  Google Scholar 

  42. Zhan F, Yu Y, Wu R, Zhang J, Lu S, Liu L, ... Xing E (2023) Multimodal image synthesis and editing: The generative ai era

  43. Zhang H, Goodfellow I, Metaxas D, Odena A (2019) Self-attention generative adversarial networks. International conference on machine learning, pp 7354–7363

  44. Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas DN (2017) Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. Proceedings of the ieee international conference on computer vision, pp 5907–5915

  45. Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas DN (2018) Stackgan++: Realistic image synthesis with stacked generative adversarial networks. IEEE Trans Pattern Anal Mach Intell 41(8):1947–1962

    Article  Google Scholar 

  46. Zhu M, Pan P, Chen W, Yang Y (2019) Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp 5802–5810

Download references

Acknowledgements

This work is supported by National Natural Science Foundation of China under grant 62176062.

Author information

Authors and Affiliations

Authors

Contributions

Qi Guo: Conceptualization of this study, Methodology, Software,Writing original draft. Xiaodong Gu: Supervision, Conceptualization and methodology, Writing original draft, Project administration.

Corresponding author

Correspondence to Xiaodong Gu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Guo, Q., Gu, X. Towards photorealistic face generation using text-guided Semantic-Spatial FaceGAN. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-19320-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11042-024-19320-7

Keywords

Navigation