Skip to main content
Log in

Audio-visual speech synthesis using vision transformer–enhanced autoencoders with ensemble of loss functions

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Audio-visual speech synthesis (AVSS) has garnered attention in recent years for its utility in the realm of audio-visual learning. AVSS transforms one speaker’s speech into another’s audio-visual stream while retaining linguistic content. This approach extends existing AVSS methods by first modifying vocal features from the source to the target speaker, akin to voice conversion (VC), and then synthesizing the audio-visual stream for the target speaker, termed audio-visual synthesis (AVS). In this work, a novel AVSS approach is proposed using vision transformer (ViT)-based Autoencoders (AEs), enriched with a combination of cycle consistency and reconstruction loss functions, with the aim of enhancing synthesis quality. Leveraging ViT’s attention mechanism, this method effectively captures spectral and temporal features from input speech. The combination of cycle consistency and reconstruction loss improves synthesis quality and aids in preserving essential information. The proposed framework is trained and tested on benchmark datasets, and compared extensively with state-of-the-art (SOTA) methods. The experimental results demonstrate the superiority of the proposed approach over existing SOTA models, in terms of quality and intelligibility for AVSS, indicating the potential for real-world applications.

Graphical abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data availability and access

The source code of the proposed ViTAE-AVSS is publicly available at: https://github.com/Subhayu-ghosh/ViTAE-AVSS. The VoxCeleb2 and LRS3-TED datasets are used in this work, which can be found at: https://rb.gy/o7xs74 and https://rb.gy/5shq0j.

References

  1. Brooke NM, Scott SD (1998) Two-and three-dimensional audio-visual speech synthesis. In: AVSP’98 International conference on auditory-visual speech processing

  2. Zhu H, Luo M-D, Wang R, Zheng A-H, He R (2021) Deep audio-visual learning: A survey. Int J Autom Comput 18:351–376

    Article  Google Scholar 

  3. Zhang Z, Li Z, Wei K, Pan S, Deng C (2022) A survey on multimodal-guided visual content synthesis. Neurocomputing 497:110–128

    Article  Google Scholar 

  4. Desai S, Raghavendra EV, Yegnanarayana B, Black AW, Prahallad K (2009) Voice conversion using artificial neural networks. In: 2009 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 3893–3896

  5. Mohammadi SH, Kain A (2017) An overview of voice conversion systems. Speech Commun 88:65–82

    Article  Google Scholar 

  6. Zhang J-X, Ling Z-H, Liu L-J, Jiang Y, Dai L-R (2019) Sequence-to-sequence acoustic modeling for voice conversion. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27(3):631–644

    Article  Google Scholar 

  7. Cotescu M, Drugman T, Huybrechts G, Lorenzo-Trueba J, Moinet A (2019) Voice conversion for whispered speech synthesis. IEEE Signal Process Lett 27:186–190

    Article  Google Scholar 

  8. Sisman B, Yamagishi J, King S, Li H (2020) An overview of voice conversion and its challenges: From statistical modeling to deep learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29:132–157

    Article  Google Scholar 

  9. Akhter MT, Banerjee P, Dhar S, Ghosh S, Jana ND (2023) Region normalized capsule network based generative adversarial network for non-parallel voice conversion. In: International conference on speech and computer, Springer Publication, pp 233–244

  10. Mattheyses W, Latacz L, Verhelst W (2009) On the importance of audiovisual coherence for the perceived quality of synthesized visual speech. EURASIP Journal on Audio, Speech, and Music Processing, Springer Publication 2009:1–12

    Article  Google Scholar 

  11. Ouni S, Colotte V, Musti U, Toutios A, Wrobel-Dautcourt B, Berger M-O, Lavecchia C (2013) Acoustic-visual synthesis technique using bimodal unit-selection. EURASIP Journal on Audio, Speech, and Music Processing, Springer Publication 2013(1):1–13

    Google Scholar 

  12. Železnỳ M, Krňoul Z, Císař P, Matoušek J (2006) Design, implementation and evaluation of the czech realistic audio-visual speech synthesis. Signal Process 86(12):3657–3673

    Article  Google Scholar 

  13. Morrone G, Michelsanti D, Tan Z-H, Jensen J (2021) Audio-visual speech inpainting with deep learning. In: 2021 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 6653–6657

  14. Deng K, Bansal A, Ramanan D (2021) Unsupervised audiovisual synthesis via exemplar autoencoders. In: 2021 International conference on learning representations (ICLR)

  15. Zhai J, Zhang S, Chen J, He Q (2018) Autoencoder and its various variants. In: 2018 IEEE International conference on systems, man, and cybernetics (SMC), pp 415–419

  16. Bank D, Koenigstein N, Giryes R (2023) Autoencoders. Machine learning for data science handbook: data mining and knowledge discovery handbook, Springer Publication, pp 353–374

  17. Li Z, Liu F, Yang W, Peng S, Zhou J (2021) A survey of convolutional neural networks: analysis, applications, and prospects. IEEE Trans Neural Netw Learn Syst

  18. Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 12179–12188

  19. Chen C-FR, Fan Q, Panda R (2021) Crossvit: Cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 357–366

  20. Wu K, Fan J, Ye P, Zhu M (2023) Hyperspectral image classification using spectral-spatial token enhanced transformer with hash-based positional embedding. IEEE Trans Geosci Remote Sens 61:1–16

    Google Scholar 

  21. Khan S, Naseer M, Hayat M, Zamir SW, Khan FS, Shah M (2022) Transformers in vision: A survey. ACM computing surveys (CSUR) 54(10s):1–41

    Article  Google Scholar 

  22. Han K, Wang Y, Chen H, Chen X, Guo J, Liu Z, Tang Y, Xiao A, Xu C, Xu Y et al (2022) A survey on vision transformer. IEEE Trans Pattern Anal Mach Intell 45(1):87–110

    Article  Google Scholar 

  23. Gadermayr M, Tschuchnig M, Gupta L, Krämer N, Truhn D, Merhof D, Gess B (2021) An asymmetric cycle-consistency loss for dealing with many-to-one mappings in image translation: a study on thigh mr scans. In: 2021 IEEE 18th International symposium on biomedical imaging (ISBI), pp 1182–1186

  24. Wang H, Qian Y, Wang X, Wang Y, Wang C, Liu S, Yoshioka T, Li J, Wang D (2022) Improving noise robustness of contrastive speech representation learning with speech reconstruction. In: 2022 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6062–6066

  25. Kaneko T, Kameoka H (2018) Cyclegan-vc: Non-parallel voice conversion using cycle-consistent adversarial networks. In: 2018 IEEE 26th European signal processing conference (EUSIPCO), pp 2100–2104

  26. Abouzid H, Chakkor O, Reyes OG, Ventura S (2019) Signal speech reconstruction and noise removal using convolutional denoising audioencoders with neural deep learning. Analog Integrated Circuits and Signal Processing, Springer Publication 100:501–512

    Article  Google Scholar 

  27. Hajiabadi H, Molla-Aliod D, Monsefi R, Yazdi HS (2020) Combination of loss functions for deep text classification. International Journal of Machine Learning and Cybernetics, Springer Publication 11:751–761

    Article  Google Scholar 

  28. Zabihzadeh D, Alitbi Z, Mousavirad SJ (2023) Ensemble of loss functions to improve generalizability of deep metric learning methods. Multimed Tool Appl, Springer Publication, pp 1–25

  29. Nagrani A, Chung JS, Xie W, Zisserman A (2020) Voxceleb: Large-scale speaker verification in the wild. Computer Speech & Language, Elsevier 60:101027

    Article  Google Scholar 

  30. Serdyuk D, Braga O, Siohan O (2021) Audio-visual speech recognition is worth \(32 \times 32 \times 8\) voxels. In: 2021 IEEE Automatic speech recognition and understanding workshop (ASRU), pp 796–802

  31. Gibiansky A, Arik S, Diamos G, Miller J, Peng K, Ping W, Raiman J, Zhou (2017) Deep voice 2: Multi-speaker neural text-to-speech. Adv Neural Inform Process Syst (Neurips) 30

  32. Pidhorskyi S, Adjeroh DA, Doretto G (2020) Adversarial latent autoencoders. In: 2020 Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (ICCV), pp 14104–14113

  33. Zhang G, Liu Y, Jin X (2020) A survey of autoencoder-based recommender systems. Front Comp Sci 14:430–450

    Article  Google Scholar 

  34. Creswell A, White T, Dumoulin V, Arulkumaran K, Sengupta B, Bharath AA (2018) Generative adversarial networks: An overview. IEEE Signal Process Mag 35(1):53–65

    Article  Google Scholar 

  35. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2020) Generative adversarial networks. Commun ACM 63(11):139–144

    Article  MathSciNet  Google Scholar 

  36. Bi J, Zhu Z, Meng Q (2021) Transformer in computer vision. In: 2021 IEEE International conference on computer science, electronic information engineering and intelligent control technology (CEI), pp 178–188

  37. Wu K, Peng H, Chen M, Fu J, Chao H (2021) Rethinking and improving relative position encoding for vision transformer. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 10033–10041

  38. Fang F, Yamagishi J, Echizen I, Lorenzo-Trueba J (2018) High-quality nonparallel voice conversion based on cycle-consistent adversarial network. In: 2018 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 5279–5283

  39. Lee S, Ko B, Lee K, Yoo I-C, Yook D (2020) Many-to-many voice conversion using conditional cycle-consistent adversarial networks. In: 2020 IEEE Int Conf Acoust Speech Signal Process (ICASSP), pp 6279–6283

  40. Du H, Tian X, Xie L, Li H (2021) Optimizing voice conversion network with cycle consistency loss of speaker identity. In: 2021 IEEE Spoken language technology workshop (SLT), pp 507–513

  41. Toda T, Saruwatari H, Shikano K (2001) Voice conversion algorithm based on gaussian mixture model with dynamic frequency warping of straight spectrum. In: 2001 IEEE International conference on acoustics, speech, and signal processing (ICASSP), vol 2, pp 841–844

  42. Kim E-K, Lee S, Oh Y-H (1997) Hidden markov model based voice conversion using dynamic characteristics of speaker. In: European conference on speech communication and technology (Eurospeech), pp 2519–2522

  43. Sun L, Li K, Wang H, Kang S, Meng H (2016) Phonetic posteriorgrams for many-to-one voice conversion without parallel data training. In: 2016 IEEE International conference on multimedia and expo (ICME), pp 1–6

  44. Walczyna T, Piotrowski Z (2023) Overview of voice conversion methods based on deep learning. Applied Sciences, MDPI 13(5):3100

    Article  Google Scholar 

  45. Huang W-C, Hwang H-T, Peng Y-H, Tsao Y, Wang H-M (2018) Voice conversion based on cross-domain features using variational auto encoders. In: 2018 IEEE 11th International symposium on chinese spoken language processing (ISCSLP), pp 51–55

  46. Sisman B, Vijayan K, Dong M, Li H (2019) Singan: Singing voice conversion with generative adversarial networks. In: 2019 IEEE Asia-pacific signal and information processing association annual summit and conference (APSIPA ASC), pp 112–118

  47. Zhou Y, Tian X, Li H (2020) Multi-task wavernn with an integrated architecture for cross-lingual voice conversion. IEEE Signal Process Lett 27:1310–1314

    Article  Google Scholar 

  48. Casanova E, Weber J, Shulby CD, Junior AC, Gölge E, Ponti MA (2022) Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In: 2022 International conference on machine learning (ICML), pp 2709–2720

  49. Barbulescu A, Hueber T, Bailly G, Ronfard R (2013) Audio-visual speaker conversion using prosody features. In: AVSP 2013-12th International conference on auditory-visual speech processing, pp 11–16

  50. Sawada K, Takehara M, Tamura S, Hayamizu S (2014) Audio-visual voice conversion using noise-robust features. In: 2014 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 7899–7903

  51. Moubayed SA, Smet MD, Van_hamme H (2008) Lip synchronization: from phone lattice to pca eigen-projections using neural networks. In: Ninth annual conference of the international speech communication association, Citeseer

  52. Tamura S, Horio K, Endo H, Hayamizu S, Toda T (2018) Audio-visual voice conversion using deep canonical correlation analysis for deep bottleneck features. In: INTERSPEECH, pp 2469–2473

  53. Ibrokhimov B, Hur C, Kim H, Kang S (2021) A-dbnf: adaptive deep belief network framework for regression and classification tasks. Applied Intelligence, Springer 51(7):4199–4213

    Article  Google Scholar 

  54. Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: An overview with application to learning methods. Neural computation, MIT Press 16(12):2639–2664

    Article  Google Scholar 

  55. Assael Y, Shillingford B, Whiteson S, Freitas N (2016) Lipnet: End-to-endsentence-level lipreading. In: 2016 International conference on learning representations (ICLR)

  56. Hirose S, Wada N, Katto J, Sun H (2021) Vit-gan: Using vision transformer as discriminator with adaptive data augmentation. In: 2021 IEEE International conference on computer communication and the internet (ICCCI), pp 185–189

  57. Lorenzo-Trueba J, Fang F, Wang X, Echizen I, Yamagishi J, Kinnunen T (2018) Can we steal your vocal identity from the Internet?: Initial investigation of cloning Obama’s voice using GAN, WaveNet and low-quality found data. arXiv

  58. AlBadawy EA, Lyu S (2020) Voice conversion using speech-to-speech neuro-style transfer. In: INTERSPEECH, pp 4726–4730

  59. Wang S-L, Lau W-H, Liew AW-C, Leung S-H (2007) Robust lip region segmentation for lip images with complex background. Pattern Recognition, Elsevier 40(12):3481–3491

    Article  Google Scholar 

  60. Mazumder A, Ghosh S, Roy S, Dhar S, Jana ND (2022) Rectified adam optimizer-based cnn model for speaker identification. In: Advances in intelligent computing and communication, Springer Publication, pp 155–162

  61. Kameoka H, Kaneko T, Tanaka K, Hojo N (2018) Stargan-vc: Non-parallel many-to-many voice conversion using star generative adversarial networks. In: 2018 IEEE Spoken language technology workshop (SLT), pp. 266–273

  62. Kaneko T, Kameoka H, Tanaka K, Hojo N (2019) Stargan-vc2: Rethinking conditional methods for stargan-based voice conversion. arXiv preprint arXiv:1907.12279

  63. Serrà J, Pascual S, Segura Perales C (2019) Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion. Adv Neural Inform Process Syst 32

  64. Pasini M (2019) Melgan-vc: Voice conversion and audio style transfer on arbitrarily long samples using spectrograms. arXiv preprint arXiv:1910.03713

  65. Dhar S, Banerjee P, Jana ND, Das S (2023) Voice conversion using feature specific loss function based self-attentive generative adversarial network. In: 2023 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 1–5

  66. Chung JS, Jamaludin A, Zisserman A (2017) You said that? arXiv preprint arXiv:1705.02966

  67. KR P, Mukhopadhyay R, Philip J, Jha A, Namboodiri V, Jawahar C (2019) Towards automatic face-to-face translation. In: Proceedings of the 27th ACM international conference on multimedia pp 1428–1436

  68. Akhter MT, Banerjee P, Dhar S, Jana ND (2022) An analysis of performance evaluation metrics for voice conversion models. In: 2022 IEEE 19th India council international conference (INDICON), pp 1–6

  69. Abdul ZK, Al-Talabani AK (2022) Mel frequency cepstral coefficient and its applications: A review. IEEE Access

  70. Liu W, Liao Q, Qiao F, Xia W, Wang C, Lombardi F (2019) Approximate designs for fast fourier transform (fft) with application to speech recognition. IEEE Trans Circuits Syst I Regul Pap 66(12):4727–4739

    Article  Google Scholar 

  71. Chang J-H (2005) Warped discrete cosine transform-based noisy speech enhancement. IEEE Trans Circuits Syst II Express Briefs 52(9):535–539

    Article  Google Scholar 

  72. Takamichi S, Toda T, Black AW, Neubig G, Sakti S, Nakamura S (2016) Postfilters to modify the modulation spectrum for statistical parametric speech synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(4):755–767

    Article  Google Scholar 

  73. Jassim WA, Harte N (2022) Comparison of discrete transforms for deep-neural-networks-based speech enhancement. IET Signal Proc 16(4):438–448

    Article  Google Scholar 

  74. Streijl RC, Winkler S, Hands DS (2016) Mean opinion score (mos) revisited: methods and applications, limitations and alternatives. Multimedia Systems, Springer 22(2):213–227

    Article  Google Scholar 

  75. Seshadrinathan K, Soundararajan R, Bovik AC, Cormack LK (2010) Study of subjective and objective quality assessment of video. IEEE Trans Image Process 19(6):1427–1441

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Contributions

Subhayu Ghosh and Nanda Dulal Jana contributed to the study conception and design. Coding implementation and analysis were performed by Subhayu Ghosh, Snehashis Sarkar and Sovan Ghosh. The first draft of the manuscript was written by Subhayu Ghosh. Frank Zalkow participated in the revision of the paper and provided many pertinent suggestions. Nanda Dulal Jana guided throughout the research and supervised the every aspect of this work.

Corresponding author

Correspondence to Subhayu Ghosh.

Ethics declarations

Competing Interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ghosh, S., Sarkar, S., Ghosh, S. et al. Audio-visual speech synthesis using vision transformer–enhanced autoencoders with ensemble of loss functions. Appl Intell (2024). https://doi.org/10.1007/s10489-024-05380-7

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10489-024-05380-7

Keywords

Navigation