Audio-visual speech synthesis using vision transformer–enhanced autoencoders with ensemble of loss functions

Ghosh, Subhayu; Sarkar, Snehashis; Ghosh, Sovan; Zalkow, Frank; Jana, Nanda Dulal

doi:10.1007/s10489-024-05380-7

Audio-visual speech synthesis using vision transformer–enhanced autoencoders with ensemble of loss functions

Published: 27 March 2024

(2024)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Subhayu Ghosh ORCID: orcid.org/0009-0005-2538-6768¹,
Snehashis Sarkar¹,
Sovan Ghosh¹,
Frank Zalkow² &
…
Nanda Dulal Jana¹

248 Accesses
2 Altmetric
Explore all metrics

Abstract

Audio-visual speech synthesis (AVSS) has garnered attention in recent years for its utility in the realm of audio-visual learning. AVSS transforms one speaker’s speech into another’s audio-visual stream while retaining linguistic content. This approach extends existing AVSS methods by first modifying vocal features from the source to the target speaker, akin to voice conversion (VC), and then synthesizing the audio-visual stream for the target speaker, termed audio-visual synthesis (AVS). In this work, a novel AVSS approach is proposed using vision transformer (ViT)-based Autoencoders (AEs), enriched with a combination of cycle consistency and reconstruction loss functions, with the aim of enhancing synthesis quality. Leveraging ViT’s attention mechanism, this method effectively captures spectral and temporal features from input speech. The combination of cycle consistency and reconstruction loss improves synthesis quality and aids in preserving essential information. The proposed framework is trained and tested on benchmark datasets, and compared extensively with state-of-the-art (SOTA) methods. The experimental results demonstrate the superiority of the proposed approach over existing SOTA models, in terms of quality and intelligibility for AVSS, indicating the potential for real-world applications.

Graphical abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

A literature review and perspectives in deepfakes: generation, detection, and applications

Article 23 July 2022

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

Article 29 September 2022

Data availability and access

The source code of the proposed ViTAE-AVSS is publicly available at: https://github.com/Subhayu-ghosh/ViTAE-AVSS. The VoxCeleb2 and LRS3-TED datasets are used in this work, which can be found at: https://rb.gy/o7xs74 and https://rb.gy/5shq0j.

References

Brooke NM, Scott SD (1998) Two-and three-dimensional audio-visual speech synthesis. In: AVSP’98 International conference on auditory-visual speech processing
Zhu H, Luo M-D, Wang R, Zheng A-H, He R (2021) Deep audio-visual learning: A survey. Int J Autom Comput 18:351–376
Article Google Scholar
Zhang Z, Li Z, Wei K, Pan S, Deng C (2022) A survey on multimodal-guided visual content synthesis. Neurocomputing 497:110–128
Article Google Scholar
Desai S, Raghavendra EV, Yegnanarayana B, Black AW, Prahallad K (2009) Voice conversion using artificial neural networks. In: 2009 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 3893–3896
Mohammadi SH, Kain A (2017) An overview of voice conversion systems. Speech Commun 88:65–82
Article Google Scholar
Zhang J-X, Ling Z-H, Liu L-J, Jiang Y, Dai L-R (2019) Sequence-to-sequence acoustic modeling for voice conversion. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27(3):631–644
Article Google Scholar
Cotescu M, Drugman T, Huybrechts G, Lorenzo-Trueba J, Moinet A (2019) Voice conversion for whispered speech synthesis. IEEE Signal Process Lett 27:186–190
Article Google Scholar
Sisman B, Yamagishi J, King S, Li H (2020) An overview of voice conversion and its challenges: From statistical modeling to deep learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29:132–157
Article Google Scholar
Akhter MT, Banerjee P, Dhar S, Ghosh S, Jana ND (2023) Region normalized capsule network based generative adversarial network for non-parallel voice conversion. In: International conference on speech and computer, Springer Publication, pp 233–244
Mattheyses W, Latacz L, Verhelst W (2009) On the importance of audiovisual coherence for the perceived quality of synthesized visual speech. EURASIP Journal on Audio, Speech, and Music Processing, Springer Publication 2009:1–12
Article Google Scholar
Ouni S, Colotte V, Musti U, Toutios A, Wrobel-Dautcourt B, Berger M-O, Lavecchia C (2013) Acoustic-visual synthesis technique using bimodal unit-selection. EURASIP Journal on Audio, Speech, and Music Processing, Springer Publication 2013(1):1–13
Google Scholar
Železnỳ M, Krňoul Z, Císař P, Matoušek J (2006) Design, implementation and evaluation of the czech realistic audio-visual speech synthesis. Signal Process 86(12):3657–3673
Article Google Scholar
Morrone G, Michelsanti D, Tan Z-H, Jensen J (2021) Audio-visual speech inpainting with deep learning. In: 2021 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 6653–6657
Deng K, Bansal A, Ramanan D (2021) Unsupervised audiovisual synthesis via exemplar autoencoders. In: 2021 International conference on learning representations (ICLR)
Zhai J, Zhang S, Chen J, He Q (2018) Autoencoder and its various variants. In: 2018 IEEE International conference on systems, man, and cybernetics (SMC), pp 415–419
Bank D, Koenigstein N, Giryes R (2023) Autoencoders. Machine learning for data science handbook: data mining and knowledge discovery handbook, Springer Publication, pp 353–374
Li Z, Liu F, Yang W, Peng S, Zhou J (2021) A survey of convolutional neural networks: analysis, applications, and prospects. IEEE Trans Neural Netw Learn Syst
Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 12179–12188
Chen C-FR, Fan Q, Panda R (2021) Crossvit: Cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 357–366
Wu K, Fan J, Ye P, Zhu M (2023) Hyperspectral image classification using spectral-spatial token enhanced transformer with hash-based positional embedding. IEEE Trans Geosci Remote Sens 61:1–16
Google Scholar
Khan S, Naseer M, Hayat M, Zamir SW, Khan FS, Shah M (2022) Transformers in vision: A survey. ACM computing surveys (CSUR) 54(10s):1–41
Article Google Scholar
Han K, Wang Y, Chen H, Chen X, Guo J, Liu Z, Tang Y, Xiao A, Xu C, Xu Y et al (2022) A survey on vision transformer. IEEE Trans Pattern Anal Mach Intell 45(1):87–110
Article Google Scholar
Gadermayr M, Tschuchnig M, Gupta L, Krämer N, Truhn D, Merhof D, Gess B (2021) An asymmetric cycle-consistency loss for dealing with many-to-one mappings in image translation: a study on thigh mr scans. In: 2021 IEEE 18th International symposium on biomedical imaging (ISBI), pp 1182–1186
Wang H, Qian Y, Wang X, Wang Y, Wang C, Liu S, Yoshioka T, Li J, Wang D (2022) Improving noise robustness of contrastive speech representation learning with speech reconstruction. In: 2022 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6062–6066
Kaneko T, Kameoka H (2018) Cyclegan-vc: Non-parallel voice conversion using cycle-consistent adversarial networks. In: 2018 IEEE 26th European signal processing conference (EUSIPCO), pp 2100–2104
Abouzid H, Chakkor O, Reyes OG, Ventura S (2019) Signal speech reconstruction and noise removal using convolutional denoising audioencoders with neural deep learning. Analog Integrated Circuits and Signal Processing, Springer Publication 100:501–512
Article Google Scholar
Hajiabadi H, Molla-Aliod D, Monsefi R, Yazdi HS (2020) Combination of loss functions for deep text classification. International Journal of Machine Learning and Cybernetics, Springer Publication 11:751–761
Article Google Scholar
Zabihzadeh D, Alitbi Z, Mousavirad SJ (2023) Ensemble of loss functions to improve generalizability of deep metric learning methods. Multimed Tool Appl, Springer Publication, pp 1–25
Nagrani A, Chung JS, Xie W, Zisserman A (2020) Voxceleb: Large-scale speaker verification in the wild. Computer Speech & Language, Elsevier 60:101027
Article Google Scholar
Serdyuk D, Braga O, Siohan O (2021) Audio-visual speech recognition is worth \(32 \times 32 \times 8\) voxels. In: 2021 IEEE Automatic speech recognition and understanding workshop (ASRU), pp 796–802
Gibiansky A, Arik S, Diamos G, Miller J, Peng K, Ping W, Raiman J, Zhou (2017) Deep voice 2: Multi-speaker neural text-to-speech. Adv Neural Inform Process Syst (Neurips) 30
Pidhorskyi S, Adjeroh DA, Doretto G (2020) Adversarial latent autoencoders. In: 2020 Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (ICCV), pp 14104–14113
Zhang G, Liu Y, Jin X (2020) A survey of autoencoder-based recommender systems. Front Comp Sci 14:430–450
Article Google Scholar
Creswell A, White T, Dumoulin V, Arulkumaran K, Sengupta B, Bharath AA (2018) Generative adversarial networks: An overview. IEEE Signal Process Mag 35(1):53–65
Article Google Scholar
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2020) Generative adversarial networks. Commun ACM 63(11):139–144
Article MathSciNet Google Scholar
Bi J, Zhu Z, Meng Q (2021) Transformer in computer vision. In: 2021 IEEE International conference on computer science, electronic information engineering and intelligent control technology (CEI), pp 178–188
Wu K, Peng H, Chen M, Fu J, Chao H (2021) Rethinking and improving relative position encoding for vision transformer. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 10033–10041
Fang F, Yamagishi J, Echizen I, Lorenzo-Trueba J (2018) High-quality nonparallel voice conversion based on cycle-consistent adversarial network. In: 2018 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 5279–5283
Lee S, Ko B, Lee K, Yoo I-C, Yook D (2020) Many-to-many voice conversion using conditional cycle-consistent adversarial networks. In: 2020 IEEE Int Conf Acoust Speech Signal Process (ICASSP), pp 6279–6283
Du H, Tian X, Xie L, Li H (2021) Optimizing voice conversion network with cycle consistency loss of speaker identity. In: 2021 IEEE Spoken language technology workshop (SLT), pp 507–513
Toda T, Saruwatari H, Shikano K (2001) Voice conversion algorithm based on gaussian mixture model with dynamic frequency warping of straight spectrum. In: 2001 IEEE International conference on acoustics, speech, and signal processing (ICASSP), vol 2, pp 841–844
Kim E-K, Lee S, Oh Y-H (1997) Hidden markov model based voice conversion using dynamic characteristics of speaker. In: European conference on speech communication and technology (Eurospeech), pp 2519–2522
Sun L, Li K, Wang H, Kang S, Meng H (2016) Phonetic posteriorgrams for many-to-one voice conversion without parallel data training. In: 2016 IEEE International conference on multimedia and expo (ICME), pp 1–6
Walczyna T, Piotrowski Z (2023) Overview of voice conversion methods based on deep learning. Applied Sciences, MDPI 13(5):3100
Article Google Scholar
Huang W-C, Hwang H-T, Peng Y-H, Tsao Y, Wang H-M (2018) Voice conversion based on cross-domain features using variational auto encoders. In: 2018 IEEE 11th International symposium on chinese spoken language processing (ISCSLP), pp 51–55
Sisman B, Vijayan K, Dong M, Li H (2019) Singan: Singing voice conversion with generative adversarial networks. In: 2019 IEEE Asia-pacific signal and information processing association annual summit and conference (APSIPA ASC), pp 112–118
Zhou Y, Tian X, Li H (2020) Multi-task wavernn with an integrated architecture for cross-lingual voice conversion. IEEE Signal Process Lett 27:1310–1314
Article Google Scholar
Casanova E, Weber J, Shulby CD, Junior AC, Gölge E, Ponti MA (2022) Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In: 2022 International conference on machine learning (ICML), pp 2709–2720
Barbulescu A, Hueber T, Bailly G, Ronfard R (2013) Audio-visual speaker conversion using prosody features. In: AVSP 2013-12th International conference on auditory-visual speech processing, pp 11–16
Sawada K, Takehara M, Tamura S, Hayamizu S (2014) Audio-visual voice conversion using noise-robust features. In: 2014 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 7899–7903
Moubayed SA, Smet MD, Van_hamme H (2008) Lip synchronization: from phone lattice to pca eigen-projections using neural networks. In: Ninth annual conference of the international speech communication association, Citeseer
Tamura S, Horio K, Endo H, Hayamizu S, Toda T (2018) Audio-visual voice conversion using deep canonical correlation analysis for deep bottleneck features. In: INTERSPEECH, pp 2469–2473
Ibrokhimov B, Hur C, Kim H, Kang S (2021) A-dbnf: adaptive deep belief network framework for regression and classification tasks. Applied Intelligence, Springer 51(7):4199–4213
Article Google Scholar
Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: An overview with application to learning methods. Neural computation, MIT Press 16(12):2639–2664
Article Google Scholar
Assael Y, Shillingford B, Whiteson S, Freitas N (2016) Lipnet: End-to-endsentence-level lipreading. In: 2016 International conference on learning representations (ICLR)
Hirose S, Wada N, Katto J, Sun H (2021) Vit-gan: Using vision transformer as discriminator with adaptive data augmentation. In: 2021 IEEE International conference on computer communication and the internet (ICCCI), pp 185–189
Lorenzo-Trueba J, Fang F, Wang X, Echizen I, Yamagishi J, Kinnunen T (2018) Can we steal your vocal identity from the Internet?: Initial investigation of cloning Obama’s voice using GAN, WaveNet and low-quality found data. arXiv
AlBadawy EA, Lyu S (2020) Voice conversion using speech-to-speech neuro-style transfer. In: INTERSPEECH, pp 4726–4730
Wang S-L, Lau W-H, Liew AW-C, Leung S-H (2007) Robust lip region segmentation for lip images with complex background. Pattern Recognition, Elsevier 40(12):3481–3491
Article Google Scholar
Mazumder A, Ghosh S, Roy S, Dhar S, Jana ND (2022) Rectified adam optimizer-based cnn model for speaker identification. In: Advances in intelligent computing and communication, Springer Publication, pp 155–162
Kameoka H, Kaneko T, Tanaka K, Hojo N (2018) Stargan-vc: Non-parallel many-to-many voice conversion using star generative adversarial networks. In: 2018 IEEE Spoken language technology workshop (SLT), pp. 266–273
Kaneko T, Kameoka H, Tanaka K, Hojo N (2019) Stargan-vc2: Rethinking conditional methods for stargan-based voice conversion. arXiv preprint arXiv:1907.12279
Serrà J, Pascual S, Segura Perales C (2019) Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion. Adv Neural Inform Process Syst 32
Pasini M (2019) Melgan-vc: Voice conversion and audio style transfer on arbitrarily long samples using spectrograms. arXiv preprint arXiv:1910.03713
Dhar S, Banerjee P, Jana ND, Das S (2023) Voice conversion using feature specific loss function based self-attentive generative adversarial network. In: 2023 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 1–5
Chung JS, Jamaludin A, Zisserman A (2017) You said that? arXiv preprint arXiv:1705.02966
KR P, Mukhopadhyay R, Philip J, Jha A, Namboodiri V, Jawahar C (2019) Towards automatic face-to-face translation. In: Proceedings of the 27th ACM international conference on multimedia pp 1428–1436
Akhter MT, Banerjee P, Dhar S, Jana ND (2022) An analysis of performance evaluation metrics for voice conversion models. In: 2022 IEEE 19th India council international conference (INDICON), pp 1–6
Abdul ZK, Al-Talabani AK (2022) Mel frequency cepstral coefficient and its applications: A review. IEEE Access
Liu W, Liao Q, Qiao F, Xia W, Wang C, Lombardi F (2019) Approximate designs for fast fourier transform (fft) with application to speech recognition. IEEE Trans Circuits Syst I Regul Pap 66(12):4727–4739
Article Google Scholar
Chang J-H (2005) Warped discrete cosine transform-based noisy speech enhancement. IEEE Trans Circuits Syst II Express Briefs 52(9):535–539
Article Google Scholar
Takamichi S, Toda T, Black AW, Neubig G, Sakti S, Nakamura S (2016) Postfilters to modify the modulation spectrum for statistical parametric speech synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(4):755–767
Article Google Scholar
Jassim WA, Harte N (2022) Comparison of discrete transforms for deep-neural-networks-based speech enhancement. IET Signal Proc 16(4):438–448
Article Google Scholar
Streijl RC, Winkler S, Hands DS (2016) Mean opinion score (mos) revisited: methods and applications, limitations and alternatives. Multimedia Systems, Springer 22(2):213–227
Article Google Scholar
Seshadrinathan K, Soundararajan R, Bovik AC, Cormack LK (2010) Study of subjective and objective quality assessment of video. IEEE Trans Image Process 19(6):1427–1441
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

National Institute of Technology Durgapur, Durgapur, India
Subhayu Ghosh, Snehashis Sarkar, Sovan Ghosh & Nanda Dulal Jana
Fraunhofer IIS, Erlangen, Germany
Frank Zalkow

Authors

Subhayu Ghosh
View author publications
You can also search for this author in PubMed Google Scholar
Snehashis Sarkar
View author publications
You can also search for this author in PubMed Google Scholar
Sovan Ghosh
View author publications
You can also search for this author in PubMed Google Scholar
Frank Zalkow
View author publications
You can also search for this author in PubMed Google Scholar
Nanda Dulal Jana
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Subhayu Ghosh and Nanda Dulal Jana contributed to the study conception and design. Coding implementation and analysis were performed by Subhayu Ghosh, Snehashis Sarkar and Sovan Ghosh. The first draft of the manuscript was written by Subhayu Ghosh. Frank Zalkow participated in the revision of the paper and provided many pertinent suggestions. Nanda Dulal Jana guided throughout the research and supervised the every aspect of this work.

Corresponding author

Correspondence to Subhayu Ghosh.

Ethics declarations

Competing Interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ghosh, S., Sarkar, S., Ghosh, S. et al. Audio-visual speech synthesis using vision transformer–enhanced autoencoders with ensemble of loss functions. Appl Intell (2024). https://doi.org/10.1007/s10489-024-05380-7

Download citation

Accepted: 07 March 2024
Published: 27 March 2024
DOI: https://doi.org/10.1007/s10489-024-05380-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions