Fast Griffin Lim based waveform generation strategy for text-to-speech synthesis

Sharma, Ankit; Kumar, Puneet; Maddukuri, Vikas; Madamshetti, Nagasai; Kishore, K. G.; Kavuru, Sahit Sai Sriram; Raman, Balasubramanian; Roy, Partha Pratim

doi:10.1007/s11042-020-09321-7

Fast Griffin Lim based waveform generation strategy for text-to-speech synthesis

Published: 15 August 2020

Volume 79, pages 30205–30233, (2020)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Ankit Sharma¹,
Puneet Kumar¹,
Vikas Maddukuri ORCID: orcid.org/0000-0001-5130-8249²,
Nagasai Madamshetti²,
K. G. Kishore²,
Sahit Sai Sriram Kavuru²,
Balasubramanian Raman¹ &
…
Partha Pratim Roy¹

430 Accesses
4 Citations
Explore all metrics

Abstract

The performance of text-to-speech (TTS) systems heavily depends on spectrogram to waveform generation, also known as the speech reconstruction phase. The time required for the same is known as synthesis delay. In this paper, an approach to reduce speech synthesis delay has been proposed. It aims to enhance the TTS systems for real-time applications such as digital assistants, mobile phones, embedded devices, etc. The proposed approach applies Fast Griffin Lim Algorithm (FGLA) instead Griffin Lim algorithm (GLA) as vocoder in the speech synthesis phase. GLA and FGLA are both iterative, but the convergence rate of FGLA is faster than GLA. The proposed approach is tested on LJSpeech, Blizzard and Tatoeba datasets and the results for FGLA are compared against GLA and neural Generative Adversarial Network (GAN) based vocoder. The performance is evaluated based on synthesis delay and speech quality. A 36.58% reduction in speech synthesis delay has been observed. The quality of the output speech has improved, which is advocated by higher Mean opinion scores (MOS) and faster convergence with FGLA as opposed to GLA.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

A literature review and perspectives in deepfakes: generation, detection, and applications

Article 23 July 2022

A Deep Learning Framework for Audio Deepfake Detection

Article 08 November 2021

References

Aaron A, Bakis R, Eide EM, Hamza WM (2014) Systems and methods for text-to-speech synthesis using spoken example, November 11 2014. US Patent 8,886,538
Arik SO, Chrzanowski M, Coates A, Diamos G, Gibiansky A, Kang Y, Li X, Miller J, Ng A, Raiman J, et al. (2017) Deep voice: real-time neural text-to-speech. In: Proceedings of the 34th international conference on machine learning (ICML), vol 70, pp 195–204
Arik SO, Jun H, Diamos G (2018) Fast spectrogram inversion using multi-head convolutional neural networks. IEEE Signal Process Lett 26(1):94–98
Article Google Scholar
Bracewell RN, Bracewell RN (1986) The Fourier transform and its applications, vol 31999. McGraw-Hill, New York
MATH Google Scholar
Braunschweiler N, Gales MJF, Buchholz S (2010) Lightly supervised recognition for automatic alignment of large coherent speech recordings. In: INTERSPEECH
Cheng Z, Shen J (2016) On effective location-aware music recommendation. ACM Trans Inf Syst (TOIS) 34(2):1–32
Article MathSciNet Google Scholar
Coorman G, Deprez F, De Bock M, Fackrell J, Leys S, Rutten P, De Moortel J, Schenk A, Van Coile B (2007) Speech synthesis using concatenation of speech waveforms, May 15 2007. US Patent 7,219,060
Ghate P, Shirbahadurkar SD (2017) A survey on methods of tts and various test for evaluating the quality of synthesized speech. Int J Dev Res 7:15236–15239
Google Scholar
Gibiansky A, Arik S, Diamos G, Miller J, Peng K, Ping W, Raiman J, Zhou Y (2017) Deep voice 2: multi-speaker neural text-to-speech. In: Advances in neural information processing systems, pp 2962–2970
Griffin D, Lim J (1984) Signal estimation from modified short-time fourier transform. IEEE Trans Acoust Speech Signal Process 32(2):236–243
Article Google Scholar
Hunt AJ, Black AW (1996) Unit selection in a concatenative speech synthesis system using a large speech database. In: IEEE international conference on acoustics, speech, and signal processing conference proceedings, vol 1, pp 373–376
Ito K (2017) The lj speech dataset https://keithito.com/LJ-Speech-Dataset/
Jia Y, Zhang Y, Weiss R, Wang Q, Shen J, Ren F, Nguyen P, Pang R, Moreno IL, Wu Y, et al. (2018) Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In: Advances in neural information processing systems (NeuroIPS), pp 4480–4490
Jones BT, Guthrie DM, Schaefer L, Martin JD (2017) Real-time speech-to-text conversion in an audio conference session, January 31 2017. US Patent 9,560,206
Kinsella B (2017) Speech synthesis becomes more humanlike. https://voicebot.ai/2017/12/21/speech-synthesis-becomes-humanlike/
Kim S, Hori T, Watanabe S (2017) Joint ctc-attention based end-to-end speech recognition using multi-task learning. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4835–4839
Kumar K, Kumar R, de Boissiere T, Gestin L, Teoh WZ, Sotelo J, de Brebisson A, Bengio Y, Courville AC (2019) Melgan: generative adversarial networks for conditional waveform synthesis. In: Advances in neural information processing systems, pp 14881–14892
Lee S, Chang J-H (2017) Spectral difference for statistical model-based speech enhancement in speech recognition. Springer Multimed Tools Appl 76 (23):24917–24929
Article Google Scholar
Levoy M (1992) Volume rendering using the fourier projection-slice theorem. Computer Systems Laboratory, Stanford University
Malathi T, Bhuyan MK (2017) Performance analysis of gabor wavelet for extracting most informative and efficient features. Springer Multimed Tools Appl 76(6):8449–8469
Article Google Scholar
Masuko T, Tokuda K, Kobayashi T, Imai S (1997) Voice characteristics conversion for hmm-based speech synthesis system. In: 1997 IEEE international conference on acoustics, speech, and signal processing, vol 3, pp 1611–1614
Masuyama Y, Yatabe K, Oikawa Y (2018) Griffin–lim like phase recovery via alternating direction method of multipliers. IEEE Signal Process Lett 26 (1):184–188
Article Google Scholar
Masuyama Y, Yatabe K, Koizumi Y, Oikawa Y, Harada N (2019) Deep griffin–lim iteration. In: IEEE international conference on acoustics speech and signal processing (ICASSP), pp 61–65
Mizuno H, Abe M, Hirokawa T (1993) Waveform-based speech synthesis approach with a formant frequency modification. In: IEEE international conference on acoustics, speech, and signal processing, vol 2, pp 195–198
Morise M, Yokomori F, Ozawa K (2016) World: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans Inf Syst 99(7):1877–1884
Article Google Scholar
Oyamada K, Kameoka H, Kaneko T, Tanaka K, Hojo N, Ando H (2018) Generative adversarial network-based approach to signal reconstruction from magnitude spectrogram. In: 26th IEEE European signal processing conference (EUSIPCO), pp 2514–2518
Perraudin N, Balazs P, Søndergaard PL (2013) A fast griffin-lim algorithm. In: IEEE workshop on applications of signal processing to audio and acoustics, pp 1–4
Ping W, Peng K, Gibiansky A, Arik SO, Kannan A, Narang S, Raiman J, Miller J (2017) Deep voice 3: scaling text-to-speech with convolutional sequence learning. arXiv:1710.07654W
Prahallad K (2016) Speech technology: spectrogram, cepstrum and mel-frequency analysis. https://archive.org/details/SpectrogramCepstrumAndMel-frequency_636522
Prenger R, Valle R, Catanzaro B (2019) Waveglow: a flow-based generative network for speech synthesis. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 3617–3621
Qian S, Chen D (1993) Discrete gabor transform. IEEE Trans Signal Process 41(7):2429–2438
Article Google Scholar
Salza PL, Foti E, Nebbia L, Oreglia M (1996) Mos and pair comparison combined methods for quality evaluation of text-to-speech systems. Acta Acustica United with Acustica 82(4):650–656
Google Scholar
Shen J, Pang R, Weiss RJ, Schuster M, Jaitly N, Yang Z, Chen Z, Zhang Y, Wang Y, Skerrv-Ryan RJ, et al. (2018) Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In: International conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 4779–4783
Sorensen HV, Jones D, Heideman M, Burrus C (1987) Real-valued fast fourier transform algorithms. IEEE Trans Acoust Speech Signal Process 35(6):849–863
Article Google Scholar
Sotelo J, Mehri S, Kumar K, Santos JF, Kastner K, Courville A, Bengio Y (2017) Char2wav: end-to-end speech synthesis
Sysko (2013) Tatoeba speech dataset. https://tatoeba.org/eng/
Taigman Y, Wolf L, Polyak A, Nachmani E (2017) Voiceloop: voice fitting and synthesis via a phonological loop. arXiv:1707.06588
Tokuda K, Nankaku Y, Toda T, Zen H, Yamagishi J, Oura K (2013) Speech synthesis based on hidden markov models. Proc IEEE 101 (5):1234–1252
Article Google Scholar
Tokuday K, Zen H (2015) Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4215–4219
van den Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior A, Kavukcuoglu K (2016) Wavenet: a generative model for raw audio. arXiv:1609.03499
van den Oord A, Li Y, Babuschkin I, Simonyan K, Vinyals O, Kavukcuoglu K, van den Driessche G, Lockhart E, Cobo LC, Stimberg F, et al. (2017) Parallel wavenet: fast high-fidelity speech synthesis. arXiv:1711.10433
Wang Y, Skerry-Ryan RJ, Stanton D, Wu Y, Weiss RJ, Jaitly N, Yang Z, Xiao Y, Chen Z, Bengio S, et al. (2017) Tacotron: towards end-to-end speech synthesis. arXiv:1703.10135
Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, et al. (2016) Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144
Yamagishi J, Usabaev B, King S, Watts O, Dines J, Tian J, Hu R, Guan Y, Oura K, Tokuda K, Karhila R, Kurimo M (2009) Thousands of voices for hmm-based speech synthesis–analysis and application of tts systems built on various asr corpora. IEEE Trans Audio Speech Lang Process 18:984–1004
Article Google Scholar
Yu F, Koltun V (2015) Multi-scale context aggregation by dilated convolutions. arXiv:1511.07122
Zen H, Tokuda K, Black AW (2009) Statistical parametric speech synthesis. Elsevier Speech Commun 51(11):1039–1064
Article Google Scholar
Zhao Y, Takaki S, Luong H -T, Yamagishi J, Saito D, Minematsu N (2018) Wasserstein gan and waveform loss-based acoustic model training for multi-speaker text-to-speech synthesis systems using a wavenet vocoder. IEEE Access 6:60478–60488
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science and Engineering Department, Indian Institute of Technology, Roorkee, 247667, India
Ankit Sharma, Puneet Kumar, Balasubramanian Raman & Partha Pratim Roy
Electronics and Communication Engineering Department, Indian Institute of Technology, Roorkee, 247667, India
Vikas Maddukuri, Nagasai Madamshetti, K. G. Kishore & Sahit Sai Sriram Kavuru

Authors

Ankit Sharma
View author publications
You can also search for this author in PubMed Google Scholar
Puneet Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Vikas Maddukuri
View author publications
You can also search for this author in PubMed Google Scholar
Nagasai Madamshetti
View author publications
You can also search for this author in PubMed Google Scholar
K. G. Kishore
View author publications
You can also search for this author in PubMed Google Scholar
Sahit Sai Sriram Kavuru
View author publications
You can also search for this author in PubMed Google Scholar
Balasubramanian Raman
View author publications
You can also search for this author in PubMed Google Scholar
Partha Pratim Roy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vikas Maddukuri.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

(PDF 593 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sharma, A., Kumar, P., Maddukuri, V. et al. Fast Griffin Lim based waveform generation strategy for text-to-speech synthesis. Multimed Tools Appl 79, 30205–30233 (2020). https://doi.org/10.1007/s11042-020-09321-7

Download citation

Received: 22 October 2019
Revised: 23 June 2020
Accepted: 09 July 2020
Published: 15 August 2020
Issue Date: November 2020
DOI: https://doi.org/10.1007/s11042-020-09321-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fast Griffin Lim based waveform generation strategy for text-to-speech synthesis

Abstract

Access this article

Similar content being viewed by others

A comprehensive survey on automatic speech recognition using neural networks

A literature review and perspectives in deepfakes: generation, detection, and applications

A Deep Learning Framework for Audio Deepfake Detection

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Electronic supplementary material

(PDF 593 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Fast Griffin Lim based waveform generation strategy for text-to-speech synthesis

Abstract

Access this article

Similar content being viewed by others

A comprehensive survey on automatic speech recognition using neural networks

A literature review and perspectives in deepfakes: generation, detection, and applications

A Deep Learning Framework for Audio Deepfake Detection

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Electronic supplementary material

(PDF 593 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation