Skip to main content
Log in

Fast Griffin Lim based waveform generation strategy for text-to-speech synthesis

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The performance of text-to-speech (TTS) systems heavily depends on spectrogram to waveform generation, also known as the speech reconstruction phase. The time required for the same is known as synthesis delay. In this paper, an approach to reduce speech synthesis delay has been proposed. It aims to enhance the TTS systems for real-time applications such as digital assistants, mobile phones, embedded devices, etc. The proposed approach applies Fast Griffin Lim Algorithm (FGLA) instead Griffin Lim algorithm (GLA) as vocoder in the speech synthesis phase. GLA and FGLA are both iterative, but the convergence rate of FGLA is faster than GLA. The proposed approach is tested on LJSpeech, Blizzard and Tatoeba datasets and the results for FGLA are compared against GLA and neural Generative Adversarial Network (GAN) based vocoder. The performance is evaluated based on synthesis delay and speech quality. A 36.58% reduction in speech synthesis delay has been observed. The quality of the output speech has improved, which is advocated by higher Mean opinion scores (MOS) and faster convergence with FGLA as opposed to GLA.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Aaron A, Bakis R, Eide EM, Hamza WM (2014) Systems and methods for text-to-speech synthesis using spoken example, November 11 2014. US Patent 8,886,538

  2. Arik SO, Chrzanowski M, Coates A, Diamos G, Gibiansky A, Kang Y, Li X, Miller J, Ng A, Raiman J, et al. (2017) Deep voice: real-time neural text-to-speech. In: Proceedings of the 34th international conference on machine learning (ICML), vol 70, pp 195–204

  3. Arik SO, Jun H, Diamos G (2018) Fast spectrogram inversion using multi-head convolutional neural networks. IEEE Signal Process Lett 26(1):94–98

    Article  Google Scholar 

  4. Bracewell RN, Bracewell RN (1986) The Fourier transform and its applications, vol 31999. McGraw-Hill, New York

    MATH  Google Scholar 

  5. Braunschweiler N, Gales MJF, Buchholz S (2010) Lightly supervised recognition for automatic alignment of large coherent speech recordings. In: INTERSPEECH

  6. Cheng Z, Shen J (2016) On effective location-aware music recommendation. ACM Trans Inf Syst (TOIS) 34(2):1–32

    Article  MathSciNet  Google Scholar 

  7. Coorman G, Deprez F, De Bock M, Fackrell J, Leys S, Rutten P, De Moortel J, Schenk A, Van Coile B (2007) Speech synthesis using concatenation of speech waveforms, May 15 2007. US Patent 7,219,060

  8. Ghate P, Shirbahadurkar SD (2017) A survey on methods of tts and various test for evaluating the quality of synthesized speech. Int J Dev Res 7:15236–15239

    Google Scholar 

  9. Gibiansky A, Arik S, Diamos G, Miller J, Peng K, Ping W, Raiman J, Zhou Y (2017) Deep voice 2: multi-speaker neural text-to-speech. In: Advances in neural information processing systems, pp 2962–2970

  10. Griffin D, Lim J (1984) Signal estimation from modified short-time fourier transform. IEEE Trans Acoust Speech Signal Process 32(2):236–243

    Article  Google Scholar 

  11. Hunt AJ, Black AW (1996) Unit selection in a concatenative speech synthesis system using a large speech database. In: IEEE international conference on acoustics, speech, and signal processing conference proceedings, vol 1, pp 373–376

  12. Ito K (2017) The lj speech dataset https://keithito.com/LJ-Speech-Dataset/

  13. Jia Y, Zhang Y, Weiss R, Wang Q, Shen J, Ren F, Nguyen P, Pang R, Moreno IL, Wu Y, et al. (2018) Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In: Advances in neural information processing systems (NeuroIPS), pp 4480–4490

  14. Jones BT, Guthrie DM, Schaefer L, Martin JD (2017) Real-time speech-to-text conversion in an audio conference session, January 31 2017. US Patent 9,560,206

  15. Kinsella B (2017) Speech synthesis becomes more humanlike. https://voicebot.ai/2017/12/21/speech-synthesis-becomes-humanlike/

  16. Kim S, Hori T, Watanabe S (2017) Joint ctc-attention based end-to-end speech recognition using multi-task learning. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4835–4839

  17. Kumar K, Kumar R, de Boissiere T, Gestin L, Teoh WZ, Sotelo J, de Brebisson A, Bengio Y, Courville AC (2019) Melgan: generative adversarial networks for conditional waveform synthesis. In: Advances in neural information processing systems, pp 14881–14892

  18. Lee S, Chang J-H (2017) Spectral difference for statistical model-based speech enhancement in speech recognition. Springer Multimed Tools Appl 76 (23):24917–24929

    Article  Google Scholar 

  19. Levoy M (1992) Volume rendering using the fourier projection-slice theorem. Computer Systems Laboratory, Stanford University

  20. Malathi T, Bhuyan MK (2017) Performance analysis of gabor wavelet for extracting most informative and efficient features. Springer Multimed Tools Appl 76(6):8449–8469

    Article  Google Scholar 

  21. Masuko T, Tokuda K, Kobayashi T, Imai S (1997) Voice characteristics conversion for hmm-based speech synthesis system. In: 1997 IEEE international conference on acoustics, speech, and signal processing, vol 3, pp 1611–1614

  22. Masuyama Y, Yatabe K, Oikawa Y (2018) Griffin–lim like phase recovery via alternating direction method of multipliers. IEEE Signal Process Lett 26 (1):184–188

    Article  Google Scholar 

  23. Masuyama Y, Yatabe K, Koizumi Y, Oikawa Y, Harada N (2019) Deep griffin–lim iteration. In: IEEE international conference on acoustics speech and signal processing (ICASSP), pp 61–65

  24. Mizuno H, Abe M, Hirokawa T (1993) Waveform-based speech synthesis approach with a formant frequency modification. In: IEEE international conference on acoustics, speech, and signal processing, vol 2, pp 195–198

  25. Morise M, Yokomori F, Ozawa K (2016) World: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans Inf Syst 99(7):1877–1884

    Article  Google Scholar 

  26. Oyamada K, Kameoka H, Kaneko T, Tanaka K, Hojo N, Ando H (2018) Generative adversarial network-based approach to signal reconstruction from magnitude spectrogram. In: 26th IEEE European signal processing conference (EUSIPCO), pp 2514–2518

  27. Perraudin N, Balazs P, Søndergaard PL (2013) A fast griffin-lim algorithm. In: IEEE workshop on applications of signal processing to audio and acoustics, pp 1–4

  28. Ping W, Peng K, Gibiansky A, Arik SO, Kannan A, Narang S, Raiman J, Miller J (2017) Deep voice 3: scaling text-to-speech with convolutional sequence learning. arXiv:1710.07654W

  29. Prahallad K (2016) Speech technology: spectrogram, cepstrum and mel-frequency analysis. https://archive.org/details/SpectrogramCepstrumAndMel-frequency_636522

  30. Prenger R, Valle R, Catanzaro B (2019) Waveglow: a flow-based generative network for speech synthesis. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 3617–3621

  31. Qian S, Chen D (1993) Discrete gabor transform. IEEE Trans Signal Process 41(7):2429–2438

    Article  Google Scholar 

  32. Salza PL, Foti E, Nebbia L, Oreglia M (1996) Mos and pair comparison combined methods for quality evaluation of text-to-speech systems. Acta Acustica United with Acustica 82(4):650–656

    Google Scholar 

  33. Shen J, Pang R, Weiss RJ, Schuster M, Jaitly N, Yang Z, Chen Z, Zhang Y, Wang Y, Skerrv-Ryan RJ, et al. (2018) Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In: International conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 4779–4783

  34. Sorensen HV, Jones D, Heideman M, Burrus C (1987) Real-valued fast fourier transform algorithms. IEEE Trans Acoust Speech Signal Process 35(6):849–863

    Article  Google Scholar 

  35. Sotelo J, Mehri S, Kumar K, Santos JF, Kastner K, Courville A, Bengio Y (2017) Char2wav: end-to-end speech synthesis

  36. Sysko (2013) Tatoeba speech dataset. https://tatoeba.org/eng/

  37. Taigman Y, Wolf L, Polyak A, Nachmani E (2017) Voiceloop: voice fitting and synthesis via a phonological loop. arXiv:1707.06588

  38. Tokuda K, Nankaku Y, Toda T, Zen H, Yamagishi J, Oura K (2013) Speech synthesis based on hidden markov models. Proc IEEE 101 (5):1234–1252

    Article  Google Scholar 

  39. Tokuday K, Zen H (2015) Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4215–4219

  40. van den Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior A, Kavukcuoglu K (2016) Wavenet: a generative model for raw audio. arXiv:1609.03499

  41. van den Oord A, Li Y, Babuschkin I, Simonyan K, Vinyals O, Kavukcuoglu K, van den Driessche G, Lockhart E, Cobo LC, Stimberg F, et al. (2017) Parallel wavenet: fast high-fidelity speech synthesis. arXiv:1711.10433

  42. Wang Y, Skerry-Ryan RJ, Stanton D, Wu Y, Weiss RJ, Jaitly N, Yang Z, Xiao Y, Chen Z, Bengio S, et al. (2017) Tacotron: towards end-to-end speech synthesis. arXiv:1703.10135

  43. Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, et al. (2016) Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144

  44. Yamagishi J, Usabaev B, King S, Watts O, Dines J, Tian J, Hu R, Guan Y, Oura K, Tokuda K, Karhila R, Kurimo M (2009) Thousands of voices for hmm-based speech synthesis–analysis and application of tts systems built on various asr corpora. IEEE Trans Audio Speech Lang Process 18:984–1004

    Article  Google Scholar 

  45. Yu F, Koltun V (2015) Multi-scale context aggregation by dilated convolutions. arXiv:1511.07122

  46. Zen H, Tokuda K, Black AW (2009) Statistical parametric speech synthesis. Elsevier Speech Commun 51(11):1039–1064

    Article  Google Scholar 

  47. Zhao Y, Takaki S, Luong H -T, Yamagishi J, Saito D, Minematsu N (2018) Wasserstein gan and waveform loss-based acoustic model training for multi-speaker text-to-speech synthesis systems using a wavenet vocoder. IEEE Access 6:60478–60488

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vikas Maddukuri.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

(PDF 593 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sharma, A., Kumar, P., Maddukuri, V. et al. Fast Griffin Lim based waveform generation strategy for text-to-speech synthesis. Multimed Tools Appl 79, 30205–30233 (2020). https://doi.org/10.1007/s11042-020-09321-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-09321-7

Keywords

Navigation