Abstract
The performance of text-to-speech (TTS) systems heavily depends on spectrogram to waveform generation, also known as the speech reconstruction phase. The time required for the same is known as synthesis delay. In this paper, an approach to reduce speech synthesis delay has been proposed. It aims to enhance the TTS systems for real-time applications such as digital assistants, mobile phones, embedded devices, etc. The proposed approach applies Fast Griffin Lim Algorithm (FGLA) instead Griffin Lim algorithm (GLA) as vocoder in the speech synthesis phase. GLA and FGLA are both iterative, but the convergence rate of FGLA is faster than GLA. The proposed approach is tested on LJSpeech, Blizzard and Tatoeba datasets and the results for FGLA are compared against GLA and neural Generative Adversarial Network (GAN) based vocoder. The performance is evaluated based on synthesis delay and speech quality. A 36.58% reduction in speech synthesis delay has been observed. The quality of the output speech has improved, which is advocated by higher Mean opinion scores (MOS) and faster convergence with FGLA as opposed to GLA.
Similar content being viewed by others
References
Aaron A, Bakis R, Eide EM, Hamza WM (2014) Systems and methods for text-to-speech synthesis using spoken example, November 11 2014. US Patent 8,886,538
Arik SO, Chrzanowski M, Coates A, Diamos G, Gibiansky A, Kang Y, Li X, Miller J, Ng A, Raiman J, et al. (2017) Deep voice: real-time neural text-to-speech. In: Proceedings of the 34th international conference on machine learning (ICML), vol 70, pp 195–204
Arik SO, Jun H, Diamos G (2018) Fast spectrogram inversion using multi-head convolutional neural networks. IEEE Signal Process Lett 26(1):94–98
Bracewell RN, Bracewell RN (1986) The Fourier transform and its applications, vol 31999. McGraw-Hill, New York
Braunschweiler N, Gales MJF, Buchholz S (2010) Lightly supervised recognition for automatic alignment of large coherent speech recordings. In: INTERSPEECH
Cheng Z, Shen J (2016) On effective location-aware music recommendation. ACM Trans Inf Syst (TOIS) 34(2):1–32
Coorman G, Deprez F, De Bock M, Fackrell J, Leys S, Rutten P, De Moortel J, Schenk A, Van Coile B (2007) Speech synthesis using concatenation of speech waveforms, May 15 2007. US Patent 7,219,060
Ghate P, Shirbahadurkar SD (2017) A survey on methods of tts and various test for evaluating the quality of synthesized speech. Int J Dev Res 7:15236–15239
Gibiansky A, Arik S, Diamos G, Miller J, Peng K, Ping W, Raiman J, Zhou Y (2017) Deep voice 2: multi-speaker neural text-to-speech. In: Advances in neural information processing systems, pp 2962–2970
Griffin D, Lim J (1984) Signal estimation from modified short-time fourier transform. IEEE Trans Acoust Speech Signal Process 32(2):236–243
Hunt AJ, Black AW (1996) Unit selection in a concatenative speech synthesis system using a large speech database. In: IEEE international conference on acoustics, speech, and signal processing conference proceedings, vol 1, pp 373–376
Ito K (2017) The lj speech dataset https://keithito.com/LJ-Speech-Dataset/
Jia Y, Zhang Y, Weiss R, Wang Q, Shen J, Ren F, Nguyen P, Pang R, Moreno IL, Wu Y, et al. (2018) Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In: Advances in neural information processing systems (NeuroIPS), pp 4480–4490
Jones BT, Guthrie DM, Schaefer L, Martin JD (2017) Real-time speech-to-text conversion in an audio conference session, January 31 2017. US Patent 9,560,206
Kinsella B (2017) Speech synthesis becomes more humanlike. https://voicebot.ai/2017/12/21/speech-synthesis-becomes-humanlike/
Kim S, Hori T, Watanabe S (2017) Joint ctc-attention based end-to-end speech recognition using multi-task learning. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4835–4839
Kumar K, Kumar R, de Boissiere T, Gestin L, Teoh WZ, Sotelo J, de Brebisson A, Bengio Y, Courville AC (2019) Melgan: generative adversarial networks for conditional waveform synthesis. In: Advances in neural information processing systems, pp 14881–14892
Lee S, Chang J-H (2017) Spectral difference for statistical model-based speech enhancement in speech recognition. Springer Multimed Tools Appl 76 (23):24917–24929
Levoy M (1992) Volume rendering using the fourier projection-slice theorem. Computer Systems Laboratory, Stanford University
Malathi T, Bhuyan MK (2017) Performance analysis of gabor wavelet for extracting most informative and efficient features. Springer Multimed Tools Appl 76(6):8449–8469
Masuko T, Tokuda K, Kobayashi T, Imai S (1997) Voice characteristics conversion for hmm-based speech synthesis system. In: 1997 IEEE international conference on acoustics, speech, and signal processing, vol 3, pp 1611–1614
Masuyama Y, Yatabe K, Oikawa Y (2018) Griffin–lim like phase recovery via alternating direction method of multipliers. IEEE Signal Process Lett 26 (1):184–188
Masuyama Y, Yatabe K, Koizumi Y, Oikawa Y, Harada N (2019) Deep griffin–lim iteration. In: IEEE international conference on acoustics speech and signal processing (ICASSP), pp 61–65
Mizuno H, Abe M, Hirokawa T (1993) Waveform-based speech synthesis approach with a formant frequency modification. In: IEEE international conference on acoustics, speech, and signal processing, vol 2, pp 195–198
Morise M, Yokomori F, Ozawa K (2016) World: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans Inf Syst 99(7):1877–1884
Oyamada K, Kameoka H, Kaneko T, Tanaka K, Hojo N, Ando H (2018) Generative adversarial network-based approach to signal reconstruction from magnitude spectrogram. In: 26th IEEE European signal processing conference (EUSIPCO), pp 2514–2518
Perraudin N, Balazs P, Søndergaard PL (2013) A fast griffin-lim algorithm. In: IEEE workshop on applications of signal processing to audio and acoustics, pp 1–4
Ping W, Peng K, Gibiansky A, Arik SO, Kannan A, Narang S, Raiman J, Miller J (2017) Deep voice 3: scaling text-to-speech with convolutional sequence learning. arXiv:1710.07654W
Prahallad K (2016) Speech technology: spectrogram, cepstrum and mel-frequency analysis. https://archive.org/details/SpectrogramCepstrumAndMel-frequency_636522
Prenger R, Valle R, Catanzaro B (2019) Waveglow: a flow-based generative network for speech synthesis. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 3617–3621
Qian S, Chen D (1993) Discrete gabor transform. IEEE Trans Signal Process 41(7):2429–2438
Salza PL, Foti E, Nebbia L, Oreglia M (1996) Mos and pair comparison combined methods for quality evaluation of text-to-speech systems. Acta Acustica United with Acustica 82(4):650–656
Shen J, Pang R, Weiss RJ, Schuster M, Jaitly N, Yang Z, Chen Z, Zhang Y, Wang Y, Skerrv-Ryan RJ, et al. (2018) Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In: International conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 4779–4783
Sorensen HV, Jones D, Heideman M, Burrus C (1987) Real-valued fast fourier transform algorithms. IEEE Trans Acoust Speech Signal Process 35(6):849–863
Sotelo J, Mehri S, Kumar K, Santos JF, Kastner K, Courville A, Bengio Y (2017) Char2wav: end-to-end speech synthesis
Sysko (2013) Tatoeba speech dataset. https://tatoeba.org/eng/
Taigman Y, Wolf L, Polyak A, Nachmani E (2017) Voiceloop: voice fitting and synthesis via a phonological loop. arXiv:1707.06588
Tokuda K, Nankaku Y, Toda T, Zen H, Yamagishi J, Oura K (2013) Speech synthesis based on hidden markov models. Proc IEEE 101 (5):1234–1252
Tokuday K, Zen H (2015) Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4215–4219
van den Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior A, Kavukcuoglu K (2016) Wavenet: a generative model for raw audio. arXiv:1609.03499
van den Oord A, Li Y, Babuschkin I, Simonyan K, Vinyals O, Kavukcuoglu K, van den Driessche G, Lockhart E, Cobo LC, Stimberg F, et al. (2017) Parallel wavenet: fast high-fidelity speech synthesis. arXiv:1711.10433
Wang Y, Skerry-Ryan RJ, Stanton D, Wu Y, Weiss RJ, Jaitly N, Yang Z, Xiao Y, Chen Z, Bengio S, et al. (2017) Tacotron: towards end-to-end speech synthesis. arXiv:1703.10135
Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, et al. (2016) Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144
Yamagishi J, Usabaev B, King S, Watts O, Dines J, Tian J, Hu R, Guan Y, Oura K, Tokuda K, Karhila R, Kurimo M (2009) Thousands of voices for hmm-based speech synthesis–analysis and application of tts systems built on various asr corpora. IEEE Trans Audio Speech Lang Process 18:984–1004
Yu F, Koltun V (2015) Multi-scale context aggregation by dilated convolutions. arXiv:1511.07122
Zen H, Tokuda K, Black AW (2009) Statistical parametric speech synthesis. Elsevier Speech Commun 51(11):1039–1064
Zhao Y, Takaki S, Luong H -T, Yamagishi J, Saito D, Minematsu N (2018) Wasserstein gan and waveform loss-based acoustic model training for multi-speaker text-to-speech synthesis systems using a wavenet vocoder. IEEE Access 6:60478–60488
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Sharma, A., Kumar, P., Maddukuri, V. et al. Fast Griffin Lim based waveform generation strategy for text-to-speech synthesis. Multimed Tools Appl 79, 30205–30233 (2020). https://doi.org/10.1007/s11042-020-09321-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-09321-7