Skip to main content
Log in

Conventional and contemporary approaches used in text to speech synthesis: a review

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

Nowadays speech synthesis or text to speech (TTS), an ability of system to produce human like natural sounding voice from the written text, is gaining popularity in the field of speech processing. For any TTS, intelligibility and naturalness are the two important measures for defining the quality of a synthesized sound which is highly dependent on the prosody modeling using acoustic model of synthesizer. The purpose of this review survey is firstly to study and analyze the various approaches used traditionally (articulatory synthesis, formant synthesis, concatenative speech synthesis and statistical parametric techniques based on hidden Markov model) and recently (statistical parametric based on deep learning approaches) for acoustic modeling with their pros and cons. The approaches based on deep learning to build the acoustic model has significantly contributed to the advancement of TTS as models based on deep learning are capable of modelling the complex context dependencies in the input data. Apart from these, this article also reviews the TTS approaches for generating speech with different voices and emotions to makes the TTS more realistic to use. It also addresses the subjective and objective metrics used to measure the quality of the synthesized voice. Various well known speech synthesis systems based on autoregressive and non-autoregressive models such as Tacotron, Deep Voice, WaveNet, Parallel WaveNet, Parallel Tacotron, FastSpeech by global tech-giant Google, Facebook, Microsoft employed the architecture of deep learning for end-to-end speech waveform generation and attained a remarkable mean opinion score (MOS).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28

Similar content being viewed by others

Data availability

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

References

Download references

Acknowledgements

This piece of research work in supported by IK Gujral Punjab Technical University, Kapurthala, Punjab, India.

Funding

No funding was received to assist with the preparation of this manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Navdeep Kaur.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kaur, N., Singh, P. Conventional and contemporary approaches used in text to speech synthesis: a review. Artif Intell Rev 56, 5837–5880 (2023). https://doi.org/10.1007/s10462-022-10315-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-022-10315-0

Keywords

Navigation