Skip to main content

Review on Unit Selection-Based Concatenation Approach in Text to Speech Synthesis System

  • Conference paper
  • First Online:
Book cover Cybernetics, Cognition and Machine Learning Applications

Part of the book series: Algorithms for Intelligent Systems ((AIS))

Abstract

Speech synthesis is the technique of artificially generating human speech. A text-to-speech system converts ordinary (raw) text of any language for which it is designed into speech signal. In this era, people are more interested to listen their native language. So, there is need for text-to-speech system (TTS) that will give automatic transformation of raw text into speech that looks like a native speaker of the language reading the raw text. So far, successful research has not been done in generating a speech signal. We have been only able to extract parameters from recorded speech and synthesize the original signal from it. Hence, the speech synthesis term is more appropriate than speech generation. Synthesized speech is generated by many concatenation algorithms. The goal of this paper is to give concise idea about different engineering approaches for talking machines that use sequence of word units that provides flexibility for arbitrary vocabularies as required in many application such as ultimate conversion from written text-to-speech signal.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Rabiner, L.R., Schafer, R.W.: Digital signal processing of speech signals

    Google Scholar 

  2. Xydas, G., Kouroupetroglou, G.: Tone-Group F0 selection for modeling focus prominence in small-footprint speech synthesis. Speech Communication, Greece (2006)

    Book  Google Scholar 

  3. Lin, C.-T., Wu, R.-C., Chang, J.-Y., Liang, S.-F.: A novel prosodic-information synthesizer based on recurrent fuzzy neural network for the Chinese TTS system. IEEE Trans. Syst. Man Cybern. Part B: Cybern. 34(1) (2004)

    Google Scholar 

  4. Thomas, S.: Natural Sounding Text-to-Speech Synthesis Based on Syllable-Like Units. Indian Institute of Technology Madras (2007)

    Google Scholar 

  5. Low, P.H., Ho, C.-H., Vaseghi, S.: Using Estimated Formants Tracks for Formants Smoothing in Text to Speech (TTS) Synthesis. Brunel University, London, UB8 3PH, UK. LEEE (2003)

    Google Scholar 

  6. Lee, C.-H., Jung, S.-K., Kang, H.-G.: Applying a speaker-dependent speech compression technique to concatenative TTS synthesizers. IEEE Trans. Audio Speech Lang. Proces. 15(2) (2007)

    Google Scholar 

  7. Dusan, Rabiner, L.: On the relation between maximum spectral transition positions and phone boundaries. Sorin Center for Advanced Information Processing Rutgers University, Piscataway, New Jersey, USA

    Google Scholar 

  8. Gahlawat, M., Malik, A., Bansal, P.: Integrating Human Emotions with Spatial Speech Using Optimized Selection of Acoustic Phonetic Units. Computer Science & Engineering Department, DCRUST, Murthal, India (2015)

    Google Scholar 

  9. Narendra, N.P., Sreenivasa Rao, K.: Optimal Weight Tuning Method for Unit Selection Cost Functions in Syllable Based Text-to-Speech Synthesis. School of Information Technology, Indian Institute of Technology Kharagpur, Bengal, India (2012)

    Google Scholar 

  10. Moulines, E., Charpentier, F.: Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Commun. 9, 453–467 (1990)

    Google Scholar 

  11. Erro, D., Navas, E., Hernáez, I., Saratxaga, I.: Emotion on version based on prosodic unit selection. IEEE Trans. Audio Speech Lang. Proces. 18, 974–983 (2010)

    Google Scholar 

  12. Iida, A., Campbell, N., Higuchi, F., Yasumura, M.: A corpus-based speech synthesis system with emotion. J. Speech Commun. 40, 161–187 (2003)

    Google Scholar 

  13. Black, A.W.: Unit selection and emotional speech. In: Paper Presented at the Eurospeech, Geneva, Switzerland (2003)

    Google Scholar 

  14. Zen, H., Tokuda, K., Black, A.W.: Statistical parametric speech synthesis. Speech Commun. 51, 1039–1064 (2009)

    Google Scholar 

  15. Taylor, P.: Unifying unit selection and hidden Markov model speech synthesis. In: Paper Presented at the Interspeech (2006)

    Google Scholar 

  16. Hunt, A.J., Black, A.W.: Unit selection in a concatenative speech synthesis system using a large speech database. In: Paper Presented at the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing. ICASSP (1996)

    Google Scholar 

  17. Hue, X.: Genetic Algorithms for Optimization. University of Edinburgh, Edinburgh (1997)

    Google Scholar 

  18. Kumar, R.: A Genetic Algorithm for Unit Selection Based Speech Synthesis. InterSpeech-ICSLP, Korea (2004)

    Google Scholar 

  19. Black, A.W., Campbell, N.: Optimising selection of units from speech databases for concatenative synthesis. In: Proceedings of the Eurospeech’95, Madrid, Spain (1995)

    Google Scholar 

  20. Zen, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (May 2013)

    Google Scholar 

  21. Juvela, L., Bollepalli, B., Tsiaras, V., Alku, P.: GlotNet—A Raw Waveform Model for the Glottal Excitation in Statistical Parametric Speech Synthesis. IEEE/ACM Trans. Audio Speech Lang. Proces. (2019)

    Google Scholar 

  22. van den Oord, A., et al.: WaveNet: A Generative Model for Raw Audio (2016). Available: https://arxiv.org/abs/1609.03499

  23. Kawahara, H., Estill, J., Fujimura, O.: Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT. In: Proceedings of MAVEBA (2001)

    Google Scholar 

  24. Barra-Chicote, R., Yamagishi, J., King, S., Montero, J.M.: Analysis of Statistical Parametric and Unit Selection Speech Synthesis Systems Applied to Emotional Speech, Javier Macias-Guarasa (2010)

    Google Scholar 

  25. Lim, Y.C., Tan, T.S., Salleh, S.H.S., Ling, D.K.: Application of Genetic Algorithm in Unit Selection for Malay Speech Synthesis System (2012)

    Google Scholar 

  26. Clark, R., Richmond, K., King, S.: Multisyn: open-domain unit selection for the festival speech synthesis system. Speech Commun. 49, 317–330 (2007)

    Google Scholar 

  27. Ling, Z.-H., Wang, R.-H.: Minimum Unit Selection Error Training for Hmm-Based Unit Selection Speech Synthesis System. iFlytek Speech Laboratory (2008)

    Google Scholar 

  28. Takamichi, S., Toda, T., Black, A.W., Neubig, G., Sakti, S., Nakamura, S.: Postfilters to modify the modulation spectrum for statistical parametric speech synthesis. IEEE/ACM Trans. Audio Speech Lang. Proces 24, 755–767 (2016)

    Google Scholar 

  29. Kayte, S., Mundada, M., Kayte, C.: A review of unit selection speech synthesis. Int. J. Adv. Res. Comput. Sci. Softw. Eng. (2015)

    Google Scholar 

  30. Li, T., Shen, F.: Automatic Segmentation of Chinese Mandarin Speech into Syllable-Like. National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China

    Google Scholar 

  31. Liberman, I.Y., Shankweiler, D., William Fischer, F., Carter, B.: Explicit syllable and phoneme segmentation in the young child. Child Psychol. 18, 201–212 (1074)

    Google Scholar 

  32. Gold, B., Morgan, N.: Speech and Audio Signal Processing. Wiley India Edition, New Delhi

    Google Scholar 

  33. Narendra, N.P., Sreenivasa Rao, K.: Syllable Specific Unit Selection Cost Functions for Text-to-Speech Synthesis. Indian Institute of Technology Kharagpur

    Google Scholar 

  34. Xia, X.-J., Ling, Z.-H., Jiang, Y., Dai, L.-R.: HMM-Based Unit Selection Speech Synthesis Using Log Likelihood Ratios Derived from Perceptual Data. National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China

    Google Scholar 

  35. Dusan, S., Rabiner, L.: On the Relation Between Maximum Spectral Transition Positions and Phone Boundaries. Center for Advanced Information Processing Rutgers University, Piscataway, New Jersey, USA (2006)

    Google Scholar 

  36. Kasuya, H., Wakita, H.: Automatic Detection of Syllable Nuclei as Applied to Segmentation of Speech. Speech Communications Research Laboratory, Inc.

    Google Scholar 

  37. Jittiwarangkul, N., Jitapunkul, S., Luksaneeyanawin, S., Ahkuputra, V., Wutiwiwatchai, C.: Thai Syllable Segmentation for Connected Speech Based on Energy. Digital Signal Processing Research Laboratory Bangkok, Thailand

    Google Scholar 

  38. Jitsup, J., Sritheeravirojana, U.-T., Udomhunsakul, S.: Syllable Segmentation of Thai Human Speech Using Stationary Wavelet Transform. King Mongkut’s Institute of Technology Ladkrabang, Bangkok, Thailand (2007)

    Book  Google Scholar 

  39. Chou, C.-H., Liu, P.-H., Cai, B.: On the Studies of Syllable Segmentation and Improving MFCCs for Automatic Birdsong Recognition. In: IEEE Asia-Pacific Services Computing Conference, Department of Computer Science and Information Engineering, Taiwan (2008)

    Google Scholar 

  40. Thomas, S., Nageshwara Rao, M., Murthy, H.A., Ramalingam, C.S.: Natural Sounding TTS Based on Syllable-Like Units. Indian Institute of Technology Madras

    Google Scholar 

  41. Díaz, F.C., Banga, E.R.: A Method for Combining Intonation Modelling and Speech Unit Selection in Corpus-Based Speech Synthesis Systems. Campus Universitario, 36200 Vigo, Spain (2006)

    Google Scholar 

  42. Wang, X.: An HMM-Based Cantonese Speech Synthesis System. Joint Research Center for Media Sciences, Technologies and Systems, China. IEEE Global High Tech Congress on Electronics (2012)

    Google Scholar 

  43. Alías, F., Formiga, L., Llora, X.: Efficient and Reliable Perceptual Weight Tuning for Unit-Selection Text-to-Speech Synthesis Based on Active Interactive Genetic Algorithms (2011)

    Google Scholar 

  44. Clark, R.A.J., Richmond, K., King, S.: Multisyn: Open-Domain Unit Selection for the Festival Speech Synthesis System CSTR. The University of Edinburgh, 2 Buccleuch Place, Edinburgh UK (2007)

    Google Scholar 

  45. van Santen, J., Kain, A., Klabbers, E., Mishra, T.: Synthesis of prosody using multi-level unit sequences. Speech Commun. 46, 365–375 (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Priyanka Gujarathi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gujarathi, P., Patil, S.R. (2021). Review on Unit Selection-Based Concatenation Approach in Text to Speech Synthesis System. In: Gunjan, V.K., Suganthan, P.N., Haase, J., Kumar, A. (eds) Cybernetics, Cognition and Machine Learning Applications. Algorithms for Intelligent Systems. Springer, Singapore. https://doi.org/10.1007/978-981-33-6691-6_22

Download citation

Publish with us

Policies and ethics