Advertisement

Hidden-Markov-model based statistical parametric speech synthesis for Marathi with optimal number of hidden states

  • Suraj Pandurang Patil
  • Swapnil Laxman Lahudkar
Article
  • 7 Downloads

Abstract

Hidden Markov Model and Deep Neural Networks based Statistical Parametric Speech Synthesis systems, gain a significant attention from researchers because of their flexibility in generating speech waveforms in diverse voice qualities as well as in styles. This paper describes HMM-based speech synthesis system (SPSS) for the Marathi language. In proposed synthesis method, speech parameter trajectories used for synthesis are generated from the trained hidden Markov models (HMM). We have recorded our database of 5300 phonetically balanced Marathi sentences to train the context-dependent HMM with five, seven and nine hidden states. The subjective quality measures (MOS and PWP) shows that the HMMs with seven hidden states are capable of giving an adequate quality of synthesized speech as compared to five state and with less time complexity than seven state HMMs. The contextual features used for experimentation are inclusive of a position of an observed phoneme in a respective syllable, word, and sentence.

Keywords

Speech Synthesis Hidden Markov Model Context-dependent HMM HMM Toolkit 

Notes

Acknowledgements

The Authors would like to thank Dr. K Samudravijaya for useful discussion on HMM-based Speech Synthesis Systems and his guidance for preparing and validating the prepared database. Authors also thankful to members of HTS working group including for their software development efforts.

References

  1. Black, A.W., Bunnell, H. T., Ying, D., Muthukumar, P. K., Florian, M., Daniel, P., et al. (2012). Articulatory features for expressive speech synthesis. IEEE international conference on acoustics, speech and signal processing (ICASSP), (pp. 4005–4008).Google Scholar
  2. Black, A.W., Zen, H., Tokuda, K. (2007). Statistical parametric speech synthesis. IEEE international conference on acoustics, speech and signal processing (ICASSP), (Vol. 4, pp. 1229–1232).Google Scholar
  3. Bouguelia, M. R., Nowaczyk, S., Santosh, K. C., Verikas, A. (2018 August). Agreeing to disagree: Active learning with noisy labels without crowdsourcing. International Journal of Machine Learning & Cybernetics, 9(8), 1307–1319.CrossRefGoogle Scholar
  4. Dey, N., & Ashour, A. S. (2018a). Applied Examples and Applications of Localization and Tracking Problem of Multiple Speech Sources. In N. Dey & A. S. Ashour (Eds.), Direction of Arrival Estimation and Localization of Multi-Speech Sources, Springer Briefs in Electrical and Computer Engineering (pp. 35–48). Cham: Springer.CrossRefGoogle Scholar
  5. Dey, N., & Ashour, A. S. (2018b). Sources localization and DOAE techniques of moving multiple sources. In N. Dey & A. S. Ashour (Eds.), Direction of arrival estimation and localization of multi-speech sources, springer briefs in electrical and computer engineering (pp. 23–34). Cham: Springer.CrossRefGoogle Scholar
  6. Dey, N., & Ashour, A. S. (2018c). Challenges and future perspectives in speech-sources direction of arrival estimation and localization. In N. Dey & A. S. Ashour (Eds.), Direction of arrival estimation and localization of multi-speech sources, springer briefs in electrical and computer engineering (pp. 49–52). Cham: Springer.CrossRefGoogle Scholar
  7. Fukada, T., & Tokuda, K., Kobayashi, T., Imai, S. (1992). An adaptive algorithm for mel-cepstral analysis of speech. ICASSP (pp. 137–140).Google Scholar
  8. Hunt, A., & Black, A. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. ICASSP (pp. 373–376).Google Scholar
  9. Imai, S. (1983). Cepstral analysis synthesis on the mel-frequency scale. IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 93–96).Google Scholar
  10. Tokuda, K., Masuko, T., Miyazaki, N., Kobayashi, T. (2002a). Multi-space probability distribution HMM. IEICE Transactions on Information and Systems, E85-D(3), 455–464.Google Scholar
  11. Tokuda, K., Zen, H., Black, A. W. (2002b). An HMM-based speech synthesis system applied to english. IEEE workshop on speech synthesis.Google Scholar
  12. Tokuda, K. (2006). An HMM-based approach to flexible speech synthesis. In Q. Huo, B. Ma, E. S. Chng & H. Li (Eds.), Chinese spoken language processing. Lecture notes in computer science (Vol. 4274). Berlin: Springer.Google Scholar
  13. Vajda, S., & Santosh, K. C. (2017). A fast k-nearest neighbor classifier using unsupervised clustering. In Recent trends in image processing and pattern recognition. RTIP2R 2016. Communications in computer and information science (Vol. 709). Singapore: Springer.Google Scholar
  14. Yoshimura, T. (2002). Simultaneous modeling of phonetic and prosodic parameters, and characteristic conversion for HMM-based Text-To-Speech systems, PhD dissertation, Nagoya Institute of Technology.Google Scholar
  15. Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., Kitamura, T. (1998). Duration modeling for HMM-based speech synthesis. ICSLP (pp. 29–32).Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.JSPMs Rajarshi Shahu College of EngineeringPuneIndia
  2. 2.JSPM’s Imperial College of Engineering and ResearchPuneIndia

Personalised recommendations