Advertisement

International Journal of Speech Technology

, Volume 21, Issue 4, pp 1045–1055 | Cite as

HMM speech synthesis based on MDCT representation

  • Giorgio Biagetti
  • Paolo Crippa
  • Laura Falaschetti
  • Claudio Turchetti
Article
  • 28 Downloads

Abstract

Hidden Markov model (HMM) based text-to-speech (TTS) has become one of the most promising approaches, as it has proven to be a particularly flexible and robust framework to generate synthetic speech. However, several factors such as mel-cepstral vocoder and over-smoothing are responsible for causing quality degradation of synthetic speech. This paper presents an HMM speech synthesis technique based on the modified discrete cosine transform (MDCT) representation to cope with these two issues. To this end, we use an analysis/synthesis technique based on MDCT that guarantees a perfect reconstruction of the signal frame from feature vectors and allows for a 50% overlap between frames without increasing the data vector, in contrast to the conventional mel-cepstral spectral parameters that do not ensure direct speech waveform reconstruction. Experimental results show that a sound of good quality, conveniently evaluated using both objective and subjective tests, is obtained.

Keywords

Speech synthesis HMM MDCT Overlap-and-add Mel-cepstral analysis 

References

  1. Allen, J., Hunnicutt, M. S., Klatt, D. H., Armstrong, R. C., & Pisoni, D. B. (1987). From text to speech: The MITalk system. Cambridge: Cambridge University Press.Google Scholar
  2. Benoît, C., Grice, M., & Hazan, V. (1996). The SUS test: A method for the assessment of text-to-speech synthesis intelligibility using semantically unpredictable sentences. Speech Communication, 18(4), 381–392.CrossRefGoogle Scholar
  3. Bertinetto, P. M. (1981). Strutture prosodiche dell’italiano: accento, quantità, sillaba, giuntura, fondamenti metrici. Firenze: Presso l’Accademia della Crusca.Google Scholar
  4. Biagetti, G., Crippa, P., Falaschetti, L., Orcioni, S., & Turchetti, C. (2016). Learning HMM state sequences from phonemes for speech synthesis. Procedia Computer Science, 96, 1589–1596.CrossRefGoogle Scholar
  5. Black, A. W. (2006). CLUSTERGEN: A statistical parametric synthesizer using trajectory modeling. In Proceeding of 9th International conference on spoken language processing (INTERSPEECH 2006 - ICSLP), Pittsburgh.Google Scholar
  6. Black, A. W., & Campbell, N. (1995). Optimising selection of units from speech databases for concatenative synthesis. Eurospeech, 1, 581–584.Google Scholar
  7. Black, A. W., & Tokuda, K. (2005). The Blizzard Challenge–2005: Evaluating corpus-based speech synthesis on common datasets. In Proceeding of 9th European conference on speech communication and technology (INTERSPEECH), Lisbon (pp. 77–80).Google Scholar
  8. Boersma, P. (1993). Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. In IFA Proceedings 17 (Vol. 17, pp. 97–110).Google Scholar
  9. Bosi, M., & Goldberg, R. E. (2003). Introduction to digital audio coding and standards. New York: Springer.CrossRefGoogle Scholar
  10. Cabral, J. P., Renals, S., Richmond, K., & Yamagishi, J. (2008). Glottal spectral separation for parametric speech synthesis. In Proceeding of interspeech, Brisbane (pp. 1829–1832)Google Scholar
  11. Cabral, J. P., Richmond, K., Yamagishi, J., & Renals, S. (2014). Glottal spectral separation for speech synthesis. IEEE Journal of Selected Topics in Signal Processing, 8(2), 195–208.CrossRefGoogle Scholar
  12. Canepari, L. (1979). Introduzione alla fonetica. Torino: Einaudi.Google Scholar
  13. Chen, G., Koh, S. N., & Soon, I. Y. (2003). Enhanced Itakura measure incorporating masking properties of human auditory system. Signal Processing, 83(7), 1445–1456.CrossRefzbMATHGoogle Scholar
  14. Chen, L. H., Raitio, T., Valentini-Botinhao, C., Ling, Z. H., & Yamagishi, J. (2015). A deep generative architecture for postfiltering in statistical parametric speech synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(11), 2003–2014.CrossRefGoogle Scholar
  15. CMUSphinx. (2007). CMUSphinx—Open source speech recognition toolkit. https://cmusphinx.github.io.
  16. Csapó, T. G., & Németh, G. (2014). Modeling irregular voice in statistical parametric speech synthesis with residual codebook based excitation. IEEE Journal of Selected Topics in Signal Processing, 8(2), 209–220.CrossRefGoogle Scholar
  17. Dobrowolski, A. P., & Majda, E. (2011). Cepstral analysis in the speakers recognition systems. In Proceeding of Signal Processing Algorithms, Architectures, Arrangements, and Applications Conference (SPA), Poznan (pp. 1–6).Google Scholar
  18. Donovan, R. E., & Woodland, P. C. (1999). A Hidden Markov-model-based trainable speech synthesizer. Computer Speech & Language, 13(3), 223–241.CrossRefGoogle Scholar
  19. Dutoit, T. (2008). Corpus-based speech synthesis (pp. 437–456). Berlin: Springer.Google Scholar
  20. Erro, D., Sainz, I., Navas, E., & Hernaez, I. (2014). Harmonics plus noise model based vocoder for statistical parametric speech synthesis. IEEE Journal of Selected Topics in Signal Processing, 8(2), 184–194.CrossRefGoogle Scholar
  21. Helander, E., Silen, H., Virtanen, T., & Gabbouj, M. (2012). Voice conversion using dynamic kernel partial least squares regression. IEEE Transactions on Audio, Speech, and Language Processing, 20(3), 806–817.CrossRefGoogle Scholar
  22. Hunt, A. J., & Black, A. W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. In Proceeding of IEEE international conference on acoustics, speech, and signal processing conference proceedings (ICASSP), Atlanta (Vol. 1, pp. 373–376).Google Scholar
  23. Itakura, F., & Saito, S. (1968). Analysis synthesis telephony based on the maximum likelihood method. In Proceeding of 6th international congress on acoustics, Tokyo (pp. C17–C20).Google Scholar
  24. Iwahashi, N., Kaiki, N., & Sagisaka, Y. (1992). Concatenative speech synthesis by minimum distortion criteria. In Proceeding of IEEE international conference on acoustics, speech and signal processing (ICASSP), San Francisco (Vol. 2, pp. 65–68).Google Scholar
  25. Kameoka, H., Yoshizato, K., Ishihara, T., Kadowaki, K., Ohishi, Y., & Kashino, K. (2015). Generative modeling of voice fundamental frequency contours. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(6), 1042–1053.CrossRefGoogle Scholar
  26. Karhila, R., Remes, U., & Kurimo, M. (2014). Noise in HMM-based speech synthesis adaptation: Analysis, evaluation methods and experiments. IEEE Journal of Selected Topics in Signal Processing, 8(2), 285–295.CrossRefGoogle Scholar
  27. Kawahara, H., Masuda-Katsuse, I., & De Cheveigne, A. (1999). Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds. Speech Communication, 27(3), 187–207.CrossRefGoogle Scholar
  28. Klatt, D. H. (1973). Interaction between two factors that influence vowel duration. The Journal of the Acoustical Society of America, 54(4), 1102–1104.CrossRefGoogle Scholar
  29. Klatt, D. H. (1987). Review of text-to-speech conversion for English. The Journal of the Acoustical Society of America, 82(3), 737–793.CrossRefGoogle Scholar
  30. Koriyama, T., Nose, T., & Kobayashi, T. (2014). Statistical parametric speech synthesis based on gaussian process regression. IEEE Journal of Selected Topics in Signal Processing, 8(2), 173–183.CrossRefGoogle Scholar
  31. Ling, Z.-H., Deng, L., & Yu, D. (2013). Modeling spectral envelopes using restricted Boltzmann machines and deep belief networks for statistical parametric speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 21(10), 2129–2139.CrossRefGoogle Scholar
  32. Ling, Z. H., Kang, S. Y., Zen, H., Senior, A., Schuster, M., Qian, X. J., et al. (2015). Deep learning for acoustic modeling in parametric speech generation: A systematic review of existing techniques and future trends. IEEE Signal Processing Magazine, 32(3), 35–52.CrossRefGoogle Scholar
  33. Maia, R., Toda, T., Tokuda, K., Sakai, S., & Nakamura, S. (2009). A decision tree-based clustering approach to state definition in an excitation modeling framework for HMM-based speech synthesis. In Proceeding of 10th annual conference of the international speech communication association (INTERSPEECH), Brighton (pp. 1783–1786).Google Scholar
  34. Maia, R., Toda, T., Zen, H., Nankaku, Y., & Tokuda, K. (2007). An excitation model for HMM-based speech synthesis based on residual modeling. In Proceeding of international speech communication association speech synthesis, workshop 6 (ISCA SSW6), Bonn (pp. 131–136).Google Scholar
  35. Malvar, H. S. (1992). Signal processing with lapped transforms. Norwood: Artech House Inc.zbMATHGoogle Scholar
  36. MaryTTS. (2012). MaryTTS—An open-source, multilingual text-to-speech synthesis platform written in java. http://mary.dfki.de.
  37. Nakashika, T., Takashima, R., Takiguchi, T., & Ariki, Y. (2013). Voice conversion in high-order eigen space using deep belief nets. In Proceeding of interspeech, Lyon (pp. 369–372).Google Scholar
  38. Nespor, M., & Bafile, L. (2008). I suoni del linguaggio. Bologna: Il mulino.Google Scholar
  39. Nose, T. (2016). Efficient implementation of global variance compensation for parametric speech synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(10), 1694–1704.CrossRefGoogle Scholar
  40. Nose, T., Chunwijitra, V., & Kobayashi, T. (2014). A parameter generation algorithm using local variance for HMM-based speech synthesis. IEEE Journal of Selected Topics in Signal Processing, 8(2), 221–228.CrossRefGoogle Scholar
  41. Picart, B., Drugman, T., & Dutoit, T. (2014). Automatic variation of the degree of articulation in new HMM-based voices. IEEE Journal of Selected Topics in Signal Processing, 8(2), 307–322.CrossRefGoogle Scholar
  42. Princen, J., & Bradley, A. (1986). Analysis/synthesis filter bank design based on time domain aliasing cancellation. IEEE Transactions on Acoustics, Speech and Signal Processing, 34(5), 1153–1161.CrossRefGoogle Scholar
  43. Princen, J., Johnson, A., & Bradley, A. (1987). Subband/transform coding using filter bank designs based on time domain aliasing cancellation. In Proceeding of IEEE international conference on acoustics, speech, and signal processing (ICASSP), Dallas (Vol. 12, pp. 2161–2164).Google Scholar
  44. Raitio, T., Suni, A., Yamagishi, J., Pulakka, H., Nurminen, J., Vainio, M., et al. (2011). HMM-based speech synthesis utilizing glottal inverse filtering. IEEE Transactions on Audio, Speech, and Language Processing, 19(1), 153–165.CrossRefGoogle Scholar
  45. Rosales, H. G., Jokisch, O., & Hoffmann, R. (2008). Bayes optimal classification for corpus-based unit selection in TTS synthesis. In Proceeding of V Jornadas en Tecnologıa del Habla (VJTH) (pp. 141–144).Google Scholar
  46. Schröder, M., & Trouvain, J. (2003). The german text-to-speech synthesis system MARY: A tool for research, development and teaching. International Journal of Speech Technology, 6(4), 365–377.CrossRefGoogle Scholar
  47. Schröder, M., Charfuelan, M., Pammi, S., & Steiner, I. (2011). Open source voice creation toolkit for the MARY TTS platform. In 12th annual conference of the international speech communication association (INTERSPEECH), Florence (pp. 3253–3256).Google Scholar
  48. Schröder, M., Charfuelan, M., Pammi, S., & Türk, O. (2008). The MARY TTS entry in the Blizzard Challenge 2008. In Proceeding of Blizzard challenge.Google Scholar
  49. Schwarz, D. (2007). Corpus-based concatenative synthesis. IEEE Signal Processing Magazine, 24(2), 92–104.CrossRefGoogle Scholar
  50. Sung, J. S., Hong, D. H., & Kim, N. S. (2014). Factored maximum penalized likelihood kernel regression for HMM-based style-adaptive speech synthesis. IEEE Journal of Selected Topics in Signal Processing, 8(2), 251–261.CrossRefGoogle Scholar
  51. Takaki, S., Nankaku, Y., & Tokuda, K. (2014). Contextual additive structure for HMM-based speech synthesis. IEEE Journal of Selected Topics in Signal Processing, 8(2), 229–238.CrossRefGoogle Scholar
  52. Takamichi, S., Toda, T., Shiga, Y., Sakti, S., Neubig, G., & Nakamura, S. (2014). Parameter generation methods with rich context models for high-quality and flexible text-to-speech synthesis. IEEE Journal of Selected Topics in Signal Processing, 8(2), 239–250.CrossRefGoogle Scholar
  53. Taylor, P. (2009). Text-to-speech synthesis. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
  54. Tesser, F., Paci, G., Sommavilla, G., & Cosi, P. (2013). A new language and a new voice for MARY-TTS. In 9th national congress, AISV (Associazione Italiana di Scienze della Voce), Venice (pp. 435–443).Google Scholar
  55. Toda, T., & Tokuda, K. (2007). A speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE Transactions on Information and Systems, 90(5), 816–824.CrossRefGoogle Scholar
  56. Toda, T., Black, A. W., & Tokuda, K. (2007). Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Transactions on Audio, Speech, and Language Processing, 15(8), 2222–2235.CrossRefGoogle Scholar
  57. Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., & Kitamura, T. (2000). Speech parameter generation algorithms for HMM-based speech synthesis. In Proceeding of IEEE international conference on acoustics, speech, and signal processing, (ICASSP), Istanbul (Vol. 3, pp. 1315–1318).Google Scholar
  58. Tokuda, K., Nankaku, Y., Toda, T., Zen, H., Yamagishi, J., & Oura, K. (2013). Speech synthesis based on Hidden Markov models. Proceedings of the IEEE, 101(5), 1234–1252.CrossRefGoogle Scholar
  59. Tsiaras, V., Maia, R., Diakoloukas, V., Stylianou, Y., & Digalakis, V. (2016). Global variance in speech synthesis with linear dynamical models. IEEE Signal Processing Letters, 23(8), 1057–1061.CrossRefGoogle Scholar
  60. Urbain, J., Çakmak, H., Charlier, A., Denti, M., Dutoit, T., & Dupont, S. (2014). Arousal-driven synthesis of laughter. IEEE Journal of Selected Topics in Signal Processing, 8(2), 273–284.CrossRefGoogle Scholar
  61. Wan, V., Latorre, J., Yanagisawa, K., Braunschweiler, N., Chen, L., Gales, M. J. F., et al. (2014). Building HMM-TTS voices on diverse data. IEEE Journal of Selected Topics in Signal Processing, 8(2), 296–306.CrossRefGoogle Scholar
  62. Wu, Y.-J., Zen, H., Nankaku, Y., Tokuda, K. (2008). Minimum generation error criterion considering global/local variance for HMM-based speech synthesis. In Proceeding of IEEE international conference on acoustics, speech and signal processing (ICASSP), Las Vegas (pp. 4621–4624).Google Scholar
  63. Yamagishi, J., Nose, T., Zen, H., Ling, Z.-H., Toda, T., Tokuda, K., et al. (2009). Robust speaker-adaptive HMM-based text-to-speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 17(6), 1208–1230.CrossRefGoogle Scholar
  64. Yoshimura, T. (2002). Simultaneous modeling of phonetic and prosodic parameters, and characteristic conversion for HMM-based text-to-speech systems. PhD thesis, Nagoya Institute of Technology.Google Scholar
  65. Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (1997). Speaker interpolation in HMM-based speech synthesis system. In Proceeding of 5th European conference on speech communication and technology (EUROSPEECH), Rhodes (pp. 2523–2526).Google Scholar
  66. Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (1999). Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In Proceeding of 6th European Conference on Speech Communication and Technology (EUROSPEECH), Budapest (Vol. 5, pp. 2347–2350).Google Scholar
  67. Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (2001). Mixed excitation for HMM-based speech synthesis. In Proceeding of 7th European conference on speech communication and technology (EUROSPEECH), Scandinavia (pp. 2263–2266).Google Scholar
  68. Yu, K., & Young, S. (2011). Continuous F0 modeling for HMM based statistical parametric speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 19(5), 1071–1079.CrossRefGoogle Scholar
  69. Zen, H., & Senior, A. (2014). Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis. In Proceeding of IEEE international conference on acoustics, speech and signal processing (ICASSP), Florence (pp. 3844–3848).Google Scholar
  70. Zen, H., Toda, T., & Tokuda, K. (2008). The Nitech-NAIST HMM-based speech synthesis system for the Blizzard Challenge 2006. IEICE Transactions on Information and Systems, 91(6), 1764–1773.CrossRefGoogle Scholar
  71. Zen, H., Tokuda, K., & Black, A. W. (2009). Statistical parametric speech synthesis. Speech Communication, 51(11), 1039–1064.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.DII – Dipartimento di Ingegneria dell’InformazioneUniversità Politecnica delle MarcheAnconaItaly

Personalised recommendations