International Journal of Speech Technology

, Volume 13, Issue 3, pp 175–188 | Cite as

Phone duration modeling: overview of techniques and performance optimization via feature selection in the context of emotional speech

  • Alexandros Lazaridis
  • Todor Ganchev
  • Theodoros Kostoulas
  • Iosif Mporas
  • Nikos Fakotakis
Article
  • 136 Downloads

Abstract

Accurate modeling of prosody is prerequisite for the production of synthetic speech of high quality. Phone duration, as one of the key prosodic parameters, plays an important role for the generation of emotional synthetic speech with natural sounding. In the present work we offer an overview of various phone duration modeling techniques, and consequently evaluate ten models, based on decision trees, linear regression, lazy-learning algorithms and meta-learning algorithms, which over the past decades have been successfully used in various modeling tasks. Furthermore, we study the opportunity for performance optimization by applying two feature selection techniques, the RReliefF and the Correlation-based Feature Selection, on a large set of numerical and nominal linguistic features extracted from text, such as: phonetic, phonologic and morphosyntactic ones, which have been reported successful on the phone and syllable duration modeling task. We investigate the practical usefulness of these phone duration modeling techniques on a Modern Greek emotional speech database, which consists of five categories of emotional speech: anger, fear, joy, neutral, sadness. The experimental results demonstrated that feature selection significantly improves the accuracy of phone duration prediction regardless of the type of machine learning algorithm used for phone duration modeling. Specifically, in four out of the five categories of emotional speech, feature selection contributed to the improvement of the phone duration modeling, when compared to the case without feature selection. The M5p trees based phone duration model was observed to achieve the best phone duration prediction accuracy in terms of RMSE and MAE.

Keywords

Phone duration modeling Statistical modeling Feature selection Emotional speech Text-to-speech synthesis 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aha, D., Kibler, D., & Albert, M. (1991). Instance-based learning algorithms. Journal of Machine Learning, 6, 37–66. Google Scholar
  2. Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716–723. MATHCrossRefMathSciNetGoogle Scholar
  3. Allen, J., Hunnicutt, S., & Klatt, D. H. (1987). From text to speech: the MITalk system. Cambridge: Cambridge University Press. Google Scholar
  4. Arvaniti, A., & Baltazani, M. (2000). Greek ToBI: a system for the annotation of Greek speech corpora. In Proceedings of the 2nd international conference on language resources and evaluation (pp. 555–562). Athens, Greece. Google Scholar
  5. Atkeson, C. G., Moorey, A. W., & Schaal, S. (1996). Locally weighted learning. Artificial Intelligence Review, 11, 11–73. CrossRefGoogle Scholar
  6. Barbosa, P. A., & Bailly, G. (1994). Characterisation of rhythmic patterns for text-to-speech synthesis. Speech Communication, 15, 127–137. CrossRefGoogle Scholar
  7. Bartkova, K., & Sorin, C. (1987). A model of segmental duration for speech synthesis in French. Speech Communication, 6, 245–260. CrossRefGoogle Scholar
  8. Bell, A., Jurafsky, D., Fosler-Lussier, E., Girand, C., Gregory, M., & Gildea, D. (2003). Effects of disfluencies, predictability, and utterance position on word form variation in English conversation. Journal of the Acoustical Society of America, 113(2), 1001–1024. CrossRefGoogle Scholar
  9. Black, A. (2003). Unit selection and emotional speech. In Proceedings of EUROSPEECH’03 (pp. 1649–1652). Geneva, Switzerland. Google Scholar
  10. Breiman, L. (1996). Bagging predictors. Journal of Machine Learning, 24(2), 123–140. MATHMathSciNetGoogle Scholar
  11. Burkhardt, F., & Sendlmeier, W. F. (2000). Verification of acoustical correlates of emotional speech using formant-synthesis. In Proceedings of the ISCA workshop on speech & emotion (pp. 151–156). Northern Ireland. Google Scholar
  12. Campbell, W. N. (1992). Syllable based segment duration. In G. Bailly, C. Benoit, & T. R. Sawallis (Eds.), Talking machines: theories, models and designs (pp. 211–224). Amsterdam: Elsevier. Google Scholar
  13. Carlson, R., & Granstrom, B. (1986). A search for durational rules in real speech database. Phonetica, 43, 140–154. CrossRefGoogle Scholar
  14. Chien, J. T., & Huang, C. H. (2003). Bayesian learning of speech duration models. IEEE Transactions on Speech and Audio Processing, 11(6), 558–567. CrossRefGoogle Scholar
  15. Chung, H. (2002). Duration models and the perceptual evaluation of spoken Korean. In Proceedings of speech prosody (pp. 219–222). France. Google Scholar
  16. Cordoba, R., Montero, J. M., Gutierrez-Ariola, J., & Pardo, J. M. (2001). Duration modeling in a restricted-domain female-voice synthesis in Spanish using neural networks. In Proceedings of ICASSP’01 (pp. 793–796). Utah, USA. Google Scholar
  17. Crystal, T. H., & House, A. S. (1988). Segmental durations in connected-speech signals: current results. Journal of the Acoustical Society of America, 83(4), 1553–1573. CrossRefGoogle Scholar
  18. Dutoit, T. (1997). An introduction to text-to-speech synthesis. Dordrecht: Kluwer Academic. Google Scholar
  19. Epitropakis, G., Tambakas, D., Fakotakis, N., & Kokkinakis, G. (1993). Duration modelling for the Greek language. In Proceedings of EUROSPEECH’93 (pp. 1995–1998). Berlin, Germany. Google Scholar
  20. Febrer, A., Padrell, J., & Bonafonte, A. (1998). Modeling phone duration: application to Catalan TTS. In Workshop of speech synthesis (pp. 43–46). Australia. Google Scholar
  21. Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5), 1189–1232. MATHCrossRefMathSciNetGoogle Scholar
  22. Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics and Data Analysis, 38(4), 367–378. MATHCrossRefMathSciNetGoogle Scholar
  23. Gilad-Bachrach, R., Navot, A., & Tishby, N. (2004). Margin based feature selection—theory and algorithms. In P. Tadepalli, R. Givan, & K. Driessens (Eds.), Proceedings of the 21st international conference on machine learning (pp. 43–50). Banff: Morgan Kaufmann. Google Scholar
  24. Goldberg, D. E. (1989). Genetic algorithms in search, optimization and machine learning. Boston: Addison–Wesley/Longman. MATHGoogle Scholar
  25. Goubanova, O., & King, S. (2008). Bayesian network for phone duration prediction. Speech Communication, 50, 301–311. CrossRefGoogle Scholar
  26. Goubanova, O., & Taylor, P. (2000). Using Bayesian belief networks for modeling duration in text-to-speech systems. In Proceedings of the ICSLP’00 (pp. 427–431). Beijing, China. Google Scholar
  27. Gregory, M., Bell, A., Jurafsky, D., & Raymond, W. (2001). Frequency and predictability effects on the duration of content words in conversation. Journal of the Acoustical Society of America, 110(5), 27–38. Google Scholar
  28. Hall, M. A. (1999). Correlation-based feature subset selection for machine learning. PhD thesis, Department of Computer Science, University of Waikato, Waikato, New Zealand. Google Scholar
  29. Hall, M. A. (2000). Correlation-based feature selection for discrete and numeric class machine learning. In P. Langley (Ed.), Proceedings of the 17th international conference on machine learning (pp. 359–366). San Francisco: Morgan Kaufmann. Google Scholar
  30. Heuft, B., Portele, T., & Rauth, M. (1996). Emotions in time domain synthesis. In Proceedings of ICSLP’96 (pp. 1974–1977). Philadelphia, USA. Google Scholar
  31. Inanoglu, Z., & Young, S. (2009). Data-driven emotion conversion in spoken English. Speech Communication, 51, 268–283. CrossRefGoogle Scholar
  32. Iida, A., Campbell, N., Iga, S., Higuchi, F., & Yasumura, M. (2000). A speech synthesis system for assisting communication. In Proceedings of the ISCA workshop on speech & emotion (pp. 167–172). Northern Ireland. Google Scholar
  33. Iwahashi, N., & Sagisaka, Y. (2000). Statistical modeling of speech segment duration by constrained tree regression. IEICE Transactions on Information and Systems, E83-D(7), 1550–1559. Google Scholar
  34. Jiang, D. N., Zhang, W., Shen, L., & Cai, L. H. (2005). Prosody analysis and modeling for emotional speech synthesis. In Proceedings of ICASSP’05 (pp. 281–284). Philadelphia, USA. Google Scholar
  35. Kääriäinen, M., & Malinen, T. (2004). Selective rademacher penalization and reduced error pruning of decision trees. Journal of Machine Learning Research, 5, 1107–1126. Google Scholar
  36. Kira, K., & Rendell, L. A. (1992). A practical approach to feature selection. In Sleeman, & P. Edwards (Eds.), Proceedings of the 9th international conference on machine learning (pp. 249–256). Aberdeen, Scotland. San Francisco: Morgan Kaufmann. Google Scholar
  37. Klatt, D. H. (1976). Linguistic uses of segmental duration in English: Acoustic and perceptual evidence. Journal of the Acoustical Society of America, 59, 1209–1221. CrossRefGoogle Scholar
  38. Klatt, D. H. (1979). Synthesis by rule of segmental durations in English sentences. In B. Lindlom & S. Ohman (Eds.), Frontiers of speech communication research (pp. 287–300). New York: Academic Press. Google Scholar
  39. Klatt, D. H. (1987). Review of text-to-speech conversion for English. Journal of the Acoustical Society of America, 82(3), 737–793. CrossRefGoogle Scholar
  40. Kohler, K. J. (1988). Zeistrukturierung in der Sprachsynthese. ITG-Tagung Digitalc Sprachverarbeitung, 6, 165–170. MathSciNetGoogle Scholar
  41. Kominek, J., & Black, A. W. (2003). CMU ARCTIC databases for speech synthesis, CMU-LTI-03-177, Language Technologies Institute, School of Computer Science, Carnegie Mellon University. Google Scholar
  42. Kononenko, I. (1994). Estimating attributes: analysis and extensions of relief. In F. Bergadano & L. De Raedt (Eds.), Proceedings of the European conference machine learning (pp. 171–182). New York: Springer. Google Scholar
  43. Krishna, N. S., & Murthy, H. A. (2004). Duration modeling of Indian languages Hindi and Telugu. In Proceedings of the 5th ISCA speech synthesis workshop (pp. 197–202). Pittsburgh, USA. Google Scholar
  44. Krishna, N. S., Talukdar, P. P., Bali, K., & Ramakrishnan, A. G. (2004). Duration modeling for Hindi text-to-speech synthesis system. In Proceedings of ICSLP’04 (pp. 789–792). Jeju Island, Korea. Google Scholar
  45. Lazaridis, A., Zervas, P., & Kokkinakis, G. (2007). Segmental duration modeling for Greek speech synthesis. In Proceedings of ICTAI’07 (pp. 518–521). Patras, Greece. Google Scholar
  46. Lee, S., & Oh, Y. H. (1999a). Tree-based modeling of prosodic phrasing and segmental duration for Korean TTS systems. Speech Communication, 28, 283–300. CrossRefGoogle Scholar
  47. Lee, S., & Oh, Y. H. (1999b). CART-based modelling of Korean segmental duration. In Proceedings of the oriental COCOSDA’99 (pp. 109–112). Taipei, Taiwan. Google Scholar
  48. Möbius, B., & Santen, P. H. J. (1996). Modeling segmental duration in German text-to-speech synthesis. In Proceedings of ICSLP’96 (pp. 2395–2398). Philadelphia, USA. Google Scholar
  49. Murray, I. R., & Arnott, J. L. (1995). Implementation and testing of a system for producing emotion-by-rule in synthetic speech. Speech Communication, 16, 369–390. CrossRefGoogle Scholar
  50. Oatley, K., & Johnson-Laird, P. (1998). The communicative theory of emotions. In J. Jenkins, K. Oatley, & N. Stein (Eds.), Human emotions: a reader (pp. 84–87). Oxford: Blackwell. Google Scholar
  51. Olive, J. P., & Liberman, M. Y. (1985). Text to speech—an overview. Journal of the Acoustical Society of America, 78(1), S6. CrossRefGoogle Scholar
  52. Quinlan, R. J. (1992). Learning with continuous classes. In Proceedings of the 5th Australian Joint Conference on Artificial Intelligence (pp. 343–348). Hobart, Tasmania. Google Scholar
  53. Rank, E., & Pirker, H. (1998). Generating Emotional Speech with a Concatenative Synthesizer. In Proceedings of ICSLP’98 (pp. 671–674). Sydney, Australia. Google Scholar
  54. Rao, K. S., & Yegnanarayana, B. (2007). Modeling durations of syllables using neural networks. Computer Speech & Language, 21(2), 282–295. CrossRefGoogle Scholar
  55. Riley, M. (1992). Tree-based modelling for speech synthesis. In G. Bailly, C. Benoit, & T. R. Sawallis (Eds.), Talking machines: theories, models and designs (pp. 265–273). Amsterdam: Elsevier. Google Scholar
  56. Robnik-Sikonja, M., & Kononenko, I. (1997). An adaptation of relief for attribute estimation in regression. In D. H. Fisher (Ed.), Proceedings of the 14th international conference on machine learning (pp. 296–304). San Francisco: Morgan Kaufmann. Google Scholar
  57. Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M., Wightman, C., Price, P., Pierrehumbert, J., & Hirschberg, J. (1992). ToBI: a standard for labeling English prosody. In Proceedings of ICSLP’92 (pp. 867–870). Banff, Alberta, Canada. Google Scholar
  58. Simoes, A. R. M. (1990). Predicting sound segment duration in connected speech: an acoustical study of Brazilian Portuguese. In Proceedings of the workshop on speech synthesis (pp. 173–176). Autrans, France. Google Scholar
  59. Takeda, K., Sagisaka, Y., & Kuwabara, H. (1989). On sentence-level factors governing segmental duration in Japanese. Journal of Acoustic Society of America, 86(6), 2081–2087. CrossRefGoogle Scholar
  60. Tesser, F., Cosi, P., Drioli, C., & Tisato, G. (2005). Emotional festival-mbrola TTS synthesis. In Proceedings of INTERSPEECH’05 (pp. 505–508). Lisboa, Portugal. Google Scholar
  61. Teixeira, J. P., & Freitas, D. (2003). Segmental durations predicted with a neural network. In Proceedings of EUROSPEECH’03 (pp. 169–172). Geneva, Switzerland, September. Google Scholar
  62. van Santen, J. P. H. (1992). Contextual effects on vowel durations. Speech Communication, 11, 513–546. CrossRefGoogle Scholar
  63. van Santen, J. P. H. (1994). Assignment of segmental duration in text-to-speech synthesis. Computer Speech & Language, 8(2), 95–128. CrossRefGoogle Scholar
  64. Wang, Y., & Witten, I. H. (1997). Induction of model trees for predicting continuous classes. In Proceedings of the 9th European conference on machine learning (pp. 128–137). University of Economics, Faculty of Informatics and Statistics, Prague, Czech. Google Scholar
  65. Wang, L., Zhao, Y., Chu, M., Zhou, J., & Cao, Z. (2004). Refining segmental boundaries for TTS database using fine contextual-dependent boundary models. In Proceedings of ICASSP’04 (pp. 641–644). Montreal, Canada. Google Scholar
  66. Witten, H. I., & Frank, E. (2005). Data mining: practical machine learning tools and techniques (2nd ed.) San Francisco: Morgan Kaufmann. MATHGoogle Scholar
  67. Yamagishi, J., Kawai, H., & Kobayashi, T. (2008). Phone duration modeling using gradient tree boosting. Speech Communication, 50(5), 405–415. CrossRefGoogle Scholar
  68. Yegnanarayana, B. (1999). Artificial neural networks. New Delhi: Prentice Hall. Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  • Alexandros Lazaridis
    • 1
  • Todor Ganchev
    • 1
  • Theodoros Kostoulas
    • 1
  • Iosif Mporas
    • 1
  • Nikos Fakotakis
    • 1
  1. 1.Artificial Intelligence Group, Wire Communications Laboratory, Department of Electrical and Computer EngineeringUniversity of PatrasRion-PatrasGreece

Personalised recommendations