International Journal of Speech Technology

, Volume 13, Issue 3, pp 175–188

Phone duration modeling: overview of techniques and performance optimization via feature selection in the context of emotional speech

  • Alexandros Lazaridis
  • Todor Ganchev
  • Theodoros Kostoulas
  • Iosif Mporas
  • Nikos Fakotakis
Article

DOI: 10.1007/s10772-010-9077-x

Cite this article as:
Lazaridis, A., Ganchev, T., Kostoulas, T. et al. Int J Speech Technol (2010) 13: 175. doi:10.1007/s10772-010-9077-x
  • 121 Downloads

Abstract

Accurate modeling of prosody is prerequisite for the production of synthetic speech of high quality. Phone duration, as one of the key prosodic parameters, plays an important role for the generation of emotional synthetic speech with natural sounding. In the present work we offer an overview of various phone duration modeling techniques, and consequently evaluate ten models, based on decision trees, linear regression, lazy-learning algorithms and meta-learning algorithms, which over the past decades have been successfully used in various modeling tasks. Furthermore, we study the opportunity for performance optimization by applying two feature selection techniques, the RReliefF and the Correlation-based Feature Selection, on a large set of numerical and nominal linguistic features extracted from text, such as: phonetic, phonologic and morphosyntactic ones, which have been reported successful on the phone and syllable duration modeling task. We investigate the practical usefulness of these phone duration modeling techniques on a Modern Greek emotional speech database, which consists of five categories of emotional speech: anger, fear, joy, neutral, sadness. The experimental results demonstrated that feature selection significantly improves the accuracy of phone duration prediction regardless of the type of machine learning algorithm used for phone duration modeling. Specifically, in four out of the five categories of emotional speech, feature selection contributed to the improvement of the phone duration modeling, when compared to the case without feature selection. The M5p trees based phone duration model was observed to achieve the best phone duration prediction accuracy in terms of RMSE and MAE.

Keywords

Phone duration modelingStatistical modelingFeature selectionEmotional speechText-to-speech synthesis

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  • Alexandros Lazaridis
    • 1
  • Todor Ganchev
    • 1
  • Theodoros Kostoulas
    • 1
  • Iosif Mporas
    • 1
  • Nikos Fakotakis
    • 1
  1. 1.Artificial Intelligence Group, Wire Communications Laboratory, Department of Electrical and Computer EngineeringUniversity of PatrasRion-PatrasGreece