Skip to main content
Log in

Time-Scale Feature Extractions for Emotional Speech Characterization

Applied to Human Centered Interaction Analysis

  • Published:
Cognitive Computation Aims and scope Submit manuscript

Abstract

Emotional speech characterization is an important issue for the understanding of interaction. This article discusses the time-scale analysis problem in feature extraction for emotional speech processing. We describe a computational framework for combining segmental and supra-segmental features for emotional speech detection. The statistical fusion is based on the estimation of local a posteriori class probabilities and the overall decision employs weighting factors directly related to the duration of the individual speech segments. This strategy is applied to a real-world application: detection of Italian motherese in authentic and longitudinal parent–infant interaction at home. The results suggest that short- and long-term information, respectively, represented by the short-term spectrum and the prosody parameters (fundamental frequency and energy) provide a robust and efficient time-scale analysis. A similar fusion methodology is also investigated by the use of a phonetic-specific characterization process. This strategy is motivated by the fact that there are variations across emotional states at the phoneme level. A time-scale based on both vowels and consonants is proposed and it provides a relevant and discriminant feature space for acted emotion recognition. The experimental results on two different databases Berlin (German) and Aholab (Basque) show that the best performance are obtained by our phoneme-dependent approach. These findings demonstrate the relevance of taking into account phoneme dependency (vowels/consonants) for emotional speech characterization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Picard R. Affective computing. Cambridge, MA: MIT Press; 1997.

  2. Argyle M. Bodily communication. 2nd edn. Madison: International Universities Press; 1988.

  3. Kendon A, Harris RM, Key MR. Organization of behavior in face to face interactions. The Hague: Mouton; 1975.

    Google Scholar 

  4. Pentland A. Social signal processing. IEEE Signal Process Mag. 2007;24(4):108–11.

    Article  Google Scholar 

  5. Vinciarelli A, Pantic M, Bourlard H, Pentland A. Social signals, their function, and automatic analysis: a survey. In: IEEE international conference on multimodal interfaces (ICMI’08). 2008. p. 61–8.

  6. Schuller B, Batliner A, Seppi D, Steidl S, Vogt T, Wagner J, et al. The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals. In: Proceedings of interspeech; 2007. p. 2253–6.

  7. Keller E. The Analysis of voice quality in speech processing. In: Chollet G, Esposito A, Faundez-Zanuy M, et al. editors. Lecture notes in computer science, vol. 3445/2005. New York: Springer; 2005. p. 54–73.

  8. Campbell N. On the use of nonverbal speech sounds in human communication. In: Esposito A, et al. editors. Verbal and nonverbal communicational behaviours, LNAI 4775. Berlin, Heidelberg: Springer; 2007. p. 117–128.

  9. Williams CE, Stevens KN. Emotions and speech: some acoustic correlates. J Acoust Soc Am. 1972;52:1238–50.

    Article  PubMed  CAS  Google Scholar 

  10. Sherer KR. Vocal affect expression: a review and a model for future research. Psychol Bull. 1986;99(2):143–65.

    Article  Google Scholar 

  11. Murray IR, Amott JL. Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion. J Acoust Soc Am. 1993;93(2):1097–108.

    Article  PubMed  CAS  Google Scholar 

  12. Shami M, Verhelst W. An evaluation of the robustness of existing supervised machine learning approaches to the classification of emotions, speech. Speech Commun. 2007;49(3):201–12.

    Article  Google Scholar 

  13. Schuller B, Rigoll G, Lang M. Hidden Markov model-based speech emotion recognition. In: Proceedings of ICASSP’03, vol. 2. 2003. p. 1–4.

  14. Lee Z, Zhao Y. Recognizing emotions in speech using short-term and long-term features. In: Proceedings ICSLP 98; 1998. p. 2255–58.

  15. Vlasenko B, Schuller B, Wendemuth A, Rigoll G. Frame vs. turn-level: emotion recognition from speech considering static and dynamic processing. Affect Comput Intell Interact. 2007;139–47.

  16. Schuller B, Vlasenko B, Minguez R, Rigoll G, Wendemuth A. Comparing one and two-stage acoustic modeling in the recognition of emotion in speech. In: Proceedings of IEEE automatic speech recognition and understanding workshop (ASRU 2007), 9–13 Dec 2007, Kyoto, Japan; 2007. p. 596–600.

  17. Jiang DN, Cai L-H. Speech emotion classification with the combination of statistic features and temporal features. In: Proceedings of ICME 2004 IEEE, Taipei, Taiwan; 2004. p. 1967–71.

  18. Kim S, Georgiou P, Lee S, Narayanan S. Real-time emotion detection system using speech: multi-modal fusion of different timescale features. In: IEEE international workshop on multimedia signal processing; 2007.

  19. Fernald A, Simon T. Expanded intonation contours in mother’s speech to newborns. Dev Psychol.1987;20(1):104–13.

    Article  Google Scholar 

  20. Uther M, Knoll MA, Burnham D. Do you speak E-NG-L-I-SH? A comparison of foreigner- and infant directed speech. Speech Commun. 2007;49:2–7.

    Article  Google Scholar 

  21. Fernald A, Kuhl P. Acoustic determinants of infant preference for Motherese speech. Infant Behav Dev. 1987;10:279–93.

    Article  Google Scholar 

  22. Fernald A. Intonation and communication intent in mothers speech to infants: is the melody the message? Child Dev. 1989;60:1497–510.

    Article  PubMed  CAS  Google Scholar 

  23. Slaney M, McRoberts G. Baby ears: a recognition system for affective vocalizations. Speech Commun. 2003;39(3–4):367–84.

    Google Scholar 

  24. Burnham D, Kitamura C, Vollmer-Conna U. What’s new, Pussycat? On talking to babies and animals. Science. 2002;296:1435.

    Article  PubMed  CAS  Google Scholar 

  25. Varchavskaia P, Fitzpatrick P, Breazeal C. Characterizing and processing robot-directed speech. In: Proceedings of the IEEE/RAS international conference on humanoid robots. Tokyo, Japan, 22–24 Nov 2001.

  26. Batliner A, Biersack S, Steidl S. The prosody of pet robot directed speech: evidence from children. In: Proceedings of speech prosody; 2006. p. 1–4.

  27. Breazeal C, Aryananda L. Recognition of affective communicative intent in robot-directed speech. Auton Robots. 2002;12:83–104.

    Google Scholar 

  28. Maestroa S, et al. Early behavioral development in autistic children: the first 2 years of life through home movies. Psychopathology. 2001;34:147–52.

    Article  Google Scholar 

  29. Muratori F, Maestro S. Autism as a downstream effect of primary difficulties in intersubjectivity interacting with abnormal development of brain connectivity. Int J Dialog Sci Fall. 2007;2(1):93–118.

    Google Scholar 

  30. Mahdhaoui A, Chetouani M, Zong C, Cassel RS, Saint-Georges C, Laznik M-C, et al. Automatic Motherese detection for face-to-face interaction analysis. In: Anna Esposito, et al. editors. Multimodal signals: cognitive and algorithmic issues. Berlin: Springer; 2009. p. 248–55.

  31. Laznik MC, Maestro S, Muratori F, Parlato E. Les interactions sonores entre les bebes devenus autistes et leur parents. In: Castarde MF, Konopczynski G, editors. Au commencement tait la voix. Ramonville Saint-Agne: Eres; 2005. p. 171–81.

    Google Scholar 

  32. Mahdhaoui A, Chetouani M, Zong C. Motherese detection based on segmental and supra-segmental features. In: IAPR international conference on pattern recognition, ICPR 2008; 2008.

  33. Chetouani M, Faundez-Zanuy M, Gas B, Zarader JL. Investigation on LP-residual representations for speaker identification. Pattern Recogn. 2009;42(3):487–94.

    Article  Google Scholar 

  34. Duda RO, Hart PE, Stork DG. Pattern classification. 2nd edn. New York: Wiley; 2000.

  35. Kuncheva I. Combining pattern classifiers: methods and algorithms. Wiley-Interscience; 2004.

  36. Monte-Moreno E, Chetouani M, Faundez-Zanuy M, Sole-Casals J. Maximum likelihood linear programming data fusion for speaker recognition. Speech Commun; 2009 (in press).

  37. Reynolds D. Speaker identification and verification using Gaussian mixture speaker models. Speech Commun. 1995;17:91108.

    Google Scholar 

  38. Leinonen L, Hiltunen T, Linnankoski I, Laakso MJ. Expression or emotional–motivational connotations with a one-word utterance. J Acoust Soc Am. 1997;102(3):1853–63.

    Article  PubMed  CAS  Google Scholar 

  39. Pereira C, Watson C. Some acoustic characteristics of emotion. In: International conference on spoken language processing (ICSLP98); 1998. p. 927–30.

  40. Lee CM, Yildirim S, Bulut M, Kazemzadeh A, Busso C, Deng Z, Lee S, Narayanan S. Effects of emotion on different phoneme classes. J Acoust Soc Am. 2004;116:2481.

    Google Scholar 

  41. Ringeval F, Chetouani M. A vowel based approach for acted emotion recognition. In: Proceedings of interspeech’08; 2008.

  42. Andr-Obrecht R. A new statistical approach for automatic speech segmentation. IEEE Trans ASSP. 1988;36(1):29–40.

    Article  Google Scholar 

  43. Rouas JL, Farinas J, Pellegrino F, Andr-Obrecht R. Rhythmic unit extraction and modelling for automatic language identification. Speech Commun. 2005;47(4):436–56.

    Article  Google Scholar 

  44. Burkhardt F. et al. A database of German emotional speech. In: Proceedings of Interspeech; 2005. p. 1517–20.

  45. Saratxaga I, Navas E, Hernaez I, Luengo I. Designing and recording an emotional speech database for corpus based synthesis in Basque. In: Proceedings of LREC; 2006. p. 2126–9.

  46. Keller E, Port R. Speech timing: Approaches to speech rhythm. Special session on timing. In: Proceedings of the international congress of phonetic sciences; 2007. p. 327–29.

  47. Tincoff R, Hauser M, Tsao F, Spaepen G, Ramus F, Mehler J. The role of speech rhythm in language discrimination: further tests with a nonhuman primate. Dev Sci. 2005;8(1):26–35.

    Article  PubMed  Google Scholar 

  48. Ramus F, Nespor M, Mehler J. Correlates of linguistic rhythm in the speech signal. Cognition. 1999;73(3):265–92.

    Article  PubMed  CAS  Google Scholar 

  49. Grabe E, Low EL. Durational variability in speech and the rhythm class hypothesis. Papers in Laboratory Phonology 7, Mouton; 2002.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohamed Chetouani.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chetouani, M., Mahdhaoui, A. & Ringeval, F. Time-Scale Feature Extractions for Emotional Speech Characterization. Cogn Comput 1, 194–201 (2009). https://doi.org/10.1007/s12559-009-9016-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12559-009-9016-9

Keywords

Navigation