Abstract
This paper proposes two stage speech emotion recognition approach using speaking rate. The emotions considered in this study are anger, disgust, fear, happy, neutral, sadness, sarcastic and surprise. At the first stage, based on speaking rate, eight emotions are categorized into 3 broad groups namely active (fast), normal and passive (slow). In the second stage, these 3 broad groups are further classified into individual emotions using vocal tract characteristics. Gaussian mixture models (GMM) are used for developing the emotion models. Emotion classification performance at broader level, based on speaking rate is found to be around 99% for speaker and text dependent cases. Performance of overall emotion classification is observed to be improved using the proposed two stage approach. Along with spectral features, the formant features are explored in the second stage, to achieve robust emotion recognition performance in case of speaker, gender and text independent cases.
Similar content being viewed by others
References
Alm, C. O., & Llora, X. (2006). Evolving emotional prosody. In ICSLP ninth international conference on spoken language processing INTERSPEECH 2006, Pittsburgh, PA, USA, 17–21 September 2006.
Banziger, T., & Scherer, K. R. (2005). The role of intonation in emotional expressions. Speech Communication 46, 252–267.
Benesty, J., Sondhi, M. M., & Huang, Y. (Eds.) (2008). Springer handbook on speech processing. Berlin: Springer.
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., & Weiss, B. (2005). A database of german emotional speech, In Interspeech.
Cowie, R., & Cornelius, R. R. (2003). Describing the emotional states that are expressed in speech. Speech Communication, 40, 5–32.
Francis, A. L., & Nusbaum, H. C. Paying attention to speaking rate. Center for Computational Psychology, Department of Psychology, The University of Chicago.
Koolagudi, S. G., Maity, S., Kumar, V. A., Chakrabarti, S., & Rao, K. S. (2009). In Communications in computer and information science. IITKGP-SESC: Speech database for emotion analysis, JIIT University, Noida, India, 17–19 August 2009. Berlin: Springer. ISSN: 1865-0929.
Lee, C. M., & Narayanan, S. S. (2005). Toward detecting emotions in spoken dialogs. IEEE Transactions on Speech and Audio Processing, 13, 293–303.
Li, A. & Zu, Y. (2008). Speaking rate effects on discourse prosody in standard Chinese. In Fourth international conference on speech prosody 2008 (pp. 449–452). Campinas, Brazil, 6–9 May 2008.
Lussier, E. F., & Morgan, N. (1999). Effects of speaking rate and word frequency on pronunciations in conventional speech. Speech Communication, 29, 137–158.
Murty, K., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 16, 1602–1613.
Rabiner, L. R., & Juang, B. H. (1993). Fundamentals of speech recognition. Englewood Cliffs: Prentice-Hall.
Rao, K. S., & Yegnanarayana, B. (2007). Modeling durations of syllables using neural networks. Computer Speech and Language, 21, 282–295.
Reddy, M. S. H., Kumar, K. S., Guruprasad, S., & Yegnanarayana, B. (2009). Subsegmental features for analysis of speech at different speaking rates. In International conference on natural language processing, India (pp. 75–80). New York: Macmillan.
Richardson, M., Hwang, M. Y., Acero, A., & Huang, X. (1999). Improvements on speech recognition for fast talkers. In Eurospeech conference, September 1999.
Sagar, T. V., Rao, K. S., Prasanna, S. R. M., & Dandapat, S. (2007). Characterisation and incorporation of emotions in speech. In IEEE INDICON, Delhi, India, September 2006.
Schroder, M., Cowie, R., Douglas-Cowie, E., Westerdijk, M., & Gielen, S. (2001). Acoustic correlates of emotion dimensions in view of speech synthesis. In 7th European conference on speech communication and technology, EUROSPEECH 2001 Scandinavia, 2nd INTERSPEECH Event, Aalborg, Denmark, 3–7 September 2001.
Seshadri, G., & Yegnanarayana, B. (2009). Perceived loudness of speech based on the characteristics of glottal excitation source. The Journal of the Acoustical Society of America, 126, 2061–2071.
Ververidis, D., Kotropoulos, C., & Pitas, I. (2004). Automatic emotional speech classification. In ICASSP 2004 IEEE (pp. I593–I596).
Yang, H., Guo, W., & Liang, Q. (2008). A speaking rate adjustable digital speech repeater for listening comprehension in second-language learning. In International conference on computer science and software engineering (Vol. 5, pp. 893–896). 12–14 December 2008.
Yuan, J., Liberman, M., & Cieri, C. (2006). Towards an integrated understanding of speaking rate in conversation. In Interspeech 2006 (pp. 541–544). Pittsburgh, PA.
Zheng, J., Franco, H., Weng, F., Sankar, A., & Bratt, H. (2000). Word-level rate of speech modeling using rate-specific phones and pronunciations. In International conf. on acoustic, speech and signal processing (ICASSP-2000) (pp. 1775–1778).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Koolagudi, S.G., Krothapalli, R.S. Two stage emotion recognition based on speaking rate. Int J Speech Technol 14, 35–48 (2011). https://doi.org/10.1007/s10772-010-9085-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-010-9085-x