Emotion recognition from speech using source, system, and prosodic features

Koolagudi, Shashidhar G.; Rao, K. Sreenivasa

doi:10.1007/s10772-012-9139-3

Emotion recognition from speech using source, system, and prosodic features

Published: 20 March 2012

Volume 15, pages 265–289, (2012)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Shashidhar G. Koolagudi¹ &
K. Sreenivasa Rao¹

1570 Accesses
54 Citations
Explore all metrics

Abstract

In this work, source, system, and prosodic features of speech are explored for characterizing and classifying the underlying emotions. Different speech features contribute in different ways to express the emotions, due to their complementary nature. Linear prediction residual samples chosen around glottal closure regions, and glottal pulse parameters are used to represent excitation source information. Linear prediction cepstral coefficients extracted through simple block processing and pitch synchronous analysis represent the vocal tract information. Global and local prosodic features extracted from gross statistics and temporal dynamics of the sequence of duration, pitch, and energy values represent the prosodic information. Emotion recognition models are developed using above mentioned features separately, and in combination. Simulated Telugu emotion database (IITKGP-SESC) is used to evaluate the proposed features. The emotion recognition results obtained using IITKGP-SESC are compared with the results of internationally known Berlin emotion speech database (Emo-DB). Autoassociative neural networks, Gaussian mixture models, and support vector machines are used to develop emotion recognition systems with source, system, and prosodic features, respectively. Weighted combination of evidence has been used while combining the performance of systems developed using different features. From the results, it is observed that, each of the proposed speech features has contributed toward emotion recognition. The combination of features improved the emotion recognition performance, indicating the complementary nature of the features.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Anjani, A. V. N. S. (2000). Autoassociate neural network models for processing degraded speech. Master’s thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai 600 036, India.
Atal, B. S. (1972). Automatic speaker recognition based on pitch contours. The Journal of the Acoustical Society of America, 52(6), 1687–1697.
Article Google Scholar
Bajpai, A., & Yegnanarayana, B. (2004). Exploring features for audio clip classification using LP residual and AANN models. In The international conference on intelligent sensing and information processing 2004 (ICISIP 2004), Chennai, India, Jan. 2004 (pp. 305–310).
Chapter Google Scholar
Banziger, T., & Scherer, K. R. (2005). The role of intonation in emotional expressions. Speech Communication, 46, 252–267.
Article Google Scholar
Bapineedu, G., Avinash, B., Gangashetty, S. V., & Yegnanarayana, B. (2009). Analysis of lombard speech using excitation source information. In INTERSPEECH-09, Brighton, UK, September 6–10 (pp. 1091–1094).
Google Scholar
Bitouk, D., Verma, R., & Nenkova, A. (2010). Class-level spectral features for emotion recognition. Speech Communication, 52(7–8), 613–625.
Article Google Scholar
Burkhardt, F., & Sendlmeier, W. F. (2000). Verification of acoustical correlates of emotional speech using formant synthesis. In ITRW on speech and emotion, Newcastle, Northern Ireland, UK, Sept. 2000 (pp. 151–156).
Google Scholar
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., & Weiss, B. (2005). A database of German emotional speech. In Interspeech.
Google Scholar
Cahn, J. E. (1990). The generation of affect in synthesized speech. In JAVIOS, Jul. 1990 (pp. 1–19).
Google Scholar
Cowie, R., & Cornelius, R. R. (2003). Describing the emotional states that are expressed in speech. Speech Communication, 40, 5–32.
Article MATH Google Scholar
Cummings, K. E., & Clements, M. A. (1995). Analysis of the glottal excitation of emotionally styled and stressed speech. The Journal of the Acoustical Society of America, 98, 88–98.
Article Google Scholar
Dellaert, F., Polzin, T., & Waibel, A. (1996). Recognising emotions in speech. In ICSLP 96, Oct. 1996.
Google Scholar
Dellert, F., Polzin, T., & Waibel, A. (1996). Recognizing emotion in speech. In 4th international conference on spoken language processing, Philadelphia, PA, USA, Oct. 1996 (pp. 1970–1973).
Chapter Google Scholar
Desai, S., Black, A. W., Yegnanarayana, B., & Prahallad, K. (2010). Spectral mapping using artificial neural networks for voice conversion. IEEE Transactions on Audio, Speech, and Language Processing, 18, 954–964.
Article Google Scholar
Diamantaras, K. I., & Kung, S. Y. (1996). Principal component neural networks: Theory and applications. New York: Wiley.
MATH Google Scholar
Gupta, C. S. (2003). Significance of source features for speaker recognition. Master’s thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai 600 036, India.
Gupta, C. S., Prasanna, S. R. M., & Yegnanarayana, B. (2002). Autoassociative neural network models for online speaker verification using source features from vowels. In Int. joint conf. neural networks, Honolulu, Hawaii, USA, May 2002.
Google Scholar
hao Kao, Y., & shan Lee, L. (2006). Feature analysis for emotion recognition from Mandarin speech considering the special characteristics of Chinese language. In INTERSPEECH—ICSLP, Pittsburgh, Pennsylvania, Sept. 2006 (pp. 1814–1817).
Google Scholar
Haykin, S. (1999) Neural networks: A comprehensive foundation. New Delhi: Pearson Education Asia, Inc.
MATH Google Scholar
Hua, L. Z., Yu, H., & Hua, W. R. (2005). A novel source analysis method by matching spectral characters of LF model with STRAIGHT spectrum. Berlin: Springer.
Google Scholar
Iida, A., Campbell, N., Higuchi, F., & Yasumura, M. (2003). A corpus-based speech synthesis system with emotion. Speech Communication, 40, 161–187.
Article MATH Google Scholar
Ikbal, M. S., Misra, H., & Yegnanarayana, B. (1999). Analysis of autoassociative mapping neural networks. In Int. joint conf. neural networks, USA (pp. 854–858).
Google Scholar
Iliou, T., & Anagnostopoulos, C. N. (2009). Statistical evaluation of speech features for emotion recognition. In Fourth international conference on digital telecommunications, Colmar, France, July 2009 (pp. 121–126).
Chapter Google Scholar
Kamaruddin, N., & Wahab, A. (2009). Features extraction for speech emotion. Journal of Computational Methods in Science and Engineering, 9(9), 1–12.
MATH Google Scholar
Kishore, S. P., & Yegnanarayana, B. (2001). Online text-independent speaker verification system using autoassociative neural network models. In Int. joint conf. neural networks, Washington, USA, Aug. 2001 (Vol. 2, pp. 1548–1553).
Google Scholar
Koolagudi, S. G., Maity, S., Kumar, V. A., Chakrabarti, S., & Rao, K. S. (2009). IITKGP-SESC: Speech database for emotion analysis. Communications in computer and information science, LNCS. Berlin: Springer.
Google Scholar
Koolagudi, S. G., & Rao, K. S. (2009). Exploring speech features for classifying emotions along valence dimension. In S. Chandhury, et al. (Eds.), LNCS. The 3rd international conference on pattern recognition and machine intelligence (PReMI-09), IIT Delhi, December 2009 (pp. 537–542). Heidelberg: Springer.
Chapter Google Scholar
Kumar, K. S., Reddy, M. S. H., Murty, K. S. R., & Yegnanarayana, B. (2009). Analysis of laugh signals for detecting in continuous speech. In INTERSPEECH-09, Brighton, UK, September 6–10 (pp. 1591–1594).
Google Scholar
Lee, C. M., & Narayanan, S. S. (2005). Toward detecting emotions in spoken dialogs. IEEE Transactions on Speech and Audio Processing, 13, 293–303.
Article Google Scholar
Liu, J. H. L., & Palm, G. (1997). On the use of features from prediction residual signal in speaker recognition. In European conf. speech processing and technology (EUROSPEECH) (pp. 313–316).
Google Scholar
Luengo, I., Navas, E., Hernez, I., & Snchez, J. (2005). Automatic emotion recognition using prosodic parameters. In INTERSPEECH, Lisbon, Portugal, Sept. 2005 (pp. 493–496).
Google Scholar
Lugger, M., & Yang, B. (2007). The relevance of voice quality features in speaker independent emotion recognition. In ICASSP, Honolulu, Hawaii, USA, May 2007 (pp. IV17–IV20). New York: IEEE.
Google Scholar
Mary, L., & Yegnanarayana, B. (2004). Autoassociative neural network models for language identification. In International conference on intelligent sensing and information processing, Aug. 24 2004 (pp. 317–320). New York: IEEE.
Chapter Google Scholar
McGilloway, S., Cowie, R., Douglas-Cowie, E., Gielen, S., Westerdijk, M., & Stroeve, S. (2000). Approaching automatic recognition of emotion from voice: A rough benchmark. In ISCA workshop on speech and emotion, Belfast.
Google Scholar
Mohan, C. K., & Yegnanarayana, B. (2008). Classification of sport videos using edge-based features and autoassociative neural network models. Signal, Image and Video Processing, 4, 61–73.
Article Google Scholar
Mubarak, O. M., Ambikairajah, E., & Epps, J. (2005). Analysis of an MFCC-based audio indexing system for efficient coding of multimedia sources. In 8th international symposium on signal processing and its applications, Sydney, Australia, Aug. 2005.
Google Scholar
Murray, I. R., & Arnott, J. L. (1995). Implementation and testing of a system for producing emotion by rule in synthetic speech. Speech Communication, 16, 369–390.
Article Google Scholar
Murray, I. R., Arnott, J. L., & Rohwer, E. A. (1996). Emotional stress in synthetic speech: Progress and future directions. Speech Communication, 20, 85–91.
Article Google Scholar
Murty, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 16, 1602–1613.
Article Google Scholar
Muta, H., Baer, T., Wagatsuma, K., Muraoka, T., & Fukuda, H. (1988). Pitch synchronous analysis of hoarseness in running speech. The Journal of the Acoustical Society of America, 84, 1292–1301.
Article Google Scholar
Neiberg, D., Elenius, K., & Laskowski, K. (2006). Emotion recognition in spontaneous speech using GMMs. In INTERSPEECH—ICSLP. Pittsburgh, Pennsylvania, 17–19 September 2006 (pp. 809–812).
Google Scholar
Nwe, T. L., Foo, S. W., & Silva, L. C. D. (2003). Speech emotion recognition using hidden Markov models. Speech Communication, 41, 603–623.
Article Google Scholar
Oudeyer, P. Y. (2003). The production and recognition of emotions in speech: features and algorithms. International Journal of Human-Computer Studies, 59, 157–183.
Article Google Scholar
Pao, T. L., Chen, Y. T., Yeh, J. H., & Liao, W. Y. (2005). Combining acoustic features for improved emotion recognition in Mandarin speech. In J. Tao, T. Tan, & R. Picard (Eds.), LNCS. ACII, Berlin, Heidelberg (pp. 279–285), Berlin: Springer.
Google Scholar
Pao, T. L., Chen, Y. T., Yeh, J. H., Cheng, Y. M., & Chien, C. S. (2007). Feature combination for better differentiating anger from neutral in mandarin emotional speech. In LNCS: Vol. 4738. ACII 2007. Berlin: Springer.
Google Scholar
Prasanna, S. R. M., Reddy, B. V. S., & Krishnamoorthy, P. (2009). Vowel onset point detection using source, spectral peaks, and modulation spectrum energies. IEEE Transactions on Audio, Speech, and Language Processing, 17, 556–565.
Article Google Scholar
Rao, K. S. (2005). Acquisition and incorporation prosody knowledge for speech systems in Indian languages. PhD thesis, Dept. of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai, India, May 2005.
Rao, K. S., & Yegnanarayana, B. (2009). Intonation modeling for Indian languages. Computer Speech and Language, 23, 240–256.
Article Google Scholar
Rao, K. S., Prasanna, S. R. M., & Yegnanarayana, B. (2007). Determination of instants of significant excitation in speech using Hilbert envelope and group delay function. IEEE Signal Processing Letters, 14, 762–765.
Article Google Scholar
Rao, K. S., Reddy, R., Maity, S., & Koolagudi, S. G. (2010). Characterization of emotions using the dynamics of prosodic features. In International conference on speech prosody, Chicago, USA, May 2010.
Google Scholar
Reddy, K. S. (2004). Source and system features for speaker recognition. Master’s thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai 600 036, India.
Scherer, K. R. (2003). Vocal communication of emotion: A review of research paradigms. Speech Communication, 40, 227–256.
Article MATH Google Scholar
Schroder, M. (2001). Emoptional speech synthesis: A review. In Seventh European conference on speech communication and technology, Eurospeech, Aalborg, Denmark, Sept. 2001.
Google Scholar
Schroder, M., & Cowie, R. (2006). Issues in emotion-oriented computing toward a shared understanding. In Workshop on emotion and computing (HUMAINE).
Google Scholar
Seshadri, G. P., & Yegnanarayana, B. (2009). Perceived loudness of speech based on the characteristics of glottal excitation source. The Journal of the Acoustical Society of America, 126, 2061–2071.
Article Google Scholar
Sigmund, M. (2007). Spectral analysis of speech under stress. International Journal of Computer Science and Network Security, 7, 170–172.
Google Scholar
Tato, R., Santos, R., & Pardo, R. K. J. (2002). Emotional space improves emotion recognition. In 7th international conference on spoken language processing, Denver, Colorado, USA, Sept. 16–20 2002.
Google Scholar
Theodoridis, S., & Koutroumbas, K. (2006). Pattern recognition (3rd ed.). New York: Elsevier, Academic Press.
MATH Google Scholar
Thevenaz, P., & Hugli, H. (1995). Usefulness of LPC residue in textindependent speaker verification. Speech Communication, 17, 145–157.
Article Google Scholar
Ververidis, D., & Kotropoulos, C. (2006). A state of the art review on emotional speech databases. In Eleventh Australasian international conference on speech science and technology, Auckland, New Zealand, Dec. 2006.
Google Scholar
Ververidis, D., Kotropoulos, C., & Pitas, I. (2004). Automatic emotional speech classification. In ICASSP (pp. I593–I596). New York: IEEE.
Google Scholar
Wakita, H. (1976). Residual energy of linear prediction to vowel and speaker recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24, 270–271.
Article Google Scholar
Wang, Y., Du, S., & Zhan, Y. (2008). Adaptive and optimal classification of speech emotion recognition. In Fourth international conference on natural computation, Oct. 2008 (pp. 407–411).
Chapter Google Scholar
Williams, C. E., & Stevens, K. N. (1981). Vocal correlates of emotional states. In Speech evaluation in psychiatry (pp. 189–220).
Google Scholar
Yegnanarayana, B. (1999). Artificial neural networks. New Delhi: Prentice-Hall.
Google Scholar
Yegnanarayana, B., & Kishore, S. P. (2002). AANN an alternative to GMM for pattern recognition. Neural Networks, 15, 459–469.
Article Google Scholar
Yegnanarayana, B., Murthy, P. S., Avendano, C., & Hermansky, H. (1998). Enhancement of reverberant speech using lp residual. In IEEE international conference on acoustics, speech and signal processing, Seattle, WA, USA, May 1998 (Vol. 1, pp. 405–408). New York: IEEE Xplore.
Google Scholar
Yegnanarayana, B., Reddy, K. S., & Kishore, S. P. (2001a). Source and system features for speaker recognition using aann models. In IEEE int. conf. acoust., speech, and signal processing, Salt Lake City, UT, May 2001.
Google Scholar
Yegnanarayana, B., Reddy, K. S., & Kishore, S. P. (2001b). Source and system features for speaker recognition using AANN models. In Proc. IEEE int. conf. acoust., speech, signal processing, Salt Lake City, Utah, USA, May 2001 (pp. 409–412).
Google Scholar
Yegnanarayana, B., Swamy, R. K., & Murty, K. S. R. (2009). Determining mixing parameters from multispeaker data using speech-specific information. IEEE Transactions on Audio, Speech, and Language Processing, 17(6), 1196–1207.
Article Google Scholar
Yildirim, S., Bulut, M., Lee, C. M., Kazemzadeh, A., Busso, C., Deng, Z., Lee, S., & Narayanan, S. (2004). An acoustic study of emotions expressed in speech. In Int. conf. on spoken language processing (ICSLP 2004), Jeju Island, Korea, Oct. 2004.
Google Scholar
Zeng, Y., Wu, H., & Gao, R. (2007). Pitch synchronous analysis method and Fisher criterion based speaker identification. In Third international conference on natural computation, Washington DC, USA (pp. 691–695). Washington: IEEE Computer Society.
Chapter Google Scholar
Zhang, S. (2008). Emotion recognition in Chinese natural speech by combining prosody and voice quality features. In Sun, et al. (Eds.), Lecture notes in computer science. Advances in neural networks (pp. 457–464). Berlin: Springer.
Google Scholar
Zhu, A., & Luo, Q. (2007). Study on speech emotion recognition system in E-learning. In J. Jacko (Ed.), LNCS. Human computer interaction, Part III, HCII (pp. 544–552). Berlin: Springer.
Google Scholar

Download references

Author information

Authors and Affiliations

School of Information Technology, Indian Institute of Technology Kharagpur, Kharagpur, 721302, West Bengal, India
Shashidhar G. Koolagudi & K. Sreenivasa Rao

Authors

Shashidhar G. Koolagudi
View author publications
You can also search for this author in PubMed Google Scholar
K. Sreenivasa Rao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to K. Sreenivasa Rao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Koolagudi, S.G., Rao, K.S. Emotion recognition from speech using source, system, and prosodic features. Int J Speech Technol 15, 265–289 (2012). https://doi.org/10.1007/s10772-012-9139-3

Download citation

Received: 02 December 2011
Accepted: 03 March 2012
Published: 20 March 2012
Issue Date: June 2012
DOI: https://doi.org/10.1007/s10772-012-9139-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Emotion recognition from speech using source, system, and prosodic features

Abstract

Access this article

Similar content being viewed by others

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Role of machine learning and deep learning techniques in EEG-based BCI emotion recognition system: a review

Automatic speech recognition: a survey

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Emotion recognition from speech using source, system, and prosodic features

Abstract

Access this article

Similar content being viewed by others

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Role of machine learning and deep learning techniques in EEG-based BCI emotion recognition system: a review

Automatic speech recognition: a survey

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation