Emotion recognition from speech using sub-syllabic and pitch synchronous spectral features

Koolagudi, Shashidhar G.; Krothapalli, Sreenivasa Rao

doi:10.1007/s10772-012-9150-8

Emotion recognition from speech using sub-syllabic and pitch synchronous spectral features

Published: 08 June 2012

Volume 15, pages 495–511, (2012)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Shashidhar G. Koolagudi¹ &
Sreenivasa Rao Krothapalli¹

768 Accesses
24 Citations
3 Altmetric
Explore all metrics

Abstract

In this work, spectral features extracted from sub-syllabic regions and pitch synchronous analysis are proposed for speech emotion recognition. Linear prediction cepstral coefficients, mel frequency cepstral coefficients and the features extracted from high amplitude regions of spectrum are used to represent emotion specific spectral information. These features are extracted from consonant, vowel and transition regions of each syllable to study the contribution of these regions toward recognition of emotions. Consonant, vowel and the transition regions are determined using vowel onset points. Spectral features extracted from each pitch cycle, are also used to recognize emotions present in speech. The emotions used in this study are: anger, fear, happy, neutral and sad. The emotion recognition performance using sub-syllabic speech segments are compared with the results of conventional block processing approach, where entire speech signal is processed frame by frame. The proposed emotion specific features are evaluated using simulated emotion speech corpus, IITKGP-SESC (Indian Institute of Technology, KharaGPur-Simulated Emotion Speech Corpus). The emotion recognition results obtained using IITKGP-SESC are compared with the results of Berlin emotion speech corpus. Emotion recognition systems are developed using Gaussian mixture models and auto-associative neural networks. The purpose of this study is to explore sub-syllabic regions to identify the emotions embedded in a speech signal, and if possible, to avoid processing of entire speech signal for emotion recognition without serious compromise in the performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Article Open access 07 May 2022

Role of machine learning and deep learning techniques in EEG-based BCI emotion recognition system: a review

Article Open access 13 February 2024

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

References

Bitouk, D., Verma, R., & Nenkova, A. (2010). Class-level spectral features for emotion recognition. Speech Communication, 52, 613–625.
Article Google Scholar
Bozkurt, E., Erzin, E., Erdem, C. E., & Erdem, A. T. (2009). Improving automatic emotion recognition from speech signals. In 10th annual conference of the international speech communication association (interspeech), Brighton, UK, 6–10 September 2009 (pp. 324–327).
Google Scholar
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., & Weiss, B. (2005). A database of German emotional speech. In Interspeech.
Google Scholar
Busso, C., Deng, Z., Yildirim, S., Bulut, M., Lee, C. M., Kazemzadeh, A., Lee, S., Neumann, U., & Narayanan, S. (2004). Analysis of emotion recognition using facial expressions, speech and multimodal information. In ACM 6th international conference on multimodal interfaces (ICMI 2004), State College, PA, The USA, October 2004.
Google Scholar
Chen, J., Huang, Y. A., Li, Q., & Paliwal, K. K. (2004). Recognition of noisy speech using dynamic spectral subband centroids. IEEE Signal Processing Letters, 11, 258–261 (February 2004).
Article Google Scholar
Dellert, F., Polzin, T., & Waibel, A. (1996). Recognizing emotion in speech. In 4th international conference on spoken language processing, Philadelphia, PA, USA, October 1996 (pp. 1970–1973).
Chapter Google Scholar
Diamantaras, K. I., & Kung, S. Y. (1996). Principal component neural networks: theory and applications. New York: Wiley.
MATH Google Scholar
Duda, R. O., Hart, P. E., & Stork, D. G. (2004). Pattern classification (2nd ed.). Singapore: Wiley-Interscience.
Google Scholar
Gangashetty, S. V., Sekhar, C. C., & Yegnanarayana, B. (2004). Detection of vowel on set points in continuous speech using auto-associative neural network models. In INTERSPEECH. New York: IEEE Press.
Google Scholar
Gangashetty, S. V., Sekhar, C. C., & Yegnanarayana, B. (2005). Spotting multilingual consonant-vowel units of speech using neural network models. In M. Faundez-Zanuy (Ed.), NOLISP (pp. 303–317). Berlin: Springer.
Google Scholar
Gupta, C. S., Prasanna, S. R. M., & Yegnanarayana, B. (2002). Autoassociative neural network models for online speaker verification using source features from vowels. In Int. joint conf. neural networks, Honululu, Hawii, USA, May 2002.
Google Scholar
Haykin, S. (1999). Neural networks: a comprehensive foundation. New Delhi: Pearson Education Aisa.
MATH Google Scholar
Hoque, M. E., Yeasin, M., & Louwerse, M. M. (2006). Robust recognition of emotion from speech. In Lecture notes in computer science. Intelligent virtual agents (pp. 42–53). Berlin: Springer.
Google Scholar
Ikbal, M. S., Misra, H., & Yegnanarayana, B. (1999). Analysis of autoassociative mapping neural networks. In Int. joint conf. neural networks, USA (pp. 854–858).
Google Scholar
Iliev, A. I., Scordilis, M. S., Papa, J. P., & Falco, A. X. (2010). Spoken emotion recognition through optimum-path forest classification using glottal features. Computer Speech and Language, 24(3), 445–460.
Article Google Scholar
Kamaruddin, N., & Wahab, A. (2009). Features extraction for speech emotion. Journal of Computational Methods in Science and Engineering, 9(9), 1–12.
MATH Google Scholar
Kishore, S. P., & Yegnanarayana, B. (2001). Online text-independent speaker verification system using autoassociative neural network models. In Int. joint conf. neural networks (V2), Washington, USA, August 2001 (pp. 1548–1553).
Google Scholar
Kodukula, S. R. M. (2009). Significance of excitation source information for speech analysis. Ph.D. thesis, Dept. of Computer Science, IIT, Madras (March 2009).
Koolagudi, S. G., & Rao, K. S. (2009). Exploring speech features for classifying emotions along valence dimension. In Springer LNCS. The 3rd international conference on pattern recognition and machine intelligence (PReMI-09).
Google Scholar
Koolagudi, S. G., & Rao, K. S. (2011). Two stage emotion recognition based on speaking rate. International Journal of Speech Technology, 14, 35–48.
Article Google Scholar
Koolagudi, S. G., Maity, S., Kumar, V. A., Chakrabarti, S., & Rao, K. S. (2009). IITKGP-SESC: speech database for emotion analysis. In LNCS. Communications in computer and information science, August 2009. Berlin: Springer.
Google Scholar
Koolagudi, S. G., Ray, S., & Rao, K. S. (2010). Emotion classification based on speaking rate. In The 3rd international conference on contemporary computing.
Google Scholar
Kwon, O., Chan, K., Hao, J., & Lee, T. (2003). Emotion recognition by speech signals. In Eurospeech, Geneva (pp. 125–128).
Google Scholar
Lee, C. M., & Narayanan, S. (2005). Toward detecting emotions in spoken dialogs. IEEE Transactions on Speech and Audio Processing, 13, 293–303 (March 2005).
Article Google Scholar
Mallidi, S. H. R., Prahallad, K., Gangashetty, S. V., & Yegnanarayana, B. (2010). Significance of pitch synchronous analysis for speaker recognition using AANN models. In INTERSPEECH-2010, Makuhari, Japan, September 2010.
Google Scholar
Mary, L., & Yegnanarayana, B. (2008). Extraction and representation of prosodic features for language and speaker recognition. Speech Communication, 50, 782–796 (April 2008).
Article Google Scholar
McGilloway, S., Cowie, R., Douglas-Cowie, E., Gielen, S., Westerdijk, M., & Stroeve, S. (2000). Approaching automatic recognition of emotion from voice: a rough benchmark. In ISCA workshop on speech and emotion, Belfast.
Google Scholar
Mubarak, O. M., Ambikairajah, E., & Epps, J. (2005). Analysis of an MFCC-based audio indexing system for efficient coding of multimedia sources. In 8th international symposium on signal processing and its applications, Sydney, Australia, August 2005.
Google Scholar
Murty, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 16, 1602–1613.
Article Google Scholar
Muta, H., Baer, T., Wagatsuma, K., Muraoka, T., & Fukuda, H. (1988a). Pitch synchronous analysis of hoarseness in running speech. The Journal of the Acoustical Society of America, 84, 1292–1301.
Article Google Scholar
Muta, H., Baer, T., Wagatsuma, K., Muraoka, T., & Fukudatt, H. (1988b). A pitch-synchronous analysis of hoarseness in running speech. Status report on speech research SR-93/94, Haskins laboratories.
Neiberg, D., Elenius, K., & Laskowski, K. (2006). Emotion recognition in spontaneous speech using GMMs. In INTERSPEECH 2006—ICSLP, Pittsburgh, Pennsylvania, 17–19 September 2006 (pp. 809–812).
Google Scholar
Nicholson, J., Takahashi, K., & Nakatsu, R. (1999). Emotion recognition in speech using neural networks. In 6th international conference on neural information processing (ICONIP-99), Perth, WA, Australia, August 1999 (pp. 495–501).
Google Scholar
Pao, T. L., Chen, Y. T., Yeh, J. H., & Liao, W. Y. (2005). Combining acoustic features for improved emotion recognition in Mandarin speech. In J. Tao, T. Tan & R. Picard (Eds.), LNCS. ACII (pp. 279–285). Berlin: Springer.
Google Scholar
Pao, T. L., Chen, Y. T., Yeh, J. H., Cheng, Y. M., & Chien, C. S. (2007). Feature combination for better differentiating anger from neutral in mandarin emotional speech. In LNCS: Vol. 4738. ACII 2007. Berlin: Springer.
Google Scholar
Petrushin, V. A. (1999). Emotion in speech: recognition and application to call centers. In Proceedings of the 1999 conference on artificial neural networks in engineering (ANNIE 99).
Google Scholar
Prasanna, S. R. M., Zachariah, J. M., & Yegnanarayana, B. (2003). Begin-end detection using vowel onset points. In Proceedings workshop on spoken language, TIFR Mumbai, India (January 2003).
Google Scholar
Prasannaa, S. M., Gupta, C. S., & Yegnanarayana, B. (2006). Extraction of speaker-specific excitation information from linear prediction residual of speech. Speech Communication, 48, 1243–1261.
Article Google Scholar
Prasanna, S. R. M., Reddy, B. V. S., & Krishnamoorthy, P. (2009). Vowel onset point detection using source, spectral peaks, and modulation spectrum energies. IEEE Transactions on Audio, Speech, and Language Processing, 17, 556–565 (May 2009).
Article Google Scholar
Rabiner, L. R., & Juang, B. H. (1993). Fundamentals of speech recognition. Englewood Cliffs: Prentice-Hall.
Google Scholar
Rao, K. S. (2010). Voice conversion by mapping the speaker-specific features using pitch synchronous approach. Computer Speech and Language, 24, 474–494.
Article Google Scholar
Rao, K. S. (2011a). Application of prosody models for developing speech systems in Indian languages. International Journal of Speech Technology, 14, 19–33.
Article Google Scholar
Rao, K. S. (2011b). Role of neural network models for developing speech systems. Sadhana (Springer), 36, 783–836.
Google Scholar
Rao, K. S., & Koolagudi, S. G. (2011). Identification of Hindi dialects and emotions using spectral and prosodic features of speech. Journal of Systemics, Cybernetics and Informatics, 9(4), 24–33.
Google Scholar
Rao, K. S., & Yegnanarayana, B. (2006). Prosody modification using instants of significant excitation. IEEE Transactions on Speech and Audio Processing, 14, 972–980 (May 2006).
Article Google Scholar
Rao, K. S., & Yegnanarayana, B. (2009). Duration modification using glottal closure instants and vowel onset points. Speech Communication, 51, 1263–1269.
Article Google Scholar
Reddy, K. S. (2004). Source and system features for speaker recognition. Master’s thesis, MS thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai 600 036, India 2004.
Reddy, B. V. S., Rao, K. V., & Prasanna, S. R. M. (2008). Keyword spotting using vowel onset point, vector quantization and hidden Markov modeling based techniques. In TENCON 2008—2008 IEEE region 10 conference, IIIT, Hyderabad. New York: IEEE Press.
Google Scholar
Schuller, B., Rigoll, G., & Lang, M. (2004). Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In Proc. IEEE int. conf. acoust., speech, signal processing (pp. 577–580). New York: IEEE Press.
Google Scholar
Sigmund, M. (2007). Spectral analysis of speech under stress. IJCSNS International Journal of Computer Science and Network Security, 7, 170–172.
Google Scholar
Ververidis, D., & Kotropoulos, C. (2006). Emotional speech recognition: resources, features, and methods. Speech Communication, 48, 1162–1181.
Article Google Scholar
Ververidis, D., Kotropoulos, C., & Pitas, I. (2004). Automatic emotional speech classification. In ICASSP (pp. I593–I596). New York: IEEE Press.
Google Scholar
Vuppala, A. K., Rao, K. S., & Chakrabarti, S. (2012a). Improved vowel onset point detection using epoch intervals. International Journal of Electronics and Communications. doi:10.1016/j.aeue.2.
Google Scholar
Vuppala, A. K., Yadav, J., Chakrabarti, S., & Rao, K. S. (2012b). Vowel onset point detection for low bit rate coded speech. IEEE Transactions on Audio, Speech, and Language Processing, 20, 1894–1903 (August 2012).
Article Google Scholar
Wu, S., Falk, T. H., & Chan, W. Y. (2009). Automatic recognition of speech emotion using long-term spectro-temporal features. In 16th international conference on digital signal processing, Santorini-Hellas, 5–7 July 2009 (pp. 1–6). New York: IEEE Press.
Chapter Google Scholar
Yegnanarayana, B. (1999). Artificial neural networks. New Delhi: Prentice-Hall.
Google Scholar
Yegnanarayana, B., & Kishore, S. P. (2002). AANN an alternative to GMM for pattern recognition. Neural Networks, 15, 459–469.
Article Google Scholar
Yegnanarayana, B., Reddy, K. S., & Kishore, S. P. (2001a). Source and system features for speaker recognition using aann models. In IEEE int. conf. acoust., speech, and signal processing, Salt Lake City, UT, May 2001.
Google Scholar
Yegnanarayana, B., Reddy, K. S., & Kishore, S. P. (2001b). Source and system features for speaker recognition using AANN models. In Proc. IEEE int. conf. acoust., speech, signal processing, Salt Lake City, Utah, USA, May 2001 (pp. 409–412).
Google Scholar
Zeng, Y., Wu, H., & Gao, R. (2007). Pitch synchronous analysis method and Fisher criterion based speaker identification. In Third international conference on natural computation, Washington D.C., USA (pp. 691–695). Los Alamitos: IEEE Comput. Soc.
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

School of Information Technology, Indian Institute of Technology Kharagpur, Kharagpur, 721302, West Bengal, India
Shashidhar G. Koolagudi & Sreenivasa Rao Krothapalli

Authors

Shashidhar G. Koolagudi
View author publications
You can also search for this author in PubMed Google Scholar
Sreenivasa Rao Krothapalli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sreenivasa Rao Krothapalli.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Koolagudi, S.G., Krothapalli, S.R. Emotion recognition from speech using sub-syllabic and pitch synchronous spectral features. Int J Speech Technol 15, 495–511 (2012). https://doi.org/10.1007/s10772-012-9150-8

Download citation

Received: 08 February 2012
Accepted: 26 May 2012
Published: 08 June 2012
Issue Date: December 2012
DOI: https://doi.org/10.1007/s10772-012-9150-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Emotion recognition from speech using sub-syllabic and pitch synchronous spectral features

Abstract

Access this article

Similar content being viewed by others

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Role of machine learning and deep learning techniques in EEG-based BCI emotion recognition system: a review

A comprehensive survey on automatic speech recognition using neural networks

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Emotion recognition from speech using sub-syllabic and pitch synchronous spectral features

Abstract

Access this article

Similar content being viewed by others

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Role of machine learning and deep learning techniques in EEG-based BCI emotion recognition system: a review

A comprehensive survey on automatic speech recognition using neural networks

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation