End-to-End Emotion Recognition From Speech With Deep Frame Embeddings And Neutral Speech Handling

Sterling, Grigoriy; Kazimirova, Eva

doi:10.1007/978-3-030-12385-7_76

End-to-End Emotion Recognition From Speech With Deep Frame Embeddings And Neutral Speech Handling

Grigoriy Sterling⁴ &
Eva Kazimirova⁴

Conference paper
First Online: 02 February 2019

1507 Accesses
1 Citations

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 70))

Abstract

In this paper we present a novel approach to improve machine learning techniques in emotion recognition from speech. The core idea is based on the fact that not all parts of the utterance convey emotion information. Thus, we propose to separate a given utterance into emotional and neutral parts and clean up the database to make it more univocal. Then we estimate short speech interval embeddings using speaker recognition convolutional neural network trained on the VoxCeleb2 dataset with the triplet loss. Sequences of these features are processed with a recurrent neural network to get an emotion label for the considered utterance. This stage consists of two sub-stages. At the first one we train a model to recognize neutral frames in a given utterance. Next we separate a corpus into emotional and neutral parts and train an improved model. Our experiments on the IEMOCAP corpus show that the final model achieves 66% of unweighted accuracy (UA) on four emotions and outperforms other known approaches like out-of-the-box Connectionist Temporal Classification (CTC) and local attention by more than 4%.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Pitaloka, D.A., Wulandari, A., Basaruddin, T., Liliana, D.Y.: Enhancing CNN with preprocessing stage in automatic emotion recognition. Procedia Comput. Sci. 116, 523–529 (2017)
Article Google Scholar
Bitouk, Dmitri, Verma, Ragini, Nenkova, Ani: Class-level spectral features for emotion recognition. Speech Commun. 52(7–8), 613–625 (2010)
Article Google Scholar
Busso, C., Bulut, M., Lee, C.C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J.N., Lee, S., Narayanan, S.S.: IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335–359 (2008)
Article Google Scholar
Chernykh, V., Sterling, G., Prihodko, P.: Emotion Recognition From Speech With Recurrent Neural Networks (2017)
Google Scholar
Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: Deep speaker recognition. In: INTERSPEECH (2018)
Google Scholar
Ghosh, S., Laksana, E., Morency, L.P., Scherer, S.: Learning Representations of Affect from Speech, pp. 1–10 (2015)
Google Scholar
Graves, A., Fernndez, S., Gomez, F.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the International Conference on Machine Learning, ICML, vol. 2006, pp. 369–376 (2006)
Google Scholar
Griffin, D., Lim, J.: Signal estimation from modified short-time fourier transform. IEEE Trans. Acoust. Speech Signal Process. 32(2), 236–243 (1984)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR, abs/1512.03385 (2015)
Google Scholar
Hoffer, E., Ailon, N.: Deep metric learning using triplet network. Lecture Notes in Computer Science (including Subseries Lecture Notes Artificial Intelligence Lecture Notes Bioinformatics), vol. 9370(2010), pp. 84–92 (2015)
Chapter Google Scholar
Lee, J., Tashev, I.: High-level feature representation using recurrent neural network for speech emotion recognition. In: Interspeech, pp. 1537–1540 (2015)
Google Scholar
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323, 533–536 (1986) \(<\)a href= “abps/sutherlandbp.pdf”\(>\) Commentary from News and Views section of Nature\(<\)/a\(>\)
Google Scholar
Satt, A., Rozenberg, S., Hoory, R.: Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms, pp. 1089–1093. IBM Research-Haifa (2017)
Google Scholar
Schuller, B., Rigoll, G.: Timing levels in segment-based speech emotion recognition. In: Proceedings of INTERSPEECH 2006, Proceedings of International Conference on Spoken Language Processing ICSLP, pp. 1818–1821 (2006)
Google Scholar
Zhang, C., Mirsamadi, S., Barsoum, E.: Automatic speech emotion recognition using recurrent neural networks with local attention. In: Proceedings of 42nd IEEE International Conferene on Acoustics Speech, and Signal Processing ICASSP 2017, pp. 2227–2231. Center for Robust Speech Systems , The University of Texas at Dallas , Richardson , TX 75080, USA Microsoft Research, One Microsoft Way, Redmond , WA 98052 , USA (2017)
Google Scholar
Tripathi, S., Beigi, H.: Multi-Modal Emotion Recognition on IEMOCAP Dataset using Deep Learning
Google Scholar
Wang, Z.-Q., Tashev, I.: Learning utterance-level representations for speech emotion and age / gender recognition using deep neural networks. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing—Proceedings, pp. 5150–5154 (2017)
Google Scholar
Xia, R., Liu, Y.: A multi-task learning framework for emotion recognition using 2d continuous space. IEEE Trans. Affect. Comput. 8(1), 3–14 (2017)
Article Google Scholar
Zhang, C., Koishida, K.: End-To-end text-independent speaker verification with triplet loss on short utterances. In: Proceedings of Annual Conference Inter Speech Communication Association, INTERSPEECH, 2017-August(August), pp. 1487–1491 (2017)
Google Scholar

Download references

Acknowledgements

All of this work is a part of Emotion Recognition Project at Neurodata Lab company.

Author information

Authors and Affiliations

Neurodata Lab, Moscow, Russia
Grigoriy Sterling & Eva Kazimirova

Authors

Grigoriy Sterling
View author publications
You can also search for this author in PubMed Google Scholar
Eva Kazimirova
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Eva Kazimirova .

Editor information

Editors and Affiliations

Faculty of Science and Engineering, Saga University, Saga, Japan
Kohei Arai
The Science and Information (SAI) Organization, Bradford, UK
Rahul Bhatia

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sterling, G., Kazimirova, E. (2020). End-to-End Emotion Recognition From Speech With Deep Frame Embeddings And Neutral Speech Handling. In: Arai, K., Bhatia, R. (eds) Advances in Information and Communication. FICC 2019. Lecture Notes in Networks and Systems, vol 70. Springer, Cham. https://doi.org/10.1007/978-3-030-12385-7_76

Download citation

DOI: https://doi.org/10.1007/978-3-030-12385-7_76
Published: 02 February 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-12384-0
Online ISBN: 978-3-030-12385-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics