Analysis of Emotions Through Speech Using the Combination of Multiple Input Sources with Deep Convolutional and LSTM Networks

Gil Morales, Cristyan R.; Shinde, Suraj

doi:10.1007/978-3-030-04497-8_18

Cristyan R. Gil Morales¹⁵ &
Suraj Shinde¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11289))

Included in the following conference series:

Mexican International Conference on Artificial Intelligence

852 Accesses
1 Citations

Abstract

Understanding emotions expressed in speech by a person is fundamental in having a better interaction between humans and machines. Many algorithms have been developed to solve this problem before. They have been tested on different datasets, some of these datasets were recorded by actors under ideal recording conditions and some others were recorded from people’s opinion on some video streaming platform. Deep learning has shown very positive results in recent years and the model presented here follows this approach. We propose the use of Fourier transformations as the input of a convolutional neural network and Mel frequency cepstral coefficients as the input of an LSTM neural network. Finally, we concatenate the outputs of both models and obtain a final classification for five emotions. The model is trained using the MOSEI dataset. We also perform data augmentation by using time variations and pitch changes. Our model shows significant improvements over state-of-the-art algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bengio, Y., Courville, A.C., Vincent, P.: Unsupervised feature learning and deep learning: a review and new perspectives. CoRR abs/1206.5538 (2012). http://arxiv.org/abs/1206.5538
Blanchard, N., Moreira, D.M., Bharati, A., Scheirer, W.J.: Getting the subtext without the text: scalable multimodal sentiment classification from visual and acoustic modalities. CoRR abs/1807.01122 (2018)
Google Scholar
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech, vol. 5, pp. 1517–1520 (2005)
Google Scholar
Busso, C., et al.: IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42, 335–359 (2008)
Article Google Scholar
Davletcharova, A., Sugathan, S., Abraham, B., James, A.P.: Detection and analysis of emotion from speech signals. CoRR abs/1506.06832 (2015). http://arxiv.org/abs/1506.06832
Fayek, H.M., Lech, M., Cavedon, L.: Towards real-time speech emotion recognition using deep neural networks. In: 2015 9th International Conference on Signal Processing and Communication Systems (ICSPCS), pp. 1–5, December 2015
Google Scholar
Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. In: INTERSPEECH (2015)
Google Scholar
Lee, C.W., Song, K.Y., Jeong, J., Choi, W.Y.: Convolutional attention networks for multimodal emotion recognition from speech and text data. CoRR abs/1805.06606 (2018)
Google Scholar
Lim, W., young Jang, D., Lee, T.: Speech emotion recognition using convolutional and recurrent neural networks. 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–4 (2016)
Google Scholar
Mao, Q., Dong, M., Huang, Z., Zhan, Y.: Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans. Multimedia 16, 2203–2213 (2014)
Article Google Scholar
Mao, X., Chen, L., Fu, L.: Multi-level speech emotion recognition based on HMM and ANN. In: 2009 WRI World Congress on Computer Science and Information Engineering. vol. 7, pp. 225–229, March 2009
Google Scholar
Neiberg, D., Elenius, K., Laskowski, K.: Emotion recognition in spontaneous speech using GMMS. In: INTERSPEECH (2006)
Google Scholar
Niu, Y., Zou, D., Niu, Y., He, Z., Tan, H.: A breakthrough in speech emotion recognition using deep retinal convolution neural networks. CoRR abs/1707.09917 (2017)
Google Scholar
Niu, Y., Zou, D., Niu, Y., He, Z., Tan, H.: Improvement on speech emotion recognition based on deep convolutional neural networks. In: Proceedings of the 2018 International Conference on Computing and Artificial Intelligence, pp. 13–18. ICCAI 2018. ACM, New York, NY, USA (2018). https://doi.org/10.1145/3194452.3194460
Pham, H., Manzini, T., Liang, P.P., Póczos, B.: Seq2seq2sentiment: Multimodal sequence to sequence models for sentiment analysis. CoRR abs/1807.03915 (2018)
Google Scholar
Prasomphan, S.: Improvement of speech emotion recognition with neural network classifier by using speech spectrogram. In: 2015 International Conference on Systems, Signals and Image Processing (IWSSIP), pp. 73–76, September 2015
Google Scholar
Sahay, S., Kumar, S.H., Xia, R., Huang, J., Nachman, L.: Multimodal relational tensor network for sentiment and emotion classification. CoRR abs/1806.02923 (2018)
Google Scholar
Ververidis, D., Kotropoulos, C.: Fast and accurate sequential floating forward feature selection with the bayes classifier applied to speech emotion recognition. Signal Processing 88(12), 2956–2970 (2008). http://www.sciencedirect.com/science/article/pii/S0165168408002120
Article Google Scholar
Williams, J., Kleinegesse, S., Comanescu, R., Radu, O.: Recognizing emotions in video using multimodal DNN feature fusion. In: Proceedings of Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML), pp. 11–19. Association for Computational Linguistics, July 2018
Google Scholar
Wu, C.H., Liang, W.B.: Emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information and semantic labels. IEEE Trans. Affect. Comput. 2(1), 10–21 (2011)
Article Google Scholar
Zadeh, A., Liang, P.P., Poria, S., Vij, P., Cambria, E., Morency, L.P.: Multi-attention recurrent network for human communication comprehension. CoRR abs/1802.00923 (2018)
Google Scholar
Zhang, S., Zhang, S., Huang, T., Gao, W.: Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Trans. Multimedia 20(6), 1576–1590 (2018)
Article Google Scholar

Download references

Author information

Authors and Affiliations

everis AI Digital Lab, 06600, Mexico City, Mexico
Cristyan R. Gil Morales & Suraj Shinde

Authors

Cristyan R. Gil Morales
View author publications
You can also search for this author in PubMed Google Scholar
Suraj Shinde
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cristyan R. Gil Morales .

Editor information

Editors and Affiliations

Instituto Politécnico Nacional, Mexico City, Mexico
Ildar Batyrshin
Universidad Panamericana, Mexico City, Mexico
María de Lourdes Martínez-Villaseñor
Faculty of Engineering, Universidad Panamericana, Mexico City, Mexico
Hiram Eredín Ponce Espinosa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gil Morales, C.R., Shinde, S. (2018). Analysis of Emotions Through Speech Using the Combination of Multiple Input Sources with Deep Convolutional and LSTM Networks. In: Batyrshin, I., Martínez-Villaseñor, M., Ponce Espinosa, H. (eds) Advances in Computational Intelligence. MICAI 2018. Lecture Notes in Computer Science(), vol 11289. Springer, Cham. https://doi.org/10.1007/978-3-030-04497-8_18

Download citation

DOI: https://doi.org/10.1007/978-3-030-04497-8_18
Published: 03 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-04496-1
Online ISBN: 978-3-030-04497-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics