Skip to main content

Analysis of Emotions Through Speech Using the Combination of Multiple Input Sources with Deep Convolutional and LSTM Networks

  • Conference paper
  • First Online:
Advances in Computational Intelligence (MICAI 2018)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11289))

Included in the following conference series:

Abstract

Understanding emotions expressed in speech by a person is fundamental in having a better interaction between humans and machines. Many algorithms have been developed to solve this problem before. They have been tested on different datasets, some of these datasets were recorded by actors under ideal recording conditions and some others were recorded from people’s opinion on some video streaming platform. Deep learning has shown very positive results in recent years and the model presented here follows this approach. We propose the use of Fourier transformations as the input of a convolutional neural network and Mel frequency cepstral coefficients as the input of an LSTM neural network. Finally, we concatenate the outputs of both models and obtain a final classification for five emotions. The model is trained using the MOSEI dataset. We also perform data augmentation by using time variations and pitch changes. Our model shows significant improvements over state-of-the-art algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bengio, Y., Courville, A.C., Vincent, P.: Unsupervised feature learning and deep learning: a review and new perspectives. CoRR abs/1206.5538 (2012). http://arxiv.org/abs/1206.5538

  2. Blanchard, N., Moreira, D.M., Bharati, A., Scheirer, W.J.: Getting the subtext without the text: scalable multimodal sentiment classification from visual and acoustic modalities. CoRR abs/1807.01122 (2018)

    Google Scholar 

  3. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech, vol. 5, pp. 1517–1520 (2005)

    Google Scholar 

  4. Busso, C., et al.: IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42, 335–359 (2008)

    Article  Google Scholar 

  5. Davletcharova, A., Sugathan, S., Abraham, B., James, A.P.: Detection and analysis of emotion from speech signals. CoRR abs/1506.06832 (2015). http://arxiv.org/abs/1506.06832

  6. Fayek, H.M., Lech, M., Cavedon, L.: Towards real-time speech emotion recognition using deep neural networks. In: 2015 9th International Conference on Signal Processing and Communication Systems (ICSPCS), pp. 1–5, December 2015

    Google Scholar 

  7. Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. In: INTERSPEECH (2015)

    Google Scholar 

  8. Lee, C.W., Song, K.Y., Jeong, J., Choi, W.Y.: Convolutional attention networks for multimodal emotion recognition from speech and text data. CoRR abs/1805.06606 (2018)

    Google Scholar 

  9. Lim, W., young Jang, D., Lee, T.: Speech emotion recognition using convolutional and recurrent neural networks. 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–4 (2016)

    Google Scholar 

  10. Mao, Q., Dong, M., Huang, Z., Zhan, Y.: Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans. Multimedia 16, 2203–2213 (2014)

    Article  Google Scholar 

  11. Mao, X., Chen, L., Fu, L.: Multi-level speech emotion recognition based on HMM and ANN. In: 2009 WRI World Congress on Computer Science and Information Engineering. vol. 7, pp. 225–229, March 2009

    Google Scholar 

  12. Neiberg, D., Elenius, K., Laskowski, K.: Emotion recognition in spontaneous speech using GMMS. In: INTERSPEECH (2006)

    Google Scholar 

  13. Niu, Y., Zou, D., Niu, Y., He, Z., Tan, H.: A breakthrough in speech emotion recognition using deep retinal convolution neural networks. CoRR abs/1707.09917 (2017)

    Google Scholar 

  14. Niu, Y., Zou, D., Niu, Y., He, Z., Tan, H.: Improvement on speech emotion recognition based on deep convolutional neural networks. In: Proceedings of the 2018 International Conference on Computing and Artificial Intelligence, pp. 13–18. ICCAI 2018. ACM, New York, NY, USA (2018). https://doi.org/10.1145/3194452.3194460

  15. Pham, H., Manzini, T., Liang, P.P., Póczos, B.: Seq2seq2sentiment: Multimodal sequence to sequence models for sentiment analysis. CoRR abs/1807.03915 (2018)

    Google Scholar 

  16. Prasomphan, S.: Improvement of speech emotion recognition with neural network classifier by using speech spectrogram. In: 2015 International Conference on Systems, Signals and Image Processing (IWSSIP), pp. 73–76, September 2015

    Google Scholar 

  17. Sahay, S., Kumar, S.H., Xia, R., Huang, J., Nachman, L.: Multimodal relational tensor network for sentiment and emotion classification. CoRR abs/1806.02923 (2018)

    Google Scholar 

  18. Ververidis, D., Kotropoulos, C.: Fast and accurate sequential floating forward feature selection with the bayes classifier applied to speech emotion recognition. Signal Processing 88(12), 2956–2970 (2008). http://www.sciencedirect.com/science/article/pii/S0165168408002120

    Article  Google Scholar 

  19. Williams, J., Kleinegesse, S., Comanescu, R., Radu, O.: Recognizing emotions in video using multimodal DNN feature fusion. In: Proceedings of Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML), pp. 11–19. Association for Computational Linguistics, July 2018

    Google Scholar 

  20. Wu, C.H., Liang, W.B.: Emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information and semantic labels. IEEE Trans. Affect. Comput. 2(1), 10–21 (2011)

    Article  Google Scholar 

  21. Zadeh, A., Liang, P.P., Poria, S., Vij, P., Cambria, E., Morency, L.P.: Multi-attention recurrent network for human communication comprehension. CoRR abs/1802.00923 (2018)

    Google Scholar 

  22. Zhang, S., Zhang, S., Huang, T., Gao, W.: Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Trans. Multimedia 20(6), 1576–1590 (2018)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cristyan R. Gil Morales .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gil Morales, C.R., Shinde, S. (2018). Analysis of Emotions Through Speech Using the Combination of Multiple Input Sources with Deep Convolutional and LSTM Networks. In: Batyrshin, I., Martínez-Villaseñor, M., Ponce Espinosa, H. (eds) Advances in Computational Intelligence. MICAI 2018. Lecture Notes in Computer Science(), vol 11289. Springer, Cham. https://doi.org/10.1007/978-3-030-04497-8_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-04497-8_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-04496-1

  • Online ISBN: 978-3-030-04497-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics