Skip to main content

Multimodal Emotion Analysis Based on Acoustic and Linguistic Features of the Voice

Part of the Lecture Notes in Computer Science book series (LNISA,volume 12774)

Abstract

Artificial speech analysis can be used to detect non-verbal communication cues and reveal the current emotional state of the person. The inability of appropriate recognition of emotions can inevitably lessen the quality of social interaction. A better understanding of speech can be achieved by analyzing the additional characteristics, like tone, pitch, rate, intensity, meaning, etc. In a multimodal approach, sensing modalities can be used to alter the behavior of the system and provide adaptation to inconsistencies of the real world. A change detected by a single modality can generate a different system behavior at the global level.

In this paper, we presented a method for emotion recognition based on acoustic and linguistic features of the speech. The presented voice modality is a part of the larger multi-modal computation architecture implemented on the real affective robot as a control mechanism for reasoning about the emotional state of the person in the interaction. While the audio is connected to the acoustic sub-modality, the linguistic sub-modality is related to text messages in which a dedicated NLP model is used. Both methods are based on neural networks trained on available open-source databases. These sub-modalities are then merged in a single voice modality through an algorithm for multimodal information fusion. The overall system is tested on recordings available through Internet services.

Keywords

  • Emotion recognition
  • Affective robotics
  • Multimodal information fusion
  • Voice analysis
  • Speech recognition
  • Learning
  • Reasoning

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-77626-8_20
  • Chapter length: 11 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   89.00
Price excludes VAT (USA)
  • ISBN: 978-3-030-77626-8
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   119.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.

References

  1. Nathan, M.J., Alibali, M.W., Church, R.B.: Making and breaking common ground: how teachers use gesture to foster learning in the classroom. In: Why Gesture? How the Hands Function in Speaking, Thinking and Communicating, Gesture Studies, vol. 7, pp. 285–316. John Benjamins Publishing Company, Amsterdam (2017). https://doi.org/10.1075/gs.7.14nat

  2. Stipancic, T., Ohmoto, Y., Badssi, S.A., Nishida, T.: Computation mechanism for situated sentient robot. In: Proceedings of the 2017 SAI Computing Conference (SAI), London (2017)

    Google Scholar 

  3. Jerbic, B., Stipancic, T., Tomasic, T.: Robotic bodily aware interaction within human environments. In: SAI Intelligent Systems Conference (IntelliSys 2015), London (2015). https://doi.org/10.1109/IntelliSys.2015.7361160

  4. Stipancic, T., Jerbic, B., Curkovic, P.: Bayesian approach to robot group control. In: International Conference in Electrical Engineering and Intelligent Systems, London (2011). https://doi.org/10.1007/978-1-4614-2317-1_9

  5. Stipancic, T., Jerbic, B.: Self-adaptive vision system. In: Camarinha-Matos, L.M., Pereira, P., Ribeiro, L. (eds.) Emerging Trends in Technological Innovation, DoCEIS 2010, IFIP Advances in Information and Communication Technology, vol. 314. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-11628-5_21

  6. Koolagudi, S.G., Kumar, N., Rao, K.S.: Speech emotion recognition using segmental level prosodic analysis. In: 2011 International Conference on Devices and Communications (ICDeCom) (2011). https://doi.org/10.1109/ICDECOM.2011.5738536

  7. Yogesh, C.K., et al.: A new hybrid PSO assisted biogeography-based optimization for emotion and stress recognition from speech signal. Expert Syst. Appl. 69, 149–158 (2017). https://doi.org/10.1016/j.eswa.2016.10.035

    CrossRef  Google Scholar 

  8. Zhao, J., Mao, X., Chen, L.: Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Signal Process. Control 47, 312–323 (2019). https://doi.org/10.1016/j.bspc.2018.08.035

    CrossRef  Google Scholar 

  9. Uddin, M.Z., Nilsson, E.G.: Emotion recognition using speech and neural structured learning to facilitate edge intelligence. Eng. Appl. Artif. Intell. 9 (2020). https://doi.org/10.1016/j.engappai.2020.103775

  10. Perikos, I., Hatzilygeroudis, I.: Recognizing emotions in text using ensemble of classifiers. Eng. Appl. Artif. Intell. 51, 191–201 (2016). https://doi.org/10.1016/j.engappai.2016.01.012

    CrossRef  Google Scholar 

  11. Kratzwald, B., Ilic, S., Kraus, M., Feuerriegel, S., Prendinger, H.: Deep learning for affective computing: Text-based emotion recognition in decision support. Decis. Support Syst. 24–35 (2018). https://doi.org/10.1016/j.dss.2018.09.002

  12. Halim, Z., Waqar, M., Tahir, M.: A machine learning-based investigation utilizing the in-text features for the identification of dominant emotion in an email. Knowl.-Based Syst. (2020). https://doi.org/10.1016/j.knosys.2020.106443

  13. Jiang, Y., Li, W., Hossain, M.S., Chen, M., Alelaiwi, A., Al-Hammadi, M.: A snapshot research and implementation of multimodal information fusion for data-driven emotion recognition. Inf. Fusion 53, 209–221 (2020)

    Google Scholar 

  14. Poria, S., Cambria, E., Howard, N., Huang, G.-B., Hussain, A.: Fusing audio, visual and textual clues for sentiment analysis from multimodal content. Neurocomputing 50–59 (2016). https://doi.org/10.1016/j.neucom.2015.01.095

  15. Qian, Y., Zhang, Y., Ma, X., Yu, H., Peng, L.: EARS: Emotion-aware recommender system based on hybrid information fusion. Inf. Fusion 141–146 (2019). https://doi.org/10.1016/j.inffus.2018.06.004

  16. Gkoumas, D., Li, Q., Lioma, C., Yu, Y., Song, D.: What makes the difference? An empirical comparison of fusion strategies for multimodal language analysis. Inf. Fusion 184–197 (2021). https://doi.org/10.1016/j.inffus.2020.09.005

  17. Haq, S., Jackson, P.: Speaker-dependent audio-visual emotion recognition. In: AVSP (2009)

    Google Scholar 

  18. Cao, H.W., Cooper, D.G., Keutmann, M.K., Gur, R.C., Nenkova, A., Verma, R.: CREMA-D: Crowd-sourced emotional multimodal actors dataset. IEEE Trans. Affect. Comput. 5, 377–390 (2014). https://doi.org/10.1109/taffc.2014.2336244

    CrossRef  Google Scholar 

  19. Livingstone, S.R., Russo, F.A.: The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13 (2018). https://doi.org/10.1371/journal.pone.0196391

  20. Pichora-Fuller, M.K., Dupuis, K.: Toronto Emotional Speech Set (TESS), Toronto Emotional Speech Set (TESS), Toronto (2020)

    Google Scholar 

  21. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech. In: 9th European Conference on Speech Communication and Technology, Lisabon (2005)

    Google Scholar 

  22. Daniels, G., Gervais, R., Merchant, S.: The Office US (2005–2013). https://github.com/LAPISLab-FSB/HCII_21/tree/main/Recordings

Download references

Acknowledgements

This work has been supported in part by Croatian Science Foundation under the project “Affective Multimodal Interaction based on Constructed Robot Cognition (UIP-2020–02-7184)”.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tomislav Stipancic .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Koren, L., Stipancic, T. (2021). Multimodal Emotion Analysis Based on Acoustic and Linguistic Features of the Voice. In: Meiselwitz, G. (eds) Social Computing and Social Media: Experience Design and Social Network Analysis . HCII 2021. Lecture Notes in Computer Science(), vol 12774. Springer, Cham. https://doi.org/10.1007/978-3-030-77626-8_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-77626-8_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-77625-1

  • Online ISBN: 978-3-030-77626-8

  • eBook Packages: Computer ScienceComputer Science (R0)