Advertisement

Utilizing Psychoacoustic Modeling to Improve Speech-Based Emotion Recognition

  • Ingo SiegertEmail author
  • Alicia Flores Lotz
  • Olga Egorow
  • Susann Wolff
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11096)

Abstract

Usually, compression methods are avoided for emotion recognition problems, as it is feared that compression degrades the acoustic characteristics needed for an accurate recognition. By contrast, we assume that the psychoacoustic modeling used for transparent music compression could actually improve speech-based emotion recognition, as it removes certain parts of the acoustic signal that are considered “unnecessary”, while still containing the full emotional information.

To test this assumption, we conducted several recognition experiments employing different datasets to verify the generalizability of this assumption. Depending on the dataset, we achieved performance gains between 0.94% and 4.86% absolute. Furthermore, we identified the features that are modified by the psychoacoustic modeling and confirmed by additional recognition experiments that the modification of these features is responsible for the observed performance increase. Although the feature influence is dataset specific, a small group of four low-level feature descriptors is shared amongst all three datasets.

Keywords

Automatic emotion recognition Speech compression Psychoacoustic modelling 

Notes

Acknowledgements

This work has further been sponsored by the German Federal Ministry of Education and Research in the program Zwanzig20 – Partnership for Innovation as part of the research alliance 3Dsensation. One of us (A.F. Lotz) wishes to acknowledge funding from the European Union’s Horizon 2020 research and innovation programme in the project “ADAS&Me” under grant agreement No. 68890.

References

  1. 1.
    Albahri, A., Lech, M., Cheng, E.: Effect of speech compression on the automatic recognition of emotions. Int. J. Signal Process. Syst. 4(1), 55–61 (2016)Google Scholar
  2. 2.
    Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech. In: Proceedings of the INTERSPEECH 2005, Lisbon, Portugal, pp. 1517–1520 (2005)Google Scholar
  3. 3.
    Byrne, C., Foulkes, P.: The ‘mobile phone effect’ on vowel formants. Int. J. Speech Lang. Law 11(1), 83–102 (2004)CrossRefGoogle Scholar
  4. 4.
    Böck, R., Egorow, O., Siegert, I., Wendemuth, A.: Comparative study on normalisation in emotion recognition from speech. In: Horain, P., Achard, C., Mallem, M. (eds.) IHCI 2017. LNCS, vol. 10688, pp. 189–201. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-72038-8_15CrossRefGoogle Scholar
  5. 5.
    Eyben, F., Wöllmer, M., Schuller, B.: openSMILE - the Munich versatile and fast open-source audio feature extractor. In: Proceedings of the ACM MM 2010. p. s.p., Firenze, Italy (2010)Google Scholar
  6. 6.
    Fastl, H., Zwicker, E.: Psychoacoustics. Facts and Models. Springer, Berlin (2007).  https://doi.org/10.1007/978-3-540-68888-4CrossRefGoogle Scholar
  7. 7.
    García, N., Vásquez-Correa, J.C., Arias-Londoño, J.D., Várgas-Bonilla, J.F., Orozco-Arroyave, J.R.: Automatic emotion recognition in compressed speech using acoustic and non-linear features. In: 20th Symposium on Signal Processing, Images and Computer Vision (STSIVA), Bogota, Colombia, pp. 1–7 (2015)Google Scholar
  8. 8.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  9. 9.
    Hansen, J., Bou-Ghazale, S.: Getting started with SUSAS: a speech under simulated and actual stress database. In: Proceedings of EUROSPEECH 1997, Rhodes, Greece, vol. 4, pp. 1743–1746 (1997)Google Scholar
  10. 10.
    Hoene, C., Valin, J.M., Vos, K., Skoglund, J.: Summary of Opus listening test results draft-valin-codec-results-03. Internet-draft, IETF (2013). https://tools.ietf.org/html/draft-ietf-codec-results-03
  11. 11.
    Lefter, I., Nefs, H.T., Jonker, C.M., Rothkrantz, L.: Cross-corpus analysis for acoustic recognition of negative interactions. In: Proceedings of the 6th ACII, Xian, China, pp. 132–138 (2015)Google Scholar
  12. 12.
    Lotz, A.F., Siegert, I., Maruschke, M., Wendemuth, A.: Audio compression and its impact on emotion recognition in affective computing. In: Elektronische Sprachsignalverarbeitung 2017. Tagungsband der 28. Konferenz, vol. 86, pp. 1–8. TUDpress, Saarbrücken (2017)Google Scholar
  13. 13.
    Maruschke, M., Jokisch, O., Meszaros, M., Trojahn, F., Hoffmann, M.: Quality assessment of two fullband audio codecs supporting real-time communication. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 571–579. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-43958-7_69CrossRefGoogle Scholar
  14. 14.
    Pan, D.: A tutorial on mpeg/audio compression. IEEE MultiMed. 2(2), 60–74 (1995)CrossRefGoogle Scholar
  15. 15.
    Pfister, T., Robinson, P.: Speech emotion classification and public speaking skill assessment. In: Salah, A.A., Gevers, T., Sebe, N., Vinciarelli, A. (eds.) HBU 2010. LNCS, vol. 6219, pp. 151–162. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-14715-9_15CrossRefGoogle Scholar
  16. 16.
    Schuller, B., Müller, R., Hörnler, B., Höthker, A., Konosu, H., Rigoll, G.: Audiovisual recognition of spontaneous interest within conversations. In: Proceedings of the 9th ACM ICMI, pp. 30–37 (2007)Google Scholar
  17. 17.
    Schuller, B., Vlasenko, B., Eyben, F., Rigoll, G., Wendemuth, A.: Acoustic emotion recognition: a benchmark comparison of performances. In: Proceedings of the IEEE ASRU 2009, Merano, Italy, pp. 552–557 (2009)Google Scholar
  18. 18.
    Schuller, B., Vlasenko, B., Eyben, F., Wollmer, M., Stuhlsatz, A., Wendemuth, A., Rigoll, G.: Cross-corpus acoustic emotion recognition: variances and strategies. IEEE Trans. Affect. Comput. 1, 119–131 (2010)CrossRefGoogle Scholar
  19. 19.
    Siegert, I., Jokisch, O., Lotz, A.F., Trojahn, F., Meszaros, M., Maruschke, M.: Acoustic cues for the perceptual assessment of surround sound. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 65–75. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-66429-3_6CrossRefGoogle Scholar
  20. 20.
    Siegert, I., Lotz, A.F., Duong, L.L., Wendemuth, A.: Measuring the impact of audio compression on the spectral quality of speech data. In: Elektronische Sprachsignalverarbeitung 2016. Tagungsband der 27. Konferenz, vol. 81, pp. 229–236. TUDpress, Leipzig (2016)Google Scholar
  21. 21.
    Siegert, I., Lotz, A.F., Egorow, O., Wendemuth, A.: Improving speech-based emotion recognition by using psychoacoustic modeling and analysis-by-synthesis. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 445–455. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-66429-3_44CrossRefGoogle Scholar
  22. 22.
    Siegert, I., Lotz, A.F., Maruschke, M., Jokisch, O., Wendemuth, A.: Emotion intelligibility within codec-compressed and reduced bandwith speech. In: ITG-Fb. 267: Speech Communication: 12. ITG-Fachtagung Sprachkommunikation, pp. 215–219. VDE Verlag, Paderborn, October 2016Google Scholar
  23. 23.
    Tahon, M., Devillers, L.: Towards a small set of robust acoustic features for emotion recognition: challenges. EEE/ACM Trans. Audio Speech Lang. Process. 24(1), 16–28 (2016)CrossRefGoogle Scholar
  24. 24.
    Tahon, M., Devillers, L.: Acoustic measures characterizing anger across corpora collected in artificial or natural context. In: International Conference on Speech Prosody (SP 2010), Chicago, USA, May 2010Google Scholar
  25. 25.
    Tickle, A., Raghu, S., Elshaw, M.: Emotional recognition from the speech signal for a virtual education agent. J. Phys. Conf. Ser. 450, 012053 (2013)CrossRefGoogle Scholar
  26. 26.
    Valin, J.M., Terriberry, T.B., Montgomery, C., Maxwell, G.: A high-quality speech and audio codec with less than 10-ms delay. Trans. Audio Speech Lang. Process. 18(1), 58–67 (2010)CrossRefGoogle Scholar
  27. 27.
    Valin, J.M., Vos, K., Terriberry, T.B.: Definition of the Opus Audio Codec. RFC 6716, RFC Editor, September 2012. https://tools.ietf.org/html/rfc6716
  28. 28.
    Xu, X., et al.: Survey on discriminative feature selection for speech emotion recognition. In: 9th ISCSLP, pp. 345–349 (2014)Google Scholar
  29. 29.
    Zhang, Z., Weninger, F., Wöllmer, M., Schuller, B.: Unsupervised learning in cross-corpus acoustic emotion recognition. In: Proceedings of the IEEE ASRU 2011, Waikoloa, USA, pp. 523–528 (2011)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Ingo Siegert
    • 1
    Email author
  • Alicia Flores Lotz
    • 1
  • Olga Egorow
    • 1
  • Susann Wolff
    • 2
  1. 1.Cognitive Systems GroupOtto von Guericke UniversityMagdeburgGermany
  2. 2.Special Lab Non-Invasive Brain ImagingLeibniz Institute for NeurobiologyMagdeburgGermany

Personalised recommendations