Improving Speech-Based Emotion Recognition by Using Psychoacoustic Modeling and Analysis-by-Synthesis

  • Ingo Siegert
  • Alicia Flores Lotz
  • Olga Egorow
  • Andreas Wendemuth
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10458)

Abstract

Most technical communication systems use speech compression codecs to save transmission bandwidth. A lot of development was made to guarantee a high speech intelligibility resulting in different compression techniques: Analysis-by-Synthesis, psychoacoustic modeling and a hybrid mode of both. Our first assumption is that the hybrid mode improves the speech intelligibility. But, enabling a natural spoken conversation also requires affective, namely emotional, information, contained in spoken language, to be intelligibly transmitted. Usually, compression methods are avoided for emotion recognition problems, as it is feared that compression degrades the acoustic characteristics needed for an accurate recognition [1]. By contrast, in our second assumption we state that the combination of psychoacoustic modeling and Analysis-by-Synthesis codecs could actually improve speech-based emotion recognition by removing certain parts of the acoustic signal that are considered “unnecessary”, while still containing the full emotional information. To test both assumptions, we conducted an ITU-recommended POLQA measuring as well as several emotion recognition experiments employing two different datasets to verify the generality of this assumption. We compared our results on the hybrid mode with Analysis-by-Synthesis-only and psychoacoustic modeling-only codecs. The hybrid mode does not show remarkable differences regarding the speech intelligibility, but it outperforms all other compression settings in the multi-class emotion recognition experiments and achieves even an \(\sim \)3.3% absolute higher performance than the uncompressed samples.

Keywords

Automatic emotion recognition Speech compression Intelligibility of affective speech 

Notes

Acknowledgments

The authors thank for continued support by the SFB/TRR 62 “Companion-Technology for Cognitive Technical Systems” (www.sfb-trr-62.de) funded by the German Research Foundation (DFG). This work has further been sponsored by the Federal Ministry of Education and Research in the program Zwanzig20 – Partnership for Innovation as part of the research alliance 3Dsensation (www.3d-sensation.de). We would further like to thank SwissQual AG (a Rhode & Schwarz company), in particular Jens Berger, for supplying the POLQA testbed.

References

  1. 1.
    Albahri, A., Lech, M., Cheng, E.: Effect of speech compression on the automatic recognition of emotions. IJSPS 4(1), 55–61 (2016)Google Scholar
  2. 2.
    Biundo, S., Wendemuth, A.: Companion-technology for cognitive technical systems. KI - Künstliche Intelligenz 30(1), 71–75 (2016)CrossRefGoogle Scholar
  3. 3.
    Brandenburg, K.: MP3 and AAC explained. In: 17th AES International Conference: High-Quality Audio Coding, Florence, Italy, September 1999Google Scholar
  4. 4.
    Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech. In: Proceedings of the INTERSPEECH-2005, pp. 1517–1520, Lisbon, Portugal (2005)Google Scholar
  5. 5.
    Byrne, C., Foulkes, P.: The ‘mobile phone effect’ on vowel formants. Int. J. Speech Lang. Law 11(1), 83–102 (2004)CrossRefGoogle Scholar
  6. 6.
    Dhall, A., Goecke, R., Gedeon, T., Sebe, N.: Emotion recognition in the wild. J. Multimodal User Interfaces 10, 95–97 (2016)CrossRefGoogle Scholar
  7. 7.
    Engberg, I.S., Hansen, A.V.: Documentation of the danish emotional speech database (DES), Tech. rep. Aalborg University, Denmark (1996)Google Scholar
  8. 8.
    Eyben, F., Wöllmer, M., Schuller, B.: openSMILE - the munich versatile and fast open-source audio feature extractor. In: Proceedings of the ACM MM-2010, Firenze, Italy (2010)Google Scholar
  9. 9.
    García, N., Vásquez-Correa, J.C., Arias-Londoño, J.D., Várgas-Bonilla, J.F., Orozco-Arroyave, J.R.: Automatic emotion recognition in compressed speech using acoustic and non-linear features. In: Proceedings of STSIVA 2016, pp. 1–7 (2015)Google Scholar
  10. 10.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  11. 11.
    Hoene, C., Valin, J.M., Vos, K., Skoglund, J.: Summary of Opus listening test results draft-valin-codec-results-03. Internet-draft, IETF (2013)Google Scholar
  12. 12.
    IBM Corporation and Microsoft Corporation: Multimedia programming interface and data specifications 1.0. Tech. rep., August 1991Google Scholar
  13. 13.
    ITU-T: Methods for subjective determination of transmission quality. REC P.800 (1996), https://www.itu.int/rec/T-REC-P.800-199608-I/en
  14. 14.
    ITU-T: Wideband Coding of Speech at around 16 kbit/s using adaptive multi-rate wideband (AMR-WB). REC G.722.2 (2003), https://www.itu.int/rec/T-REC-G.722.2-200307-I/en
  15. 15.
    ITU-T: Methods for objective and subjective assessment of speech quality (POLQA): Perceptual Objective Listening Quality Assessment. REC P.863, September 2014, http://www.itu.int/rec/T-REC-P.863-201409-I/en
  16. 16.
    Jokisch, O., Maruschke, M., Meszaros, M., Iaroshenko, V.: Audio and speech quality survey of the opus codec in web real-time communication. In: Elektronische Sprachsignalverarbeitung 2016, vol. 81, Leipzig, Germany, pp. 254–262 (2016)Google Scholar
  17. 17.
    Lotz, A.F., Siegert, I., Maruschke, M., Wendemuth, A.: Audio compression and its impact on emotion recognition in affective computing. In: Elektronische Sprachsignalverarbeitung 2017, vol. 86, Saarbrücken, Germany, pp. 1–8 (2017)Google Scholar
  18. 18.
    Paulsen, S.: QoS/QoE-Modelle für den Dienst Voice over IP (VoIP). Ph.D. thesis, Universität Hamburg (2015)Google Scholar
  19. 19.
    Pfister, T., Robinson, P.: Speech emotion classification and public speaking skill assessment. In: Salah, A.A., Gevers, T., Sebe, N., Vinciarelli, A. (eds.) HBU 2010. LNCS, vol. 6219, pp. 151–162. Springer, Heidelberg (2010). doi:10.1007/978-3-642-14715-9_15 CrossRefGoogle Scholar
  20. 20.
    Rämö, A., Toukomaa, H.: Voice quality characterization of IETF opus codec. In: Proceedings of the INTERSPEECH-2011, pp. 2541–2544, Florence, Italy (2011)Google Scholar
  21. 21.
    Schuller, B., Vlasenko, B., Eyben, F., Rigoll, G., Wendemuth, A.: Acoustic emotion recognition: a benchmark comparison of performances. In: Proceedings of the IEEE ASRU-2009, Merano, Italy, pp. 552–557 (2009)Google Scholar
  22. 22.
    Siegert, I., Lotz, A.F., l. Duong, L., Wendemuth, A.: Measuring the impact of audio compression on the spectral quality of speech data. In: Elektronische Sprachsignalverarbeitung 2016, vol. 81, pp. 229–236. Leipzig, Germany (2016)Google Scholar
  23. 23.
    Siegert, I., Lotz, A.F., Maruschke, M., Jokisch, O., Wendemuth, A.: Emotion intelligibility within codec-compressed and reduced bandwith speech. In: ITG-Fb. 267: Speech Communication : 12. ITG-Fachtagung Sprachkommunikation 5–7. Oktober 2016 in Paderborn, pp. 215–219. VDE Verlag (2016)Google Scholar
  24. 24.
    Steininger, S., Schiel, F., Dioubina, O., Raubold, S.: Development of user-state conventions for the multimodal corpus in smartkom. In: Workshop on Multimodal Resources and Multimodal Systems Evaluation, Las Palmas, pp. 33–37 (2002)Google Scholar
  25. 25.
    Tickle, A., Raghu, S., Elshaw, M.: Emotional recognition from the speech signal for a virtual education agent. J. Phys.: Conf. Ser., vol. 450, p. 012053 (2013)Google Scholar
  26. 26.
    Valin, J.M., Vos, K., Terriberry, T.: Definition of the opus audio codec. RFC 6716, http://tools.ietf.org/html/rfc6716
  27. 27.
    Valin, J.M., Maxwell, G., Terriberry, T.B., Vos, K.: The opus codec. In: 135th AES International Convention, New York, USA, October 2013Google Scholar
  28. 28.
    Ververidis, D., Kotropoulos, C.: Emotional speech recognition: resources, features, and methods. Speech Commun. 48, 1162–1181 (2006)CrossRefGoogle Scholar
  29. 29.
    Vásquez-Correa, J.C., García, N., Vargas-Bonilla, J.F., Orozco-Arroyave, J.R., Arias-Londoño, J.D., Quintero, M.O.L.: Evaluation of wavelet measures on automatic detection of emotion in noisy and telephony speech signals. In: International Carnahan Conference on Security Technology, pp. 1–6 (2014)Google Scholar
  30. 30.
    Zeng, Z., Pantic, M., Roisman, G.I., Huang, T.S.: A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans. Pattern Anal. Mach. Intell. 31, 39–58 (2009)CrossRefGoogle Scholar
  31. 31.
    Zhang, Z., Weninger, F., Wöllmer, M., Schuller, B.: Unsupervised learning in cross-corpus acoustic emotion recognition. In: Proceedings of the IEEE ASRU-2011, Waikoloa, USA, pp. 523–528 (2011)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Ingo Siegert
    • 1
  • Alicia Flores Lotz
    • 1
  • Olga Egorow
    • 1
  • Andreas Wendemuth
    • 1
    • 2
  1. 1.Cognitive Systems Group, Institute of Information and Communication EngineeringOtto von Guericke UniversityMagdeburgGermany
  2. 2.Center for Behavioral Brain SciencesMagdeburgGermany

Personalised recommendations