Detecting Escalation Level from Speech with Transfer Learning and Acoustic-Linguistic Information Fusion

Zhou, Ziang; Xu, Yanze; Li, Ming

doi:10.1007/978-981-99-2401-1_14

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1765))

Included in the following conference series:

National Conference on Man-Machine Speech Communication

351 Accesses

Abstract

Textual escalation detection has been widely applied to e-commerce companies’ customer service systems to pre-alert and prevent potential conflicts. Similarly, acoustic-based escalation detection systems are also helpful in enhancing passengers’ safety and maintaining public order in public areas such as airports and train stations, where many impersonal conversations frequently occur. To this end, we introduce a multimodal system based on acoustic-linguistic features to detect escalation levels from human speech. Voice Activity Detection (VAD) and Label Smoothing are adopted to enhance the performance of this task further. Given the difficulty and high cost of data collection in open scenarios, the datasets we used in this task are subject to severe low resource constraints. To address this problem, we introduce transfer learning using a multi-corpus framework involving emotion detection datasets such as RAVDESS and CREMA-D to integrate emotion features into escalation signals representation learning. On the development set, our proposed system achieves 81.5% unweighted average recall (UAR), which significantly outperforms the baseline of 72.2%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Webrtc-vad (2017). https://webrtc.org/
Abdelwahab, M., Busso, C.: Supervised domain adaptation for emotion recognition from speech. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5058–5062. IEEE (2015)
Google Scholar
Aurelio, Y.S., de Almeida, G.M., de Castro, C.L., Braga, A.P.: Learning from imbalanced data sets with weighted cross-entropy function. Neural Process. Lett. 50(2), 1937–1949 (2019)
Article Google Scholar
Brain, D., Webb, G.I.: On the effect of data set size on bias and variance in classification learning. In: Proceedings of the Fourth Australian Knowledge Acquisition Workshop, University of New South Wales, pp. 117–128 (1999)
Google Scholar
Busso, C., et al.: Iemocap: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335–359 (2008)
Article Google Scholar
Busso, C., et al.: Analysis of emotion recognition using facial expressions, speech and multimodal information. In: Proceedings of the 6th International Conference on Multimodal Interfaces, pp. 205–211 (2004)
Google Scholar
Cao, H., Cooper, D.G., Keutmann, M.K., Gur, R.C., Nenkova, A., Verma, R.: Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE Trans. Affect. Comput. 5(4), 377–390 (2014)
Article Google Scholar
Caraty, M.-J., Montacié, C.: Detecting speech interruptions for automatic conflict detection. In: D’Errico, F., Poggi, I., Vinciarelli, A., Vincze, L. (eds.) Conflict and Multimodal Communication. CSS, pp. 377–401. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-14081-0_18
Chapter Google Scholar
Dupuis, K., Pichora-Fuller, M.K.: Toronto emotional speech set (tess)-younger talker_happy (2010)
Google Scholar
Evci, U., Dumoulin, V., Larochelle, H., Mozer, M.C.: Head2toe: Utilizing intermediate representations for better transfer learning. In: International Conference on Machine Learning, pp. 6009–6033. PMLR (2022)
Google Scholar
Fayek, H.M., Lech, M., Cavedon, L.: Towards real-time speech emotion recognition using deep neural networks. In: 2015 9th International Conference on Signal Processing and Communication Systems (ICSPCS), pp. 1–5. IEEE (2015)
Google Scholar
Gideon, J., Khorram, S., Aldeneh, Z., Dimitriadis, D., Provost, E.M.: Progressive Neural Networks for Transfer Learning in Emotion Recognition. In: Proceedings Interspeech 2017, pp. 1098–1102 (2017). https://doi.org/10.21437/Interspeech. 2017–1637
Grèzes, F., Richards, J., Rosenberg, A.: Let me finish: automatic conflict detection using speaker overlap. In: Proceedings Interspeech 2013, pp. 200–204 (2013). https://doi.org/10.21437/Interspeech. 2013–67
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Huang, C., Song, B., Zhao, L.: Emotional speech feature normalization and recognition based on speaker-sensitive feature clustering. Int. J. Speech Technol. 19(4), 805–816 (2016). https://doi.org/10.1007/s10772-016-9371-3
Article Google Scholar
Kim, S., Valente, F., Vinciarelli, A.: Annotation and detection of conflict escalation in Political debates. In: Proceedings Interspeech 2013, pp. 1409–1413 (2013). https://doi.org/10.21437/Interspeech. 2013–369
Kim, S., Yella, S.H., Valente, F.: Automatic detection of conflict escalation in spoken conversations, pp. 1167–1170 (2012). https://doi.org/10.21437/Interspeech. 2012–121
Kishore, K.K., Satish, P.K.: Emotion recognition in speech using mfcc and wavelet features. In: 2013 3rd IEEE International Advance Computing Conference (IACC), pp. 842–847. IEEE (2013)
Google Scholar
Ko, J.H., Fromm, J., Philipose, M., Tashev, I., Zarar, S.: Limiting numerical precision of neural networks to achieve real-time voice activity detection. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2236–2240. IEEE (2018)
Google Scholar
Lalitha, S., Geyasruti, D., Narayanan, R., M, S.: Emotion detection using mfcc and cepstrum features. Procedia Computer Science 70, 29–35 (2015). https://doi.org/10.1016/j.procs.2015.10.020, https://www.sciencedirect.com/science/article/pii/S1877050915031841, proceedings of the 4th International Conference on Eco-friendly Computing and Communication Systems
Lefter, I., Burghouts, G.J., Rothkrantz, L.J.: An audio-visual dataset of human-human interactions in stressful situations. J. Multimodal User Interfaces 8(1), 29–41 (2014)
Article Google Scholar
Lefter, I., Rothkrantz, L.J., Burghouts, G.J.: A comparative study on automatic audio-visual fusion for aggression detection using meta-information. Pattern Recogn. Lett. 34(15), 1953–1963 (2013)
Article Google Scholar
Letcher, A., Trišović, J., Cademartori, C., Chen, X., Xu, J.: Automatic conflict detection in police body-worn audio. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2636–2640. IEEE (2018)
Google Scholar
Likitha, M.S., Gupta, S.R.R., Hasitha, K., Raju, A.U.: Speech based human emotion recognition using mfcc. In: 2017 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET), pp. 2257–2260 (2017). https://doi.org/10.1109/WiSPNET.2017.8300161
Lin, M., Chen, Q., Yan, S.: Network in network. arXiv preprint arXiv:1312.4400 (2013)
Livingstone, S.R., Russo, F.A.: The ryerson audio-visual database of emotional speech and song (ravdess): a dynamic, multimodal set of facial and vocal expressions in north american english. PLoS ONE 13(5), e0196391 (2018)
Article Google Scholar
McFee, B., et al.: librosa: Audio and music signal analysis in python. In: Proceedings of the 14th Python in Science Conference. vol. 8, pp. 18–25. Citeseer (2015)
Google Scholar
Mehta, P., et al.: A high-bias, low-variance introduction to machine learning for physicists. Phys. Rep. 810, 1–124 (2019)
Article MathSciNet Google Scholar
Müller, R., Kornblith, S., Hinton, G.E.: When does label smoothing help? In: Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 32. Curran Associates, Inc. (2019). https://proceedings.neurips.cc/paper/2019/file/f1748d6b0fd9d439f71450117eba2725-Paper.pdf
Ng, H.W., Nguyen, V.D., Vonikakis, V., Winkler, S.: Deep learning for emotion recognition on small datasets using transfer learning. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 443–449 (2015)
Google Scholar
Peng, M., Wu, Z., Zhang, Z., Chen, T.: From macro to micro expression recognition: Deep learning on small datasets using transfer learning. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 657–661. IEEE (2018)
Google Scholar
Polzehl, T., Sundaram, S., Ketabdar, H., Wagner, M., Metze, F.: Emotion classification in children’s speech using fusion of acoustic and linguistic features. In: Proceedings Interspeech 2009, pp. 340–343 (2009). https://doi.org/10.21437/Interspeech. 2009–110
Poria, S., Hazarika, D., Majumder, N., Naik, G., Cambria, E., Mihalcea, R.: Meld: A multimodal multi-party dataset for emotion recognition in conversations. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 527–536 (2019)
Google Scholar
Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992 (2019)
Google Scholar
Reimers, N., Gurevych, I.: Making monolingual sentence embeddings multilingual using knowledge distillation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4512–4525 (2020)
Google Scholar
Schuller, B.W., et al.: The INTERSPEECH 2021 Computational Paralinguistics Challenge: COVID-19 Cough, COVID-19 Speech, Escalation & Primates. In: Proceedings INTERSPEECH 2021, 22nd Annual Conference of the International Speech Communication Association. ISCA, Brno, Czechia (September 2021), to appear
Google Scholar
Tang, Y.: Deep learning using linear support vector machines (2013). https://doi.org/10.48550/ARXIV.1306.0239, https://arxiv.org/abs/1306.0239
Tzirakis, P., Trigeorgis, G., Nicolaou, M.A., Schuller, B.W., Zafeiriou, S.: End-to-end multimodal emotion recognition using deep neural networks. IEEE J. Selected Topics Signal Process. 11(8), 1301–1309 (2017). https://doi.org/10.1109/JSTSP.2017.2764438
Article Google Scholar
van den Oord, A., Dieleman, S., Schrauwen, B.: Transfer learning by supervised pre-training for audio-based music classification. In: Conference of the International Society for Music Information Retrieval, Proceedings, p. 6 (2014)
Google Scholar
Wu, C., Huang, C., Chen, H.: Text-independent speech emotion recognition using frequency adaptive features. Multimedia Tools Appl. 77(18), 24353–24363 (2018). https://doi.org/10.1007/s11042-018-5742-x
Article Google Scholar
Yang, Y., et al.: Multilingual universal sentence encoder for semantic retrieval. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 87–94 (2020)
Google Scholar
Zhao, W.: Research on the deep learning of the small sample data based on transfer learning. In: AIP Conference Proceedings, vol. 1864, p. 020018. AIP Publishing LLC (2017)
Google Scholar

Download references

Acknowledgements

This research is funded in part by the Synear and Wang-Cai donation lab at Duke Kunshan University. Many thanks for the computational resource provided by the Advanced Computing East China Sub-Center.

Author information

Authors and Affiliations

Data Science Research Center, Duke Kunshan University, Kunshan, China
Ziang Zhou, Yanze Xu & Ming Li
School of Computer Science, Wuhan University, Wuhan, China
Ming Li

Authors

Ziang Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Yanze Xu
View author publications
You can also search for this author in PubMed Google Scholar
Ming Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ming Li .

Editor information

Editors and Affiliations

University of Science and Technology of China, Anhui, China
Ling Zhenhua
Hefei University, Anhui, China
Gao Jianqing
Shanghai Jiaotong University, Shanghai, China
Yu Kai
Tsinghua University, Beijing, China
Jia Jia

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, Z., Xu, Y., Li, M. (2023). Detecting Escalation Level from Speech with Transfer Learning and Acoustic-Linguistic Information Fusion. In: Zhenhua, L., Jianqing, G., Kai, Y., Jia, J. (eds) Man-Machine Speech Communication. NCMMSC 2022. Communications in Computer and Information Science, vol 1765. Springer, Singapore. https://doi.org/10.1007/978-981-99-2401-1_14

Download citation

DOI: https://doi.org/10.1007/978-981-99-2401-1_14
Published: 10 May 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-2400-4
Online ISBN: 978-981-99-2401-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Detecting Escalation Level from Speech with Transfer Learning and Acoustic-Linguistic Information Fusion