A Time-Frequency Attention Mechanism with Subsidiary Information for Effective Speech Emotion Recognition

Xi, Yu-Xuan; Song, Yan; Dai, Li-Rong; Liu, Lin

doi:10.1007/978-981-99-2401-1_16

Yu-Xuan Xi⁹,
Yan Song⁹,
Li-Rong Dai⁹ &
…
Lin Liu¹⁰

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1765))

Included in the following conference series:

National Conference on Man-Machine Speech Communication

351 Accesses

Abstract

Speech emotion recognition (SER) is the task of automatically identifying human emotions from the analysis of utterances. In practical applications, the task is often affected by subsidiary information, such as speaker or phoneme information. Traditional domain adaptation approaches are often applied to remove unwanted domain-specific knowledge, but often unavoidably contribute to the loss of useful categorical information. In this paper, we proposed a time-frequency attention mechanism based on multi-task learning (MTL). This uses its own content information to obtain self attention in time and channel dimensions, and obtain weight knowledge in the frequency dimension through domain information extracted from MTL. We conduct extensive evaluations on the IEMOCAP benchmark to assess the effectiveness of the proposed representation. Results demonstrate a recognition performance of 73.24% weighted accuracy (WA) and 73.18% unweighted accuracy (UA) over four emotions, outperforming the baseline by about 4%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Atmaja, B.T., Akagi, M.: Multitask learning and multistage fusion for dimensional audiovisual emotion recognition. In: ICASSP 2020, pp. 4482–4486 (2020)
Google Scholar
Deng, J., Xu, X., Zhang, Z., Frühholz, S.: Semisupervised autoencoders for speech emotion recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 26, 31–43 (2018)
Article Google Scholar
Deng, J., Xu, X., Zhang, Z., Frühholz, S., Schuller, B.: Universum autoencoder-based domain adaptation for speech emotion recognition. IEEE Signal Process. Lett. 24, 500–530 (2017)
Article Google Scholar
Deng, J., Zhang, Z., Eyben, F., Schuller, B.: Autoencoder-based unsupervised domain adaptation for speech emotion recognition. IEEE Signal Process. Lett. 21, 1068–1072 (2014)
Article Google Scholar
Fan, W., Xu, X., Xing, X., Huang, D.: Adaptive domain-aware representation learning for speech emotion recognition. In: Proceedings of Interspeech 2020, pp. 4089–4093 (2020). https://doi.org/10.21437/Interspeech.2020-2572
Feng, H., Ueno, S., Kawahara, T.: End-to-end speech emotion recognition combined with acoustic-to-word ASR model. In: Proceedings of Interspeech 2020, pp. 501–505 (2020). https://doi.org/10.21437/Interspeech.2020-1180
Gao, Y., Okada, S., Wang, L., Liu, J., Dang, J.: Domain-invariant feature learning for cross corpus speech emotion recognition. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6427–6431 (2022). https://doi.org/10.1109/ICASSP43922.2022.9747129
Gat, I., Aronowitz, H., Zhu, W., Morais, E., Hoory, R.: Speaker normalization for self-supervised speech emotion recognition. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7342–7346 (2022). https://doi.org/10.1109/ICASSP43922.2022.9747460
Hou, Q., Zhang, L., Cheng, M.M., Feng, J.: Strip pooling: rethinking spatial pooling for scene parsing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4003–4012 (2020)
Google Scholar
Hu, J., Shen, L., Albanie, S., Sun, G., Wu, E.: Squeeze-and-excitation networks. In: Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Lee, J., Tashev, I.: High-level feature representation using recurrent neural network for speech emotion recognition. In: Proceedings of Interspeech, pp. 1537–1540 (2015)
Google Scholar
Li, P., Song, Y., McLoughlin, I., Guo, W., Dai, L.: An attention pooling based representation learning method for speech emotion recognition. In: Proceedings of Interspeech (2018)
Google Scholar
Li, R., Zhao, J., Jin, Q.: Speech emotion recognition via multi-level cross-modal distillation. In: Proceedings of Interspeech 2021, pp. 4488–4492 (2021). https://doi.org/10.21437/Interspeech.2021-785
McLoughlin, I.V.: Speech and Audio Processing: a MATLAB-Based Approach. Cambridge University Press, Cambridge (2016)
Google Scholar
Miao, X., McLoughlin, I., Yan, Y.: A new time-frequency attention tensor network for language identification. Circ. Syst. Signal Process. 39(5), 2744–2758 (2020)
Article Google Scholar
Pappagari, R., Villalba, J., Żelasko, P., Moro-Velazquez, L., Dehak, N.: CopyPaste: an augmentation method for speech emotion recognition. In: ICASSP, pp. 6324–6328 (2021). https://doi.org/10.1109/ICASSP39728.2021.9415077
Parthasarathy, S., Busso, C.: Jointly predicting arousal, valence and dominance with multi-task learning. In: Proceedings of Interspeech 2017, pp. 1103–1107 (2017). https://doi.org/10.21437/Interspeech.2017-1494
Pepino, L., Riera, P., Ferrer, L., Gravano, A.: Fusion approaches for emotion recognition from speech using acoustic and text-based features. In: ICASSP 2020, pp. 4482–4486 (2020)
Google Scholar
Priyasad, D., Fernando, T., Denman, S., Sridharan, S., Fookes, C.: Attention driven fusion for multi-modal emotion recognition. In: ICASSP 2020, pp. 6484–6488 (2020)
Google Scholar
Satt, A., Rozenberg, S., Hoory, R.: Efficient emotion recognition from speech using deep learning on spectrograms. In: Proceedings of Interspeech, pp. 1089–1093 (2017)
Google Scholar
Sebastian, J., Pierucci, P.: Fusion techniques for utterance-level emotion recognition combining speech and transcripts. In: Proceedings of Interspeech 2019, pp. 51–55 (2019). https://doi.org/10.21437/Interspeech.2019-3201
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_1
Chapter Google Scholar
Wu, X., Hu, S., Wu, Z., Liu, X., Meng, H.: Neural architecture search for speech emotion recognition. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6902–6906 (2022). https://doi.org/10.1109/ICASSP43922.2022.9746155
Xi, Y., Li, P., Song, Y., Jiang, Y., Dai, L.: Speaker to emotion: domain adaptation for speech emotion recognition with residual adapters. In: Asia-Pacific Signal and Information Processing Association (APSIPA) (2019)
Google Scholar
Xia, W., Koishida, K.: Sound event detection in multichannel audio using convolutional time-frequency-channel squeeze and excitation. In: Proceedings of Interspeech, pp. 3629–3633 (2019)
Google Scholar
Yeh, S.L., Lin, Y.S., Lee, C.C.: A dialogical emotion decoder for speech emotion recognition in spoken dialog. In: ICASSP 2020, pp. 6479–6483 (2020)
Google Scholar
Zou, H., Si, Y., Chen, C., Rajan, D., Chng, E.S.: Speech emotion recognition with co-attention based multi-level acoustic information. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7367–7371 (2022). https://doi.org/10.1109/ICASSP43922.2022.9747095

Download references

Author information

Authors and Affiliations

National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, China
Yu-Xuan Xi, Yan Song & Li-Rong Dai
iFLYTEK Research, iFLYTEK CO., LTD., Hefei, China
Lin Liu

Authors

Yu-Xuan Xi
View author publications
You can also search for this author in PubMed Google Scholar
Yan Song
View author publications
You can also search for this author in PubMed Google Scholar
Li-Rong Dai
View author publications
You can also search for this author in PubMed Google Scholar
Lin Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yan Song .

Editor information

Editors and Affiliations

University of Science and Technology of China, Anhui, China
Ling Zhenhua
Hefei University, Anhui, China
Gao Jianqing
Shanghai Jiaotong University, Shanghai, China
Yu Kai
Tsinghua University, Beijing, China
Jia Jia

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xi, YX., Song, Y., Dai, LR., Liu, L. (2023). A Time-Frequency Attention Mechanism with Subsidiary Information for Effective Speech Emotion Recognition. In: Zhenhua, L., Jianqing, G., Kai, Y., Jia, J. (eds) Man-Machine Speech Communication. NCMMSC 2022. Communications in Computer and Information Science, vol 1765. Springer, Singapore. https://doi.org/10.1007/978-981-99-2401-1_16

Download citation

DOI: https://doi.org/10.1007/978-981-99-2401-1_16
Published: 10 May 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-2400-4
Online ISBN: 978-981-99-2401-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Time-Frequency Attention Mechanism with Subsidiary Information for Effective Speech Emotion Recognition