Abstract
Speech emotion recognition (SER) is the task of automatically identifying human emotions from the analysis of utterances. In practical applications, the task is often affected by subsidiary information, such as speaker or phoneme information. Traditional domain adaptation approaches are often applied to remove unwanted domain-specific knowledge, but often unavoidably contribute to the loss of useful categorical information. In this paper, we proposed a time-frequency attention mechanism based on multi-task learning (MTL). This uses its own content information to obtain self attention in time and channel dimensions, and obtain weight knowledge in the frequency dimension through domain information extracted from MTL. We conduct extensive evaluations on the IEMOCAP benchmark to assess the effectiveness of the proposed representation. Results demonstrate a recognition performance of 73.24% weighted accuracy (WA) and 73.18% unweighted accuracy (UA) over four emotions, outperforming the baseline by about 4%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Atmaja, B.T., Akagi, M.: Multitask learning and multistage fusion for dimensional audiovisual emotion recognition. In: ICASSP 2020, pp. 4482–4486 (2020)
Deng, J., Xu, X., Zhang, Z., Frühholz, S.: Semisupervised autoencoders for speech emotion recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 26, 31–43 (2018)
Deng, J., Xu, X., Zhang, Z., Frühholz, S., Schuller, B.: Universum autoencoder-based domain adaptation for speech emotion recognition. IEEE Signal Process. Lett. 24, 500–530 (2017)
Deng, J., Zhang, Z., Eyben, F., Schuller, B.: Autoencoder-based unsupervised domain adaptation for speech emotion recognition. IEEE Signal Process. Lett. 21, 1068–1072 (2014)
Fan, W., Xu, X., Xing, X., Huang, D.: Adaptive domain-aware representation learning for speech emotion recognition. In: Proceedings of Interspeech 2020, pp. 4089–4093 (2020). https://doi.org/10.21437/Interspeech.2020-2572
Feng, H., Ueno, S., Kawahara, T.: End-to-end speech emotion recognition combined with acoustic-to-word ASR model. In: Proceedings of Interspeech 2020, pp. 501–505 (2020). https://doi.org/10.21437/Interspeech.2020-1180
Gao, Y., Okada, S., Wang, L., Liu, J., Dang, J.: Domain-invariant feature learning for cross corpus speech emotion recognition. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6427–6431 (2022). https://doi.org/10.1109/ICASSP43922.2022.9747129
Gat, I., Aronowitz, H., Zhu, W., Morais, E., Hoory, R.: Speaker normalization for self-supervised speech emotion recognition. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7342–7346 (2022). https://doi.org/10.1109/ICASSP43922.2022.9747460
Hou, Q., Zhang, L., Cheng, M.M., Feng, J.: Strip pooling: rethinking spatial pooling for scene parsing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4003–4012 (2020)
Hu, J., Shen, L., Albanie, S., Sun, G., Wu, E.: Squeeze-and-excitation networks. In: Computer Vision and Pattern Recognition (CVPR) (2018)
Lee, J., Tashev, I.: High-level feature representation using recurrent neural network for speech emotion recognition. In: Proceedings of Interspeech, pp. 1537–1540 (2015)
Li, P., Song, Y., McLoughlin, I., Guo, W., Dai, L.: An attention pooling based representation learning method for speech emotion recognition. In: Proceedings of Interspeech (2018)
Li, R., Zhao, J., Jin, Q.: Speech emotion recognition via multi-level cross-modal distillation. In: Proceedings of Interspeech 2021, pp. 4488–4492 (2021). https://doi.org/10.21437/Interspeech.2021-785
McLoughlin, I.V.: Speech and Audio Processing: a MATLAB-Based Approach. Cambridge University Press, Cambridge (2016)
Miao, X., McLoughlin, I., Yan, Y.: A new time-frequency attention tensor network for language identification. Circ. Syst. Signal Process. 39(5), 2744–2758 (2020)
Pappagari, R., Villalba, J., Żelasko, P., Moro-Velazquez, L., Dehak, N.: CopyPaste: an augmentation method for speech emotion recognition. In: ICASSP, pp. 6324–6328 (2021). https://doi.org/10.1109/ICASSP39728.2021.9415077
Parthasarathy, S., Busso, C.: Jointly predicting arousal, valence and dominance with multi-task learning. In: Proceedings of Interspeech 2017, pp. 1103–1107 (2017). https://doi.org/10.21437/Interspeech.2017-1494
Pepino, L., Riera, P., Ferrer, L., Gravano, A.: Fusion approaches for emotion recognition from speech using acoustic and text-based features. In: ICASSP 2020, pp. 4482–4486 (2020)
Priyasad, D., Fernando, T., Denman, S., Sridharan, S., Fookes, C.: Attention driven fusion for multi-modal emotion recognition. In: ICASSP 2020, pp. 6484–6488 (2020)
Satt, A., Rozenberg, S., Hoory, R.: Efficient emotion recognition from speech using deep learning on spectrograms. In: Proceedings of Interspeech, pp. 1089–1093 (2017)
Sebastian, J., Pierucci, P.: Fusion techniques for utterance-level emotion recognition combining speech and transcripts. In: Proceedings of Interspeech 2019, pp. 51–55 (2019). https://doi.org/10.21437/Interspeech.2019-3201
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_1
Wu, X., Hu, S., Wu, Z., Liu, X., Meng, H.: Neural architecture search for speech emotion recognition. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6902–6906 (2022). https://doi.org/10.1109/ICASSP43922.2022.9746155
Xi, Y., Li, P., Song, Y., Jiang, Y., Dai, L.: Speaker to emotion: domain adaptation for speech emotion recognition with residual adapters. In: Asia-Pacific Signal and Information Processing Association (APSIPA) (2019)
Xia, W., Koishida, K.: Sound event detection in multichannel audio using convolutional time-frequency-channel squeeze and excitation. In: Proceedings of Interspeech, pp. 3629–3633 (2019)
Yeh, S.L., Lin, Y.S., Lee, C.C.: A dialogical emotion decoder for speech emotion recognition in spoken dialog. In: ICASSP 2020, pp. 6479–6483 (2020)
Zou, H., Si, Y., Chen, C., Rajan, D., Chng, E.S.: Speech emotion recognition with co-attention based multi-level acoustic information. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7367–7371 (2022). https://doi.org/10.1109/ICASSP43922.2022.9747095
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Xi, YX., Song, Y., Dai, LR., Liu, L. (2023). A Time-Frequency Attention Mechanism with Subsidiary Information for Effective Speech Emotion Recognition. In: Zhenhua, L., Jianqing, G., Kai, Y., Jia, J. (eds) Man-Machine Speech Communication. NCMMSC 2022. Communications in Computer and Information Science, vol 1765. Springer, Singapore. https://doi.org/10.1007/978-981-99-2401-1_16
Download citation
DOI: https://doi.org/10.1007/978-981-99-2401-1_16
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-2400-4
Online ISBN: 978-981-99-2401-1
eBook Packages: Computer ScienceComputer Science (R0)