Skip to main content

A Time-Frequency Attention Mechanism with Subsidiary Information for Effective Speech Emotion Recognition

  • Conference paper
  • First Online:
Man-Machine Speech Communication (NCMMSC 2022)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1765))

Included in the following conference series:

  • 351 Accesses

Abstract

Speech emotion recognition (SER) is the task of automatically identifying human emotions from the analysis of utterances. In practical applications, the task is often affected by subsidiary information, such as speaker or phoneme information. Traditional domain adaptation approaches are often applied to remove unwanted domain-specific knowledge, but often unavoidably contribute to the loss of useful categorical information. In this paper, we proposed a time-frequency attention mechanism based on multi-task learning (MTL). This uses its own content information to obtain self attention in time and channel dimensions, and obtain weight knowledge in the frequency dimension through domain information extracted from MTL. We conduct extensive evaluations on the IEMOCAP benchmark to assess the effectiveness of the proposed representation. Results demonstrate a recognition performance of 73.24% weighted accuracy (WA) and 73.18% unweighted accuracy (UA) over four emotions, outperforming the baseline by about 4%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Atmaja, B.T., Akagi, M.: Multitask learning and multistage fusion for dimensional audiovisual emotion recognition. In: ICASSP 2020, pp. 4482–4486 (2020)

    Google Scholar 

  2. Deng, J., Xu, X., Zhang, Z., Frühholz, S.: Semisupervised autoencoders for speech emotion recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 26, 31–43 (2018)

    Article  Google Scholar 

  3. Deng, J., Xu, X., Zhang, Z., Frühholz, S., Schuller, B.: Universum autoencoder-based domain adaptation for speech emotion recognition. IEEE Signal Process. Lett. 24, 500–530 (2017)

    Article  Google Scholar 

  4. Deng, J., Zhang, Z., Eyben, F., Schuller, B.: Autoencoder-based unsupervised domain adaptation for speech emotion recognition. IEEE Signal Process. Lett. 21, 1068–1072 (2014)

    Article  Google Scholar 

  5. Fan, W., Xu, X., Xing, X., Huang, D.: Adaptive domain-aware representation learning for speech emotion recognition. In: Proceedings of Interspeech 2020, pp. 4089–4093 (2020). https://doi.org/10.21437/Interspeech.2020-2572

  6. Feng, H., Ueno, S., Kawahara, T.: End-to-end speech emotion recognition combined with acoustic-to-word ASR model. In: Proceedings of Interspeech 2020, pp. 501–505 (2020). https://doi.org/10.21437/Interspeech.2020-1180

  7. Gao, Y., Okada, S., Wang, L., Liu, J., Dang, J.: Domain-invariant feature learning for cross corpus speech emotion recognition. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6427–6431 (2022). https://doi.org/10.1109/ICASSP43922.2022.9747129

  8. Gat, I., Aronowitz, H., Zhu, W., Morais, E., Hoory, R.: Speaker normalization for self-supervised speech emotion recognition. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7342–7346 (2022). https://doi.org/10.1109/ICASSP43922.2022.9747460

  9. Hou, Q., Zhang, L., Cheng, M.M., Feng, J.: Strip pooling: rethinking spatial pooling for scene parsing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4003–4012 (2020)

    Google Scholar 

  10. Hu, J., Shen, L., Albanie, S., Sun, G., Wu, E.: Squeeze-and-excitation networks. In: Computer Vision and Pattern Recognition (CVPR) (2018)

    Google Scholar 

  11. Lee, J., Tashev, I.: High-level feature representation using recurrent neural network for speech emotion recognition. In: Proceedings of Interspeech, pp. 1537–1540 (2015)

    Google Scholar 

  12. Li, P., Song, Y., McLoughlin, I., Guo, W., Dai, L.: An attention pooling based representation learning method for speech emotion recognition. In: Proceedings of Interspeech (2018)

    Google Scholar 

  13. Li, R., Zhao, J., Jin, Q.: Speech emotion recognition via multi-level cross-modal distillation. In: Proceedings of Interspeech 2021, pp. 4488–4492 (2021). https://doi.org/10.21437/Interspeech.2021-785

  14. McLoughlin, I.V.: Speech and Audio Processing: a MATLAB-Based Approach. Cambridge University Press, Cambridge (2016)

    Google Scholar 

  15. Miao, X., McLoughlin, I., Yan, Y.: A new time-frequency attention tensor network for language identification. Circ. Syst. Signal Process. 39(5), 2744–2758 (2020)

    Article  Google Scholar 

  16. Pappagari, R., Villalba, J., Żelasko, P., Moro-Velazquez, L., Dehak, N.: CopyPaste: an augmentation method for speech emotion recognition. In: ICASSP, pp. 6324–6328 (2021). https://doi.org/10.1109/ICASSP39728.2021.9415077

  17. Parthasarathy, S., Busso, C.: Jointly predicting arousal, valence and dominance with multi-task learning. In: Proceedings of Interspeech 2017, pp. 1103–1107 (2017). https://doi.org/10.21437/Interspeech.2017-1494

  18. Pepino, L., Riera, P., Ferrer, L., Gravano, A.: Fusion approaches for emotion recognition from speech using acoustic and text-based features. In: ICASSP 2020, pp. 4482–4486 (2020)

    Google Scholar 

  19. Priyasad, D., Fernando, T., Denman, S., Sridharan, S., Fookes, C.: Attention driven fusion for multi-modal emotion recognition. In: ICASSP 2020, pp. 6484–6488 (2020)

    Google Scholar 

  20. Satt, A., Rozenberg, S., Hoory, R.: Efficient emotion recognition from speech using deep learning on spectrograms. In: Proceedings of Interspeech, pp. 1089–1093 (2017)

    Google Scholar 

  21. Sebastian, J., Pierucci, P.: Fusion techniques for utterance-level emotion recognition combining speech and transcripts. In: Proceedings of Interspeech 2019, pp. 51–55 (2019). https://doi.org/10.21437/Interspeech.2019-3201

  22. Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_1

    Chapter  Google Scholar 

  23. Wu, X., Hu, S., Wu, Z., Liu, X., Meng, H.: Neural architecture search for speech emotion recognition. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6902–6906 (2022). https://doi.org/10.1109/ICASSP43922.2022.9746155

  24. Xi, Y., Li, P., Song, Y., Jiang, Y., Dai, L.: Speaker to emotion: domain adaptation for speech emotion recognition with residual adapters. In: Asia-Pacific Signal and Information Processing Association (APSIPA) (2019)

    Google Scholar 

  25. Xia, W., Koishida, K.: Sound event detection in multichannel audio using convolutional time-frequency-channel squeeze and excitation. In: Proceedings of Interspeech, pp. 3629–3633 (2019)

    Google Scholar 

  26. Yeh, S.L., Lin, Y.S., Lee, C.C.: A dialogical emotion decoder for speech emotion recognition in spoken dialog. In: ICASSP 2020, pp. 6479–6483 (2020)

    Google Scholar 

  27. Zou, H., Si, Y., Chen, C., Rajan, D., Chng, E.S.: Speech emotion recognition with co-attention based multi-level acoustic information. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7367–7371 (2022). https://doi.org/10.1109/ICASSP43922.2022.9747095

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yan Song .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xi, YX., Song, Y., Dai, LR., Liu, L. (2023). A Time-Frequency Attention Mechanism with Subsidiary Information for Effective Speech Emotion Recognition. In: Zhenhua, L., Jianqing, G., Kai, Y., Jia, J. (eds) Man-Machine Speech Communication. NCMMSC 2022. Communications in Computer and Information Science, vol 1765. Springer, Singapore. https://doi.org/10.1007/978-981-99-2401-1_16

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-2401-1_16

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-2400-4

  • Online ISBN: 978-981-99-2401-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics