Two-Stage Adaptation for Cross-Corpus Multimodal Emotion Recognition

Huang, Zhaopei; Zhao, Jinming; Jin, Qin

doi:10.1007/978-3-031-44696-2_34

Zhaopei Huang¹¹,
Jinming Zhao¹² &
Qin Jin¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14303))

Included in the following conference series:

CCF International Conference on Natural Language Processing and Chinese Computing

938 Accesses

Abstract

The development of multimodal emotion recognition is severely limited by time-consuming annotation costs. In this paper, we pay attention to the multimodal emotion recognition task in the cross-corpus setting, which can help adapt a trained model to an unlabeled target corpus. Inspired by the recent development of pre-trained models, we adopt a multimodal emotion pre-trained model to provide a better representation learning foundation for our task. However, we may face two domain gaps when applying a pre-trained model to the cross-corpus downstream task: the scenario gap between pre-trained and downstream corpora, and the distribution gap between different downstream sets. To bridge these two gaps, we propose a two-stage adaptation method. Specifically, we first adapt a pre-trained model to the task-related scenario through task-adaptive pre-training. We then fine-tune the model with a cluster-based loss to align the distribution of two downstream sets in a class-conditional manner. Additionally, we propose a ranking-based pseudo-label filtering strategy to obtain more balanced and high-quality samples from the target sets for calculating the cluster-based loss. We conduct extensive experiments on two emotion datasets, IEMOCAP and MSP-IMPROV. The results of our experiments demonstrate the effectiveness of our proposed two-stage adaptation method and the pseudo-label filtering strategy in cross-corpus settings.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

MT-TCCT: Multi-task Learning for Multimodal Emotion Recognition

Cross-Collection Emotion Tagging for Online News

Filter-based multi-task cross-corpus feature learning for speech emotion recognition

Article 20 February 2024

References

Neumann, M., et al.: Cross-lingual and multilingual speech emotion recognition on English and French. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5769–5773. IEEE (2018)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Li, T., Chen, X., Zhang, S., Dong, Z., Keutzer, K.: Cross-domain sentiment classification with contrastive learning and mutual information maximization. In: 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2021, pp. 8203–8207. IEEE (2021)
Google Scholar
Chen, Y.-C., et al.: UNITER: UNiversal Image-TExt representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_7
Chapter Google Scholar
Bao, H., et al.: VLMo: unified vision-language pre-training with mixture-of-modality-experts. In: Advances in Neural Information Processing Systems, vol. 35, pp. 32897–32912 (2022)
Google Scholar
Zhao, J., Li, R., Jin, Q., Wang, X., Li, H.: MEmoBERT: pre-training model with prompt-based learning for multimodal emotion recognition. In: 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2022, pp. 4703–4707. IEEE (2022)
Google Scholar
Zhu, Y., et al.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 19–27 (2015)
Google Scholar
Busso, C., et al.: IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335–359 (2008)
Article Google Scholar
Busso, C., Parthasarathy, S., Burmania, A., AbdelWahab, M., Sadoughi, N., Provost, E.M.: MSP-IMPROV: an acted corpus of dyadic interactions to study emotion perception. IEEE Trans. Affect. Comput. 8(1), 67–80 (2016)
Article Google Scholar
Gururangan, S., et al.: Don’t stop pretraining: adapt language models to domains and tasks. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8342–8360 (2020)
Google Scholar
Deng, Z., Luo, Y., Zhu, J.: Cluster alignment with a teacher for unsupervised domain adaptation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9944–9953 (2019)
Google Scholar
Sun, C., Qiu, X., Xu, Y., Huang, X.: How to fine-tune BERT for text classification? In: Sun, M., Huang, X., Ji, H., Liu, Z., Liu, Y. (eds.) CCL 2019. LNCS (LNAI), vol. 11856, pp. 194–206. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32381-3_16
Chapter Google Scholar
Karouzos, C., Paraskevopoulos, G., Potamianos, A.: UDALM: unsupervised domain adaptation through language modeling. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2579–2590 (2021)
Google Scholar
Wilson, G., Cook, D.J.: A survey of unsupervised deep domain adaptation. ACM Trans. Intell. Syst. Technol. (TIST) 11(5), 1–46 (2020)
Article Google Scholar
Long, M., Cao, Y., Wang, J., Jordan, M.: Learning transferable features with deep adaptation networks. In: International Conference on Machine Learning, pp. 97–105. PMLR (2015)
Google Scholar
Sun, B., Saenko, K.: Deep CORAL: correlation alignment for deep domain adaptation. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 443–450. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49409-8_35
Chapter Google Scholar
Wang, M., Deng, W.: Deep face recognition with clustering based domain adaptation. Neurocomputing 393, 1–14 (2020)
Article Google Scholar
Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: International Conference on Machine Learning, pp. 1180–1189. PMLR (2015)
Google Scholar
Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7167–7176 (2017)
Google Scholar
Abdelwahab, M., Busso, C.: Domain adversarial for acoustic emotion recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 26(12), 2423–2435 (2018)
Article Google Scholar
Milner, R., Jalal, M.A., Ng, R.W., Hain, T.: A cross-corpus study on speech emotion recognition. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 304–311. IEEE (2019)
Google Scholar
Yin, Y., Huang, B., Wu, Y., Soleymani, M.: Speaker-invariant adversarial domain adaptation for emotion recognition. In: Proceedings of the 2020 International Conference on Multimodal Interaction, pp. 481–490 (2020)
Google Scholar
Gao, Y., Okada, S., Wang, L., Liu, J., Dang, J.: Domain-invariant feature learning for cross corpus speech emotion recognition. In: 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2022, pp. 6427–6431. IEEE (2022)
Google Scholar
Wen, Y., Zhang, K., Li, Z., Qiao, Yu.: A discriminative feature learning approach for deep face recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 499–515. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_31
Chapter Google Scholar
Luo, Y., Zhu, J., Li, M., Ren, Y., Zhang, B.: Smooth neighbors on teacher graphs for semi-supervised learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8896–8905 (2018)
Google Scholar
Chapelle, O., Zien, A.: Semi-supervised classification by low density separation. In: International Workshop on Artificial Intelligence and Statistics, pp. 57–64. PMLR (2005)
Google Scholar
Laine, S., Aila, T.: Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242 (2016)
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)
Google Scholar
Barsoum, E., Zhang, C., Ferrer, C.C., Zhang, Z.: Training deep networks for facial expression recognition with crowd-sourced label distribution. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 279–283 (2016)
Google Scholar
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11) (2008)
Google Scholar

Download references

Acknowledgements

This work was partially supported by the National Key R &D Program of China (No. 2020AAA0108600) and the National Natural Science Foundation of China (No. 62072462).

Author information

Authors and Affiliations

School of Information, Renmin University of China, Beijing, China
Zhaopei Huang & Qin Jin
Qiyuan Lab, Beijing, China
Jinming Zhao

Authors

Zhaopei Huang
View author publications
You can also search for this author in PubMed Google Scholar
Jinming Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Qin Jin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qin Jin .

Editor information

Editors and Affiliations

Emory University, Atlanta, GA, USA
Fei Liu
Microsoft Research Asia, Beijing, China
Nan Duan
Soochow University, Suzhou, China
Qingting Xu
Soochow University, Suzhou, China
Yu Hong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, Z., Zhao, J., Jin, Q. (2023). Two-Stage Adaptation for Cross-Corpus Multimodal Emotion Recognition. In: Liu, F., Duan, N., Xu, Q., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2023. Lecture Notes in Computer Science(), vol 14303. Springer, Cham. https://doi.org/10.1007/978-3-031-44696-2_34

Download citation

DOI: https://doi.org/10.1007/978-3-031-44696-2_34
Published: 08 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44695-5
Online ISBN: 978-3-031-44696-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the China Computer Federation (CCF) (opens in a new tab)

Two-Stage Adaptation for Cross-Corpus Multimodal Emotion Recognition

Abstract

Access this chapter

Similar content being viewed by others

MT-TCCT: Multi-task Learning for Multimodal Emotion Recognition

Cross-Collection Emotion Tagging for Online News

Filter-based multi-task cross-corpus feature learning for speech emotion recognition

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

Two-Stage Adaptation for Cross-Corpus Multimodal Emotion Recognition

Abstract

Access this chapter

Similar content being viewed by others

MT-TCCT: Multi-task Learning for Multimodal Emotion Recognition

Cross-Collection Emotion Tagging for Online News

Filter-based multi-task cross-corpus feature learning for speech emotion recognition

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation