Abstract
The development of multimodal emotion recognition is severely limited by time-consuming annotation costs. In this paper, we pay attention to the multimodal emotion recognition task in the cross-corpus setting, which can help adapt a trained model to an unlabeled target corpus. Inspired by the recent development of pre-trained models, we adopt a multimodal emotion pre-trained model to provide a better representation learning foundation for our task. However, we may face two domain gaps when applying a pre-trained model to the cross-corpus downstream task: the scenario gap between pre-trained and downstream corpora, and the distribution gap between different downstream sets. To bridge these two gaps, we propose a two-stage adaptation method. Specifically, we first adapt a pre-trained model to the task-related scenario through task-adaptive pre-training. We then fine-tune the model with a cluster-based loss to align the distribution of two downstream sets in a class-conditional manner. Additionally, we propose a ranking-based pseudo-label filtering strategy to obtain more balanced and high-quality samples from the target sets for calculating the cluster-based loss. We conduct extensive experiments on two emotion datasets, IEMOCAP and MSP-IMPROV. The results of our experiments demonstrate the effectiveness of our proposed two-stage adaptation method and the pseudo-label filtering strategy in cross-corpus settings.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Neumann, M., et al.: Cross-lingual and multilingual speech emotion recognition on English and French. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5769–5773. IEEE (2018)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Li, T., Chen, X., Zhang, S., Dong, Z., Keutzer, K.: Cross-domain sentiment classification with contrastive learning and mutual information maximization. In: 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2021, pp. 8203–8207. IEEE (2021)
Chen, Y.-C., et al.: UNITER: UNiversal Image-TExt representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_7
Bao, H., et al.: VLMo: unified vision-language pre-training with mixture-of-modality-experts. In: Advances in Neural Information Processing Systems, vol. 35, pp. 32897–32912 (2022)
Zhao, J., Li, R., Jin, Q., Wang, X., Li, H.: MEmoBERT: pre-training model with prompt-based learning for multimodal emotion recognition. In: 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2022, pp. 4703–4707. IEEE (2022)
Zhu, Y., et al.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 19–27 (2015)
Busso, C., et al.: IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335–359 (2008)
Busso, C., Parthasarathy, S., Burmania, A., AbdelWahab, M., Sadoughi, N., Provost, E.M.: MSP-IMPROV: an acted corpus of dyadic interactions to study emotion perception. IEEE Trans. Affect. Comput. 8(1), 67–80 (2016)
Gururangan, S., et al.: Don’t stop pretraining: adapt language models to domains and tasks. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8342–8360 (2020)
Deng, Z., Luo, Y., Zhu, J.: Cluster alignment with a teacher for unsupervised domain adaptation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9944–9953 (2019)
Sun, C., Qiu, X., Xu, Y., Huang, X.: How to fine-tune BERT for text classification? In: Sun, M., Huang, X., Ji, H., Liu, Z., Liu, Y. (eds.) CCL 2019. LNCS (LNAI), vol. 11856, pp. 194–206. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32381-3_16
Karouzos, C., Paraskevopoulos, G., Potamianos, A.: UDALM: unsupervised domain adaptation through language modeling. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2579–2590 (2021)
Wilson, G., Cook, D.J.: A survey of unsupervised deep domain adaptation. ACM Trans. Intell. Syst. Technol. (TIST) 11(5), 1–46 (2020)
Long, M., Cao, Y., Wang, J., Jordan, M.: Learning transferable features with deep adaptation networks. In: International Conference on Machine Learning, pp. 97–105. PMLR (2015)
Sun, B., Saenko, K.: Deep CORAL: correlation alignment for deep domain adaptation. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 443–450. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49409-8_35
Wang, M., Deng, W.: Deep face recognition with clustering based domain adaptation. Neurocomputing 393, 1–14 (2020)
Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: International Conference on Machine Learning, pp. 1180–1189. PMLR (2015)
Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7167–7176 (2017)
Abdelwahab, M., Busso, C.: Domain adversarial for acoustic emotion recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 26(12), 2423–2435 (2018)
Milner, R., Jalal, M.A., Ng, R.W., Hain, T.: A cross-corpus study on speech emotion recognition. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 304–311. IEEE (2019)
Yin, Y., Huang, B., Wu, Y., Soleymani, M.: Speaker-invariant adversarial domain adaptation for emotion recognition. In: Proceedings of the 2020 International Conference on Multimodal Interaction, pp. 481–490 (2020)
Gao, Y., Okada, S., Wang, L., Liu, J., Dang, J.: Domain-invariant feature learning for cross corpus speech emotion recognition. In: 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2022, pp. 6427–6431. IEEE (2022)
Wen, Y., Zhang, K., Li, Z., Qiao, Yu.: A discriminative feature learning approach for deep face recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 499–515. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_31
Luo, Y., Zhu, J., Li, M., Ren, Y., Zhang, B.: Smooth neighbors on teacher graphs for semi-supervised learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8896–8905 (2018)
Chapelle, O., Zien, A.: Semi-supervised classification by low density separation. In: International Workshop on Artificial Intelligence and Statistics, pp. 57–64. PMLR (2005)
Laine, S., Aila, T.: Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242 (2016)
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)
Barsoum, E., Zhang, C., Ferrer, C.C., Zhang, Z.: Training deep networks for facial expression recognition with crowd-sourced label distribution. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 279–283 (2016)
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11) (2008)
Acknowledgements
This work was partially supported by the National Key R &D Program of China (No. 2020AAA0108600) and the National Natural Science Foundation of China (No. 62072462).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Huang, Z., Zhao, J., Jin, Q. (2023). Two-Stage Adaptation for Cross-Corpus Multimodal Emotion Recognition. In: Liu, F., Duan, N., Xu, Q., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2023. Lecture Notes in Computer Science(), vol 14303. Springer, Cham. https://doi.org/10.1007/978-3-031-44696-2_34
Download citation
DOI: https://doi.org/10.1007/978-3-031-44696-2_34
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44695-5
Online ISBN: 978-3-031-44696-2
eBook Packages: Computer ScienceComputer Science (R0)