Multimodal emotion recognition based on feature selection and extreme learning machine in video clips

Pan, Bei; Hirota, Kaoru; Jia, Zhiyang; Zhao, Linhui; Jin, Xiaoming; Dai, Yaping

doi:10.1007/s12652-021-03407-2

Multimodal emotion recognition based on feature selection and extreme learning machine in video clips

Original Research
Published: 27 July 2021

Volume 14, pages 1903–1917, (2023)
Cite this article

Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Bei Pan¹,
Kaoru Hirota¹,
Zhiyang Jia ORCID: orcid.org/0000-0003-3248-8875¹,
Linhui Zhao^2,3,
Xiaoming Jin^2,3 &
…
Yaping Dai¹

1055 Accesses
8 Citations
Explore all metrics

Abstract

Multimodal fusion-based emotion recognition has attracted increasing attention in affective computing because different modalities can achieve information complementation. One of the main challenges for reliable and effective model design is to define and extract appropriate emotional features from different modalities. In this paper, we present a novel multimodal emotion recognition framework to estimate categorical emotions, where visual and audio signals are utilized as multimodal input. The model learns neural appearance and key emotion frame using a statistical geometric method, which acts as a pre-processer for saving computation power. Discriminative emotion features expressed from visual and audio modalities are extracted through evolutionary optimization, and then fed to the optimized extreme learning machine (ELM) classifiers for unimodal emotion recognition. Finally, a decision-level fusion strategy is applied to integrate the results of predicted emotions by the different classifiers to enhance the overall performance. The effectiveness of the proposed method is demonstrated through three public datasets, i.e., the acted CK+ dataset, the acted Enterface05 dataset, and the spontaneous BAUM-1s dataset. An average recognition rate of 93.53\(\%\) on CK+, 91.62\(\%\) on Enterface05, and 60.77\(\%\) on BAUM-1s are obtained. The emotion recognition results acquired by fusing visual and audio predicted emotions are superior to both recognition of unimodality and concatenation of individual features.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 7

Combining modality-specific extreme learning machines for emotion recognition in the wild

Article 01 May 2015

Heysem Kaya & Albert Ali Salah

MEC 2016: The Multimodal Emotion Recognition Challenge of CCPR 2016

Speech emotion recognition using multimodal feature fusion with machine learning approach

Article 21 April 2023

Sandeep Kumar Panda, Ajay Kumar Jena, … Susmita Panda

References

Akçay MB, Oğuz K (2020) Speech emotion recognition: emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Commun 116:56–76
Article Google Scholar
Avots E, Sapiński T, Bachmann M, Kamińska D (2019) Audiovisual emotion recognition in wild. Mach Vis Appl 30(5):975–985
Article Google Scholar
Bejani M, Gharavian D, Charkari NM (2014) Audiovisual emotion recognition using anova feature selection method and multi-classifier neural networks. Neural Comput Appl 24(2):399–412
Article Google Scholar
Busso C, Deng Z, Yildirim S, Bulut M, Lee CM, Kazemzadeh A, Lee S, Neumann U, Narayanan S (2004) Analysis of emotion recognition using facial expressions, speech and multimodal information. In: Proceedings of the 6th International Conference on multimodal interfaces, pp 205–211. https://doi.org/10.1145/1027933.1027968
Chen J, Chen Z, Chi Z, Fu H (2016) Facial expression recognition in video with multiple feature fusion. IEEE Trans Affect Comput 9(1):38–50
Article Google Scholar
Chen L, Zhou M, Su W, Wu M, She J, Hirota K (2018a) Softmax regression based deep sparse autoencoder network for facial emotion recognition in human-robot interaction. Inf Sci 428:49–61
Article MathSciNet Google Scholar
Chen L, Zhou M, Wu M, She J, Liu Z, Dong F, Hirota K (2018b) Three-layer weighted fuzzy support vector regression for emotional intention understanding in human–robot interaction. IEEE Trans Fuzzy Syst 26(5):2524–2538
Article Google Scholar
Chu WS (2017) Automatic analysis of facial actions: learning from transductive, supervised and unsupervised frameworks. PhD thesis, Carnegie Mellon University
Ekman P, Friesen WV (1978) Facial action coding system: investigators guide. Consulting Psychologists Press
Google Scholar
El Ayadi M, Kamel MS, Karray F (2011) Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recognit 44(3):572–587
Article MATH Google Scholar
Han K, Yu D, Tashev I (2014) Speech emotion recognition using deep neural network and extreme learning machine. In: Fifteenth Annual Conference of the international speech communication association, pp 223–227
Hossain MS, Muhammad G (2019) Emotion recognition using deep learning approach from audio-visual emotional big data. Inf Fusion 49:69–78
Article Google Scholar
Huang GB, Zhu QY, Siew CK (2006) Extreme learning machine: theory and applications. Neurocomputing 70(1–3): 489–501
Article Google Scholar
Jain DK, Shamsolmoali P, Sehdev P (2019) Extended deep neural network for facial emotion recognition. Pattern Recognit Lett 120:69–74
Article Google Scholar
Jung H, Lee S, Yim J, Park S, Kim J (2015) Joint fine-tuning in deep neural networks for facial expression recognition. In: Proceedings of the IEEE International Conference on computer vision, pp 2983–2991. https://doi.org/10.1109/ICCV.2015.341
Kansizoglou I, Bampis L, Gasteratos A (2019) An active learning paradigm for online audio-visual emotion recognition. IEEE Trans Affect Comput. https://doi.org/10.1109/TAFFC.2019.2961089
Article Google Scholar
Kazemi V, Sullivan J (2014) One millisecond face alignment with an ensemble of regression trees. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 1867–1874. https://doi.org/10.1109/CVPR.2014.241
Krithika L, Priya GL (2020) Graph based feature extraction and hybrid classification approach for facial expression recognition. J Ambient Intell Human Comput 12:2131–2147. https://doi.org/10.1007/s12652-020-02311-5
Article Google Scholar
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
Article Google Scholar
Liu Y, Yuan X, Gong X, Xie Z, Fang F, Luo Z (2018) Conditional convolution neural network enhanced random forest for facial expression recognition. Pattern Recognit 84:251–261
Article Google Scholar
Lucey P, Cohn JF, Kanade T, Saragih J, Ambadar Z, Matthews I (2010) The extended cohn-kanade dataset (ck+): a complete dataset for action unit and emotion-specified expression. In: 2010 IEEE Computer Society Conference on computer vision and pattern recognition-workshops, IEEE, pp 94–101. https://doi.org/10.1109/CVPRW.2010.5543262
Ma Y, Hao Y, Chen M, Chen J, Lu P, Košir A (2019) Audio-visual emotion fusion (avef): a deep efficient weighted approach. Inf Fusion 46:184–192
Article Google Scholar
Martin O, Kotsia I, Macq B, Pitas I (2006) The enterface’05 audio-visual emotion database. In: 22nd International Conference on Data Engineering Workshops (ICDEW’06), IEEE. https://doi.org/10.1109/ICDEW.2006.145
Mendoza-Palechor F, Menezes ML, Sant’Anna A, Ortiz-Barrios M, Samara A, Galway L (2019) Affective recognition from eeg signals: an integrated data-mining approach. J Ambient Intell Hum Comput 10(10):3955–3974
Article Google Scholar
Miyoshi R, Nagata N, Hashimoto M (2021) Enhanced convolutional lstm with spatial and temporal skip connections and temporal gates for facial expression recognition from video. Neural Comput Appl 33:7381–7392. https://doi.org/10.1007/s00521-020-05557-4
Noroozi F, Marjanovic M, Njegus A, Escalera S, Anbarjafari G (2017) Audio-visual emotion recognition in video clips. IEEE Trans Affect Comput 10(1):60–75
Article Google Scholar
Pons G, Masip D (2020) Multitask, multilabel, and multidomain learning with convolutional networks for emotion recognition. IEEE Trans Cybern 99:1–8. https://doi.org/10.1109/TCYB.2020.3036935
Poria S, Cambria E, Howard N, Huang GB, Hussain A (2016) Fusing audio, visual and textual clues for sentiment analysis from multimodal content. Neurocomputing 174:50–59
Article Google Scholar
Poria S, Cambria E, Bajpai R, Hussain A (2017) A review of affective computing: from unimodal analysis to multimodal fusion .Inf Fusion 37:98–125
Article Google Scholar
Rahdari F, Rashedi E, Eftekhari M (2019) A multimodal emotion recognition system using facial landmark analysis. Iran J Sci Technol Trans Electric Eng 43(1):171–189
Article Google Scholar
Schuller B, Steidl S, Batliner A, Burkhardt F, Devillers L, Müller C, Narayanan SS (2010) The interspeech 2010 paralinguistic challenge. In: Eleventh annual conference of the international speech communication association, pp 2794–2797
Shan C, Gong S, McOwan PW (2009) Facial expression recognition based on local binary patterns: a comprehensive study. Image Vis Comput 27(6):803–816
Article Google Scholar
Wang Y, Guan L (2008) Recognizing human emotional state from audiovisual signals. IEEE Trans Multimed 10(5):936–946
Article Google Scholar
Whitley D (1994) A genetic algorithm tutorial. Stat Comput 4(2):65–85
Article Google Scholar
Wöllmer M, Weninger F, Knaup T, Schuller B, Sun C, Sagae K, Morency LP (2013) Youtube movie reviews: Sentiment analysis in an audio-visual context. IEEE Intell Syst 28(3):46–53
Article Google Scholar
Wu M, Su W, Chen L, Liu Z, Cao W, Hirota K (2019) Weight-adapted convolution neural network for facial expression recognition in human-robot interaction. IEEE Trans Syst Man Cybern Syst 51(3):1473–1484
Article Google Scholar
Xiao W, Zhang J, Li Y, Zhang S, Yang W (2017) Class-specific cost regulation extreme learning machine for imbalanced classification. Neurocomputing 261:70–82
Article Google Scholar
Xie S, Hu H, Wu Y (2019) Deep multi-path convolutional neural network joint with salient region attention for facial expression recognition. Pattern Recognit 92:177–191
Article Google Scholar
Zeng Z, Pantic M, Roisman GI, Huang TS (2008) A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Trans Pattern Anal Mach Intell 31(1):39–58
Article Google Scholar
Zhalehpour S, Onder O, Akhtar Z, Erdem CE (2016) Baum-1: a spontaneous audio-visual face database of affective and mental states. IEEE Trans Affect Comput 8(3):300–313
Article Google Scholar
Zhang S, Zhang S, Huang T, Gao W, Tian Q (2017) Learning affective features with a hybrid deep model for audio-visual emotion recognition. IEEE Trans Circ Syst Video Technol 28(10): 3030–3043
Article Google Scholar
Zhang S, Zhao X, Tian Q (2019) Spontaneous speech emotion recognition using multiscale deep convolutional lstm. IEEE Trans Affect Comput 99:1–1. https://doi.org/10.1109/TAFFC.2019.2947464
Article Google Scholar
Zhang J, Li Y, Xiao W, Zhang Z (2020a) Non-iterative and fast deep learning: multilayer extreme learning machines. J Frankl Inst 357(13):8925–8955
Article MathSciNet MATH Google Scholar
Zhang J, Li Y, Xiao W, Zhang Z (2020b) Robust extreme learning machine for modeling with unknown noise. J Frankl Inst 357(14):9885–9908
Article MathSciNet MATH Google Scholar
Zhao G, Pietikainen M (2007) Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans Pattern Anal Mach Intell 29(6):915–928
Article Google Scholar

Download references

Acknowledgements

This work was supported by the Open Foundation of Beijing Engineering Research Center of Smart Mechanical Innovation Design Service under Grant No. KF2019302, the General Projects of Science and Technology Plan of Beijing Municipal Commission of Education under Grant No. KM202011417005, and the National Talents Foundation under Grant No. WQ20141100198.

Author information

Authors and Affiliations

School of Automation, Beijing Institute of Technology, Beijing, 100081, China
Bei Pan, Kaoru Hirota, Zhiyang Jia & Yaping Dai
College of Robotics, Beijing Union University, Beijing, 100020, China
Linhui Zhao & Xiaoming Jin
Beijing Engineering Research Center of Smart Mechanical Innovation Design Service, Beijing, 100020, China
Linhui Zhao & Xiaoming Jin

Authors

Bei Pan
View author publications
You can also search for this author in PubMed Google Scholar
Kaoru Hirota
View author publications
You can also search for this author in PubMed Google Scholar
Zhiyang Jia
View author publications
You can also search for this author in PubMed Google Scholar
Linhui Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoming Jin
View author publications
You can also search for this author in PubMed Google Scholar
Yaping Dai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Zhiyang Jia or Linhui Zhao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pan, B., Hirota, K., Jia, Z. et al. Multimodal emotion recognition based on feature selection and extreme learning machine in video clips. J Ambient Intell Human Comput 14, 1903–1917 (2023). https://doi.org/10.1007/s12652-021-03407-2

Download citation

Received: 03 February 2021
Accepted: 13 July 2021
Published: 27 July 2021
Issue Date: March 2023
DOI: https://doi.org/10.1007/s12652-021-03407-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multimodal emotion recognition based on feature selection and extreme learning machine in video clips

Abstract

Access this article

Similar content being viewed by others

Combining modality-specific extreme learning machines for emotion recognition in the wild

MEC 2016: The Multimodal Emotion Recognition Challenge of CCPR 2016

Speech emotion recognition using multimodal feature fusion with machine learning approach

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Abstract

Access this article

Similar content being viewed by others

Combining modality-specific extreme learning machines for emotion recognition in the wild

MEC 2016: The Multimodal Emotion Recognition Challenge of CCPR 2016

Speech emotion recognition using multimodal feature fusion with machine learning approach

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation