A Novel Approach to Cross dataset studies in Facial Expression Recognition

Ramis, Silvia; Buades, Jose M.; Perales, Francisco J.; Manresa-Yee, Cristina

doi:10.1007/s11042-022-13117-2

A Novel Approach to Cross dataset studies in Facial Expression Recognition

Open access
Published: 29 April 2022

Volume 81, pages 39507–39544, (2022)
Cite this article

Download PDF

You have full access to this open access article

Multimedia Tools and Applications Aims and scope Submit manuscript

A Novel Approach to Cross dataset studies in Facial Expression Recognition

Download PDF

Silvia Ramis ORCID: orcid.org/0000-0002-1039-4387¹,
Jose M. Buades¹,
Francisco J. Perales¹ &
…
Cristina Manresa-Yee¹

2633 Accesses
16 Citations
1 Altmetric
Explore all metrics

Abstract

Recognizing facial expressions is a challenging task both for computers and humans. Although recent deep learning-based approaches are achieving high accuracy results in this task, research in this area is mainly focused on improving results using a single dataset for training and testing. This approach lacks generality when applied to new images or when using it in in-the-wild contexts due to diversity in humans (e.g., age, ethnicity) and differences in capture conditions (e.g., lighting or background). The cross-datasets approach can overcome these limitations. In this work we present a method to combine multiple datasets and we conduct an exhaustive evaluation of a proposed system based on a CNN analyzing and comparing performance using single and cross-dataset approaches with other architectures. Results using the proposed system ranged from 31.56% to 61.78% when used in a single-dataset approach with different well-known datasets and improved up to 73.05% when using a cross-dataset approach. Finally, to study the system and humans’ performance in facial expressions classification, we compare the results of 253 participants with the system. Results show an 83.53% accuracy for humans and a correlation exists between the results obtained by the participants and the CNN.

Improved Cross-Dataset Facial Expression Recognition by Handling Data Imbalance and Feature Confusion

Facial Expression Recognition Using Machine Learning and Deep Learning Techniques: A Systematic Review

Article 13 April 2024

Fusing multi-stream deep neural networks for facial expression recognition

Article 19 November 2018

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Facial expression recognition (FER) has gained increasing interest in the last years due to the high demand of applications for automatic human behavior analysis [3, 7, 13] and novel technologies for human-machine communication and multimedia retrieval [32, 50]. A challenge to solve is that a same expression in different individuals can vary according to ethnicity, age or gender [8, 9, 31]. Although Ekman [8] found that some expressions appear differently across cultures, he also identified seven universal emotions: anger, disgust, fear, sadness, happiness, surprise and contempt. Another feature that can affect is the age, which plays an important role in the representation of emotions. For example, elderly people tend to appear sad or angry when they are in their neutral expression due to the loss of face muscle tone caused by aging [31]. Even the gender can affect, since women generally are more expressive than men [9]. In addition to the human-related factors, there are other factors which also affect face expression recognition. On the one hand, an expression of a particular person may appear differently based on lighting, background, or posture. On the other hand, image-related factors such as image quality, color intensity or resolution, depend on the capture process and environment. These different capture conditions can affect the classification accuracy, especially in cross-dataset evaluations. In most of the published literature, FER is simplified by focusing on optimizing results using the same method or combined-methods on a single dataset or on several datasets separately, but with the training and testing sets belonging to the same dataset [16, 17, 28, 41, 52]. Therefore, these approaches lack of generality when applied to new images or when using it in in-the-wild contexts. This problem can be addressed by combining different datasets for the training, but it is difficult to standardize images from different datasets (regarding image format or capturing conditions).

The main aim of this work is to evaluate how merging information from diverse datasets significantly helps in the training task. Therefore, we present a method to combine multiple datasets to build a large-scale dataset and we conduct an exhaustive evaluation of a proposed system based on a CNN to analyze the performance using a single and cross-dataset approach. Finally, we compare the results both with recent architectures and with humans’ recognition.

The main contributions presented throughout this work are: (1) we define a protocol to select and work with different datasets and create a homogenized dataset with data augmentation to be used as a source for a single learning step; (2) we present an extensive evaluation of a proposed CNN using four datasets widely employed in the literature (BU-4DFE, CK+, JAFFE, WSEFEP) and two new datasets (FEGA and FE-Test), using both single and cross-datasets approaches; (3) and we compare the performance of the CNN with state-of-the-art models and with humans’ perception. The results show that our approach accurately classifies various facial expressions performing better or at the same level as other state-of-the-art methods and shows a correlation with the humans’ classification [23].

The work is structured as follows: In the following section, a review of literature is carried out to identify the most relevant works related to the topic. Section 4 informs about the protocol to create new datasets to train a model based on diverse existing datasets and lists the datasets used in the training and testing of the CNN. This Section also details the image pre-processing and data augmentation steps and describes the proposed CNN for FER. Section 5, 6 and 7 present the exhaustive evaluation conducted to evaluate the pre-processing step, the performance, the comparison with humans and discussion the results. Finally, the last Section concludes the work and summarizes the main contributions.

2 Related work

This Section reviews works on: (1) automatic FER, (2) datasets used for this research area, (3) cross-dataset evaluation and (4) works comparing human results with automatic recognition.

Automatic FER is currently a main area of interest across different fields such as computer science, medicine, or psychology. Research in the area has a long tradition in the Human-Computer Interaction (HCI) discipline and more recently in Human-Robot Interaction (HRI). In the last decades, several techniques have been proposed for FER. Sebe et al. [42] used techniques such as Bayesian networks, Support Vector Machines (SVMs) and decision trees to evaluate several promising machine learning algorithms for emotion detection. SVM was also used by Trujillo et al. [48] for facial expression classification. In [38], the authors studied Gauss–Laguerre (GL) wavelets, which have rich frequency extraction capabilities, to extract texture information of various facial expressions. For each input image, the face area was localized first. Then, the features were extracted based on GL filters, and, finally, a KNN classification was used for expression recognition. Siddiqi et al. [44] used Principal Component Analysis (PCA) and Independent Component Analysis (ICA) for global and local feature extraction, and a hierarchical classifier (HMM) to recognize the facial expression. In [37], Gabor feature extraction techniques were employed to extract thousands of facial features. An AdaBoost-based hypothesis was used to select a few hundreds of the numerous extracted features to speed up classification, and these were fed into a 3-layer neural network classifier trained by a back-propagation algorithm.

More recently, deep learning methods have contributed to improve many research areas [1, 2, 43] and FER is not an exception [6, 16, 17, 20, 27, 28, 41, 45, 52]. Burkert et al. [6] proposed a CNN architecture for FER using the CK+ and MMI datasets in both for training and testing. In [17] the authors proposed a model based on single Deep Convolutional Neural Networks (DNNs), which contained convolution layers and deep residual blocks. Khorrami et al. [20] proposed a CNN for FER. They used the CK+ and TFD datasets and introduced an approach to decipher which portions of the face influenced the CNN’s predictions. A combination of CNN and a specific image pre-processing step for the task of emotion detection was proposed in [28], and a Hybrid Convolution-Recurrent Neural Network method for FER in images was presented in [16]. Sajjanhar et al. [41] evaluated the Inception and VGG architectures, which are pre-trained for object recognition. They compared the performance with VGG-Face, which is pre-trained for face recognition. In [45], authors developed a real-time FER system on a smartphone using the CK+, SAIT, SAIT2 and Internet datasets. The Internet dataset was created by the authors; they downloaded face images from the Internet and manually labelled them with five facial expressions. In [52], an ensemble of CNNs was presented with probability-based fusion for FER, where the architecture of each CNN was adapted by using the convolutional rectified linear layer as the first layer and multiple hidden maxout layers. Liu et al. [27] proposed a FER model based on a CNN fused the double-regularized linear support vector machine (L2-SVM).

Mining the literature, we find diverse datasets used for FER, such as CK+ [29], MMI [49], AffectNet [34] or JAFFE [30], being CK+ one of the most popular ones. In Table 1, we summarize the accuracy results reported for some of the architectures developed in this last decade that use the CK+ dataset in their evaluation. Most architectures [6, 20, 27, 28, 33, 41, 45] use k-cross-validation [5] to obtain the accuracy results reported in Table 1, except for Jain et al. [17] that performed tests using 98% of the data for training and only 2% for testing. Sajjanhar et al. [41] used the pre-trained model Face-VGG and Liu et al. [27] used the VGG-11 architecture to perform the feature extraction of human facial expressions. Mollahosseini et al. [33] designed a complex architecture using convolutional layers in parallel and combined them to obtain the final result. Papers [20, 28, 45] presented better results using simpler architectures than papers [17, 27, 33, 41]. Although Burkert et al. [6] presented similar results as the ones reported by Song et al. [45], they used a more complex architecture. Lopes et al. [28] obtained 96.76% accuracy, but the authors only tested with 1 subject for each partition of the k-cross-validation set and ran the experiment 10 times to select the best result. Their method also included a pre-processing step, a CNN using the k-cross-validation method and they reported a value of 89.7% accuracy.

Table 1 Accuracy results of recent models in the literature. These models were trained and tested with the CK+ dataset to classify 6 basic expressions, except for the models presented in [45] and [27] that were trained to classify 5 and 7 facial expressions, respectively

A Novel Approach to Cross dataset studies in Facial Expression Recognition

Abstract

Similar content being viewed by others

Improved Cross-Dataset Facial Expression Recognition by Handling Data Imbalance and Feature Confusion

Facial Expression Recognition Using Machine Learning and Deep Learning Techniques: A Systematic Review

Fusing multi-stream deep neural networks for facial expression recognition

1 Introduction

2 Related work

3 Materials and methods

3.1 Protocol

3.2 Datasets

3.2.1 Datasets built with actors commonly used for facial expression studies

3.2.2 Datasets built via web scrapping used for facial expression studies

3.2.3 New datasets (FEGA and FE-test)

3.3 Image pre-processing and data augmentation

3.4 The proposed CNN

3.5 Comparison between our CNN architecture and other recent architectures in the literature

4 The pre-processing step evaluation

4.1 Procedure

4.2 Results and discussion

5 The CNN’s evaluation

5.1 Experiment 1. Subject-independent evaluation

5.1.1 Procedure

5.1.2 Results and discussion

5.2 Experiment 2. Cross-datasets evaluation

5.2.1 Procedure

5.2.2 Results and discussion

5.3 Experiment 3. Different combinations of datasets

5.3.1 Procedure

5.3.2 Results

5.3.3 Discussion

5.4 Experiment 4. Comparison of our system with other architectures

5.4.1 Procedure

5.4.2 Results and discussion

5.5 Experiment 5. Evaluation of an unknown test dataset

5.5.1 Procedure

5.5.2 Results and discussion

6 Human performance evaluation

6.1 How subjective is facial expression recognition by human experts?

6.2 Facial expression recognition by humans

6.2.1 Participants

6.2.2 Task

6.2.3 Procedure

6.2.4 Results and discussion

6.3 Facial expression recognition by our system

6.3.1 Procedure

6.3.2 Results and discussion

7 Conclusions

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation