Unveiling the Human-like Similarities of Automatic Facial Expression Recognition: An Empirical Exploration through Explainable AI

Facial expression recognition is vital for human behavior analysis, and deep learning has enabled models that can outperform humans. However, it is unclear how closely they mimic human processing. This study aims to explore the similarity between deep neural networks and human perception by comparing twelve different networks, including both general object classifiers and FER-specific models. We employ an innovative global explainable AI method to generate heatmaps, revealing crucial facial regions for the twelve networks trained on six facial expressions. We assess these results both quantitatively and qualitatively, comparing them to ground truth masks based on Friesen and Ekman's description and among them. We use Intersection over Union (IoU) and normalized correlation coefficients for comparisons. We generate 72 heatmaps to highlight critical regions for each expression and architecture. Qualitatively, models with pre-trained weights show more similarity in heatmaps compared to those without pre-training. Specifically, eye and nose areas influence certain facial expressions, while the mouth is consistently important across all models and expressions. Quantitatively, we find low average IoU values (avg. 0.2702) across all expressions and architectures. The best-performing architecture averages 0.3269, while the worst-performing one averages 0.2066. Dendrograms, built with the normalized correlation coefficient, reveal two main clusters for most expressions: models with pre-training and models without pre-training. Findings suggest limited alignment between human and AI facial expression recognition, with network architectures influencing the similarity, as similar architectures prioritize similar facial regions.


Introduction
Facial expressions are a rich and salient source of information for human communication.While facial expressions are not emotions themselves, they serve as a visual representation of an individual's emotional state as they offer observable cues that can convey emotions effectively.Their universality has been a topic of debate [1]: some argue that certain facial expressions may vary across cultures or contexts, suggesting that their interpretation may not be entirely universal.However, a significant body of research builds upon the work of Ekman, who identified six fundamental and universally recognized facial expressions: anger, happiness, surprise, disgust, sadness, and fear [2].This set of facial expressions do not convey the wide range of different expressions humans can do, but they have been the basis for understanding and interpreting emotional states in research.
The affective computing market trends exhibit a market growth rate of 28.2% during 2024-2032, expecting the market to reach US$ 682.2 billion by the end of the period [3].Therefore, research on automatic Facial Expression Recognition (FER) is a very active field due to its practical application in areas such as medical diagnosis and treatment [4], human behavior analysis [5,6], or human-computer interaction [7,8].The development of FER systems has been a topic of interest for computer vision experts, who have looked to human vision research for insights into the visual perception process.In the early stages of FER research, emphasis was placed on extracting facial features or analyzing facial appearance [9,10].However, in recent years, significant progress has been made by leveraging deep learning techniques [11,12].
Deep learning approaches, such as Convolutional Neural Networks (CNNs), have shown remarkable results in FER tasks and achieved state-of-the-art performance.Their success have raised questions as to whether the models work like human vision, with numerous researchers pointing out the similarities [13][14][15].
Regarding these systems, humans often exhibit cognitive anthropomorphism [16], wherein they tend to project human-like qualities and expectations onto AI systems, assuming that AI possesses similar characteristics to human intelligence.Since 2020, literature has shown a growing research interest in anthropomorphism, highlighting its significant influence on the perception, adoption, and continued use of AI-enabled technology [17].However, it is important to note that while a deep learning approach may yield high accuracy in recognizing facial expressions, it does not necessarily imply that the underlying decision-making process resembles that of humans [16,18].Therefore, we find studies investigating similarities between deep learning approaches and human vision in the FER field and other visual human tasks [19][20][21][22][23], but thorough comparisons of human and AI behavior are still relatively rare.In fact, Bowers et al. [24] state that to develop biologically plausible models of human vision, attention needs to be directed to explain psychological findings.Further, research also advocates for the exchange of knowledge between neuroscience and AI too find the most suitable computational models that resemble the information processing of the human brain [25].
Mueller [16] examined known properties of human classification (e.g.representation, processing), in comparison with image classifier systems and described design strategies to cope with the differences: change the AI, change the human (e.g.instructing and training humans' expectations) or change the interaction between human and AI (e.g.showing explanations).In the case of changing the AI, mimicking human behavior may help avoiding errors and inconsistencies in performance when compared with humans and improve the deep networks by incorporating critical information for recognition [26,27].
In the case of human perception, the Facial Action Coding System (FACS) [28] describes anatomically all visually discernible facial movement by defining Action Units (AUs), which are the actions of individual muscles or groups of muscles.Observing and coding a selection of AUs, EMotion FACS (EMFACS), humans can identify prototypical facial expressions that have been found to suggest certain emotions.Mining the literature, we find deep learning models trained with salient regions for human vision [29] or AUs [30,31] which achieve similar results as human coders with frontal faces and controlled conditions, but the accuracy decreases to below 83% when conditions are not constrained [32].
In this particular study, the focus lies on deep networks trained with facial expression images, primarily because these networks exhibit the ability to handle the challenging conditions of recognition in in the wild scenarios [11,33,34].Furthermore, the study specifically deals with still images, rather than dynamic sequences or videos.
We seek to answer whether deep networks observe facial AUs regions as defined in EMFACS and how human-like, in terms of visual perception of physical cues, is their processing.
To gain a deeper understanding of the inner workings of deep learning techniques, the integration of eXplainable Artificial Intelligence (XAI) techniques has emerged as a valuable approach [35,36].XAI techniques aim to provide explanations on how deep learning models make decisions and generate outputs, enabling human users to comprehend and interpret the results effectively [37].XAI emphasizes on developing AI models that are interpretable and transparent [38] [35].The interpretability can be achieved both by developing directly interpretable models (e.g. a decision tree) while maintaining high performance levels or using post-hoc explanations applied on trained models (specially black-box models).In this context, by applying post-hoc XAI techniques, we will analyze the human-like perception of deep learning models and explore their similarities with the FACS.
The aim of this work is twofold.On the one hand, we investigate on the similarities of deep networks with facial AUs using explainability techniques to assess how humanlike are the deep networks regarding visual perceivable features.On the other hand, we compare different CNNs to study whether they focus on the same regions.Comparisons among the 12 CNNs would clarify whether FER relies on specific zones or on the architecture of the network.
Section 2 describes related works that compare human perception and deep learning techniques regarding FER.Section 3 explains the development of the FER system describing the models, the training datasets or the training process.Section 4 describes the global XAI technique proposed for explanation purposes, the ground-through mask based on Friesen and Ekman's work [39] , the metrics used to compare and the procedure followed.Section 5 presents the results and they are discussed in Section 6.Finally, the last Section concludes the work.

FER comparison between human perception and deep learning techniques
There is a scarce number of works comparing thoroughly human perception and deep learning techniques regarding FER.Mining the literature, we find works focused on applying XAI techniques to the FER domain [40][41][42][43][44][45][46], but it is out of their scope to explore in depth the similarities or differences regarding the human perception.
Regarding works related to analyzing how human-like are the learned features of deep learning techniques when not trained specifically with them (e.g. using AUs to train the model), Khorrami et al. [47] already considered whether the neural networks learned facial AUs in FER.They proposed an approach to interpret which portions of the face influenced the CNN's predictions and applied it to the extended Cohn-Kanade (CK+) dataset [48] and the Toronto Face Dataset (TFD) [49].They visualized the spatial patterns that maximally excite different filters in the convolutional layers of the network (10 selected filters in the conv3 layer).Then, they compared if the facial observed AUs aligned with facial movements by using the KL divergence on the FAU labels given in the CK+ dataset.Their results showed that CNNs are able to model high-level features that correspond to facial AUs.
Prajod et al. [50] also related the automatic FER with AUs.They presented a study on the effects of transfer learning for automatic FER from emotions to pain using the VGG16 architecture and two datasets: AffectNet for emotions and UNBC-McMaster for pain [51].They applied Layer-wise Relevance Propagation (LRP) saliency maps to visually compare and understand the most influent regions for the classification, both for emotion recognition and for pain recognition, and map those regions with AUs.
The results showed that specific AUs related to the facial expressions of contempt and surprise were not relevant for pain recognition.
Deramgozin et al. [52] presented a hybrid framework composed of a main functional pipeline with a CNN to classify input images and an explainable pipeline including a LIME visualization part.It also included a facial AUs extraction part to identify the AUs and a Multi-Layer Perceptron (MLP) classifier which reused the results of the facial AU extractor to reinforce the results obtained by the main functional pipeline.The 13 AUs used in this work were: AU01, AU02, AU05, AU06, AU07, AU09, AU12, AU15, AU17, AU20, AU23 and AU26.They evaluated their CNN and the FAU+MLP system with the CK+ dataset, and they did a visual comparison of the LIME and AUS extracted.The AUs extracted were consistent with Ekman's basic facial expressions, although they found some differences due to the dataset and the compound emotional categories.
Gund et al. [53] analyzed superficially how different spatial regions and landmark points changed in position over time.They classified expressions using two convolutional networks trained with the CK+ dataset and tested with the SAMM dataset [54].They highlighted the regions of interest with Class Activation Maps (CAM) and visually compared the AUs of each emotion with the CAM for landmarks for some images.
Zhou et al. [55] did not use XAI techniques, but they found that the pre-trained VGG-Face, could spontaneously generate expression-selective units.The accuracy of the system was far lower than humans, but it exhibited hallmarks of human expression recognition (e.g., facial expression confusion and categorical perception).However, the expression-selective units did not exhibit reliable human-like characteristics of facial expression perception.
None of the works reviewed carry out an in-depth comparison between the learning of the model and visually perceptible features observed by humans.Since there are insights both indicating the similarity and the differences between the model learning and the human perception, in this work we will carry out a thorough quantitative comparison and a qualitative exploration.

Datasets
We list four standard datasets widely used in facial expression studies: the Extended Cohn-Kanade (CK+) dataset [48], the BU-4DFE dataset [56], the JAFFE dataset [57] and the WSEFEP dataset [58].In addition, we include the FEGA dataset introduced in [59].Some samples from the different datasets can be observed in Figure 1.
The Extended Cohn-Kanade (CK+) dataset [48] contains 593 facial expression sequences from 123 subjects ranging from 18 to 30 years old.These sequences are labelled with one of the 7 emotions (anger, contempt, disgust, fear, happiness, sadness, and surprise) based on the expression of each subject.
The JAFFE [57] dataset contains 213 images from 10 female Japanese actresses posing six basic facial expressions (happiness, surprise, fear, sadness, anger, disgust) and the neutral facial expression.
The WSEFEP [58] dataset contains 210 high-quality images of 30 subjects (14 men and 16 women) posing the six basic facial expression plus the neutral face.
Finally, the Facial Expression, Gender and Age (FEGA) dataset [59] contains 51 subjects (21 females and 30 males) ranging from 21 to 66 years old.Each subject posed six basic facial expressions plus the neutral face.Each subject was also labelled with the age and gender.

Pre-processing and data augmentation
To prepare the input of the datasets used for the training, we need to homogenize the images regarding resolution, color space and face alignment [59].On the one hand, when training and testing, we prepare the images applying these three steps: 1. Detect the face on the image using the method proposed by Lisani et al. [60].2. Align the face using the 68 facial landmarks proposed by Kazemi and Sullivan [61] to locate the eyes position and calculate the distance between them.This method was selected for its speed, accuracy and robustness.Further, it is a very well-known technique [62].We get the angle to rotate the image by drawing a straight line between the eyes, and then, we crop the aligned face.3. Convert the image to gray-scale and resize it to fit the input of the CNN.
On the other hand, when training, we also augment the number of images applying these two steps: 1. Add variations in terms of lighting.We use the gamma correction technique [63].2. Add variations in terms of appearance, which aim to cover for small errors in the position of the eyes during the eyes' location detection.The variations are: translation (in both axes), crop (always preserving the eyes, nose and mouth), and mirroring.

Model selection
We built twelve models used in the literature to classify Ekman's six basic facial expressions: nine well-known models widely used in the computer vision and neuroscience literature (AlexNet [64], VGG16 [65], VGG19 [65], ResNet50 [66], ResNet101V2 [66], InceptionV3 [67], Xception [68], MobileNetV3 [69], and EfficientNetV2 [70]) and three models built specifically for FER (Ramis et al. [59], Song et al. [71] and Wei et al. [72]).We name these three works as SilNet, SongNet and WeiNet respectively.All these CNNs were trained and tested on the datasets described in subsection 3.1, which were pre-processed and augmented following the steps listed in subsection 3.2.AlexNet, WeiNet, SongNet, and SilNet exhibit simpler architectures compared to the other models, comprising three to five convolutional layers followed by max-pooling layers, and concluding with varying numbers of dense layers.VGG16 and VGG19 are variants sharing a base architecture, distinguished by their use of compact 3x3 convolutional filters across the network and concluding with three fully connected layers, resulting in a total of 16 and 19 layers, respectively.ResNet50 and ResNet101V2 are part of the residual network family [73], which introduced residual learning through skip connections, facilitating the creation of deeper architectures.While ResNet50 comprises 50 layers, ResNet101V2 integrates optimizations such as bottleneck layers and improved skip connections, totaling 101 layers.InceptionV3 and Xception represent advancements over the original Inception [74], employing Inception modules with multiple parallel convolutional operations.Notably, Xception replaces standard Inception modules with depthwise separable convolutions to capture cross-channel and spatial correlations separately, thereby enhancing efficiency and performance.MobileNet [75] is a family of lightweight neural network architectures designed for efficient deployment on mobile and edge devices.Characterized by depthwise separable convolutions, MobileNet architectures strike a balance between accuracy and computational efficiency, proving suitable for resource-constrained environments.MobileNetV3 incorporates features like inverted residuals, linear bottlenecks, and squeeze-and-excitation modules to further enhance efficiency and accuracy.Efficient-NetV2 represents an improvement over the original EfficientNet [76], which introduced compound scaling for simultaneous optimization of model depth, width, and resolution, resulting in improved performance and efficiency.The second version introduces refinements in scaling factors, potentially enhancing overall model performance with fewer parameters.
Table 1 provides a comprehensive overview of each utilized model.By incorporating these diverse architectures and comparing their performance, we aim to gain valuable insights into the impact of architectural choices on FER tasks.
The following models utilized pre-trained weights from the ImageNet dataset: VGG16, VGG19, ResNet50, ResNet101V2, InceptionV3, Xception, MobileNetV3, and EfficientNetV2.For these models, we removed the final fully connected layers and replaced them with an average 2D pooling layer and a new fully connected layer, tailored to accommodate the specific number of facial expression classes (anger, fear, disgust, sadness, surprise, and happiness) as defined by Ekman and Friesen [28].Conversely, the models AlexNet, WeiNet, SongNet, and SilNet were trained from scratch exclusively using facial expression images.Notably, WeiNet, SongNet, and SilNet were purpose-built explicitly for the task of facial expression recognition.The implementation of the models was carried out in Python, leveraging the Keras library.For the models AlexNet, WeiNet, SongNet, and SilNet, we implemented each layer step by step within the Keras framework.The remaining models were readily accessible through the Keras API, complete with their pre-trained weights from the ImageNet dataset.

Procedure
We performed a k-fold cross validation with k = 5 to train the twelve networks on the five datasets defined in subsection 3.1.To do so, we first grouped the images by users for each dataset, and then split in five the users of each dataset randomly, resulting in five partitions, each containing approximately the same amount of images from each sub-dataset.Finally, following the standard k-fold cross validation procedure, we made all five possible combinations combining four of the splits for the training and leaving the remaining one for testing.
Before starting with the cross validation, we conducted preliminary trainings for each network to properly determine the number of epochs for each one.We determined that 10 epochs were enough to train SilNet [59], AlexNet [64] and WeiNet [72], while MobileNetV3 [69] and EfficientNetV2 [70] required 10 epochs, and SongNet [71] required up to 25 to get proper results.The remaining networks (i.e.VGG16, VGG19, ResNet50, ResNet101V2, InceptionV3, and Xception) only needed 2 epochs to get a good performance.
Summing up, a total of 60 trainings were carried out: 5 for each of the 12 networks, using different splits of the datasets.4 Measuring what do the deep networks learn and metrics for comparison

LIME explanations
There are numerous techniques for providing visual explanations of the predictions made by deep learning (DL) models, such as perturbation-based, feature-based, and propagation-based methods, each with its own advantages and disadvantages [77].
From this broad ecosystem of XAI techniques, we selected Local Interpretable Modelagnostic Explanations (LIME) [78], one of the most popular model-agnostic methods [79].LIME can be applied to any classifier and offers locally faithful explanations for the instance being explained.When applied to images, LIME highlights the regions that most influence the prediction for a given class.
Considering that it is impractical to evaluate each pixel independently for explanation purposes, clustering techniques are generally employed to create superpixels, which group similar pixels together.For this, we used the SLIC (Simple Linear Iterative Clustering) [80] algorithm to compute superpixels, which were then explained using LIME.
Given a neural network architecture R j (where j is one of the twelve networks described in subsection 3.3), a training iteration k, and an image X i , we obtained a classification c ijk corresponding to a facial expression (anger, happiness, sadness, disgust, fear, or surprise).By applying LIME, we identified the image regions of most interest to the network.The outcome was an image L ijk , with values in the range [0, 1], representing minimum and maximum importance, respectively.This image was stored in grayscale to be used in further steps.

Standardization of the explanations
Since the dataset images were taken from different points of view and the face proportions can vary from one user to another, the exact coordinates of facial keypoints (e.g., eyes, mouth, forehead, or nose) can also vary.Therefore, we standardized all images, processing each image so that all regions of interest correspond to the same image coordinates, and applied the same transformation to the explanation grayscale images.
The standardization process involved transforming the images using 68 landmarks estimated using the method described by Kazemi and Sullivan [61], along with 17 landmarks added to the top border and 4 at each corner of the image, totaling 89 landmarks.Next, we applied triangulation to each image X i using the Delaunay algorithm.The standard position of each landmark and triangle is shown in Fig. 2.  Given an input image X i from the dataset, L ijk as the explanation grayscale image for the confidence of a model j on a class k for image X i , and S as the standard image, we find a transformation T that moves the landmarks from X i to the same coordinates as in S. The transformation T is the set of affine transformations that map each triangle in the input image into another triangle in the standardized image space.Transformation T is then used to copy pixels from the input image to the standardized image.Since X i and L ijk share a common space (all landmarks are located at the same coordinates), the same transformation T can be applied to standardize L ijk , which we denote by S ijk .Thus, we find T knowing S, X i , and S = T (X i ), and then we apply S ijk = T (L ijk ).The entire explanation and standardization process is depicted in Fig. 3, showing the different steps for an example image from each expression.c) superpixels computed using SLIC segmentation, d) LIME explanation, e) transformed input image using the normalized landmarks coordinates, and f) transformed LIME relevance for each region in gray scale (for further heatmap computation), using the same landmarks.

Importance heatmaps
Once the LIME explanations were standardized, further insights could be extracted.Given a coordinate in any LIME transformed image S x , the coordinate will correspond to the same face region in all images.Hence, given a network j, a training k, and a set of images A, we built a heatmap by calculating the average of the LIME transformed images as in Eq. 1.
where card(A) is the cardinality of the set A. By following this method, we calculated heatmaps for each network, training and facial expression, so A jkc represents the LIME images built with the network j, training set k and whose classification is c, c ∈ {Anger, Disgust, F ear, Happiness, Sadness, Surprise}.
Since we conducted five trainings for each of the twelve networks, different groupings can be done to build the heatmaps, as shown in Eq. 2. These groupings correspond to a summary of important regions according to the network, training set, and class.
where the sum of two images is done pixel-wise.All heatmaps are images with pixel values in the range

Ekman masks construction
We build Ekman masks following the specifications defined by Friesen and Ekman [39], to compare the influential face regions considered by the networks with anatomically visually discernible features related to emotions (i.e.EMFACS).EMFACS defines facial AUs related to facial expressions (see Table 2), and each AU relates to the movement of two or more landmarks.Based on this, we build an AU-landmark related Ekman mask for each facial expression.
To determine the landmarks involved, we use the relations described in Perveen and Mohan's work [81].Based on the landmarks, we mark visually the face areas involved in the expression, to obtain an Ekman mask for every expression following Friesen and Ekman's [39] description (see Fig. 4).
Given a facial expression, to determine the important region, the AUs are taken into account.For a given AU, the corresponding landmark is identified, or alternatively, the triangle of landmarks that includes it.If it matches a landmark, that landmark and its vicinity are marked as an important region.If the AU falls within a triangle (or on an edge), the triangle(s) that include it are marked.Subsequently, the relationship between AUs is evaluated to merge related regions.

Metrics for heatmaps and masks comparison
To assess the obtained results and explore the computed heatmaps for the different networks, we use five different metrics: Intersection over Union (IoU), F1 score, precision, recall, and normalized correlation coefficient.
On the one hand, we use the IoU, F1 score, precision, and recall to assess the difference between two masks, where only binary data is present.This is the case when comparing thresholded heatmaps with the Ekman masks described in subsection 4.4.Since precision, recall, and F1 score are metrics designed to evaluate the performance of a predictor against ground truth, we consider the Ekman masks as the ground truth and the thresholded heatmaps as the predicted values.Given two binary images I gt (ground truth) and I p (predicted values), with equal dimensions W × H containing only values in {0, 1} (representing the two classes: not important and important, respectively), metrics are computed as in Eq. 3, 4, 5, and 6. (

IoU =
Referring to pixels with ones as white and regions with zeros as black for simplicity, precision assesses the accuracy of predicted white regions; recall evaluates how many ground truth white regions were correctly predicted; and the F1 score, the harmonic mean of precision and recall, considers both measures simultaneously, which is crucial for unbalanced problems.IoU, specifically designed for mask comparison, also provides a fair estimation in unbalanced problems.
On the other hand, to evaluate how similar are the face regions influencing the classification of facial expressions among network architectures, we use the Normalized Correlation Coefficient, which better addresses the comparison of gray-scale images.Given two heatmaps, H 1 (x, y) and H 2 (x, y), we calculate the coefficient with Eq. 7.
where ⟨a, b⟩ denotes the dot product.This metric is 1 when images are identical, and therefore the maximum similarity, and 0 when a heatmap is the inverse of another one.This value is converted to a distance value as 1 − corr.

Procedure
Bearing in mind that we have five trainings for twelve networks, we start by taking a sample of 100 positives per class at random, for a total of 600 images, which will be the images to explain.We use the positives (i.e. the class predicted by the model) and not directly images taken from the class being explained because in the case that the model's prediction and the truth label do not match, we would not be explaining the model's decision.
To follow up, we segment each image into approximately 30 regions using SLIC [80] and find the importance of each region using LIME, setting the number of samples to 1,000 and the background (or occlusion color) to black.This results in an explanation in the form of a gray-scale image, where brighter colors represent more relevant regions for the class being explained (see subsection 4.1).
The next step is to standardize the gray-scale explanation images obtaining explanations in a normalized space, where the coordinates of the landmarks have a fixed location (see subsection 4.2).With all standardized explanations, we obtain heatmaps at different levels by grouping explanations of the same class, network and training; of the same class and network (joining trainings); or only of the same class (joining trainings and networks), as seen in subsection 4.3.
Finally, we make use of the computed heatmaps in two ways.The first one is to assess the similarity between the heatmaps and the Ekman masks described in subsection 4.4.To do so, it is necessary to first binarize the heatmaps, which we do by using a threshold selected using Otsu's method [82], and then we calculate the intersection over union.The second one is to analyze the difference among heatmaps of different networks for the same facial expression.For this, we use the normalized correlation coefficient (converted to distance) and construct different dendrograms, to visually show which networks bestow importance upon regions in a more similar way.

Results
In this section, we present a comprehensive analysis of our experimental findings.Firstly, we provide the cross-validation outcomes for the distinct neural networks employed in the study.Subsequently, we conduct an in-depth qualitative examination of the heatmaps generated by these networks.Additionally, we quantitatively assess the Intersection over Union (IoU) between the heatmaps and the Ekman masks, shedding light on the degree of similarity between the facial expression recognition capabilities of AI models and humans.Finally, we leverage dendrogram construction as an essential tool for exploring the likeness among heatmaps produced by different networks when presented with the same facial expression.

Cross Validation
To test the performance of each net, we have calculated the average accuracy across cross-validation test sets.The results are shown in Figure 5.As seen, the obtained results for all networks fall within the range of 80% to 84% accuracy, except for ResNet50, which achieved an accuracy of 74%.Since the aim of this study is not to maximize these results, we deem the achieved accuracy levels to be satisfactory for proceeding with the subsequent stages of the experiment: an exploration of significant facial regions identified by each network and a comparative analysis with the regions that humans perceive as important.

Heatmaps
To reflect the importance of the different regions of the face, we have employed the color map shown in Figure 6.The resulting heatmaps of the whole explanation, standardization and summary process described in Section 4 are shown in Figures 7, 8 and  9.  Figure 7 shows the heatmaps summarizing the important regions among all networks for each expression.From these images, we can appreciate the following aspects for the different expressions: • Anger: the relevance is scattered through all the face, with the mouth and surroundings being the most important part.The forehead, nose, cheeks and chin can also be relevant for the models.• Disgust: the nose and mouth are specially relevant for the classification, and so can be the space in between.• Fear: the mouth and the chin are specially important, and the eyes can also be.
• Happiness: the most important part is the mouth, although the surrounding regions can also be.• Sadness: the relevance is rather scattered, covering the mouth, nose and the space in between.• Surprise: the important parts are the eyes and the surrounding regions of the mouth.
The mentioned important regions for each facial expression make sense in general compared with humans' perception: for example, the mouth should be utterly important for the happiness expression (when the person is smiling), the nose should be important for the disgust expression (wrinkle of the nose), or both the eyes and mouth should be relevant for the surprise expression (they should be wide opened).The first one displays only low-depth networks not using pre-trained weights (i.e.SilNet, WeiNet, AlexNet and SongNet), while the second one displays deeper networks using pre-trained weights on ImageNet (i.e.VGG16, VGG19, ResNet50, ResNet101V2, InceptionV3, Xception, MobileNetV3, and EfficientNetV2).At first glance, it seems that the heatmaps in Figure 8 have more scattered hot regions than Figure 9, and that there is more similarity between heatmaps of models with pre-trained weights than with models without a pre-training, which is later explored in depth in subsection 5.4.
A specially important region for the lower depth networks appears to be the mouth, which is highlighted in almost all 48 heatmaps shown in Figure 9.The eyes are also of special importance in the recognition of fear and surprise for all networks.The nose, on the other hand, seems important for the anger, disgust and sadness expressions, although not for all networks.In the case of VGG16 and VGG19, the heatmaps are pretty alike for all expressions.

Difference between network heatmaps and Ekman masks
The IoU, F1 score, precision, and recall results between the Ekman masks and the thresholded heatmaps of the different networks, for each expression, are shown in Tables 3 and 4.
The IoU and F1 score demonstrate a high Pearson correlation (99%), with F1 score values being slightly higher than IoU values.As seen in Table 3, the network with the highest value for one metric also achieves the highest value for the other.Among all the trained networks, InceptionV3 exhibits the best performance across all facial expressions, with IoU and F1 scores of approximately 0.31 and 0.46, respectively.It is followed closely by ResNet50, ResNet101V2, and EfficientNetV2.
When averaging by facial expressions, "Disgust" yields the most similar heatmaps to the Ekman mask, with mean values of 0.42 for IoU and 0.57 for F1 score, which are roughly two tenths higher than the values for other expressions.In particular, InceptionV3 achieves scores of 0.59 for IoU and 0.74 for F1 for this expression, the highest values among all networks and expressions.
Despite these findings, the average values for the IoU and F1 score remain relatively low across all networks (below 0.32 for IoU and 0.47 for F1 score).This indicates a significant discrepancy between the relevant regions identified by the trained networks to classify facial expressions and those defined by the constructed Ekman masks.Table 4 allows to appreciate higher average precision values when compared with recall for some networks and expressions.This is the case for AlexNet (10% higher precision), VGG16 (9%), VGG19 (7%), EfficientNetV2 (7%), and ResNet101V2 (6%) networks, and for the "sadness" (11%), and"fear" (8%) expressions.The remaining networks and expressions show an average difference between precision and recall below 5%, and the total average precision is a 3% higher than the total average recall.Therefore, the regions identified as important by the models tend to be correct, but they seem to miss a great proportion of them.In other words, models seem to focus on smaller regions than humans do, according to the results.
A detailed comparison between the best and worst performing models for each facial expression can be found in Table 5.

Similarity between heatmaps
Figure 10 shows the dendrograms constructed using the normalized correlation coeficient to assess the difference between heatmaps for a network and an expression.As already seen in Figures 8 and 9, there appear two main clusters for the majority of the expressions: deeper models with pre-training and low depth models with no pre-training.This seems to indicate that either the pre-training on ImageNet or the inherent similarities of deep architectures used by VGG16, VGG19, ResNet50, ResNet101V2, InceptionV3, Xception, MobileNetV3, and EfficientNetV2 lead them to converge to similar solutions for the facial expression classification problem.In addition, the difference between heatmaps for this cluster tends to be smaller than that between heatmaps of the cluster of lower depth models not using pre-trained weights (i.e.SilNet, WeiNet, AlexNet and SongNet): 0.2 vs. 0.35 in disgust, 0.05 vs. 0.2 in happiness, 0.15 vs. 0.25 in sadness, and 0.2 vs. 0.5 in surprise respectively.Table 3: IoU and F1 score between the computed heatmaps for each network and expression and the Ekman masks.Last column shows the average by network, and last row shows the average by expression.The best result for each column (and for the last row) is highlighted in bold.Performance is evaluated using IoU and F1 to compare the binarized explanation heatmaps with the Ekman masks.The table lists, for each expression, the top three best and worst models along with a brief explanation of the observed differences in their performance.A high similarity between some of the networks can also be observed.For example, SilNet and WeiNet's heatmaps are very similar for the majority of facial expressions, as is the case for VGG16 and VGG19.This indicates that similar architectures tend to generate similar heatmaps.

Discussion
Next, we discuss the findings of the study, divided into several sections for better clarity.First, we examine the impact of the models' performance on the results.Next, we compare the similarities between different models.Then, we compare the explanation heatmaps with the Ekman masks to draw the main conclusions of the study.Finally, we discuss the implications of the results for future work and explore potential research directions.

Performance of the models
Results showed consistently high accuracy across all tested models, ranging from 80% to 84%, with the exception of ResNet50 (74%).We selected a broad number of networks introduced to the community between 2012 and 2022, varying in pre-training, Fig. 10: Dendrograms grouping the heatmaps of the different networks by facial expression, using the normalized correlation coefficient.architecture, parameters, and the purpose of design, as three of them were specifically built for facial expression recognition.As already commented, improving the performance was out of the scope of this work, but the results show that despite the diversity in design, complexity and depth, all networks demonstrated comparable accuracy levels, even ResNet50, allowing the next next steps of the study.
The same ensemble of datasets was used in [83] to assess how different kinds of explanations affected users' trust in the system.The experiment involved 109 users, who achieved an accuracy of approximately 73% on a subsample of 60 images from the dataset.Therefore, considering this accuracy to be representative, all models trained in this study outperformed humans in the facial expression classification task.This indicates the suitability of the models and the training process for this task, underscoring the importance of further investigation into their inner workings and their similarity to human reasoning.

Model comparison
When analyzing the heatmaps between deep pre-trained networks and lower depth non-pre-trained ones (see Figures 8 and 9), the former focus on localized regions, while the latter consider more dispersed regions within the face.This qualitative observation is consistent with the computed dendrograms (see Figure 10), where lower-depth networks cluster together in four expressions out of (disgust, happiness, surprise, and sadness).The evident similarities within these two groups could be attributed to the presence of pre-training, the depth of the models, or both.Since the pre-training was done on the same dataset, it is possible that the models learned similar patterns that were preserved even after fine-tuning them for the new task.The lower variability observed in the group of pre-trained models compared to the non pre-trained ones, as shown in the dendograms, might indicate this phenomenon.However, this clustering could also be related to the depth of the networks, as SilNet, WeiNet, SongNet, and AlexNet are relatively shallow compared to the others.
Diving deeper into the similarities between networks, we observe groupings among those that share a common base architecture, such as SilNet and WeiNet, VGG16 and VGG19, and InceptionV3 and Xception.This similarity is directly evident in the computed heatmaps for most facial expressions.This correlation is likely due to their similar architecture, which causes the models to converge in a similar manner and attribute importance to the same regions of the face.

Human-like similarities of the models
Regarding the important regions used by the networks for classification, we first conducted a qualitative comparison of the heatmaps.Considering all networks globally (see Figure 7), we observed several patterns in how different expressions are processed.Some expressions are analyzed by the networks in a similar manner to humans, while others focus on a subset of important regions, and still others include regions not typically considered by human perception based on Ekman's descriptions [84].For example, the happiness expression exhibits a human-like behavior by concentrating on the lower part of the face, likely focusing on the smile.In the case of the disgust expression, the networks highlight regions important to humans, such as the nose and mouth, although the brows-forehead section is less prominent.Conversely, for expressions like anger, while the networks do pay attention to the lower part of the face, they also consider many other facial regions that are not related to AUs.
Analyzing the quantitative results from the comparison between the models' heatmaps and the computed masks based on Ekman's descriptions (see Tables 3 and  4), the findings indicate significant differences, regardless of the model architecture.This suggests that the models tend to focus on different facial regions than humans do.One major difference is the consistent emphasis on certain features across all expressions, particularly the mouth, for all deep pre-trained models.This indicates that the mouth is regarded as an important feature regardless of the expression, which does not align with Ekman's descriptions.Another notable discrepancy is the models' tendency to excessively focus their attention on specific regions of the face, disregarding many regions considered as important for human perception.

Implications of the findings
Neural networks have surpassed human performance in various recognition tasks.This success has led to questions about whether these models function similarly to human vision, with researchers noting their similarities [24] and proposing strategies to understand the alignment between neural networks and humans [85,86] .The FER context is not an exception as evidenced by the existing studies described in Section 2 [47,50,52,53,55].
Therefore, the insights of this work have different implications.We present results showing a poor correlation between human perception and the models' rationale when classifying the facial expressions.These results are supported by different metrics and studied for diverse CNN architectures, considered among the best models of human visual object processing [13,14].These insights can encourage the development of more human-centric neural networks to improve their performance, improve transfer, help generating human-like errors or explaining its behaviour to humans [24][25][26][27]87].
Building on these last points, humans often display cognitive anthropomorphism, that is, they tend to think AI operates in a similar way as humans.Therefore, when the FER system presents misclassifications or inexplicable decisions that seem illogical to humans, this often results in distrust towards the system, especially in high-stake domains.
Consequently, one of Mueller's strategies can be used to cope with the unexpected behavior and build trust [16]: change the AI to function in a more human-like manner, change the human (e.g.training humans' expectations) or change the interaction between human and AI (e.g.showing explanations).The strategy to use will depend on the domain and the identified requirements for the system.
Deep learning based solutions have demonstrated strong performance in FER, and although their complexity can pose challenges for user trust, it is precisely this complexity that may enable them to tackle diverse and complex challenges effectively.Consequently, efforts to enhance user trust in DL models could be achieved by changing the human or the interaction between human and AI.On the one hand, by changing the human, the system would focus on demonstrating its effectiveness and robustness, instead of making it more 'human-like'.Indeed, some studies on user trust towards DL models have shown that the accuracy of the models is more influential than the presence of explanations in building trust for tasks such as facial expression recognition [83] .Therefore, by proving the reliability and high performance of the models could be more crucial for building user's trust than aligning them with human interpretability.
On the other hand, in critical applications like medical diagnosis, understanding how models function is crucial for their deployment in real systems.Therefore, by changing the interaction between human and AI, we would offer explanations to users on how the AI reaches specific outcomes.These explanations are essential for grasping the rationale behind the AI's decisions.Providing clear explanations helps users comprehend the AI's decision-making process, fostering confidence and enabling more informed use of these systems [88].
Finally, changing the AI aligning it with the human visual system would enable it to better meet human expectations.The outputs and decisions made by the FER system are more understandable to humans and are aligned with their mental model.This enhances transparency building trust by allowing users to understand why the system made a particular decision and it becomes easier to identify and correct errors, which is crucial for improving the reliability and accuracy of AI systems over time.
The results of our work contribute to understand the internal working of FER recognition systems based on CNNs.This knowledge will help to further advance the field and guide the decisions on how to cope with the disparity between human perception and the system based on the problem's needs.

Conclusion
In this study, we investigated the similarities between deep learning models and human perception in identifying the six basic facial expressions.To achieve this, twelve different networks were trained on an ensemble of five datasets for the facial expression recognition task.An explanation and standardization process was applied to uncover the important facial regions for each network and expression, represented as heatmaps.Subsequently, the obtained results were evaluated both qualitatively and quantitatively, comparing them to ground truth masks representing human perception of the expressions.
The results reveal a significant disparity between the networks and humans in recognizing facial expressions, with values under 0.32 IoU and 0.47 F1 score in all cases.Qualitative analysis of the heatmaps, coupled with higher precision than recall for some networks and expressions, indicates a tendency of the models to overly focus on certain facial regions while neglecting others that are considered important by human perception.The consistent attribution of importance to the same regions (usually the mouth) regardless of the expression is another notable discrepancy.
The comparison of similarity between models highlights two main clusters: one comprising pre-trained deep models and another consisting of shallower models without pre-training.Inter-cluster differences are evident in the computed heatmaps, underscoring the impact of pre-training, which may have preserved similar patterns even after fine-tuning, and the depth of the models.Additionally, architectural similarities among models contributed to smaller differences in their heatmaps, as seen with the VGG16 and VGG19 models.
Given the negative impact that the lack of correlation between AI models and human perception has on user trust, we propose two research directions: (I) exploring how to build user trust in the models despite this discrepancy both by changing the human or the interaction, and (II) adapting the models (changing the AI) to resemble human perception while maintaining their performance.
Nonetheless, we believe that research exploring and identifying similarities between CNNs and human perception, along with their systematic comparison, offers a valuable foundation for developing human-inspired computational models.

Fig. 1 :
Fig. 1: Samples for each class (by columns) available from each of the datasets (by rows) used in this study.

Fig. 3 :
Fig. 3: Images from the different steps involved in the explanation of an image.By rows, an example of each class: Anger, Disgust, Fear, Happiness, Sadness and Surprise.By columns: a) image being explained, b) detected face landmarks,c) superpixels computed using SLIC segmentation, d) LIME explanation, e) transformed input image using the normalized landmarks coordinates, and f) transformed LIME relevance for each region in gray scale (for further heatmap computation), using the same landmarks.
[0, 1], indicating the pixels' probability of being used by the network to classify the image in the facial expression c.Values close to 1 have more influence over the classification of the expression, while values closer to 0 are less important for the classification.H jkc are heatmaps summarizing the explanations of a single training (approximately 100 images per class); H jc group different trainings for the same network, showing the usually important regions for a model (100 × 5 = 500 images per class); finally, H c also group the results obtained by different networks, showing the most relevant regions globally among all networks (100 × 5 × 12 = 6000 images per class).

Fig. 5 :
Fig. 5: Mean accuracy of the cross validation for each trained network.

Fig. 6 :
Fig. 6: Color map used to represent the importance of each region in the different heatmaps.

Fig. 7 :
Fig. 7: Resulting per-class heatmaps of the average explanations between all trained networks.

Fig. 8 :
Fig. 8: Resulting heatmaps of the four low depth networks not using pre-trained weights.By columns, heatmaps of the same network for the different expressions.By rows, heatmaps of the different networks for the same expression.

Figures 8
Figures 8 and 9 display the heatmaps for a specific network and facial expression.The first one displays only low-depth networks not using pre-trained weights (i.e.SilNet, WeiNet, AlexNet and SongNet), while the second one displays deeper networks using pre-trained weights on ImageNet (i.e.VGG16, VGG19, ResNet50, ResNet101V2, InceptionV3, Xception, MobileNetV3, and EfficientNetV2).At first glance, it seems that the heatmaps in Figure8have more scattered hot regions than Figure9, and that there is more similarity between heatmaps of models with pre-trained weights than with models without a pre-training, which is later explored in depth in subsection 5.4.A specially important region for the lower depth networks appears to be the mouth, which is highlighted in almost all 48 heatmaps shown in Figure9.The eyes are also of special importance in the recognition of fear and surprise for all networks.The nose, on the other hand, seems important for the anger, disgust and sadness expressions, although not for all networks.In the case of VGG16 and VGG19, the heatmaps are pretty alike for all expressions.

Fig. 9 :
Fig. 9: Resulting heatmaps of the eight deeper networks using pre-trained weights.By columns, heatmaps of the same network for the different expressions.By rows, heatmaps of the different networks for the same expression.

Table 1 :
Networks used in this study to classify facial expressions.The number of parameters shown corresponds to the obtained model after replacing the final fully-connected layers.

Table 4 :
Precision and recall between the computed heatmaps for each network and expression and the Ekman masks.Last column shows the average by network, and last row shows the average by expression.The best result for each column (and for the last row) is highlighted in bold.

Table 5 :
Comparison of the best and worst performing models for each facial expression.