1 Introduction

Facial emotion is an important clue of human behavior and intention. Its recognition can be applied in many scenarios, such as security monitoring [1], behavior understanding [2, 3], medical diagnosis [4], etc. However, facial emotion recognition in the wild through facial images is still a challenge, due to the presence of irrelevant information such as background, occlusion, illumination variation, and the expression ambiguity stemming from intra-class diversity and inter-class similarity. To address these challenges, many works have attempted to utilize attention mechanisms in deep models as an effective way to emphasize interesting features while suppressing irrelevant features [5,6,7,8]. At the same time, many studies have explored probability distributions in emotion recognition tasks [9, 10]. Probability distribution is the soft label that describes the confidence levels for each category, which can better characterize intra-class diversities and inter-class similarities for different facial expression images. Hence, it allows deep models to learn the ambiguity of expressions, which can better guide them to learn the diversities of the same expressions and the similarities between the different expressions.

Recently, mutual learning has been introduced in deep models. The core idea of mutual learning is the imitation learning between deep models at the levels of features or outputs, and which can usually be achieved by using attention mechanisms and probability distributions. However, there are few methods in expression recognition field that focus on exploring mutual learning. Besides, most of the existing mutual learning methods in other fields usually constrain the models for consistent learning, while generally neglecting to maintain the self-learning ability of the models, which may ruin the dynamics of mutual learning [11, 12]. Therefore, we propose harmonious mutual learning for emotion recognition. We construct a framework consisting of two parallel networks and perform progressive mutual teaching in feature layers and output layers. Specifically, we design a self-mutual attention learning (SMAL) module in the backbone architecture to transfer feature information. We deploy different attention components for the two networks to increase the dynamic of mutual learning. Then, we design a probability distribution distillation learning (PDDL) module in the classification head for the purpose of bidirectional class semantic interaction and mutual calibration, resulting in the performance improvement of emotion recognition. In summary, the main contributions of our work are summarized as follows:

  • We propose a novel harmonious progressive mutual learning framework containing two parallel networks, which jointly utilize attention mechanisms and probability distributions.

  • We construct a self-mutual attention module by using distinct attention components for two networks, which facilitates mutual learning of enhanced and supplementary features while preserving self-learning capabilities.

  • In the classification head, we introduce bidirectional probability distribution distillation learning through KL loss, with an objective of mutual learning of class ambiguity and the calibration of the two networks.

  • We demonstrate the effectiveness of our framework on three publicly available datasets, and our method reaches to state-of-the-art performance.

2 Related Works

In this section, we mainly review the most relevant works about mutual learning, including feature-based approaches and probability distribution-based approaches.

2.1 Feature-Based Approaches

Mutual learning has been received extensive research, where attention mechanisms have been widely applied. For examples, Ma et al. [13] fused global and local representations, and applied softmax function for attention learning on the fusion features to achieve implicit mutual teaching between local semantics and global long-term dependencies. Zhang et al. [14] employed L2 loss to quantify the disparity between cross-modality attention features across RGB modality images, IR modality images and the mixed-modality images. Liu et al. [15] viewed the shallow to deep layers of CNN as "experts" with different perspectives. The authors constructed attention images for different experts and input to other experts to achieve cross-layer mutual learning. However, there is a lack of methods that focus on mutual learning in the field of facial expression recognition. Some metric learning methods [16,17,18,19,20], which typically utilize feature metric functions to construct imitation losses, and can be considered as forms of mutual learning. However, these methods often use additional information, such as facial landmarks, action unit (AU), head pose and so on. Nevertheless, the accurate acquisition of this information itself is a significant challenge, and its improper utilization may detrimentally impact the effectiveness of emotion recognition.

2.2 Probability Distribution-Based Approaches

Probability distribution, which refers to the class posterior probability output by deep model, describes the class confidence levels for the input. By aligning probability distributions, mutual learning can be performed for the mutual teaching from the perspective of class semantics. Zhang et al. [21] constructed multiple peer networks that utilize the KL loss of probability distributions to force the consistency learning between different models, so as to train in a mutual learning manner. Bian et al. [11] constructed a handwritten mathematical expression recognition model consisting of a shared encoder and two parallel inverse decoders, where the two decoding branches align the output probability distributions at the time step through KL loss for mutual learning of complementary information. Xu et al. [22] utilized the probability distribution alignment method to achieve the mutual learning and adaptation of multi-source domain data. Qiao et al. [12] obtained cross-modal attention representations based on the softmax function, and then aligned the probability distributions for the mutual learning between global and local semantics. Wang et al. [23] designed a mutual learning network between overall and occluded images, which is achieved by aligning the probability distributions of the two types of images. However, the above methods primarily focus on mutual imitation or consistency learning, i.e., constraining the models to generate the same outputs, while ignoring the self-learning ability with differences. This can potentially result in a homogenization of knowledge among different models, which in turn make the mutual influence to be limited and disrupt the dynamics of mutual learning.

Inspired by existing works, this paper proposes a harmonious mutual learning method for facial emotion recognition, by combining attention mechanisms and the distillation learning of probability distributions. Unlike previous works, we design different attention components for the models in mutual learning to obtain the learning abilities of diverse features. This not only promotes mutual knowledge transfer between models, but also maintains diverse learning abilities and promotes the dynamics of mutual learning. At the same time, we design a bidirectional probability distribution alignment to facilitate the mutual transfer of class semantics, which is beneficial for handling the ambiguity of expressions and improving the recognition performance.

3 Methodology

In this section, this paper proposes a new mutual learning method based on attention mechanisms and probability distributions. We capture the expression information of interest against irrelevant facial information through self-mutual attention learning. Further, by combining with probability distribution distillation learning, we can potentially calibrate the classification to increase emotion recognition performance.

3.1 The Overall Architecture

The proposed architecture contains two networks, Net1 and Net2, and two interactive modules, as shown in Fig. 1. We adopt resnet50 as the basic architecture for each network, producing a backbone with five convolutional blocks and a classification head with one fully-connected layer. The two interactive modules are self-mutual attention learning (SMAL) module and probability distribution distillation learning (PDDL) module. The former integrated in the backbone implements self-attention within networks and mutual attention between networks via embedding two submodules spatial attention module (SAM) and convolutional block attention module (CBAM) [24]. The latter implements bidirectional class semantic interaction using KL loss. More details will be described in the rest of this chapter.

\hspace*{-6pt 7
figure 1

The proposed framework. Our framework consists of two networks and conducts progressive mutual learning. In the backbone, SMAL module conducts enhanced and complementary interesting learning to capture more discriminative pattern, via utilizing different attention components: SAM and CBAM. In the classification head, PDDL module uses the probability distributions output by the peers to conduct class semantic interaction for correcting the learning of each other

3.2 Self-Mutual Attention Learning

In the mutual learning of the backbone, we aim to capture the importance of facial features while combining with enhanced and supplementary interesting learning to improve efficiency against irrelevant facial information. Therefore, the SMAL module mainly includes self-attention branches and mutual attention branches, as shown in Fig. 2. The self-attention captures saliency of the feature maps, and the mutual attention captures saliency with enhanced and complementary properties. Meanwhile, we hope to improve the diversity of the two networks in order to increase the dynamics in mutual learning. Therefore, we deploy two attention components for the two networks, namely SAM and CBAM modules. We integrate the SMAL module into the fourth and fifth blocks of the backbone.

SAM and CBAM are used for the self-attention learning of Net1 and Net2 respectively, and the information interaction is realized through the attention maps of SAM and CBAM. Formally, assuming that the feature map output by the convolution block of Net1 is expressed by \({\textbf{F}}^{(1)}\in {\mathbb {R}}^{C\times H \times W}\). Here, C represents the number of channels. H and W represent the height and width of the feature maps. For Net1, self-mutual attention learning can be expressed by:

$$\begin{aligned} \begin{aligned} \hat{\textbf{F}}^{(1)} = (1+{\textbf{M}}_s)\otimes {\textbf{F}}^{(1)}+detach({{\textbf{M}}}_c)\otimes dropout2d({\textbf{F}}^{(1)}) \end{aligned} \end{aligned}$$
(1)

where \({\textbf{M}}_s\) and \({\textbf{M}}_c\) respectively indicate the self-attention weight map learned by SAM module and the mutual-attention weight map learned by CBAM module. \(\otimes \) denotes element-wise multiplication. dropout2d denotes randomly zeroing out some channels. detach operator prevents gradient backpropagation toward to Net2. However, attention information from Net2 is not always applicable and potential. Therefore, we insert a dropout2d operator for random suppressing, to provoke adaptive ability of learning from Net2. For Net2, self-mutual attention learning is implemented with the same approach, and can be expressed by:

$$\begin{aligned} \begin{aligned} \hat{\textbf{F}}^{(2)} = (1+{\textbf{M}}_c)\otimes {\textbf{F}}^{(2)}+detach({{\textbf{M}}}_s)\otimes dropout2d({\textbf{F}}^{(2)}) \end{aligned} \end{aligned}$$
(2)

where \({\textbf{F}}^{(2)}\) is the feature map output by the convolution block of Net2.

\hspace*{-6pt 8
figure 2

The SMAL module

SAM takes the feature map \({\textbf{F}}^{(1)}\) as input. It first uses two parallel operators, namely average pooling and maximum pooling, to compress the channel information and generate two 2D maps: \({\textbf{F}}_{avg}^s\in {\mathbb {R}}^{H \times W}\) and \({\textbf{F}}_{max}^s\in {\mathbb {R}}^{H \times W}\). Then, the concat layer stacks the two maps along channel dimension. Next, a convolutional layer followed by a sigmoid activation serves as a fusion operator to product comprehensive spatial attention map \({\textbf{M}}_s\). Mathematically, spatial attention can be expressed as follows:

$$\begin{aligned} \begin{aligned} {\textbf{M}}_{s}&= \sigma (f^{3\times 3}([avgpool({\textbf{F}}^{(1)});maxpool({\textbf{F}}^{(1)})]))\\&=\sigma (f^{3\times 3}([{\textbf{F}}_{avg}^s;{\textbf{F}}_{max}^s])) \end{aligned} \end{aligned}$$
(3)

where \(\sigma \) denotes the sigmoid function, \([\cdot ]\) denotes concatenating the features along channel dimension, and \(f^{3\times 3}\) denotes a convolution operation with a filter size of 3\(\times \)3.

CBAM is the combination of channel attention and spatial attention, as shown in Fig. 3. CBAM first conducts max pooling and average pooling on \({\textbf{F}}^{(2)}\) along the channel dimension respectively, obtaining channel descriptor: \({\textbf{v}}_{max}\in {\mathbb {R}}^{C}\) and \({\textbf{v}}_{avg}\in {\mathbb {R}}^{C}\). These pass a shared multilayer perceptron(MLP) with two fully-connected layers. At last, the channel attention map \({\textbf{v}}^c\in {\mathbb {R}}^{C}\) can be obtained by sigmoid function, as shown below:

$$\begin{aligned} \begin{aligned} {\textbf{v}}^c&=\sigma (MLP(avgpool({\textbf{F}}^{(2)}))+MLP(maxpool({\textbf{F}}^{(2)}))) \\&= \sigma (W_1(W_0({\textbf{v}}_{avg}))+W_1(W_0({\textbf{v}}_{max}))) \end{aligned} \end{aligned}$$
(4)

where \(W_1\) and \(W_2\) are the parameters of MLP. Then, channel attention features can be obtained by \({\textbf{F}}^c\in {\mathbb {R}}^{C\times H \times W} ={\textbf{F}}^{(2)}\odot {\textbf{v}}^c\), where \(\odot \) denotes channel-wise multiplication. \({\textbf{F}}^c \) is taken as the input of spatial attention, so the attention map of CBAM module \({\textbf{M}}_{c}\) can be computed by the following formula:

$$\begin{aligned} \begin{aligned} {\textbf{M}}_{c}&= \sigma (f^{3\times 3}([avgpool({\textbf{F}}^c );maxpool({\textbf{F}}^c )]))\\&=\sigma (f^{3\times 3}([{\textbf{F}}_{avg}^c;{\textbf{F}}_{max}^c])) \end{aligned} \end{aligned}$$
(5)

In our architecture, the learnable parameters for SAM and CBAM are independent. Therefore, the two networks can learn both self-attention and conduct mutual attention, enabling interesting learning with enhanced and supplementary properties that can be understood intuitively in Sect. 4.6.

\hspace*{-6pt 9
figure 3

The CBAM module

3.3 Probability Distribution Distillation Learning

Further, probability distribution distillation learning is integrated into the classification head, as shown in Fig. 4. Assume that in the forward propagation of the network, the output class score for a sample \({\textbf{x}}_i\) is \([s_{i1}, s_{i2},\ldots , s_{iK}]\), where K is the number of emotion classes. Then, in the learned probability distribution conditioned on \({\textbf{x}}_i\) by the network, the probability of the sample that belongs to the j-th class \(d_{ij}\), where we remove the superscript (1) or (2) for simplicity, can be obtained by softmax function with temperature parameter T:

$$\begin{aligned} \begin{aligned} d_{ij} = p(y_{i} \mid {\textbf{x}}_i;{\textbf{W}}) = \frac{exp(s_{ij}/T)}{\sum _{r=1}^K exp(s_{ir}/T)} \end{aligned} \end{aligned}$$
(6)

where \({\textbf{W}}\) represents the learnable parameters. The larger the value of T, the smoother the probability distribution, and vice versa. Using formula (6), we can obtain the learned probability distributions for the two networks and denote them as \({\textbf{d}}^{(1)}\) and \({\textbf{d}}^{(2)}\) respectively.

\hspace*{-6pt 10
figure 4

Bidirectional probability distribution distillation learning. T-softmax denotes the softmax function with temperature parameter T

In this stage, the two networks take the probability distribution output by the other one as the latent ground-truth probability distribution. The optimization objective is to minimize the error between the two probability distributions. Therefore, bi-directional KL loss can be adopted to measure the error of mutual class knowledge distillation, as shown in the following:

$$\begin{aligned} \begin{aligned} L_{bi-kl}&=KL({\textbf{d}}^{(1)}\Vert {\textbf{d}}^{(2)})+KL({\textbf{d}}^{(2)}\Vert {\textbf{d}}^{(1)}) \\&=\frac{1}{N}\sum _{i=1}^{N}\sum _{j=1}^K (d_{ij}^{(1)} \log d_{ij}^{(1)}+d_{ij}^{(2)} \log d_{ij}^{(2)}-d_{ij}^{(1)} \log d_{ij}^{(2)}-d_{ij}^{(2)} \log d_{ij}^{(1)}) \end{aligned} \end{aligned}$$
(7)

where N is the batch size. In emotion recognition system, one-hot label (hard label) is usually used for error measurement in training. Hard label can be regarded as a kind of supervision information with 100% confidence for one class, which may easily lead to over fitting especially in FER systems. Because there are great similarities between different emotions, as well as different emotional intensities in a facial image. Therefore, hard label will force the network to excessively fit one class with 100% confidence, which damages the generalization performance. By contrast, probability distribution is a class description degree vector, in which each element belongs to [0,1] and describes the class strength and the similarity between different classes. So, probability distribution can serve as a soft target and provide more transferable knowledge in the supervised learning. With the bidirectional probability distribution transfer, each of the two networks learns the latent ground-truth probability distributions and uses them as soft targets to mutually correct the supervised learning guided by the hard labels, thus mutual calibration can be achieved.

3.4 Emotion Recognition Loss

The cross entropy function is used as the loss function of emotion classification. For sample \({\textbf{x}}_i\), assuming its one-hot label is \({\textbf{y}}_i=[y_{i1},y_{i2},\ldots ,y_{iK}]\), the loss function for a batch can be expressed by:

$$\begin{aligned} \begin{aligned} L_c = -\frac{1}{N}\sum _{i=1}^N \sum _{j=1}^K \ y_{ij} \log p_{ij} \end{aligned} \end{aligned}$$
(8)

where \(p_{ij}\) is the output of softmax function. The value of \(y_{ij}\) is 0 or 1, which indicates whether \({\textbf{x}}_i\) belongs to the j-th class:

$$\begin{aligned} y_{ij} = {\left\{ \begin{array}{ll} 1, {\textbf{x}}_i \ \text {belongs to class} \ j \\ 0, \text {otherwise } \end{array}\right. } \end{aligned}$$
(9)

3.5 Multi-loss Strategy and Inference

We train the proposed architecture through a multi-loss strategy, including emotion recognition losses for Net1 and Net2, as well as the bi-directional KL loss. The final optimization objective is as follows:

$$\begin{aligned} \begin{aligned} L=L_{bi-kl}+L_{c-Net1}+L_{c-Net2} \end{aligned} \end{aligned}$$
(10)

Multi-loss strategy allows the two networks train jointly. In this way, each network performs supervised learning with hard label and accepts the soft target from the other one to achieve compromise training, thus the generalization can be boosted.

In the inference phase, we use the fusion of Net1 and Net2 as the prediction, which can be formulated as:

$$\begin{aligned} \begin{aligned} y= arg\max _{j^*} \frac{1}{2} \left( s^{(1)}_j+ s^{(2)}_j \right) \end{aligned} \end{aligned}$$
(11)

where \(s_.^{(1)}\) and \(s_.^{(2)}\) are class scores of Net1 and Net2 respectively.

4 Experiment

In this section, we first describe three public datasets, data preprocessing methods and experimental configurations. Then, we compare with state-of-the-art methods to show superiority of the proposed method. Finally, we show the visualization and perform ablation study to further demonstrate the effectiveness of the proposed framework.

4.1 Dataset

RAF-DB [25]: It is a popular emotion recognition dataset. There are 7 emotion classes: surprise, fear, disgust, happiness, sadness, anger and neutral. It contains 15,339 real-world facial images collected from the Internet. This dataset has provided protocol for model optimization and performance evaluation, in which 12,271 images are divided as training set and 3,068 as test set, and our experiments follow this protocol.

FER2013 [26]: It is a large-scale dataset collected by the Google search engine. It contains 7 emotion classes, and the total number of images is 35,888, where 28,709 images for training, 3,589 images for validation and 3,589 images for testing. In this dataset, image size is 48\(\times \)48 pixels.

SFEW [27]: It is a dataset extracted from video. There are 1766 images in the dataset, which are divided into three parts, namely 958 for training, 436 for validation and 372 for testing. Since the labels of the test set are used for competition and remain private, we train on the training set and report the recognition accuracy on the validation set while following the previous methods being compared in our experiment analysis.

In our experiment, facial images in RAF-DB and SFEW datasets are aligned using the landmarks provided by the dataset and the detected landmarks by detector [28] respectively, and face regions are extracted and resized to size 256\(\times \)256 pixels. For FER2013 dataset, the above preprocessing is elided since the provided images have been well addressed. Data augmentation is applied to extend datasets, such as random cropping and random rotation. The experimental images are shown in Fig. 5, and are resized to 224\(\times \)224 pixels to fit the input size of the network.

\hspace*{-6pt 11
figure 5

Some samples from RAF-DB, FER2013 and SFEW datasets

4.2 Experimental Setting

Our framework is implemented using PyTorch and deployed on a Titan RTX GPU. The momentum, weighted decay and batch size are set to 0.9, 0.0001 and 96, respectively. We use Adam optimizer, and adopt an initial learning rate 0.0002. On the RAF-DB and FER2013 datasets, the learning rate is decayed by 0.1 every 30 epochs, and total number of training epochs is 90. The number of images in SFEW dataset is relatively small, and the training fluctuation is large. So, we apply two Adam optimizers and deploy them on Net1 and Net2 respectively. The optimizer for Net1 decays learning rate every 14 epochs, the optimizer for Net2 decays learning rate every 7 epochs, and the total training epochs are 20. In loss \(L_{bi-kl}\), T is set as 1.2. On the RAF-DB and FER2013 datasets, the two networks are initialized by Ms-celeb-1 m [29] and ImageNet [30], respectively. On SFEW dataset, we use the Ms-celeb-1 m model and a model pre-trained on RAF-DB dataset to initialize the two networks respectively.

4.3 Results on RAF-DB Dataset

Table 1 presents FER performance of the proposed method on RAF-DB dataset. In Table 1, we also compare with the approaches utilizing attention mechanisms and probability distributions. The methods in [6, 8, 31,32,33] perform attention learning for facial images, an attempt to enhance emotion-related information and suppress the others. Specifically, Wen et al. [6] embedded spatial attention and channel attention among three parallel networks, and enhanced the attention through fusion operation. They achieved accuracy of 89.70% on the RAF-DB dataset. Indolia et al. [8] established a self-attention module in the convolutional block of resnet by using softmax activation to handle intra-class variation and inter-class similarity of expressions, and the accuracy is 81.06%. Li et al. [31] and Wang et al. [32] proposed a region attention network, which aims to explore the key facial patches for emotion recognition. To improve the FER performance, Li et al. [33] constructed a local binary attention module for specific regions to obtain the real emotion hidden under the face. Cai et al. [34] proposed a probabilistic attribute tree convolutional neural network to deal with the influence of identity-related attributes and achieved accuracy of 88.43%. Xi et al. [35] proposed a novel weighted contrastive objective function to measure positive and negative samples labeled by pseudo labels, in order to reduce the intra-class variation while enlarging the distance among different instances. This method obtains accuracy of 86.96%. RAF-DB is an in-the-wild dataset, in which there exist great changes in background, illumination and the attributes on face are more emotion-irrelevant. Our method performs self-mutual attention that is beneficial for the enhanced and complementary learning of interesting features and obtains accuracy of 90.71%, which is superior than the above methods by obvious performance gaps. Besides our method also shows better recognition performance than other methods listed in Table 1 that utilize probability distributions [9, 10, 36].

Table 1 Accuracy (%) of different methods on RAF-DB dataset

4.4 Results on FER2013 Dataset

The evaluation on FER2013 dataset is reported in Table 2 for comparison with state-of-the-art methods. As can be seen, our method obtains accuracy of 74.06%, which is superior to [8, 34,35,36] and is consistent with the comparison on the RAF dataset. The methods in [7, 33, 38], which aim to relieve the influence of irrelevant facial information, also explore attention mechanisms, but obtain inferior performance than ours. We also make comparison with some methods that use multiple networks for emotion recognition. Among them, Li et al. [39] proposed redundancy reduction method for 35 deep models, and obtained accuracy of 70.66% by fusing multiple networks. Wen et al. presented a probability-based method for the prediction fusion of deep models, and the corresponding accuracy is 69.96% [40]. Our method performs mutual learning of the interesting features and mutual calibration for the two networks, resulting in better fusion result. Moreover, our architecture is superior than some methods that potentially utilize the ability of mutual learning by multi-task framework or diversified representation. Concretely, Xiang et al. designed a two-stream framework to exert the auxiliary ability of face detection [41]. Shao et al used handcraft features to utilize the collaborative learning of different emotion features from extra extractor [42]. By comparison, our method is 13.36% and 17.42% higher than these two methods.

Table 2 Accuracy (%) of different methods on FER2013 dataset

4.5 Results on SFEW Dataset

Table 3 summarizes the results achieved on SFEW dataset. Meng et al. presented a multi-task framework to jointly recognize emotion and identity, an attempt to relieve the identity affect for FER [16]. With the same purpose of suppressing identity facial information, Liu et al. develop (N+M)-tuplet cluster loss that is expression-sensitive, and obtain accuracy of 54.19% [43]. Besides, some methods introduce attention learning [6, 7, 31, 32] and obtain accuracies of 53.18%, 50.00%, 51.47% and 56.40% respectively. Our method obtains accuracy of 60.18%. The contrasts show that our method achieves obviously better result. At the same time, the results indicate the effectiveness compared to methods [9, 10, 36] that introduce probability distributions. Furthermore, our method is superior to methods [33, 34, 36] that report emotion recognition performance on both RAF-DB and FER2013 datasets.

Table 3 Accuracy (%) of different methods on SFEW dataset

4.6 Visualization

Figure 6 shows the heatmap visualization of the outputs of different methods on RAF-DB, FER2013 and SFEW datasets, to give a clear and visual understanding of the proposed framework. Resnet+SAM and resnet+CBAM are the architectures of Net1 and Net2 respectively, so we regard them as baseline methods and visualize their heatmaps. As can be seen, for the same image, these two methods always show attentions that have both similarities and differences. For example, they have relatively large response values in some key areas that around nose, mouth and eyes. Meanwhile, we can also observe that there are obvious differences between them. The above observation inspires us to deploy the two networks with SAM and CBAM respectively, and perform mutual learning of attention. Meanwhile, we can also see that there are obvious differences between them. The above observation inspires us to deploy the two networks with SAM and CBAM respectively to perform mutual attention learning. In our method, the attention information from Net1 or Net2 selectively pass to the other network, which is benefit for interesting learning with enhanced and complementary properties and encode discriminative features for performance improvement.

\hspace*{-6pt 12
figure 6

The heatmap visualizations for different methods on RAF-DB, FER2013 and SFEW datasets (three image blocks from top to bottom). The images from leftmost column to rightmost column correspond to surprise, fear, disgust, happiness, sadness, anger and neutral, respectively. For each image block, the images from top to bottom correspond to the original image, and the heatmap visualizations of resnet+SAM, resnet+CBAM and ours, respectively

4.7 Ablation Study

Finally, ablation study is performed to provide further insight into the key modules, and the experimental results are listed in Table 4. We report performance of the single network with different settings as baselines, i.e., resnet+initialization1, resnet+initialization2, resnet+SAM and resnet+CBAM. Specifically, resnet+initialization1 and resnet+initialization2 represent the methods that adopt resnet50 as the network architectures initialized by the pretrained Ms-celeb-1 m and ImageNet models respectively, which correspond to Net1 and Net2 without any additional modules. Resnet+SAM and resnet+CBAM represents adding SAM and CBAM modules into the above two baselines respectively, which indicates that the models only perform self-attention without mutual attention. Besides, we summarize the performance of the key modules in the proposed framework through the resnet+SMAL method. Resnet+SMAL represents adding SMAL module between Net1 and Net2. As can be seen, on SFEW dataset, the effectiveness of attention mechanisms is relatively obvious, since the accuracy contrasts referring to with or without attention learning component are 52.25% vs 50.23% and 56.42% vs 54.76%. On RAF-DB and FER2013 datasets, attention mechanisms tend to play small role. On the three datasets, the Net1 of resnet+SMAL shows consistent superiority compared to resnet+SAM, since the accuracies increase by 0.39%, 1.54%, 2.43%, respectively. Performance gains can also be found in Net2. These results elaborate the contribution of SMAL module, which enables mutual attention learning with enhanced and complementary properties. In addition, the PDDL module is also essential, since certain performance gain can be found by comparing our method with resnet+SMAL. The results on the three datasets demonstrate that our key modules, including SMAL and PDDL, are effective to improve emotion recognition performance.

Table 4 Ablation study on the RAF-DB, FER2013 and SFEW datasets

5 Conclusion

In this paper, we present a novel mutual learning method for emotion recognition, which tends to be harmonious because it can increase the dynamics of mutual teaching. Specifically, we construct a framework with two emotion recognition networks and perform progressive mutual learning in the backbone and the classification head, through utilizing attention mechanisms and probability distributions. The self-mutual attention learning module is integrated into the convolutional block of the backbone, allowing to encode discriminative facial features by the learning of interesting information with enhanced and complementary properties. In this module, we introduce SAM and CBAM submodules for the two networks, which can preserve the self-learning capability to promote mutual teaching. Further, we conduct mutual distillation learning in classification head, enabling mutual calibration for emotion recognition. Experimental results on three public datasets show that the proposed method achieves state-of-the-art performance.