Global-Local Attention for Emotion Recognition

Human emotion recognition is an active research area in artificial intelligence and has made substantial progress over the past few years. Many recent works mainly focus on facial regions to infer human affection, while the surrounding context information is not effectively utilized. In this paper, we proposed a new deep network to effectively recognize human emotions using a novel global-local attention mechanism. Our network is designed to extract features from both facial and context regions independently, then learn them together using the attention module. In this way, both the facial and contextual information is used to infer human emotions, therefore enhancing the discrimination of the classifier. The intensive experiments show that our method surpasses the current state-of-the-art methods on recent emotion datasets by a fair margin. Qualitatively, our global-local attention module can extract more meaningful attention maps than previous methods. The source code and trained model of our network are available at https://github.com/minhnhatvt/glamor-net

works assume that the facial region is the most informative representation of human emotion, therefore they ignore the surrounding context, which is shown to play an important role in the understanding of the perceived emotion, especially when the emotions on the face are expressed weakly or indistinguishable [32].
Recently, researchers have been focusing on incorporating background information such as people's pose, gaits, etc., into the model to improve the performance [39] [43]. In this work, we follow the same direction. However, unlike other works that learn the facial and context information independently [39], we propose to jointly learn both facial and context information using our new Global-Local Attention mechanism. We hypothesize that the local information (i.e., facial region) and global information (i.e., context background) have a correlative relationship, and by simultaneously learning the attention using both of them, the accuracy of the network can be improved. This is based on the fact that the emotion of one person can be indicated by not only the face's emotion (i.e., local information) but also other context information such as the gesture, pose, or emotion/pose of a nearby person. To verify the effectiveness of our approach, we benchmark on the CAER-S dataset [32], a large-scale dataset for context-aware emotion recognition. We achieved 77.90% top-1 accuracy on the test set, which is an improvement of 4.38% over the recent state-of-the-art method [32]. Furthermore, with the integrated ResNet-18 [25] as the backbone network, we obtained state-of-the-art performance on the CAER-S dataset with 89.88% classification accuracy. We also present a novel way to create a new static-image dataset from videos of the CAER dataset [32]. The experiments on this new dataset also confirm that our proposed method consistently achieves better performance than previous state-of-the-art approaches.
In summary, our contributions are as follows: -We propose a new deep network, namely, Global-Local Attention for Emotion Recognition Network (GLAMOR-Net) that surpasses the state-of-the-art methods in the emotion recognition task. -In GLAMOR-Net, we proposed the Global-Local Attention module, which successfully encodes both local features from facial regions and global features from surrounding background to improve the human emotion classification accuracy. -We perform extensive experiments to validate the effectiveness of our proposed method and the contribution of each module on recent challenging datasets.
The paper is organized as follow: We review the related work in Section 2. We then describe our methodology in detail in Section 3. In Section 4, we present extensive experimental results on challenging datasets and analyze the contribution of each module in GLAMOR-Net. Finally, we conclude the paper and discuss future work in Section 5.

Human Emotion
In the late twentieth century, Ekman and Friesen discovered six basic universal emotions including anger, disgust, fear, happiness, sadness, and surprise [18]. Several years later, contempt was added and considered as one of the basic emotions [37]. However, our affective displays in reality are much more complicated and subtle compared to the simplicity of these universal emotions. To represent the complexity of the emotional spectrum, many approaches were proposed such as the Facial Action Coding System [8], where all facial actions are described in terms of Action Units (AUs); or dimensional models [46], where affection is quantified by values chosen over continuous emotional scales like valence and arousal. Nevertheless, those models which use discrete affections are the most popular in automatic emotion recognition task because they are easier to interpret and more intuitive to human.

Emotion Recognition
In automatic human emotion recognition, many approaches mainly focus on analyzing facial expression. Thus, a standard emotion recognition system usually consists of three main stages: face detection, feature extraction and expression classification [58] [23][38] [6]). Traditional methods relied on handcrafted features (LBP [51], HOG [5]) to extract meaningful features from input images, and classifiers (such as SVM or random forest) to classify human emotions based on extracted features. With the rise of deep learning, CNNbased methods have made significant progress in the task of emotion recognition [34]. Apart from using input image, other works focus on categorizing emotions by utilizing extra information such as speech [26] [19], human pose [50], body movements and gaits [47] [55]. However, these works have relied on the information coming from a single modality, hence they have limited ability to fully exploit all usable information of human emotions.
To overcome this limitation, many researches have investigated the use of multiple modalities. Primarily, these works tried to fuse multiple channels of information from each modality to predict emotion. Castellano et al. [4] used extracted features from three different modalities (facial expressions, body gestures and speech expressions), and then fused those modalities at two different levels (i.e. feature level and decision level). Their results showed that the fusion performed at the feature level provided better results Fig. 2 The architecture of our proposed network. The whole process includes three steps. First, we extract the facial information (local) and context information (global) using two Encoding Modules. Second, we feed the extracted face and context features into the Global-Local Attention (GLA) module to perform attention inference on the global context. Lastly, we fuse both features from the facial region and output features from GLA into a neural network to make final emotion classification than the one performed at the decision level. Sikka et al. [52] extracted different visual features such as SIFT-Bag of Words [53], LPQ-TOP [45], HOG [12], PHOG [3], and GIST [44] and fuse them with audio features by building a kernel from each set of features, then combine them using a SVM classifier. Likewise, the authors in [56] used the same multi-modality approach but using deep learning techniques. In [39], three interpretations of context information are fused together by a deep neural network to classify human emotions in an end-to-end manner.
Recently, many works have focused on exploring contextaware information for emotion recognition. Kosti et al. [29] and Lee et al. [32] proposed two architectures based on deep neural networks for learning context information. Both of them have two separate branches for extracting different kinds of information. One branch focuses on human features (i.e. face for [32] and body for [29]) and the other concentrates on surrounding context. When considering multiple modalities, which have a large amount of information, deep learningbased methods like [39][32] [29][16] are more suitable and effective than traditional approaches. These multi-modal approaches often yield better classification performance than uni-modal methods.

Attention Model
Attention was first introduced in machine translation [2], allowing the translation model to search for words in the input sentence that are more relevant to the prediction words. Since then, attention models have become an important concept and an essential component of neural network architectures. It has made significant impacts in many application domains, including natural language processing [21], computer vision [57], graph [33] and speech processing [7].
In emotion recognition, attention models were mainly used to discover the attentive areas of the face that need to be focused on [6]. Recently, the work that forced the model to pay attention to the most discriminative regions of the background using attention was proposed in CAER-Net-S [32]. However, previous work only used the background encoding to learn the context saliency map and did not take advantage of the facial representation to assist the process. Therefore, we propose the Global-Local Attention mechanism, which takes both facial and context encoding as inputs, to utilize facial information more efficiently to guide the context saliency map learning procedure.

Overview
In this work, we assume that emotions can be recognized by understanding the context components of the scene together with the facial expression. Our method aims to do emotion recognition in the wild by incorporating both facial information of the person's face and contextual information surrounding that person. Our model consists of three components: Encoding Module, Global-Local Attention (GLA) Module, and Fusion Module. Our main contribution is the novel GLA module, which utilizes facial features as the local information to attend better to salient locations in the global context. Fig. 2 shows an overview of our method.

Encoding Module
To detect human emotion, many works first process the image by cropping out the human faces from the scene, and then feed them into a convolutional network to extract faciallyexpressive features [ Fig. 3 Our proposed encoder network as the feature extractor for both face and context branches. The network contains five convolutional layers with ReLU non-linearity, each convolution is followed by a max pooling layer except the last one to reduce the spatial dimensions of the input the face features, and the Context Encoding Module to learn the context features.
Facial Encoding Module This module aims to learn meaningful features from the facial region of the input image. The facial embedding information can be denoted as where C is the convolutional operation parameterized by θ f , and I f is the input facial region. In practice, we use a subnetwork ( Fig. 3) as the feature extractor for the Facial Encoding Module. The proposed sub-network has five convolutional layers. Particularly, each convolutional layer has a kernel set of 3×3 filters with strides of 1 × 1 followed by a Batch Normalization layer and a ReLU activation function. The number of filters starts with 32 in the first layer, increasing by a factor of 2 at each subsequent layer except the last one. Our network ends up with 256 output channels. We also use the padding technique before each convolutional layer to keep the output spatial dimensions the same as the input. The output of each convolutional layer is pooled using a max-pooling layer with strides of 2×2. The encoding module outputs a 256-channel volume feature map, which is the embedded representation with respect to the input image.
Context Encoding Module This module is used to exploit background knowledge to support the emotion predicting process. Similar to the Facial Encoding Module, we follow the same procedure to extract context information contained in the scene with a different set of parameters: where C is the convolutional operation parameterized by θ c , and I c is the input context. Similar to the Facial Encoding Module, we use the sub-network (Fig. 3) to extract deep features from the background context region in the Context Encoding Module. After getting these two feature maps, we feed them into the Global-Local Attention Module to calculate the attention scores for regions in the context. However, if we extract the context information in the raw image where the faces apparently exist, the network will also encode the facial information. This problem can make the attention module produce trivial outputs because the network may only focus on the facial region, and omitting the context information in other parts of the image. To address this problem, we first detect the face and then hide it in the raw input by setting all the values in the facial region to zero.

Global-Local Attention Module
Inspired by the attention mechanism [7] [41], to model the associative relationship of the local information (i.e., the facial region in our work) and global information (i.e., the surrounding context background), we propose the Global-Local Attention Module to guide the network focus on meaningful regions (Fig. 4). Specifically, our attention mechanism models the hidden correlation between the face and different regions in the context by capturing their similarity using deep learning techniques. Our attention module takes the extracted face feature map F f and the context feature map F c from the two encoding modules as input, and then outputs a normalized saliency map that has the same spatial dimension as F c .
In practice, we first reduce the facial feature map F f into vector representation using the Global Pooling operator, denoted as v f . Note that the context feature map F c is a 3D tensor, F c ∈ R Hc×Wc×Dc , where H c , W c , and D c are the height, width, and channel dimension respectively. We derive the context feature map F c as a set of W c * H c vectors with D c dimensions, each vector in each cell (i, j) represents the embedded features at that location, which can be projected back to the corresponding patch in the input image: At each location (i, j) in the context feature map, we have F We concatenate [v f ; v i,j ] into a big vectorv i,j , which contains both information about the face and some small regions of the scene. We then employ a feed-forward neural network to compute the score corresponding to that region by feedingv i,j into the network. After repeating the same process for all regions, each region (i, j) will output a raw The input vector of each branch is then scaled by its corresponding weight and combined together into the final representation vector v. We use this vector v to estimate the emotion category by feeding it into another sub-network (see Fig. 2) score value s i,j , we spatially apply the Softmax function to produce the attention map: To obtain the final context representation vector, we squish the feature maps by taking the average over all the regions weighted by a i,j as follow: where v c ∈ R Dc is the final single vector encoding the context information, and is the scalar multiplication operation. Additionally, v c mainly contains information from regions that have high attention, while other unimportant parts of the context are mostly ignored. With this design, our attention module can guide the network focus on important areas based on both facial information and context information of the image. Note that, in practice, we only need to extract context information once and then using different encoded face representations to make the system look at different regions with respect to that person.

Fusion Module
The Fusion Module is used to incorporate the facial and context information more effectively when predicting human emotions. The Fusion Module takes v f and v c as the input, then the face score and context score are computed independently by two neural networks: where φ f and φ c are the network parameters of the face branch and context branch, respectively. Next, we normalize those scores by the Softmax function to produce weights for each face and context branch so that these weights sum up to 1.
Notice that the face weight and the context weight are independently computed by their corresponding networks and represent the importance of these branches. We let the two networks competitively determine which branch is more useful than the other. Then we amplify the more useful branch and lower the effect of the other by multiplying the extracted features with the corresponding weight: Finally, we use these vectors to estimate the emotion category. Specifically, in our experiments, after multiplying both v f and v c by their corresponding weights, we concatenate them together as the input for a network to make final predictions. Fig. 5 shows our fusion procedure in detail.

Datasets
CAER-S In this work, we only focus on static images with background context as our input. Therefore, we choose the static CAER (CAER-S) dataset [32] to validate our method. The CAER-S dataset contains 70K static images extracted from a total of 13201 video clips of 79 TV shows. Each image is labeled with one of seven universal emotions: anger, fear, disgust, happiness, neutral, sadness and surprise. We follow the standard split proposed by [32] for training, validation and testing, respectively. Novel CAER-S (NCAER-S) While experimenting with the CAER-S dataset, we observe that there is a correlation between images in the training and the test sets, which can make the model less robust to changes in data and may not generalize well on unseen samples. More specifically, many images in the training and the test set of the CAER-S dataset are extracted from the same video, hence making them look very similar to each other. To cope with this issue, we propose a novel way to extract static frames from the CAER video clips to create a new static image dataset called Novel CAER-S (NCAER-S). In particular, frames extracted from the training, validation, and test sets of the CAER dataset are separately put into the corresponding training, validation, and test sets of the new NCAER-S dataset. In particular, for each video in the original CAER dataset, we split the video into multiple parts, each part is approximately 2s long. Then we randomly select one frame of each part to include in the new NCAER-S dataset. Any original video that provides frames for the training set will be removed from the testing set. This process assures the new dataset is novel while the training frames and testing frames are never from one original input video. With our selection method, we ensure that images in the validation and test sets are independent of those in the training set. We also make sure that the numbers of extracted frames of each emotion category are approximately equal to tackle the imbalance problem of the CAER dataset and prevent bias towards prominent emotions.
The statistics of the original CAER and the new NCAER-S training sets are shown in Fig. 6 and Table 1.
The new split NCAER-S dataset can be downloaded at https://bit.ly/NCAERS_dataset.

Experimental Setup
Evaluation Metric. Classification accuracy is the standard evaluation metric that is widely used to measure the reliability of automated emotion recognition systems in the literature [34][32] [36][13] [40]. To compare our results with previous approaches quantitatively, as in [34][32] we use the overall classification accuracy as the evaluation metric: where 1 is the indicator function, N is the total number of samples in the dataset,ŷ i and y i is the network prediction and ground-truth category of the i-th example, respectively. Baselines. We compare the results of our proposed Global-Local Attention for Emotion Recognition network (GLAMOR-Net) with the following methods as baselines: AlexNet [31], VGGNet [54], ResNet [25], CAER-Net-S [32]. The results of AlexNet, VGGNet, and ResNet on the CAER-S dataset are reported in two cases: using the Ima-geNet dataset as the pre-trained model, and fine-tuning these networks on this dataset. Note that, these results are taken from [32] paper. On the CAER-S, we also compare our method to several recent state-of-the-art approaches. GR-ERN [22] utilized a multi-layer Graph Convolutional Network (GCN) to exploit the relationship among different regions in the context. EfficientFace [60] proposed an efficient lightweight network and utilized the label distribution to handle the ambiguity of real-world emotions. MA-Net [59] designed a highly complicated architecture based on ensemble learning of multiple regions to handle the occlusion and pose variation problems. We report the results of our GLAMOR-Net with two different backbones: the original encoding module introduced in section 3.2.1 and ResNet-18 [25]. Implementation Details. Our networks are implemented using Tensorflow 2.0 framework [1]. For optimization, we use the SGD optimization algorithm and standard cross-entropy loss function: where p (yi) i is the predicted probability for the true emotion category y i of the i-th sample and N is the total number of samples in the dataset.
Given an input image, we first use the CNN based face detector in the dlib library [28] to detect the face coordinates. The detected face is then cropped and resized to 96 x 96 and fed to the Facial Encoding Module. To create input for the Context Encoding Module, we mask the facial region in the original image and resize it to 128 x 171, then we apply random crop during the training phase and center crop during the inference phase to the final size of 112 x 112. We use a dropout layer before the final layer with a dropout rate of 0.5 to reduce the effect of overfitting. During training, we observe that the fusion network is very unstable and easily affected by random factors. Specifically, the weights of the face branch or the context branch in the Fusion Module can easily take a value near 0 or 1, which means the model completely ignores information extracted from one of the branches. To tackle this problem, we first train the Facial Encoding Module and the Context Encoding Module separately, then jointly train both modules and the fusion network in an end-to-end manner. Table 2 summarizes the results of our network and other recent state-of-the-art methods on the CAER-S dataset [32]. This table clearly shows that integrating our GLA module can significantly improve the accuracy performance of the recent CAER-Net. In particular, our GLAMOR-Net (original) achieves 77.90% accuracy, which is a +4.38% improvement over the CAER-Net-S. When compared with other recent state-of-the-art approaches, the table clearly demonstrates that our GLAMOR-Net (ResNet-18) outperforms all those methods and achieves a new state-of-the-art performance with an accuracy of 89.88%. This result confirms our global-local attention mechanism can effectively encode both facial information and context information to improve the human emotion classification results.

Results on the NCAER-S dataset
On the NCAER-S dataset, we compare our results with three recent methods: VGG16 [54], ResNet50 [25], and CAER-Net-S [32]. The results from the VGG16 and ResNet50 models are reproduced as baseline methods. We finetune the VGG16 and the ResNet50 from the pre-trained models on VGG-Face and ImageNet, respectively. Our GLAMOR-Net (original) and CAER-Net-S are trained from scratch for a fair comparison. Table 3 reports the comparative results of our GLAMOR-Net and other recent methods. This table shows that the GLAMOR-Net architecture outperforms all other architectures and achieves the highest performance. In particular, our network increases classification accuracy by 2.77% compared to the second-highest model CAER-Net-S. These results also validate the effectiveness of our proposed global-local attention mechanism integrated into the GLAMOR-Net. We note that the result of VGG16 pretrained on VGG-Face is surprisingly better than the result of ResNet50 pre-trained on ImageNet dataset. This is explainable as the pre-trained weight on VGG-Face carries more meaningful information than the pre-trained weight on Ima-geNet, which includes many non-face images.
Also from Table 3, we can see that the classification accuracy of the models is much lower than those in Table 2. The reason behind this is the new NCAER-S is more challenging than the original CAER-S dataset. As mentioned earlier, to construct the NCAER-S dataset, we eliminate the correlation between the train and the test samples as much as we can. Specifically, we separately resample image frames from clips of the train and test sets of the CAER dataset to mitigate the train and test dependency. Moreover, note that the size of the new dataset is only less than one-third of the original one, which also limits the amount of information that the models can exploit. However, our GLAMOR-Net still consistently outperforms other state-of-the-art methods despite the challenges of the NCAER-S dataset and shows competitive results.
The confusion matrix of our GLAMOR-Net evaluated on the NCAER-S dataset is given in Fig 8. The two categories with the highest accuracy are happy and neutral while the disgust emotion has the lowest accuracy of 0.28. It can also be inferred from the confusion matrix that our model mostly confuses neutral with other emotion categories as most of the misclassified examples of the six categories: angry, disgust, fear, happy, sad and surprise fall into the class neutral. It may be because the facial emotion in the NCAER-S dataset is weakly expressed, which makes it more difficult to identify and distinguish other emotions from the neutral class.
In summary, we can conclude that our method consistently improves the results on both the original CAER-S and the challenging NCAER-S datasets. Note that although we follow the same procedure as in [32], our proposed Global-Local Attention Module is the key difference that helps enhance the accuracy of the emotion recognition task. The results reported in Table 2 and Table 3 verify that with the assistance of our attention strategy, the classification accuracy is significantly improved. We believe that if a more sophis-  ticated neural architecture is adopted, the performance will be further boosted.

Analysis
To further analyze the contribution of each component in our proposed method, we experiment with 4 different input Table 5 p value of the Stuart-Maxwell test for each pair of methods that are used in the setting (iv) of Table 4. Each element on the main diagonal is the test result of the agreement between the model prediction and the observed data (ground-truth label). Note that to compute the saliency map with the proposed GLA in the (ii) and (iii) setting, we extract facial features using the Facial Encoding Module, however, these features are only used as the input of the GLA module to guide the context attention map learning process and not as the input of the Fusion Network to predict the emotion category. The performances of these settings are summarized in Table 4.
The results clearly show that our GLA consistently helps improve performance in all settings. Specifically, in setting (ii), using our GLA achieves an improvement of 1.06% over method without attention, 0.97% over standard attention module in CAER-Net-S [32]. It is also noteworthy that when the context with visible faces is utilized as in setting (iii), using the attention module in the CAER-Net-S achieves 41.94% accuracy, lower than the one using only the cropped face in setting (i) by 0.64%, while using our GLA module achieves higher accuracy (42.66% vs. 42.58%). Our GLA also improves the performance of the model when both facial and context information is used to predict emotion. Specifically, our model with GLA achieves the best result with an accuracy of 46.91%, which is higher than the method with no attention 3.72% and standard attention module in [32] 2.77%. The results from Table 4 show the effectiveness of our Global-Local Attention module for the task of emotion recognition. They also verify that the use of both the local face region and global context information is essential for improving emotion recognition accuracy.
In order to emphasize the contribution of the Attention module to the final results, we conduct Stuart-Maxwell test for each pair of methods that are used in the setting (iv) of Table 4.The Stuart-Maxwell test is the generalized version of McNemar test [14] which is generally used for testing the significant difference of multi-class classification models. The resulted p-values of the tests are shown in Table 5. Note that the lower p value indicates stronger statistical disagreement between the two compared methods. Overall, we can see that all of the models have significant different error rates. Furthermore, the higher value on the main diagonal would imply stronger agreement between the model prediction and the observed data, which means the performance is better. In conjunction with the results in Table 4, we can statistically confirm that our GLA module performs better than other attention mechanisms.

Fusion methods comparison
To study the effectiveness of the information obtained from multiple modalities via different fusion strategies, we conduct experiment by alternatively changing the Fusion Module with multiple Fusion operators while keeping other com- ponents of the system unchanged. Specifically, the Elementwise Addition (Fusion Add), Element-wise Maximum (Fusion Max)) and our Fusion Net are studied in our experiment. Furthermore, we also compare our method with recent work by Dubey et al. [17]. Table 6 summarizes the results from our experiment. As shown in this table, the performance of our network using Fusion Net is superior to other fusion strategies. However, we notice that the results from other fusion techniques are also very competitive. This shows that the fusion strategy is also an important module in the emotion recognition task, however the final result is also affected by the extracted features from the feature extraction and attention modules.

Backbone architectures
We further study the effect of different Encoding network architectures. Specifically, the MobileNetV2 [48] and ResNet-18 [25] are adopted as the backbone network to extract features for both face and context branches in our study. We use the output of the last convolutional layer as the represented feature maps. These feature maps are then fed into the GLA module and processed as in Section 3. We summarize the total amount of network parameters and the classification results on CAER-S and NCAER-S in Table 7. We observe that the ResNet-18 significantly outperforms other shallower architectures (Original and MobileNetV2) and yields the best performance with 89.88% and 48.40% accuracy on CAER-S and NCAER-S. However, using such complex model resulted in more memory footprint as well as computational cost. Additionally, the MobileNetV2 can balance the tradeoff between accuracy and the speed of the model, which is a considerable option for deploying in environments with limited resources such as mobile devices. Fig. 9 shows the qualitative visualization with learned attention maps obtained by our method GLAMOR-Net in comparison with CAER-Net-S. It can be seen that our Global-Local attention mechanism produces better saliency maps and helps the model attend to the right discriminative regions in the surrounding background than the attention map produced by CAER-Net-S [32]. As we can see, our model is able to focus on the gesture of the person (Fig. 9f) and also Surprise   Fig. 10 Predictions on the NCAER-S test set. The first two rows (i) and (ii) show the results of the CAER-Net-S while the last two rows (iii) and (iv) demonstrate predictions of our GLAMOR-Net. The columns' names from (a) to (g) denote the ground-truth emotion of the images Fig. 11 Human emotion detection results in the wild setting the face of surrounding people (Fig. 9c, Fig. 9d) to infer the emotion accurately. Fig. 10 shows some emotion recognition results of different approaches on the NCAER-S dataset. More specifically, the first two rows (i) and (ii) contain predictions of the CAER-Net-S while the last two rows (iii) and (iv) show the results of our GLAMOR-Net. In some cases, our model was able to exploit the context effectively to perform inference accurately. For instance, with the same sad image input (shown on the (i) and (iii) rows), the CAER-Net-S misclassified it as neutral while the GLAMOR-Net correctly recognized the true emotion category. It might be because our model was able to identify that the man was hugging and appeasing the woman and inferred that they were sad. Another example is shown on the (i) and (iii) rows of the fear column. Our model classified the input accurately, while the CAER-Net-S might be confused between the facial expression and the wedding surrounding, thus incorrectly predicted the emotion as happy.

Visualization
On the other hand, we can also see on the (iv) rows of Fig. 10, the GLAMOR-Net misclassified the disgust and the surprise images as happy and the neutral image as sad. The reason might be that these images look quite confusing even to humans. Our model also failed to recognize emotions in the anger, fear, happy and sad images on the (iv) rows and predicted them as neutral instead. It can be because the facial expression in these images does not manifest clearly enough, which makes it difficult to distinguish between the neutral class and these emotion categories. This uncertainty was previously shown in the confusion matrix in Fig. 8.

Emotion Recognition in The Wild
As both the CAER-S dataset and its new split NCAER-S dataset contain only images from movie settings, they have a very limited number of people in a constrained environment. Therefore, the model trained using these datasets potentially do not work well on real-world image setting. Despite this challenge, Fig. 11 shows that our GLAMOR-Net can successfully detect and recognize human emotion in these challenging settings. Note that, the input images in this setup do not share any overlap with the movie settings as in the training set. This again confirms the generalization ability of our proposed method.

Conclusions and Future Work
In this work, we presented a novel method to exploit context information more efficiently by using the proposed globallocal attention model. We have shown that our approach can considerably improve the emotion classification accuracy compared to the current state-of-the-art result in the contextaware emotion recognition task. The results on the CAER-S and the NCAER-S dataset consistently demonstrate the effectiveness and robustness of our method.
Our approach currently only takes static images as input, which limits the amount of knowledge that can be exploited. We are planning to utilize temporal information in dynamic videos and other modalities such as audio in order to further improve the performance. We also consider releasing a more challenging emotion recognition dataset that contains rich background contexts with multiple faces in the same frame and take advantage of our attention model to extract the context saliency map for each face in a more effective manner. We hope that our work will pave the way for future work in which predicting the emotions of different people simultaneously is tackled.

Declarations
Conflict of interest The authors declare that they have no conflict of interest.
Availability of data and material The NCAER-S dataset can be downloaded at https://bit.ly/ NCAERS_dataset.
Code availability The source code and trained model of our network are available at https://github.com/ minhnhatvt/glamor-net.