Adaptively Enhancing Facial Expression Crucial Regions via Local Non-Local Joint Network

Facial expression recognition (FER) is still one challenging research due to the small inter-class discrepancy in the facial expression data. In view of the significance of facial crucial regions for FER, many existing researches utilize the prior information from some annotated crucial points to improve the performance of FER. However, it is complicated and time-consuming to manually annotate facial crucial points, especially for vast wild expression images. Based on this, a local non-local joint network is proposed to adaptively light up the facial crucial regions in feature learning of FER in this paper. In the proposed method, two parts are constructed based on facial local and non-local information respectively, where an ensemble of multiple local networks are proposed to extract local features corresponding to multiple facial local regions and a non-local attention network is addressed to explore the significance of each local region. Especially, the attention weights obtained by the non-local network is fed into the local part to achieve the interactive feedback between the facial global and local information. Interestingly, the non-local weights corresponding to local regions are gradually updated and higher weights are given to more crucial regions. Moreover, U-Net is employed to extract the integrated features of deep semantic information and low hierarchical detail information of expression images. Finally, experimental results illustrate that the proposed method achieves more competitive performance compared with several state-of-the art methods on five benchmark datasets. Noticeably, the analyses of the non-local weights corresponding to local regions demonstrate that the proposed method can automatically enhance some crucial regions in the process of feature learning without any facial landmark information.

Abstract-Facial expression recognition (FER) is still one challenging research due to the small inter-class discrepancy in the facial expression data.In view of the significance of facial crucial regions for FER, many existing researches utilize the prior information from some annotated crucial points to improve the performance of FER.However, it is complicated and time-consuming to manually annotate facial crucial points, especially for vast wild expression images.Based on this, a local non-local joint network is proposed to adaptively light up the facial crucial regions in feature learning of FER in this paper.In the proposed method, two parts are constructed based on facial local and non-local information respectively, where an ensemble of multiple local networks are proposed to extract local features corresponding to multiple facial local regions and a nonlocal attention network is addressed to explore the significance of each local region.Especially, the attention weights obtained by the non-local network is fed into the local part to achieve the interactive feedback between the facial global and local information.Interestingly, the non-local weights corresponding to local regions are gradually updated and higher weights are given to more crucial regions.Moreover, U-Net is employed to extract the integrated features of deep semantic information and low hierarchical detail information of expression images.Finally, experimental results illustrate that the proposed method achieves more competitive performance compared with several state-ofthe-art methods on five benchmark datasets.Noticeably, the analyses of the non-local weights corresponding to local regions demonstrate that the proposed method can automatically enhance some crucial regions in the process of feature learning without any facial landmark information.
Index Terms-Facial Expression Recognition, Deep Neural Network, Multiple Networks Ensemble, Attention Network.

I. INTRODUCTION
E MOTION is a complex state that integrates people's feel- ings, thoughts and behaviors [1], and facial expression is one of the most direct signals to communicate their innermost thoughts.Therefore, facial expression recognition (FER) [2], [3], [4], [5], [6] has attracted the attention of many researchers due to its important role in many practical application fields, such as human-computer interaction, recommendation system, patient monitoring, et al..In general, facial expression is encoded into facial action units through facial action coding system [7], [8], [9], and any expressions can be described through a set of facial action units.As we know, some facial action units are crucial for FER [10], such as the one located in regions around eyes and the mouth, since they are of more obvious actions compared with other facial regions (such as cheek and forehead).In the following parts, we regard these crucial facial action units as facial crucial regions, shortened by FCRs.Fig. 1 illustrates facial crucial regions of two facial images (ID1 and ID2) from six expressions, respectively.From Fig. 1, it is found that the FCRs are more discriminative to determine the expression category of a facial image [11].
In view of the significance of FCRs, many studies [12], [13], [14], [15] have been proposed based on applying the information of facial local regions, where the facial landmarks are employed as the prior information of facial crucial regions, whereas the landmarks are given by manually annotating for facial expression images.Early, most of FER researches [16], [17], [18] focused on lab-collected expression datasets, such as CK+ [19], MMI [20], JAFFE [21], Oulu-CASIA [22].For labcollected datasets, facial expressions images were collected from several or dozens of individuals under similar conditions (such as illumination, angle, posture, et al.), generally with a few uncontrollable factors.Thus, it is easily achieved to manually annotate the landmark of FCRs for lab-collected datasets.
However, compared with the lab-controlled datasets, the wild expression datasets [23] are collected under more complex and uncontrollable conditions, such as RAF-DB [24], AffectNet [25], EmotionNet [26], et al.For the wild expression datasets, especially including a vast of images, it is very com- On the other hand, there exists a problem that some FCRs from different expression categories are similar, whereas some FCRs from one same category are very different.From Fig. 1, it is obviously seen that the FCRs (including mouths) of ID1 from six expressions are similar with opening the mouth, which is absolutely different from ID2 with closing the mouth.Similarly, for the crucial regions including eyes, ID1 and ID2 from the category (Fear) are different, whereas ID1 from the category (Surprise) and ID2 from the category (Anger) are similar.It illustrates that FCRs of expression images belonging to the same category may be very different but FCRs from different categories are similar.Distinctly, it is insufficient that only local information of facial expressions is utilized to construct one effective model for FER, especially for the wild dataset.Hence, it is still important to utilize the global information of the facial expression while FCRs are enhanced in deep facial expression recognition.
Based on the above analyses, we propose a new method of facial expression recognition in this paper, which constructs a local non-local joint network to adaptively enhance the facial crucial regions in the process of deep feature learning, shortened for LNLAttenNet.In LNLAttenNet, the local and the non-local information of facial expressions are simultaneously considered to construct two parts of the network respectively: a local multi-network ensemble and a non-local attention network, and then the generated local and nonlocal feature vectors are integrated and jointly optimized in feature learning.Specially, the attention weights obtained by  the non-local part is regarded as the significance of facial local regions and fed into the local multi-network ensemble system to combine multiple local networks.Interestingly, we find that some facial crucial regions can be automatically enhanced in the process of deep feature learning by the proposed method.Moreover, U-Net is employed to generate feature maps where each pixel has large receptive field and the local region also contains the global information.Fig. 3 shows a simple view of LNLAttenNet.From Fig. 3, it is obvious that some crucial regions is given higher weights by LNLAttenNet, such as the 5th patch around the left eye (0.1123), the 10th, 11t and 14th patches around the mouth (0.0887, 0.1073 and 0.1298), which illustrates that some crucial regions are effectively enhanced by LNLAttenNet.Note that w i is the non-local attention weight corresponding to the i th local region and the initial weights are equal.More detailed descriptions will be introduced in the following parts.
Compared with stat-of-the art methods, our contributions are mainly three points: Section II firstly introduces related works about deep facial expression recognition.Secondly, Section III introduces the detail of the proposed method.Then, experimental results and analyses are demonstrated to validate the performance of the proposed method in Section IV.Finally, Section V provides the conclusion as well as the prospects on future works.

II. RELATED WORKS
Due to the excellent performance of deep learning, various deep networks have been applied in FER [23], such as VggNet [27], InceptionNet [28], ResNet [29], et al.Based on this, many deep FER methods have been proposed to address different problems.In [30], Hu et al. firstly extended the idea of deep supervision to deal with FER in the wild.The training of deep CNNs was softer and easier through the supervision not only to deep layers but also to intermediate layers and shallow layers, and a fusion structure was constructed where the feature ahead was used for the second-level supervision.In [31], Acharya et al. thought that the second-order statistic (such as covariance) were more suitable to catch the feature of the twisted facial expression.In their framework, a mainfold structure was constructed for covariance pooling to obtain a competitive performance for FER.In [32], Li et al. proposed a new deep manifold strategy for multi-label expressions, and their proposed network focused on the ambiguity expressions and could learn the discriminative feature that was suitable for cross-database FER.
Considering that facial expression is determined by key regions, Fan et al. [12] utilized the information of facial landmark points to select three sub-images around the eyes, mouth and nose.Then, three sub-images were encoded by three sub-networks, and the last pooling layer in each subnetwork was concatenated with each other, which obtained better recognition performance compared with others.In [33], [34], the information of facial landmark is used to extract features and generate masks from specific locations to remove the pose variation.
In [35], it was taken into account that there are inevitably labeling errors and deviations between different databases due to the subjectivity of labeling facial expressions.Therefore, when existing methods make use of multiple databases to expand the training set, their performance cannot be continuously improved.In order to solve this inconsistency between different databases, an Inconsistent Pseudo Annotations to Latent Truth (IPA2LT) framework is proposed to train a model from multiple inconsistent databases and large scale unlabeled images.The IPA2LT essentially constructs the ensemble at label level.Each image in the model has the same number of labels as the number of data sources, in which only one label is original and others are pseudo.Existing methods for FER have been almost satisfying on analyzing the frontal faces but fail to attain a good performance on partially occluded faces collected in the wild.Some facial expressions are ambiguous and have multi-labels.In [36], Gan et al. proposed a new framework based on CNN with the supervision of soft labels, where hard labels are used to construct soft labels with a novel label-level perturbation.In this framework, soft labels were obtained to eliminate the similarity between faces of different emotions, and multiple basic classifiers were trained and then combined.Moreover, some GAN-based methods have been proposed to generate expressional images for FER [37], [38], [39] or usually focus only on generating new facial expression images [40], [41], [42], [43].In [37], a novel approach is proposed to learn facial expressions by extracting the expressive component through a de-expression procedure where the corresponding neutral expression is generated by the trained generative model by given a facial image with arbitrary expressions.In [40], a user-controllable approach is proposed so as to generate video clips of various lengths from a single face image and the lengths and types of the expressions are controlled by users.
In [13], Li et al. proposed a CNN with attention mechanism (ACNN) to detect the occlusion of facial regions and paid attention to the most discriminative regions, where ACNN used the information of 24 facial landmark points to select the key regions at the feature level.In [44], Barros et al. investigated the emotion-driven attention mechanisms from the view of videos.In [45], Wang et al. proposed two-level attention mechanism to extract emotion-related features, which was based on global information, not involving the local regions.Similarly to [44], [45], [13], the attention mechanism is also involved in this work, whereas the essence of algorithms is very different.Here, our purpose is to adaptively enhance the significance of facial crucial regions based on the attention weights in feature learning obtained by the non-local attention network from the view of multiple local regions, where the attention weights corresponding to each local regions are obtained by the non-local attention network.

III. LOCAL NON-LOCAL JOINT NETWORK FOR FACIAL EXPRESSION RECOGNITION In this paper, we propose a Local Non-Local Attention Joint
Network for FER to adaptively light up more crucial local regions of facial expression, named by LNLAttenNet.The overall framework of LNLAttenNet is visually shown in Fig. 4. In Fig. 4, one facial expression image is used as the initial input instance of the proposed network, and its size is 144×144 as same as our implemented experiments.In LNLAttenNet, U-Net is firstly employed to extract the feature maps integrating the deep semantic information and the low hierarchical detail information of facial expression images.For the facial expression dataset, when the regional integration is carried out [12], the inter-class discrepancy is smaller and the intra-class discrepancy is larger, as shown in Fig. 1.The structure of U-Net [46], [47], [48], the top-down architecture with lateral connections for introducing details into high-level semantic feature maps, has been proved that local regions in last few layers are of the large receptive field and the global information, which is important and useful for ambiguous objects recognition [49], [50].Therefore, U-Net is beneficial to alleviate the negative impact of the regional integration, but it does not mean that the proposed method is restricted to U-Net.Actually, one model with the similar structure to U-Net can be employed in our proposed method, such as FPN [49].

U-Net
As shown as Fig. 4, facial expression images are inputted to the proposed model.By U-Net, two different feature maps are generated for the initial input image, located in the last layer (Conv9-2) and the intermediate layer (Conv5-2) of U-Net, respectively.In the following parts, we use F 5 and F 9 to express the feature maps from Conv5-2 and Conv9-2 of U-Net, respectively.Then, the generated feature maps

A. Non-Local Attention Network
For facial expression recognition, there is small inter-class discrepancy and large intra-class discrepancy on expression images, as shown in Fig. 1.Therefore, facial crucial regions are regarded as more discriminative regions which determine the categories of facial expression, such as regions around the mouth (eyes) rather than the cheek.However, it is tough to estimate which regions are more crucial without the assistance from manually annotated crucial points.Based on this, we construct the Non-Local Attention Network to automatically mine more discriminative regions from the whole facial expression, visually shown in the box with orange dot lines of Fig. 4.
In Fig. 4, the feature map F 5 (Conv5-2) is generated by U-Net as the global information of the facial image to construct the non-local attention network.The Conv5-2 is with the minimum resolution and the maximum receptive field, which means that F 5 is not affected by each local patch but contains the relationship between local patches implicitly.It is useful to mine more crucial regions based on the global information from the whole face.
1) Global Attention: Inspired by [51], [52], we construct a non-local attention model based on three branches, shown as in Fig. 5. First, the input is the map F 5 containing the global information of facial expression in Fig. 5. Based on F 5 , three feature maps Q, K and V are generated by one convolution layer and one pooling layer, respectively.Note that three maps are with a special resolution 1 with n * n in this model, where M = n2 and M is the number of cropped local regions.Then, the maps Q and K are reshaped as Q * and K * , shown as in Fig. 5, and a multiplication operation is followed to get a matrix R which reflects the correlation among local regions.Compared with [51], [52], the relevance of each region (patch) in LNLAttenNet is not as strong as each frame in video or each word in sentence, and thus L 1 normalization is adopted Furthermore, the map V is reshaped as V * , and the feature vector s is obtained by multiplying V * by the correlation matrix R, which is the self-attention form in [51], [52].In order to make the matrix R reflect the correlation among local regions, s is flattened and added to the non-local vector g (shown in Fig. 4).Meanwhile, a function is given to trade off two vectors g and s, shown as where g * expresses the new non-local vector and α is the hyper-parameter to adjust the ratio of s.In experiments, we will give an analysis for the parameter α.

B. Local Multi-Networks Ensemble
The feature map (F 9 ) is employed as the input to construct the part: Local Multi-Networks Ensemble, shown as in Fig. 4. The reason of using the map F 9 is that each pixel is of the large receptive field and the rich sementic information in Conv9-2, where F 9 is with the same resolution as the initial input image.In the part of Local Multi-Networks Ensemble, the feature map F 9 is firstly divided into M patches (including different local regions) with the same dimension (set as 48*48*64 in our experiments).Then, M patches are trained by Simple Network 2 to generate M individual networks {IN 1 , ..., IN M }, respectively.Specially, for each individual network, the local attention mechanism is added to enhance the feature vector of each local region.Finally, M local feature vectors are combining with the non-local attention weights obtained by Non-Local Attention Network.
1) Local Attention: In practice, it is found that the useful information is decreased when partial regions in one patch are missed or obscured.It means that less attention should be given to them.In view of this, a local attention mechanism is adopted in each individual network to weaken the significance of useless regions.The local attention model is encoded by four convolution layers and two fully connected layers, and its structure is shown in Fig. 6.Note that two convolution layers are not padded in order to reduce the computational complexity.In the local attention model, its input is the output of the last pooling layer in Simple-Net, and its output is one value between 0 and 1 obtained via the sigmoid function, regarded as the local attention weight w l i of each individual network, which represents the amount of information in each local patch can flow to the next level.If the facial local region is obscured or missed, the information that it contains for expression recognition will be reduced, and then the weight value of the local attention is also reduced to alleviate the effect of patches including the obscured region.Furthermore, the weights will be multiplied by the corresponding local vector as the output feature of each local network.More visual illustrations can be found in the part of experiments.
2) Combination of Multiple Local Networks: According to the non-local attention weights w g and the local attention weights w l , the local feature vectors given by M individual networks {IN 1 , ..., IN M } are aggregated by the formula where f en expresses the ensemble feature vector, f i expresses the feature vector given by IN i corresponding to the i th local region, w g i is the non-local attention weight of the i th local region, and w l i expresses the local attention weight of the i th local region.In experiments, we will give an analysis for the number M of local patches.

C. Joint Optimization of LNLAttenNet
In Fig. 4, the non-local feature vector g * is produced by the non-local attention network, and the local vector f en is obtained by the local multi-network ensemble.Inspired by [53], we think that the global information of an input image is essential, and each local patch can get large receptive field and the global information by embedding U-Net, which makes it easier to classify the similar patch of facial expression of different categories.Moreover, Conv5-2 is encoded to a global vector with 8192 dimension by two convolution layers and one pooling layer.Then, the non-local vector g * is concatenated with the local vector f en to obtain the total vector as the feature of the first fully connected layer and is jointly optimized, and the dimension of the integrated feature vector is 17408 shown as in Fig. 4. In LNLAttenNet, three full connect layers are implemented, and the loss function is formulated as where loss entropy expresses the cross entropy loss, loss l2 is the l2 regularization loss, and γ is the hyper-parameter controlling the balance between two losses.The cross entropy is calculated as: where C is the number of categories, N is the number of the input image, and L is the function that determines whether the input is correct.p i n is the i th component of the output of the last softmax layer of the n th image, and l n is the label of the n th input image.The l2 regularization loss is computed by where W is the parameters of our model and λ is set as 0.0001 in the following experiments.

IV. EXPERIMENTS AND ANALYSES
In this section, we will validate the performance of the proposed method from several items: 1) the performance comparison with state-of-the-art methods on benchmark datasets, 2) the analyses of Non-Local Attention, 3) the visualization of Local Attention, 4) the change of the parameter α, 5) the performance of LNLAttenNet with different M , and 6) the analyses for overlapped pixels between local regions, respectively.
• RAF-DB contains 29672 facial images downloaded from the Internet.For the RAF-DB dataset, the facial landmarks are manually annotated via the crowdsourcing method with basic or compound expressions.In experiments, we use the basic database including 12,271 training and 3,068 testing images.309 sequences have been annotated with six basic emotions.The emotion in each sequence goes from neutral to peak and then to neutral again.In view of this, we select the first frame of each sequence with the label of neutral and the peak frame of each sequence with the target label to generate 618 experimental images.• MMI is recorded from 30 objects with rich details of annotations, and 398 images are generated by selecting the first frame of each sequence with the label of neutral and one peak frame of each sequence.For RAF-DB and SFEW datasets, their training sets are directly used to train the model and testing sets are used to evaluate the performance.For AffectNet dataset, its training set is used to train the model, and its validation set is used as the testing set, since the testing set of AffectNet is not given the annotated labels [25].For CK+ and MMI datasets, we adopt the five-fold cross-validation scheme to evaluate the recognition performance, in order to make a fair comparison with other methods.Additionally, in order to fairly compare with the state-of-the-art methods of FER, we initialized the parameters of U-Net by Xavier initializer [55]  region) overlaps about 16 pixels with its adjacent patches, and the parameter α is set as 0.7 in Eq.( 1).The size of the epoch is set to 24, the initial learning rate is 0.0003, and the weight decay is set as 0.95 each epoch.
In Tables.I, III and II, we give the structures of the nonlocal attention network, the local attention and the simple net, respectively.For the non-local attention network, we only show the convolution layer and the pooling layer, and the operations such as reshaping and matrix multiplication are not shown.All experiments are implemented on the framework of Tensorflow and GTX 2080Ti with 11G memory.
• DLP-CNN [24] decomposes the image structurally rather than spatially into regions (parts) which are discriminative for matching.According to the representations over the regions, it aggregates discriminative features for classification.• Soft-CNN [36] fuses the latent label probability distribution predicted by the trained model to obtain soft labels with a novel label-level perturbation strategy.
• CenterLoss [57] minimizes the center loss calculated by the distance between each data and its corresponding class center to reduce the intra-class discrepancy.
• gACNN [13] uses 24 facial landmarks as the attention mechanism to conduct multi-region ensemble at the feature level.• LDL-ALSG [58] considers the subjectivity of human annotators and the ambiguous expression labels and then leverages the topological information of the labels from related but more distinct tasks, such as AU recognition and facial landmark detection, to explore the label distribution of facial expressions.• IPA2LT [35] employs an inconsistent pseudo annotations framework to solve the inconsistent annotations between different facial expression databases.Noticeablely, IPA2LT [35] applies both RAF and AffectNet as the training set, differently from our method (LNLAttenNet) and other compared methods where only the training set of one dataset is employed to train a model.In LNLAttenNet, both non-local attention and local attention mechanisms are utilized.Thus, we also make a comparison with three special cases of our model: the model without both local and nonlocal attention (Model-S), the model only with local attention (Model-Local), and the model only with non-local attention (Model-NonLocal).Table IV shows the experimental results of 12 models, where the highest accuracy is bold for each dataset.All results are the average of the last 10 epochs.
From Table IV, it is obviously seen that the performance of the proposed method (LNLAttenNet) is superior to all compared methods except LDL-ALSG and IPA2LT on Af-fectNet, RAF-DB, CK+, MMI and SFEW.Differently to LNLAttenNet, IPA2LT [35] utilizes two big datasets (RAF and AffectNet) as the training set, which results in its obtaining better performance.But, LNLAttenNet still achieves a competitive performance on two datasets (RAF-DB and SFEW) and outperforms on three datasets (AffectNet, CK+ and MMI) compared with IPA2LT.Compared with LDL-ALSG [58], LNLAttenNet outperforms on RAF-DB, SFEW and CK+, ties on AffectNet and loss on MMI.In the last column of Table IV, we also show the average of accuracies for five datasets given by each method in the last.It is found that LNLAttenNet obtains the highest average of accuracies: 74.03%, which illustrates LNLAttenNet can obtain a more competitive performance of FER on all of five datasets than eight compared methods.
Furthermore, it is found that Model-S is inferior to all of Model-Local, Model-NonLocal and LNLAttenNet, which demonstrates that the attention mechanism is meaningful for improving the performance of FER in our model.Meanwhile, Model-NonLocal is slightly better than Model-Local but obviously inferior to LNLAttenNet, which also demonstrates our model jointly utilizing local and non-local information of facial expression is more effective.In short, the experimental results illustrate that adaptively enhancing the facial crucial regions in feature learning by LNLAttenNet is effective for improving the performance of FER.
Considering that RAF and AffectNet datasets have a large amount of images, we also shows the confusion matrices for them in Fig. 8 and Fig. 9, respectively.According to the confusion matrices, it is observed that the categories (fear and surprise) are easily distinguishable for RAF-DB (shown in Fig. 8) and the categories (disgust and anger) are easily distinguishable for AffectNet (shown in Fig. 9).

C. Analyses of Non-Local Attention
In LNLAttenNet, it is achieved to adaptively enhance the feature learning of facial crucial regions by jointly optimizing for local and non-local parts, where the non-local attention network is constructed to obtain the global weights w g of multiple local regions.Actually, one purpose of our work is to explore how to automatically enhance the significance of local crucial regions in deep FER, while any landmarks are not given as the prior information of facial crucial regions.Thus, in order to validate it, we make an analysis for the weights of 16 local regions obtained by our non-local attention for RAF-DB dataset.
First, the visualization results from 16 persons are shown in Fig. 10.In Fig. 10, the first and third rows show the original facial expression images, and the second and fourth rows exhibit the matrix (4×4) of the final global weights w g (16×1) corresponding to 16 local regions.For each matrix, the darker the color is, the higher the weight is.From Fig. 10, it is obvious that some crucial regions obtain higher weights and non-crucial regions get smaller weights for each facial expression.For examples, the areas including or around eyes are given higher weights for the first person in the first row, where the maximum is given the local region located at the coordinate (2,2) including eyes.For the sixth person in the first row, four local regions (located at (3,2), (3,3), (4,2), (4,3)) including his mouth are boosted and given higher weights.In the third and fourth rows, the local regions located around eyes and the mouth are boosted for the second person, and the whole regions including eyes are given higher weights for the last person.Visually, these enhanced local regions are more discriminative and significant for FER.
From Fig. 10, it is also observed that the location of crucial regions is different for different facial images.But, our network still automatically tracks down more discriminative regions for each different face, without the supervision of any annotated crucial points.Based on this, secondely, we make an experiment to pursue the change of weights corresponding to each local region in the process of training our model.Fig. 11 shows the change of non-local weights in the training process.In Fig. 11, the first row shows the original image and its final global weights obtained by our model, the second and third rows show the given global weights of 16 local regions in the initial, 250 th , 500 th , 750 th , 1000 th and 1250 th iterations, respectively, and the last row shows the final weights.From Fig. 11, it is seen that the non-local weight of each local patch is same at the beginning of training, which implies that each local region is initially regarded as the equal importance.With the training of our network, each local region is given different weights, and the higher weights are given some more discriminative regions, such as the patches (located at (4,2) and (4,3)) including the mouth shown in Fig. 10(a), the patches (located at (3,2), (3,3), (4,2) and (4,4)) in Fig. 10(b), et al..It illustrates that some more crucial local regions can be adaptively enhanced in the training of our network without any landmarks.
In order to better observe the change of weights, we also show the change of weights corresponding to 16 local regions in all iterations in Fig. 12. From Fig. 12, it is seen that the weight value fluctuates at the beginning of network training and it is gradually stabilized until the end of the training.Some patches that are visually more discriminative are lightened  with higher weights and some patches located at the non crucial regions cut down with smaller weights.In summary, the analyses for non-local weights demonstrate that the proposed method can effectively automatically enhance the significance of facial crucial regions in deep feature learning, without any given prior information of facial crucial regions.

D. Visualization of Local Attentions
In the proposed method, the local attention is designed to deal with the problem that local regions is missed or obscured.In this part, the visualization of local attentions will be shown to validate the robustness of the proposed method for faces with missing regions, experimented on RAF-DB database.Note that the sigmoid function is employed to select the information flowing into the next layer in our local attention model.Fig. 13 shows visual results of local attentions obtained by our method.
In Fig. 13, the 1th and 3rd rows show one original facial image and six obscured images (from 2nd to 7th columns), and the 2nd and 4th rows show the weights of 16 patches of each facial image obtained by our method.Compared with the result of the original images (shown in the first column of Fig. 13), it is found that the weight is weakened while one patch is obscured and the weights of other patches are unchanged.Note that the weights of some adjacent patches are also decreased with the central patch, due to overlap pixels between two adjacent patches.Practically, the local vector encoded based on one obscured patch is given a small weight, which effectively diminishes the influence of that obscured patch for facial expression recognition.In short, the experimental results illustrate that the proposed method equipped with the local attention is more robust for complex facial expression databases in practice.

E. Analyses for the parameter α
In the non-local attention network, we formulate Eq.( 1) to obtain the non-local feature vector g * based on the global information of facial expression, where the parameter α is used to traff off the feature vectors g and s.In the previous experiments, we set α = 0.7.Therefore, we make an analysis to observe the performance of the proposed method with different values of α in this part.In this experiment, the experimental setups are same as the above experiments except α, and α is set as {0, 0.1, 0.2, ...,0.9, 1}, respectively.Table V shows the accuracy under different α for five datasets.
From Table V, it is seen that the accuracy is firstly increased and then decreased with a change in trend while increasing the value of α.According to Eq.( 1), we get g * = g if α = 0 and g * = s if α = 1.Combining the network optimization, it is known that the back propagation in LNLAttenNet has no constraint on s when α = 0, which implies that the same effect (or feedback) is given the non-local attention and each component of the non-local weights w g should be random in theory.On the contrary, α = 1 means that the back propagation has no constraint on the global vector g, which means the back propagation in LNLAttenNet has no global information and may result in an extreme result.Actually, as shown in Fig. 14, we also find that the obtained weights (w g ) tend to be random under a small α and equal under a large α, which effectively verifies the effect of α as same as the above analysis.
where γ is around 1/3, n 2 = M and P size is the size of each patch.Note that the parameters of our network except M is set as same as previous experiments.
From Table VI, it is observed that the performance with more local regions is superior to with less local regions.It implies that the size of each local region is too large to attain multiple diverse local information when M is set as a small value.Whereas, it is also notice that the computational complexity will be increased when M is set as a high value, and thus we finally set M = 16 to implement most experiments.

G. Analyses for Overlapped Pixels between Local Regions
In the previous experiments, 1/3 of whole pixels in each patch are applied as the overlapping pixels between two neighbor patches, which is a more appropriate value, since the number of pixels overlapping between the middle patch and both sides is only 2/3 , and the information of 1/3 of the pixels at the center of patch is still retained.If a larger number of overlapping pixels is employed, such as 1/2, the middle patch will completely overlap with the patches on both sides.If a smaller number is used, such as 1/4, the number of pixels in the overlapping region will be too small to solve the problem of regional connectivity.In order to analyze the influence of overlapping pixels between two patches, an experiment that other experimental settings are same to before is implemented based on RAF-DB dataset, and the result is shown in Table VII.In Table VII, it shows accuracies obtained by the proposed method based on different number (N ) of overlapping pixels.From the results, it is seen that the performance on the test set increases slowly to plateau as the number of overlapping pixels increases.It illustrates that the more the overlapping pixels are, the larger the number of network parameters are.According to our analyses, the main reason is that it is easier to introduce redundant information between adjacent patches when the number of overlapping pixels is larger.

V. CONCLUSION
In this paper, we propose the LNLAttenNet method to effectively explore the significance of facial crucial regions in feature learning for FER, without any landmark information.In LNLAttenNet, the global information of the facial expression is utilized to construct the non-local attention network, and meanwhile the local information is utilized to supervise selfinformation.By the joint optimization of facial non-local and local feature vectors, LNLAttenNet can adaptively enhance more crucial regions in the process of deep feature learning.Specifically, an ensemble of multiple networks corresponding to local regions is constructed to integrate the local feature with the non-local weights, which achieves the interactive guidance between the facial global and local information.Experimental results also demonstrate that some local crucial regions can be effectively enhanced in feature learning by LNLAttenNet while there are not any given information of landmarks in the training model.Moreover, the proposed method focuses on enhancing facial crucial regions in FER without any landmark information based on multiple patches, and thus we will explore it from the view of pixels for facial expressions in the further works.

Fig. 1 .
Fig. 1.An illustration of facial crucial regions from six expressions, where two facial images (ID1 and ID2) are shown for each expression.The regions around eyes and mouths are cropped as examples of FCRs in the purple box and the green box, respectively.

Fig. 2 .
Fig. 2. Schematic diagram of the pixel deviations at image level when posture changing.To demonstrate this change, we measured the movement of 68 landmark points on the faces with different postures and the same identity.In figure (a) and (b), 68 landmark points are marked with a green cross, and figure (c) shows the movement of 68 landmark points.

Fig. 3 .
Fig. 3.A simple view of the proposed model (LNLAttenNet).The part in the green dotted box shows the global weights corresponding to 16 local regions (from Patch 1 to Patch 16) obtained by LNLAttenNet, and the part under the green dotted box is a simple framework of LNLAttenNet.

Fig. 4 .
Fig. 4. The framework of the proposed model (LNLAttenNet).LNLAttenNet uses U-Net to generate feature map with the same resolution as the input image.Then, its feature map (Conv9-2) is cropped into M local patches to construct the local multi-networks ensemble model, where each patch is used to generate an individual network based on the structure of Simple Net.The feature map (Conv5-2) is used to construct the global attention network.Finally, the global and local features are integrated based on the global weights, and then three fully connected layers are followed.

Fig. 7 .
Fig. 7.The structure of Simple Network

Fig. 10 .
Fig. 10.Non-Local weights of 16 local regions of one face in RAF-DB obtained by the proposed model.The first and third lines show the facial images, and the second and forth lines show the Non-Local weights of 16 local regions corresponding to images.

Fig. 11 .
Fig. 11.16 non-local weights of two input images.In the first row, the input image and the non-local weights corresponding to each patches is shown.In the second and third rows, the six figures show the non-local weights of the input images at different training stages,respectively.The last row shows the final non-local weights obtained by our model.

F
. Analyses for different M In our method, multiple individual networks are generated based on facial local regions, and the previous experiments are implemented with the number of local patches M = 16.Therefore, we also make an analysis for the number (M ) of local patches on five datasets.In this experiment, M is set as 4, 9, 16, 25 and 36, respectively.Table VI shows the accuracy rates with different M .In this experiment, the size of the input image is 144*144 and the size of overlapping pixels between adjacent patches is around a third of the size of each patch, which is computed by n * P size − (n − 1) * γ * P size = 144,

Fig. 12 .
Fig. 12.The change of weights w g corresponding to 16 local regions in the training process of LNLAttenNet.The abscissa represents the number of iterations in the training process and the ordinate represents the magnitude of the weight corresponding to each iteration.

Fig. 14 .
Fig. 14.The change in the non-local weight at different α • SFEW contains the statistic images selected from the movie clips with spontaneous expressions, where the labels of training set and validation set are given.Therefore, 958 training images are used as the training set and 436 validation images are as the testing set in experiments.•AffectNet contains 450,000 images with 10 categories, where each image is annotated by one volunteer.In experiments, we use 287,401 images with neutral and six basic emotions, where 283,901 images are selected as the training set and 3,500 images are selected from the validation set as the testing set.
• CK+ contains 593 sequences from 123 volunteers, where rather than pretraining.In experiments, the original images are resized to 144×144, and the training images are augmented by standard approaches, such as image flips and random cropping.The number M of local regions is set as 16, and each patch (local

TABLE IV ACCURACY
(%) OF THE PROPOSED METHOD (LNLATTENNET) COMPARED WITH STATE-OF-THE-ARTS METHODS.

TABLE V ACCURACY
RATES (%) GIVEN BY THE PROPOSED METHOD WITH DIFFERENT α.