Learning inter-class optical flow difference using generative adversarial networks for facial expression recognition

Facial expression recognition is a fine-grained task because different emotions have subtle facial movements. This paper proposes to learn inter-class optical flow difference using generative adversarial networks (GANs) for facial expression recognition. Initially, the proposed method employs a GAN to produce inter-class optical flow images from the difference between the static fully expressive samples and neutral expression samples. Such inter-class optical flow difference is used to highlight the displacement of facial parts between the neutral facial images and fully expressive facial images, which can avoid the disadvantage that the optical flow change between adjacent frames of the same video expression image is not obvious. Then, the proposed method designs four-channel convolutional neural networks (CNNs) to learn high-level optical flow features from the produced inter-class optical flow images, and high-level static appearance features from the fully expressive facial images, respectively. Finally, a decision-level fusion strategy is adopted to implement facial expression classification. The proposed method is validated on two public facial expression databases, BAUM_1a, SAMM and AFEW5.0, demonstrating its promising performance.


Introduction
With the development of human-computer interaction (HCI), facial expression recognition is becoming increasingly important because it makes human-computer interaction more natural [16]. Facial feature extraction is a key step in a facial expression recognition system [31]. So far, the representative facial features are the static geometry appearance [10,20,22,29]. The static geometry appearance is popularly employed to recognize facial expressions since it contains valuable information for expression detection and can effectively characterize expressions [18]. Extracting dynamic optical flow images for facial expression recognition, as a supplement to the static geometry appearance, is a challenging problem.
To improve the effect of facial expression recognition, the optical flow features of facial images, which can well describe the regional difference between two different facial expression images, will provide an insight. The inter-class optical flow images from the difference between the static fully expressive samples and neutral samples, are able to emphasize the dynamic expression variation (displacement) of facial parts between the neutral facial images and the fully expressive facial images. A comparison of the neutral expression with other expressions is shown in Fig. 1.
Motivated by the good properties of inter-class optical flow images, this paper proposes to employ generative adversarial networks (GANs) [9] to produce inter-class dynamic optical flow images, as a supplement to the static appearance images, so as to further enhance the discriminative power of feature representations. Our method can effectively distinguish the

Joy Neutral Sad
Anger Disgust Fear changes between the static fully expressive samples and neutral samples, thereby significantly improving the performance of facial expression recognition. The inter-class optical flow images and the static appearance images are separately fed into the designed convolutional neural networks (CNNs) to learn high-level feature representations. Then, the classification results based on them are fused on decision-level for facial expression recognition.
The main contributions of this paper are as follows: (1). We propose to employ a GAN to produce inter-class optical flow images so as to highlight the dynamical facial parts changing from the neutral facial image to the fully expressive facial image. To the best of our knowledge, this work is the first to employ a GAN to learn inter-class optical flow difference for facial expression recognition. (2). We design four-channel convolutional neural networks (CNNs) to learn high-level optical flow features from the produced optical flow images and the static appearance features from the fully expressive facial images, respectively. (3). The proposed method is verified on two public databases, BAUM_1a, SAMM and AFEW5.0, showing its effectiveness.

Related work
Facial expression recognition aims to perceive user's emotional changes which are very useful for a lot of application scenarios, like monitoring patient's status [1], enhancing spatial perception [2], and making decisions faster in emergency situations [24]. The main methods of facial expression recognition usually utilize the static appearance and optical flow information from facial images. Static appearance images are popularly employed to recognize facial expressions and achieve good recognition performance [7]. The static appearance of the mouth and eyes is employed to recognize smiling [3]. The static appearance of the nose is employed to recognize the facial expression of disgust [28]. The spatial emotional region is employed to recognize facial expressions by a new action-unit attention mechanism [21]. Dhall A et al. [8] employs local binary pattern-three orthogonal planes (LBP-TOP), pyramid of histogram of gradients (PHOG) and local phase quantisation (LPQ) to recognize facial expression. Kayaoglu et al. [11] presents a multimodal categorical emotion recognition method based on key frame selection from video. Yao A et al. [37] presents a pair-wise learning strategy to learn the AU-aware features Zhalehpour S et al. [38] employs the static appearance images to recognize facial expressions. Knor et al. [12] presents a lightweight Dual-Stream Shallow Network (DSSN) to learn deep micro-expression features. Pan H et al. [26] employs low-level descriptors of static appearance images to recognize facial expressions. Li et al. [23] adopts two methods (Branches [19] and ApexME [14]) by using MiE-X pre-training to improve the facial expression recognition accuracy. The gray-level co-occurrence matrix is adopted to represent facial images and achieves the expected recognition results [13]. Although static appearance images can achieve promising recognition results, fusing the optical flow information can mine more contextual and deeper information for facial expression recognition. At present, numerical studies have focused on fusing optical flow images with other features to improve recognition performance [33]. Pan X et al. [27] employs the static appearance images and intra-class optical flow images of adjacent frames to identify facial expressions. Zhang S et al. proposes a hybrid deep learning model to apply for facial expression recognition [39]. Action units (AUs) are widely employed to represent the main facial appearance region changes in the mouth and eyes. It is necessary to guide our method to focus on optical flow information for facial expression recognition. Weighted spatial and optical flow features are employed to classify facial expressions and outperform state-of-theart methods on CASME II and SMIC-HS [34]. The spatial and optical flow features are learned simultaneously and play a significant role in facial expression recognition [36]. The apex frame of the frequency domain is detected and employed to recognize micro-expressions [16]. The frequency domain is adopted to yield high-level features by the learnable multiplication kernel for facial expression recognition [32]. The optical flow features are extracted and combined with facial landmarks to recognize facial expression. In this way, these methods achieve competitive facial expression recognition performance [15].
From the above-mentioned previous works, we can see that it is feasible to fuse optical flow information with other features for improving facial expression recognition performance. However, identifying subtle facial movements between different facial expressions is still a challenging problem. Many researchers adopt the intra-class optical flow information between the fully expressive (peak) frame and the neural frame from the same sample for facial expression recognition [4,40]. Nevertheless, few researchers focus on employing inter-class optical flow information that reflect the difference between the static fully expressive samples and neutral samples for facial expression for improving facial expression classification. Therefore, this paper proposes to employ a generative adversarial networks (GAN) to learn inter-class optical flow difference between the static fully expressive samples and neutral samples, and then the learned optical flow difference are combined with the learned static appearance features from the fully expressive facial images to further recognize facial expressions on decision-level.

The proposed method
The framework of the proposed method is shown in Fig. 2. It has three key steps: (1). Generating inter-class optical flow images with a GAN; (2). Designing four-channel CNNs to learn high-level optical flow features from the produced optical flow images and the static appearance features from the fully expressive facial images, respectively; (3). Implementing decision-level fusion for final expression classification.
It is known that all facial images have the labels of facial expressions in the training datasets. In this case, we can produce the inter-class optical flow images between the static fully expressive samples and neutral samples in the training datasets. However, it is noted that in real-world application scenarios the labels of facial expression images from the testing datasets are not available. In this case, the labels of facial expression images from the training datasets are only obtained. Therefore, we cannot directly obtain inter-class optical flow images between the static fully expressive samples and neutral samples in the testing datasets. To solve this problem, generative adversarial networks (GANs) [9] are designed to generate interclass optical flow images between the static fully expressive samples and neutral samples. The inter-class optical flow images between the fully expressive image (such as joy) and the neutral facial images have more displacement information, reflecting he dynamic expression variation (displacement) of facial parts between the neutral facial images and the fully expressive facial images. A four-channel CNN structure is designed to recognize facial expression categories of inter-class optical flow images and static appearance images simultaneously. Considering the complementarity between the dynamic inter-class optical flow images and the static appearance facial image for facial expression classification, a decision-level fusion strategy is designed to integrate four-channel based classification results for final facial expression recognition.

Generating inter-class optical flow images with a GAN
The facial action coding system (FACS) shows that the movements of key regions are the most important clues for facial expression recognition [5]. Thus, neutral facial expression can be used as the reference point for a comparison with other fully expressive categories. The displacement variation between neutral expression images and other fully expressive images is employed to reflect the dynamic expression variation. This can be embodied by producing inter-class optical flow images, which is similar to inter-class optical flow images for facial expression classification. Due to the influence of possibly existing noise from the input image, it is necessary to perform noise reduction when extracting inter-class optical flow images. To reduce the impact of noise, FaceNet is employed to preprocess and crop face images [30].
To generate inter-class optical flow images between neutral expression images and other fully expressive facial expression images, GANs [9] may provide a possible solution. GANs can be used to directly generate inter-class optical flow images based on the target facial expression images and neutral facial expression images. The discrimination of optical flow feature in different facial expressions is expanded by the proposed GAN, which is compared with neutral expression. It avoids the disadvantage that the optical flow change between adjacent frames of the same video expression image is not obvious. The design of a GAN is shown in Fig. 3.  In detail, a facial image x (static appearance images) is fed into a network K (Generator) to generate an inter-class optical flow image relative to neutral expression image y (optical flow images). Similarly, y is fed into the network R (Generator) to generate the facial image x. Given the training samples x i f g N i¼1 ∈X and y i f g N i¼1 ∈Y , in this work we used the cycle-GAN [42] to learn two mapping functions: K : X → Y, and R : Y → X. D X and D Y are two adversarial discriminators. Here, D X aims to identify between the facial image x and the translated imageR(y). Similarly, D Y aims to distinguish between the image y and the translatedK(x). For the mapping function K : X → Yand its discriminator D Y , the adversarial loss is defined as: where E represents Expectation. Likewise, for the mapping function R : Y → X and its discriminator D X , the adversarial loss is defined as: Suppose that the learned mapping functions are cycle-consistent, the cycle consistency loss is defined as: Then, the full objection function is defined as: where λ aims to control the related importance of the used two objective functions. The parameters that minimize the total loss of formula (4) are obtained by using the facial expression training dataset. Therefore, we can calculate the inter-class optical flow image related to the neutral expression image according to the input facial image with a GAN, as shown in Fig. 3. The input facial image size is 299 × 299 × 3. In the generative adversarial network training process, the template of inter-class optical flow images related to the neutral expression images is achieved based on the BAUM_1a facial expression dataset [38], because the facial expression images of the BAUM_1a dataset are very standardized and comprehensive. In detail, the template of inter-class optical flow images related to the neutral expression images is computed by formula (5). Here, M l p represents the inter-class optical flow image related to the neutral expression image. It also represents the l-th type of facial expression, such as joy. S l ð Þ c; j denotes the l-th type of facial image, N represents the neutral expression image, and S l ð Þ d; j represents the number of facial images belonging to the l-th facial expression.

Designing four-channel CNNs for high-level feature learning
Different softmax loss with CNNs may bring about different classification performance [35]. It is noted that the basic L1-contrained softmax loss makes the learned features prone to be separable, rather than discriminative for classification. By contrast, L2constrained softmax loss is able to increase the discriminative power of the learned features, thereby increasing the classification performance. Motivated by diverse properties of L1 and L2-contrained softmax loss on classification tasks, we design fourchannel CNNs (CNN-1, CNN-2, CNN-3, CNN-4) for high-level feature learning, as shown in Fig. 2. Every channel of the proposed framework has the same CNN structure, which has 17 convolution layers, 5 max-pooling layers, and 1 fully connected (FC) layer. Among four-channel CNNs, CNN-1 and CNN-3 employ the basic L1-contrained softmax loss, whereas CNN-2 and CNN-4 adopt the L2-contrained softmax loss. In Fig. 2, CNN-1 and CNN-2 are used to learn high-level appearance features from the static fully expressive facial images. By contrast, CNN-3 and CNN-4 are leveraged to learn highlevel inter-class optical flow difference from the produced inter-class optical flow images. In this case, there may exist the complementarity among the learned different features with the designed four-channel CNNs (CNN-1, CNN-2, CNN-3, CNN-4) on facial expression classification tasks. The key step of a CNN is to implement the convolution operation with convolution layers. The goal of convolution layers is to learn more representative emotional features. The convolution layer consists of several convolution kernels. The process of convolution calculation is shown in formula (6).
where f refers to the input image, and ∅ denotes the convolution kernel. The size of the convolution kernel is 3 × 3. The goal of max-pooling layers is to reduce the emotional feature dimension. The FC layer is employed to classify emotions. The output of each channel CNN is a 1 × 7 vector. The size of maximum pooling is 2 × 2. The input image size is 48 × 48 × 1. The proposed framework employs two independent CNNs to classify face images and two other independent CNNs to classify inter-class optical flow images simultaneously. In this way, four classification results for the input image are obtained. The illustrations of training process with different CNNs on BAUM_1a, SAMM and AFEW5.0 are shown in Figs. 4, 5 and 6 respectively. Horizontal coordinate means the number of training epoch. Vertical coordinate means the loss values. The acceptable convergence results are obtained in the range of 1000 steps during training on the three database sets.

Decision-level fusion
Considering the complementarity among the learned features of the designed four-channel CNNs (CNN-1, CNN-2, CNN-3, CNN-4) on facial expression classification tasks, integrating their obtained results of four-channel CNNs may provide better performance [17]. To this end, a decision-level fusion strategy is adopted to improve the performance of facial expression recognition. This is shown in formula (7).
where A optical and B optical separately represent the classification scores of CNN-3 and CNN-4 from the produced inter-class optical flow images. C static and D static individually represent the classification scores of CNN-1 and CNN-2 from the static appearance images. w 1 , w 2 , w 3 , w 4 are the weights of decision makers, which satisfy formula (8). We seek every w in the range of [0,1] with an interval of 0.1 satisfying formula (8) and find that w1 = 0.1, w2 = 0.1, w3 = 0.4 and w4 = 0.4 give the best performance on facial expression recognition tasks. Once the optimal weight values of w 1 , w 2 , w 3 , w 4 are fixed, we perform the maximization operations on the total fused classification scores to obtain the final class. In this way, the proposed method

Experimental studies
The proposed method is evaluated on three public datasets. The experimental implementation details and analysis are shown in this section.

Datasets
BAUM_1a [38]: There are 31 persons acting in 1222 video clips in the BAUM_1a dataset. It includes videos of contempt, joy, fear, anger, sadness, surprise, disgust and other facial expressions. In this wok, the middle frame of each video from BAUM_1a Dataset is employed as a facial image sample for facial expression recognition. In addition, we focus on six basic expressions: surprise, sadness, joy, disgust, fear, and anger, thereby producing 521 samples for experiments. Some facial images of BAUM_1a are shown in Fig. 7. SAMM [6]: There are 32 persons acting in 159 video clips in the SAMM dataset. It has videos of joy, fear, anger, sadness, surprise, disgust, contempt, and other facial expressions. In this work, 11,816 images on the SAMM are used to evaluate the proposed method, including six basic expressions: surprise, sadness, joy, disgust, fear, and anger. Some facial images of SAMM are shown in Fig. 8.
AFEW5.0 [8]: It is divided three parts, which are training, validation and testing. The training part has 723 samples. The valid part has 383 samples. The testing part has 539 samples. It includes videos of neutral, joy, fear, anger, sadness, surprise and disgust. Some facial images of AFEW5.0 are shown in Fig. 9.

Experimental results and analysis
The well-known leave-one-subject-out (LOSO) cross-validation strategy is employed for subject-independent experiments. The respective recognition accuracies of the inter-class optical flow images and static appearance images are shown in Table 1. Table 1 shows that static appearance images play a major role in facial expression recognition, whereas the role of inter-class optical flow images in facial expression recognition

Anger
Disgust Joy Sadness Fear Surprise Evaluation metrics, such as Precision, Recall, Accuracy, F1-score are adopted in this paper. They are defined as follows:   Tables 2, 3 and 4, we can see that "fear" is more difficult to classify than other expressions on BAUM_1a, SAMM and AFEW5.0. This may be attributed to that fear is easily confused with other expressions such as sadness or anger on these datasets.

Performance comparison with state-of-the-art methods
To show the effectiveness of the proposed method, we compared our obtained results with state-of-the-art methods. The proposed method gains an improvement of 3.38%, 1.74% and 0.25% on the BAUM_1a dataset, SAMM dataset and AFEW5.0 database, respectively, compared with the state-of-the-art methods, as shown in Table 5, 6 and 7.
Note that Knor et al., [12] present a lightweight Dual-Stream Shallow Network (DSSN) to learn deep micro-expression features. Pan H et al., [26] employed low-level descriptors of static appearance images to recognize facial expressions in SAMM. Li et al., [23] adopted two methods [14,19] by using MiE-X pre-training to improve the accuracy on SAMM. A comparison of results is shown in Table 5.
Zhalehpour S et al., [38] also employed the static appearance images to recognize facial expressions in BAUM_1a. Pan X et al., [27] employed the static appearance images and intra-  Table 6. AFEW5.0 is one of the databases for video and image based emotion recognition challenges in the wild 2015. Dhall A et al., [8] describes the baselines, data and protocols for the challenges. The testing part of AFEW has no label, we compared with the validate part, Kayaoglu et al., [11] employed the Minimum Sparse Reconstruction of key static appearance images to identify facial expressions in AFEW5.0. Yao A et al. [37] presents a pair-wise learning strategy to learn the AU-aware features in AFEW5.0, comparison of results as shown in Table 7.

Conclusions and future work
This paper proposes a new facial expression recognition method, which learns inter-class optical flow difference with a designed GAN. Such inter-class optical flow difference aims to emphasize the displacement of facial parts between the neutral facial images and fully expressive facial images. The GAN produces inter-class optical flow images so as to highlight the dynamical facial parts changing. The designed four-channel convolutional neural networks  [26] 58.46% Branches [19] 53.30% Branches + MiE-X pre-train [23] 55.40% ApexME [14] 56.80% ApexME + MiE-X pre-train [23] 59.20% Ours 60.20% Table 6 Comparison against state-of-the-art methods for the BAUM_1a dataset Dataset Methods Accuracy

BAUM_1a
Zhalehpour S et al. [38] 47.44% Pan X et al. [27] 53.60% Zhang S et al. [39] 55.85% Ours 56.98%  [11] 44.47% Yao A et al. [37] 44.04% Ours 44.72% (CNNs) can learn high-level optical flow features from the produced optical flow images and the static appearance features from the fully expressive facial images, respectively. From the above-mentioned experiments on the SAMM, BAUM_1a and AFEW5.0 datasets, we conclude that integrating the inter-class optical flow images and the static appearance images yield promising performance on facial expression recognition tasks. In future, it is meaningful to explore the performance of advanced and effective shallow CNNs [25] on facial expression recognition owing to the limited samples. In addition, it is a research direction to investigate the performance of our method on cross-database microexpression recognition tasks [41]. It will also be interesting to design a more comprehensive deep learning framework to further promote facial expression recognition performance.