Multi-scale joint feature network for micro-expression recognition

Micro-expression recognition is a substantive cross-study of psychology and computer science, and it has a wide range of applications (e.g., psychological and clinical diagnosis, emotional analysis, criminal investigation, etc.). However, the subtle and diverse changes in facial muscles make it difficult for existing methods to extract effective features, which limits the improvement of micro-expression recognition accuracy. Therefore, we propose a multi-scale joint feature network based on optical flow images for micro-expression recognition. First, we generate an optical flow image that reflects subtle facial motion information. The optical flow image is then fed into the multi-scale joint network for feature extraction and classification. The proposed joint feature module (JFM) integrates features from different layers, which is beneficial for the capture of micro-expression features with different amplitudes. To improve the recognition ability of the model, we also adopt a strategy for fusing the feature prediction results of the three JFMs with the backbone network. Our experimental results show that our method is superior to state-of-the-art methods on three benchmark datasets (SMIC, CASME II, and SAMM) and a combined dataset (3DB).


Introduction
Micro-expressions are brief facial expressions that people unconsciously make when they try to hide a real emotion.They usually appear when people are in a critical situation.Initially, micro-expressions were defined as instant facial expressions during psychotherapy or micro-momentary facial expressions in nonverbal communication [1].Later, Ekman and Friesen [2] analyzed a conversation video between a psychiatrist and a depressed patient, and observed that painful expressions occasionally appeared during the patient's smile, which he called micro-expressions.This was the first time that the term microexpression was used.At the same time, related researchers also defined micro-expressions as facial expressions that can spontaneously reveal the real emotions of people, and are difficult to disguise.Therefore, micro-expression recognition has a wide range of applications in psychological and clinical diagnosis, emotional analysis, criminal investigation, and national defense security.
Micro-expression recognition is difficult, because (i) the duration of micro-expressions is very short, under 1/5 second, (ii) they only appear in specific areas of the face, and the intensity of the change is very weak, and (iii) existing micro-expression samples are scarce and numbers of samples in different classes are not balanced.Therefore, fully capturing the features of micro-expressions and improving recognition accuracy are still very challenging problems.
Since the task of micro-expression recognition was proposed, many associated methods have been suggested.In earlier years, Ekman [3] developed a micro-expression training tool (METT) to improve people's perception of micro-expressions.However, even with the help of METT, the recognition accuracy of professionals is still very low.With the development of computer vision and image processing technology, some researchers have used feature descriptors to extract features from images [4][5][6].However, it is difficult to accurately capture tiny facial motion information by directly extracting features from the original images.Therefore, the work of Refs.[7][8][9] extracts micro-expression features by calculating facial optical flow.
In addition, early deep learning models combined convolutional neural networks (CNN) and long short-term memory (LSTM) for micro-expression recognition.The complexity of this model can easily lead to overfitting problems in the training process when there are insufficient samples.Transfer learning and data augmentation techniques were introduced to overcome the difficulty of insufficient training samples [10,11], but this undoubtedly increases the cost of model training.At present, some research works [12,13] tend to use shallow multi-stream networks to improve the performance of the model on small datasets and in class-imbalanced situations.
In this paper, we ignore brute force solution methods that use different serial network combinations or multistream structures to improve the performance of model recognition.Instead, we use features in different layers in a single convolutional neural network, and propose a multi-scale joint feature network based on optical flow images.Since the intensity of micro-expressions in various samples is different, effective features of samples with weak intensity can be obtained in low layers but may be lost in high layers.However, samples with relatively high intensity can extract more effective features in the high layers of the network.Therefore, we hypothesize that features at different layers contribute to the classification of the network.To make better use of features from different layers, we propose a joint feature module.Further, we design two fusion strategies and observe the impact of different strategies on classification performance through experiments.Finally, we compare the proposed method with state-of-the-art methods; the results show that the performance of our model is competitive on three benchmark datasets and a crossdataset.

Related work
In early stages, the slow development of microexpression recognition was mainly due to the lack of a well-established database.The samples in an early micro-expression recognition database [5,14] were simulated, non-spontaneous expressions of subjects.With the establishment of spontaneous micro-expression databases, SMIC [15], CASME [16], CASME II [17], CAS(ME) 2 , and SAMM [18], relevant studies increased.At present, existing methods are mainly divided into two categories based on traditional methods and deep learning methods.

Traditional methods
Traditional methods generally combine handcrafted feature extraction with classical machine learning methods.Zhao and Pietikainen [6] used local binary patterns on three orthogonal planes (LBP-TOP) to enhance time dimension features.Subsequently, to enrich features in the time domain, completed local quantized patterns (CLQP) [19] and local binary patterns with six intersection point (LBP-SIP) [20] methods were successively proposed.Ben et al. [21] proposed three different methods using binary face descriptors, in which DCP-TOP based on dual-cross patterns and HWP-TOP based on hot wheel patterns encode features from micro-expression sequences.Wang et al. [22] considered the influence of color space on feature extraction and proposed a tensor independent color space (TICS) method to improve the performance of micro-expression recognition.Huang et al. [23] presented a method based on spatiotemporal local binary patterns and improved integral projection to extract facial feature information and distinguish feature information between different micro-expression classes.
To capture subtle facial motion information, optical flow [24,25] was introduced into the field of microexpression recognition.Liu et al. [8] put forward the main directional mean optical-flow (MDMO) method based on regions of interest, which reduced the dimensionality of features and improved the robustness of micro-expression recognition.As MDMO can cause loss of the underlying manifold structure, Liu et al. [26] proposed sparse MDMO, which constructs an effective sparse representation of micro-expressions by incorporating a new metric into GraphSC.Xu et al. [9] proposed facial dynamic maps (FDM), which took the optical flow field as the basis for extracting features of micro-expressions at different granularity.Liong et al. [7] came up with a new feature extractor, bi-weighted oriented optical flow (Bi-WOOF), which weighted the optical flow direction histogram twice to highlight facial motion, while proving that the micro-expression apex frame can provide sufficiently meaningful feature expressions.

Deep learning methods
Deep learning can directly learn hierarchical visual features from images, which has a wide range of applications to image classification and recognition, such as face recognition [27] and facial expression recognition [28].Direct feature extraction from an original micro-expression sequence was attempted early on.Kim et al. [29] adopted a CNN to encode spatial features, and then used LSTM to extract temporal features of micro-expressions from continuous spatial features.However, the serial combination of these two deep networks undoubtedly increases the complexity of the model, with a tendency to overfitting for small datasets which are typical for micro-expressions.To avoid the problem of insufficient data, Peng et al. [30] and Wang et al. [31] both used a macro-expression dataset to train the network to obtain a preprocessing model and then applied it to the micro-expression recognition task.In addition to a macro-expression dataset for pre-training, Wang et al. [32] also used 560 microexpression video clips from three micro-expression datasets to expand the network training data.Xia et al. [11] expanded micro-expression data based on multi-scale data augmentation by Eulerian video magnification (EVM) [33].Quang et al. [34] used both transfer learning and data augmentation techniques to reduce the risk of overfitting during model training.
Due to the short duration of micro-expressions and their weak intensity, using the original image as input is not an effective way for the network to extract useful features.Therefore, to emphasise the main features, as a vector describing the motion information of objects, optical flow is widely used in deep learning methods for micro-expression recognition.In such works [10,35], a CNN and LSTM are combined to extract micro-expression features from the original image and the optical flow image generated by the micro-expression motion.Inspired by Ref. [7], some subsequent studies only use the onset frame and the apex frame to represent a microexpression sequence to reduce the computational cost of the network by avoiding redundant data.Liu et al. [36] also adopted adversarial training and EVM based on a pre-trained Resnet18 to improve the accuracy of micro-expression recognition.In addition, recent studies tend to design shallow networks and improve the performance of the model by increasing the number of branches of the network.For example, Refs.[12,13,37] designed a dual-stream network for micro-expression recognition, and Liong et al. [38] proposed a shallow triple stream threedimensional CNN (STSTNet) for feature extraction and classification.

Method
To establish effective micro-expression recognition features, a novel micro-expression recognition framework is designed in this paper (see Fig. 1).In our micro-expression sequence, we first generate an RGB optical flow image (Section 3.1) that can describe the facial changes in a micro-expression, and then feed the optical flow image to the multiscale joint feature network (Section 3.2) for feature extraction and classification.

Optical flow image
The micro-expression samples in the datasets were collected by high-speed cameras (100-200 fps), and each micro-expression sequence can be decomposed into multiple frames.The short duration and subtle intensity of micro-expressions make the changes between consecutive frames in the video sequence indistinct, that is, each frame is very similar.If we input all frames of a video sequence directly into the network, it will not only increase the computational cost but also result in feature redundancy.Instead, inspired by Refs.[7,39], we only select the onset frame and the apex frame to represent each micro-expression sequence.Relative to the onset frame, the apex frame has the most obvious micro-expression intensity, but the change is extremely subtle compared to those of macro-expressions, and the number of samples in the micro-expression dataset is low.If the onset frame and apex frame are input to the network directly, it will be difficult for the network to learn effective features.Therefore, we use the onset frame and the apex frame to derive an optical flow image representing the dynamic changes in the microexpression as the input of the network.
Optical flow, as a two-dimensional vector that describes changes in pixels between two images over time, can capture subtle changes in the face.The calculation of optical flow requires two basic assumptions: brightness constancy, and continuous motion or small motion.For a micro-expression sequence, the gray value of pixel k at location (x, y) in the onset frame is G(x, y, t).Using the first assumption, we can obtain the gray value of pixel k in the apex frame: where δx and δy represent the distance pixel k has moved in the horizontal and vertical directions after a time δt.
Using the second assumption, the left hand side of Eq. ( 1) can be expanded as a first-order Taylor series, giving: (2) where represents a negligible higher order term.Therefore: where u = δx/δt, v = δy/δt represent the horizontal and vertical components of optical flow, respectively.The optical flow u k can then be expressed as Further, we can obtain for pixel k its optical flow magnitude m k = √ u 2 + v 2 and normalized direction θ k = arctan(v/u)/π.In our work, we use TV-L1 [25] to calculate the optical flow between the onset frame and the apex frame of each sample.Then the pixel value at location (x, y) for channel c ∈ {R, G, B} in the optical flow image can be written as where Here C is the color system [25], n c = 55 is the number of hues, and p is the total number of pixels in an image.For an optical flow image, various colors indicate different directions of pixel movement, and the intensity of the colors indicates the magnitude of the motion.Figure 2 shows some onset frames, apex frames, and optical flow images for different microexpressions.It can be seen that the color change in areas where the micro-expression appears is more dramatic.

Multi-scale joint feature network
In this section, we introduce the details of the proposed network structure for micro-expression recognition.The structure consists of three parts: the backbone network, the joint feature module, and the fusion module.The backbone network and the joint feature module are used to extract features from optical flow images.The fusion module combines the output of the backbone network and the joint feature module to get the final prediction according to the fusion strategy.Details are given in the following subsections.

Backbone network
The backbone network consists of five convolutional layers, three max-pooling layers, and three fully connected layers.Detailed parameters of the backbone network configuration are shown in Table 1.The convolution layer performs multi-stage feature extraction from the optical flow image through convolution operations; different convolution layers can extract different detail features.The max-pooling layer can not only extract the features with the largest response, but also reduce the dimensionality of the feature output and feed it to the next stage of the network.At the same time, the max-pooling layer can appropriately reduce the impact of unnecessary information on model training when the number of training samples is low.The fully connected layer integrates the features from previous layers and improves the nonlinear fitting ability of the model.

Joint feature module
The magnitudes of micro-expressions made by each subject differ.In the process of extracting features from the optical flow image, the top convolutional layer can better capture the details of changing features, and with deepening of the network layers, detailed features may be lost.For samples with relatively subtle facial changes in the training set, we hypothesize that making full use of features from different layers is essential to recognizing microexpressions with different amplitudes.Therefore, we propose our joint feature module (JFM) to capture micro-expression features with different magnitudes by integrating features of different layers.The structure of the JFM is shown in Fig. 3. Let X l = {x l i } c i=1 be the activation maps of a selected layer l ∈ {1, . . ., 5}, where x l i is an output activation feature map of each channel i in layer l.In this work, the 2nd, 3rd, and 4th layers of the network are selected for fusion with the 5th layer.The purpose of this step is that when detailed features are lost in the feature extraction process, the fused features extracted from different network layers can still be effectively classified.Before fusion, two feature maps from different layers are linearly transformed using 1 × 1 convolutions to unify their dimensions, and the transformed feature maps can be formulated as follows: where c and c denote the number of channels before and after convolution, respectively.Then, the convolution results of X l and X 5 are fused to get the feature map X m ∈ R H 5 ×W 5 ×C 5 : where ⊕ denotes addition of corresponding channels.
Next, the ReLU: σ(x m c,p ) = max(0, x m c,p ) is employed to activate the feature map to obtain X m = {x m i } C 5 i=1 , to reduce interdependence between parameters and alleviate the problem of overfitting.Then, the activated feature map is input to the max-pooling layer to reduce the feature dimensionality and extract the more representative feature φ m ∈ R H m ×W m ×C 5 , Fig. 3 Joint feature module (JFM).Combining features from different layers to obtain the prediction y m from the module, the lower-layer features in the network can provide effective supplementing that from the higher layers.
and the subsequent fully connected layer is used to increase the non-linear fitting ability of the JFM.Finally, the softmax function normalizes the output of the fully connected layer to obtain the prediction y m j of the jth class, which can be expressed as follows: where f (φ m ) is the output of the fully connected layer, W is the learnable weight parameter vector, and N is the total number of micro-expression classes.

Fusion strategy
The prediction results of the three JFMs and the backbone network are fused to improve the recognition ability of the model.In our work, we adopt two fusion strategies to deliver the prediction results of the model.The first strategy averages the output of the three JFMs (y m ) and the output of the backbone network (y b ) as the final prediction y : where n is the number of JFMs.
The second strategy uses softmax to normalize Z as the final prediction y.The probability of the jth class can be expressed as where j and N are the corresponding class index and the total number of micro-expression classes, respectively.
During the training phase, the cross-entropy loss function is adopted: where ŷk and y k represent the ground-truth label and prediction result for the kth sample, respectively.Through backpropagation, the backbone network and the joint feature module jointly learn to improve network performance.

Experiments
In this section, we first introduce the datasets adopted to evaluate the performance of the proposed approach.Secondly, we introduce the metrics and implementation details used in our experiments.Finally, we show the results of experiments, and compare and discuss the results.

Datasets
We utilized four representative datasets: SMIC, CASME II, SAMM, and 3DB, after referring to MEGC [40].Table 2 shows the number of samples used in each dataset in our experiment.Each dataset is briefly described below.SMIC [15] collects 164 micro-expressions from 16 participants recorded by 100 fps cameras.All types of micro-expressions are divided into three classes: positive, negative, and surprise.These microexpression samples are only encoded using the onset frame and apex frame.Therefore, in the experiment, we calculate the average value of the optical flow between each frame and the onset frame, and the frame with the largest average value is regarded as the apex frame.
CASME II [17] contains 255 spontaneous microexpression samples recorded by 200 fps cameras from 26 subjects, and selected from nearly 2500 induced facial movements.These samples are encoded with onset, apex, and offset frames and labeled with AUs and emotion classes.Taking MEGC as a reference, we use 145 micro-expression samples in our work.The micro-expressions are divided into three classes: positive (happiness), negative (disgust, repression), and surprise.
SAMM [18] contains 159 micro-expression samples recorded by 200 fps cameras, from 32 participants of 13 different ethnicities.The average age of participants is 33.24 years old, with equal gender ratio.These samples are encoded with onset, apex, and offset frames and labeled with AUs and emotion classes.Taking MEGC as a reference, we use 133 micro-expression samples in our experiment, and divide them into three classes: positive (happiness), negative (anger, disgust, contempt, sadness, and fear), and surprise.
3DB [40] is a combined dataset, whose samples come from SMIC, CASME II, and SAMM.The combined dataset has 68 subjects (16 from SMIC, 24 from CASME II, 28 from SAMM), so the dataset contain subjects from different backgrounds (ethnicity, environment, and gender).The types of micro-expression are divided into three classes: positive, negative, and surprise.

Metrics
It can be seen from Table 2 that the number of samples in different classes in each dataset is imbalanced.For example, the ratio of the three classes in the 3DB dataset is 1.3 (positive) : 3 (negative) : 1 (surprise), so accuracy cannot fully measure network performance.Therefore, we use two balanced metrics (unweighted F1-score, unweighted average recall) [42] to evaluate the performance of our method.
The unweighted F1 score (UF1) first calculates the F1 of each class, then superimposes the F1 of each class, and finally takes the average of the superimposed results according to the number of classes, which can be formulated as follows: where T P c , F P c , and F N c represent the number of true positives, false positives, and false negatives for the cth class, respectively.N is the number of classes.
The unweighted average recall rate (UAR) is obtained by summing the accuracy of each class, and then averaging over the number of classes: where M c is the number of samples in the cth class.
To avoid the problem of person dependence in the classification process, we adopt leave-one-subject-out cross validation (LOSO).All facial micro-expression data of one subject are set aside for testing, and the rest are used for training to ensure that the training stage excludes the testers' information.Each subject in the dataset is then tested once.

Implementation details
To reduce the occurrence of overfitting, we applied a dropout operation on the fully connected layer and set its ratio to 0.5.In the training stage, we set the batch size to 12 and the network is trained for 300 epochs.We initialize the learning rate to 10 −4 and use a method for stochastic optimization (Adam) to update the weights of the network.Adam is an optimization algorithm using adaptive learning rate gradient descent, which enables the network to converge faster.All of our experiments were based on an Ubuntu 16.04 system with an Nvidia GeForce GTX 1060 GPU.

Ablation study
To verify the effectiveness of JFM, we deleted the JFM module while keeping other components in the network architecture unchanged, i.e., we only used the backbone network for feature extraction and classification.The performance of the backbone network on each dataset is shown as MJFN-BbN in Table 3.We can see that when all JFMs are removed, the performance of the network declines to different degrees on each dataset, demonstrating  the contribution of JFM to model classification performance.Using optical flow image as the input to the network in each case for fairness, a comparison to AlexNet with the same number of layers is also shown in Table 3.The backbone network (MJFN-BbN) has superior performance on the 3DB, SMIC, and CASME II datasets.Further, we assessed the influence of the number of JFM on network performance through experiments.As can be seen from Fig. 5, with increasing number of JFM, the performance of the model improves to varying degrees on different datasets.When the number of JFM reaches 3, the overall model performance is optimal.Therefore, this work uses three JFMs.In addition, the number of JFM added is from the fourth convolution layer to the first convolution layer.
We also assessed the impact of two different fusion strategies on network performance.MJFN-Avg is the first fusion strategy, and MJFN is the second strategy.It can be seen from Table 3 that, these two strategies have the same performance on CASME II, but on the remaining datasets, the second strategy is better.Especially for the SMIC dataset, the UF1 and UAR of the second strategy are better by 6.67% and 5.82%, respectively.
Fig. 5 Influence of the number of JFM on UF1 and UAR for different datasets.

Comparisons
We compared the proposed method with state-ofthe-art methods; results are shown in Table 3.It can be seen that the proposed method (MJFN) obtains higher micro-expression recognition results.In particular, compared with the classical method LBP-TOP, UF1 and UAR of our method are better by 24.56% and 22.71%, respectively, on the 3DB dataset with extremely imbalanced samples.Compared with Bi-WOOF, which also uses apex frames, our method achieves 18.29% improvement in UAR on 3DB, while UF1 improves by 20.42%.The data shows that the performance of our proposed method on the SMIC and CASME II datasets is improved by 12% to 19% compared to the traditional methods MDMO and Sparse MDMO using optical flow.Furthermore, the proposed method also achieves 9.85% and 4.51% improvement in UF1 and UAR, respectively, compared to better results in multistream networks (Off-Apex, STSTNet, ATNet, Dual-Inception).
The confusion matrix in Fig. 4 shows the recognition accuracy of our proposed method on each class in each dataset.It is clear that our method has the best recognition accuracy for the negative class, which also shows that the number of samples in different classes has a certain impact on the performance of the network.

Conclusions
This paper proposes a novel framework for microexpression recognition.First, we use the onset frame and the apex frame of a micro-expression sequence to generate the RGB optical flow image describing the facial changes of the micro-expression.Then, the optical flow image is fed to a multi-scale joint feature network for feature extraction and classification.In addition, the proposed joint feature module integrates features from different layers, which helps the model to capture micro-expression features of different amplitudes.In our work, we use three JFMs to optimize the performance of the model.Furthermore, we adopt two different feature fusion strategies to fuse the prediction results of the three JFMs with the backbone network to improve the recognition ability of the model.Finally, we compare the proposed method with state-of-the-art microexpression recognition methods; the results show that our method achieves superior results.
Although our current work can obtain superior results in micro-expression recognition, the calculation of optical flow takes a long time, which makes it difficult to guarantee real-time performance in our work.Real-time micro-expression recognition is still a major difficulty in this field.In future work, we will focus on the study of micro-expression feature extraction methods to ensure real-time microexpression recognition.

Fig. 1
Fig. 1 Framework of our multi-scale joint network.Given a micro-expression video sequence, we employ the onset frame and the apex frame to obtain an optical flow image and feed it into the multi-scale joint feature network for feature extraction and classification.

Fig. 2
Fig. 2 Examples of the onset frame (above) and the apex frame (middle) in CASME II, and the corresponding optical flow image (below), for (a) happiness, (b) disgust, (c) surprise, (d) repression.

Fig. 4
Fig. 4 Confusion matrices for the proposed method on different datasets.

Table 1
Parameters of the backbone network

Table 2
Samples in each dataset

Table 3
Comparative results for different methods and different datasets."-" indicates no results were reported