Keywords

1 Introduction

Facial expressions convey much information regarding individuals’ affective states, statuses, attitudes, and their cooperative and competitive natures in social interactions [15]. Human facial expression recognition has been a widely studied topic in computer vision since the concept of affective computing was first proposed by Picard [6]. Numerous methods and algorithms for automatically recognizing emotions from human faces have been developed [7]. However, previous research has focused primarily on general facial expressions, usually called macro-expressions, which typically last for more than 1/2 of a second, up to 4 s [8] (although some researchers treat the duration of macro-expressions as between 1/2 s and 2 s [9]). Recently, another type of facial expression, namely, micro-expressions, which are characterized by their involuntary occurrence, short duration and typically low intensity, has drawn the attention of affective computing researchers and psychologists. Micro-expressions are rapid and brief expressions that appear when individuals attempt to conceal their genuine emotions, especially in high-stakes situations [10, 11]. A micro-expression is characterized by its duration and spatial locality [12, 13]. Ekman has even claimed that micro-expressions may be the most promising cues for lie detection [11]. The potential applications in clinical diagnosis, national security and interviewing that derive from the possibility that micro-expressions may reveal genuine feelings and aid in the detection of lies have encouraged researchers from various fields to enter this area. However, little work on micro-expressions has yet been performed in the field of computer vision.

As mentioned above, much work has been conducted regarding the automatic recognition of macro-expressions. This progress would not have been possible without the construction of well-established facial expression databases, which have greatly facilitated the development of facial expression recognition (FER) systems. To date, numerous facial expression databases have been developed, such as the Japanese Female Facial Expression Database (JAFFE) [14]; the CMU Pose, Illumination and Expression (PIE) database [15] and its successor, the Multi-PIE database [16]; and the Genki-4 K database [17]. However, all of the databases mentioned above contain only still facial expression images that represent different emotional states.

It is clear that still facial expressions contain less information than do dynamic facial expression sequences. Therefore, researchers have shifted their attention to dynamic factors and have developed several databases that contain dynamic facial expressions, such as the RU-FACS database [18], the MMI facial expression database [19], and the Cohn-Kanade database (CK) [20]. These databases all contain facial expression image sequences, which are more efficient for training and testing than still images. However, all of the facial expression databases mentioned above collect only posed expressions (i.e., the participants were asked to present certain facial expressions, such as happy expressions, sad expressions and so on) rather than naturally expressed or spontaneous facial expressions. Previous literature reports have stated that posed expressions may differ in appearance and timing from spontaneously occurring expressions [21]. Therefore, a database containing spontaneous facial expressions would have more ecological validity.

To address the issue of facial expression spontaneity, Lucey et al. [22] developed the Extended Cohn-Kanade Dataset (CK +) by collecting a number of spontaneous facial expressions; however, this dataset included only happy expressions that spontaneously occurred between the participants’ facial expression posing tasks (84 participants smiled at the experimenter one or more times between tasks, not in response to a task request). Recently, Mcduff et al. [23] presented the Affectiva-MIT Facial Expression Dataset (AM-FED), which contains naturalistic and spontaneous facial expressions collected online from volunteers recruited on the internet who agreed to be videotaped while watching amusing Super Bowl commercials. This database achieved further improved validity through the collection of spontaneous expression samples recorded in natural settings. However, only facial expressions that were understood to be related to a single emotional state (amusement) were recorded, and no self-reports on the facial movements were collected to exclude any unemotional movements that may contaminate the database, such as blowing of the nose, swallowing of saliva or rolling of the eyes. Zhang et al. [24] recently published a new facial expression database of 3D spontaneous elicited facial expressions, coded according to the Facial Action Coding System (FACS).

Compared with research on macro-expression recognition, however, studies related to micro-expression recognition are rare. Polikovsky et al. [25] used a 3D-gradient descriptor for micro-expression recognition. Wang et al. [26] treated a gray-scale micro-expression video clip as a 3rd-order tensor and applied Discriminant Tensor Subspace Analysis (DTSA) and Extreme Learning Machine (ELM) approaches to recognize micro-expressions. However, the subtle movements involved in micro-expressions may be lost in the process of DTSA. Pfister et al. [27] used a temporal interpolation model (TIM) and LBP-TOP [28] to extract the dynamic textures of micro-expressions. Wang et al. [29] used an independent color space to improve upon this work, and they also used Robust PCA [30] to extract subtle motion information regarding micro-expressions. These algorithms require micro-expression data for model training. To our knowledge, only six micro-expression datasets exist, each with different advantages and disadvantages (see Table 1): USF-HD [9]; Polikovsky’s database [25]; SMIC [27] and its successor, an extended version of SMIC [31]; and CASME [32] and its successor, CASME II [33]. The quality of these databases can be assessed based on two major factors.

Table 1. Previous micro-expression databases

These micro-databases have greatly facilitated the development of automatic micro-expression recognition. However, all these databases only include cropped micro-expression samples, which was not very suitable for automatic micro-expression spotting. Besides, the methods used for emotion labeling was not consistent, which usually labeled the emotion according to FACS or the emotion type of elicitation materials or both. These methods left the possibility that may include some meaningless facial movements into the database (such as the blowing of the nose, swallowing of saliva or rolling of the eyes). In this database, together with the FACS and emotion type of elicitation material, we collected participants’ self-reports on each of their facial movements, which to the best extent guaranteed the purity of the database.

Considering the issues mentioned above, we developed a new database, CAS(ME)2, for automatic micro-expression recognition training and evaluation. The main contributions of this database can be summarized as follows:

  • This database is the first to contain both macro-expressions and micro-expressions collected from the same participants and under the same experimental conditions. This will allow researchers to develop more efficient algorithms to extract features that are better able to discriminate between macro-expressions and micro-expressions and to compare the differences in feature vectors between them.

  • This database will allow the development of algorithms for detecting micro-expressions from video streams.

  • The difference in AUs between macro-expressions and micro-expressions will also be able to be acquired through algorithms tested on this database.

  • The method used to elicit both macro-expressions and micro-expressions for inclusion in this database has proven to be valid in previous work [33]. The participants were asked to neutralize their facial expressions while watching emotion-evoking videos. All expression samples are dynamic and ecologically valid.

  • The database provides the AUs for each sample. In addition, after the expression-inducing phase, the participants were asked to watch the videos of their recorded facial expressions and provide a self-report on each expression. This procedure allowed us to exclude almost all emotion-irrelevant facial movements, resulting in relatively pure expression samples. The main emotions associated with the emotion-evoking videos were also considered in the emotion labeling process.

2 The CAS(ME)2 Database Profile

The CAS(ME)2 database contains 303 expressions — 250 macro-expressions and 53 micro-expressions — filmed at a camera frame rate of 30 fps. The expression samples were selected from more than 600 elicited facial movements and were coded with the onset, apex, and offset framesFootnote 1, with AUs marked and emotions labeled [35]. Moreover, to enhance the reliability of the emotion labeling, we obtained an additional emotion label by asking the participants to review their facial movements and report the emotion associated with each.

Macro-expressions with durations of more than 500 ms and less than 4 s were selected for inclusion in this database [8]. Micro-expressions with durations of no more than 500 ms were also selected.

The expression samples were recorded using a Logitech Pro C920 camera at 30 fps, and the resolution was set to 640 \( \times \) 480 pixels. The participants were recorded in a room with two LED lights to maintain a fixed level of illumination. The steps of the data acquisition and coding processes are presented in following sections. Table 2 presents the descriptive statistics for the expression samples of different durations, which include 250 macro-expressions and 53 micro-expressions, defined in terms of total duration. Figure 1 shows examples of a micro-expression (a) and a macro-expression (b).

Table 2. Descriptive statistics for macro-expressions and micro-expressions
Fig. 1.
figure 1

Examples of a micro-expression (a) and a macro-expression (b). The apex frame occurs at approximately frame 5 for the micro-expression and frame 11 for the macro-expression, both of which represent the negative emotion anger. The AU related to these two expressions are all AU4 (inner brow).

2.1 Participants and Elicitation Materials

Twenty-two participants (16 females), with a mean age of 22.59 years (standard deviation = 2.2), were recruited. All provided informed consent to the use of their video images for scientific research.

Seventeen video episodes used in previous studies [36] and 3 new video episodes downloaded from the internet were rated with regard to their ability to elicit facial expressions. Twenty participants rated the main emotions associated with these video clips and assigned them arousal intensity scores from 1 to 9 (1 represents the weakest intensity, and 9 represents the strongest intensity). Based on the ability to elicit micro-expressions observed in previous studies, only three types of emotion-evoking videos (those evoking disgust, anger, happiness) were used in this study, and they ranged in length from 1 min to approximately 2 and a half minutes. Five happiness-evoking video clips were chosen for their relatively low ability to elicit micro-expressions (see Table 3). Each episode predominantly elicited one type of emotion.

Table 3. Participant’s ratings of the 9 video clips

2.2 Elicitation Procedure

Because the elicitation of micro-expressions requires rather strong motivation to conceal truly experienced emotions, motivation manipulation protocols were needed. Following a procedure used in previous studies [32], the participants were first instructed that the purpose of the experiment was to test their ability to control their emotions, which was stated to be strongly related to their social success. The participants were also told that their payment would be directly related to their performance. To better exclude noise arising from head movements, the participants were asked not to turn their eyes or heads away from the screen.

While watching the videos, the participants were asked to suppress their expressions to the best of their ability. They were informed that their monetary rewards would be reduced if they produced any noticeable expression. After watching all nine emotion-eliciting videos, the participants were asked to review the recorded videos of their faces and the originally presented videos to identify any facial movements and report the inner feelings that they had been experiencing when the expressions occurred. These self-reports of the inner feelings associated with each expression were collected and used as a separate tagging system, as described in the following section.

2.3 Coding Process

Two well-trained FACS coders coded each frame (duration = 1/30th of a second) of the videotaped clips for the presence and duration of emotional expressions in the upper and lower facial regions. This coding required classifying the emotion exhibited in each facial region in each frame; recording the onset time, apex time and offset time of each expression; and arbitrating any disagreement that occurred between the coders. When the coders could not agree on the exact frame of the onset, apex or offset of an expression, the average of the values specified by both coders was used. The two coders achieved a coding reliability (frame agreement) of 0.82 (from the initial frame to the end). The coders also coded the Action Units (AUs) of each expression sample. The reliability between the two coders was 0.8, which was calculated as follows:

$$ R = \frac{{2 \times AU\left( {C{}_{1}C_{2} } \right)}}{AA} $$

where AU(C1C2) is the number of AUs on which Coder 1 and Coder 2 agreed and AA is the total number of AUs in the facial expressions scored by the two coders. The coders discussed and arbitrated any disagreements [33].

2.4 Emotion Labeling

Previous studies of micro-expressions have typically employed two types of expression tagging criteria, i.e., the emotion types associated with the emotion-evoking videos and the FACS. In databases with tagging based on the emotion types associated with the emotion-evoking videos, facial expression samples are typically tagged with the emotions happiness, sadness, surprise, disgust and anger [32] or with more general terms such as positive, negative and surprise [31]. The FACS is also used to tag expression databases. The FACS encodes facial expressions based on a combination of 38 elementary components, known as 32 AUs and 6 action descriptors (ADs) [34]. Different combinations of AUs describe different facial expressions: for example, AU6 + AU12 describes an expression of happiness. Micro-expressions are special combinations of AUs with specific local facial movements indicated by each AU. In the CAS(ME)2 database, in addition to using the FACS and the emotion types associated with the emotion-evoking videos as tagging criteria, we also employed the self-reported emotions collected from the participants.

When labeling the emotions associated with facial expressions, previous researchers have usually used the emotion types associated with the corresponding emotion-evoking videos as the ground truth [31]. However, an emotion-evoking video may consist of multiple emotion-evoking events. Therefore, the emotion types estimated according to the FACS and the emotion types of the evoking videos are not fully representative, and many facial movements such as blowing of the nose, eye blinks, and swallowing of saliva may also be included among the expression samples. In addition, micro-expressions differ from macro-expressions in that they may occur involuntarily, partially and in short durations; thus, the emotional labeling of micro-expressions based only on the FACS AUs and the emotion types of the evoking videos is incomplete. We must also consider the inner feelings that are reflected in the self-reported emotions of the participants when labeling micro-expressions. In this database, a combination of AUs, video-associated emotion types and self-reports was considered to enhance the validity of the emotion labels. Four emotion categories are used in this database: positive, negative, surprise and other (see Table 4). Table 4 predominantly presents the emotion labeling criteria based on the FACS coding results; however, the self-reported emotions and the emotion types associated with the emotion-evoking videos were also considered during the emotion labeling process and are included in the CAS(ME)2 database.

Table 4. Criteria for labeling each type of emotion and their frequencies in the database

3 Dataset Evaluation

To evaluate the database, we used Local Binary Pattern histograms from Three Orthogonal Planes (LBP-TOP) [28] to extract dynamic textures and used a Support Vector Machine (SVM) approach to classify these dynamic textures.

Among the 303 samples, the number of frames included in the shortest sample is 4, and the longest sample contains 118 frames. The frame numbers of all samples were normalized to 120 via linear interpolation. For the first frame of each clip, 68 feature landmarks were marked using the Discriminative Response Map Fitting (DRMF) method [37]. Based on these 68 feature landmarks, 36 Regions of Interest (ROIs) were drawn, as shown in Fig. 2. Here, we used Leave-One-Video-Out (LOVO) cross-validation, i.e., in each fold, one video clip was used as the test set and the others were used as the training set. After the analysis of 302 folds, each sample had been used as the test set once, and the final recognition accuracy was calculated based on all results.

Fig. 2.
figure 2

Thirty-six Regions of Interest (ROIs).

We extracted LBP-TOP histograms to represent the dynamic texture features for each ROI. Then, these histograms were concatenated into a vector to serve as an input to the classifier. An SVM classifier was selected. For the SVM algorithm, we used LIBSVM with a polynomial kernel, \( {\mathcal{K}}(x_{i} ,x_{j} ) = (\gamma x_{i}^{T} x_{j} + coef)^{degree} \) \( \gamma = 0.1 \), with \( degree = 4 \) and \( coef = 1 \). For the LBP-TOP analysis, the radii along the X and Y axes (denoted by R x and R y ) were set to 1, and the radii along the T axis (denoted by R t ) were assigned various values from 2 to 4. The numbers of neighboring points (denoted by P) in the XY, XT and YT planes were all set to 4 or 8. The uniform pattern and the basic LBP were used in LBP coding. The results are listed in Table 5. As shown in the table, the best performance of 75.66 % was achieved using R x  = 1, R y  = 1, R t  = 3, p = 4, and the uniform pattern.

Table 5. Performance in the recognition of micro-expressions with LBP-TOP feature extraction

4 Discussion and Conclusion

In this paper, we describe a new facial expression database that includes macro-expression and micro-expression samples collected from the same individuals under the same experimental conditions. This database contains 303 expression samples, comprising 250 macro-expression samples and 53 micro-expression samples. This database may allow researchers to develop more efficient algorithms to extract features that are better able to discriminate between macro-expressions and micro-expressions.

Considering the unique features of micro-expressions, which typically occur involuntarily, rapidly, partially (on either the upper face or the lower face) and with low intensity, the emotional labeling of such facial expressions based only on the corresponding AUs and the emotion types associated with the videos that evoked them may not be sufficiently precise. To construct the presented database, we also collected self-reports of the subjects’ emotions for each expression sample and performed the emotion labeling of each sample based on a combination of the AUs, the emotion type associated with the emotion-evoking video and the self-reported emotion. This labeling method should considerably enhance the precision of the emotion category assignment. In addition, all three labels are independent of one another and will be accessible when the database is published in the future, to allow researchers to access specific expressions in the database.

In the current version of the database, because of the difficulties encountered in micro-expression elicitation and the extremely time-consuming nature of manual coding, the size of the micro-expression sample pool may not be fully sufficient. We plan to enrich the sample pool by eliciting more micro-expression samples to provide researchers with sufficient testing and training data.