1 Introduction

Face expression analysis is a substantial research area as it plays an essential role in psychology and human interaction, including modern-day affecting computing systems and processes that are designed to be more emotionally aware [18, 23]. Facial expressions are the facial changes that are based on a person’s internal emotions and intentions [49]. The most commonly used human facial expression descriptor is the Facial Action Coding System (FACS). It is worth noting that the FACS has been intensively used in various aspects of facial expression analysis over the past 30 years [40].

Essentially, face expressions comprise micro and macro expressions. Macro-expressions, commonly known as normal expressions, are voluntary facial expressions that last between 0.5s to 4s [17]. On the contrary, micro-expressions are spontaneous and involuntary expressions that last less than 0.5s [56]. The application of facial expressions is not limited to facial expression recognition but could also be incorporated into various fields such as smart city, human-computer and human-robot interaction, and medicine [6, 7, 55]. Hence, studying facial expressions is important in realizing the various applications in these mentioned domains.

Over the years, micro-expression recognition has also received increasing attention as they carry vital cues for applications such as lie detection, clinical diagnosis and social interaction. This is primarily due to the fact that micro-expression is part of human reflexive behaviour [23]. Hence, it inadvertently reveals one’s true emotions [58]. Several works such as [21, 43, 53] have been focused on using the full face for micro-expression recognition, although [13, 22] advocate that people should look at the regions that are most characteristic for each emotion to achieve higher precision in the recognition. In spite of this, region-based micro-expression recognition is less common than techniques which use the entire face.

The action units (AUs) defined in the FACS describe the muscle-based facial actions triggered by facial expressions. These AUs are often provided in the emotion datasets to detect and understand emotions better. Therefore, the reliability of the AUs in FACS-coded datasets could significantly impact the emotion analysis. Although the AUs are commonly used for recognizing the normal macro-expressions, they are not fully analyzed for micro-expressions [23]. The main factor limiting AUs’ applicability in micro-expressions is the number of video frames in micro-expression clips [23]. In order to label the AUs, the face video should be divided into onset (increasing AU intensities), apex (maximum intensities), offset (decreasing intensities) and neutral (minimum intensities). Not to mention, the AUs exhibited in the micro-expression have much lower intensity. With the brief and low intensity of the micro-expressions, it is challenging to identify them precisely. In spite of these challenges, the AUs provided by the FACS-coded datasets are directly used in micro-expression-related problems, such as micro-expression recognition and spotting. Consequently, the micro-expression recognition and spotting performance are affected by having inaccurate AUs encoded in the micro-expression datasets.

Furthermore, all of the FACS-coded micro-expression datasets only report the average reliability across all AUs [47]. Without considering the reliability of individual AUs, this may mask the low reliability for certain AUs in these datasets [17] and correspondingly distort the actual outcome of the recognition.

Moreover, studies have shown that there are inconsistencies in the training and validation of humans to become certified FACS coders [8]. The reliability of the certified FACS coders is arguable because even certified FACS coders require specific training to code the micro-expressions reliably. Researchers from the research area could have also passed the FACS Final Test when one or more of their coders has demonstrated validated reliability [8]. Besides, it is commonly accepted that certain emotions are reliably revealed when specific combinations of AUs suggested by the FACS occur [4]. However, this might not be the case as the way people express themselves varies substantially across different cultures and situations [4]. Hence, all the more reason to validate the effectiveness of the FACS-based AUs reported in each dataset’s ground truth before considering using them for micro-expression-related problems.

Therefore, in this work, we focus on analyzing the occurrence and impact of AUs in micro-expressions regardless of their intensities, as micro-expressions are known to have low AU intensities [1]. In more detail, we focus our discussion on the 24 main AUs and a few relevant gross behaviours that occur in the recent mainstream micro-expression datasets, namely the CASME II, SAMM and CAS(ME)\(^2\) datasets. These AUs are the main AUs that describe muscle-based facial actions induced by facial expressions. This paper provides the following contributions in the realm of micro-expression recognition:

  1. 1.

    We first assign specific facial landmarks to each AU based on the FACS’ action descriptor. These facial landmarks representing the AUs will then form the AU-based regions of interest (RoIs) by having the facial landmarks as central points to perform the independent AU analysis.

  2. 2.

    The independent AU analysis then yields our proposed sets of AUs that are shown to be more relevant to each dataset considered in this paper, i.e., CASME II, SAMM and CAS(ME)\(^2\) in micro-expression recognition.

  3. 3.

    We then revisit the existing AUs, including the ground truth’s AUs human-coded by the datasets’ designers. We consider the effectiveness of existing AUs encoded in the widely used micro-expression datasets, i.e., CASME II, SAMM and CAS(ME)\(^2\) by comparing the micro-expressions recognition performance in terms of accuracy and F1-score.

  4. 4.

    In addition, we suggest the universal AUs that are applicable to a specific emotion based on our proposed emotion-specific AUs obtained from the evaluated datasets.

The findings from our analysis show that the proposed AUs can better describe the micro-expressions in CASME II, SAMM and CAS(ME)\(^2\), resulting in higher recognition accuracy and F1-scores. To elaborate, the proposed RoIs achieve the F1-scores of 0.6083, 0.4476 and 0.5037 for CASME II, SAMM and CAS(ME)\(^2\), respectively, when implemented with the state-of-the-art AU-based technique. Note that the analysis in this paper aims to recommend the most effective AU-based RoIs for each micro-expression dataset, which will be helpful in AU-based studies, particularly AU-based micro-expression recognition.

The rest of the paper is organized as follows. In Section 2, we study the importance of emotion recognition and the works related to FACS. The use of AUs in micro-expression-related problems is also reviewed. We also discuss the reliability of the AUs encoded by the dataset designers. In Section 3, we present our analysis approaches employed in this work. In Section 4, we present the experimental results obtained from independent AU analysis and then formulate the proposed emotion-specific AUs. The performance of the proposed AUs is evaluated in the standard multi-class micro-expression recognition. The robustness of the proposed AUs in the state-of-the-art methods is also discussed in this section. Concluding remarks are given in the last section.

2 Literature survey

2.1 Emotion recognition

Emotion recognition has been of broad interest due to its usefulness in medical, security and automotive fields. Studies such as [3, 12, 29] have been conducted to analyze and better understand emotional states. Barra et al. [3] design an algorithm that recognizes emotions by analyzing the facial landmark points through a virtual spider web on the face. To improve the emotion recognition system, Khattak et al. [29] propose the most optimal convolutional neural network (CNN) based model after experimenting with different machine and deep learning models.

Facial expressions can be used to express and detect emotions. Therefore, facial expression recognition is often related to emotion recognition. For instance, the emotion recognition system suggested in [12] is evaluated on facial expression datasets. Yan et al. [55] demonstrate incorporating a hybrid neural networks-based facial expression recognition for the smart city. The introduction of facial expression recognition enables the equipment in the smart city to capture the user’s instant facial changes, allowing the equipment to accommodate the user’s needs accordingly. Chen et al. [7] suggest a multi-modal emotion recognition algorithm involving human facial expressions and speeches to aid in human-robot interaction. Besides that, emotion recognition is also helpful in the healthcare industry. Bisogni et al. [6] propose a CNN-based facial expression recognition system to identify patients’ emotions in the real-time healthcare framework. Hence, the study of the emotion and expression recognition system is crucial in order to realize the benefits brought by the fields mentioned above.

2.2 Facial action units

The comprehensive and anatomically based measurement system, FACS, was introduced in 1978 [14, 15] and later updated in 2002 [16] by psychology researchers trained in understanding human emotions. By using FACS, the human face can be divided into several parts of facial action units (AUs) inclusive of 24 main AUs that must be considered when scoring the AU. The FACS also provides miscellaneous and optional AUs besides the main AUs. The miscellaneous AUs describe the movement of the lower face. However, the FACS does not distinguish specific behaviour for the miscellaneous AUs. The optional AUs are usually excluded unless they reach a certain level of intensity. Besides, the optional AUs are not often present as they are eye-blinking and winking movements. According to FACS, each action unit has a numeric code, an action description, and the involved muscles.

The FACS could be used for emotion measurement since a facial expression is caused by a single AU or a combination of multiple AUs [17]. The combination of multiple AUs may be additive or non-additive. The additive AU combinations maintain the movement of all involved AUs, whereas the non-additive AU combinations modify each other’s appearance [9]. For instance, the combination of AU1 (i.e., inner brow raiser) and AU2 (i.e., outer brow raiser) is often shown in surprise. The inner and outer brow-raising movements of AU1 and AU2 remain the same regardless of the AUs appearing separately or together. Hence, the combination of AU1 and AU2 is additive. On the other hand, the combination of AU1 (i.e., inner brow raiser) and AU4 (i.e., brow lowerer) is non-additive. This is because when both AUs occur together, the upward movement of AU1 changes the downward action of AU4. Although FACS has effectively described macro-expressions with AUs, the study on micro-expression is not as developed due to the diversity of micro-expressions categories in different datasets [5].

As a trained human rater encodes the manual FACS, the FACS rating is subjective and prone to bias. Therefore, research has been done to automate the FACS rating process [8, 24, 40]. However, this research field is still underdeveloped as many problems remain open [40]. One of the problems with automating FACS is the quality of the face video. Since the AUs only cause changes to the local appearance, slight occlusion of the face could lead to inaccurate results. Besides, it is difficult to treat each combined AU as a single class as there are more than 7000 AU combinations [40]. Hence, fully automatic and real-time FACS are yet to be adopted; the recent mainstream micro-expression datasets, notably CASME II, CAS(ME)\(^2\) and SAMM, have their ground truth AU information based on the manual FACS. These, therefore, differ across the datasets.

2.3 AUs for micro-expressions

As the AUs can be used for emotion measurement, several studies have incorporated them in the micro-expression recognition system [33, 42, 54, 60, 63]. The AUs serve as a guidance on the area to focus on since micro-expressions consist of local facial movements [26]. The FACS had proposed the AUs commonly triggered when certain expressions occur [40]. Meanwhile, Davinson et al. [11] also suggested a set of dataset-dependent AUs that apply to the micro-expressions in both CASME II and SAMM to eliminate the bias of human reporting in each micro-expression sample. Table 1 compares the AUs suggested by FACS and the objective classes for specific micro-expressions. However, these suggested AUs are generic, as the same AUs are applied across different datasets. Hence, it is necessary to have more relevant AUs assigned specifically for each dataset for better precision. Recently, the work in [62] extracted the optical flow on specific AU regions for micro-expression recognition. The regions are focused on the brows and mouth, as most micro-expressions occur in these regions.

Table 1 Emotion-specific AUs in FACS [40] and Objective Classes [11]

Researchers have also designed different Regions of Interest (RoIs) to study the impact of specific regions in the micro-expression, and [37, 42, 50,51,52, 65] show that AU-based RoIs give more discriminative features compared to those from fixed facial blocks. Therefore, the RoIs need to be determined to correspond to AUs involved. In more detail, Merghani and Yap [42] conduct the region-based micro-expression recognition by extracting the features from 14 AU-based regions. The associated AUs for these regions are selected by observing the most frequently occurring AUs in the two standard micro-expression datasets, CASME II and SAMM. Including the regions covering the most frequently occurring AUs ensures that only the features of the relevant movements are considered in the micro-expression classification. To our best knowledge, the work in [42] is state-of-the-art for region-based micro-expression recognition with hand-crafted techniques. It forms one of our baselines in this paper.

Meanwhile, Zong et al. designed a hierarchical division scheme that would divide the face image into blocks and cover all the critical AU regions that is associated with the micro-expressions [65]. The micro-expression video clips are first divided into several non-overlapping equal-sized blocks based on different grid sizes. The grid size gradually increases iteratively, allowing the number of blocks to increase, forming the AU-based blocks for micro-expression recognition.

Fig. 1
figure 1

(a) The template face and the 16 RoIs defined in [50]. (b) The 36 RoIs defined using 66 facial landmark points [37]

Furthermore, Liu et al. [37] propose 36 AU-based RoIs for region-based micro-expression recognition. The 68 facial landmarks are detected using the DRMF method [2] on the first neutral frame in each micro-expression video clip. Out of the 68 DRMF facial landmark points, only 66 of them are used. According to their guidelines, the RoIs are determined: (1) RoI partitioning should be refined and avoid having many AUs located at the same or overlapping portion of the face. (2) The RoI partitioning should be sparse and each RoI should contain at least one AU. Figure 1(b) illustrates the 36 RoIs covering the facial AUs. This AU-based RoI will also act as our baseline in this paper.

Wang et al. [50] performed the region-based micro-expression recognition by using the textures extracted from 16 AU-based RoI. A frontal neutral facial image is used when deciding the RoIs. The RoIs are determined to have minimal overlapping by grouping the nearby AUs into similar RoIs. The 16 independent RoIs as shown in Fig. 1(a). The RoIs by [50] is a common AU-based RoIs as it is also used in [51, 52]. Hence, these AU-based RoIs serve as one of the baselines in this paper.

Although the existing AU-based RoIs have achieved promising results in classifying the micro-expressions, there is room for improvement. Moreover, the impact of each single AU has not been explored in that context, although different RoIs has been determined to facilitate the AU-based micro-expression recognition. We address this gap by performing a single AUs analysis in our work. Fan et al. [20] associate the facial landmarks to the AUs that occur in the BP4D [59] and DISFA [41] datasets to compute the correspondence between the AUs. This inspired us to assign specific facial landmarks to the AUs for our single AU analysis.

The AUs are commonly used as an indicator of the facial regions to be focused on for facial expression recognition. Recently, deep learning approaches such as the graph convolution network (GCN) have been incorporating the AUs into the graph structure for micro-expression recognition tasks [33, 34, 38, 54, 63]. Hence, it is crucial to inspect the effectiveness of the AUs in the emotion classes as it could significantly impact the performance of the recognition task if they are not correctly encoded.

2.4 Reliability of AUs

As the AUs coding is dependent on the FACS manual, the reliability of the AUs in the FACS-coded datasets determines the efficacy of these AUs in the AU-based micro-expression recognition. The AU reliability is determined based on inter-observer agreement. More specifically, the reliability of the AU occurrence and intensity is decided based on the mutual agreement between two or more human observers [9]; thus, the reliability of the AUs may be subjective and differ across datasets. A review by Clark et al. [8] shows that FACS can be performed inconsistently across different studies. Knowing that the FACS rating is prone to bias and yet, the AUs are highly dependent on the FACS, it is crucial to review the ground truth AUs labelled in each dataset before considering using them for micro-expression recognition.

In addition, the reliability of a spontaneous facial expression is harder to achieve due to the low intensity in the AUs [9]. To our best knowledge, it could not be explicitly identified which AUs had been accurately coded as existing studies only report the average reliability of all AUs [9]. Therefore, it is of particular concern to analyze the effect of human-coded AUs in spontaneous micro-expression recognition research. In the following section, we propose a comprehensive analysis of the effects of the human-coded AUs in each standard micro-expression dataset, namely the CASME II, CAS(ME)\(^2\) and SAMM.

3 Proposed architecture

This section outlines the process of the AU analysis conducted in this work. We first perform an independent AU analysis for each emotion of the database (i.e., CASME II, SAMM and CAS(ME)\(^2\)). The analysis results will manifest the AUs’ order that gives the best results for each emotion class in each dataset. We subsequently propose new sets of AUs more relevant to the emotion classes in each micro-expression dataset. The RoIs are formed by combining k number of AUs of each emotion that gives the best performance for each dataset. The selection of the AUs will be further discussed in Section 4.2. The effectiveness of our proposed AU-based RoIs is then compared with other existing AU-based RoIs in the literature. Figure 2 illustrates the block diagram of the proposed architecture.

Fig. 2
figure 2

(a) Overview of the proposed architecture. (b) The Analysis Module that performs emotion-specific classification with the region formed by one FACS AU at a time

3.1 Temporal interpolation model (TIM)

The short and varied frame lengths in the micro-expression datasets cause micro-expression recognition to be highly challenging and obstruct the discovery of AUs on them [23, 35]. The short frame length of micro-expression samples also limits the application of the spatiotemporal descriptor (i.e., Local Binary Pattern from Three Orthogonal Planes (LBP-TOP)) in this work as the radius, r, can only be 1 if the samples have less than 6 frames. To overcome this, the temporal interpolation model (TIM) [64] is applied to the datasets.

The TIM learns the patterns of the frames through a sequence-specific mapping and generates a continuous and deterministic curve function of a variable t in the range of [0,1]. Since the curve function describes the temporal relations between the frames, it can predict the characteristic of the unseen frames. Therefore, the desired number of frames can be generated through TIM by controlling the variable t of the curve function. In this work, the micro-expression datasets are interpolated into 10 frames based on the findings in [35] that the frame length of 10 shows the most stable performance across different datasets.

3.2 Emotion-specific AU-based RoIs

Inspired by [20], we assign facial landmark(s) to each AU to identify the locations of the AUs. This is crucial in analyzing the significance of each AU in certain micro-expressions. The facial landmark points of each AU are located using a pre-trained facial landmark detector [28, 30]. Apart from the facial landmarks adopted from [20] for some AUs, the facial landmarks for the remaining targeted AUs are derived by mapping the 68 facial landmark points to the action descriptor of FACS.

It can be observed from Table 3 that the miscellaneous (i.e., AU8, AU38 and AU39) and optional AUs (AU43 and AU45) are also considered in the AU analysis for the respective datasets. Although they are not the main AUs, they are included in this work as they are annotated in the ground truth of CASME II, SAMM and CAS(ME)\(^2\). This helps to study the AUs co-occurrence of the targeted FACS-coded datasets. In Table 3, the AUs annotated in the ground truth of each dataset are tabulated as well.

In addition, the numeric codes of the AUs considered in this work are provided in Table 3 along with their description and the muscles involved. Note that all of the listed main AUs will be considered in the AU analysis for CASME II, SAMM and CAS(ME)\(^2\), where the miscellaneous and optional AUs are only involved in the analysis for particular datasets. Since the emotion-specific AU-based RoIs are formed by assigning the facial landmarks to the AUs based on the AU’s description, the same RoIs can be formed using different facial landmark detectors. The MediaPipe [39] is a comprehensive framework that provides up to 468 facial landmark points. The MediaPipe’s facial landmarks of each AU are provided in Appendix Table 19 to provide more flexibility on the face detector choices.

To observe the texture changes caused by each AU, we determine an AU-based RoI with each predefined AU’s facial landmark acting as the central point. This is crucial as the size of the regions should be sufficient enough to cover the facial textures caused by the AU movement but not too large to cover facial features irrelevant to the emotions. Given a \(256 \times 256\) image, I, different sizes (\(8 \times 8\), \(16 \times 16\), \(24 \times 24\), \(32 \times 32\) and \(48 \times 48\)) of RoIs have experimented with the facial landmarks listed in Table 3. The experiment stops at the region size of \(48 \times 48\) as a further increment of the AU-based patch would result in covering facial parts consisting of movements governed by different AUs (i.e., eyes and eyebrows). Table 2 tabulates the average multi-emotion results of CASME II, SAMM and CAS(ME)\(^2\) when the features are extracted from the AU-based regions of different sizes to determine the optimum size of the AU-based region. Based on Table 2, the size of \(48 \times 48\) gives consistent good best performance across the datasets.

Table 2 Performance achieved with different RoI sizes

Given a \(256 \times 256\) image, I, the face image can be divided into \(16 \times 16\) blocks to get the RoI of \(48 \times 48\) as shown in Fig. 3. To ensure the same ratio of RoI is considered for different input sizes, the size of the region can be computed as follows:

$$\begin{aligned} Region_{h,w} = I_{M,N} \times \frac{48}{256} \end{aligned}$$
(1)

where h and w denote the height and width of the region; M and N denote the height and width of the face image.

Fig. 3
figure 3

Each cross represents an AU and the highlighted region is a \(48 \times 48\) AU-based RoI with an AU at the centre

Table 3 FACS and Ground Truth AUs of Each Dataset Together with Their Corresponding Facial Landmarks

In reality, some facial muscles exist symmetrically on both sides of the face [19]. Based on the facial landmark points assigned in [20], some AUs could have more than one side of the facial muscle activated; thus, the AU could be observed on either one or both sides. In this work, we use the terms defined in the FACS, i.e., unilateral and symmetric AUs, to refer to cases where AUs are observed on either or both sides. In more detail, a unilateral AU only has one side of the facial muscle activated, whereas a symmetric AU has both sides of the muscles activated. For instance, AU6 (cheek raiser) involves the muscles on both sides of the cheek. Thus, this AU6 comprises the unilateral AU\(6_1\) (left side) and unilateral AU\(6_2\) (right side); or is collectively known as the symmetric AU6, where both AU\(6_1\) and AU\(6_2\) are considered.

As AUs are formed from facial muscle movement, some degree of overlap occurs between some AUs. A few AUs listed in Table 3 share the same facial landmark points; thus, they form the same AU-based RoI. The k AUs with the best F1-scores in our analysis will be chosen as the proposed AUs. The determination of the k value will be further discussed in Section 4. The AUs with common facial landmarks are activated if either is the top-k AUs in our analysis.

Fig. 4
figure 4

(a) Proposed Unilateral AUs, P\(_1\) (b) Proposed AUs, including their symmetric sides, P\(_2\)

3.3 Analysis module

The Analysis Module will perform the independent AU analysis by considering each AU-based RoIs in turn. Note that these RoIs are formed by using the facial landmarks explicitly assigned for each FACS AU listed in Table 3 as the central points. This module aims to obtain our proposed effective emotion-specific RoIs for each class of the CASME II, SAMM and CAS(ME)\(^2\) datasets.

As shown in Fig. 2(b), each AU-based RoI will overlay the input images, forming the independent emotion-specific RoI-activated images. The features of the activated images are extracted using LBP-TOP to be consistent with the micro-expression dataset designers’ benchmarking, and the parameters of LBP-TOP are discussed in Section 3.4. The LBP-TOP features are used for emotion-specific classification. Subsequently, the relevance order of the AUs for each micro-expression is yielded from the analysis. The k AUs with the best F1-scores in each micro-expression class form the proposed emotion-specific AU-based RoIs where k is determined from the results of the top 10%, 20% and 30% of the best AUs. The top-k AUs are also used to form the proposed unilateral AUs, \(P_1\), and the proposed symmetric AUs, \(P_2\), for each micro-expression class. Figure 4 shows the examples of \(P_1\) and \(P_2\) for the class “disgust” in CASME II. The unilateral AUs only involve those listed in the top-k, whereas the symmetric AUs consider the top-k AUs and their corresponding symmetric sides. The emotion-specific RoIs formed from the \(P_1\) and \(P_2\) AUs will be evaluated in a micro-expression recognition system as shown in Fig. 2(a).

3.4 Feature extraction

In this work, we focus on the Local Binary Pattern from Three Orthogonal Planes (LBP-TOP) [61] as it is the benchmarked feature extraction technique used by the designers of the datasets (i.e., CASME II, SAMM and CAS(ME)\(^2\)). Furthermore, LBP-TOP is also used in the baselines reported in this analysis [42, 50]. LBP-TOP is an extension of the original LBP [44] that describes the dynamic features in the spatial-temporal domain. The LBP-TOP views the video sequence from three aspects: (a) a stack of XY planes along the time (T) dimension; (b) a stack of XT planes along the Y dimension; (c) a stack of YT planes along the X dimension. The temporal planes (i.e., XT and YT planes) contain information about the space-time transition, whereas the XY plane contains the appearance information of the image frame. The LBP code of every pixel from the XY, XT and YT planes is computed as follows:

$$\begin{aligned} f_{j,P,R}(x,y,t) = \sum _{p=0}^{P-1}s(g_p-g_c)2^p \end{aligned}$$
(2)

where \(f_{j}(x,y,t)\) represents the center pixel at the \(j^{th}\) plane (XY plane when j = 0; XT plane when j = 1; YT plane when j = 2); P and R represent the number of neighboring pixels and the radius respectively; \(g_c\) represents the intensity of the center pixel; \(g_p(p=0,..., P-1)\) represents the intensity of the \(p^{th}\) neighboring pixel on the radius R; \(2^{p}\) represents the weight corresponding to the neighboring pixel locations. The function s(x) is a piecewise function defined as follows:

$$\begin{aligned} s(x) = {\left\{ \begin{array}{ll} 1 &{} x\ge 0 \\ 0 &{} x< 0 \end{array}\right. } \end{aligned}$$
(3)

The histogram recording the LBP codes for each plane in LBP-TOP can be calculated as:

$$\begin{aligned} H_{i,j} = \sum _{x,y,t} I\{f_j(x,y,t)=i\}, i=0,...,n_j-1; j=0,1,2. \end{aligned}$$
(4)

where \(n_j\) represents the number of different labels generated by the LBP operator in the three orthogonal planes; jth indicates the type of plane (XY plane when j = 0; XT plane when j = 1; YT plane when j = 2). The function \(f_j(x,y,t)\) shows the LBP code of the central pixel (xyt) of each plane. The function \(I\{A\}\) is defined as:

$$\begin{aligned} I(A) = {\left\{ \begin{array}{ll} 1 &{} \text {, if { A} is true,} \\ 0 &{} \text {, if { A} is false.} \end{array}\right. } \end{aligned}$$
(5)

Figure 5 shows the process of the LBP-TOP feature extraction where the histograms from each plane are concatenated and expressed as an LBP-TOP feature vector. In this work, the radii of x and y vary from 1 to 4 whereas t varies from 2 to 4 as these are the common range used for micro-expression recognition [37, 46, 57]. The block size of LBP-TOP is set to 5, and 8-neighbourhood is used for better comparison with neighbour points. The best performances of each experiment yielded from these parameters are reported in this work.

Fig. 5
figure 5

Three histograms are produced from XY, XT and YT planes in LBP-TOP. The histograms are then concatenated into a single histogram as the final LBP-TOP feature vector

3.5 Classification

The linear Support Vector Machine (LSVM) is used as the classifier for the micro-expression classification as demonstrated in different research [25, 27, 32, 35, 37, 46] to be effective for this problem. For each experiment’s C parameter selection, the range of \([10^{-1}, 1, 2, 10, \dots , 10^4]\) is consistently used across the datasets considered in this paper, and the value with the best performance is chosen.

In the independent AU analysis, emotion-specific classification is performed to obtain the relevancy order of the AUs. Each micro-expression is classified using the method of one-vs-rest (OvR) [32]. F1-score is computed for binary classification to avoid biased results due to the class sample imbalance in the datasets [48]. More precisely, the F1-score, \(F1_c\), of each class is computed as follows:

$$\begin{aligned} F1_c= & {} \frac{2P_cR_c}{P_c + R_c}\end{aligned}$$
(6)
$$\begin{aligned} P_c= & {} \frac{TP_c}{TP_c + FP_c}, R_c = \frac{TP_c}{TP_c+FN_c} \end{aligned}$$
(7)

where \(P_c\) and \(R_c\) are the precision and recall of each class; TP represents the true positive samples; FP represents the false positive samples and FN represents the false negative samples.

The output of the AU analysis forms our proposed \(P_1\) and \(P_2\) AUs, which are subsequently used for micro-expression recognition to compare their efficacy against other AUs in the literature. In the multi-emotion classification problem, recognition is performed simultaneously among multiple emotion classes considered by the dataset, for which the average accuracy and F1-score are computed. As the number of samples in each dataset varies across the emotion classes, the F1-score is a better metric than accuracy for measuring the classification performance in this work. The Leave-One-Subject-Out (LOSO) protocol is implemented to prevent the interference of subject identities in micro-expression recognition [32].

4 Experimental results and analysis

This section first introduces the micro-expression datasets used in our experiments. Then it discusses the experimental results obtained on these datasets by applying our emotion-specific AU-based RoIs analysis technique. The datasets (i.e., CASME II, SAMM and CAS(ME)\(^2\)) represent the standard FACS-coded datasets that cover subjects across the diverse heritage and cultures and therefore are used for this analysis. The performance of our proposed emotion-specific RoIs is then compared with the existing AU-based RoIs.

4.1 Dataset profile

This section introduces the standard FACS-coded micro-expression datasets evaluated in this work. The proposed architecture is applied individually to the CASME II, SAMM and CAS(ME)\(^2\) datasets.

CASME II [57]. The dataset consists of 249 spontaneous micro-expressions samples collected from 26 subjects with a sampling rate of 200 fps. The CASME II has micro-expression samples from seven different classes, namely disgust, happiness, others, repression, surprise, fear, and sadness. As demonstrated in the database’s baseline experiment [57], only the first five mentioned classes have sufficient samples and will be considered in our experiments. Two trained coders labelled the AUs of the micro-expressions based on the FACS investigator guide [19] and produced an average reliability score of 0.846. The average AU reliability, R, of the database is computed as:

$$\begin{aligned} R = \frac{2 \times AU(C_1, C_2)}{\#All\_AU} \end{aligned}$$
(8)

where \(AU(C_1, C_2)\) represents the number of AUs agreed by both coders and \(\#All\_AU\) represents the total number of AUs labelled by both coders from the micro-expression samples. In this work, we only consider the micro-expression classes of the FACS and objective class AUs that are labelled in the CASME II dataset’s ground truth. Hence, only the disgust, happiness, and surprise emotion classes are considered in both FACS and objective class AUs.

SAMM [10]. The dataset consists of 159 spontaneous micro-expression samples collected from 29 subjects with the sampling rate of 200 fps. The dataset categorizes the micro-expressions into eight classes, namely anger, contempt, disgust, fear, happiness, sadness, surprise and others. As demonstrated in [31], only five classes (i.e., anger, contempt, happiness, others and surprise) with sufficient micro-expression samples are considered. Therefore, a total of 136 micro-expression samples are used in our experiments. Unlike CASME II, the AUs of the SAMM dataset were coded by three certified coders and achieved an overall AU reliability, R, of 0.82. The reliability is computed as:

$$\begin{aligned} R = \frac{3 \times AU(C_1, C_2, C_3)}{\#All\_AU} \end{aligned}$$
(9)

where \(AU(C_1, C_2, C_3)\) is the number of AUs agreed by all coders and \(\#All\_AU\) is the total number of AUs coded by all coders. Similar to CASME II, for consistent comparison, we only consider the micro-expression classes of the FACS and objective AUs that are overlapped with the SAMM ground truth AUs. Hence, only the anger, happiness, and surprise of both FACS and objective classes are considered in the SAMM dataset analysis.

CAS(ME)\(^2\) [46]. The dataset consists of two parts (i.e., Part A and B). In particular, Part A has 87 long macro-expression and micro-expression videos that can be used for expression spotting, whereas Part B has a total of 357 videos which consist of macro-expression and micro-expression samples. As our work focuses on micro-expression recognition, this paper only uses the micro-expression samples from Part B. Hence, 54 micro-expression samples will be used in our experiment [36]. The dataset is categorized into three classes consisting of anger, happy and disgust. The average reliability of the dataset is 0.8 as computed by (8). Similar to CASME II, the AUs that occurred in CAS(ME)\(^2\) are labelled by two well-trained coders based on FACS.

4.2 Independent AU analysis to propose relevant AUs

The independent AU analysis is conducted by applying the module described in Section 3.3 on CASME II, SAMM and CAS(ME)\(^2\) to determine each AU’s relevance to the micro-expression classes of the datasets. In particular, the AU-based RoI formed by each FACS AU listed in Table 3 is considered individually in the emotion-specific classification, i.e., classifying a specific emotion vs all other emotions. This helps to gather the most relevant AUs for each micro-expression class within the three datasets. As the performance of each class affects the results of the average multi-emotion classification, it is essential to have the RoIs formed based on the AUs relevant to each micro-expression. To our best knowledge, this is the first work that explores the micro-expression of the common FACS-coded dataset (i.e., CASME II, SAMM and CAS(ME)\(^2\)) in the one-vs-all classification settings. As a result of the analysis, each dataset yields an AU relevance order for each micro-expression by arranging the obtained F1-score in descending order. The complete lists of F1-scores for each dataset can be found in the Appendix section under Tables 20, 21 and 22.

The proposed AUs for each dataset are determined by performing the standard multi-class micro-expression recognition that involves different percentages of the AUs (i.e., 10%, 20% and 30%). Based on Table 4, the most relevant AUs for CASME II, SAMM and CAS(ME)\(^2\) are the top-15 (30%), top-5 (10%) and top-11 (20%), respectively. The selected AUs for each dataset are listed in Tables 5, 6 and 7. These AUs then form the Proposed \(P_1\) and \(P_2\) AU-based RoIs for each emotion class.

Table 4 Results with different number of AUs

4.3 Classification results and discussion

This section compares the regions formed by our proposed AUs with existing AU-based RoIs. The robustness of the proposed AUs is also evaluated in this section using benchmarking techniques.

Table 5 Top 30% (15 AUs) F1-score (F1) of each Class in CASME II
Table 6 Top 10% (5 AUs) F1-score (F1) of each Class in SAMM

4.3.1 Comparison with existing AU-based RoIs

To compare the performance of our proposed \(P_1\) and \(P_2\) AUs against other state-of-the-art AU-based RoIs in the literature, we apply all these AUs as an overlay for feature extraction and subsequent multi-emotion classification. The main focus of this work is to revisit the effectiveness of the ground truth AUs which had been human-coded for CASME II, SAMM and CAS(ME)\(^2\). Therefore, the ground truth AUs of those datasets will serve as our main baselines. Table 3 shows each dataset’s ground truth AUs. The proposed AUs are the union of the emotion-specific AUs obtained from our Analysis Module. Similarly, the general sets of AUs proposed in Table 1 by FACS and the objective classes are also gathered from all samples of the corresponding emotion. They are then evaluated and compared with the proposed \(P_1\) and \(P_2\) AUs.

Table 7 Top 20% (11 AUs) F1-score (F1) of each Class in CAS(ME)\(^2\)
Table 8 Differences of the existing AU-based RoIs

The FACS and objective classes AUs do not consider all the micro-expression classes in the CASME II, CAS(ME)\(^2\) and SAMM datasets. Hence, for each dataset analysis’s FACS and objective classes AUs, we only consider the FACS or objective classes’ micro-expressions considered and reported in those datasets. In addition, the ground truth of the micro-expression datasets explicitly reports the sides of the AUs involved in each micro-expression clip, whether unilateral or symmetric. However, we only consider the symmetric AUs for the dataset ground truth. This is because different sides of the same AU occur in different micro-expression clips of the same micro-expression class. We subsequently analyse both unilateral and symmetric AUs to discover the most relevant ones for each micro-expression class.

Table 9 Multi-Emotion Results of the Existing AUs and Union of the Proposed AUs (Unilateral AUs, P\(_1\), and Symmetric AUs, P\(_2\)) on CASME II

In addition, the proposed AUs are compared with the existing AU-based RoIs suggested by [37, 42, 50]. Table 8 shows the differences between the existing AU-based RoIs in terms of regional shapes and the AUs considered for micro-expression recognition. It can be observed that several AUs are excluded when constructing the RoIs in the existing techniques. Interestingly, these excluded AUs achieved high F1-scores in our Analysis Module. The RoIs recommended by Merghani et al. [42] cover most of the relevant AUs. However, the size of RoIs for some AUs is small and might not capture all of the facial texture changes induced by the AU movement.

Table 10 Multi-Emotion Results of the Existing AUs and Union of the Proposed AUs (Unilateral AUs, P\(_1\), and Symmetric AUs, P\(_2\)) on SAMM

Table 9 shows the multi-emotion classification results for the CASME II dataset achieved by our proposed AUs, the ground truth AUs of the datasets designers, and other existing state-of-the-art AU-based RoIs [37, 42, 50]. These are the top performance rates achieved in a TIM-interpolated CASME II with LBP-TOP feature, based on the same pre-processing, feature extraction and classification techniques as those performed by the dataset designers in their benchmarking experiments. The best performance is achieved when the RoIs are formed by the proposed \(P_2\) AUs, surpassing the recognition rate of the benchmark techniques with the minimum increment of \(2\%\). Interestingly, both proposed AUs surpass the ground truth’s accuracy by at least \(4\%\). The better results achieved by our proposed RoIs indicate that some emotion-relevant AUs may have unintentionally been left out by the human coders when forming the ground truth of the CASME II dataset.

Table 10 shows the multi-emotion classification results for the SAMM dataset. The proposed unilateral \(P_1\) set of AUs improves the performance achieved by the existing AUs and achieves the best accuracy of \(31.62\%\) and F1-score of 0.2129 in comparison to the other techniques. Table 11 shows the results for the CAS(ME)\(^2\) dataset. The proposed \(P_1\) and \(P_2\) AUs significantly improve the performances obtained by the baselines of CAS(ME)\(^2\). This performance improvement indicates that the proposed AUs are more relevant to the micro-expression classes than the baseline AUs in the CAS(ME)\(^2\) dataset. Based on the performance improvement achieved across the three datasets, the additional AUs considered in our proposed AUs contain meaningful features that are helpful in the micro-expression recognition task.

4.3.2 Effect of proposed AUs on state-of-the-art methods

This section compares our proposed RoIs for the AU-based micro-expression recognition system with other AU-based micro-expression recognition. Table 12 shows the results of CASME II, SAMM and CAS(ME)\(^2\) obtained from different AU-based approaches. The results from Leong et al. [34], Merghani et al. [42] and Zong et al. [65] are considered in the comparison. To our knowledge, [34] is the state-of-the-art technique involving AUs in micro-expression recognition. The method proposed by [34] models the AU relationship of the micro-expressions based on the AUs annotated in the ground truth. The AUs are then combined with the features that are randomly extracted from the face region. Instead of using features from random face areas, we re-implemented the method with the features extracted from our proposed emotion-specific AU-based regions. To examine the effectiveness of the proposed AUs under different classifiers, the results are obtained using the classifiers of the benchmarking techniques (i.e., Graph Neural Network (GNN) in [34] and Sequential Minimal Optimization (SMO) in [42]). However, the method by [34] does not apply to CAS(ME)\(^2\) as the technique requires at least 16 frames per micro-expression sample. As a result, we improve the F1-scores for CASME II (+1.33%) and SAMM(+0.66%) when the proposed \(P_2\) AUs are considered.

Table 11 Multi-Emotion Results of Existing AUs and Union of the Proposed AUs (Unilateral AUs, P\(_1\), and Symmetric AUs, P\(_2\)) on CAS(ME)\(^2\)
Table 12 Effectiveness of Proposed P\(_1\) and P\(_2\) AUs on State-of-the-art AU-based Micro-expression Recognition
Table 13 Emotion-specific AUs proposed for each micro-expression
Table 14 Facial regions involved in each micro-expression

The effect of the proposed AUs is also evaluated on the handcrafted technique suggested in [42]. More specifically, we apply the Gaussian smoothing operator to the images of the datasets as proposed in [42]. The optical flows of the smoothed input images are then computed and overlaid with the original input images. The features from RoIs proposed in [42] are then extracted using the LBP-TOP technique. The LBP-TOP features obtained from the smoothed optical flow images are classified with SMO [45]. We compare the results achieved by extracting the features from our proposed AU-based RoIs with Merghani et al.’s technique. Although our proposed AUs do not further improve the performance of Merghani et al.’s technique in CASME II, our proposed AUs achieved an accuracy of \(78.07\%\), which is close to the \(80.64\%\) of the original work. On top of that, the proposed \(P_2\) AUs for SAMM also improves the recognition rate from \(30.15\%\) to \(33.82\%\), and the F1-score is improved from 0.1908 to 0.2198. The proposed \(P_2\) AUs for CAS(ME)\(^2\) significantly improve the recognition rate by \(5\%\) compared to the \(61.11\%\) of Merghani et al.’s RoIs. The F1-score for CAS(ME)\(^2\) is also improved from 0.4284 to 0.5037. Considering the performance improvement in the state-of-the-art methods after our proposed AUs are introduced, the robustness of the AUs persists when the more advanced technique is implemented.

4.3.3 Generic AUs for specific emotions

This section suggests generic AUs for different micro-expressions based on our proposed emotion-specific AUs to ensure the usability of these proposed AUs for future emotion studies. Since the emotion-specific AUs are obtained through the emotion-specific classification performed in the Analysis Module, these proposed AUs contain the most relevant features towards the respective micro-expressions. Hence, the emotion-specific AUs of each dataset obtained through the Analysis Module are imperative to get the universal AUs for each emotion. The AUs of the same micro-expression class across CASME II, SAMM and CAS(ME)\(^2\) are unified to obtain more universal AUs that apply to the corresponding micro-expressions of any dataset. The binary classification of a specific emotion is preferred over the standard multi-classification as the main metrics of the latter are averaged over all involved micro-expression classes. Hence, the effectiveness of the proposed emotion-specific AUs is more prominent in binary classification settings than in multi ones. Table 13 shows the emotion-specific AUs proposed for the micro-expression occurring in multiple datasets.

Table 15 F1-score comparison of the anger-specific classification results on cross-dataset validation between SAMM and CAS(ME)\(^2\), i.e., train dataset \(\rightarrow \) test dataset
Table 16 F1-score comparison of the surprise-specific classification results on cross-dataset validation between CASME II and SAMM, i.e., train dataset \(\rightarrow \) test dataset
Table 17 F1-score comparison of the disgust-specific classification results on cross-dataset validation between CASME II and CAS(ME)\(^2\), i.e., train dataset \(\rightarrow \) test dataset
Table 18 F1-score comparison of the happiness-specific classification results on CASME II, SAMM and CAS(ME)\(^2\), i.e., train dataset \(\rightarrow \) test dataset

Table 14 compares our proposed AUs to the generic AUs suggested by FACS and the objective classes based on the facial regions associated with a specific emotion. The distribution of the AUs is categorized based on the upper (i.e., eyes and brows), middle (i.e., cheeks and nose) and (i.e., mouth and chin) parts of the human face. It can be observed that our proposed emotion-specific AUs are in line with the regions of the FACS AUs. It is also worth mentioning that the AUs proposed for surprise and anger appear in the same areas as those suggested in FACS and objective classes. Therefore, our proposed AUs exhibit the characteristic of the universal AUs suggested for disgust, happiness, surprise and anger.

The generic AUs listed in Table 13 are used to evaluate the emotion-specific classification on CASME II, SAMM and CAS(ME)\(^2\) in cross-dataset validation settings. The classification results for anger, surprise, disgust and happiness are tabulated in Tables 15, 16, 17 and 18. The F1-scores are reported for the binary emotion-specific classifications to give unbiased results regardless of the imbalance class samples.

Based on the results, the AUs proposed for anger and surprise are robust for the respective micro-expressions in CASME II, SAMM and CAS(ME)\(^2\) as better F1-scores are achieved than FACS and objective classes. On the contrary, the proposed AUs for disgust and happiness achieved slightly lower F1-scores than those from FACS. Nevertheless, our proposed AUs are shown to be more effective than the AUs from the objective classes.

Fig. 6
figure 6

(a) AUs used in Independent AU analysis. (b) AUs annotated in the ground truth. (c) Proposed Unilateral \(P_1\) AUs (d) Proposed Symmetric \(P_2\) AUs. Facial landmark points for all the AUs involved in the Independent AU analysis are annotated with blue crosses; facial landmark points corresponding to the dataset ground truth AUs are annotated with red dots; facial landmark points of our proposed AUs are annotated with green dots

4.4 Discussion

Figure 6 illustrates the facial landmarks of the encoded AUs from different categories in all datasets. Figure 6(a) annotates the AUs listed in Table 3 whereas Fig. 6(b) overlays the ground truth AUs of each dataset with the red dots. This shows that some AUs are excluded from the dataset’s ground truth. Comparing Fig. 6(b) and our proposed AUs shown in Fig. 6(c) and (d), some AUs that are excluded in the ground truth are proposed as our \(P_1\) or \(P_2\) AUs. This indicates that the dataset’s ground truth may have likely left out a few important AUs helpful for the classification. In addition, the proposed AUs also omitted some AUs encoded in the ground truth that had shown to be less beneficial for the classification. As such, the classification performance is improved using more relevant AU regions as the redundant features are excluded.

As discussed in Section 4.3.1, the RoIs formed by our proposed AUs provide more relevant features for CASME II, SAMM and CAS(ME)\(^2\) when compared with the existing state-of-the-art AU-based RoIs in multi-emotion classifications. The proposed RoIs are also evaluated with the state-of-the-art AU-based method and reported in Table 12 to ensure that the proposed RoIs are effective when applied to different techniques. As a result, the proposed AUs are robust as it improves the recognition rates when the feature extraction technique is replaced with the state-of-the-art approach.

In Section 4.3.3, we also suggest the generic AUs for specific emotions based on the findings of the Analysis Module. The respective emotion-specific classifications are performed to validate the universality of the suggested AUs. Based on the results achieved in Tables 15, 16, 17 and 18, our suggested AUs for anger and surprise are applicable for the corresponding emotion-specific studies regardless of the dataset under consideration.

Table 19 FACS and Ground Truth AUs of Each Dataset Together with Their Corresponding Mediapipe Facial Landmarks

5 Conclusion

This work presents an independent AU analysis that revisits the ground truth AUs that had been human-encoded for the three established micro-expression datasets: CASME II, SAMM and CAS(ME)\(^2\). These AUs contain more important cues than other facial parts and should be focused on, especially if a full face is unavailable (i.e., a face is partially occluded). Along the way, this paper also provides an AUs ranking for each micro-expression class of the datasets. Based on this ranking, we proposed emotion-specific AUs for more relevant emotion classification.

As the independent AU analysis results in the best AU-based RoIs for each emotion, we further evaluate the effectiveness of these emotion-specific AU-based RoIs in AU-based micro-expression recognition. The experimental results show that our proposed AU RoIs can perform better than existing ground truth or state-of-the-art AUs for the CASME II, SAMM and CAS(ME)\(^2\) datasets.

Table 20 F1-Scores (F1) of each Emotion Class in CASME II
Table 21 F1-Scores (F1) of each Emotion Class in SAMM
Table 22 F1-Scores (F1) of each Emotion Class in CAS(ME)\(^2\)
Table 23 Comparison of AUs Involved for each Micro-expression in CASME II Ground truth, FACS, Objective and Proposed AUs
Table 24 Comparison of AUs Involved for each Micro-expression in SAMM Ground truth, FACS, Objective and Proposed AUs
Table 25 Comparison of AUs Involved for each Micro-expression in CAS(ME)\(^2\) Ground truth, FACS and Proposed AUs

As a direction of future work, we hope to revisit the use of facial AUs in the psychological literature to analyze if differences in results exist between human psychologist-recognized micro-expressions vs pattern recognition-based techniques. Subsequently, we aim to investigate more human-like recognition of micro-expressions since emotions are essential to be experienced and perceived by human beings. Facial AUs are a good starting point for this since humans have the innate ability to extract the main characteristics of any faces they see, much like how an artist draws a caricature by exaggerating distinctive facial features.