Cross-centroid ripple pattern for facial expression recognition

Verma, Monu; Vipparthi, Santosh Kumar

doi:10.1007/s11042-024-19364-9

Cross-centroid ripple pattern for facial expression recognition

Open access
Published: 25 May 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Multimedia Tools and Applications Aims and scope Submit manuscript

Cross-centroid ripple pattern for facial expression recognition

Download PDF

Monu Verma¹ &
Santosh Kumar Vipparthi²

211 Accesses
Explore all metrics

Abstract

In this paper, we propose a new feature descriptor Cross-Centroid Ripple Pattern (CRIP) for facial expression recognition. CRIP encodes the transitional pattern of a facial expression by incorporating a cross-centroid relationship between two ripples located at radius r₁ and r₂ respectively. These ripples are generated by dividing the local neighborhood region into subregions. Thus, CRIP has the ability to preserve macro and microstructural variations in an extensive region, which enables it to deal with side views and spontaneous expressions. Furthermore, gradient information between cross centroid ripples provides strength to capture prominent edge features in active patches: eyes, nose, and mouth, that define the disparities between different facial expressions. Cross-centroid information also provides robustness to irregular illumination. Moreover, CRIP utilizes the averaging behavior of pixels at subregions that yields robustness to deal with noisy conditions. The performance of the proposed descriptor is evaluated on seven comprehensive expression datasets consisting of challenging conditions such as age, pose, ethnicity, and illumination variations. The experimental results show that our descriptor consistently achieved a better accuracy rate as compared to existing state-of-the-art approaches.

Modified chess patterns: handcrafted feature descriptors for facial expression recognition

Article Open access 21 September 2021

Affective Computing: Using Covariance Descriptors for Facial Expression Recognition

Radial mesh pattern: a handcrafted feature descriptor for facial expression recognition

Article 24 July 2021

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Facial expressions exhibit human emotion, that is reflected directly on the face regions and gives cues about a person’s feeling at that moment. Usually, expressions are more effective than verbal dialogue in order to perceive someone’s feelings. Each expression represents a unique pattern on the face based on the muscular variations at different facial regions. With the rapid evolution of technology especially in computer vision and machine learning techniques, many state-of-the-art facial expression recognition (FER) systems have been proposed in the literature. Automatic facial expression recognition has various applications in different fields like depression analysis, e-learning, medical diagnosis, law enforcement, gaming, etc. In real-world scenarios, facial images are captured from multiple angles with spontaneous interactions in various challenging environments such as illumination and noise variations. These factors make facial expression recognition a challenging task. Thus, there is a need for the development of a robust FER system, which can handle these challenging scenarios.

The concept of universal expression was given by Paul Ekman [1]. He classified expressions into six major categories namely: anger, disgust, fear, happiness, sadness, and surprise. Furthermore, the Facial Action Coding System [2] was built by Ekman and Friesen which standardized the recognition procedure. They achieved this by dividing the face into action units (AUs). These AUs are categorized according to the movement of a facial muscle or set of muscles.

An automated FER system usually consists of three steps: preprocessing, feature extraction, and expression classification. In a preprocessing step, the facial region is extracted or cropped from the image. This is important because the presence of other elements that are not related to expressions (e.g. background in the image) can create confusion during the recognition process. After preprocessing, features are extracted from processed images. This is the most important step in any facial expression recognition approach. A good feature extractor provides strength to the FER system and improves accuracy. Feature extraction techniques can be divided into two categories namely predesigned, and learning-based. In predesigned techniques, handcrafted filters are designed to extract the features. They can be further divided into two categories namely geometric and appearance-based approaches. In geometry-based recognition, fiducial points are selected at active areas like eyebrows, eyes, nose, and lips. With the help of these selected points, the shape and distance between active areas are computed and then the final features are extracted. There are many models available in the literature based on the shape and size of active points. The overall performance of this type of model is dependent on the selection of facial points with pinpoint accuracy, which is not easy to achieve because of the dynamic nature of facial shape as it varies from expression to expression. In appearance-based approaches, features are extracted by applying a hand-crafted filter over a complete image or a part of the image, called a patch. Many filters such as Gabor Wavelets [3], Elastic Bunch Graph Matching [4], Local Binary Patterns (LBP) [5,6,7], Local Directional Pattern (LDP) [8], PHOG and LPQ [9], etc. are available for the same. These descriptors encode texture and gradient magnitude information to represent the microstructure of a facial appearance. However, LBP and its variants are not robust to noisy and multi-view pose conditions. After feature extraction, the next step, expression classification comes into play. It classifies the feature vectors into one of the six expression classes. Various classification models like Support Vector Machine (SVM) [10], K-Nearest Neighbor (KNN) [11], Hidden Markov Models (HMM) [12], AdaBoost, etc. are available for expression classification. In the case of learning approaches, features are learned during training. Feature extraction and classification tasks are usually performed simultaneously by using some pipeline architecture.

In this paper, we propose Cross-Centroid Ripple Pattern (CRIP) for robust facial expression recognition that encodes the intensity variation using centroid values at adjacent ripples. Centroids of each ripple are calculated by establishing the average relationship between the center pixels and its neighboring pixels located at a radius $r_{1}$ and $r_{2}$ respectively. Thus, it embeds both micro and macro structures to represent the spatial variations and enhance discriminability between different expression classes. It improves the performance of the descriptor in various pose conditions as it extracts deep edge responses from the face structure. The average relationship between the pixels provides more comprehensive texture information and incorporates robustness to noise and image blur. Furthermore, gradient information between ripple centroids extract minute intensity variations and improves their ability to deal with illumination variations. The working flow of the proposed FER system is demonstrated in Fig. 1.

2 Related work

In literature, FER techniques are generalized in two categories: geometric and appearance-based. Zhang et al. [14] proposed a geometric-based approach by using 34 facial point positions to extract the features. Another commonly used facial expression classification method is Facial Action Coding (FAC), where expressions are recognized by extracting AUs. Valstar et al. [15] introduced a new approach, where AUs are detected by extracting some particular facial points.

Appearance-based features are extracted by applying a filter or filter bank. These filters are applied on the whole face image or local regions of the face, which extracts the physical changes in facial appearance. There are many appearances-based techniques like Eigenfaces [16], LDA [17], ICA [18], PCA [19], IDA [20, 21] Gabor Wavelet [3], and LBP [5] proposed in the literature, which extracts the local features to represent the physical appearance of the face. Local Binary Patterns (LBP) is the most acknowledged approach and has been effectively used in many domains. LBP used the first-order derivative between the neighborhood and reference pixels to define the texture variation. The basic LBP descriptor molds each intensity value of an image to a new value by applying threshold conditions between the reference pixel to its neighboring pixel. LBP has gained significant popularity as an invariant texture measure, as it has the tolerance capability to illumination changes and computational simplicity. However, LBP performs well in extracting the micro information of edges, spots, and other relevant features, by analyzing the intensity changes around the gray levels. However, it is not robust to random noise and rotation variations. To overcome the issues of LBP, Jabid et al. [7] proposed a descriptor LDP, that utilizes directional information instead of actual intensity value. To extract the directional numbers, LDP computed edge responses in eight directions by using the Kirsch mask. Further, it extracts the three most leading directions indices and replaces them with one binary and remaining with zeros. The use of edge responses showed successful performance in facial expression recognition as they are less sensitive in case of illumination changes and noise. However, LDP still faces the problem of non-monotonic illumination variation. To improve the capability of feature descriptors, Ramirez et. al. [22] used the local directional numbers (LDN). LDN descriptor assigned a six-bit code to each pixel value by extracting the highest positive and negative edge responses. Further, Rivera et. al. [23] improved their directional descriptor’s performance by adding texture information to it and proposed a new descriptor LDTexP. LDTexP descriptor takes advantage of both directional and intensity information of referencing neighborhood. Moreover, Ryu et. al. [24] proposed LDTerP, which encodes facial expressions by dividing the face into two parts stable region and active regions. This multi-level representation provided a fine description of uniform and non-uniform areas. LDTerP encoded the texture information by using directional information with a ternary pattern. Directional information reveals the edge pattern while the ternary pattern describes the disparities between excited and smooth facial regions. LDTerP used Robinson's symmetric mask to extract the responses in four directions. Further, identified the two most dominating directions, and jointly calculated the ternary patterns for both of the directions.

In recent years, deep learning techniques have performed very well in many applications such as facial recognition, facial expression recognition, object detection, etc. In literature, various state-of-the-art CNN-based architectures such as VGG16 [25], VGG19 [25], and ResNet [26] have been proposed. These architectures have the ability to learn features itself with provided accurate labels. There are some networks available in the literature that perform very well in FER, some of them are: Deep Belief Networks (DBN) [27], where each layer extracts and learns the features from a given image and predicts the expression class. Khorrami et. al. [28] proposed a shallow CNN network to extract AU-based features. Further, Li et al. [29] introduced an enhancing and cropping-based CNN network to learn salient features from the AU region, by detecting landmarks. Lopes et al. [30] created a big dataset by applying augmentation techniques and designed CNN based-network to detect the expressive features from the facial images. Ding et. al [31] proposed a two-phase network: first they trained a network for facial features and then forwarded updated neuron weights to the next network, to extract microlevel expression features. Other CNN-based systems such as DTAGN [32], DTAGN-Joint [33], Inception [34], Spatiotemporal [35], GCNet [36], encoder-decoder FER [37], MaNet [38], and AutoFERNet [39] have also achieved accelerated growth in the field of facial expression recognition.

3 Proposed method

In this paper, we propose a new feature descriptor named Cross-Centroid Rippled Pattern (CRIP) for facial expression recognition. CRIP is designed to encode the cross-centroid information, located at adjacent ripples to represent the spatial structure variation of the facial expressions. These ripples are generated by dividing the local neighborhood region into subregions and represented by $X_{\eta }$ and $Y_{n}$$\left( {\eta \in \left\{ {0,1,...7} \right\}} \right)$ as shown in Fig. 2. The X ripple is created by incorporating the center pixel with its neighboring pixels, located at radius $r_{1}$. Similarly, Y ripple elicited by encapsulating the inter-radial pixels, located at radius $r_{1}$ and $r_{2}$. Furthermore, centroid values are generated by computing the local mean of all magnitude values in a particular subregion. Since, CRIP captures the averaging behavior of the subregion, that enhances noise the robustness property of the proposed descriptor. Moreover, the CRIP efficiently encodes the macro and micro-structure by embedding extensive information of radius $r_{1}$ and $r_{2}$ pixels. This enables the proposed descriptor to become an invariant view and posed facial expression. The final pattern is generated by establishing a cross-centroid relationship between ripples, in the respective directions. Thereby, cross-centroid responses are able to capture the intensive and slight muscle distortions. Moreover, the CRIP pattern detects the deep and narrow edge responses, which reveal the salient facial features of the facial appearance that allows it to handle irregular illumination. The properties of CRIP are as follows: (1) the proposed descriptor categorically reveals information about dipped and embossed muscles on the face that is visible at the time of expression. Further, this enables the identification of disparities between the intensity changes, (2) The CRIP extracts the robust spatial information embedding macro and micro-structural information, that consequently yields better recognition accuracy, (3) The cross-centroid comparison between two adjacent ripples improves its capability to deal with irregular illumination changes, pose variations and noisy conditions.

To calculate the CRIP patterns, we divide the local neighborhood region into subregions and then compute the centroids to establish an averaging relationship as shown in Fig. 2. Let $I(m,n)$ be a grayscale image of size $Q \times N$ where $m \in [1,Q]$, $n \in [1,N]$. If at each location $I_{C}$ in the image, p and q imply the total neighboring pixels located at a corresponding radius $r_{1}$ and $r_{2}$ respectively, then the pattern is computed by using Eq. (1).

$$CRIP(c) = \sum\limits_{\eta = 0}^{p - 1} {f\left( {Y(\eta ) - X(\eta )} \right)} \times 2^{\eta }$$

(1)

The sign function $f( \cdot )$ is computed using Eq. (2).

$$f\left(t\right)=\left\{\begin{array}{l}\begin{array}{cc}1&t\geq0\end{array}\\\begin{array}{cc}0&else\end{array}\end{array}\right.$$

(2)

The centroid values $X_{\eta }$ and $Y_{n}$$\left( {\eta \in \left\{ {0,1,...7} \right\}} \right)$ for radius $r_{1}$ and $r_{2}$ are calculated using [Eqs. (3)—(6)].

$$X(\eta )=\frac{\left(\left(k(\eta )-1\right)\times {\sum }_{i=\eta -l(\eta )-2}^{\eta -l(\eta )-1}{I}_{{r}_{1},\mathit{mod}(i,p)}\right)+k(\eta )\times ({I}_{C}+{I}_{{r}_{1},\eta -l(\eta )})+\left({\sum }_{i=\eta -l(\eta )+1}^{\eta -l(\eta )+2}{I}_{{r}_{1},\mathit{mod}(i,p)}\right)}{4\times k(\eta )}$$

(3)

$$Y(\eta ) = \frac{{\sum\limits_{i = \eta + l(\eta ) - 1}^{\eta - l(\eta ) + 1} {I_{{r_{1} ,\bmod (i,p)}} + \sum\limits_{i = 2\eta - 1}^{2\eta + 1} {I_{{r_{2} ,\bmod (i,q)}} } } }}{{2\left( {2 \times k(\eta ) - \left\lfloor {k(\eta )/2} \right\rfloor } \right)}}$$

(4)

$$k(\eta ) = \bmod (\eta + 1,2) + 1$$

(5)

$$l(\eta ) = \bmod (\eta ,2)$$

(6)

3.1 Feature vector & classification

Feature response maps are divided into non-overlapping blocks of identical size for feature vector representation. The final feature vector is generated by accumulating previously generated block-level feature vectors. Moreover, the response feature maps of size $u \times v$ are divided into $n$ equal-sized $\eta \times \eta$ non-overlapping blocks $(K_{1} ,K_{2} ,...K_{n} )$. Thus, a total number of blocks in a row and column are $R_{\eta } = \left\lceil {u/\eta } \right\rceil$,$C_{\eta } = \left\lceil {v/\eta } \right\rceil$ respectively. Finally, the response map $M_{R}$ is calculated by using Eq. (7).

$$M_{R} = \left\{ {M(R,\lambda ,\varphi )} \right\}_{\varphi = 1}^{n}$$

(7)

where $R$ is the generated feature map, $\lambda \in [0,255]$ and $M( \cdot , \cdot , \cdot )$ is computed using Eqs. (8)—(10).

$$M(R,\lambda ,\varphi ) = \sum\limits_{{i = \omega_{1} }}^{{\omega_{1} + B}} {\sum\limits_{{i = \omega_{2} }}^{{\omega_{2} + B}} {\sigma \left( {M(i,j) - \lambda } \right)} }$$

(8)

$$\omega_{1} = 1 + \left( {\left\lfloor {\varphi /R_{j} } \right\rfloor \times B} \right)$$

(9)

$$\omega_{2} = \left( {\bmod (\varphi - 1,R_{j} ) \times B} \right) + 1$$

(10)

where $\omega_{1}$ and $\omega_{2}$ is the row and column index for each block. We perform classification by using a support vector machine (SVM) [10]. SVM is a supervised machine learning technique, that, transforms a feature vector to a high dimensional plane by using linear or non-linear mapping. Furthermore, it executes the linear decision hyper-plane having the maximal scope and performs binary classification to separate the feature vector into two classes in the dimensional space. Let $T = \{ (A_{i} ,B_{i} ),i = 1,2,....Z\}$ be a training set of labeled objects where $A_{i} \in \psi^{n}$ is the feature vector of each class and $B_{i} \in \{ + 1, - 1\}$. A new test object is classified by using Eq. (11).

$$f(t) = sign\left( {\sum\limits_{i = 1}^{Z} {\lambda_{i} B_{i} \gamma (A_{i} ,t) + b} } \right)$$

(11)

where $\lambda_{i}$ represents the Lagrange's multipliers of dual optimization problem, $\gamma$ is the kernel function and b imply bias parameter. To classify multiple blocks, we adopt one-against-one coding scheme and performs ${}^{n}C_{2}$ binary SVM classifiers, where n is the total no. of class label. The final decision is taken based on voting.

4 Experimental results and discussion

4.1 Comparison with existing descriptor

In the literature, various state-of-the-art approaches have been proposed to extract salient features from facial images. Recently, directional edge response-based approaches [7, 19,20,21] have achieved high classification accuracy. These methods have utilized the directional information, calculated by imposing the eight-directional compass mask (Sobel, kirsch, and Robinson). Moreover, they extract the directional index position to encode the feature in the local neighborhood. Existing descriptors are able to describe the features of the still and posed frontal view expressions. However, in real-life applications, people express their emotions spontaneously, which can be recorded from multiple views. Thus, a feature descriptor that can also deal with the non-frontal views will yield better results for side-view and spontaneous expression recognition. Therefore, to improve the performance of FER systems, we proposed a CRIP descriptor, which can extract expression features in such kinds of situations. The comparison of different FER techniques can be summarized as follows.

1)
The directional descriptors capture only extreme edge variations in the local neighborhood. This may lead to information loss in the facial region thereby degrading the performance of the FER system. However, CRIP utilizes gradient information between centroids, located at different radials. Thus, CRIP Descriptor is able to extract efficient transitional pattern information to describe macro and microstructure variations which strengthens its capability to discriminate the expression classes in both side and frontal views as shown in Figs. 3 and 4 respectively.
Fig. 3
The qualitative comparison between the feature maps generated by the proposed CRIP and other state-of-the-art descriptors, in the side view image, belonging to the neutral and surprise emotion class. Color boxes highlighted the expressive regions of the facial appearance, which plays a major role in defining the disparities between different classes of expressions
Full size image
Fig. 4
An input anger expression and response feature maps encoded by (a) LBP, (b) Two-Phase, (c) LDP, (d) LDN, (e) LDTexP, (f) LDTerP, and (g) CRIP respectively
Full size image
2)
In the CRIP descriptor, gradient information of inter-radial centroids, elicited from different sub-regions, captures the dipped and embossed edges. Thus, it enhances the robustness to inter-class and intra-class variations as compared to LDTexP and LDTerP. The qualitative performance of the descriptors is depicted in Fig. 5, representing three expression images: two are from the anger emotion class and one is from the sad emotion class. Furthermore, a particular area of each image is selected to analyze the behavior of the directional descriptors (LDTexP and LDTerP) with the proposed descriptor. We have extracted the response images along with histograms of active patches. Furthermore, Euclidean distance is computed between the response feature maps to validate the performance of CRIP in terms of inter-and intra- class. Thus, from the computed feature maps and relevant distances, it is clear, the CRIP descriptor is able to categorize anger emotions to the same class and sad to the different class.
Fig. 5
The encoded response and feature vectors of three images with two emotion classes (anger and sad) are computed using LDTexP, LDTerP and CRIP respectively. Euclidean distance between (a) and (b) is lesser and (a) and (c) is larger for CRIP as compared to other descriptors
Full size image
3)
The robustness of CRIP to deal with the illumination variations is compared to state-of-the-art techniques is depicted in Fig. 6. The cross-centroid relationship allows the CRIP to capture minute edge variations, which improves the feature map quality in various lighting conditions over other state-of-the-art approaches.
Fig. 6
The input images and coded response generated by applying state-of-art descriptors: LDP, LDN, LDTexP, LDTerP, QUEST, and proposed CRIP. The images are captured using (a) NIR and (b-d) VIS cameras in various lighting conditions: dark, strong, and weak respectively
Full size image
4)
CRIP utilizes averaging behavior of the pixels in the local neighborhood to preserve edge variations in low-resolution conditions (blur images). Figure 7, demonstrates the response maps of happy expression in three different resolutions, captured by imposing using LBP, LDP, LDN, LDTexP, LDTerP, QUEST, and CRIP descriptors. From Fig. 7, it is visible that almost all existing methods lose prominent features in low-resolution images as compared to CRIP, which degrades the performance of the FER system.
Fig. 7
Spontenous happy emotion images, captured in three different resolution conditions and their coded response using LBP, LDP, LDN, LDTexP, LDTerP, QUEST and CRIP descriptors respectively
Full size image
5)
It embeds extensive local neighborhoods with radius $r_{2}$ instead of an exclusive neighborhood with radius $r_{1}$ like QUEST, which elicits more pattern variations and extends the quality of response features.

Recently, convolution neural network (CNN) [25, 26] based models achieved high accuracy in FER. Conventional CNN models have a series of convolutions, ReLU, and pooling layers that mutually create a hierarchical network, which is able to extract desired features of expression, and the final outcome of extracted feature feeds into the soft-max layer for classification. Though recent deep learning models show remarkable performance in the field of image classification, these networks also have some limitations like 1). They required large datasets to get the desired results. 2). In deep learning models, predefined hyperparameters play an important role in network learning. A minute change in hyper-parameter can affect the overall learning process which affects the performance indirectly. Parameter tuning is itself a difficult task and sometimes needs numerous experiments.

Moreover, to evaluate the performance of the CRIP encoding scheme, we executed 7 experiments on benchmark expression datasets MUG [40], MMI [41, 42], Oulu-CASIA [43], ISED [44], AFEW [45, 46], FER challenge [47] and Combined (MUG, MMI, ISED, AFEW). We tested the CRIP descriptor under different conditions: time-lapse, ethnicity, and illumination variations. Moreover, we also validate our proposed approach for with and without neutral images by creating six and seven expression classes respectively. Generally, existing approaches [22,23,24] perform the manual detection of faces and then cropped them for further processing. However, in the proposed method, we utilize the Viola-Jones [48] detection algorithm to extract the faces and then normalize them into equal-sized images. Thereby, this procedure provides a real-life scenario where the position and background of the images are not static, which enhances the performance of the FER systems.

Furthermore, resultant feature maps are partitioned into equal-sized blocks. The adaptability of equal-sized blocks instead of selective different block sizes for all databases strengthens its capability by reducing the overhead of parameter selection and provides universal acceptance to the FER system. This approach also ensures fair comparative analysis between proposed and existing descriptors. In our experimental setup, we have created our datasets by selecting three to five peak frames from each video sequence. Most of the existing approaches [8, 22,23,24] follow the same setup. In literature, to measure the performance of the FER system two validation schemes are used called as person-dependent and person-independent cross-validation.

In person-dependent cross-validation, dataset is randomly partitioned into N categories, where N-1 sets are assisting as training set and one set assist as a testing set. However, in person independent cross-validation, image sets are categorized according to the subjects as all expression of some particular persons at a particular iteration, are not involved in training set and utilize as a testing set. In this paper, we have validated performance of the proposed approach by utilizing both schemes.

Person dependent scheme

In person dependent (PD) scheme, we have arranged the training and testing set by dividing datasets into 80:20 respectively. Since, partition of training and testing image set have done randomly, so to make fair analysis of outcomes, we conducted the same experiments with 5 iterations and then average is taken as final recognition rate. The recognition rate is computed by following equation.

$$\textit{Rec}ognition\,Accuracy=\frac{Total\,no.\,of\,correctly\,predicted\,samples}{Total\,no.\,of\,samples}\times100$$

(12)

Person independent scheme

In person independent (PI) scheme, we have adopted two different PI methods: tenfold subject independent and leave-on-subject-out. In the tenfold subject independent method, subjects are divided into 10 folds according to their ids. In our experimental setup, we have used a tenfold subject-independent method for MMI and ISED datasets. Since all the expressions of every subject are not available in these datasets, the tenfold subject independent cross-validation is the more suitable scheme. However, in the leave-one-subject-out method, only one subject’s expressions are involved in the testing set and the remaining subject’s expressions are used for training. For MUG and OULU-CASIA datasets, we have followed a leave-one-subject-out scheme as all the expressions for each subject are provided.

4.2 Experiment on MUG dataset

The multimedia understanding group built the MUG [40] dataset with high-resolution images. It consists of 86 subjects showing various expressions. The subjects contain 35 females and 512 males, all belongs of Caucasian origin aged between 20 to 35 years. We have taken 5 peak images from all the sequences of available subjects. images having 260—anger, 250– disgust, 241 – fear, 255 – happy, 245– sad, 255 – surprise, and 255 – neutral. For person-dependent experiments, we arrange a total 1761 For person-independent experiments, we selected 1540 facial expression images (anger: 220, disgust: 220, fear: 220, happy: 220, sad: 220, surprise: 220, neutral: 220). We showed the recognition rates of our method and another method over MUG dataset in Table 1, from which we can state that our method outperforms the state-of-the-art methods for both PD and PI setup. The CRIP also outperforms the VGG16, VGG19, and ResNet50 in PD experiments. Specifically, our proposed method yields performance improvement of 1.5%, 0.8%, 0.3 and 1.0%, 0.4%, 0.2% over LDTexP, LDTerP and QUEST for 6-class and 7-class problem respectively in PD experiments. Similarly, the proposed CRIP outperforms LDTexP, LDTerP and QUEST by 1.5%, 3.4%, 0.7% and 0.2%, 0.8%, 1.3% for 6-class and 7-class problem respectively in PI experiments.

Table 1 Recognition accuracy comparison on MUG and MMI dataset

Full size table

4.3 Experiment on MMI dataset

The MMI [41, 42] expression dataset has more than 2900 samples, which contains both videos and still images. Each sequence starts and ends with a neutral expression with a peak of a particular expression in between the expression. The frontal view is captured in all images. There are multiple sessions for a person with and without glasses. MMI consists of 236 sessions recorded for 32 subjects. We selected 30 subjects frame sequences in our experiments. Each of the sequence categorized in one of the expression classes: anger, disgust, fear, happy, sad, surprise. We manually crop images of the MMI dataset. For person-dependent and person-independent experiments, we have followed a similar setup and arranged a total of 561 images in which 69 – anger, 71 – disgust, 69 – fear, 84 – happy, 67 – sad, 75 – surprise, and 126 – neutral. The recognition rates over MMI dataset for both 6-class and 7-class problems are shown in Table 1. The results from Table 1 make it clear that the proposed descriptor outperforms other existing handcrafted approaches for facial expression recognition. In particular, the proposed method improves the performance by 4.2%, 7.0%, 0.8% and 0.6%, 0.6%, 2.6% over LDTexP, LDTerP and QUEST for 6-class and 7-class problems respectively in PD experiments. Similarly, in PI experiments, the CRIP achieves 1.4%, 4.3%, 1.1% and 0.5%, 2.2%, 0.7% performance improvement over LDTexP, LDTerP and QUEST for 6-class and 7-class problems respectively. To delineate the class-wise recognition accuracy, we have further presented the confusion matrices for 6- and 7-class problems on PD setup in Figs. 8 and 9 respectively.

4.4 Experiment on OULU-CASIA dataset

OULU-CASIA [43] dataset consist of 80 subjects representing six basic expressions: anger, disgust, fear, happy, sad, surprise, aged between 23 to 58 years. All images in the dataset are frontally faced. The expression images are captured by we only select 3 peak images from all Neutral expression is created by taking initial frames of the sequences when subjects are not started showing any emotion. For our experimental purpose, using NIR and VIS cameras. Furthermore, all images are taken in 3 different lighting conditions: strong, dark, and weak. In strong conditions appropriate light is used, in weak conditions, only computer display light is used, while in dark conditions images are taken in a near dark environment.

A total of 2880 video sequences are present in the dataset, in which 480 image sequences belong to each illumination condition. we arrange images in all the 3- illumination-condition. The numbers of images we select in each expression as anger – 240, disgust – 240, fear – 240, happy – 240, surprise – 240, and neutral – 240. The same applied in both NIR and VIS camera images. To validate the robustness of CRIP, we have shown a qualitative comparison between the proposed descriptor and other state-of-the-art feature descriptors in Fig. 6. From Fig. 6 we can see that almost all the methods perform well in the case of normal lighting conditions. However, in the presence of strong and weak illumination conditions, the proposed method is more conclusively able to extract the discriminative features as compared to other descriptors.

The quantitative results of the proposed method and other approaches over the NIR dataset for 6-class and 7-class problems are given through Tables 2, 3, 4, and 5. The evaluated results clearly show that the proposed descriptor outperforms other state-of-the-art approaches. More specifically, the proposed descriptor achieves performance gain of 0.8%, 0.9%, 0.6 and 1.0%, 0.5%, 0.3% for 6-class and 7-class problems as compared to LDTexP, LDTerP and QUEST respectively in PD setup. In PI setup, the CRIP secures 1.3%, 10.2%, 0.8% and 1.5%, 15.4%, 1.6% performance improvement over LDTexP, LDTerP and QUEST for 6-class and 7-class problems respectively. The results over the VIS dataset are given in Tables 2, 3, 4, and 5. In PD setup, the proposed method outperforms the LDTexP, LDTerP and QUEST by 2.6%, 1.3% 0.8% and 0.6%, 1.3%, 0.3 for 6-class and 7-class problems respectively. Similarly, in PI setup, the CRIP achieves 7.1%, 9.1%, 1.5% and 6.2%, 10.2%, 0.6% performance gain as compared to LDTexP, LDTerP and QUEST for 6-class and 7-class problems respectively. subjects with 428 video sequences. Out of 50 subjects 29 are males and 21 are females with all aged between 18 to 22 years.

Table 2 Recognition accuracy comparison on OULU-NIR dataset for 6-class expressions in both Person Dependent and Independent Setup

Full size table

Table 3 Recognition accuracy comparison on OULU-NIR dataset for 7-class expressions in both Person Dependent and Independent Setup

Full size table

Table 4 Recognition accuracy comparison on OULU-VIS dataset for 6-class expressions in both Person Dependent and Independent Setup

Full size table

Table 5 Recognition accuracy comparison on OULU-VIS dataset for 7-class expressions in both Person Dependent and Independent Setup

Full size table

4.5 Experiment on ISED dataset

The ISED [44] dataset includes 50 culturally diverse Indians This dataset was created with spontaneous emotions, i.e., the subjects are not asked to pose particular expressions. These all spontaneous expressions were captured without the subjects’ awareness. Multiple sessions are recorded of the same emotion with a high resolution of 1920 × 1080. Expressions are categorized into 4 classes: disgust, happy, sad, and fear. The neutral class was created with the onset frames of these expressions sessions while subjects did not start posing any expression. All captures are taken in weak illumination conditions Finally, we have selected 730 total image frames as 234 – disgust, 294 – happy, 174 – sad, 189 – surprise, and 295 – neutral. The proposed method gives more accurate results and extracts the discriminative features well as compared to other methods. The comparative results of our proposed method and other approaches are given in Table 6. From Table 6, we can see that the proposed method achieves superior results as compared to other state-of-the-art approaches. Especially, it surpasses the accuracy of LDTexP, LDTerP and QUEST by 4.0%, 7.0%, 3.1% and 7.4%, 7.0%, 4.0% for 4-class and 5-class expression recognition problem in PD setup. Similarly, in PI setup, the proposed descriptor attains 1.1%, 2.5%, 0.4% and 3.5%, 0.02%, 1.4% performance as compared to LDTexP, LDTerP, and QUEST for 6-class and 7-class problems respectively Thus, it proves, that our method is effective even in recognizing the spontaneous expressions.

Table 6 Recognition accuracy comparison on ISED dataset

Full size table

4.6 Experiment on AFEW dataset

AFEW [45, 46] contains 957 video clips of wild emotions around all age groups people. These videos are extracted from the different movies to encapsulate the real-world scenario for the dataset. surprise and neutral. The comparative results of CRIP and other state-of-art approaches over AFEW are tabulated in Table 7. The AFEW database comprises various real-world constraints: spontaneous facial expressions, head pose variations, occlusions, distinctive face resolution, and illumination changes. In our setup, we have selected 10 to 15 most expressive images to validate the performance of the proposed method. In total, AFEW holds 10,809 images, with seven basic expression classes: angry, disgust, fear, happy, sad, Specifically, our proposed method yields 8.0%, 1.9%, 1.2% and 6%, 1.3%, 0.9% more accuracy rates for 6- and 7- class expressions as compared to LDTexP, LDTerP, and QUEST, respectively.

Table 7 Recognition accuracy comparison on AFEW, COMPOSITE and FER13 dataset

Full size table

4.7 Experiment on composite dataset

In this experiment, we create a larger dataset by combining the MUG, MMI, ISED, and AFEW datasets. By doing this we get a greater number of images to train the model and provide a diverse pool of subjects for designing and evaluating the FER system. The resultant dataset consists of 14,242 images: 1856—anger, 1700—disgust, 1497—fear, 2761—happy, 2265- sad, 1445 – surprise, and 2718—neutral. Table 7 represents the results for proposed method and other approaches. Table 7 represents the evaluated results for proposed descriptor and other approaches. The proposed method obtains 4.8%, 11.9%, 1.5% and 6.9%, 3.5%, 2.3% more recognition accuracy as compared to LDTexP, LDTerP and QUEST for 6-class and 7-class expression recognition respectively.

4.8 Experiment on FER-2013 dataset

The FER challenge [47] was organized to promote the development of feature extraction and learning techniques for automated facial expression recognition. The dataset consists of grayscale images of 48 × 48 resolution. The images were partitioned into a training set (28,709 images), public test set (3,589 images), and private test set (3,589 images). In our experiments, we have computed the results for 6-class and 7-class emotions over both the public and private test sets, and the comparative results are shown in Table 7. From Table 7, we can see that the CRIP outperforms the existing state-of-the-art feature descriptors. More specifically, it attains 8.6%, 15.0%, 10.7% and 8.2%, 13.6%, 10.4% performance improvement as compared to LDTexP, LDTerP and QUEST for 6-class and 7-class problems in the public test set. Similarly, it outperforms the LDTexP, LDTerP, and QUEST by 8.1%, 15.8%, 8.9 and 16%, 8.4% for 6-class and 7-class problems in the private test set.

5 Conclusion

In this paper, we proposed a new feature descriptor Cross-Centroid Ripple Pattern, which encodes the image features by using inter-radial ripples. These centroids are computed by establishing a relationship between neighboring subspaces in the local neighborhood. The cross-centroid thresholding extracts the deep edge variations, which improves the capability to identify different expressions belonging to different emotion classes. The cross-centroid relationship also enhances the robustness to pose and illumination variation, by highlighting the deep edge features. Moreover, average information of pixels in local neighborhoods improves the performance of CRIP in both frontal and side views. Experimental results show the effectiveness of the proposed method as it achieved better recognition rates as compared to other state-of-the-art approaches.

Data availability

The datasets utilized in this research are not publicly available due to restrictions imposed by the data source. However, they can be requested for academic use by signing an End User License Agreement (EULA) with the data provider. The dataset included in this paper has been properly cited and is accessible to readers. For further inquiries regarding data access, please contact the corresponding author.

References

Ekman P (1993) Facial expression and emotion. Am Psychol 48(4):384
Article Google Scholar
Ekman P, Friesen WV (1978) Facial action coding system. Environmental psychology & nonverbal behavior
Gabor D (1946) Theory of communication. J Inst Electr Eng III Radio Commun Eng 93(26):429–425
Google Scholar
Wiskott L, Krüger N, von der Malsburg C (1997) Face recognition by elastic bunch graph matching. IEEE Trans Pattern Anal Mach Intell 19(7):775–779
Article Google Scholar
Shan C, Gong S, McOwan PW (2009) Facial expression recognition based on local binary patterns: a comprehensive study. Image Vis Comput 27(6):803–816
Article Google Scholar
Zhao G, Pietikainen M (2007) Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans Pattern Anal Mach Intell 29(6):915–928
Article Google Scholar
Lai CC, Ko CH (2014) Facial expression recognition based on two-stage features extraction. Optik-Int J Light Electron Optics 125(22):6678–6680
Article Google Scholar
Jabid T, Kabir M, Chae OO (2010) Robust facial expression recognition based on local directional pattern. J ETRI 32(5):784–794
Article Google Scholar
Dhall, Goecke R, Lucey S, Gedeon T (2011) Static facial expression analysis in tough conditions: data, evaluation protocol and benchmark. In: 2011 IEEE international conference on computer vision workshops (ICCV workshops). IEEE, pp 2106–2112
Cortes, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Article Google Scholar
Gu W, Xiang C, Venkatesh YV, Huang D, Lin H (2012) Facial expression recognition using radial encoding of local Gabor features and classifier synthesis. Pattern Recognit 45(1):80–91
Article Google Scholar
Aleksic PS, Katsaggelos AK (2006) Automatic facial expression recognition using facial animation parameters and multistream HMMs. IEEE Trans Inf Forensics Secur 1(1):3–11
Article Google Scholar
Verma M, Saxena P, Vippathi SK, Singh G (2018) QUEST: qudrilateral senary bit pattern for facia expression recognition. arXiv preprint arXiv: 1807.09154
Zhang Z, Lyons MJ, Schuster M, Akamatsu S (1998) Comparison between geometry‑based and Gabor‑wavelets‑based facial expression recognition using multi‑layer perceptron. In: Proceedings third IEEE international conference on automatic face and gesture recognition. IEEE, pp 454–459
Valstar M, Patras I, Pantic M (2005) Facial action unit detection using probabilistic actively learned support vector machines on tracked facial point data. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR'05)-Workshops. IEEE, pp 76–84
Turk M, Pentland A (1991) Eigenfaces for recognition. J Cognit Neurosci 3(1):71–86
Article Google Scholar
Etemad K, Chellappa R (1997) Discriminant analysis for recognition of human face images. J Opt Soc Amer A Opt Image Sci 14(8):1724–1733
Article Google Scholar
Bartlett MS, Movellan JR, Sejnowski TJ (2002) Face recognition by independent component analysis. IEEE Trans Neural Netw 13(6):1450–1464
Article Google Scholar
Yang J, Zhang D, Frangi AF, Yang JY (2004) Two-dimensional PCA: a new approach to appearance-based face representation and recognition. IEEE Trans Pattern Anal Mach Intell 26(1):131–137
Article Google Scholar
Nenadic Z (2007) Information discriminant analysis: feature extraction with an information-theoretic objective. IEEE Trans Pattern Anal Mach Intell 29(8):1394–1407
Article Google Scholar
Masai K, Kunze K, Sugiura Y, Ogata M, Inami M, Sugimoto M (2017) Evaluation of facial expression recognition by a smart eyewear for facial direction changes, repeatability, and positional drift. ACM Trans Interact Intell Syst (TiiS) 7(4):1–23
Article Google Scholar
Rivera R, Castillo JR, Chae OO (2013) Local directional number pattern for face analysis: face and expression recognition. IEEE Trans Image Process 22(5):1740–1752
Article MathSciNet Google Scholar
Rivera R, Castillo JR, Chae OO (2015) Local directional texture pattern image descriptor. Pattern Recognit Lett 51:94–100
Article Google Scholar
Ryu, Rivera AR, Kim J, Chae OO (2017) Local directional ternary pattern for facial expression recognition. IEEE Trans Image Process 26(12):6006–6018
Article MathSciNet Google Scholar
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, pp 770–778
Liu P, Han S, Meng Z, Tong Y (2014) Facial expression recognition via a boosted deep belief network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, pp 1805–1812
Khorrami P, Paine T, Huang T (2015) Do deep neural networks learn facial action units when doing expression recognition. In: Proceedings of the IEEE international conference on computer vision workshops. IEEE, pp 19–27
Li W, Abtahi F, Zhu Z, Yin L (2018) EAC-Net: deep nets with enhancing and cropping for facial action unit detection. IEEE Trans Pattern Anal Mach Intell 40(11):2583–2596
Article Google Scholar
Lopes T, de Aguiar E, De Souza AF, Oliveira-Santos T (2017) Facial expression recognition with convolutional neural networks: coping with few data and the training sample order. Pattern Recognit 61:610–628
Article Google Scholar
Ding H, Zhou SK, Chellappa R (2017) Facenet2expnet: regularizing a deep face recognition net for expression recognition. In: 2017 12^th IEEE international conference on automatic face & gesture recognition (FG 2017). IEEE, pp 118–126
Jung H, Lee S, Park S, Lee I, Ahn C, Kim J (2015) Deep temporal appearance-geometry network for facial expression recognition. In: arXiv preprint arXiv:1503.01532
Jung H, Lee S, Yim J, Park S, Kim J (2015) Joint fine-tuning in deep neural networks for facial expression recognition. In: Proceedings of the IEEE international conference on computer vision. IEEE, pp 2983–2991
Mollahosseini A, Chan D, Mahoor MH (2016) Going deeper in facial expression recognition using deep neural networks. In: 2016 IEEE Winter conference on applications of computer vision (WACV), pp 1–10
Zhang K, Huang Y, Du Y, Wang L (2017) Facial expression recognition based on deep evolutional spatial-temporal networks. IEEE Trans Image Proc 26(9):4193–4203
Article MathSciNet Google Scholar
Kim Y, Yoo B, Kwak Y, Choi C, Kim J (2017) Deep generative-contrastive networks for facial expression recognition. arXiv preprint arXiv:1703.07140
Bagheri E, Esteban PG, Cao HL, Beir AD, Lefeber D, Vanderborght B (2020) An autonomous cognitive empathy model responsive to users’ facial emotion expressions. ACM Trans Interact Intell Syst (TIIS) 10(3):1–23
Article Google Scholar
Zhao Z, Liu Q, Wang S (2021) Learning deep global multi-scale and local attention features for facial expression recognition in the wild. IEEE Trans Image Process 30:6544–6556
Article Google Scholar
Li S, Li W, Wen S, Shi K, Yang Y, Zhou P, Huang T (2021) Auto-FERNet: a facial expression recognition network with architecture search. IEEE Trans Netw Sci Eng 8(3):2213–2222
Article Google Scholar
Aifanti N, Papachristou C, Delopoulos A (2010) The MUG facial expression database. In: International workshop on image analysis for multimedia interactive services WIAMIS 10. IEEE, pp 1–4
Pantic M, Valstar M, Rademaker R, Maat L (2005) Web-based database for facial expression analysis. In: 2005 IEEE international conference on multimedia and Expo. IEEE, pp 5-pp
Valstar M, Pantic M (2010) Induced disgust, happiness and surprise: an addition to the mmi facial expression database. In: Proc. 3^rd intern. Workshop on EMOTION (satellite of LREC): Corpora for research on emotion and affect 10, pp 65–70
Zhao G, Huang X, Taini M, Li SZ, PietikäInen M (2011) Facial expression recognition from near-infrared videos. Image Vis Comput 29(9):607–619
Article Google Scholar
Happy SL, Patnaik P, Routray A, Guha R (2017) The Indian spontaneous expression database for emotion recognition. IEEE Trans Affect Comput 8(1):131–142
Article Google Scholar
Dhall A, Goecke R, Lucey S, Gedeon T (2012) Collecting large, richly annotated facial-expression databases from movies. IEEE Multimed 19(3):34–41. https://doi.org/10.1109/MMUL.2012.26
Article Google Scholar
Dhall A, Goecke R, Ghosh S, Joshi J, Hoey J, Gedeon T (2017) From individual to group-level emotion recognition: EmotiW 5.0. In: Proceedings of the 19th ACM international conference on multimodal interaction (ICMI '17). Association for Computing Machinery, New York, pp 524–528. https://doi.org/10.1145/3136755.3143004
Goodfellow IJ, Erhan D, Carrier PL, Courville A, Mirza M, Hamner B, Cukierski W, Tang Y, Thaler D, Lee DH, Zhou Y (2013) Challenges in representation learning: a report on three machine learning contests. In: Neural information processing: 20th international conference, ICONIP 2013, Daegu, Korea, November 3-7, 2013. Proceedings, Part III 20. Springer, Heidelberg, pp 117–124
Viola P, Jones MJ (2004) Robust real-time face detection. Int J Comput Vis 57(2):137–154
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Miami, Coral Gables, FL, USA
Monu Verma
Indian Institute of Technology, Ropar, India
Santosh Kumar Vipparthi

Authors

Monu Verma
View author publications
You can also search for this author in PubMed Google Scholar
Santosh Kumar Vipparthi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Monu Verma.

Ethics declarations

Conflict of interests/Competing interests

The authors declare that there are no conflicts of interest or competing interests associated with this research.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Verma, M., Vipparthi, S.K. Cross-centroid ripple pattern for facial expression recognition. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-19364-9

Download citation

Received: 09 February 2024
Revised: 24 April 2024
Accepted: 02 May 2024
Published: 25 May 2024
DOI: https://doi.org/10.1007/s11042-024-19364-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Cross-centroid ripple pattern for facial expression recognition

Abstract

Similar content being viewed by others

Modified chess patterns: handcrafted feature descriptors for facial expression recognition

Affective Computing: Using Covariance Descriptors for Facial Expression Recognition

Radial mesh pattern: a handcrafted feature descriptor for facial expression recognition

1 Introduction

2 Related work