1 Introduction

Facial expressions exhibit human emotion, that is reflected directly on the face regions and gives cues about a person’s feeling at that moment. Usually, expressions are more effective than verbal dialogue in order to perceive someone’s feelings. Each expression represents a unique pattern on the face based on the muscular variations at different facial regions. With the rapid evolution of technology especially in computer vision and machine learning techniques, many state-of-the-art facial expression recognition (FER) systems have been proposed in the literature. Automatic facial expression recognition has various applications in different fields like depression analysis, e-learning, medical diagnosis, law enforcement, gaming, etc. In real-world scenarios, facial images are captured from multiple angles with spontaneous interactions in various challenging environments such as illumination and noise variations. These factors make facial expression recognition a challenging task. Thus, there is a need for the development of a robust FER system, which can handle these challenging scenarios.

The concept of universal expression was given by Paul Ekman [1]. He classified expressions into six major categories namely: anger, disgust, fear, happiness, sadness, and surprise. Furthermore, the Facial Action Coding System [2] was built by Ekman and Friesen which standardized the recognition procedure. They achieved this by dividing the face into action units (AUs). These AUs are categorized according to the movement of a facial muscle or set of muscles.

An automated FER system usually consists of three steps: preprocessing, feature extraction, and expression classification. In a preprocessing step, the facial region is extracted or cropped from the image. This is important because the presence of other elements that are not related to expressions (e.g. background in the image) can create confusion during the recognition process. After preprocessing, features are extracted from processed images. This is the most important step in any facial expression recognition approach. A good feature extractor provides strength to the FER system and improves accuracy. Feature extraction techniques can be divided into two categories namely predesigned, and learning-based. In predesigned techniques, handcrafted filters are designed to extract the features. They can be further divided into two categories namely geometric and appearance-based approaches. In geometry-based recognition, fiducial points are selected at active areas like eyebrows, eyes, nose, and lips. With the help of these selected points, the shape and distance between active areas are computed and then the final features are extracted. There are many models available in the literature based on the shape and size of active points. The overall performance of this type of model is dependent on the selection of facial points with pinpoint accuracy, which is not easy to achieve because of the dynamic nature of facial shape as it varies from expression to expression. In appearance-based approaches, features are extracted by applying a hand-crafted filter over a complete image or a part of the image, called a patch. Many filters such as Gabor Wavelets [3], Elastic Bunch Graph Matching [4], Local Binary Patterns (LBP) [5,6,7], Local Directional Pattern (LDP) [8], PHOG and LPQ [9], etc. are available for the same. These descriptors encode texture and gradient magnitude information to represent the microstructure of a facial appearance. However, LBP and its variants are not robust to noisy and multi-view pose conditions. After feature extraction, the next step, expression classification comes into play. It classifies the feature vectors into one of the six expression classes. Various classification models like Support Vector Machine (SVM) [10], K-Nearest Neighbor (KNN) [11], Hidden Markov Models (HMM) [12], AdaBoost, etc. are available for expression classification. In the case of learning approaches, features are learned during training. Feature extraction and classification tasks are usually performed simultaneously by using some pipeline architecture.

In this paper, we propose Cross-Centroid Ripple Pattern (CRIP) for robust facial expression recognition that encodes the intensity variation using centroid values at adjacent ripples. Centroids of each ripple are calculated by establishing the average relationship between the center pixels and its neighboring pixels located at a radius \(r_{1}\) and \(r_{2}\) respectively. Thus, it embeds both micro and macro structures to represent the spatial variations and enhance discriminability between different expression classes. It improves the performance of the descriptor in various pose conditions as it extracts deep edge responses from the face structure. The average relationship between the pixels provides more comprehensive texture information and incorporates robustness to noise and image blur. Furthermore, gradient information between ripple centroids extract minute intensity variations and improves their ability to deal with illumination variations. The working flow of the proposed FER system is demonstrated in Fig. 1.

Fig. 1
figure 1

Working block diagram of the proposed FER system: The facial region is cropped from the original image and features are extracted by applying the CRIP descriptor. Further, the response map is divided into blocks to compute the feature vector. The feature vector is fed to the classifier for emotion recognition

2 Related work

In literature, FER techniques are generalized in two categories: geometric and appearance-based. Zhang et al. [14] proposed a geometric-based approach by using 34 facial point positions to extract the features. Another commonly used facial expression classification method is Facial Action Coding (FAC), where expressions are recognized by extracting AUs. Valstar et al. [15] introduced a new approach, where AUs are detected by extracting some particular facial points.

Appearance-based features are extracted by applying a filter or filter bank. These filters are applied on the whole face image or local regions of the face, which extracts the physical changes in facial appearance. There are many appearances-based techniques like Eigenfaces [16], LDA [17], ICA [18], PCA [19], IDA [20, 21] Gabor Wavelet [3], and LBP [5] proposed in the literature, which extracts the local features to represent the physical appearance of the face. Local Binary Patterns (LBP) is the most acknowledged approach and has been effectively used in many domains. LBP used the first-order derivative between the neighborhood and reference pixels to define the texture variation. The basic LBP descriptor molds each intensity value of an image to a new value by applying threshold conditions between the reference pixel to its neighboring pixel. LBP has gained significant popularity as an invariant texture measure, as it has the tolerance capability to illumination changes and computational simplicity. However, LBP performs well in extracting the micro information of edges, spots, and other relevant features, by analyzing the intensity changes around the gray levels. However, it is not robust to random noise and rotation variations. To overcome the issues of LBP, Jabid et al. [7] proposed a descriptor LDP, that utilizes directional information instead of actual intensity value. To extract the directional numbers, LDP computed edge responses in eight directions by using the Kirsch mask. Further, it extracts the three most leading directions indices and replaces them with one binary and remaining with zeros. The use of edge responses showed successful performance in facial expression recognition as they are less sensitive in case of illumination changes and noise. However, LDP still faces the problem of non-monotonic illumination variation. To improve the capability of feature descriptors, Ramirez et. al. [22] used the local directional numbers (LDN). LDN descriptor assigned a six-bit code to each pixel value by extracting the highest positive and negative edge responses. Further, Rivera et. al. [23] improved their directional descriptor’s performance by adding texture information to it and proposed a new descriptor LDTexP. LDTexP descriptor takes advantage of both directional and intensity information of referencing neighborhood. Moreover, Ryu et. al. [24] proposed LDTerP, which encodes facial expressions by dividing the face into two parts stable region and active regions. This multi-level representation provided a fine description of uniform and non-uniform areas. LDTerP encoded the texture information by using directional information with a ternary pattern. Directional information reveals the edge pattern while the ternary pattern describes the disparities between excited and smooth facial regions. LDTerP used Robinson's symmetric mask to extract the responses in four directions. Further, identified the two most dominating directions, and jointly calculated the ternary patterns for both of the directions.

In recent years, deep learning techniques have performed very well in many applications such as facial recognition, facial expression recognition, object detection, etc. In literature, various state-of-the-art CNN-based architectures such as VGG16 [25], VGG19 [25], and ResNet [26] have been proposed. These architectures have the ability to learn features itself with provided accurate labels. There are some networks available in the literature that perform very well in FER, some of them are: Deep Belief Networks (DBN) [27], where each layer extracts and learns the features from a given image and predicts the expression class. Khorrami et. al. [28] proposed a shallow CNN network to extract AU-based features. Further, Li et al. [29] introduced an enhancing and cropping-based CNN network to learn salient features from the AU region, by detecting landmarks. Lopes et al. [30] created a big dataset by applying augmentation techniques and designed CNN based-network to detect the expressive features from the facial images. Ding et. al [31] proposed a two-phase network: first they trained a network for facial features and then forwarded updated neuron weights to the next network, to extract microlevel expression features. Other CNN-based systems such as DTAGN [32], DTAGN-Joint [33], Inception [34], Spatiotemporal [35], GCNet [36], encoder-decoder FER [37], MaNet [38], and AutoFERNet [39] have also achieved accelerated growth in the field of facial expression recognition.

3 Proposed method

In this paper, we propose a new feature descriptor named Cross-Centroid Rippled Pattern (CRIP) for facial expression recognition. CRIP is designed to encode the cross-centroid information, located at adjacent ripples to represent the spatial structure variation of the facial expressions. These ripples are generated by dividing the local neighborhood region into subregions and represented by \(X_{\eta }\) and \(Y_{n}\)\(\left( {\eta \in \left\{ {0,1,...7} \right\}} \right)\) as shown in Fig. 2. The X ripple is created by incorporating the center pixel with its neighboring pixels, located at radius \(r_{1}\). Similarly, Y ripple elicited by encapsulating the inter-radial pixels, located at radius \(r_{1}\) and \(r_{2}\). Furthermore, centroid values are generated by computing the local mean of all magnitude values in a particular subregion. Since, CRIP captures the averaging behavior of the subregion, that enhances noise the robustness property of the proposed descriptor. Moreover, the CRIP efficiently encodes the macro and micro-structure by embedding extensive information of radius \(r_{1}\) and \(r_{2}\) pixels. This enables the proposed descriptor to become an invariant view and posed facial expression. The final pattern is generated by establishing a cross-centroid relationship between ripples, in the respective directions. Thereby, cross-centroid responses are able to capture the intensive and slight muscle distortions. Moreover, the CRIP pattern detects the deep and narrow edge responses, which reveal the salient facial features of the facial appearance that allows it to handle irregular illumination. The properties of CRIP are as follows: (1) the proposed descriptor categorically reveals information about dipped and embossed muscles on the face that is visible at the time of expression. Further, this enables the identification of disparities between the intensity changes, (2) The CRIP extracts the robust spatial information embedding macro and micro-structural information, that consequently yields better recognition accuracy, (3) The cross-centroid comparison between two adjacent ripples improves its capability to deal with irregular illumination changes, pose variations and noisy conditions.

Fig. 2
figure 2

Proposed CRIP descriptor: Centroids are computed by portioning the local region into sub-regions at radius \(r_{1}\) and \(r_{2}\). At each radius centroids \(X(\eta )\) and \(Y(\eta )\) are generated, where \(\eta \in [0 - 7]\)

To calculate the CRIP patterns, we divide the local neighborhood region into subregions and then compute the centroids to establish an averaging relationship as shown in Fig. 2. Let \(I(m,n)\) be a grayscale image of size \(Q \times N\) where \(m \in [1,Q]\), \(n \in [1,N]\). If at each location \(I_{C}\) in the image, p and q imply the total neighboring pixels located at a corresponding radius \(r_{1}\) and \(r_{2}\) respectively, then the pattern is computed by using Eq. (1).

$$CRIP(c) = \sum\limits_{\eta = 0}^{p - 1} {f\left( {Y(\eta ) - X(\eta )} \right)} \times 2^{\eta }$$
(1)

The sign function \(f( \cdot )\) is computed using Eq. (2).

$$f\left(t\right)=\left\{\begin{array}{l}\begin{array}{cc}1&t\geq0\end{array}\\\begin{array}{cc}0&else\end{array}\end{array}\right.$$
(2)

The centroid values \(X_{\eta }\) and \(Y_{n}\)\(\left( {\eta \in \left\{ {0,1,...7} \right\}} \right)\) for radius \(r_{1}\) and \(r_{2}\) are calculated using [Eqs. (3)—(6)].

$$X(\eta )=\frac{\left(\left(k(\eta )-1\right)\times {\sum }_{i=\eta -l(\eta )-2}^{\eta -l(\eta )-1}{I}_{{r}_{1},\mathit{mod}(i,p)}\right)+k(\eta )\times ({I}_{C}+{I}_{{r}_{1},\eta -l(\eta )})+\left({\sum }_{i=\eta -l(\eta )+1}^{\eta -l(\eta )+2}{I}_{{r}_{1},\mathit{mod}(i,p)}\right)}{4\times k(\eta )}$$
(3)
$$Y(\eta ) = \frac{{\sum\limits_{i = \eta + l(\eta ) - 1}^{\eta - l(\eta ) + 1} {I_{{r_{1} ,\bmod (i,p)}} + \sum\limits_{i = 2\eta - 1}^{2\eta + 1} {I_{{r_{2} ,\bmod (i,q)}} } } }}{{2\left( {2 \times k(\eta ) - \left\lfloor {k(\eta )/2} \right\rfloor } \right)}}$$
(4)
$$k(\eta ) = \bmod (\eta + 1,2) + 1$$
(5)
$$l(\eta ) = \bmod (\eta ,2)$$
(6)

3.1 Feature vector & classification

Feature response maps are divided into non-overlapping blocks of identical size for feature vector representation. The final feature vector is generated by accumulating previously generated block-level feature vectors. Moreover, the response feature maps of size \(u \times v\) are divided into \(n\) equal-sized \(\eta \times \eta\) non-overlapping blocks \((K_{1} ,K_{2} ,...K_{n} )\). Thus, a total number of blocks in a row and column are \(R_{\eta } = \left\lceil {u/\eta } \right\rceil\),\(C_{\eta } = \left\lceil {v/\eta } \right\rceil\) respectively. Finally, the response map \(M_{R}\) is calculated by using Eq. (7).

$$M_{R} = \left\{ {M(R,\lambda ,\varphi )} \right\}_{\varphi = 1}^{n}$$
(7)

where \(R\) is the generated feature map, \(\lambda \in [0,255]\) and \(M( \cdot , \cdot , \cdot )\) is computed using Eqs. (8)—(10).

$$M(R,\lambda ,\varphi ) = \sum\limits_{{i = \omega_{1} }}^{{\omega_{1} + B}} {\sum\limits_{{i = \omega_{2} }}^{{\omega_{2} + B}} {\sigma \left( {M(i,j) - \lambda } \right)} }$$
(8)
$$\omega_{1} = 1 + \left( {\left\lfloor {\varphi /R_{j} } \right\rfloor \times B} \right)$$
(9)
$$\omega_{2} = \left( {\bmod (\varphi - 1,R_{j} ) \times B} \right) + 1$$
(10)

where \(\omega_{1}\) and \(\omega_{2}\) is the row and column index for each block. We perform classification by using a support vector machine (SVM) [10]. SVM is a supervised machine learning technique, that, transforms a feature vector to a high dimensional plane by using linear or non-linear mapping. Furthermore, it executes the linear decision hyper-plane having the maximal scope and performs binary classification to separate the feature vector into two classes in the dimensional space. Let \(T = \{ (A_{i} ,B_{i} ),i = 1,2,....Z\}\) be a training set of labeled objects where \(A_{i} \in \psi^{n}\) is the feature vector of each class and \(B_{i} \in \{ + 1, - 1\}\). A new test object is classified by using Eq. (11).

$$f(t) = sign\left( {\sum\limits_{i = 1}^{Z} {\lambda_{i} B_{i} \gamma (A_{i} ,t) + b} } \right)$$
(11)

where \(\lambda_{i}\) represents the Lagrange's multipliers of dual optimization problem, \(\gamma\) is the kernel function and b imply bias parameter. To classify multiple blocks, we adopt one-against-one coding scheme and performs \({}^{n}C_{2}\) binary SVM classifiers, where n is the total no. of class label. The final decision is taken based on voting.

4 Experimental results and discussion

4.1 Comparison with existing descriptor

In the literature, various state-of-the-art approaches have been proposed to extract salient features from facial images. Recently, directional edge response-based approaches [7, 19,20,21] have achieved high classification accuracy. These methods have utilized the directional information, calculated by imposing the eight-directional compass mask (Sobel, kirsch, and Robinson). Moreover, they extract the directional index position to encode the feature in the local neighborhood. Existing descriptors are able to describe the features of the still and posed frontal view expressions. However, in real-life applications, people express their emotions spontaneously, which can be recorded from multiple views. Thus, a feature descriptor that can also deal with the non-frontal views will yield better results for side-view and spontaneous expression recognition. Therefore, to improve the performance of FER systems, we proposed a CRIP descriptor, which can extract expression features in such kinds of situations. The comparison of different FER techniques can be summarized as follows.

  1. 1)

    The directional descriptors capture only extreme edge variations in the local neighborhood. This may lead to information loss in the facial region thereby degrading the performance of the FER system. However, CRIP utilizes gradient information between centroids, located at different radials. Thus, CRIP Descriptor is able to extract efficient transitional pattern information to describe macro and microstructure variations which strengthens its capability to discriminate the expression classes in both side and frontal views as shown in Figs. 3 and 4 respectively.

    Fig. 3
    figure 3

    The qualitative comparison between the feature maps generated by the proposed CRIP and other state-of-the-art descriptors, in the side view image, belonging to the neutral and surprise emotion class. Color boxes highlighted the expressive regions of the facial appearance, which plays a major role in defining the disparities between different classes of expressions

    Fig. 4
    figure 4

    An input anger expression and response feature maps encoded by (a) LBP, (b) Two-Phase, (c) LDP, (d) LDN, (e) LDTexP, (f) LDTerP, and (g) CRIP respectively

  2. 2)

    In the CRIP descriptor, gradient information of inter-radial centroids, elicited from different sub-regions, captures the dipped and embossed edges. Thus, it enhances the robustness to inter-class and intra-class variations as compared to LDTexP and LDTerP. The qualitative performance of the descriptors is depicted in Fig. 5, representing three expression images: two are from the anger emotion class and one is from the sad emotion class. Furthermore, a particular area of each image is selected to analyze the behavior of the directional descriptors (LDTexP and LDTerP) with the proposed descriptor. We have extracted the response images along with histograms of active patches. Furthermore, Euclidean distance is computed between the response feature maps to validate the performance of CRIP in terms of inter-and intra- class. Thus, from the computed feature maps and relevant distances, it is clear, the CRIP descriptor is able to categorize anger emotions to the same class and sad to the different class.

    Fig. 5
    figure 5

    The encoded response and feature vectors of three images with two emotion classes (anger and sad) are computed using LDTexP, LDTerP and CRIP respectively. Euclidean distance between (a) and (b) is lesser and (a) and (c) is larger for CRIP as compared to other descriptors

  3. 3)

    The robustness of CRIP to deal with the illumination variations is compared to state-of-the-art techniques is depicted in Fig. 6. The cross-centroid relationship allows the CRIP to capture minute edge variations, which improves the feature map quality in various lighting conditions over other state-of-the-art approaches.

    Fig. 6
    figure 6

    The input images and coded response generated by applying state-of-art descriptors: LDP, LDN, LDTexP, LDTerP, QUEST, and proposed CRIP. The images are captured using (a) NIR and (b-d) VIS cameras in various lighting conditions: dark, strong, and weak respectively

  4. 4)

    CRIP utilizes averaging behavior of the pixels in the local neighborhood to preserve edge variations in low-resolution conditions (blur images). Figure 7, demonstrates the response maps of happy expression in three different resolutions, captured by imposing using LBP, LDP, LDN, LDTexP, LDTerP, QUEST, and CRIP descriptors. From Fig. 7, it is visible that almost all existing methods lose prominent features in low-resolution images as compared to CRIP, which degrades the performance of the FER system.

    Fig. 7
    figure 7

    Spontenous happy emotion images, captured in three different resolution conditions and their coded response using LBP, LDP, LDN, LDTexP, LDTerP, QUEST and CRIP descriptors respectively

  5. 5)

    It embeds extensive local neighborhoods with radius \(r_{2}\) instead of an exclusive neighborhood with radius \(r_{1}\) like QUEST, which elicits more pattern variations and extends the quality of response features.

Recently, convolution neural network (CNN) [25, 26] based models achieved high accuracy in FER. Conventional CNN models have a series of convolutions, ReLU, and pooling layers that mutually create a hierarchical network, which is able to extract desired features of expression, and the final outcome of extracted feature feeds into the soft-max layer for classification. Though recent deep learning models show remarkable performance in the field of image classification, these networks also have some limitations like 1). They required large datasets to get the desired results. 2). In deep learning models, predefined hyperparameters play an important role in network learning. A minute change in hyper-parameter can affect the overall learning process which affects the performance indirectly. Parameter tuning is itself a difficult task and sometimes needs numerous experiments.

Moreover, to evaluate the performance of the CRIP encoding scheme, we executed 7 experiments on benchmark expression datasets MUG [40], MMI [41, 42], Oulu-CASIA [43], ISED [44], AFEW [45, 46], FER challenge [47] and Combined (MUG, MMI, ISED, AFEW). We tested the CRIP descriptor under different conditions: time-lapse, ethnicity, and illumination variations. Moreover, we also validate our proposed approach for with and without neutral images by creating six and seven expression classes respectively. Generally, existing approaches [22,23,24] perform the manual detection of faces and then cropped them for further processing. However, in the proposed method, we utilize the Viola-Jones [48] detection algorithm to extract the faces and then normalize them into equal-sized images. Thereby, this procedure provides a real-life scenario where the position and background of the images are not static, which enhances the performance of the FER systems.

Furthermore, resultant feature maps are partitioned into equal-sized blocks. The adaptability of equal-sized blocks instead of selective different block sizes for all databases strengthens its capability by reducing the overhead of parameter selection and provides universal acceptance to the FER system. This approach also ensures fair comparative analysis between proposed and existing descriptors. In our experimental setup, we have created our datasets by selecting three to five peak frames from each video sequence. Most of the existing approaches [8, 22,23,24] follow the same setup. In literature, to measure the performance of the FER system two validation schemes are used called as person-dependent and person-independent cross-validation.

In person-dependent cross-validation, dataset is randomly partitioned into N categories, where N-1 sets are assisting as training set and one set assist as a testing set. However, in person independent cross-validation, image sets are categorized according to the subjects as all expression of some particular persons at a particular iteration, are not involved in training set and utilize as a testing set. In this paper, we have validated performance of the proposed approach by utilizing both schemes.

Person dependent scheme

In person dependent (PD) scheme, we have arranged the training and testing set by dividing datasets into 80:20 respectively. Since, partition of training and testing image set have done randomly, so to make fair analysis of outcomes, we conducted the same experiments with 5 iterations and then average is taken as final recognition rate. The recognition rate is computed by following equation.

$$\textit{Rec}ognition\,Accuracy=\frac{Total\,no.\,of\,correctly\,predicted\,samples}{Total\,no.\,of\,samples}\times100$$
(12)

Person independent scheme

In person independent (PI) scheme, we have adopted two different PI methods: tenfold subject independent and leave-on-subject-out. In the tenfold subject independent method, subjects are divided into 10 folds according to their ids. In our experimental setup, we have used a tenfold subject-independent method for MMI and ISED datasets. Since all the expressions of every subject are not available in these datasets, the tenfold subject independent cross-validation is the more suitable scheme. However, in the leave-one-subject-out method, only one subject’s expressions are involved in the testing set and the remaining subject’s expressions are used for training. For MUG and OULU-CASIA datasets, we have followed a leave-one-subject-out scheme as all the expressions for each subject are provided.

4.2 Experiment on MUG dataset

The multimedia understanding group built the MUG [40] dataset with high-resolution images. It consists of 86 subjects showing various expressions. The subjects contain 35 females and 512 males, all belongs of Caucasian origin aged between 20 to 35 years. We have taken 5 peak images from all the sequences of available subjects. images having 260—anger, 250– disgust, 241 – fear, 255 – happy, 245– sad, 255 – surprise, and 255 – neutral. For person-dependent experiments, we arrange a total 1761 For person-independent experiments, we selected 1540 facial expression images (anger: 220, disgust: 220, fear: 220, happy: 220, sad: 220, surprise: 220, neutral: 220). We showed the recognition rates of our method and another method over MUG dataset in Table 1, from which we can state that our method outperforms the state-of-the-art methods for both PD and PI setup. The CRIP also outperforms the VGG16, VGG19, and ResNet50 in PD experiments. Specifically, our proposed method yields performance improvement of 1.5%, 0.8%, 0.3 and 1.0%, 0.4%, 0.2% over LDTexP, LDTerP and QUEST for 6-class and 7-class problem respectively in PD experiments. Similarly, the proposed CRIP outperforms LDTexP, LDTerP and QUEST by 1.5%, 3.4%, 0.7% and 0.2%, 0.8%, 1.3% for 6-class and 7-class problem respectively in PI experiments.

Table 1 Recognition accuracy comparison on MUG and MMI dataset

4.3 Experiment on MMI dataset

The MMI [41, 42] expression dataset has more than 2900 samples, which contains both videos and still images. Each sequence starts and ends with a neutral expression with a peak of a particular expression in between the expression. The frontal view is captured in all images. There are multiple sessions for a person with and without glasses. MMI consists of 236 sessions recorded for 32 subjects. We selected 30 subjects frame sequences in our experiments. Each of the sequence categorized in one of the expression classes: anger, disgust, fear, happy, sad, surprise. We manually crop images of the MMI dataset. For person-dependent and person-independent experiments, we have followed a similar setup and arranged a total of 561 images in which 69 – anger, 71 – disgust, 69 – fear, 84 – happy, 67 – sad, 75 – surprise, and 126 – neutral. The recognition rates over MMI dataset for both 6-class and 7-class problems are shown in Table 1. The results from Table 1 make it clear that the proposed descriptor outperforms other existing handcrafted approaches for facial expression recognition. In particular, the proposed method improves the performance by 4.2%, 7.0%, 0.8% and 0.6%, 0.6%, 2.6% over LDTexP, LDTerP and QUEST for 6-class and 7-class problems respectively in PD experiments. Similarly, in PI experiments, the CRIP achieves 1.4%, 4.3%, 1.1% and 0.5%, 2.2%, 0.7% performance improvement over LDTexP, LDTerP and QUEST for 6-class and 7-class problems respectively. To delineate the class-wise recognition accuracy, we have further presented the confusion matrices for 6- and 7-class problems on PD setup in Figs. 8 and 9 respectively.

Fig. 8
figure 8

Confusion matrix of CRIP for 6-class expression classification in MMI dataset

Fig. 9
figure 9

Confusion matrix of CRIP for 7-class expression classification in MMI dataset

4.4 Experiment on OULU-CASIA dataset

OULU-CASIA [43] dataset consist of 80 subjects representing six basic expressions: anger, disgust, fear, happy, sad, surprise, aged between 23 to 58 years. All images in the dataset are frontally faced. The expression images are captured by we only select 3 peak images from all Neutral expression is created by taking initial frames of the sequences when subjects are not started showing any emotion. For our experimental purpose, using NIR and VIS cameras. Furthermore, all images are taken in 3 different lighting conditions: strong, dark, and weak. In strong conditions appropriate light is used, in weak conditions, only computer display light is used, while in dark conditions images are taken in a near dark environment.

A total of 2880 video sequences are present in the dataset, in which 480 image sequences belong to each illumination condition. we arrange images in all the 3- illumination-condition. The numbers of images we select in each expression as anger – 240, disgust – 240, fear – 240, happy – 240, surprise – 240, and neutral – 240. The same applied in both NIR and VIS camera images. To validate the robustness of CRIP, we have shown a qualitative comparison between the proposed descriptor and other state-of-the-art feature descriptors in Fig. 6. From Fig. 6 we can see that almost all the methods perform well in the case of normal lighting conditions. However, in the presence of strong and weak illumination conditions, the proposed method is more conclusively able to extract the discriminative features as compared to other descriptors.

The quantitative results of the proposed method and other approaches over the NIR dataset for 6-class and 7-class problems are given through Tables 2, 3, 4, and 5. The evaluated results clearly show that the proposed descriptor outperforms other state-of-the-art approaches. More specifically, the proposed descriptor achieves performance gain of 0.8%, 0.9%, 0.6 and 1.0%, 0.5%, 0.3% for 6-class and 7-class problems as compared to LDTexP, LDTerP and QUEST respectively in PD setup. In PI setup, the CRIP secures 1.3%, 10.2%, 0.8% and 1.5%, 15.4%, 1.6% performance improvement over LDTexP, LDTerP and QUEST for 6-class and 7-class problems respectively. The results over the VIS dataset are given in Tables 2, 3, 4, and 5. In PD setup, the proposed method outperforms the LDTexP, LDTerP and QUEST by 2.6%, 1.3% 0.8% and 0.6%, 1.3%, 0.3 for 6-class and 7-class problems respectively. Similarly, in PI setup, the CRIP achieves 7.1%, 9.1%, 1.5% and 6.2%, 10.2%, 0.6% performance gain as compared to LDTexP, LDTerP and QUEST for 6-class and 7-class problems respectively. subjects with 428 video sequences. Out of 50 subjects 29 are males and 21 are females with all aged between 18 to 22 years.

Table 2 Recognition accuracy comparison on OULU-NIR dataset for 6-class expressions in both Person Dependent and Independent Setup
Table 3 Recognition accuracy comparison on OULU-NIR dataset for 7-class expressions in both Person Dependent and Independent Setup
Table 4 Recognition accuracy comparison on OULU-VIS dataset for 6-class expressions in both Person Dependent and Independent Setup
Table 5 Recognition accuracy comparison on OULU-VIS dataset for 7-class expressions in both Person Dependent and Independent Setup

4.5 Experiment on ISED dataset

The ISED [44] dataset includes 50 culturally diverse Indians This dataset was created with spontaneous emotions, i.e., the subjects are not asked to pose particular expressions. These all spontaneous expressions were captured without the subjects’ awareness. Multiple sessions are recorded of the same emotion with a high resolution of 1920 × 1080. Expressions are categorized into 4 classes: disgust, happy, sad, and fear. The neutral class was created with the onset frames of these expressions sessions while subjects did not start posing any expression. All captures are taken in weak illumination conditions Finally, we have selected 730 total image frames as 234 – disgust, 294 – happy, 174 – sad, 189 – surprise, and 295 – neutral. The proposed method gives more accurate results and extracts the discriminative features well as compared to other methods. The comparative results of our proposed method and other approaches are given in Table 6. From Table 6, we can see that the proposed method achieves superior results as compared to other state-of-the-art approaches. Especially, it surpasses the accuracy of LDTexP, LDTerP and QUEST by 4.0%, 7.0%, 3.1% and 7.4%, 7.0%, 4.0% for 4-class and 5-class expression recognition problem in PD setup. Similarly, in PI setup, the proposed descriptor attains 1.1%, 2.5%, 0.4% and 3.5%, 0.02%, 1.4% performance as compared to LDTexP, LDTerP, and QUEST for 6-class and 7-class problems respectively Thus, it proves, that our method is effective even in recognizing the spontaneous expressions.

Table 6 Recognition accuracy comparison on ISED dataset

4.6 Experiment on AFEW dataset

AFEW [45, 46] contains 957 video clips of wild emotions around all age groups people. These videos are extracted from the different movies to encapsulate the real-world scenario for the dataset. surprise and neutral. The comparative results of CRIP and other state-of-art approaches over AFEW are tabulated in Table 7. The AFEW database comprises various real-world constraints: spontaneous facial expressions, head pose variations, occlusions, distinctive face resolution, and illumination changes. In our setup, we have selected 10 to 15 most expressive images to validate the performance of the proposed method. In total, AFEW holds 10,809 images, with seven basic expression classes: angry, disgust, fear, happy, sad, Specifically, our proposed method yields 8.0%, 1.9%, 1.2% and 6%, 1.3%, 0.9% more accuracy rates for 6- and 7- class expressions as compared to LDTexP, LDTerP, and QUEST, respectively.

Table 7 Recognition accuracy comparison on AFEW, COMPOSITE and FER13 dataset

4.7 Experiment on composite dataset

In this experiment, we create a larger dataset by combining the MUG, MMI, ISED, and AFEW datasets. By doing this we get a greater number of images to train the model and provide a diverse pool of subjects for designing and evaluating the FER system. The resultant dataset consists of 14,242 images: 1856—anger, 1700—disgust, 1497—fear, 2761—happy, 2265- sad, 1445 – surprise, and 2718—neutral. Table 7 represents the results for proposed method and other approaches. Table 7 represents the evaluated results for proposed descriptor and other approaches. The proposed method obtains 4.8%, 11.9%, 1.5% and 6.9%, 3.5%, 2.3% more recognition accuracy as compared to LDTexP, LDTerP and QUEST for 6-class and 7-class expression recognition respectively.

4.8 Experiment on FER-2013 dataset

The FER challenge [47] was organized to promote the development of feature extraction and learning techniques for automated facial expression recognition. The dataset consists of grayscale images of 48 × 48 resolution. The images were partitioned into a training set (28,709 images), public test set (3,589 images), and private test set (3,589 images). In our experiments, we have computed the results for 6-class and 7-class emotions over both the public and private test sets, and the comparative results are shown in Table 7. From Table 7, we can see that the CRIP outperforms the existing state-of-the-art feature descriptors. More specifically, it attains 8.6%, 15.0%, 10.7% and 8.2%, 13.6%, 10.4% performance improvement as compared to LDTexP, LDTerP and QUEST for 6-class and 7-class problems in the public test set. Similarly, it outperforms the LDTexP, LDTerP, and QUEST by 8.1%, 15.8%, 8.9 and 16%, 8.4% for 6-class and 7-class problems in the private test set.

5 Conclusion

In this paper, we proposed a new feature descriptor Cross-Centroid Ripple Pattern, which encodes the image features by using inter-radial ripples. These centroids are computed by establishing a relationship between neighboring subspaces in the local neighborhood. The cross-centroid thresholding extracts the deep edge variations, which improves the capability to identify different expressions belonging to different emotion classes. The cross-centroid relationship also enhances the robustness to pose and illumination variation, by highlighting the deep edge features. Moreover, average information of pixels in local neighborhoods improves the performance of CRIP in both frontal and side views. Experimental results show the effectiveness of the proposed method as it achieved better recognition rates as compared to other state-of-the-art approaches.