1 Introduction

Facial expressions are the vital identifiers for human feelings, because it corresponds to the emotions. Most of the times (roughly in 55% cases) [1], the facial expression is a nonverbal way of emotional expression, and it can be considered as concrete evidence to uncover whether an individual is speaking the truth or not [2].

The current approaches primarily focus on facial investigation keeping background intact and hence built up a lot of unnecessary and misleading features that confuse CNN training process. The current manuscript focuses on five essential facial expression classes reported, which are displeasure/anger, sad/unhappy, smiling/happy, feared, and surprised/astonished [3]. The FERC algorithm presented in this manuscript aims for expressional examination and to characterize the given image into these five essential emotion classes.

Reported techniques on facial expression detection can be described as two major approaches. The first one is distinguishing expressions [4] that are identified with an explicit classifier, and the second one is making characterization dependent on the extracted facial highlights [5]. In the facial action coding system (FACS) [6], action units are used as expression markers. These AUs were discriminable by facial muscle changes.

2 Literature review

Facial expression is the common signal for all humans to convey the mood. There are many attempts to make an automatic facial expression analysis tools [7] as it has application in many fields such as robotics, medicine, driving assist systems, and lie detector [8,9,10]. Since the twentieth century, Ekman et al. [11] defined seven basic emotions, irrespective of culture in which a human grows with the seven expressions (anger, feared, happy, sad, contempt [12], disgust, and surprise). In a recent study on the facial recognition technology (FERET) dataset, Sajid et al. found out the impact of facial asymmetry as a marker of age estimation [13]. Their finding states that right face asymmetry is better compared to the left face asymmetry. Face pose appearance is still a big issue with face detection. Ratyal et al. provided the solution for variability in facial pose appearance. They have used three-dimensional pose invariant approach using subject-specific descriptors [14, 15]. There are many issues like excessive makeup [16] pose and expression [17] which are solved using convolutional networks. Recently, researchers have made extraordinary accomplishment in facial expression detection [18,19,20], which led to improvements in neuroscience [21] and cognitive science [22] that drive the advancement of research, in the field of facial expression. Also, the development in computer vision [23] and machine learning [24] makes emotion identification much more accurate and accessible to the general population. As a result, facial expression recognition is growing rapidly as a sub-field of image processing. Some of the possible applications are human–computer interaction [25], psychiatric observations [26], drunk driver recognition [27], and the most important is lie detector [28].

3 Methodology

Convolutional neural network (CNN) is the most popular way of analyzing images. CNN is different from a multi-layer perceptron (MLP) as they have hidden layers, called convolutional layers. The proposed method is based on a two-level CNN framework. The first level recommended is background removal [29], used to extract emotions from an image, as shown in Fig. 1. Here, the conventional CNN network module is used to extract primary expressional vector (EV). The expressional vector (EV) is generated by tracking down relevant facial points of importance. EV is directly related to changes in expression. The EV is obtained using a basic perceptron unit applied on a background-removed face image. In the proposed FERC model, we also have a non-convolutional perceptron layer as the last stage. Each of the convolutional layers receives the input data (or image), transforms it, and then outputs it to the next level. This transformation is convolution operation, as shown in Fig. 2. All the convolutional layers used are capable of pattern detection. Within each convolutional layer, four filters were used. The input image fed to the first-part CNN (used for background removal) generally consists of shapes, edges, textures, and objects along with the face. The edge detector, circle detector, and corner detector filters are used at the start of the convolutional layer 1. Once the face has been detected, the second-part CNN filter catches facial features, such as eyes, ears, lips, nose, and cheeks. The edge detection filters used in this layer are shown in Fig. 3a. The second-part CNN consists of layers with \(3\times 3\) kernel matrix, e.g., [0.25, 0.17, 0.9; 0.89, 0.36, 0.63; 0.7, 0.24, 0.82]. These numbers are selected between 0 and 1 initially. These numbers are optimized for EV detection, based on the ground truth we had, in the supervisory training dataset. Here, we used minimum error decoding to optimize filter values. Once the filter is tuned by supervisory learning, it is then applied to the background-removed face (i.e., on the output image of the first-part CNN), for detection of different facial parts (e.g., eye, lips. nose, ears, etc.)

Fig. 1
figure 1

a Block diagram of FERC. The input image is (taken from camera or) extracted from the video. The input image is then passed to the first-part CNN for background removal. After background removal, facial expressional vector (EV) is generated. Another CNN (the second-part CNN) is applied with the supervisory model obtained from the ground-truth database. Finally, emotion from the current input image is detected. b Facial vectors marked on the background-removed face. Here, nose (N), lip (P), forehead (F), eyes (Y) are marked using edge detection and nearest cluster mapping. The position left, right, and center are represented using L, R, and C, respectively

Fig. 2
figure 2

Convolution filter operation with the \(3 \times 3\) kernel. Each pixel from the input image and its eight neighboring pixels are multiplied with the corresponding value in the kernel matrix, and finally, all multiplied values are added together to achieve the final output value

Fig. 3
figure 3

a Vertical and horizontal edge detector filter matrix used at layer 1 of background removal CNN (first-part CNN). b Sample EV matrix showing all 24 values in the pixel in top and parameter measured at bottom. c Representation of point in Image domain (top panel) to Hough transform domain (bottom panel) using Hough transform

To generate the EV matrix, in all 24 various facial features are extracted. The EV feature vector is nothing but values of normalized Euclidian distance between each face part, as shown in Fig. 3b.

3.1 Key frame extraction from input video

FERC works with an image as well as video input. In case, when the input to the FERC is video, then the difference between respective frames is computed. The maximally stable frames occur whenever the intra-frame difference is zero. Then for all of these stable frames, a Canny edge detector was applied, and then the aggregated sum of white pixels was calculated. After comparing the aggregated sums for all stable frames, the frame with the maximum aggregated sum is selected because this frame has maximum details as per edges (more edges more details). This frame is then selected as an input to FERC. The logic behind choosing this image is that blurry images have minimum edges or no edges.

3.2 Background removal

Once the input image is obtained, skin tone detection algorithm [30] is applied to extract human body parts from the image. This skin tone-detected output image is a binary image and used as the feature, for the first layer of background removal CNN (also referred to as the first-part CNN in this manuscript). This skin tone detection depends on the type of input image. If the image is the colored image, then YCbCr color threshold can be used. For skin tome, the Y-value should be greater than 80, Cb should range between 85 and 140, Cr value should be between 135 and 200. The set of values mentioned in the above line was chosen by trial-and-error method and worked for almost all of the skin tones available. We found that if the input image is grayscale, then skin tone detection algorithm has very low accuracy. To improve accuracy during background removal, CNN also uses the circles-in-circle filter. This filter operation uses Hough transform values for each circle detection. To maintain uniformity irrespective of the type of input image, Hough transform (Fig. 3c) was always used as the second input feature to background removal CNN. The formula used for Hough transform is as shown in Eq. 1

$$\begin{aligned} H(\theta ,\rho )=\int _{-\infty }^{\infty }\int _{-\infty }^{\infty }A(x,y)\delta (\rho - x\cos \theta -y\sin \theta ){\mathrm{d}}x {\mathrm{d}}y \end{aligned}$$
(1)

3.3 Convolution filter

As shown in Fig. 2 for each convolution operation, the entire image is divided into overlapping \(3\times 3\) matrices, and then the corresponding \(3\times 3\) filter is convolved over each \(3\times 3\) matrix obtained from the image. The sliding and taking dot product operation is called ‘convolution’ and hence the name ‘convolutional filter.’ During the convolution, dot product of both \(3\times 3\) matrix is computed and stored at a corresponding location, e.g., (1,1) at the output, as shown in Fig. 2. Once the entire output matrix is calculated, then this output is passed to the next layer of CNN for another round of convolution. The last layer of face feature extracting CNN is a simple perceptron, which tries to optimize values of scale factor and exponent depending upon deviation from the ground truth.

3.4 Hardware and software details

All the programs were executed on Lenovo Yoga 530 model laptop with Intel i5 8th generation CPU and 8 GB RAM with 512 GB SSD hard disk. Software used to run the experiment were Python (Using Thonny IDE), MATLAB 2018a, and ImageJ.

4 Results and discussions

To analyze the performance of the algorithm, extended Cohn–Kanade expression dataset [31] was used initially. Dataset had only 486 sequences with 97 posers, causing accuracy to reach up to 45% maximum. To overcome the problem of low efficiency, multiple datasets were downloaded from the Internet [32, 33], and also author’s own pictures at different expressions were included. As the number of images in dataset increases, the accuracy also increased. We kept 70% of 10K dataset images as training and 30% dataset images as testing images. In all 25 iterations were carried out, with the different sets of 70% training data each time. Finally, the error bar was computed as the standard deviation. Figure 4a shows the optimization of the number of layers for CNN. For simplicity, we kept the number of layers and the number of filters, for background removal CNN (first-part CNN) as well as face feature extraction CNN (the second-part CNN) to be the same. In this study, we varied the number of layers from 1 to 8. We found out that maximum accuracy was obtained around 4. It was not very intuitive, as we assume the number of layers is directly proportional to accuracy and inversely proportional to execution time. Hence due to maximum accuracy obtained with 4 layers, we selected the number of layers to be 4. The execution time was increasing with the number of layers, and it was not adding significant value to our study, hence not reported in the current manuscript. Figure 4b shows the number of filters optimization for both layers. Again, 1–8 filters were tried for each of the four-layer CNN networks. We found that four filters were giving good accuracy. Hence, FERC was designed with four layers and four filters. As a future scope of this study, researchers can try varying the number of layers for both CNN independently. Also, the vast amount of work can be done if each layer is fed with a different number of filters. This could be automated using servers. Due to computational power limitation of the author, we did not carry out this study, but it will be highly appreciated if other researchers come out with a better number than 4 (layers), 4 (filters) and increase the accuracy beyond 96%, which we could achieve. Figure 4c and e shows regular front-facing cases with angry and surprise emotions, and the algorithm could easily detect them (Fig. 4d, f). The only challenging part in these images was skin tone detection, because of the grayscale nature of these images. With color images, background removal with the help of skin tone detection was straightforward, but with grayscale images, we observed false face detection in many cases. Image, as shown in Fig. 4g, was challenging because of the orientation. Fortunately, with 24 dimensions EV feature vector, we could correctly classify 30° oriented faces using FERC. We do accept the method has some limitations such as high computing power during CNN tuning, and also, facial hair causes a lot of issues. But other than these problems, the accuracy of our algorithm is very high (i.e., 96%), which is comparable to most of the reported studies (Table 2). One of the major limitations of this method is when all 24 features in EV vector are not obtained due to orientation or shadow on the face. Authors are trying to overcome shadow limitation by automated gamma correction on images (manuscript under preparation). For orientation, we could not find any strong solution, other than assuming facial symmetry. Due to facial symmetry, we are generating missing feature parameters by copying the same 12 values for missing entries in the EV matrix (e.g., the distance between the left eye to the left ear (LY–LE) is assumed the same as a right eye to the right ear (RY–RE), etc.) The algorithm also failed when multiple faces were present in the same image, with equal distance from the camera. For testing data selection, the same dataset with 30% data which was not used for training was used. For each pre-processing epoch, all the 100 % data were taken as new fresh sample data in all 25 folds of training. To find the performance of FERC with large datasets Caltech faces, CMU database and NIST database were used (Table 1). It was found that Accuracy goes down with an increasing number of images because of the over-fitting. Also, accuracy remained low, when the number of training images is less. The ideal number of images was found out to be in the range of 2000–10,000 for FERC to work properly.

Fig. 4
figure 4

a Optimization for the number of CNN layers. Maximum accuracy was achieved for four-layer CNN. b Optimization for the number of filters. Four filters per layer gave maximum accuracy. c, e, g Different input images from the dataset. d, f, h The output of background removal with a final predicted output of emotion

Table 1 Results obtained with different databases

4.1 Comparison with other methods

As shown in Table 2, FERC method is a unique method developed with two 4-layer networks with an accuracy of 96%, where others have just gone for a combined approach of solving background removal and face expression detection in a single CNN network. Addressing both issues separately reduces complexity and also the tuning time. Although we only have considered five moods to classify, the sixth and seventh mood cases were misclassified, adding to the error. Zao et al. [37] have achieved maximum accuracy up to 99.3% but at the cost of 22 layers neural network. Training such a large network is a time-consuming job. Compared to existing methods, only FERC has keyframe extraction method, whereas others have only gone for the last frame. Jung et al. [38] tried to work with fixed frames which make the system not so efficient with video input. The number of folds of training in most of the other cases was ten only, whereas we could go up to 25-fold training because of small network size.

Table 2 Comparison table with similar methods reported in the literature

As shown in Table 3, FERC has similar complexity as that of Alexnet. FERC is much faster, compared to VGG, GoogleNet, and Resnet. In terms of accuracy, FERC out-performs existing standard networks. However, in some cases we found GoogleNet out-performs FERC, especially when the iteration of GoogleNet reaches in the range of 5000 and above.

Table 3 Comparison table of FERC with standard networks

Another unique contribution of FERC is skin tone-based feature and Hough transform for circles-in-circle filters. The skin tone is a pretty fast and robust method of pre-processing the input data. We expect that with these new functionalities, FERC will be the most preferred method for mood detection in the upcoming years.

5 Conclusions

FERC is a novel way of facial emotion detection that uses the advantages of CNN and supervised learning (feasible due to big data). The main advantage of the FERC algorithm is that it works with different orientations (less than 30°) due to the unique 24 digit long EV feature matrix. The background removal added a great advantage in accurately determining the emotions. FERC could be the starting step, for many of the emotion-based applications such as lie detector and also mood-based learning for students, etc.