Keywords

1 Introduction

Since this century, with the rapid development of deep learning [1], image recognition technology [2] has also ushered in a golden age, and various improved convolutional neural network models [3] have continuously refreshed the highest accuracy rate in history. Expression recognition includes the recognition of static images and dynamic images. Static image recognition is a recognition technology for a single picture, while dynamic image recognition is a recognition method based on video sequences. But for now, most researches still focus on the recognition of static images.

The development of facial expression recognition can be divided into three stages: from the previous manual design of feature extractors (LBP [4], LBP-TOP [5]) for recognition, and then to shallow learning (SVM [6], Adaboost [7]) Recognition, and now it is based on deep learning [8]. Each stage of development is changing its limitations and making up for deficiencies. For example, traditional hand-designed feature extractors need to rely on manually-designed feature extractors to a certain extent. Its generalization, robustness, and accuracy are slightly insufficient. Shallow learning overcomes the shortcomings of requiring excessive manual intervention, but it is accurate There are still shortcomings in terms of rate. Therefore, in this respect, with the development of computer hardware, facial expression recognition based on deep learning has gradually overcome the lack of accuracy of shallow learning.

2 LWCNN

2.1 Related Work

Nowadays, deep learning is a relatively mature field, but in order to improve the accuracy of image recognition, researchers have also begun to improve the neural network of deep learning from other aspects. For example, the activation function [9] is improved, the attention mechanism is added to the neural network [10], and the self-encoding layer [11] is added, all of which have made significant progress. This improved idea has not only made progress in image classification, but also further improved the recognition rate in facial expression recognition. Other problems that have arisen are that the formed network structure superposition leads to more and more bloated convolutional networks. Redundant parameters and complex calculations make computer resources wasted. To solve these problems, many scholars are trying find method to overcome it such as in previous studies, the literature [12] summarizes the characteristics of the past lightweight convolutional networks, which are mainly divided into three categories: lightweight convolution structure, lightweight convolution module, and lightweight convolution operation. A recent literature [13] proposed a lightweight model method based on the attention mechanism combined with a convolutional neural network. This document combines the first two features of the lightweight model together, but there are multiple computational branches in the network model. Road, this will increase the calculation cost.

Therefore, the improvement of this paper is to cut off the calculation channels of the branches of the neural network model, retain the main calculation channels, reduce the size of the convolution kernel, and add the currently used detachable attention model as a feature auxiliary extractor to assist the main calculation channel for learning.

2.2 Improved LWCNN

The lightweight model in this paper continues to use the attention mechanism combined with the convolutional neural network method, but it strengthens the parallel extraction and fusion of features, increases the Non-Local attention mechanism (Non-Local Net) [14], and reduces the parameter amount of the main calculation channel. To put it simply, the model includes a main calculation channel and an attention mechanism calculation branch. The function of the attention mechanism calculation branch on the main calculation channel is to merge the information extracted by auxiliary features while retaining the original main channel feature information. Similar to the idea of residual structure. As shown in Fig. 1.

Fig. 1.
figure 1

Improved lightweight model.

The SeNet [15] structure is used near the output of the bottom layer, and the Non-Local Net structure is used near the input of the high layer. Through the use of Non-Local Net in the input layer to establish feature connections between the relevant features of different regions of the image, SeNet is used to merge the features of different channels before the output layer, and finally the predicted value is calculated.

The relevant calculation formula is:

$$H\left(x\right)={F}_{scale}\left[I\left(x\right),S\right]+I(x)$$
(1)

Among them, H(x) represents the network mapping after the summation, S represents the feature weight value of different channels, Fscale represents the weighted calculation, and I(x) represents the input of the previous layer, which can be expressed as:

$$\mathrm{I}(\mathrm{x})={\mathrm{f}}_{1}(\mathrm{x})+{\mathrm{f}}_{2}(\mathrm{x})$$
(2)

I(x) represents the total network mapping after summation, f1(x) represents the mapping calculated by ordinary convolution on the main road, and f2(x) represents the mapping calculated by the Non-Local Net mechanism.

The backbone calculation channel of the model uses the Xception [16] model, but the size of the convolution kernel is optimized and the amount of parameters is reduced. The hierarchy of the entire model is shown in Fig. 2.

Fig. 2.
figure 2

Hierarchy diagram of improved model.

In Fig. 2, the model is divided into three modules. The first module is Entry flow. In its module, two ordinary convolution operations are first performed on the image, and then the output feature of the second convolution operation is copied as a residual connection and used after the MaxPooling layer is completed. Add, and then copy one to NL-Net to establish the feature correlation of the image and add it before the MaxPooling layer. Following the main channel are two separable convolutional layers with an activation layer in between. This series of operations are repeated 4 times, and the size of the convolution kernel of the separable convolution layer changes from (16, 3 * 3) to (32, 3 * 3), (64, 3 * 3), (128, 3 * 3). After processing, the feature fusion of different dimensional channels is performed through SeNet to adjust the feature value of the output channel, and finally enter the second module Middle flow.

In the second module, the activation layer is above the separable convolutional layer. This is set up according to the research of the paper, repeat the calculation 8 times and enter the third module Exit flow.

In the third module, before entering the activation layer, an input channel is copied to NL-Net for parallel processing, and then the activation layer is followed by a separable convolutional layer with a number of convolution kernels of 128 and the next separable convolution. The number of convolution kernels of the layer becomes 256.

Before the MaxPooling layer, add the output channel characteristics of the NL-Net operation and the output channel characteristics of the separable convolutional layer with the number of convolution kernels of 256, and enter the MaxPooling layer for processing, and then merge the characteristics of different dimensional channels through SeNet. Adjust the characteristic value. Finally, the final result is output through the remaining operations.

3 Experiment

3.1 Configuration

Hardware environment: CPU is AMD 5800X. The graphics card is NVIDIA RTX3060, and the memory is DDR4 3200 MHz 32 GB.

Software environment: operating system is Window10, programming software is PyCharm, python version is 3.6, keras version is 2.2.4, tensorflow version is 1.13.1.

Model parameters: the batch size is set to 64, the period is 200, the photo size of Fer2013 and CK+ is unified to 48 * 48, the initial learning rate is 0.0025, and the learning decline factor is 0.1. The loss function uses the multi-class log loss function, the activation function uses the ReLU function uniformly, and the data enhancement uses the ImageDataGenerator that comes with keras.

3.2 DataSet

At present, the FER-2013 data set contains a total of 27809 training samples, 3589 verification samples and 3859 test samples. The resolution of each sample image is 48 * 48. It contains seven categories of expressions: angry, disgusted, fearful, happy, sad, surprised and neutral. Due to the incorrect labels in this data set, some images do not even have faces, and there are still faces that are occluded. Therefore, the current recognition accuracy of human eyes is only 65% (±5%). However, because Fer2013 is more complete than the current expression data set, and is also in line with daily life scenarios, so this experiment chose FER-2013.As shown in the Table 1, this is one of the various expressions of the enlarged jpg picture of 48 * 48 pixels.

Table 1. The example of Fer2013 expression.

The CK+ data set is an extension of the CK data set. It is a data set specifically used for facial expression recognition research. It includes 138 participants, 593 picture sequences, and each picture sequence has an image in the last frame. Tags, including common emoticons, the number of which is consistent with the FER-2013 data set, examples are shown in Table 2.

Table 2. The example of CK+ expression.

3.3 Result

The accuracy of the experimental results is shown in Fig. 3 and 4.

Fig. 3.
figure 3

Accuracy of different models on Fer2013 dataset

It can be seen from Fig. 3 that in the experiment of the Fer2013 data set, VGG16 [17] is the network model with the lowest recognition rate, with the highest accuracy rate of about 64%. The LWCNN and the other three models have similar or even higher accuracy in a certain period. In the last 30 cycles of the experimental data, the average accuracy of LWCNN was 74.5%, the average accuracy of InceptionV3 [18] was 73.3%, the average accuracy of ResNet50 [19] was 73.8%, and the average accuracy of DenseNet121 [20] was 74.1%. It can be seen that the LWCNN model has a higher accuracy rate on the Fer2013 data set than other classic models.

Fig. 4.
figure 4

Accuracy of different models on CK+ dataset

It can be seen from Fig. 4 that in the experiment of the CK+ data set, since the image training data is better than that of Fer2013, each model gradually tends to be flat and stable starting from the 50th cycle. The average accuracy of the last 30 cycles of each model is approximately: 90.4% of VGG16, 92.2% of InceptionV3, 94.6% of ResNet50, 95.9% of DenseNet121, and 97.8% of LWCNN. It can be seen that the accuracy of the LWCNN model on CK+ also exceeds the classic model.

Finally, by comparing the size and parameter amount of each model appearing in the experiment, it can be clearly seen that the improved model not only significantly reduces the size and parameter amount of the model, but also has a certain improvement in the recognition rate. as shown in Table 3.

Table 3. Comparison table of each model

4 Conclusion

This paper mainly follows the design of the predecessors on the lightweight model, but retains the main calculation channel of the convolutional neural network, and there is no other redundant parallel calculation branch. Focus on optimizing the neural network model combined with the attention mechanism, and add the attention mechanism as a component to the main neural network model. This part draws on the idea of residual structure.

However, the current lightweight model only integrates two of the three design ideas. How to integrate the third design idea into the model requires some time of research and learning. The future will also focus on research in this direction.