Abstract
In view of the particularity of micro expression, there are some problems, such as resource waste or parameter redundancy in micro expression training and recognition by using large convolutional neural network model alone. Therefore, a method of using lightweight model to recognize micro expression is proposed, which aims to reduce the size of model space and the number of parameters, and improve the accuracy at the same time. This method uses mini-Xception as the framework and Non-Local Net and SeNet as parallel auxiliary feature extractors to enhance feature extraction. Finally, the simulation experiments are carried out on the two public data sets of fer2013 and CK+. After a certain training cycle, the accuracy can reach 74.5% and 97.8% respectively, which slightly exceeds the commonly used classical models. It is proved that the improved lightweight model has higher accuracy, lower parameters and model size than the large convolution network model.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
- Facial expression recognition
- Deep learning
- Convolutional network
- Attention mechanism
- SeNet
- Non-local net
- Xception
1 Introduction
Since this century, with the rapid development of deep learning [1], image recognition technology [2] has also ushered in a golden age, and various improved convolutional neural network models [3] have continuously refreshed the highest accuracy rate in history. Expression recognition includes the recognition of static images and dynamic images. Static image recognition is a recognition technology for a single picture, while dynamic image recognition is a recognition method based on video sequences. But for now, most researches still focus on the recognition of static images.
The development of facial expression recognition can be divided into three stages: from the previous manual design of feature extractors (LBP [4], LBP-TOP [5]) for recognition, and then to shallow learning (SVM [6], Adaboost [7]) Recognition, and now it is based on deep learning [8]. Each stage of development is changing its limitations and making up for deficiencies. For example, traditional hand-designed feature extractors need to rely on manually-designed feature extractors to a certain extent. Its generalization, robustness, and accuracy are slightly insufficient. Shallow learning overcomes the shortcomings of requiring excessive manual intervention, but it is accurate There are still shortcomings in terms of rate. Therefore, in this respect, with the development of computer hardware, facial expression recognition based on deep learning has gradually overcome the lack of accuracy of shallow learning.
2 LWCNN
2.1 Related Work
Nowadays, deep learning is a relatively mature field, but in order to improve the accuracy of image recognition, researchers have also begun to improve the neural network of deep learning from other aspects. For example, the activation function [9] is improved, the attention mechanism is added to the neural network [10], and the self-encoding layer [11] is added, all of which have made significant progress. This improved idea has not only made progress in image classification, but also further improved the recognition rate in facial expression recognition. Other problems that have arisen are that the formed network structure superposition leads to more and more bloated convolutional networks. Redundant parameters and complex calculations make computer resources wasted. To solve these problems, many scholars are trying find method to overcome it such as in previous studies, the literature [12] summarizes the characteristics of the past lightweight convolutional networks, which are mainly divided into three categories: lightweight convolution structure, lightweight convolution module, and lightweight convolution operation. A recent literature [13] proposed a lightweight model method based on the attention mechanism combined with a convolutional neural network. This document combines the first two features of the lightweight model together, but there are multiple computational branches in the network model. Road, this will increase the calculation cost.
Therefore, the improvement of this paper is to cut off the calculation channels of the branches of the neural network model, retain the main calculation channels, reduce the size of the convolution kernel, and add the currently used detachable attention model as a feature auxiliary extractor to assist the main calculation channel for learning.
2.2 Improved LWCNN
The lightweight model in this paper continues to use the attention mechanism combined with the convolutional neural network method, but it strengthens the parallel extraction and fusion of features, increases the Non-Local attention mechanism (Non-Local Net) [14], and reduces the parameter amount of the main calculation channel. To put it simply, the model includes a main calculation channel and an attention mechanism calculation branch. The function of the attention mechanism calculation branch on the main calculation channel is to merge the information extracted by auxiliary features while retaining the original main channel feature information. Similar to the idea of residual structure. As shown in Fig. 1.
The SeNet [15] structure is used near the output of the bottom layer, and the Non-Local Net structure is used near the input of the high layer. Through the use of Non-Local Net in the input layer to establish feature connections between the relevant features of different regions of the image, SeNet is used to merge the features of different channels before the output layer, and finally the predicted value is calculated.
The relevant calculation formula is:
Among them, H(x) represents the network mapping after the summation, S represents the feature weight value of different channels, Fscale represents the weighted calculation, and I(x) represents the input of the previous layer, which can be expressed as:
I(x) represents the total network mapping after summation, f1(x) represents the mapping calculated by ordinary convolution on the main road, and f2(x) represents the mapping calculated by the Non-Local Net mechanism.
The backbone calculation channel of the model uses the Xception [16] model, but the size of the convolution kernel is optimized and the amount of parameters is reduced. The hierarchy of the entire model is shown in Fig. 2.
In Fig. 2, the model is divided into three modules. The first module is Entry flow. In its module, two ordinary convolution operations are first performed on the image, and then the output feature of the second convolution operation is copied as a residual connection and used after the MaxPooling layer is completed. Add, and then copy one to NL-Net to establish the feature correlation of the image and add it before the MaxPooling layer. Following the main channel are two separable convolutional layers with an activation layer in between. This series of operations are repeated 4 times, and the size of the convolution kernel of the separable convolution layer changes from (16, 3 * 3) to (32, 3 * 3), (64, 3 * 3), (128, 3 * 3). After processing, the feature fusion of different dimensional channels is performed through SeNet to adjust the feature value of the output channel, and finally enter the second module Middle flow.
In the second module, the activation layer is above the separable convolutional layer. This is set up according to the research of the paper, repeat the calculation 8 times and enter the third module Exit flow.
In the third module, before entering the activation layer, an input channel is copied to NL-Net for parallel processing, and then the activation layer is followed by a separable convolutional layer with a number of convolution kernels of 128 and the next separable convolution. The number of convolution kernels of the layer becomes 256.
Before the MaxPooling layer, add the output channel characteristics of the NL-Net operation and the output channel characteristics of the separable convolutional layer with the number of convolution kernels of 256, and enter the MaxPooling layer for processing, and then merge the characteristics of different dimensional channels through SeNet. Adjust the characteristic value. Finally, the final result is output through the remaining operations.
3 Experiment
3.1 Configuration
Hardware environment: CPU is AMD 5800X. The graphics card is NVIDIA RTX3060, and the memory is DDR4 3200Â MHz 32Â GB.
Software environment: operating system is Window10, programming software is PyCharm, python version is 3.6, keras version is 2.2.4, tensorflow version is 1.13.1.
Model parameters: the batch size is set to 64, the period is 200, the photo size of Fer2013 and CK+ is unified to 48Â *Â 48, the initial learning rate is 0.0025, and the learning decline factor is 0.1. The loss function uses the multi-class log loss function, the activation function uses the ReLU function uniformly, and the data enhancement uses the ImageDataGenerator that comes with keras.
3.2 DataSet
At present, the FER-2013 data set contains a total of 27809 training samples, 3589 verification samples and 3859 test samples. The resolution of each sample image is 48 * 48. It contains seven categories of expressions: angry, disgusted, fearful, happy, sad, surprised and neutral. Due to the incorrect labels in this data set, some images do not even have faces, and there are still faces that are occluded. Therefore, the current recognition accuracy of human eyes is only 65% (±5%). However, because Fer2013 is more complete than the current expression data set, and is also in line with daily life scenarios, so this experiment chose FER-2013.As shown in the Table 1, this is one of the various expressions of the enlarged jpg picture of 48 * 48 pixels.
The CK+ data set is an extension of the CK data set. It is a data set specifically used for facial expression recognition research. It includes 138 participants, 593 picture sequences, and each picture sequence has an image in the last frame. Tags, including common emoticons, the number of which is consistent with the FER-2013 data set, examples are shown in Table 2.
3.3 Result
The accuracy of the experimental results is shown in Fig. 3 and 4.
It can be seen from Fig. 3 that in the experiment of the Fer2013 data set, VGG16 [17] is the network model with the lowest recognition rate, with the highest accuracy rate of about 64%. The LWCNN and the other three models have similar or even higher accuracy in a certain period. In the last 30 cycles of the experimental data, the average accuracy of LWCNN was 74.5%, the average accuracy of InceptionV3 [18] was 73.3%, the average accuracy of ResNet50 [19] was 73.8%, and the average accuracy of DenseNet121 [20] was 74.1%. It can be seen that the LWCNN model has a higher accuracy rate on the Fer2013 data set than other classic models.
It can be seen from Fig. 4 that in the experiment of the CK+ data set, since the image training data is better than that of Fer2013, each model gradually tends to be flat and stable starting from the 50th cycle. The average accuracy of the last 30 cycles of each model is approximately: 90.4% of VGG16, 92.2% of InceptionV3, 94.6% of ResNet50, 95.9% of DenseNet121, and 97.8% of LWCNN. It can be seen that the accuracy of the LWCNN model on CK+ also exceeds the classic model.
Finally, by comparing the size and parameter amount of each model appearing in the experiment, it can be clearly seen that the improved model not only significantly reduces the size and parameter amount of the model, but also has a certain improvement in the recognition rate. as shown in Table 3.
4 Conclusion
This paper mainly follows the design of the predecessors on the lightweight model, but retains the main calculation channel of the convolutional neural network, and there is no other redundant parallel calculation branch. Focus on optimizing the neural network model combined with the attention mechanism, and add the attention mechanism as a component to the main neural network model. This part draws on the idea of residual structure.
However, the current lightweight model only integrates two of the three design ideas. How to integrate the third design idea into the model requires some time of research and learning. The future will also focus on research in this direction.
References
Han, X.H., Xu, P., Han, S.: Theoretical overview of deep learning. Comput. Era. 06, 107–110 (2016)
Zheng, Y., Li, G., Li, Y.: Survey of application of deep learning in image recognition. Comput. Eng. Appl. 55(12), 20–36 (2019)
Li, B., Liu, K., Gu, J., Jiang, W.: Review of the researches on convolutional neural networks. Comput. Era 4, 8–12+17 (2021)
Li, L., Yu, W.: Research on the face recognition based on improved LBP algorithm. J. Mod. Comput. 30(17), 68–71 (2015)
Lu, G., Yang, C., Yang, W., Yan, J., Li, H.: Micro-expression recognition based on LBP-TOP features. J. Nanjing Univ. Posts Telecommun. (Nat. Sci. Ed.) 37(06), 1–7 (2017)
Yan, Q.: Survey of Support Vector Machine Algorithms. China Information Technology and Application Academic Forum, Chengdu, Sichuan, China (2008)
Yi, H., Song, X., Jiang, B., Wang, D.: Selection of Training Samples for SVM Based on AdaBoost Approach. In: National Virtual Instrument Conference, Guilin, Guangxi, China (2009)
Lu, J.H., Zhang, S.M., Zhao, J.L.: Static face image ex pression recognition method based on deep learning. Appl. Res. Comput. 37(4), 967–972 (2020)
Zhang, H., Zhang, Q., Yu, J.: Overview of the development of activation function and its nature analysis. J. Xihua Univ. (Nat. Sci. Ed.) 06, 1–10 (2021)
Zhu, Z., Rao, Y., Wu, Y., Qi, J., Zhang, Y.: Research progress of attention mechanism in deep learning. J. Chin. Inf. Process. 33(06), 1–11 (2019)
Yuan, F.-N., Zhang, L., Shi, J.-T., Xia, X., Li, G.: Theories and applications of auto-encoder neural networks: a literature survey. Chin. J. Comput. 42(01), 203–230 (2019)
Ma, J., Zhang, Y., Ma, Z., Mao, K.: Research progress in lightweight neural network convolution design. Comput. Sci. Explor. 1–21 (2021)
Li, H., Li, J., Li, W.: A visual model based on attention mechanism and convolutional neural network. Acta Metrol. 42(07), 840–845 (2021)
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local Neural Networks. arXiv: 1711.07971 (2017)
Hu, J., Shen, L., Samuel, A., Gang, S., Wu, E.: Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. 42(8), 1 (2020)
Chollet, F.: Xception: deep learning with depth-wise separable convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2017)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. Comput. Sci. (2014)
Szegedy, C., Vanhoucke, V., Ioffe, S., et al.: Rethinking the inception architecture for computer vision. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818–2826 (2016)
He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)
Huang, G., Liu, Z., Laurens, V., et al.: Densely connected convolutional networks. IEEE Computer Society (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2022 The Author(s)
About this paper
Cite this paper
Luo, L., He, J., Cai, H. (2022). The Method for Micro Expression Recognition Based on Improved Light-Weight CNN. In: Qian, Z., Jabbar, M., Li, X. (eds) Proceeding of 2021 International Conference on Wireless Communications, Networking and Applications. WCNA 2021. Lecture Notes in Electrical Engineering. Springer, Singapore. https://doi.org/10.1007/978-981-19-2456-9_76
Download citation
DOI: https://doi.org/10.1007/978-981-19-2456-9_76
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-2455-2
Online ISBN: 978-981-19-2456-9
eBook Packages: EngineeringEngineering (R0)