1 Introduction

In recent decades, facial expression recognition (FER) had aroused great interest in the computer field. At the same time, many researchers tried to introduce FER into the field of education and teaching, because teachers' emotions affected the teaching quality and played an important role in classroom teaching activities. How to improve the detection accuracy and detection time of FER was the key problem to be solved when this technology went into practical. Traditional machine learning methods had been used for FER, mainly including hidden Markov model (HMM), fuzzy mathematics, and Bayesian classifier. Filntisis et al. [1] put forward the method of combining HMM with active appearance model (AAM), which proved that HMM interpolation could effectively classify facial expressions. Halder et al. [2] used fuzzy mathematics to make the classification accuracy reach 87.8%. Sebe et al. [3] put forward a facial expression classification method by Cauchy Bayesian classifier, which used Cauchy distribution instead of Gaussian distribution and achieved better classification results.

With the development of deep learning, the convolutional neural networks (CNN), such as Faster R-CNN [4], R-FCN [5], SSD [6], YOLO [7], VGGNet [8], and GoogLeNet [9], had achieved fruitful results in target detection tasks. Many researchers applied them to FER. In 2018, He et al. [10] recognized facial expressions through VGGNet, and the accuracy reached 73% in Fer2013 dataset. In 2020, Khanzada et al. [11] used ResNet50 network and SGD optimizer for face recognition, achieving the detection time of 40 ms and the accuracy of 75.8%. In 2021, Li et al. [12] proposed a face detection method in natural scenes based on improved Faster-RCNN and used ResNet-50 CNN to extract face features. The average accuracy on the Wider Face dataset was 89.0% and took at least 100 ms for each image. Although the above models got high accuracy in FER, the detection time could not well meet the real-time recognition requirements of teachers in class. In 2016, Redmon et al. presented a unified and real-time detector-YOLO (You Only Look Once) [7] that had the advantages of small computation and fast recognition speed and had been applied to real-time targets detection, such as tomato detection, fine-grain object detection, and real-time growth stage detection [13]. In 2021, Lawal et al. [14] proposed an improved YOLOv3 network for tomato detection, which obtained the accuracy of 99.5% and the detection time of 52 ms. In [15], a high-performance real-time fine-grain object detection network based on YOLOv4 was presented, and the detection time had reduced to 14.29 ms. Hence, YOLO was very suitable for FER in dynamic and changeable real-time scenes. In 2021, Aung et al. [16] combined YOLO algorithm with VGG16 CNN to propose a face detection system with the detection accuracy of 95% and the detection time of 29 ms, which greatly improved the face detection speed in real-time video. As time went on, YOLO versions evolved from v1 to v7 [7, 17,18,19], which further improved the accuracy and the detection time. YOLOv5 had four models, which could be flexibly selected according to the requirements. In addition, many FER studies focused on the video emotion recognition, which used 3D data combined with the time sequence information. In 2016, Fan et al. [20] combined recursive neural network and C3D convolutional neural network for dynamic video emotion recognition, and the accuracy reached 59.02% on AFEW dataset. In 2021, Liu [21] proposed a network of Capsule-LSTM to better extract the time sequence information about video sequences for video facial expressions recognition, and the accuracy reached 40.16% on AFEW dataset. In 2022, Zhang [22] proposed a double-stream network structure for emotion recognition in complex scenes. According to the current research, the accuracy of video FER was lower than that of image FER.

In addition, the introduction of attention mechanisms (AMs) could focus attention on the important information with high weight, ignore the irrelevant information with low weight, and constantly adjust the weight to select the important information under different situations. In 2020, Qin et al. [23] applied AMs to face key point detection, and solved the problem of network depth and detection time balance. In 2020, Kang et al. [24] proposed a CNN expression recognition method based on AMs, which effectively alleviated the over-fitting phenomenon, enriched the facial expression features learning, and shifted its attention to the unobstructed facial areas with rich information.

In view of the hybrid algorithm idea proposed by the above scholars, this paper proposed a real-time FER network that mainly put the AMs in the Backbone structure of YOLOv5. RAF-DB dataset was re-screened and labeled to eliminate the poor quality images and equalize the remaining images. First, different AMs were added after each CBS module in the CSP1_X module of the Backbone structure of YOLOv5 (called CSPA), respectively. Then, CAs were incorporated after the Focus (called FA), the CBSs (called CBSA), and the SPP (called SA) in the Backbone of YOLOv5, respectively. In addition, we studied diverse networks by integrating FA, SA, CBSA, and CSPA into each other. The results showed that the CSPA network based on coordinate attentions (CAs) got the best accuracy of 77.1%, increased by 3.5%, compared with YOLOv5, and the detection time was 25 ms. Meanwhile, compared with Faster-RCNN, R-FCN, ResNext-101, DETR, Swin-Transformer, YOLOv3, and YOLOX, its accuracy increased by 5.7%, 4.3%, 3.6%, 3.2%, 3.1%, and 3.6%, and the detection time decreased by 8 ms, 4 ms, 21 ms, 17 ms, 15 ms, and 2 ms, respectively. Then, some teachers' expression pictures with different genders, face sizes, and facial postures were tested. Finally, we built a real-time FER system based on CSPA network to detect and analyze the teachers’ facial expression through camera and the teaching video.

2 Proposed models

2.1 Framework overview

The whole framework of the YOLOv5 model is illustrated in Fig. 1. The whole structure could be divided into three parts: Backbone, Neck, and Head, as shown in Fig. 1a. First, the Backbone was used for extracting rich information features from the input image. For images with different input sizes, Mosaic data enhancement and adaptive anchor box computing were used at the input end. The original images were uniformly scaled to a standard size by random scaling, cropping, and arrangement, and then sent to the Backbone network. The Backbone network contained the Focus, CBS, CSP1_X, and SPP structures. The Focus module used slicing operation to split a high-resolution feature map into many low-resolution feature maps, as shown in Fig. 1b. The CBS module contained the convolution, batch normalization (BN), and SiLU operation to obtain features, as shown in Fig. 1c. Figure 1d shows the CSP1_X module, where X represented the number of residual modules. The input of the CSP1_X module was divided into two branches, in order to further extract the information in the image. One of the branches passed through the CBS first, and then through the residual network to obtain sub-feature map 1. The other branch passed through another CBS to get sub-feature map 2. Finally, the sub-feature map 1 and map 2 were spliced and then input into another CBS. Figure 1e shows the SPP module, which extracted and fused high-level features. In the process of fusion, maximum pooling was used many times to extract as many high-level features as possible. The receptive field was raised almost without reducing the speed, which was helpful to solve the alignment problem between the anchor frame and the feature layer. Second, the Neck network further extracted and fused the image feature information output by Backbone. It contained the CSP2_X, CBS, Upsample, and Concat structures. CSP2_X was different from CSP1_X only in that CSP2_X replaced the residual network with 2 * X CBS. Concat let the model learn more features. Finally, the Head had three detection heads, which could classify and locate the feature information output from Neck, and output the classification probability, confidence, box, and other information of the detected target.

Fig. 1
figure 1

The framework of YOLOv5 model

2.2 Attention mechanisms

Hence, according to the structural characteristics of YOLOv5, we introduced AMs into the Backbone structure to improve feature extraction capability of the model, thus improving the performance of the model. This paper adopted squeeze-and-excitation (SE) [25], efficient channel attention (ECA) [26], convolutional block attention module (CBAM) [27], and coordinate attention (CA) [28]. SE was a channel attention mechanism. First, the input feature graph was globally averaged and pooled, and the channel compression was proceeded directly. Then, by giving each channel a weight, each feature graph got a different weight and paid attention to more useful features. ECA used 1-D convolution to efficiently realize local cross-channel interaction and extract the dependencies between channels. CBAM allocated attention to both channel dimension and space dimension, which was called mixed attention mechanism, and could further improve the representation ability of the network. In order to integrate more spatial information, CBAM used global average pooling (GAP) and global max pooling (GMP) to compress the spatial information of feature map. CA was also a combination of spatial attention and channel attention. Moreover, CA decomposed spatial attention into two one-dimensional features and aggregated features along two spatial directions (X and Y directions), respectively. This allowed long distance dependencies to be captured in one spatial direction while retaining accurate position information in the other spatial direction. Then, the obtained feature map was encoded into a pair of directional and position sensitive attention maps, respectively, which could be applied to complement the input feature map to enhance the representation ability of the object of interest.

2.3 YOLOv5 added by different AMs

To study the effects of different AMs on the model, SE, ECA, CBAM, and CA were added after each CBS module in the CSP1_X of the Backbone structure (called CSPA), respectively, as shown in Fig. 2.

Fig. 2
figure 2

The structure chart of CSPA

2.4 YOLOv5 added by the same AM at different positions.

Then, we studied the effect of the AMs on different modules. The AMs were added after (a) the focus module (called FA), (b) the CBS module (called CBSA), and (c) the SPP module (called SA), as shown in Fig. 3. Meanwhile, we integrated FA, CBSA, SA, and CSPA network into each other to form diverse networks.

Fig. 3
figure 3

The structure chart of a FA, b CBSA, and c SA

3 Experiments

3.1 Experimental dataset

RAF-DB [29] was a database of real scenes, which contained 29,672 facial images downloaded from Internet search alerts. This dataset was highly diversified with facial expressions of different ages, genders, races, postures, and scenes. Such a rich dataset had a good generalization, which could greatly enhance the robustness of the network. This dataset contained two different sub-datasets: one was a subset of single expression labels, including six basic expressions (surprise, fear, disgust, happiness, sadness, anger) and neutral expressions, and the other was a subset of compound expressions, including 12 kinds of compound expressions.

The dataset was rich in information, but it also had many low-quality images. Therefore, we re-screened RAF-DB dataset by choosing high-quality images. Meanwhile, the number of images of each category was balanced. Figure 4a shows the re-screened RAF-DB dataset, and each category randomly selected about 320 images. Then, the dataset was converted into VOC2007 format. We used LabelImg to mark the labels, as shown in Fig. 4b.

Fig. 4
figure 4

a The category distribution of the re-screened RAF-DB dataset and b labelImg tool

3.2 Implementation details

We used Linux as the operating system, JupyterLab as the training environment, and PyTorch as the open-source deep learning framework. HengyuanyunFootnote 1 Server provided NVIDIA Geforce RTX 3090 GPU, which had extremely high computing power and could provide acceleration for PyTorch, ensuring the efficient training of YOLOv5 model under the condition of large data volume. The learning rate lr was 0.01, iteration times epochs were 300, and batch size was 16. SGD optimization algorithm was adopted by default in YOLOv5, and cosine annealing strategy was dynamic.

3.3 Evaluation index

In the experiments, we judged the FER performance of the model by comparing the detection time, accuracy, and mAP@0.5 of the model. Detection time referred to the time of detecting a frame image.

The accuracy referred to the ratio of the number of correctly classified samples to the total number of samples, as shown in Eq. (1). The precision of one class referred to the ratio of the number of samples predicted correctly for this class to the number of samples predicted as this class, as shown in Eq. (2). The recall of one class referred to the ration of the number of samples predicted correctly for this class to the total number of samples in this class, as shown in Eq. (3).

$${\text{Accuracy = }}\frac{{n_{{{\text{correct}}}} }}{{n_{{{\text{total}}}} }}$$
(1)
$${\text{Precision = }}\frac{{{\text{TP}}}}{{\text{TP + FP}}}$$
(2)
$${\text{Recall = }}\frac{{{\text{TP}}}}{{\text{TP + FN}}}$$
(3)

where \(n_{{{\text{correct}}}}\) was the number of correctly classified samples. \(n_{{{\text{total}}}}\) was the total number of samples. True positives (TP) was the number of samples identified correctly for this class. False positives (FP) was the number of samples identified incorrectly as this class for other classes. False negatives (FN) was the number of samples identified incorrectly as other classes for this class.

IOU was the intersection and union ratio between the prediction frame and the ground truth. When the threshold of IOU was set to 0.5, we calculated the average precision (AP) of all pictures in each category. AP was the area enclosed by precision and recall as two axes. Then, the mean average precision (mAP), the average AP of all categories, was calculated.

4 Results and discussion

4.1 Analysis of experimental results

Here, we mainly compared the performance of some mainstream networks, such as Faster-RCNN, R-FCN, ResNext-101, DETR, Swin-Transformer, YOLOv3, YOLOX, and YOLOv5. The results are shown in Table 1, including the model size, parameters, accuracy, mAP@0.5, and detection time.

Table 1 Different models experiment on RAF-DB dataset

The results showed that the accuracy of Swin-Transformer was better than other networks, which was 74.0%. The accuracy of YOLOv5 was 0.4% lower than Swin-Transformer. However, the detection time of YOLOv5 was the smallest, which was 25 ms faster than Swin-Transformer. In order to meet the real-time recognition of teachers' classroom, hence, YOLOv5 was used as the baseline model. Based on YOLOv5, different attention mechanisms were introduced into the Backbone structure.

4.2 Experiments of different AMs

We added SE, ECA, CBAM, and CA after each CBS in the CSP1_X module of the Backbone to constitute the CSPA structure, as shown in Fig. 2. The performance of YOLOv5, CSPA-SE, CSPA-ECA, CSPA-CBAM, and CSPA-CA on RAF-DB dataset is reported in Table 2. Compared with YOLOv5, the accuracy and mAP@0.5 of the networks added by AMs increased. Meanwhile, the model size, parameters, and detection time also increased. Compared with YOLOv5, the network CSPA-CA had the best accuracy of 77.1% and mAP@0.5 of 83.4%, which increased by 3.5% and 1.6%, respectively. The detection time was 25 ms, which was 10 ms slower than that of YOLOv5. This was because CA focused on both channel dimension and spatial dimension, and used two pool cores to compress the spatial information of the feature maps horizontally and vertically, respectively. Thus, the ability of feature extraction was improved. The proposed model outperformed other detection methods. Compared with Swin-Transformer, its accuracy was improved by 3.1%, and the detection time was reduced by 15 ms.

Table 2 Model experiments based on YOLOv5 and AMs added in the CSP1_X module on RAF-DB dataset

4.3 Experiments of CA at different positions

According to Table 2, the network with CA had the best accuracy. Then, we conducted a series of ablation experiments to study the effects of AMs on different modules in the Backbone of YOLOv5. First, we integrated CA into different modules of Backbone, such as Focus, CBS, and SPP, as shown in Fig. 3. Table 3 shows the performance of YOLOv5 with SA, FA, CBSA, and CSPA networks. Compared with YOLOv5, the accuracy of the four networks increased by 1.3%, 1.7%, 2.1%, and 3.5% respectively, and the mAP@0.5 increased by 0.3%, 0.5%, 0.7%, and 1.6%, respectively. The results showed the CSPA network also achieved the best accuracy and mAP@0.5. This was because CA was the combination of the spatial and channel attention. The feature was abstracted into a series of point attention weights by GAP and GMP, and then the relationship between these weights was established and attached to the original spatial or channel features. From the extraction position, CA extracted the features of RGB images through Focus module to obtain the attention weights of three channels and the spatial attention of the whole image size. The number of channels was small and the spatial feature map was too large to focus on specific features, which led to the increase in useless information. When CA was added after SPP, the number of channels was too large, which was easy to cause over-fitting. Meanwhile, SA was closer to the classification level. The effect of attention was easy to affect the decision of the classification level. Therefore, considering the channel and spatial information, CA was the most suitable for the middle layer of network with moderate channel and space. Meanwhile, CSPA had the residual structure, which increased the gradient value of back propagation between layers and avoided the gradient disappearance caused by deepening network structure, so that finer-grained features were extracted without worrying about network degradation. Therefore, according to the location and advantages of CSP1_X, the combination of CSP1_X and CA was the best.

Table 3 Results of CAs in the Focus, CBS, SPP, and CSPA on RAF-DB dataset

Second, we integrated SA, FA, CBSA, and CSPA into each other to constitute diverse networks. Table 4 shows the performance of YOLOv5 with FA + SA, FA + CBSA + CSPA + SA, CSPA + SA, FA + CSPA, and CBSA + CSPA networks. Compared with YOLOv5, the accuracy of the five corresponding networks increased by 1.5%, 1.7%, 2.2%, 2.9%, and 3.2%, respectively. Meanwhile, the mAP@0.5 increased by 0.3%, 0.6%, 0.2%, 0.9%, and 1.2%. It was found that when CSPA was combined with other modules, their performance degrades. This was because the expansion of the attention range led to the increase of useless features, which made the learning ability of the model decline. Meanwhile, the models had more parameters, resulting in over-fitting. In addition, the detection time of these models increased compared with YOLOv5. In view of the above experiments, our model (CSPA) got the best accuracy.

Table 4 Integrated results on RAF-DB dataset

5 Teacher's facial expression analysis

5.1 Teachers’ facial expression analysis by images

Considering both the accuracy and detection time of the network, we selected YOLOv5, CBSA + CSPA, and CSPA to verify the actual recognition effect of the teachers' expression pictures with different gender, face size, and face posture. Figure 5 shows the qualitative results. CSPA had no errors, as shown in Fig. 5c. CBSA + CSPA had one error, which was the third picture, as shown in Fig. 5b. YOLOv5 had two pictures errors detection. The first picture was identified as sadness, and the third picture was identified as disgust, as shown in Fig. 5a.

Fig. 5
figure 5

The qualitative results under the network of a YOLOv5, b CBSA + CSPA and c CSPA.

5.2 Real-time teachers’ facial expression analysis

Then, the real-time FER system based on CSPA network was designed to detect teachers' classroom expressions through the camera and the local teaching video, as shown in Fig. 6. We simulated 1.5 min teaching video about the advanced mathematics guidance course of freshmen to detect. Meanwhile, the recognition result log was printed to the corresponding window in real time, and saved in the corresponding local folder. Finally, the system showed expression distribution with time by scatter chart and pie chart, as shown in Fig. 7. The scatter chart showed the real-time expression distribution with time by analyzing a teachers’ video. The abscissa indicated the time of real-time detection, and the ordinate indicated the expression, as shown in Fig. 7a. From the scatter points in the scatter chart, we could easily find the expression at every moment. However, there were no facial expressions in the process of recognition from 28 to 40 s and from 50 to 65 s. This was because there was no face during this time. The pie chart analysis clearly showed the real-time proportion of each expression, as shown in Fig. 7b. From the pie chart, we could clearly know the proportion of each expression in the 1.5 min teaching video. For this teacher, most of the expressions were neutral, happiness, and surprise, with no expressions of disgust, anger, sadness, or fear.

Fig. 6
figure 6

The real-time FER system.

Fig. 7
figure 7

Facial expression distribution by the a scatter chart and b pie chart.

6 Conclusion

In this paper, we proposed a real-time teachers' expressions recognition network based on YOLOv5 and AMs. We studied the effects of different AMs on the CSP1_X module and CA on different modules in the Backbone structures of YOLOv5. The results showed the network that CA was incorporated after each CBS module of the CSP1_X module (CSPA) achieved the best accuracy of 77.1% and mAP@0.5 of 83.4% on RAF-DB dataset, which increased by 3.5% and 1.6%, respectively, compared with YOLOv5. Meanwhile, the detection time was 25 ms. The proposed model outperformed other detection methods, including Faster-RCNN, R-FCN, ResNext-101, DETR, Swin-Transformer, YOLOv3, and YOLOX. Finally, the real-time teachers’ facial expression recognition system was designed based on CSPA to detect and analyze the teachers’ facial expression distribution with time through camera and the teaching video. However, the method analyzed teachers' expressions only through images, ignoring other information contained in the real-time video, such as the time sequence information, the audio, and the text. In addition, there were still great challenges in teachers’ facial expression recognition, such as lack of datasets and the complexity of teachers’ expressions. In the next work, we can enrich teachers’ expressions and make a special dataset for teachers. At the same time, we can analyze the teacher's expression from multiple aspects including the audio, text, and posture of teaching to improve the recognition accuracy.