1 Introduction

With the growing artificial intelligence and the advances in the fields of computer vision and multimedia, automatic recognition of facial expressions has become an important area of research. Although the interpretation of facial expressions has been a longstanding topic in cognitive science and psychology, the development of computational models has only emerged recently. A part of the reason for this is that cognitive science and psychology researchers place emphasis on understanding the syntax and semantics of the expression which is too complex and has too little consensus, whereas the aim of automatic recognition by machine is to classify the expression with high accuracy, often into a small set of discrete categories.

Facial expressions play a vital role in communication. They convey the innermost feelings and emotions across different situations. Human–computer interaction can be more effective if machines can recognize and respond to human facial expressions. To make machines recognize human facial expressions automatically, computational models that are capable of interpreting the facial expressions must be developed. This has motivated the researchers in the field of computer vision and machine learning to model the human expression to create more relevant and effective applications.

Facial communication and expression play a significant role in human social interactions over the centuries. From being one of the essential forms of non-verbal communication, it also provides an important vehicle for expressing ideas and manifests the personality of a person. It also provides several important pieces of information in communication such as the emotional and mental state of a person and also the intention, attitude, and personality of the communicating person. Several research fields have also tried to replicate this facial expression identification on a computational platform [1]. Simulation of the human face and describing various facial expressions have been subjects of interest in computer graphics and animation. While the automatic recognition of facial expression has been an interest in the field of pattern recognition and analysis. But the recent rapid advancement in machine learning technology and artificial intelligence have provided a stimulant to develop a more realistic and reliable system for the recognition of facial expression which can be used in various applications in affective computing, HCI, and social signal processing. These intelligent systems can play a key role in bridging the gap between humans and machines to understand human behavior and can be used to make the machine more sensitive and appropriate to the user's needs [2]. A much more realistic and natural human-like interface can be developed for a variety of system simulations and game applications. The recent development of virtual and robotic agents for several applications can make these intelligent systems more useful. In medical and mental health applications, the system can be used to automatically assess emotional state and also recognize various psychological disorders. It is now necessary to describe the foundation and characteristic evolution and application in affective computing and other related fields, which will describe the various features and attributes of the system on the basis of which the intelligent system for recognizing facial expression has been or has to be developed in the future [3].

In educational environments, real-time analysis of learners' facial expressions is utilized to identify different emotions. Facial expressions, reflecting underlying emotional impulses through physical muscle movements like raised eyebrows or wrinkled foreheads, provide valuable insights [3]. Automated observation of these expressions offers a significant understanding of online learners' emotional states. For instance, in Fig. 1, examining the images allows for the estimation of individuals' emotional states based on their expressions.

Fig. 1
figure 1

Different states of mind of human beings. Happiness, sadness, contempt, surprise, fear, disgust, and anger

Recognition of facial expressions is a challenging problem due to the inherent characteristics of facial expressions, which are subtle, complex, and often ambiguous, with person-to-person variation. Facial expressions are also influenced by culture, gender, and individual differences and are often inconsistent across the same emotion. This results in the difficulty of obtaining reliable and valid data for the training and testing of automatic facial expression analysis systems. Moreover, there is no widely accepted 'gold standard' in the measurement and judgment of facial expressions, and often manual judgment by human experts is used as a surrogate. However, this can be subjective and suffers from issues of reliability. The disparity between automatic and manual analysis in the prediction of behavior and classification of expressions has also led to an issue of poor comparability between different systems and techniques. This can make evaluation of new systems problematic, and comparisons with other systems in the literature are often rare, thus slowing progress in the field.

Another issue is the recognition of subtle differences between expressions, which may vary on a number of dimensions and be displayed with different intensities. Automatic recognition systems often focus on a basic recognition of the six 'universal' emotions (happy, sad, anger, fear, disgust, surprise), which are shown in Fig. 2, and can be displayed in a number of different ways. However, there are many more expressions than these basic six, and research continues on the identification of different emotions and how they are displayed [1, 4]. The ability for a system to differentiate between these emotions and identify them correctly has obvious implications on human–computer interaction and the development of empathetic response in machines. It is also important for the identification of emotional disorders, which are often linked to abnormalities in the expression of certain emotions.

Fig. 2
figure 2

illustrates six facial expressions encompassing a range of emotions: anger, disgust, fear, happiness, neutrality, and sadness. Original material from MPI FACES database (Ebner et al., 2010). https://doi.org/10.1371/journal.pone.0257740.g001

Monitoring student learning through facial expression recognition represents a burgeoning area of research at the intersection of education, artificial intelligence, and affective computing. With the increasing prevalence of online education platforms, there is a growing need for innovative methods to gauge students' engagement, emotions, and learning behaviors remotely. Facial expression recognition technology offers a promising solution by harnessing computer vision techniques to analyze students' facial cues in real-time during online classes. This approach enables educators to gain valuable insights into students' cognitive states and adapt their teaching strategies accordingly, fostering a more personalized and effective learning experience. Recent studies have demonstrated the feasibility and potential of facial expression recognition in monitoring student learning, highlighting its role in providing actionable feedback, improving student outcomes, and enhancing the overall quality of online education [5, 6]. As advancements in machine learning algorithms and sensor technology continue to evolve, facial expression recognition holds promise as a valuable tool for enhancing educational practices and promoting student success in the digital age.

In the context of online classrooms, capturing dynamic changes in facial expressions over time is crucial for understanding students' engagement, emotions, and learning behaviors. However, this task presents significant challenges due to the complexity and variability of facial expressions, as well as the dynamic nature of online learning environments. Traditional methods often struggle to effectively capture and interpret subtle changes in facial expressions, leading to gaps and limitations in monitoring student learning effectively. By applying the combined capabilities of ResNet50, CBAM, and TCNs, these challenges can be addressed comprehensively. ResNet50's powerful feature extraction capabilities enable accurate recognition and classification of facial expressions, while CBAM enhances the model's ability to focus on relevant facial regions and features. Additionally, TCNs provide the temporal modeling necessary to capture the evolving nature of facial expressions over time. Together, these techniques enable the development of a robust facial expression recognition system capable of continuously monitoring students' emotional states and engagement levels in real-time during online classes. By bridging the gap in facial expression analysis, this integrated approach empowers educators with valuable insights into student learning behaviors, facilitating timely interventions and personalized support to enhance the online learning experience.

Facial expression recognition over time in online classrooms poses several primary challenges that necessitate advanced solutions. Traditional methods often struggle to accurately interpret the dynamic nature of facial expressions, especially in real-time online environments. Challenges include variations in lighting conditions, facial occlusions, and the need to capture subtle changes in expressions over extended periods. Additionally, traditional models may lack the ability to prioritize relevant facial features and effectively model temporal dependencies. By applying the combined capabilities of ResNet50, CBAM, and TCNs, these challenges are effectively addressed. ResNet50's feature extraction prowess ensures precise capture of nuanced expressions, while CBAM optimizes feature relevance by focusing on key facial regions. TCNs, on the other hand, enable the modeling of temporal dynamics, facilitating the analysis of how expressions evolve over time. Together, these integrated capabilities provide a robust solution for capturing dynamic changes in facial expressions over time in online classrooms, thereby enhancing the accuracy and reliability of facial expression recognition in educational settings.

This study aims to integrate of ResNet50, CBAM, and TCNs in capturing dynamic changes in facial expressions over time in online classrooms aims to achieve several key goals. Firstly, it aims to enhance the accuracy and reliability of facial expression recognition, enabling more precise interpretation of students' emotional states and engagement levels. By leveraging ResNet50's feature extraction capabilities, the system can effectively capture subtle nuances in facial expressions. Additionally, the incorporation of CBAM allows for the prioritization of relevant facial features, further improving the model's performance. Furthermore, TCNs enable the modeling of temporal dependencies, facilitating the analysis of how facial expressions evolve over the course of online classes. Ultimately, the overarching goal is to provide educators with real-time insights into student learning behaviors, enabling them to tailor their teaching approaches and provide personalized support to enhance the overall online learning experience.

The key objectives of applying the combined capabilities of ResNet50, CBAM, and TCNs to capture dynamic changes in facial expressions over time in online classrooms can be summarized as follows:

  • Enhanced Emotion Recognition: Utilize ResNet50's powerful feature extraction capabilities to accurately recognize and classify facial expressions in real-time online classroom environments.

  • Improved Feature Relevance: Incorporate CBAM to dynamically adjust the importance of spatial and channel-wise features, enhancing the model's ability to focus on relevant facial regions crucial for emotion recognition.

  • Temporal Modeling: Integrate TCNs to capture temporal dependencies in sequential facial expression data, allowing the system to analyze dynamic changes in emotions over time during online classes.

  • Real-time Monitoring: Enable the system to monitor students' facial expressions and engagement levels continuously, providing educators with real-time insights into learning behaviors and facilitating timely intervention when needed.

  • Robustness and Adaptability: Develop a robust and adaptable facial expression recognition system capable of accurately interpreting subtle changes in facial expressions across diverse student demographics and online learning scenarios.

Overall, the primary objectives are to enhance emotion recognition accuracy, improve feature relevance, capture temporal dynamics, enable real-time monitoring, and ensure robustness and adaptability in online classroom environments.

The combined capabilities of ResNet50, CBAM (Convolutional Block Attention Module), and TCNs (Temporal Convolutional Networks) offer a unique contribution to capturing dynamic changes in facial expressions over time in online classrooms. Here's a summary of their unique contributions:

  • ResNet50: ResNet50, deep convolutional neural network architecture, provides powerful feature extraction capabilities. Its deep structure allows for the extraction of hierarchical features from facial images, enabling the model to capture intricate details and nuances in facial expressions. This helps in accurately recognizing and categorizing different emotions expressed by students during online classes.

  • CBAM (Convolutional Block Attention Module): CBAM enhances the discriminative power of the model by incorporating attention mechanisms. It dynamically adjusts the importance of different spatial and channel-wise features within each convolutional block, allowing the network to focus on relevant facial regions and features crucial for emotion recognition. This attention mechanism helps in improving the model's interpretability and robustness to variations in facial expressions and backgrounds.

  • TCNs (Temporal Convolutional Networks): TCNs are specifically designed to capture temporal dependencies in sequential data, making them well-suited for analyzing dynamic changes in facial expressions over time. By integrating TCNs into the network architecture, the model can effectively capture the temporal evolution of facial expressions throughout online classes. This temporal modeling capability enables the system to recognize subtle changes in students' emotions and engagement levels, facilitating more accurate monitoring of learning behaviors in real-time.

In summary, the combined capabilities of ResNet50, CBAM, and TCNs synergistically enhance the facial expression recognition system's ability to capture dynamic changes in facial expressions over time in online classrooms. This comprehensive approach enables the system to accurately interpret students' emotions and engagement levels, thereby facilitating more effective monitoring of learning behaviors and providing valuable insights for educators to improve teaching strategies and student outcomes.

The rest of the manuscript follows this structure: Sect. 2 summarizes relevant literature on engagement detection in online learning. Section 3 discusses the datasets used. Section 4 elaborates on methodologies and our proposed system. Section 5 presents experimental results, including a comparative analysis. Finally, Sect. 6 concludes with remarks on findings and future research directions.

2 Related work

2.1 Cyber bullying in virtual classrooms

The surge in online education, particularly amplified by the COVID-19 pandemic, has introduced a myriad of challenges and opportunities for educators and students. Challenges include the digital divide, where not all students have equal access to necessary technology and internet connectivity, alongside technological hitches and the struggle with engagement and motivation in absence of face-to-face interaction. Additionally, social isolation and the complexity of ensuring assessment integrity in online settings pose significant obstacles. Conversely, online education presents opportunities such as flexibility in scheduling and location, access to a vast array of educational resources, personalized learning experiences through adaptive technologies, enhanced collaboration and communication among students and educators, and the potential for global networking and collaboration, transcending geographical boundaries. Balancing these challenges and leveraging the opportunities is essential for maximizing the potential of online education.

Cyberbullying presents a key obstacle in virtual classroom settings, involving various forms of online harassment like verbal attacks, threats, spreading misinformation, and excluding individuals from discussions [7, 8]. This type of aggression raises serious concerns for both educators and students, as it hampers the learning atmosphere and can lead to emotional distress for victims. Effectively tackling cyberbullying demands proactive steps to foster a secure and welcoming online learning environment, such as educating on digital citizenship, implementing explicit anti-cyberbullying policies, and establishing efficient reporting and response mechanisms to address instances of harassment.

Cyberbullying has a significant impact on students' academic performance, mental health, and interpersonal connections [9]. Therefore, it is crucial to establish effective strategies and mechanisms for identifying and preventing cyberbullying in virtual classrooms. Previous research in this field has employed diverse methodologies, such as behavior analysis, natural language processing, social network analysis, machine learning, and emotion recognition, to recognize and address potential instances of bullying that students may face in online learning environments. Furthermore, studies underscore the importance of educational interventions and increasing awareness among both educators and students regarding cyberbullying, aiming to foster a more secure and supportive online educational setting [10]. Facial expression detection holds significant importance in addressing cyberbullying within online classrooms. By analyzing facial expressions, educators and administrators can gain valuable insights into students' emotional states, potentially identifying instances of distress or discomfort caused by cyberbullying. Early detection through facial expression analysis enables prompt intervention, allowing educators to offer support to affected students and address bullying behaviors effectively. Additionally, integrating facial expression detection into online classroom platforms enhances the monitoring and prevention of cyberbullying, creating a safer and more supportive learning environment where students feel valued and protected.

2.2 Research of facial expression recognition and learning emotions in education

Recently, there has been an increasing emphasis on the difficulty of measuring student engagement in digital learning environments [11]. Detecting engagement has become a key research area because of its ability to track students' mental states, even without direct teacher oversight. This section seeks to examine different methods that have been devised to accurately identify student engagement. These methodologies are instrumental in improving the efficiency of online education by offering valuable insights into students' participation and enthusiasm, empowering educators to adjust their teaching approaches and support strategies accordingly.

The authors in [12,13,14] have utilized the movements of students' heads and eyes to understand their conditions in online learning environments. The present study presents a system aimed at recognizing emotions and categorizing students' levels of interest and participation during offline classroom sessions [15]. Through the analysis of head and eye movements, researchers seek to interpret students' emotional reactions and assess their engagement and involvement during face-to-face classroom interactions. This method holds promise for refining teaching techniques and enhancing the overall learning experiences of students in conventional classroom settings.

In their cited work [16], the authors introduced an innovative approach to assess user engagement while performing reading tasks. They utilized facial expressions, eye movements, and mouse actions as primary indicators, alongside tracking keystrokes and webcam footage to monitor attention levels. Their emphasis was on geometric features rather than appearance-based ones, resulting in an accuracy rate of 75.5% using an SVM classifier. This method, which integrates various modalities and prioritizes geometric characteristics, presents a promising means of precisely measuring user engagement during reading tasks. Its potential application extends to the enhancement of educational tools and interfaces, facilitating more effective learning experiences.

The research cited in [17] delved into investigating the connection between blinking frequency, facial surface temperature, and the psychological behaviors exhibited by students within classroom learning settings. Through experiments conducted during YouTube lesson videos, researchers monitored the frequency of blinking eyes and incorporated students' feedback responses from e-learning sessions. Through the analysis of these physiological and behavioral markers, the researchers aimed to gain insights into how students' levels of engagement and emotional states are expressed during online learning activities. This study significantly adds to the comprehension of utilizing physiological cues to evaluate and improve student engagement and overall well-being in digital learning environments.

In [18], a novel approach to detecting eye status was introduced by the authors, utilizing Convolutional Neural Network (CNN) architecture as the foundation. Simultaneously, researchers in [19] proposed a predictive framework aimed at monitoring classroom activities using camera technology. This method involved capturing students' facial expressions and body postures via a Fly-eye lens camera positioned at the front of the classroom. Employing vision-based techniques, the system aimed to track head movements using multiple high-resolution cameras to assess engagement levels. Additionally, [20] presented an ensemble model tailored for engagement detection, which integrated both face and body tracking. This model predicted students' engagement levels using a cluster-based conventional model framework, augmented with heuristic rules. Collectively, these studies underscore the array of methodologies available for monitoring and evaluating student engagement within classroom settings, leveraging advanced technologies such as deep learning, camera systems, and ensemble modeling techniques.

In [21], authors employed students' facial expressions as the basis for developing a method of affective content analysis. Their findings revealed promising results, with an accuracy rate of 87.65%. Concurrently, researchers in [22] delved into the relationship between eye movements and diverse emotional states. Their observations indicated that pupil dilation tends to be more pronounced during negative emotions compared to positive ones, and the frequency of eye blinks is higher in negative emotional states than in positive ones. Moreover, [23] introduced a pioneering approach that combines facial expressions and body gestures to automatically gauge student engagement levels. This innovative technique holds potential for accurately assessing students' participation and interest in learning activities. Collectively, these investigations contribute to advancing our understanding and utilization of physiological and behavioral indicators, thereby enhancing the analysis of student affect and engagement within educational settings.

Previous investigations [24,25,26] have examined diverse methodologies aimed at assessing student engagement, which include analyzing facial characteristics, monitoring eyeball movements, evaluating eye gaze patterns, interpreting bodily gestures, tracking head movements, measuring body surface temperature, analyzing mouse behaviors, and observing keystrokes. Typically, these studies entailed scrutinizing stored videos and images of students within traditional classroom environments. Nevertheless, the current research marks a notable advancement by directing its focus toward real-time learning scenarios as opposed to in-person classroom settings [27,28,29]. This innovative approach introduces a real-time system for detecting student engagement, utilizing deep learning methodologies. Specifically, the present study centers on three principal modalities: eye blink patterns, head movements, and the recognition of facial emotions. Through the utilization of these modalities in real-time, the proposed system aims to offer prompt and accurate assessments of student engagement levels, thereby amplifying the effectiveness of educational interventions and support strategies.

In [30], the authors introduced a sophisticated system designed to detect engagement levels in e-learning environments. Leveraging multimodal facial cues and employing deep learning techniques, the proposed system offers a comprehensive solution for accurately assessing student engagement. By integrating various facial cues such as expressions and head movements, the system can effectively gauge students' levels of involvement and interest during online learning activities. Through experimental validation and evaluation, the paper demonstrates the efficacy of the proposed approach in detecting engagement levels in real-time, thereby enhancing the effectiveness of e-learning platforms. This research contributes significantly to the field of educational technology, providing a valuable tool for monitoring and improving student engagement in online learning contexts.

In [31], the authors introduced addressed the evolving landscape of education in the wake of the COVID-19 pandemic. Focusing on blended learning environments, the study delves into the attitudes, emotions, and perceptions of both teachers and students. By modeling these factors, the paper seeks to provide insights into the challenges and opportunities inherent in blended education, particularly in the context of transitioning from emergency remote teaching to more sustainable post-pandemic educational practices. Through a comprehensive analysis of data collected from teachers and students, the study aims to inform the development of effective strategies and interventions to optimize blended learning experiences. This research contributes to the ongoing discourse on post-pandemic education, offering valuable insights for educators, policymakers, and researchers striving to navigate the complexities of contemporary educational landscapes.

In [32], the authors presented a comprehensive investigation into the challenging task of recognizing fine-grained facial expressions in real-world settings. By leveraging advanced techniques and methodologies, the study aims to address the complexities associated with accurately identifying subtle variations in facial expressions in diverse and uncontrolled environments. Through experimental validation and evaluation, the paper demonstrates the efficacy of the proposed approach in achieving robust and accurate facial expression recognition results. This research significantly contributes to the field of facial expression recognition, offering valuable insights and techniques for enhancing the performance of recognition systems in real-world applications.

In [33], the authors presented addresses the challenges posed by masked faces in facial expression recognition systems. The study introduces an automatic method that enhances the accuracy of recognizing facial expressions even when faces are partially covered by masks, a scenario increasingly prevalent due to public health measures such as mask mandates. By leveraging advanced techniques and algorithms, the proposed approach aims to effectively capture and analyze facial features that remain visible despite the presence of masks, thus improving recognition performance. Through experimental validation and evaluation, the paper demonstrates the effectiveness of the proposed method in achieving accurate facial expression recognition results, even in challenging masked conditions. This research contributes valuable insights and methodologies for developing more robust and reliable facial expression recognition systems in real-world applications.

In [34], the authors presented a novel approach for facial expression image classification and regression tasks. The study introduces a method that leverages knowledge distillation to achieve both speed and accuracy in facial expression analysis. By transferring knowledge from a complex, deep neural network (teacher model) to a simpler, more efficient model (student model), the proposed approach aims to maintain high performance while reducing computational overhead. Through experimental validation and evaluation, the paper demonstrates the efficacy of the method in achieving fast and accurate facial expression classification and regression results. This research contributes valuable insights and techniques for developing efficient and reliable facial expression analysis systems in various real-world applications.

In [35], the authors explored recent advancements in emotion recognition techniques utilizing image analysis and neural networks. The study delves into emerging trends and methodologies for accurately detecting and interpreting human emotions from visual data. By leveraging the capabilities of neural networks, the paper discusses innovative approaches that enhance the accuracy and efficiency of emotion recognition systems. Through a comprehensive review of recent research, the paper sheds light on the evolving landscape of emotion recognition technology, offering valuable insights into state-of-the-art techniques and future directions. This research contributes to the ongoing discourse on emotion recognition, providing researchers and practitioners with a comprehensive overview of recent trends and advancements in the field.

2.3 Research of learning emotions in image recognition and attention mechanism

In [36], the authors introduced novel network architecture, termed Knowledge-Guided Semantic Transfer Network (KSTNet), which aims to enhance recognition performance by incorporating auxiliary prior knowledge. The proposed network integrates vision inference, knowledge transfer, and classifier learning within a unified framework, facilitating optimal compatibility. By leveraging prior knowledge of category relationships, the network disseminates knowledge among all categories to learn semantic-visual mappings and infer knowledge-based classifiers for novel categories. Through experimental evaluation, the paper demonstrates the efficacy of the KSTNet in achieving improved few-shot image recognition performance compared to existing approaches. This research significantly contributes to the advancement of few-shot learning methodologies, offering valuable insights and techniques for enhancing image recognition systems in scenarios with limited training data availability.

In [37], the authors presented a novel approach for few-shot fine-grained recognition. The authors introduce a method that leverages attention-guided pyramidal features to improve recognition accuracy in scenarios with limited training data. By incorporating attention mechanisms into the feature extraction process, the proposed approach enhances the discriminative power of the features, particularly in fine-grained classification tasks where subtle differences between classes are crucial. Through experiments and evaluations, the paper demonstrates the effectiveness of the proposed method in achieving superior performance compared to existing approaches in few-shot learning contexts. This research contributes valuable insights and techniques to the field of fine-grained recognition, offering promising advancements for addressing challenges related to limited training data availability.

In [38], the authors contributed to the field of few-shot fine-grained recognition by proposing a novel approach that addresses challenges related to background clutter and foreground alignment. By incorporating background suppression techniques and foreground alignment methods, the proposed framework aims to enhance the discriminative power of fine-grained recognition models in scenarios with limited training data. Through experimental validation and evaluation, the paper demonstrates the effectiveness of the proposed approach in improving recognition accuracy and robustness. This research adds valuable insights to the existing body of literature on few-shot learning methodologies, offering promising advancements for tackling challenges in fine-grained recognition tasks.

In [39], the authors contributed to the field of text-based person search by proposing a novel approach that focuses on suppressing image-specific information and incorporating implicit local alignment techniques. This method aims to improve the accuracy and efficiency of text-based person search systems by effectively aligning textual descriptions with corresponding images. By suppressing irrelevant image-specific details and emphasizing textual cues, the proposed framework enhances the discriminative power of text-based person search models. Through experimental evaluation, the paper demonstrates the effectiveness of the approach in achieving superior performance compared to existing methods. This research provides valuable insights and techniques for enhancing text-based person search systems, thereby advancing the state-of-the-art in this domain.

3 Datasets

Facial Emotion Recognition (FER) research relies on various datasets to train and evaluate machine learning models. These datasets are essential for developing accurate and robust facial emotion recognition systems. Here are some popular datasets used in FER research:

3.1 CK + (Extended Cohn-Kanade):

  • Description: The CK + dataset is one of the most widely used datasets in FER research. It contains posed facial expressions of emotions such as happiness, sadness, anger, surprise, disgust, and fear.

  • Size: CK + includes 327 image sequences from 123 subjects, with each sequence capturing a participant's facial expression changes over time.

  • Annotations: Each image sequence is annotated with emotion labels, allowing for supervised learning approaches in FER model training [40].

3.2 FER2013:

  • Description: FER2013 is a large-scale dataset collected from the internet, consisting of facial images extracted from YouTube videos. It contains seven emotion categories: anger, disgust, fear, happiness, sadness, surprise, and neutral.

  • Size: FER2013 comprises over 35,000 images, with each image labeled with one of the seven emotion categories.

  • Annotations: The dataset provides categorical labels for each facial image, enabling supervised learning for FER models [41].

3.3 RAF-DB (Radboud Faces Database):

  • Description: RAF-DB is a dataset containing static images of facial expressions collected from online sources. It includes a variety of emotion categories such as surprise, fear, disgust, happiness, sadness, and anger.

  • Size: RAF-DB consists of over 30,000 facial images, with each image labeled with one of the emotion categories.

  • Annotations: The dataset provides categorical emotion labels for each facial image, facilitating supervised learning in FER research [42].

3.4 The Karolinska Directed Emotional Faces (KDEF):

  • Description: The Karolinska Directed Emotional Faces (KDEF) dataset is a standardized collection of grayscale images portraying various facial expressions. It was developed by researchers at the Karolinska Institute in Sweden. The dataset aims to provide a comprehensive set of facial expressions representing basic emotions for research in FER.

  • Size: KDEF contains a total of 4900 images. These images feature 70 models (35 females and 35 males), each posing six different facial expressions. Each model's facial expressions are captured from five different camera angles, resulting in a total of 84 images per expression per model.

  • Annotations: The dataset is annotated with detailed information for each image. Annotations include the emotion depicted in the facial expression, such as anger, disgust, fear, happiness, sadness, surprise, or neutral. Additionally, annotations provide the identity of the model, gender, age, and perceived intensity of the expressed emotion. Emotions are labeled using numerical codes ranging from 1 to 7, corresponding to the seven basic emotions mentioned above [43].

4 Methodology

By following this step-by-step methodology, we aim to address the challenges of facial expression recognition in online classrooms while leveraging the combined capabilities of ResNet50, CBAM, and TCNs to capture dynamic changes in facial expressions over time. This approach holds the potential to enhance the quality of online learning experiences by enabling more accurate and robust facial expression recognition.

  • Step 1: dataset collection and preprocessing

    1. 1.1

      Data collection:

      • Gather a diverse dataset of facial expression images captured during online learning sessions using webcams or cameras integrated into devices.

      • Ensure the dataset covers a wide range of lighting conditions, camera qualities, and student postures commonly encountered in online classrooms.

    2. 1.2

      Data annotation:

      • Manually annotate facial expression labels (e.g., happy, sad, angry) for each image in the dataset.

      • Consider using automated tools or crowdsourcing platforms for efficient annotation, ensuring high-quality annotations.

    3. 1.3

      Data preprocessing:

      • Resize all images to a uniform resolution to mitigate the impact of varying camera qualities.

      • Apply histogram equalization to standardize lighting conditions across images.

      • Augment the dataset by applying transformations such as rotation, flipping, and cropping to increase sample diversity and robustness.

The impact of dataset collection and preprocessing in online classrooms, particularly when leveraging technologies like ResNet50, CBAM, and TCNs to capture dynamic changes in facial expressions over time, is substantial. Here's how:

  • Quality of data: Dataset collection and preprocessing directly influence the quality of data used for training machine learning models. By ensuring that the collected dataset is comprehensive, diverse, and representative of the target population, educators and researchers can improve the accuracy and reliability of facial expression analysis.

  • Annotation and labeling: Preprocessing involves annotating and labeling the dataset, which is crucial for supervised learning tasks. In the context of capturing facial expressions, this may involve accurately labeling different emotions or expressions displayed by individuals during online classes. High-quality annotations contribute to better model performance and generalization.

  • Data augmentation: Preprocessing techniques such as data augmentation can help enhance the robustness and generalization ability of machine learning models. Techniques like flipping, rotation, and scaling can be applied to augment the dataset, ensuring that the models can effectively capture dynamic changes in facial expressions from various angles and perspectives.

  • Feature extraction: Preprocessing steps may involve feature extraction, where relevant features or attributes are extracted from raw data to represent facial expressions effectively. Techniques such as facial landmark detection or feature point tracking can be used to extract meaningful features before feeding the data into the machine learning models.

  • Normalization and standardization: Standardizing the dataset through normalization techniques ensures that the data is consistent and comparable across different samples. This step is crucial for mitigating variations in lighting conditions, camera angles, and individual differences in facial appearances, thereby improving the model's robustness.

  • Reduced bias and variance: Proper dataset collection and preprocessing help in reducing bias and variance in machine learning models. By ensuring that the dataset is balanced and free from biases, such as demographic biases or cultural biases, educators can develop more inclusive and equitable facial expression recognition systems.

  • Model performance and generalization: Ultimately, the impact of dataset collection and preprocessing is reflected in the performance and generalization ability of machine learning models. A well-curated dataset, combined with effective preprocessing techniques, contributes to more accurate, reliable, and interpretable models for capturing dynamic changes in facial expressions over time in online classrooms.

In summary, dataset collection and preprocessing play a crucial role in the successful implementation of technologies like ResNet50, CBAM, and TCNs for capturing facial expressions in online classrooms. These steps directly impact the quality, robustness, and generalization ability of machine learning models, thereby enhancing the overall effectiveness of facial expression analysis in educational settings.

  • Step 2: model architecture design and integration

    1. 2.1

      ResNet50 network model:

      • Implement the ResNet50 architecture, a deep convolutional neural network known for its effectiveness in image classification tasks.

      • Utilize pre-trained ResNet50 weights from ImageNet to leverage learned features.

    2. 2.2

      Convolutional Block Attention Module (CBAM):

      • Integrate the CBAM mechanism into the ResNet50 architecture to enhance feature representation by dynamically recalibrating channel-wise and spatial attention.

      • CBAM helps the model focus on informative facial regions despite varying lighting conditions and student postures through its integrated mechanism of channel-wise and spatial attention.

        • Channel-wise attention: CBAM dynamically recalibrates channel-wise attention, allowing the model to focus on relevant features across different channels. This helps in emphasizing discriminative facial characteristics regardless of lighting variations. For example, it might emphasize key facial landmarks or patterns indicative of specific expressions, irrespective of lighting changes.

        • Spatial attention: CBAM also incorporates spatial attention, enabling the model to selectively attend to important spatial regions within the facial image. This aspect assists the model in recognizing facial expressions accurately even when there are variations in student postures. For instance, it might prioritize facial regions such as the eyes, mouth, or eyebrows, which are crucial for interpreting expressions, irrespective of changes in posture.

By combining both channel-wise and spatial attention mechanisms, CBAM ensures that the model can adaptively focus on relevant facial features under varying conditions, thus enhancing its ability to recognize facial expressions accurately in an online classroom setting.

CBAM integrates Channel Attention Mechanism (CAM) and Spatial Attention Mechanism (SAM) in a sequential framework, leading to better results compared to models that only rely on CAM. The structure of the CBAM network is shown in Fig. 3. In this illustration, F denotes the feature image obtained from the convolution layer, 'Mc(F)' represents the resulting channel attention image, 'F'' corresponds to the feature image obtained by multiplying 'F' with 'Mc(F)', 'Ms(F')' represents the spatial attention image generated, and 'F"' denotes the feature image obtained from multiplying 'F'' with 'Ms(F')'. Figure 4 illustrates the integration of the ResNet50 network within the Convolutional Block Attention Module (CBAM).

Fig. 3
figure 3

illustrates the network structure of CBAM

Fig. 4
figure 4

depicts the integration of the ResNet50 network within the Convolutional Block Attention Module (CBAM)

The network's performance is enhanced and overfitting is mitigated by modifying the down-sampled residual module of ResNet and integrating a CBAM attention mechanism. This updated network configuration, depicted in Fig. 4, refines facial expression characteristics by employing advanced residual block operations after the initial convolutional layer. Residual blocks with varying dimensions are utilized to deepen neural networks, facilitating the extraction of more detailed attributes and mitigating gradient vanishing. Additionally, the CBAM attention module improves both channel and spatial dimensions of relevant features, leading to faster convergence and enhanced recognition accuracy.

  1. 2.3

    Temporal Convolutional Networks (TCNs):

    • Extend the ResNet50-CBAM architecture with TCNs to capture temporal dynamics in facial expressions over consecutive frames.

    • Implement 3D convolutions or 2D convolutions with temporal context to extract spatiotemporal features from sequential facial expression frames.

The impact of model architecture design and integration in online classrooms, particularly when leveraging combined capabilities of ResNet50, CBAM, and TCNs to capture dynamic changes in facial expressions over time, is significant. Here's how:

  • Model accuracy and performance: The design of the model architecture directly influences its accuracy and performance in capturing facial expressions. ResNet50, CBAM, and TCNs are state-of-the-art architectures known for their effectiveness in image recognition and temporal analysis. Integrating these architectures can lead to more accurate and reliable recognition of dynamic changes in facial expressions, enhancing the overall quality of analysis in online classrooms.

  • Feature extraction and representation learning: Each component of the combined architecture (ResNet50, CBAM, and TCNs) contributes to different aspects of feature extraction and representation learning. ResNet50 is proficient in extracting hierarchical features from images, CBAM enhances feature attention to relevant regions, and TCNs excel in temporal feature learning. Integrating these capabilities enables the model to capture both spatial and temporal dynamics of facial expressions, resulting in more comprehensive analysis.

  • Model interpretability: The design of the model architecture can impact its interpretability, i.e., how well humans can understand and interpret the model's decisions. ResNet50, CBAM, and TCNs are architectures with relatively high interpretability, allowing educators and researchers to gain insights into how the model recognizes and interprets facial expressions over time. This transparency is essential for building trust in the model's decisions and facilitating collaboration between humans and AI systems in educational contexts.

  • Scalability and efficiency: The architecture design also influences the scalability and efficiency of the model, particularly in online classrooms where real-time or near-real-time analysis is required. ResNet50, CBAM, and TCNs are designed to be scalable and efficient, making them suitable for processing large volumes of data in online educational environments. Efficient integration of these architectures ensures timely analysis of facial expressions without compromising performance.

  • Adaptability to variability: Online classrooms present variability in terms of lighting conditions, camera angles, and individual differences in facial expressions. The integrated model architecture should be adaptable to such variability to ensure robust performance across diverse scenarios. ResNet50, CBAM, and TCNs are designed to be robust to variations in input data, allowing the model to effectively capture dynamic changes in facial expressions under different conditions.

  • Generalization to new scenarios: A well-designed model architecture should generalize well to new scenarios and unseen data. By leveraging the combined capabilities of ResNet50, CBAM, and TCNs, the model can learn abstract representations of facial expressions that generalize across different individuals, expressions, and environmental conditions. This generalization ability is crucial for deploying the model in real-world online classrooms where new scenarios may arise.

In summary, model architecture design and integration play a crucial role in the effectiveness of facial expression analysis in online classrooms. By leveraging the combined capabilities of ResNet50, CBAM, and TCNs, educators and researchers can develop accurate, interpretable, scalable, and adaptable models for capturing dynamic changes in facial expressions over time, thereby enhancing the overall educational experience.

  • Step 3:model training and optimization

    1. 3.1

      Data splitting:

      • Divide the preprocessed dataset into training, validation, and testing sets, maintaining a balanced distribution of facial expressions and environmental conditions.

    2. 3.2

      Model training:

      • Train the integrated model (ResNet50-CBAM-TCN) using the training set with appropriate hyper parameters and optimization techniques.

      • Employ techniques such as stochastic gradient descent (SGD) with momentum and learning rate scheduling to optimize model convergence.

Model training and optimization significantly influence the overall performance of the facial expression recognition system, particularly in the context of monitoring student learning in online classrooms and leveraging the combined capabilities of ResNet50, CBAM, and TCNs. Here's how:

  • Accuracy and precision: Effective model training and optimization directly impact the accuracy and precision of the facial expression recognition system. By fine-tuning the parameters of ResNet50, CBAM, and TCNs and optimizing the learning process through techniques such as gradient descent optimization and learning rate scheduling, the system can better capture subtle changes in facial expressions over time. This accuracy is crucial for accurately monitoring student learning and understanding their engagement levels during online classes.

  • Generalization across students: Model training and optimization ensure that the facial expression recognition system generalizes well across different students, regardless of individual variations in facial features, expressions, or learning styles. Through techniques such as data augmentation, regularization, and transfer learning, the system learns robust representations of facial expressions that are applicable to diverse student populations, improving the overall monitoring of student learning in online classrooms.

  • Adaptability to classroom dynamics: Online classrooms often present dynamic and unpredictable environments, requiring the facial expression recognition system to adapt to changes in lighting conditions, camera angles, and student behaviors. By optimizing the model training process to incorporate temporal information and adaptively adjust to changing input data, the system can maintain high performance levels and accurately capture dynamic changes in facial expressions over time, thereby facilitating more effective monitoring of student learning.

  • Real-time analysis and feedback: Model training and optimization techniques can enable the facial expression recognition system to perform real-time analysis of student facial expressions, providing immediate feedback to educators and facilitating timely interventions to enhance student engagement and learning outcomes. Through efficient model architectures, parallelized training algorithms, and hardware acceleration, the system can achieve low-latency processing, ensuring that feedback is delivered in real-time during online classes.

  • Interpretability and transparency: Model training and optimization also influence the interpretability and transparency of the facial expression recognition system. By incorporating attention mechanisms, visualization techniques, and explainable AI approaches into the training process, the system can provide insights into how student facial expressions are interpreted and utilized to monitor learning. This transparency enhances trust in the system's decisions and fosters collaboration between educators and AI systems in online classrooms.

In summary, model training and optimization play a critical role in influencing the overall performance of facial expression recognition systems in online classrooms, especially concerning monitoring student learning. By optimizing accuracy, generalization, adaptability, real-time analysis, interpretability, and transparency, educators can leverage the combined capabilities of ResNet50, CBAM, and TCNs to effectively monitor student engagement and enhance learning outcomes in online educational settings.

  • Step 4: evaluation and performance analysis

    1. 4.1

      Evaluation metrics:

      • Evaluate the trained model's performance using metrics such as accuracy, precision, recall, and F1-score on the validation and test sets.

      • Conduct per-class analysis to identify any biases or weaknesses in facial expression recognition across different expression categories.

    2. 4.2

      Performance analysis:

      • Assess the model's robustness to varying lighting conditions, camera qualities, and student postures through systematic experimentation.

        • Data variation: Create subsets of the dataset with controlled variations in lighting conditions, camera qualities, and student postures. Ensure that each subset covers a range of scenarios, from optimal to challenging conditions.

        • Evaluation metrics: Define evaluation metrics such as accuracy, precision, recall, and F1-score to quantify the model's performance. Use these metrics to assess the model's ability to correctly classify facial expressions across different variations.

        • Experimental setup: Design experiments where the model is tested on each subset of the dataset representing different variations. Ensure that the experiments are conducted in a controlled environment to isolate the effects of specific variations.

        • Cross-validation: Employ cross-validation techniques to ensure the robustness of the evaluation results. Employ cross-validation techniques to ensure the robustness of the evaluation results.

        • Statistical analysis: Conduct statistical analysis to compare the model's performance across different variations. Use techniques such as ANOVA or t-tests to determine significant differences in performance under varying conditions.

        • Qualitative assessment: Perform qualitative assessments by visually inspecting the model's predictions on sample images from different subsets. Evaluate whether the model's predictions align with human judgment in different lighting, camera, and posture scenarios.

        • Iterative improvement: Use the insights gained from the experimentation to iteratively improve the model. Adjust model parameters, training strategies, or data augmentation techniques to enhance robustness to specific variations.

        • Real-world testing: Conduct real-world testing in actual online classroom settings to validate the model's performance in practical scenarios. Gather feedback from users and instructors to identify any remaining challenges and areas for further improvement.

By following this systematic approach, you can comprehensively assess the model's robustness to varying lighting conditions, camera qualities, and student postures, ensuring its effectiveness in real-world online classroom environments.

  • Compare the performance of the proposed approach with baseline models to demonstrate improvements in facial expression recognition accuracy.

Evaluation and performance analysis are crucial aspects that directly impact the overall effectiveness of facial expression recognition systems in online classrooms, especially when leveraging the combined capabilities of ResNet50, CBAM, and TCNs. Here's how they influence the system's performance, particularly in the context of monitoring student learning:

  • Assessment of accuracy and reliability: Evaluation and performance analysis provide insights into the accuracy and reliability of the facial expression recognition system. By comparing the system's predictions against ground truth labels obtained from human annotators, educators can assess the system's ability to accurately capture dynamic changes in facial expressions over time. This assessment is vital for ensuring that the system effectively monitors student learning and engagement during online classes.

  • Identification of performance metrics: Evaluation allows educators to identify appropriate performance metrics for assessing the system's effectiveness in monitoring student learning. Metrics such as accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC-ROC) can be used to quantify the system's performance in recognizing different facial expressions and monitoring student engagement levels. By selecting relevant performance metrics, educators can gain a comprehensive understanding of the system's capabilities and limitations.

  • Analysis of false positives and false negatives: Performance analysis helps identify areas of improvement by analyzing false positives and false negatives generated by the facial expression recognition system. False positives occur when the system incorrectly detects a facial expression that is not present, while false negatives occur when the system fails to detect a facial expression that is present. By analyzing these errors, educators can identify patterns and refine the system's algorithms to reduce inaccuracies and improve overall performance.

  • Validation across diverse scenarios: Evaluation allows educators to validate the performance of the facial expression recognition system across diverse scenarios encountered in online classrooms. This includes variations in lighting conditions, camera angles, student demographics, and teaching styles. By conducting rigorous evaluation across diverse scenarios, educators can ensure that the system generalizes well and maintains high performance levels in real-world educational settings.

  • Feedback for model refinement: Performance analysis provides valuable feedback for refining the facial expression recognition system through iterative model improvement. By analyzing performance trends over time and identifying areas of weakness, educators can iteratively update the system's algorithms, fine-tune model parameters, and incorporate new training data to enhance performance. This iterative refinement process ensures that the system continuously improves and remains effective in monitoring student learning in online classrooms.

  • Ethical considerations and bias assessment: Evaluation allows educators to assess the ethical implications of the facial expression recognition system, including issues related to bias, fairness, and privacy. By evaluating the system's performance across different demographic groups and identifying potential biases, educators can mitigate ethical concerns and ensure that the system treats all students fairly and respectfully. This ethical assessment is essential for fostering a safe and inclusive learning environment in online classrooms.

In summary, evaluation and performance analysis are critical components that influence the overall effectiveness of facial expression recognition systems in online classrooms. By rigorously assessing accuracy, reliability, performance metrics, false positives/negatives, validation across diverse scenarios, feedback for model refinement, and ethical considerations, educators can leverage the combined capabilities of ResNet50, CBAM, and TCNs to effectively monitor student learning and enhance educational outcomes in online educational settings.

  • Step 5: deployment and real-world testing

    1. 5.1

      Deployment preparation:

      • Integrate the trained model into an online classroom platform for real-time facial expression recognition during learning sessions.

      • Ensure compatibility with different devices and environments commonly encountered in online education settings.

    2. 5.2

      Real-world testing:

      • Conduct extensive field tests in diverse online classroom environments to validate the effectiveness and reliability of the proposed approach.

      • Gather feedback from users and instructors to iteratively improve the model's performance and usability in real-world scenarios.

Deployment and real-world testing are critical phases that directly impact the overall performance of facial expression recognition systems in online classrooms, particularly when leveraging the combined capabilities of ResNet50, CBAM, and TCNs. Here's how they influence the system's performance, especially in the context of monitoring student learning:

  • Validation in real-world settings: Deployment involves deploying the facial expression recognition system in real-world online classrooms, where it can be tested under actual operating conditions. Real-world testing allows educators to validate the system's performance in live educational environments, ensuring that it effectively captures dynamic changes in facial expressions over time and accurately monitors student learning.

  • Assessment of robustness and stability: Real-world testing provides insights into the robustness and stability of the facial expression recognition system under diverse operating conditions encountered in online classrooms. This includes variations in lighting, background noise, camera angles, internet connectivity, and student behaviors. By testing the system in real-world settings, educators can assess its ability to maintain high performance levels and reliability in the face of practical challenges.

  • Feedback for system optimization: Deployment and real-world testing generate valuable feedback for optimizing the facial expression recognition system. By collecting user feedback, monitoring system performance metrics, and analyzing system behavior in real-time, educators can identify areas for improvement and refine the system's algorithms, user interface, and overall design to enhance performance and usability. This iterative optimization process ensures that the system meets the evolving needs of educators and students in online classrooms.

  • Integration with existing platforms: Deployment involves integrating the facial expression recognition system with existing online learning platforms used by educators and students. Seamless integration ensures that the system seamlessly integrates into the existing educational workflow, minimizing disruptions and maximizing user adoption. By integrating with popular platforms such as learning management systems (LMS) or video conferencing tools, educators can easily access and utilize the facial expression recognition system to monitor student learning in real-time.

  • Training and support for users: Deployment includes providing training and support for educators and students on how to effectively use the facial expression recognition system in online classrooms. Comprehensive training programs, user manuals, and technical support resources help users understand the system's capabilities, interpret its outputs, and troubleshoot any issues that may arise during use. By empowering users with the knowledge and skills to leverage the system effectively, educators can maximize its impact on monitoring student learning and enhancing educational outcomes.

  • Ethical considerations and privacy protection: Deployment and real-world testing involve addressing ethical considerations and ensuring the protection of student privacy. Educators must implement appropriate privacy safeguards, data encryption measures, and informed consent procedures to protect students' sensitive facial expression data and ensure compliance with privacy regulations. By prioritizing ethical considerations and privacy protection, educators can create a safe and trusted learning environment in which the facial expression recognition system can be deployed effectively.

In summary, deployment and real-world testing are crucial phases that influence the overall performance of facial expression recognition systems in online classrooms. By validating performance in real-world settings, assessing robustness and stability, optimizing system design, integrating with existing platforms, providing training and support for users, and addressing ethical considerations, educators can leverage the combined capabilities of ResNet50, CBAM, and TCNs to effectively monitor student learning and enhance educational outcomes in online educational settings.

Figure 5 illustrates the ResNet50-CBAM architecture with TCNs for capturing temporal dynamics in facial expressions over consecutive frames involves visually representing the components and connections of the model.

Fig. 5
figure 5

illustrates the ResNet50-CBAM architecture with TCNs for capturing temporal dynamics in facial expressions over consecutive frames involves visually representing the components and connections of the model

The following Table 1 a simplified pseudocode outlining the key steps involved in enhancing facial expression recognition in an online classroom setting using ResNet50, CBAM, and TCNs.

Table 1 shows the pseudocode of the proposed method

4.1 System architecture

The online learning system integrates expression recognition technology to analyze students' emotions. It captures facial images, classifies expressions, and assesses students' learning states using a predefined emotional classification strategy. Feedback is then provided to students through the platform. Figure 6 illustrates the design process of this system.

Fig. 6
figure 6

illustrates the design process of this system

The system workflow comprises as follows:

  1. 1.

    User registration and login.

  2. 2.

    Learning phase: Learners engage with recommended content or search for their own.

  3. 3.

    Emotion recognition: Facial images are periodically captured during the learning journey. The system detects facial expressions and evaluates learners' emotional states. Positive emotions prompt continued learning, while negative emotions trigger instructional adjustments.

  4. 4.

    Adjustment of learning strategies: The curriculum is adapted, possibly by adjusting difficulty, pace, or integrating supplementary knowledge to enhance emotional well-being. Emotion recognition is critical, as shown in Fig. 7. The emotion recognition module focuses on expression recognition:

  5. 5.

    Capture facial data.

  6. 6.

    Apply image pre-processing.

  7. 7.

    Detect and localize faces. If none are detected, return to step (1); otherwise, proceed to step (4).

  8. 8.

    Extract and classify facial expressions.

  9. 9.

    Generate output indicating the result, as depicted in Fig. 8.

Fig. 7
figure 7

illustrates the Emotional State recognition Module

Fig. 8
figure 8

depicts the Expression detection Module

4.2 Online learning status assessment

In the realm of online education, learning behaviors undergo assessment through various criteria:

  • Attention: Regular monitoring of students' facial presence occurs at 10-s intervals. A failure to detect their faces for more than half of this duration may signal possible inattention or technical difficulties.

  • Challenge: Ongoing scrutiny of students' emotional responses is carried out. Should consistent expressions of anger or distress coincide with challenging learning moments, either thorough explanations or immediate teacher assistance is provided.

  • Engagement: The degree of student engagement is determined by observing expressions of happiness or surprise throughout online sessions.

  • Participation rate: Examination of students' conduct discerns whether they are active or passive participants. Spending over 50% of the time with bowed heads indicates potential disinterest, prompting timely reminders.

  • Learning preferences: Comparative analysis of emotional responses aids in identifying courses that evoke positive emotions, thus guiding future recommendations aligned with students' preferences and fostering positive learning experiences.

5 Result and discussions

5.1 Experimental environment

The experimental setup aimed to enhance facial expression recognition in an online classroom setting utilizing a combination of ResNet50, Convolutional Block Attention Module (CBAM), and Temporal Convolutional Networks (TCNs). To validate the efficacy of the proposed algorithm, comprehensive testing was conducted across four prominent datasets renowned for their diversity and representation of facial expressions: RAF-DB, FER2013, KDEF, and CK + .

  • RAF-DB (Ryerson Audio-Visual Database of Emotional Speech and Song):

    • Description: RAF-DB encompasses audio-visual recordings capturing facial expressions during speech and song performances, offering a diverse range of emotional cues.

    • Image Dimension: The images in RAF-DB dataset typically have dimensions of [224 × 224] pixels.

    • Number of Images: RAF-DB comprises approximately 29,000 images.

  • FER2013 (Facial Expression Recognition 2013):

    • Description: FER2013 is a widely used benchmark dataset containing grayscale facial images annotated with seven emotion classes, providing a standardized platform for facial expression recognition research.

    • Image Dimension: Images in FER2013 dataset are grayscale and usually have dimensions of [48 × 48] pixels.

    • Number of Images: FER2013 dataset consists of approximately 35,000 images.

  • KDEF (Karolinska Directed Emotional Faces):

    • Description: KDEF offers standardized facial expressions posed by multiple models under controlled conditions, facilitating detailed analysis of emotion portrayal.

    • Image Dimension: Images in KDEF dataset typically have dimensions of [256 × 256] pixels.

    • Number of Images: KDEF dataset includes approximately 4,900 images.

  • CK + (Extended Cohn-Kanade Dataset):

    • Description: CK + provides an extensive collection of posed facial expressions captured in laboratory settings, enabling the study of nuanced emotional dynamics.

    • Image Dimension: Images in CK + dataset typically have dimensions of [640 × 490] pixels.

    • Number of Images: CK + dataset comprises approximately 593 sequences, each containing multiple frames.

Table 2 presents the total count of occurrences for each of the seven distinct facial expressions across four prominent datasets used in facial expression recognition research: RAF-DB, FER2013, KDEF, and CK + . These datasets encompass a diverse range of facial expressions, providing valuable resources for training and evaluating facial expression recognition algorithms. Table 2 facilitates an understanding of the distribution of facial expressions within each dataset, enabling researchers to assess the availability and diversity of emotional expressions for algorithm development and validation. The counts provided offer insights into the relative prevalence of different expressions, which is crucial for designing effective facial expression recognition systems and understanding emotional dynamics in various contexts.

Table 2 Provides context for the total number of occurrences of the seven distinct expressions in the four datasets (RAF-DB, FER2013, KDEF, and CK +)

This paragraph outlines a methodology aimed at evaluating the robustness of the proposed algorithm designed to enhance facial expression recognition in online classroom settings across diverse datasets. By incorporating training and validation data from four natural scene expression datasets, including RAF-DB, FER2013, KDEF, and CK + , the study ensures comprehensive model training and validation. Utilizing stochastic gradient descent, the proposed algorithm iteratively adjusts parameters to optimize recognition outcomes, with careful initialization of parameters and selection of a learning rate of 0.0001 and a batch size of 16 to facilitate gradient descent convergence and mitigate issues such as gradient vanishing or exploding. This approach underscores a systematic effort to address common challenges in deep learning model training and enhance the algorithm's performance across various facial expression recognition tasks.

In the experimental setup, a Windows 10 computer serves as the platform for conducting model training, with the PyTorch deep learning framework utilized to facilitate the training process. The choice of PyTorch reflects its popularity and effectiveness in implementing deep learning models. Additionally, all code-based experiments are conducted within the PyCharm development environment, ensuring a cohesive and streamlined workflow for experiment design, implementation, and analysis. This arrangement provides a robust and efficient environment for conducting experiments, enabling researchers to focus on algorithm development and evaluation within a familiar and versatile development environment.

5.2 Evaluation metrics

In the assessment of facial recognition expression systems, it's imperative to evaluate their performance across multiple metrics to ensure their effectiveness. Four key metrics commonly utilized for this purpose are Accuracy, Precision, Recall, and F1-Score [44,45,46,47,48].

Accuracy (Acc) is the proportion of correctly classified facial expressions to the total number of expressions. Mathematically, it is represented as:

$$ACC=\frac{TP+TN}{TP+TN+FP+FN}$$
(1)

where TP denotes True Positives (correctly recognized expressions), TN signifies True Negatives (correctly ignored expressions), FP represents False Positives (incorrectly recognized expressions), and FN indicates False Negatives (incorrectly ignored expressions).

Precision (Prec) measures the accuracy of positive predictions among all positive predictions made by the system. It's calculated as:

$$prec=\frac{TP}{TP+FP}$$
(2)

Recall (Rec), also known as Sensitivity, quantifies the system's ability to correctly identify all positive instances. It's computed as:

$$Rec=\frac{TP}{TP+FN}$$
(3)

F1-Score (F1) is the harmonic mean of Precision and Recall, providing a balanced assessment of the system's performance. It's expressed as:

$$F1=2\times \frac{prec\times Rec}{prec+Rec}$$
(4)

By evaluating facial recognition expression systems using these metrics, developers can gain insights into their accuracy, ability to detect specific expressions, and overall effectiveness, thereby facilitating enhancements and optimizations for improved performance.

5.3 Results of the facial expression recognition experiment

5.3.1 Recognition of facial expressions within the RAF-DB dataset

The confusion matrices depicting the recognition of seven facial expressions in both the original and refined ResNet-50 models are displayed in Figs. 9 and 10, respectively. Examination of these matrices indicates a distinct enhancement in accuracy after refining the ResNet-50 model. This improvement can be attributed to the enhanced capability of the refined network to uphold the transmission of vital information throughout its architecture, thereby resulting in more proficient recognition of facial expressions.

Fig. 9
figure 9

illustrates the confusion matrix depicting the identification results of ResNet-50 on the RAF-DB dataset

Fig. 10
figure 10

displays the confusion matrix illustrating the improved identification results of ResNet-50 on the RAF-DB dataset

After refining ResNet-50, significant enhancements are apparent across all seven facial expressions in the RAF-DB dataset. The accuracy of recognizing easily distinguishable expressions, such as happiness and neutral, has seen an increase from 0.89 and 0.88 to 0.91 and 0.94, respectively. Similarly, recognition accuracy for more complex expressions like fear and disgust has also seen improvement, climbing from 0.73 and 0.70 to 0.88 and 0.95, respectively.

In order to establish the validity of the method presented in this paper, we conducted a comparative study between our proposed approach and a recently utilized deep learning method for facial expression recognition experiments using the RAF-DB dataset. The accuracy values obtained from this comparison are detailed in Table 3, offering a framework for evaluating the efficiency and trustworthiness of our proposed method in relation to existing techniques.

Table 3 illustrates the accuracy achieved by various models on the RAF-DB dataset

In the RAF-DB dataset, there exists an uneven distribution of expressions, especially noticeable for negative emotions like fear and disgust, which are somewhat less prevalent. Moreover, certain images display compound expressions, presenting hurdles during network training. Unlike other convolutional neural networks, this algorithm utilizes ResNet-50 as its primary architecture, a choice that adeptly tackles challenges such as vanishing gradients linked to deep network structures. As a result, this decision enables the extraction of richer feature information, ultimately resulting in enhanced recognition performance.

5.3.2 Recognition of facial expressions in the FER2013 dataset

The recognition confusion matrices, which demonstrate the identification of the seven facial expressions before and after adjustments, are showcased for ResNet50 models on both the public and private verification sets of FER2013. These are depicted in Figs. 11 and 12, respectively. Initially, ResNet-50 achieved recognition rates of 65.71% and 68.43% on the public and private verification sets, correspondingly. Subsequent to fine-tuning, there was a noticeable improvement, with rates rising to 91.71% and 93.71% on the public and private verification sets, respectively. It is evident that employing a fine-tuned ResNet-50 as the foundational network for expression recognition results in substantial enhancements in model performance.

Fig. 11
figure 11

displays the confusion matrix depicting the identification results of ResNet-50 on both the private and public test sets of FER2013

Fig. 12
figure 12

illustrates the confusion matrix representing the improved identification results of ResNet-50 on both the private and public test sets of FER2013

Upon examination of facial expressions, it is apparent that happiness and neutrality display noticeable facial changes, resulting in increased recognition rates. Remarkably, the accuracy rates for happiness and neutrality on the private test are 0.95 and 0.96, respectively, whereas on the public test, they are 0.90 and 0.95, respectively.

To ensure the validity of the proposed approach, a comparative investigation was conducted, comparing it with a recently deployed deep learning technique in an expression recognition experiment using the FER2013 dataset. The ensuing accuracy scores are thoroughly detailed in Table 4, offering a framework for assessing the efficacy and trustworthiness of the proposed method when juxtaposed with existing techniques.

Table 4 illustrates the accuracy achieved by various models on the FER2013 dataset

To ensure the validity of the proposed method, a comparative study was conducted, contrasting it with a recently utilized deep learning technique in an expression recognition experiment using the FER2013 dataset. The accuracy scores resulting from this comparison are thoroughly displayed in Table 4, acting as a measure for assessing the efficacy and trustworthiness of the proposed approach when compared to established techniques.

  • Complexities in FER2013 Dataset: The FER2013 dataset contains subtle differences among similar facial expression samples. These differences include variations in factors like twisting, degree of twisting, and light intensity. These nuances make it challenging for algorithms to accurately recognize facial expressions.

  • Contrast in Recognition Rates: Due to these complexities, there is a significant difference in the average recognition rate achieved by the proposed algorithm when applied to the FER2013 dataset compared to the RAF-DB dataset. The algorithm performs differently because of the varying nature of the datasets and the unique challenges they present.

  • Performance of the Refined ResNet Model: Despite the challenges posed by the FER2013 dataset, experimental results show that the refined ResNet model performs exceptionally well. It outperforms alternative methods in improving expression classification accuracy. This success highlights the effectiveness and feasibility of the proposed approach, even in datasets with inherent complexities like FER2013.

In summary, the previous paragraph discusses how subtle differences in facial expression samples within the FER2013 dataset present challenges for recognition algorithms. Despite these challenges, the refined ResNet model proves to be effective in improving accuracy, demonstrating the robustness of the proposed approach.

5.3.3 Performing a comparative analysis of the proposed model against state-of-the-art techniques

A series of experiments were conducted to assess the performance of our proposed model in comparison to other state-of-the-art methods, as detailed in Table 5. Initially, we evaluated the model's performance against FER-2013 to measure its resilience. Our proposed model demonstrated superior performance compared to models labeled as [34, 58,59,60,61,62,63,64], and [3], achieving accuracy improvements of 25.71%, 19.71%, 3.69%, 4.91%, 18.11%, 40.42%, 14.91%, 19.33%, and 18.31% respectively. We assessed the model's resilience by comparing its performance against the CK + dataset. The results showed that our proposed model outperformed other methods significantly. Specifically, our model achieved higher accuracy rates by 14.85%, 8.15%, 17.26%, 18.75%, 6.29%, and 8.6% when compared to models referenced as [3, 34, 61, 63, 65], and [27] respectively. We evaluated the model's robustness by examining its performance with the KDEF dataset. The findings indicated that our model achieved greater accuracy than the method presented by [66], surpassing it by 3.68%. Further investigation revealed that [34, 67, 68], and [63] yielded performances that were 3.38%, 10.18%, 12.16%, and 25.58% lower, respectively, than our proposed model.

Table 5 presents a comparison of the performance of the proposed model with the state-of-the-art method across three benchmark datasets

Achieving the highest accuracy with the ResNet50, CBAM, and TCNs model for enhancing facial expression recognition involves several key factors:

  • Model architecture: The use of ResNet50, CBAM, and TCNs as components of the model architecture likely plays a significant role. ResNet50 is known for its depth and skip connections, CBAM enhances feature attention, and TCNs capture temporal dependencies, all contributing to improved recognition accuracy.

  • Dataset quality and diversity: The quality, size, and diversity of the dataset used for training are essential factors. A comprehensive dataset covering various facial expressions, demographics, and environmental conditions ensures the model learns robust features and generalizes well to unseen data.

  • Preprocessing techniques: Effective preprocessing steps such as image normalization, augmentation, and alignment can enhance the quality and relevance of input data, leading to better performance during training and inference.

  • Training strategy: The choice of optimization algorithm, learning rate schedule, batch size, and training epochs significantly influences model performance. Fine-tuning these hyperparameters and employing techniques like early stopping and learning rate annealing can lead to better convergence and higher accuracy.

  • Regularization methods: Incorporating regularization techniques such as dropout, batch normalization, and weight decay helps prevent overfitting and improves the model's ability to generalize to unseen data.

  • Interpretability and contextual understanding: The model's ability to interpret and leverage contextual cues specific to facial expressions, such as facial landmarks, subtle cues, and temporal dynamics, is crucial for accurate recognition.

  • Computational resources: Adequate computational resources, including processing power and memory, are necessary for training deep neural networks like ResNet50 and TCNs effectively.

  • Evaluation metrics: Choosing appropriate evaluation metrics and benchmark datasets ensures fair comparison with existing methods and provides insights into the model's performance across different scenarios.

By addressing these key factors and optimizing each aspect of the model's development and training pipeline, the ResNet50, CBAM, and TCNs model can achieve the highest accuracy in enhancing facial expression recognition tasks.

5.3.4 Assessing the time complexity of the proposed model on GPU, CPU, and resource-limited devices

To conduct a real-time performance assessment of the ResNet50, CBAM, and TCNs model for enhancing facial expression recognition, we aimed to measure its processing time across different computing platforms. This evaluation included high-performance GPUs, standard CPUs, and a resource-constrained device, the Jetson Nano. By executing the model on each platform and recording the time taken for inference, we could compare its efficiency and responsiveness across diverse hardware configurations. Assessing performance on GPU and CPU setups provides insights into the model's scalability and suitability for both high-performance computing environments and conventional hardware setups. Furthermore, evaluating its performance on the Jetson Nano allows us to understand its feasibility for deployment in real-world, edge computing scenarios where computational resources are limited. By analyzing processing times across these platforms, we can determine the model's versatility and efficiency in real-time facial expression recognition applications. The Jetson Nano is a small, low-cost single-board computer developed by NVIDIA, designed for embedded systems and edge computing applications. It features a quad-core ARM Cortex-A57 CPU and a Maxwell-based NVIDIA GPU, providing hardware acceleration for deep learning tasks. The Jetson Nano is particularly well-suited for applications requiring real-time processing and inference on resource-constrained devices, such as robotics, drones, smart cameras, and IoT devices.

The ResNet50, CBAM, and TCNs model for enhancing facial expression recognition achieved varying frames per second (fps) rates on different computing platforms, namely GPU, CPU, and Jetson Nano. Specifically, the model attained 28 fps on GPU, 12 fps on CPU, and 15 fps on Jetson Nano. These findings underscore the substantially reduced time complexity of the proposed model compared to conventional CPU-based implementations. Several factors contribute to these outcomes:

  • Utilization of hardware acceleration: GPUs are adept at parallel processing, rendering them highly efficient for deep learning tasks. Their capacity for handling vast data volumes concurrently results in accelerated inference times compared to CPUs.

  • Optimized model configuration: The ResNet50, CBAM, and TCNs model may have been fine-tuned to exploit GPU architectures, effectively harnessing their parallel computing capabilities. This optimization encompasses techniques like batching, kernel fusion, and memory optimization.

  • Enhanced architecture compatibility: The model's design may inherently align better with GPU architectures, facilitating optimal utilization of GPU resources and consequent performance maximization.

  • Adaptation to resource constraints: Despite the Jetson Nano's limited computational resources relative to GPUs, it benefits from hardware acceleration and streamlined model inference. The device's dedicated hardware accelerators enable efficient execution of deep learning tasks, mitigating the impact of resource limitations.

  • Feasibility for real-world scenarios: The notable reduction in time complexity positions the proposed model favorably for real-world deployment scenarios necessitating swift inference times. Applications such as real-time facial expression recognition in video streams demand low latency for seamless interaction, highlighting the model's practical utility.

In summary, the interplay of hardware acceleration, tailored model optimization, architecture compatibility, and adaptation to resource constraints collectively underpins the superior performance of the proposed model on GPU, CPU, and Jetson Nano platforms, thereby affirming its suitability for real-world deployment.

Table 6 offers a comparative overview of our results across the CK + and RAF-DB datasets. Our evaluation centered on three main metrics: accuracy (%), run-time (in seconds), and average CPU memory usage (%). Within the CK + dataset, our approach demonstrated a test accuracy of 96.95%, trailing only 0.03% behind the highest accuracy achieved by the method mentioned as [69]. The assertion underscores the achievement of our method in attaining an impressively swift processing time of merely 73 s. This showcases the superior processing speed of our approach compared to recent methodologies. Furthermore, it underscores the significance of our proposed method in achieving a harmonious equilibrium between accuracy and processing time, a crucial aspect for the practical implementation of deep learning techniques on edge devices in real-world scenarios. In summary, our proposed method demonstrated competitive accuracy and computational efficiency on the CK + dataset. Regarding the RAF-DB dataset, our proposed approach attained a test accuracy of 91.92%. What sets our method apart is its significantly quicker run-time, which is approximately one-fourth of the time taken by [73]. During the processing of 3050 test images, our method successfully finishes the recognition task within a mere 121 s, contrasting sharply with other methods that typically demand around 500 s or more to achieve an accuracy above 87%. This not only underscores the applicability of our method for real-time applications but also its adaptability to edge devices.

Table 6 displays a comparative analysis with the most recent state-of-the-art findings derived from the CK + and RAF-DB datasets. The term "run-time" denotes the time taken for recognizing 1500 images in the CK + dataset and 3050 images in the RAF-DB dataset, respectively

5.3.5 Impact of online educational platforms

The learning process entails producing results for recognizing facial expressions over a designated historical timeframe. Subsequently, it offers feedback and recommendations based on these results. Figure 13 illustrates the feedback on the learning effect. Figure 14 depicts the real-time engagement index plotted across the entire video in the Operating System course.

Fig. 13
figure 13

demonstrates the feedback regarding the learning impact throughout the entirety of the video within the Operating System course

Fig. 14
figure 14

presents a graph showcasing the real-time engagement index mapped throughout the entirety of the Operating System course video

Table 7 includes the proposed model trained with different backbone networks. As we can see, ResNet-50 provides more benefits to the model than MobileNet-V2, VGG-16, and ShuffleNet-V2. We analyze this because our proposed method is more suitable for processing small sample data and avoids the overfitting that may occur with a large number of parameters in the case of VGG-16. In addition, compared with the backbone network, the AC module structure in the lightweight network obtains a higher recognition rate; in addition, the number of parameters is significantly reduced, and the computational complexity is also greatly reduced. Our method achieves a high level of accuracy with a small number of computations; it exceeds the baseline by a large margin, from 84.57% to 97.08%.

Table 7 Presents an evaluation was conducted to compare the outcomes of the most advanced networks concerning classification accuracy, the quantity of parameters, and FLOPs on the KDEF dataset

6 Conclusion and future work

This study presents a pioneering online educational platform that leverages facial expression recognition technology to enhance student monitoring and engagement within the classroom. The utilization of ResNet50, CBAM, and TCNs facilitates accurate and efficient facial expression recognition, achieving impressive accuracies across multiple datasets. The platform's ability to surpass the initial ResNet50 model in accuracy and detection of students' learning states underscores its potential significance for educational institutions. Overall, the proposed online educational platform represents a promising avenue for leveraging facial expression recognition technology to enhance student engagement and support educators in fostering effective teaching strategies. Through continued research and development efforts, the platform has the potential to revolutionize online learning environments and contribute to improved student outcomes in the future. This paper focuses on enhancing facial expression recognition accuracy by utilizing ResNet-50 for effective feature extraction. Our model exhibits significantly better performance than other methods. By conducting evaluations across four diverse datasets, we have substantiated the efficacy of our approach in improving E-learning, considering both quantitative and qualitative aspects.

Moving forward, future work could focus on several avenues to further enhance the platform's capabilities and applicability. Firstly, refining the facial expression recognition algorithms to accommodate diverse classroom environments and lighting conditions could improve accuracy and robustness. Additionally, integrating advanced machine learning techniques, such as reinforcement learning, could enable the platform to adapt and personalize the learning experience for individual students based on their facial expressions and engagement levels. Furthermore, exploring the integration of additional modalities, such as audio and text analysis could provide a more comprehensive understanding of students' emotions and behaviors during online classes. This multi-modal approach could enrich the platform's insights and enable more tailored interventions to support student learning and well-being. Lastly, conducting longitudinal studies to assess the platform's impact on student outcomes, including academic performance and socio-emotional development, would provide valuable insights into its efficacy and potential benefits for educational practices.