Introduction

The COVID-19 pandemic has had a significant impact on higher education, prompting rapid shifts towards online instruction worldwide over the past few years. This transformation necessitates the adoption of innovative pedagogical approaches that can accommodate students’ individualization and diverse motivations. Given the emerging context of remote teaching and learning, collaborative learning has become a popular choice to promote engagement among learners. Self-regulation is a critical component of individual learning success whereas socially shared regulation of learning (SSRL) is the key factor contributing to group performance in collaborative learning (Järvelä et al., 2019). The sudden changes to educational practices caused by the pandemic have further emphasized the need to gain a deeper understanding of and support socially shared regulation of regulation, which has become increasingly important in the new learning environment. However, it has been challenging to assess and support students’ socially shared regulation in collaborative learning due to the unobservability of the cognitive and emotional processes and the dynamics of collaborative interactions (Järvelä & Bannert, 2021). Previously published studies are mostly confined to self-reported surveys and interviews. Despite recent attempts in exploring the use of multimodal data including physiological data (e.g., Dindar et al., 2022), video and audio data (e.g., Isohätälä et al., 2018) to study SSRL, one of the most significant discussions in SSRL research is the need for innovative methods to analyze new data modalities (Azevedo & Gašević, 2019). Recent advances in Artificial Intelligence (AI) in general, and computer vision facial recognition in specific, along with its applications in learning sciences have enabled novel methods for examining self-regulation and offered new insights into the self-regulated learning process (Nguyen et al., 2021, 2022b). To date, there is a dearth of empirical evidence that identifies how AI methods can be leveraged to investigate students’ socially shared regulation in online collaborative learning. Addressing this research gap, our study aims to employ AI facial expression recognition to explore students’ emotional regulation in synchronous computer-supported collaborative learning (CSCL).

Emotional regulation is one of the most challenging aspects of socially shared regulation, given the diverse involvement of learners at the group level and the individualized, sometimes idiosyncratic co- and socially shared emotion regulation strategies that are pertinent to each challenge (Järvenoja et al., 2019). However, prior studies that focus on investigating the factors associated with SSRL have more frequently employed traditional data collection instruments such as surveys, interviews, or a combination of both in self-reports (Järvelä et al., 2019; Kwon et al., 2014). Järvelä et al. (2019) suggested that regulation is “not linear and involves cyclical adaptation, which “requires multiple data channels” to feature not only individual, social learning activities but also their interactions with the learning context (p. 429). Given the multiplicity of results obtained, the authors also share this viewpoint and would argue that little advancement following this route can be achieved due to the unavoidable subjectivity of participants embedded in the qualitative data amassed. The limited range of methodologies used in previous studies highlights the need for novel approaches that utilize emerging technologies in data collection, which can provide objective insights into the SSRL process.

Furthermore, in the light of recent turbulence caused by Covid-19 pandemic over the last few years, it is undeniable that the dynamics of teaching and learning have evolved dramatically, especially in the way education is delivered. Technology-featured and technology-aided models such as hybrid learning and blended learning have become the norm, leading to an increase in collaborative learning in virtual environments. Despite this transformation, research on the social interactions of learners during the collaborative learning process in a computer-supported environment is limited to traditional questionnaires (Panadero et al., 2016). While previous research has explored SSRL in asynchronous CSCL, there is a gap in knowledge regarding learning regulation in synchronous CSCL. Prior to the COVID-19 pandemic, most research on regulation in CSCL focused on asynchronous learning settings, such as those found on learning management systems (LMS) or Massive Open Online Courses (MOOCs). However, with the increased need for synchronous CSCL on platforms such as Microsoft Teams and Zoom, there is a growing demand for understanding SSRL in these contexts. Despite the importance of SSRL, especially during the COVID-19 pandemic, a literature review reveals a lack of evidence on SSRL synchronous CSCL, and limited attention has been given to emotional regulation in this context. Consequently, this study aims to address this gap and provide new insights into SSRL in synchronous CSCL.

In this study, we aim to explore the ways that thirty-six students in a higher education setting collaboratively study English and complete their assigned speaking tasks through a virtual learning platform (Microsoft Teams) and use digital data (recordings of the sessions) to track their socio-emotional interactions that led to self-regulation in learning. Our contribution is reflected in two aspects. Methodologically, it updates the ongoing literature with an AI-based real-time checking emotion and data collection tool towards an emerging field that has fast becoming a key instrument in understanding SSRL and, more broadly, high-level of cognitive learning. Upon the attainment of this goal, which is tracking learners’ emotion during their learning process, implications at a theoretical ground regarding socially shared learning can be further investigated since emotional process is instrumental to successful learning. The dynamic and cyclical nature of this process has long served as a great hindrance to traditional methods such as interviews or surveys to capture accurately and fully. The utilization of AI tools in exploring qualitative processes of learning offers significant added value in interpreting unobservable processes such as learning will also pave ways to applicability in other interdisciplinary studies of related fields. Our study, which focuses on socially regulated learning in an online, computer assisted environment targeting collaborative work within learning groups, is also anticipated to provide practical ramifications to an evolving context of education that encapsulates a growing demand in online learning during and post-Covid era. Collectively, positioning as one of the first studies to initiate the use of AI to conduct research directly on Teams and via the platform of video-based qualification enables us to examine theories that are pertinent to regulation at group level in a context of synchronous collaborative learning.

Theoretical Background

Methodological Progress and Challenges in Studying SSRL

Socially shared regulation of learning (SSRL) is defined as a deliberate and strategic approach to planning, task enactment, reflection, and adaptation that takes place within a group context (Järvelä et al., 2019). Over the past ten years, there has been an increase in interest in understanding SSRL. Due to the lack of understanding regarding the mechanisms involved in the SSRL model within collaborative learning, recent efforts have been made to elucidate the regulation processes of cognition, motivation, and emotion in SSRL (Nguyen et al., 2023; Winne, 2019). It is recently suggested that the exploration and support of CoRL and SSRL have been hampered by the dynamic character of collaborative learning and the concealed processes of affective and cognitive alterations at the core of regulation. (Järvelä et al., 2019).

To elaborate on the issue, The previous literature has highlighted the notion that students utilize metacognition consistently to adjust their learning tactics, and these adaptations in the learning process may vary from cycle to cycle (Zimmerman & Schunk, 2011). Then, in order to accurately capture SSRL, it is necessary to comprehend and record each learner’s varied and interconnected elements (such as emotion, motivation, and cognition) as well as their interactions and regulation with others within the social learning context (Järvelä et al., 2019). Accordingly, there was significant interest in finding new data collection methods in studying SSRL, with more emphasis on trace data or real-time measurements, and one such approach was multimodal data analytics. This includes analyzing eye-tracking data, screen recordings, and time-stamped descriptions of observed interactions between students and content using physiological sensors, think-aloud protocols, and interactions between students and machines (e.g., Taub & Azevedo, 2016).

Despite the benefits of using multimodal data analysis to investigate SSRL, there are limitations to this approach. Firstly, collecting multimodal data is costly, and ensuring that it captures students’ real conceptions and intentions during the learning process is challenging (Fan et al., 2021). Secondly, because the results obtained from multimodal data analysis are fragmented by nature, it is difficult to measure and infer the interrelated aspects of the SSRL process. Thus, it is recommended that more comprehensive research techniques be employed to better examine SSRL in collaborative learning (Järvelä et al., 2019). Therefore, this study seeks to explore a new approach utilizing AI techniques to examine SSRL in synchronous CSCL.

Artificial Intelligence (AI) Approaches for Examining SSRL Through Facial Emotion Recognition

Recent innovations in AI research, specifically in deep learning, have allowed AI algorithms to transform analytics-relevant tasks with fairly accurate results and minimal human effort. Notably, AI has made significant advances in both cognitive process analysis and real-time emotion recognition. In cognitive analysis, recent works by Gao et al. (2021) and Debie et al. (2019) demonstrate how neural network pre-trained models can be used to predict and analyze cognitive processes using inputs from invasive and non-invasive cognitive monitoring. This provides SSRL researchers with various opportunities to better analyze and understand cognitive processes more holistically. Meanwhile, deep learning CNN models have significantly improved the performance of emotion recognition from video inputs, which has been a well-studied problem for decades. Those interested in the history and state-of-the-art AI approaches to facial emotion recognition can refer to the recent survey by Canal et al. (2022).

With the help of AI, researchers can better understand how students are regulating their learning in real-time and in different contexts, providing insights that were not possible before. Additionally, AI can analyze large datasets and generate accurate predictions, providing a more comprehensive understanding of the complex SSRL processes. When being used as complementary to multimodal analysis (Zhang et al., 2020), AI algorithms offer the capability not only to increase the accuracy of the frame-by-frame analysis of emotions but, but also offer more in-depth analysis of physiological sensors and audio data (Nguyen et al., 2023). Using state-of-the-art development in deep learning AI facial recognition algorithms, this study aims to better capture and understand student emotions involved in SSRL.

Methods

Participants and Procedures

Participants of the study were 36 college students from a higher education institution in Vietnam (19 females, 17 males), with a mixture of freshmen and sophomores. The study was carried out in an undergraduate English-Speaking Skills course that is part of an advanced program. This program is designed for non-native English speakers who study in small classes led by experienced lecturers, with intensive content in foreign languages. Additionally, the program provides students with opportunities to participate in research and interdisciplinary team projects from early on in their studies, and to practice solving practical problems faced by enterprises as part of their learning process. The advanced program’s content is based on modern curricula from universities in developed countries such as the United States, Germany, Japan, and the United Kingdom. The program’s teaching and learning materials are primarily delivered in English, with some special programs designed to be taught in Vietnamese following an English learning enhancement scheme. It is worth noting that the class was initially intended to be conducted in-person, but due to the complexity of the Covid-19 situation, with constant lockdowns in Vietnam during the study period, teaching and learning had to be shifted to online environments.

The data used in the current study includes recording from the small group discussions in a collaborative learning task in which students were arranged into 14 groups, each ranging between two to three members working together on an assigned topic by the instructor. In this study, we designed a collaborative learning task for ESL learners where a group of students are asked to discuss a topic at own choice in English, while individuals within the group are encouraged to correct each other’s phrasing and pronunciation. An example topic discussed by the students was “Impacts of Covid 19 on our life” or “How to study well at university?”. The collaborative learning tasks for ESL was focusing on the discussion in English as the second language rather than the actual discussion content. Prior to the collaborative learning task, each group was given a topic by the instructor for preparation. In the collaborative learning session, students were asked to turn on the camera and be visible throughout the task, but no specific instructions regarding head or body location were announced to the participants. Since the group project was conducted remotely, the quality of the recording varies, since it is dependent on the student’s personal webcam, ranging between 640 × 320 pixels to 1920 × 1080 pixels. The maximum time allowed for the collaborated learning task was 15 min, and the mean participant time for each student was around 13 min, resulting in a dataset with 14 distinct videos.

Data Analysis

Facial Expression Recognition

Facial Expression Recognition (FER) performed computationally is an ongoing challenging research topic in the machine learning research community. Since the work of Krizhevsky et al. (2012), there has been a resurgence of interest in convolutional neural networks for image analysis. In recent years, convolutional-based deep FER models have been shown to consistently outperform SIFT algorithms when it comes to scalability and generalizability in FER tasks (Goodfellow et al., 2013). Moreover, recent state-of-the-art advancements in deep learning models for video-based facial expression recognition analysis allowed us to classify the participants’ facial emotions on a wider spectrum on a continuous range, which allows more flexible and robust frameworks for multi-class emotion classification by modification of the SoftMax classification layer (Giannopoulos et al., 2018). On top of that, there have also been innovations in deep learning-based FER in the wild research, which provided the ability to analyze human emotions directly from video recordings, without the need for explicit lab-control environments. These recent developments have provided us with the technical tools to better understand and analyze student’s emotion under sequenced phases of recursive cognition of the learning regulation process (Nguyen et al., 2022a, b).

Given various available FER algorithms available in the literature for evaluating how learners socially shared regulate in group-based tasks in online learning under the regulated learning framework (Canal et al., 2022; Mellouk & Handouzi, 2020), it is important to choose a suitable FER model for practical setting. In an online-learning collaborative group-based task, given the often-small scope of the dataset (< 100 examples per class), it is unlikely to be able to train a domain specific model from scratch or even fine-tune existing FER models. This means that the most practical approach to FER is to rely on publicly available pre-trained models for detecting emotional expression through the self-regulated collaborative learning process. To this endeavor, several approaches applying AI exist, divided into two main categories: one-shot detectors, which attempt to directly locate the student’s face and detect the student emotion at once, and two-shot detectors, which attempted to tackle this problem through two-steps: face-localization and emotion detection (Canal et al., 2022). We decided to adopt the second approach, given results from a recent survey in FER (Dalvi et al., 2021) suggested that the later approach achieves better performance on out-of-sample data.

As a result, this study applies the AI architecture using img2pose (Albiero et al., 2020) pre-trained on the Wider face dataset (Yang et al., 2016) as the facial localization module and Residual Masking Network (Pham et al., 2021) as our emotion detection module. img2pose is a state-of-the-art facial localization method with real-time inference, which had been shown to achieve reliable performance on the large-scale AFLW2000-3D dataset (Zhu et al., 2016) with a mean-squared error of 3.913. Furthermore, most benchmarking datasets (WIDER Face, FDDB (Jain & Learned-Miller, 2010), Pascal Face (Zhang et al., 2017) on the public benchmarking website paperwithcode.com also reported reliable results from img2pose, with the minimum accuracy rate reported of 82%. We believe image2pose will handle most student movements in our dataset. On the other hand, for the emotion detection module, Residual Masking Network is another well-known competitive method in FER, which achieved 73.28% accuracy on the classic FER2013 dataset. On the public ImageNet challenge, which also employs a more comprehensive image classification challenge than just facial emotion recognition (Deng et al., 2009), Residual Masking Network also achieved a 74.16% top − 1 accuracy and 91.91% top − 5 accuracy. In our study, we have opted to employ a common approach to emotional recognition, incorporating a classification that comprises seven distinct emotional classes as the output of our analytical pipeline. Specifically, these categories are: Anger, Disgust, Fear, Happiness, Sadness, Surprise, and Neutral.

We believe this approach could serve as a reliable starting point for qualitative analysis, given previous study (Canal et al., 2022) indicates human recognition accuracy of emotions on FER2013 was 65.5%. Our pipeline of processing video data to obtain facial expression recognition is described below (Fig. 1).

Fig. 1
figure 1

The experiment pipeline

Video Qualitative Analysis

In this study, a multistep analysis was used to investigate when, how, and which shared forms of regulation emerged and functioned during the group work. It also examined if there are challenges associated with its occurrence and what type of strategies the group employed to regulate their emotions. In preparing the data, 30-second segments were created for qualitative content analysis (Silverman, 2020). As a unit of analysis, this provides a chronological overview of the group situation and a consistent and structured way to analyze the situation as it unfolds (Järvenoja et al., 2019).

In the first stage of analysis, each 30-second segment of the videos was coded drawing on self-regulated learning literature (Pintrich, 2000; Zimmerman & Schunk, 2011) and socially shared regulated learning (Järvelä & Hadwin, 2013) to see whether that is a regulatory episode (co-regulation and socially shared regulation) or non-regulatory episode. In this category, we distinguished the segment as either cognitive, socio-emotional or other interaction. The data were coded based on an adaptation of Malmberg et al. (2017) and Järvenoja et al. (2019) coding scheme as they were particularly relevant to our research questions and consistent with the theoretical framework of SSRL. In our analysis, each 30-second segment was coded with a single label. In instances where multiple characteristics were observable within a single segment, the code representing the strongest conceptual semantic meaning was selected. The coding definition is provided in Table 1 along with examples from our data that reflect these codes. While acknowledging its limitations, the choice to adopt a single code per segment was made to align our qualitative analysis more coherently with the subsequent automatic emotion analysis, thereby allowing for more straightforward comparisons and interpretations.

Table 1 Coding scheme for interaction characteristic and regulation types

The qualitative video analysis was conducted by a single coder, who is also a researcher in the domain of socially shared regulation of learning. To validate the reliability of the video coding process, a second coder independently annotated a subset of the dataset, specifically focusing on two groups, which constitute 14.2% of the total dataset. Subsequently, a Cohen’s Kappa reliability test was conducted. The results yielded a Cohen’s Kappa score of 0.71, signifying moderate to high reliability.

Process Mining to Examine the Patterns of Interactions for Regulation

To investigate the emotional regulation patterns in synchronous computer-supported collaborative learning (CSCL), we utilized a process-mining analysis. This involved the application of the Fuzzy Miner method, as developed by Günther and van der Aalst (2007), to the regulatory activities that had been qualitatively coded. The software used for the analysis was Fluxicon’s Disco, which is a commonly used process mining tool in studies examining the process of learning events, as evidenced by prior research (Dindar et al., 2022; Nguyen et al., 2023).

Learning Regulation Activities and Emotions Co-Occurrences Analysis

The results of facial expression recognition are resampled and matched into the 30-second segments of video qualitative coding for quantitative analysis that seeks to explore the relationship between detected facial emotions and regulatory activities. Descriptive statistics are reported for the distribution of emotions among different regulatory activities. Furthermore, the emotions of different members in each group are aligned for co-occurrences analysis. First, the emotion of each learner in a group is defined with the threshold of 0.5 (p > 0.5). Kruskal-Wallis H was applied to test the difference of emotions co-occurrences between learning regulation behaviors.

Results

Learning Regulation Behaviors in Synchronous Online Collaborative Learning

The results of video qualitative analysis for regulatory interactions and learning regulation are described in Table 2. In line with previous studies (e.g., Järvelä et al., 2016; Malmberg et al., 2017), most social interactions in collaborative learning are related to task execution (f = 62.3%). While previous studies largely examined social interactions for regulation in face-to-face (Malmberg et al., 2017; Nguyen et al., 2022a, b) and asynchronous online collaborative learning (Iiskala et al., 2015; Lai & Hwang, 2016), our findings confirmed the similar learning regulation behaviors in the context of synchronous online collaborative learning. Furthermore, our findings reported rare occurrences of CoRL and SSRL in synchronous online collaborative learning (f = 5.3% and f = 0.8% respectively). These results reflect those of Malmberg et al. (2017) who also found that SSRL infrequently occurred in face-to-face collaborative learning. Furthermore, these results corroborate the ideas of Järvelä et al. (2020), who suggested that social interactions do not often lead to learning regulation. Comparison of the findings with those of other studies confirms the need for promoting learning regulation support in collaborative learning to enhance learning.

Table 2 Learning regulation activities coding

To investigate the patterns of learning regulation behaviors in synchronous online collaborative learning, we conducted a process mining analysis. The results of this analysis are presented in Fig. 2. The literature review has demonstrated the significance of learning regulation for achieving success at both the individual and group levels, as evidenced by a number of reports (Dindar et al., 2022). While previous research has examined the patterns of learning regulation in face-to-face collaboration (Järvenoja et al., 2019) and asynchronous online collaborative learning (Iiskala et al., 2015), very little is known about whether similar patterns of learning regulation exist in the context of synchronous computer-supported collaborative learning (CSCL). Our study findings revealed that group learning patterns in synchronous CSCL frequently commence and conclude with other interactions unrelated to the learning process. However, cognitive interactions set the stage for the learning regulation pattern in synchronous CSCL. Our study’s outcomes confirm the existence of similar results, whereby cognitive interactions initiate regulatory adaptation cycles (Nguyen et al., 2023). In accordance with extant literature on the SSRL (Dindar et al., 2020; Nguyen et al., 2023), our empirical investigations similarly reveal a limited incidence of SSRL within collaborative learning environments involving small groups (N = 3). Previous research has indicated that while collaborative settings theoretically offer opportunities for shared regulation, the actual manifestation of these regulatory activities externalized in verbal interactions remains infrequent (Isohätälä et al., 2017). Our current study extends the scope of inquiry to encompass emotional regulation as operationalized through facial expressions. Interestingly, we found that co-regulation often iterates with task execution whereas socially shared regulation mainly occurs after socio-emotional interactions. However, with a small sample size, caution must be applied in generalizing the findings. Notwithstanding the relatively limited sample, this study offers interesting insights into learning regulation patterns in synchronous CSCL.

Fig. 2
figure 2

Learning regulation pattern in synchronous online collaborative learning

Learning Regulation Activities and Emotions Co-Occurrences Analysis

The results of facial emotion recognition were aligned among group members and matched with learning regulation activities for conducting co-occurrences analysis. For each 30-second segment that was coded, we calculated the total counts of each type of emotion on a second-by-second basis. This granular approach enables a more nuanced understanding of the emotional dynamics within each coded interaction. Emotional synchrony was recognized whenever two or more group members shared the same emotion in each second frame. Figure 3 illustrates the distribution of shared emotion in different learning regulation activities.

Fig. 3
figure 3

Emotion distribution for different learning regulation activities

Interestingly, the most frequently shared emotion expressed by the learners is surprised. In the context of clinical reasoning, Lajoie et al. (2021) has reported that angry emotion appeared most frequently in self-regulated learning while surprised tended to occur with the second highest frequency. In an effort to confirm accuracy and provide more insight into this finding difference, we have reviewed the video data for different shared emotion segments. We found out that a possible explanation for this might be due to the nature of the learning tasks. In our study, the students were asked to share opinions and initiate discussions in English in the context of learning English as a second language. Learners often expressed surprised emotions while listening to their group members’ sharing information and thoughts. A further study with more focus on the effects of learning task design on emotional aspects of learning is therefore suggested.

Table 3 reported the descriptive statistics for emotional synchronies in different learning regulation activities. Mean refers to the average occurrence of emotional synchronies within the 30-second segment of interactions. The study sought to show evidence of difference in the number of emotional synchronies among different regulatory activities. We have acknowledged the limitation of this 30 s segments in the revision and future studies will apply more granular analysis for video coding. The emotional synchronies appear most often in SSRL segments (M = 8.67, SD = 10.69) while occurring least in socio-emotional interactions (M = 4.49, SD = 4.32). This result may be explained by the fact that SSRL is established based on the sharedness of emotion and cognition among group members (Järvelä et al., 2019). On the other hand, socio-emotional interactions are needed for negotiating between learners in a group while confronting emotional challenges in which the group members may not share similar emotions (Järvenoja et al., 2019). The Kruskal-Wallis H Test showed a slightly significant difference in the number of emotional synchronies among different regulatory activities with p = 0.029. This combination of findings provides some support for the conceptual premise that AI techniques could be utilized to provide new insights into the process of learning regulation. Nevertheless, the close mean values for emotional synchrony frequency among interactions for SSRL (MSSRL = 8.67), Cognitive (MCognitive = 8.31), and Task Execution (MTask Execution = 8.50) suggest that this metric is not sufficient to differentiate among these interactions. The issue is compounded by the large standard deviations, making it challenging to attribute a future mean value, such as 8.59, to any specific condition. These findings indicate a need for more discriminative measures in future research.

Table 3 Emotional synchronies in learning regulation activities

Discussion

The main goal of the current study was to provide empirical evidence of how AI methods can be implemented to investigate interrelation between individual affective states and their shared regulatory activities in CSCL. While previous studies have provided evidence for learning regulation in face-to-face collaborative learning and asynchronous CSCL, much less is known about the phenomenon in synchronous CSCL. Accordingly, this study seeks to address the research gap by utilizing AI-enhanced techniques to examine learning regulation in the context of synchronous CSCL. Furthermore, this is needed as prior studies have noted the importance of developing innovative methods to capture all the phases, facets, and changes over time of regulatory activities to advance the field of study (Hadwin et al., 2018; Järvelä et al., 2019). The study introduces an approach for analyzing this relationship by aligning the core emotions, captured by FER with its corresponding interaction when examining co-regulation and socially shared regulation in a synchronous CSCL environment. In this regard, computer vision and machine learning are proving useful in detecting micro expressions, moods, and temperaments of learners by automatically detecting nonverbal behavior and affect (Behera et al., 2020).

Despite a growing interest that recognizes the intertwined between learners’ emotions and cognition that guide the learning process (Nguyen et al., 2022a, b2023; Woolf et al., 2009), studies of emotion regulation are still few in comparison to other constructs (motivation, cognition, metacognition). Saariaho et al. (2016)’s study examines the emotional landscape that student-teachers experience during self- and co-regulated learning. While their results confirm prior findings on positive emotions are emphasized as essential elements across all regulatory phases, the retrospective approach of the study to analyze the sketched emotion visualization of past experience arguably can affect data accuracy. In a study carried out by Ucan and Webb (2015), emotion regulation was examined along with motivational and metacognitive control. Our study illuminates the need for more refined empirical work to distinguish the high values of emotional synchrony between SSRL and other interactions such as Cognitive and Task Execution. The current data points to a trend but does not offer a statistically significant basis for differentiation, a point that should be considered when interpreting these findings.

Järvelä et al. (2020) suggests that “with the aid of advanced technologies, multidisciplinary collaboration between the learning sciences, affective computing and machine learning can help to study these complex phenomena” (p. 2392). The complex phenomena refer to SSRL in collaborative learning in which learners actively engage, sustain, and regulate their cognition, emotions, motivation, and behaviors towards the accomplishment of their learning goals. In the context of synchronous CSCL, this multidisciplinary study offered empirical evidence for this research proposition. The study does not only contribute to our understanding of learning regulation but also provides a methodological approach proven to be useful in expanding our understanding of how AI could inform research in learning sciences. In this study, we purposely focus on shared regulations and the affect aspect, adapting Järvenoja et al. (2019) method of studying emotion regulation from video data, however, with the added alignment of unobtrusive measures of emotion from FER. The method of coding for regulatory episodes, metacognitive processes, emotion regulation strategies during collaborative learning revealed the dynamic relationship between emotions and regulation.

The study contributes to the literature on educational technology, which has endorsed the use of AI as a promising tool for transforming education. Although the literature has shown that AI technology has brought great opportunities for improving learning and teaching, the design and development of AI-enhanced support for learning and teaching have faced several challenges (Järvelä et al., 2020; Ouyang & Jiao, 2021; Roll & Wylie, 2016). For instance, as AI in education is an interdisciplinary area of research, there is still a divide between those who understand AI the methods and techniques and those who know how AI could be applied to offer benefits to learning and teaching. Furthermore, there remains a paucity of evidence on the systematic application approach of AI in educational research. Utilizing the role of AI models as scientific tools (Baker, 2000), this study attempts to bridge the gap between AI machine learning and learning research. Despite the importance of learning regulation and collaborative learning, in reviewing the literature, there have not been many attempts on using AI technologies to advance our understanding and support learning regulation in synchronous CSCL. Accordingly, our work hopes to find a methodological and theoretical grounding for further development and implementation of AI in education to support learning regulation.

Limitation and Future Directions

We acknowledge a few limitations to our work. Firstly, the study is confined to a limited sample size with participants from a higher education setting and observation was conducted online based on recordings of the language sessions. Nevertheless, their cognitive, socio-emotional, co-regulation and socially shared regulation was examined through the group dynamics consisting of fixed analysis units, which equated to a reasonably sufficient amount of data to validate findings. Secondly, due to the restricted quantity of problem-solving tasks and the predominance of certain modes of learning task designs, some key emotions may have overridden the overall results. Therefore, there is a need for future research to diversify examination in terms of emotional regulation in learning across different contexts and learner characteristics, in order to compare, contrast and cross-validate findings obtained from this study. Thirdly, although our research adopted the video analysis approach with the 30s segment, a technique that has been approved by previous studies (Järvenoja et al., 2019), there may be scope to examine different approaches that can provide insights at different levels of granularity. Lastly, due to the distinct nature of our data collection method which espoused the use of AI to conduct research directly on Teams and via the platform of video-based qualification, risks may arise from spontaneous technical issues while facial expression may not fully reflect the regulatory processes. In response, efforts have been made in the data analysis phase which employed external coders to evaluate and infer a person’s subjectively experienced emotions to verify results. We have also partially indicated the emotional processes/temporal emotions of students through the session, and triangulated with coding of their interactions help to establish the link between emotion and their regulatory interactions. Attempts have also been made to allow inferences about differences detected in learners and processes across the learning sessions. Regardless of these limitations, this work provides valuable insights into facial expression recognition for emotional regulation in online learning within a collaborative, socially shared learning context. Moreover, as the ultimate goal of this study is to inform learning regulation through the application of AI facial expression recognition, our results shed light on the potential of this methodological approach, rather than proposing a new model. Therefore, future scholars are encouraged to consider extending the sample size of students and testing the model across stages in diversified classroom settings (e.g., no internet connection, interruption during teaching and learning activities, constrained teacher - student interactions, etc.). Potential approaches for future work may also encompass the possibility of adding other covariates (e.g., only females or males, different ages, etc.) to determine the effectiveness of this methodology across contexts and genders as well as to avoid sample bias.