1 Introduction

For developing intelligent tutoring systems, it is crucial to consider learners’ affective states in the learner model as it can impact a range of individual and social behaviours essential to learning (Yadegaridehkordi et al. 2019; Hasan et al. 2020). For example, emotions have been shown to directly impact students’ behaviour, decision-making, thinking ability, well-being, resilience and communication (Tyng et al. 2017; Alarcão and Fonseca 2019). Acknowledging the impact of emotion in online learning, researchers have demonstrated some of the ways that educators and industry practitioners might incorporate emotional aspects of learning in the online learning environment to enhance motivation, persistence, engagement and overall learning outcome (Yadegaridehkordi et al. 2019; Hasan et al. 2020; Linnenbrink-Garcia et al. 2016; Cunha-Perez et al. 2018). Examples include incorporating affective feedback and support with the intelligent tutoring system, which can make online learning more enjoyable and motivate students to learn (Jiménez et al. 2018), leading to higher learning outcomes (Cunha-Perez et al. 2018; Rajendran et al. 2019).

To this end, numerous efforts have been made over the last decade to design automated detectors for intelligent tutoring systems (ITSs) that can recognise learners’ affective states (such as boredom, confusion, engaged concentration, frustration, surprise, and anxiety) in online learning. These affect detectors often use advanced machine learning techniques (such as supervised learning, unsupervised learning and active learning) (Hasan et al. 2020) and can be broadly divided into two categories: sensor-based and sensor-free (Yang et al. 2019; Leong 2015; Henderson 2023). Sensor-based affect detectors use physiological (i.e. web camera, microphone) and neurological sensors (i.e. Electroencephalograph, EEG) to measure affective states. Sensor-free detectors are based on the log files generated in ITSs from the student’s interaction with the systems without using any sensors. While sensor-based affect detection methods have exhibited more accurate detection performance than sensor-free detection methods and can be generalisable across learning environments or educational domains, they have limitations (Henderson 2023). For example, despite their effectiveness, these sensor-based detectors were found impractical for large-scale deployment in real classrooms or challenging to scale the use to larger groups of students due to the cost and privacy constraints (Yang et al. 2019; Richey et al. 2019). Using these various physiological and neurological sensors in sensor-based affect detectors also requires adequate technical expertise for deployment and may cause calibration issues, hardware failure, and mis-tracking or miscalibration (Leong 2015; Henderson 2023; Henderson et al. 2020). Alternatively, sensor-free affect detectors are typically derived from trace log data generated by learner interactions with intelligent tutoring systems (ITSs). These detectors can recognise learners’ affective states at any stage and level during online learning by examining various interactions with the system. They can bypass many of the issues prevalent in sensor-based approaches (Henderson 2023). They are more privacy-aware and cost-effective, less obtrusive, and convenient to set up due to independence from external hardware. They can be ubiquitously deployed in an online learning environment, thus making them a promising option for large-scale deployment (Yang et al. 2019; Leong 2015; Henderson 2023; Richey et al. 2019; Henderson et al. 2020; Lan et al. 2020).

Nevertheless, existing sensor-free affect detectors exhibit other challenges and limitations, such as transferability, effectiveness, interpretability, applicability, generalisability and generalisability in diverse populations (Baker 2019). Regarding generalisability, these detectors show low detection accuracy and are not generalisable across ITSs (Richey et al. 2019; Baker 2019; Paquette et al. 2015). In terms of transferability, a detector that has been developed using a dataset obtained from a system (e.g. Cognitive Tutor, MathTutor) does not show realistic detection performance on separate datasets derived from different systems (e.g. ASSISTments, decimal tutor) (Richey et al. 2019; Paquette et al. 2015). It is also unknown whether those detectors are domain-independent or not (Baker 2019). For example, a detector built in a math learning domain is applied to another domain, such as computer programming. This is because those detectors rely heavily on their individual environment and domain,the set and number of features used in those systems are typically not directly generalisable (Richey et al. 2019). Therefore, it is necessary to develop sensor-free detectors by selecting minimal relevant features related to students’ affective states with higher prediction performance that are generalisable to other systems and domains.

In this paper, we aim to examine whether or not it is possible to develop generalisable frustration detectors using ML classifiers that can recognise students’ frustration from their interactions with the online learning system without using any physical and physiological sensors. We limit our research to the automatic detection of frustration only. Frustration is a problematic cognitive-affective mental state that promotes negativity “state of stuck” (Meldrum 2002; Rajendran et al. 2013). It has high significance in online learning (Richey et al. 2019; Rajendran et al. 2013). It also refers to blocking a behaviour directed towards a goal (Morgan et al. 1986), which is crucial in educational settings (Rajendran et al. 2013; Lawson 1965). It has been observed that prolonged frustration and not acknowledging it may lead to student disengagement and attrition (Rajendran et al. 2013; Lawson 1965; DeFalco et al. 2018). Hence, the detection of students’ frustration is essential. It can enable an ITS to initiate a pedagogical intervention for those struggling students who are frustrated and may otherwise lose confidence and interest in learning and eventually attrition (Leong 2015). Notably, in an interactive task-oriented learning environment like ITS, where the construction and achievement of goals are critical to student learning episodes, early detection of frustration is pivotal to permit the ITS sufficient time to enact corrective “affective scaffolding” strategies. In order to construct frustration detectors, firstly, we investigate the minimum optimal features involved in frustration from the log data of students’ interaction with the online learning system. Secondly, we examine the performance of popular machine learning classifiers using the identified set of optimal features in predicting frustration and compare results with reported results from the previously built detectors. Thirdly, we construct generalisable frustration detectors using the best-performing classifiers and attempt to improve their detection performance by applying several techniques. Lastly, we investigate how well they generalise across different learning domains and systems when evaluating them on independent datasets.

2 Related work

The popular tutoring systems that support student learning, where researchers applied various approaches for identifying students’ affective states from the extracted log files of students’ interactions and systems that do so by modelling the affective states, will be reviewed in this research. We describe eight systems: Mindspark, AutoTutor, Programming Lab, Crystal Island, Cognitive Tutor, ASSISTments, MathTutor, and Decimal Tutor.

Mindspark (Rajendran et al. 2019, 2013) is a math ITS that can detect students’ frustration while they interact with the ITS. The frustration detection model is derived from a theoretical definition of frustration based on the analysis of goal-blocking events (Morgan et al. 1986). It selects and combines ITS log file features related to goal-blocking events and causes frustration. A total of seven features were selected, such as response to the question, response time to answer the question, time spent on the explanation of the answer and so forth. The extracted features are applied to various ML classifiers, such as Bayes, SVM, linear regression and Decision Tree classifiers. The highest classification accuracy reported is 88.84 (F1 Score 0.46, Cohens kappa 0.46) per cent using linear regression classifier.

Based on the features extracted from the log data of students’ interaction with the learning system, a dialogue-based tutoring system named AutoTutor was developed (D’Mello et al. 2008). The system can detect students’ confusion, boredom and frustration in online learning. A set of eleven features are used for detecting affective states such as response time, number of characters in a student’s response, tutor feedback to a student’s response and so forth. Various ML classifiers were applied to the selected features. The best classification accuracy of distinguishing between the frustration and the neutral state was found at 77.7 per cent using logistic regression and C4.5 decision tree classifiers.

In order to detect students’ average frustration in computer programming exercises across different labs, a learning environment called programming-lab (Rodrigo and Baker 2009) was developed. The system can detect students’ frustration based on the information from compiler data, such as the average time between compilations and consecutive pairs of the same error. A set of four features reduced from eleven features after correlation analysis with affective states were selected based on the researchers’ knowledge and Error Quotient (EQ) construct of the system. With the linear regression classifier, the correlation coefficient was obtained r = 0.3178 using the selected features.

Crystal Island (McQuiggan et al. 2007) is a task-oriented learning environment that can predict students’ frustration using data from the log file and physiological signals. Four types of features, temporal, locational, intentional, and physiological responses, were selected based on the appraisal theory of frustration. The system applied various ML classifiers to selected features, such as decision trees, Naïve Bayes, and support vector machines. The highest classification reported was 88.7 per cent, and recall of 88.9 per cent using the decision tree classifier.

Sabourin and their team proposed an emotion detection model based on dynamic Bayesian networks (DBN) (Sabourin et al. 2011). The model can capture students’ frustration and confusion when they interact with a game-based learning environment named Crystal Island. Students self-report their affective state during the interaction with the learning system. The DBN model is created based on the collected dataset. Students’ personal attributes, such as mastery approach and environmental variables (i.e. goals completed, worksheet check), were considered during model creation. For predicting emotion, accuracies achieved 28 per cent and 56 per cent for valence.

Cognitive Tutor Algebra is a learning environment that helps students to learn complex mathematical problems (Baker et al. 2012). Researchers developed automated detectors for predicting students’ gaming behaviour (Paquette and Baker 2019) and affective states (such as engaged concentration, confusion, frustration, and boredom) (Baker et al. 2012) solely from students’ interactions within the Cognitive Tutor Algebra system (Baker et al. 2012). The hybrid model combining machine-learned and knowledge engineering approaches showed superior students’ gaming behaviour predictive performance (Paquette and Baker 2019). On the other hand, for affect detection, Baker (2012) conducted a study where a total of 58 features were distilled after employing feature selection from the system’s log file. Then, features were fed into various ML classifiers, such as J48 decision trees, step regression, JRip, Naïve Bayes, and REP-Trees. For frustration detection, the best classifier was found using the REPTree, which achieved an AUC of 0.99 and Kappa of 0.23.

ASSISTments is a popular free web-based math learning platform that provides immediate feedback to the many students who use it in the classroom and for homework daily (Botelho et al. 2017). The system data was used by numerous researchers to detect students’ affective states: boredom, engaged concentration, confusion, and frustration.Footnote 1 The affect detection models were developed based on a dataset that contains synchronised log data from student actions within the ASSISTments system and human-labelled affective states. A total of forty-three features were generated from the log data (Pardos et al. 2013). The features were aggregated within each clip/observation interval by taking each feature’s average, min, max, and sum. This results in 204 features per clip. The process for developing sensor‐free affect detectors for ASSISTments replicates a process that has been successful for developing affect detectors for a different intelligent tutor, Cognitive Tutor Algebra (Baker et al. 2012). Eight ML classifiers were considered to fit sensor-free affect detectors, including J48 decision trees, step regression, JRip, Naive Bayes, K*, and REP‐Trees. The best performance achieved for frustration detection was A’ of 0.682 and Cohen’s Kappa of 0.324 using the Naive Bayes algorithm (Pardos et al. 2013), A’ of 0.65 and Cohen’s Kappa of 0.23 using REPTree (Baker and Ocumpaugh 2014), A’ of 0.60 and Cohen’s Kappa of 0.20 using JRip (Baker and Ocumpaugh 2014).

In an attempt to improve the performance of the frustration detectors, Wang (2015) applied several ML classifiers on ASSISTments’ affective dataset, including linear regression, decision trees, step regression, Naïve Bayes, JRip, J48, REPTree, Bayesian logistic regression, and K*, were applied to 232 regenerated features. The study reported an average frustration detection performance A’ of 0.60 and Cohen’s Kappa of 0.15,however, the highest performer classifier was not reported. In another work, Botelho (2017) applied three deep learning models: a Recurrent Neural Network (RNN), a Gated Recurrent Unit (GRU) neural network, and a Long-Short Term Memory network (LSTM). For frustration detection, the LSTM model provided the best result (A’ of 0.76 and Cohen’s Kappa 0.15).

A recent study (Richey et al. 2019) attempted to integrate existing affect detectors that had been built using interaction data from a different tutor (ASSISTments) into their study data. They considered MathTutor (Aleven et al. 2009) and the decimal tutor (Richey et al. 2019). Both were used to gather the data in the current study, which was implemented on the same platform. The aim was to detect students’ affective states—boredom, confusion, and frustration—using their current study data with these systems. The data contained 598 student records and 37 interaction features. Once the detectors were applied to the new dataset, the researchers found very low, unrealistic proportions of all states: 0% incidence of off-task behaviour, 3.86% incidence of boredom, 0.03% incidence of confusion, and 0.06% incidence of frustration. This indicated that the detectors were not directly generalisable to their current study's dataset due to differences in interaction features.

Our comprehensive review of literature delving into data mining approaches for automated frustration detection (Table 1) reveals a significant focus on sensor-free frustration detection over the past decade. While these methods prove valuable, they do not reach the high accuracy achieved by sensor-based detection and frequently depend on specific domains and systems. Moreover, very few studies explore the potential of creating affect detectors that could maintain high prediction accuracy while being applicable to a range of domains and systems. This area of study is relatively unexplored. Most strikingly, we found no studies addressing the construction of sensor-free detectors through identifying universally applicable optimal features and the subsequent testing of these detectors across different domains and learning environments.

Table 1 Summary of various systems and approaches for frustration detection from log files of students’ interactions with the learning environment

3 Research methodology

The quantitative research methodology will be used in this research. This involves collecting and analysing data from human participants, developing sensor-free automatic frustration detectors using various ML classifiers, and conducting a serious experiment on datasets using ML classifiers to identify students’ frustration in online learning.

3.1 Constructing generalisable sensor-free frustration detectors

Frustration is a reason for students’ disengagement and can eventually lead to attrition (Kapoor et al. 2007), with prolonged frustration being associated with poorer learning outcomes (Richey et al. 2019). Therefore, it is crucial to identify students’ frustration as early as possible in online learning. In this work, we attempted to develop frustration detectors using various ML classifiers. These detectors can determine a student’s affective state (frustrated/non-frustrated) at any point during their interaction with an online learning system, solely based on the student’s interaction with the system. To achieve this, we first identified the most important features that are involved in causing frustration from the log data of students’ interaction with the learning system. We adopted a popular real-world dataset and conducted a series of experiments to facilitate this process. Later, we attempted to fit the identified features into five common classification algorithms to build detectors and investigate how accurately they can detect frustration. We selected the classifiers that produced the highest results and attempted to improve the detectors’ performance further using certain techniques. The performance of the selected detectors was also validated on independently collected datasets using a different learning system to examine their generalisability across different learning environments and domains. The overview of the process is shown in Fig. 1.

Fig. 1
figure 1

Illustration of developing generalisable sensor-free frustration detectors using various ML classification algorithms

3.2 Dataset

We used a datasetFootnote 2 that was drawn from the ASSISTments learning platform. ASSISTments is a web-based tutoring platform to teach mathematics problems (Pardos et al. 2013). The system assists students with sequences of scaffolding support, immediate feedback and on-demand hints when they make errors. Over 40,000 students across nearly 1,400 teachers used the system during the 2015–2016 school year in central Massachusetts. The system has been found to be effective in a large-scale randomised controlled trial (Botelho et al. 2017).

The affective dataset was collected in real classrooms as students work within the ASSISTments system. Human observers were present in the classrooms, and they followed Baker Rodrigo Ocumpaugh’s monitoring protocol (BROMP) to collect data (Ocumpaugh et al. 2015). The dataset contained 7,663 field observations from 646 students in six schools in urban, suburban, and rural settings (Botelho et al. 2017). Each observation contains a student’s affective state label during a 20-s observation interval and a set of 51 action-level features developed using an extensive feature engineering process that summarises their activities within ASSISTments during this time interval. As the observation intervals or clips often contain more than one student activity within the learning system, the features were aggregated within each clip by taking each feature’s average, min, max, and sum. The end result was 204 features per clip. Observers coded four affective states during the observation: confusion, boredom, frustration and engaged concentration.

3.3 Optimal feature selection

Attribute or feature selection, as a critical data preprocessing strategy (Gnanambal et al. 2018), is one of the essential and frequently used techniques in preparing data efficiently and effectively for various data-mining and machine-learning problems (Li et al. 2018). It can remove irrelevant and redundant features or noisy data, reduce the risk of overfitting, decrease the overall training time and promote the generalisation ability of machine learning models (Sahebi et al. 2020; Chen et al. 2017; Boucher et al. 2015; Batchu and Seetha 2021). It plays an outstanding role in enhancing the performance of machine learning, such as alleviating the current situation of information abundance and knowledge shortage in understanding data, reducing the curse of dimensionality, increasing the accuracy of the learning process, and reducing the complexity and time of building the model (Sahebi et al. 2020; Li et al. 2017). In our study, we apply feature selection techniques to improve the generalisability and classification accuracy of the frustration detection models.

In order to develop domain-independent and system-independent frustration detectors on the ASSISTments dataset with 204 features, it is challenging to feed all the features into the detectors. Previous studies have reported that if the most important and common features involved in specific affective states are not distilled efficiently, detectors may underperform across independent learning environments (Richey et al. 2019; Paquette et al. 2015). Therefore, to construct generalisable frustration detectors, it is crucial to identify features that are primarily involved in detecting frustration without compromising the predictive accuracies of detection models from the set of 204 features. We employed various feature selection techniques to find optimal attributes that can determine frustration from the interaction with the online learning system.

Our optimal feature selection process involves two phases. The first phase involves applying the attribute selection technique to extract the top-ranked features from the set of 204 features of the ASSISTments dataset. The second phase involves applying various standard ML classifiers (D’Mello et al. 2008) on the various subsets of features to examine their prediction accuracies when the number of features varies from the full set of 204 features until maximum accuracy is achieved through extensive comparison. We considered the Waikato Environment for Knowledge Analysis (Weka), a data mining package for this data analysis. Weka is an open-source software package that implements machine learning algorithms for data mining tasks (Witten and Frank 2005). Weka’s attribute selection algorithms (Gnanambal et al. 2018; Sudha 2014) are used to filter and rank the attributes by removing irrelevant features. The attribute selection is a two-step process (Gnanambal et al. 2018; Sudha 2014). One is subset generation, and the other is ranking. Subset generation is a searching process that compares the candidate subset to the already determined subset (Gnanambal et al. 2018). If the new candidate subset returns better results in certain evaluations, the new subset is termed the best. This process is continued until the termination condition is reached. The ranking of attributes is used to find their importance (Gnanambal et al. 2018; Sudha 2014). The Weka’s attribute evaluators used in this study are InfoGainAttribute Eval, CfsSubsetEval, and the search methods used are Ranker, GreedyStepwise, and BestFirst (Table 2). The InfoGainAttributeEval evaluator assesses attributes by calculating their information gain—a measure of how significantly each attribute decreases the overall uncertainty or entropy regarding the class variable (Gnanambal et al. 2018; Sudha 2014). This calculation is based on the change in entropy before and after the dataset is partitioned according to each attribute. This approach effectively identifies attributes that are most informative for classification. The CfsSubsetEval method in Weka evaluates the worth of attribute subsets for feature selection by considering the individual predictive ability of each attribute and the degree of redundancy between them (Gnanambal et al. 2018; Sudha 2014). It selects attribute subsets that are highly correlated with the class while having low inter-correlation, aiming to identify groups of features that collectively contribute most effectively to the prediction task.

Table 2 Selection of optimal features using various evaluators and methods from ASSISTments dataset

The Ranker method complements this by evaluating and ordering attributes according to their individual contributions to the prediction task, as determined by their information gain scores (Gnanambal et al. 2018; Sudha 2014). Attributes are ranked from the most to the least informative. The outcome of this process is a prioritised list of features from which the top contributors are selected for model building. It is useful to select features based on their individual contributions to the prediction task, as opposed to methods that consider attribute subsets. The GreedyStepwise method is a search algorithm used for feature selection. It operates by iteratively adding or removing attributes to find the best subset of features (Gnanambal et al. 2018; Sudha 2014). Starting with no attributes, it evaluates each attribute’s contribution to the prediction model, adding the most informative one at each step. The process continues adding new attributes until it no longer significantly improves the model’s performance. This method efficiently identifies a relevant subset of features from a larger set, ensuring that the final model is effective and not overly complex. Finally, the BestFirst method is a search algorithm used for feature selection that evaluates attribute subsets and selects the best subset based on their overall contribution to the prediction model’s performance (Gnanambal et al. 2018; Sudha 2014). It works by exploring the space of attribute subsets either by starting with no attributes and adding them (forward search) or beginning with all attributes and removing them (backward search). The algorithm evaluates subsets based on a given criterion (like accuracy) and selects the one that maximises this criterion. BestFirst employs a heuristic search and may consider multiple paths in the attribute space, unlike GreedyStepwise, which follows a single path by adding or removing one attribute at a time. This method is useful for finding an optimal or near-optimal set of features for predictive modelling.

Evaluator InfoGainAttributeEval was applied on the full set of 204 features, and the search method for ranking in this case was Ranker. As a result, a subset of 71 features was ranked by significance (Table 2). In an attempt to reduce the number of features further, the same evaluator and ranking method were reapplied. This time, the resultant subset contained 10 features based on a predefined number from the user section option, in this case top 10 attributes.

In the next step, a different evaluator CfsSubsetEval and a different search method GreedyStepwise, were applied on the subset of 10 features to see whether further reduction is possible. This generated a subset of 7 ranked features from the 10 features. Alternatively, the selection process still provided the same results when a different ranking method (BestFirst) was used with the CfsSubsetEval evaluator. Finally, we obtained our subset of 7 most important features from the 204 features. The features are averagecorrect, mincorrect, sumcorrect, averagepercentcorrectperskill, sumhinttotal, averagehinttotal, mintotalfrattempted. The description of these features is highlighted in Table 2. These features correlate with students’ frustration levels, lower averages of correct actions, and increased help requests, as indicated in the previous research (Pardos et al. 2013). They are generalisable because they can capture universal aspects of the learning process that are independent of the specifics of the curriculum, subject matter, or learning style and can be adopted by different online learning platforms (Baker and Yacef 2009). They encompass key learning metrics, such as the correctness of responses (‘averagecorrect’, ‘mincorrect’, ‘sumcorrect’), the extent of help-seeking behaviour (‘sumhinttotal’, ‘averagehinttotal’), and student proficiency across skill areas (‘averagepercentcorrectperskill’), which are all significant components of learning interactions across various educational contexts (Pardos et al. 2013; Baker et al. 2010). As a result, these features are capable of offering a fundamental and universal representation of student behaviour and performance. For a computer-based learning system to collect this information effectively, it needs to meet several minimum requirements. Firstly, it needs a comprehensive tracking system that can record all student interactions, including the correctness of responses and the use of auxiliary resources like hints (Romero and Ventura 2010). This may allow for a detailed understanding of how students interact with the learning material and resources. Secondly, robust logging mechanisms are required to ensure that all these interactions are accurately recorded and stored for subsequent analysis (Romero and Ventura 2010). Lastly, the system needs sophisticated categorisation and compilation functionality, which can compile this information on a per-skill or per-topic basis (Hershkovitz and Nachmias 2009). These functionalities may facilitate a deeper understanding of a student’s learning journey, providing insight into their struggles, successes, and overall learning trajectory, thereby aiding in accurately detecting students’ affective states.

Next, we applied various popular and widely used machine learning classifiers to the generated subsets of features (Table 2) to observe the classifiers’ performance (reported in Table 2). The evaluation result is discussed in section 4.5. The metrics that will be used for this evaluation are explained in the next section.

3.4 Goodness metrics to evaluate classifiers

In the ASSISTments dataset, the distribution of the labels is non-uniform (Botelho et al. 2017). The vast majority, approximately 80% of the clips, are labelled as engaged concentration, followed by 12% as boredom, and only 4% as confusion and frustration, where both confusion and frustration are underrepresented. Although it’s encouraging to know that students are mostly concentrating when working within ASSISTments, standard measures such as F measure, precision, recall, etc., will not reflect relevant evaluation results or accurately depict the effectiveness of a classifier when one class is underrepresented (Pardos et al. 2013). The classifiers may predict all instances towards the majority class (not frustrated) and still obtain high accuracy, resulting in an artificially good model on an imbalanced dataset. To counteract this, previous works conducted cost-sensitive analyses (Lan et al. 2020; Pardos et al. 2013; Wang et al. 2015; Ocumpaugh et al. 2015; Baker et al. 2014). They used the A’ metric (equivalent to the area under the ROC curve) to examine whether a classifier is truly appropriate in a cost-sensitive manner. A’ represents the probability that the model can discriminate a randomly chosen positive case from a randomly chosen negative one. An A’ value of 0.5 for a model indicates chance‐level performance, while 1.0 represents perfect performance.

We also used Cohen’s Kappa to assess its agreement level and compare the performance of classifiers with previously reported works. Cohen’s Kappa assesses the degree to which the model is better than chance at identifying the affective class labels. A Kappa of 0 indicates chance‐level performance, while a Kappa of 1 indicates perfect performance. A Kappa of 0.15 is equivalent to a detector that is 15% better than the chance of identifying affect. All models were evaluated using fivefold cross-validation, like in previous works (Lan et al. 2020; Botelho et al. 2017).

3.5 Performance evaluation of various ML classifiers

To recap, we applied various feature selection techniques to select the most appropriate features that cause frustration in an online learning environment, aiming to improve the generalisability and classification accuracy of the frustration detectors. Table 3 presents a comparative analysis of the performance of five machine learning classifiers—Bayesian Networks (BN), Naive Bayes (NB), j48 decision tree, Random Forest (RF), and K-Nearest Neighbours (KNN) across three datasets with varying feature sizes (204-features, 71-features, and 7-features) using Weka’s default settings for each classifier.

Table 3 Performance evaluation of the various classification algorithms on datasets having a different number of features

When 204 features were considered, the Bayesian Networks (BN) and Naive Bayes (NB) classifiers exhibited robust predictive performance with AUC values of 0.76 and 0.74, respectively (Table 3). They showed moderate agreement with the actual outcomes, demonstrated by Kappa values of 0.10 and 0.07. The Random Forest (RF) classifier presented a AUC value 0.73 comparable to the Bayesian classifiers. However, its Kappa value was 0.00, indicating weak agreement with the actual outcomes. The low Kappa values suggest that the RF model performs well overall across all thresholds, likely driven by strong performance on the majority class. However, the model’s performance is no better than random chance due to the class imbalance issue, which is an important consideration given that the dataset is highly imbalanced. This discrepancy between Kappa and AUC values highlights the importance of understanding the different performance measures and the aspects of model performance they capture, particularly in the context of imbalanced datasets such as ASSISTments. The other classifiers, both J48 decision tree and K-Nearest Neighbours (KNN), demonstrated lower AUC values (0.51 and 0.53, respectively) and minimal Kappa values (0.01 for both), suggesting a weaker predictive performance and agreement with the actual outcomes.

When the number of features was reduced to 71, BN and NB maintained their AUC values at 0.76 and 0.77, respectively, indicating consistent predictive performance. There was minor improvement in their Kappa values of 0.10 for both classifiers were also found. RF’s AUC value was noticed steady at 0.73, and its Kappa value slightly increased to 0.01, suggesting a slight increase in agreement with the actual outcomes. The J48 and KNN classifiers continued to fall behind, showing minor improvements in their AUC values and slight increases in their Kappa values (0.02 for J48 and 0.06 for KNN).

Lastly, when the feature count was further reduced to 7, the Bayesian Network (BN) and Naive Bayes (NB) classifiers demonstrated consistent robustness in their predictive performance with AUC values of 0.77 and 0.79, respectively. The performance remained relatively unchanged in terms of the AUC when compared to their performance with higher features (204 and 71). As for the Kappa values, a slight improvement was observed (0.14 for BN and 0.13 for NB), showing a minor enhancement in their agreement with the actual outcomes. In essence, the reduction in features did not deteriorate their predictive performance, and they displayed slight improvement in terms of Kappa scores, demonstrating their resilience to the dimensionality of the dataset. RF’s AUC value dropped slightly to 0.71, but its Kappa value increased to 0.02, implying improved agreement with actual outcomes, although with a slightly reduced predictive performance. The AUC values for the J48 and KNN classifiers remained relatively unchanged, and their Kappa values displayed minimal improvements (0.00 for J48, 0.04 for KNN). This result suggests that these classifiers (RF, J48 and KNN) did not significantly benefit from a reduction of features in terms of predictive performance and agreement with actual outcomes. Regarding the J48 classifier, the low AUC (0.50 ± 0.00) and Kappa value (0.00) indicate that it’s performing no better than a random guess. This poor performance is due to the Class Imbalance. As previously mentioned, we had only 3.70% being the “yes” class and 96.3% being the “no” class. Decision tree algorithms, such as J48, often struggle with class imbalance as they are biased towards the majority class. The algorithm might be classifying most instances as the “no” class, leading to a high overall accuracy but poor performance in correctly identifying the “yes” class.

These findings suggest that the Bayesian classifiers (BN and NB) demonstrated superior and stable performance in both AUC and Kappa metrics by effectively managing high-dimensional data and extracting significant information even from a smaller feature set. In contrast, the RF classifier maintained its prediction accuracy across 204 and 71 features but experienced a decline in performance when only 7 features were used, even though there was a marginal improvement in its agreement with the actual outcomes. The consistently lower performance of the J48 and KNN classifiers across all feature sets suggested their struggle to handle high-dimensional data or interpret feature interactions effectively. Based on these findings, it appeared that the Bayesian classifiers were the most suitable for building frustration detectors.

3.6 Fine tuning Bayesian variants

We chose Naive Bayes and Bayesian Network classifiers due to their superior frustration detection performance compared to other classifiers found during the evaluation. We utilised the Synthetic Minority Oversampling Technique (SMOTE) and Cost Sensitive Learning to fine-tune the classifiers’ performance during training. The training was conducted on the resampled dataset, and detector effectiveness was validated on the original (Non-Resampled) dataset to ensure model validity for data with natural distributions.

3.6.1 Applying SMOTE on the dataset

As previously mentioned, the affect labels in the ASSISTments dataset are distributed non-uniformly (Botelho et al. 2017). To address this imbalanced classification issue, we used SMOTE in this study. This technique, proven to be effective for constructing classifiers from imbalanced real-world datasets (Chawla et al. 2002), can improve classification performance by addressing the class imbalance problem (Henderson 2023; Chawla et al. 2002; Jishan et al. 2015). SMOTE balanced our dataset by creating synthetic samples, enabling classifying algorithms to perform better without overfitting. It upsampled the number of instances in the minority class (“yes” class) and downsampled the majority class (“no” class) in our dataset. Resampling was applied to the training dataset, which initially comprised 3.57% ‘yes’ class and 96.43% ‘no’ class instances. Following methodologies established in prior studies (Botelho et al. 2017; Pardos et al. 2013), we adjusted the dataset to increase the proportion of the minority ‘yes’ class to 20%. This ensured a more balanced distribution and aligned with established practices for handling imbalanced datasets. This gives classifiers a better understanding of the minority class characteristics. Furthermore, to distribute instances more evenly across all folds and avoid overfitting, we also employed a randomisation technique to shuffle the order of instances.

3.6.2 Applying cost-sensitive learning on Bayesian variants

We employed cost-sensitive learning to mitigate the imbalanced classification problems further and tried to improve the models’ performance during classifier training. Cost-sensitive learning is a type of learning that considers misclassification costs (and possibly other types of costs) (Ling and Sheng 2008). The goal of this type of learning is to minimise the total cost. We assign a higher penalty associated with an incorrect prediction in our models during training.

3.6.3 Training on resampled dataset

This study applied SMOTE and cost-sensitive learning to address the imbalance in the dataset, which tends to improve the performance of machine learning models by enhancing the representativeness of the minority class. Two cost-sensitive Bayesian variants—Cost-sensitive Bayesian Network (BN) and Cost-sensitive Naive Bayes (NB) were trained on the resampled dataset. The Cost-sensitive BN yielded superior results with an AUC of 0.94 and a Cohen’s Kappa of 0.63, suggesting a high classification performance and strong agreement between predictions and actual values.

3.6.4 Validating on non-resampled dataset

When the models were tested on the non-resampled dataset, the performance of the models dropped compared to the training results, which is a common occurrence in machine learning due to overfitting or the potential discrepancy between training and testing data distributions. Among the models tested, the Cost-sensitive Bayesian models showed better performance.

3.7 Comparing Bayesian variants with traditional models and previous works

Results showed that the cost-sensitive BN and NB significantly outperformed J48, RF, and KNN on the non-resampled dataset with seven features in both AUC and kappa metrics (Table 4). There is no significant difference was found at the 95% confidence level between cost-sensitive BN and NB and conventional BN and NB.

Table 4 Performance evaluation of two cost-sensitive Bayesian variants, trained on the resampled dataset and validated on the non-resampled dataset, are compared to the previous highest reported results on the ASSISTments dataset

When compared with previous works, the Cost-sensitive BN model outperformed the others in terms of AUC. Its AUC values of 0.94 (during training) and 0.82 (during testing) are significantly higher than the maximum AUC of 0.76 achieved by the LSTM model in previous works (Table 4). This difference in performance could be attributed to a variety of factors, including the innate capability of Bayesian models to handle smaller datasets effectively, offer better interpretability, and be more computationally efficient than deep learning models like LSTMs, RNNs, or GRUs.

The Cost-sensitive Bayesian variants utilised considerably fewer action-level features (only 7) compared to the previous works (ranging from 88 to 204). This reduction in complexity is beneficial, as it makes the models more interpretable and less prone to overfitting, leading to improved generalisability. However, in terms of Cohen’s Kappa, the Cost-sensitive BN model’s kappa value of 0.63 during training is superior but drops to 0.19 when tested on the non-resampled dataset, which is lower than the 0.32 achieved by the Naive Bayes model in previous work (Pardos 2013) with 173 features. This suggests that there is potential for improvement in the models’ agreement with actual classifications. As such, we have selected these two classifiers for further evaluation on an independent dataset. This subsequent analysis will allow us to assess these models’ generalisability and potential applicability more comprehensively in real-world contexts. Depending on the results of this evaluation, future steps may be required to tune the models.

4 Evaluating the generalisability of the affect detectors

This section evaluates the generalisability of the affect detectors in identifying frustration across different learning domains and environments. Initially, we focused on the efficacy of these machine learning (ML) models for detecting frustration, which was initially trained on the ASSISTments dataset when applied to data from a different learning environment and domain. To examine this, we collected data from students engaged with learning in a different domain (computer programming) using another online learning system (Moodle), distinct from ASSISTments. We then tested the adaptability and versatility of these models using this independent dataset. Furthermore, we utilised “EmoDetect”, a publicly available dataset, to assess the detectors’ performance. The subsequent section will detail the process of creating independent datasets to validate these models.

4.1 Creation of an independent validation dataset (StudySet_001)

The creation of an independent dataset for validation involves conducting several experimental sessions with students to collect data and following an observational method to identify participants’ frustration by analysing the collected data.

4.1.1 Observation methods

The process of creating a detector for an affective state almost always starts with first obtaining “ground truth”—human-labeled data (Richey et al. 2019). This shows the presence or absence of the affective state in question for a sufficiently large sample of data. These labels are verified for acceptable inter-rater reliability (Ocumpaugh et al. 2015). They are then used to develop detectors, using machine-learning algorithms to identify the in-system behaviours corresponding to human affect judgments (Richey et al. 2019).

In this study, we used the video coding observation method (Rajendran et al. 2013) to identify students’ frustration based on their facial expressions, body language and interactions with an online learning system from the recorded video files. We were interested in predicting students’ frustration arising from their interactions with the system. The video recording helped the human observers to pause the video, note down the expressions as required and perform the affect labelling of the observations. Later, the observed coded affective states were stratified with the captured log data from the system to prepare the validation dataset.

4.1.2 Participants

This study involved adult university students aged between 20 and 35 from the ICT department of the XYZ university. These students were enrolled in a Web Development unit as part of their Master’s degree program. During the first semester of 2022, out of 74 invited students, 41 showed interest, but only two shared recorded video files due to privacy concerns and Covid-19 related precautions. In the second semester of the same year, 22 students from different groups participated out of 31 interested candidates. Despite this, many students expressed discomfort with the video recording process. Finally, we collected video recordings from 24 students, consisting of 5 females and 19 males. The ethnic background of the participants was diverse: 61% were from Southeast Asia, 20% were from Oceania, 18% were from North-West Europe, 17% were from North-East Asia, and 16% were from Southern and Central Asia.

4.1.3 Learning environment and task

The data collection for this research was conducted using Moodle, a popular, open-source learning management system (LMS) with a solid architecture, implementation, and interoperability (Giuffra et al. 2013). We utilised this platform to present participants with a range of programming problems.

The Moodle LMS was set up on an institutional web server for data collection purposes. A Web Development unit was specially designed involving ten PHP programming problems on varied topics like JavaScript String, Array, and Sorting. Students were required to solve the problems after studying provided learning materials. Each problem comprised several similar types of questions. Students only needed to submit the final answer for each problem, where they received immediate feedback. They automatically moved to the next problem if the first attempt was correct. However, students who didn’t answer correctly initially had further attempts to complete the problem correctly by attempting similar questions. We allowed up to five attempts per problem. Furthermore, to aid problem-solving, guided discovery hints were embedded with every question. Students could request a hint by clicking the hint bottom. This provides them a more explicit clue rather than the final hint given to solve a problem, facilitating a deeper understanding and mastery of the topic. The research focused on multiple-choice and fill-in-the-blank-type questions. We defined a “clip” as all sequenced actions students undertook from the start to their final attempt at solving a problem.

4.1.4 Video recording procedure

While students attempted programming problems, their facial expressions were captured using webcams, and their interactions with LMS were recorded using a cloud-based video conferencing software, Zoom (Fig. 2). The recording of facial expressions and storing of videos adhered to ethics committee approval. According to the ethics approval, only non-identifiable information can be shared and used for publication. Zoom provides options to film the user’s screen and the user simultaneously and record it. Each student created a new meeting, turned on his/her camera and recorded facial expressions and interactions with a screen. At the end of their sessions, students shared their recorded videos using OneDrive with the research team. A total of 24 recorded videos were collected, and the average length of the videos was around 41.24 min. Trained human observers used these recorded videos to code students’ affective states by analysing their facial expressions and interactions with Moodle LMS following an observation protocol.

Fig. 2
figure 2

Screen recording with facecam

4.1.5 Observation protocol

To accurately observe students’ affective states, a rigorous observational protocolFootnote 3 inspired by the BROMP manual and previous studies by Rajendran (2013, 2019) was developed (Rajendran et al. 2019, 2013; Ocumpaugh et al. 2015). The protocol contains detailed guidance focusing on key behaviours and expressions indicative of frustration. This included analysing student’s work context, actions, utterances, facial expressions, body language, and interactions with the LMS and other students. This research aimed to achieve the highest possible accuracy in the ground truth data by adopting such a comprehensive approach to affect labelling.

The data collection sheet contained information such as student ID, question number, and observations made by the observer (frustrated/not frustrated). Sample observations for a student, consisting of 15 instances, are shown in Table 5.

Table 5 Sample human observation sheet to record students’ facial observations and to code it as frustration (Frus) or nonfrustration (Non-Frus) following observation protocol

4.1.6 Observation procedure

The observation procedure involved a team of four PhD students from a University—three from the ICT department and one from Psychology. All observers had previously taken the research methodology course and had several publications to their credit, providing them with a comprehensive understanding of the observational data collection method.

Before commencing the main study, the team participated in a pilot study to practice video coding using the provided observation protocol. During this pilot study, 30 instances were considered. This practice was crucial to ensure that all observers shared an understanding of the coding process. The observer’s goal was to watch the recorded videos and capture students’ expressions when they learned whether their answers in the LMS were correct or incorrect. In addition to this, during the video coding, observers were also instructed to consider students’ learning behaviours that they demonstrate while working on a problem up until they move to the next problem before labelling it either frustrated or not frustrated. This may include observing their work context, problem-solving process, facial expressions, body language, utterances, and interactions with LMS and other students.

Participants’ body language and their interaction behaviours with the learning system were included alongside the observed facial action units. This is because facial expressions, being spontaneous and context-dependent (Hoque et al. 2012; Kulke et al. 2020), might not always directly indicate a specific emotional state. For instance, a student might smile despite feeling frustration or appear visibly distressed when failing to meet a learning objective (Hoque et al. 2012; Kulke et al. 2020), particularly in an educational setting where frustration predominantly relates to students’ goals and outcomes (Rajendran et al. 2013). This concurs with Amsel’s frustration theory (Amsel 1990), which describes frustration as a disruption in behaviour directed towards achieving a goal or an obstacle in goal attainment, aligning with OCC’s Appraisal theory (Ortony et al. 1988). Some of the key behaviours noticed during the video coding were:

  • AU1: intensity of brow lowering (Example: outer brow raise, and inner brow raise)

  • BL1: pulling hair

  • BL2: expressing disappointment noise

  • BL3: cursing

  • BL4: head-shaking gesture

  • BL5: noseplay and touching gestures

  • IF1: number of attempts on each question

  • IF2: time spent on each question

  • IF3: correctness of last answer

  • IF4: give up after multiple attempts and move to the next problem

  • IF5: feeling unachievable

  • IF6: consecutive wrong answer pattern

Upon completion of the pilot study, the interobserver reliability was measured using Cohen’s Kappa in a manner similar to what was previously reported (Rajendran et al. 2013; Pardos et al. 2013). Cohen’s Kappa is a statistical measure that accounts for the probability of agreement by chance. It was used as a quantitative measure of reliability among raters evaluating the same thing, adjusted for how often the raters may agree by chance. From this pilot study, 27 instances out of 30 instances were completely agreed upon by four observers. A Cohen’s Kappa score of 0.87 was achieved, indicating a high level of agreement among the observers. This level of agreement was found to be higher than the level typically seen for video coding of affect in prior studies, as we focused solely on a single negative affective state, frustration, excluding other negative states such as boredom, confusion, sadness, and fear from our study.

In the main study, we utilised a two-step approach—independent coding followed by group consensus. Observers were provided with recorded videos, and each observer independently watched each student’s video and documented their findings in their personal Google spreadsheet. This process involved pausing the video as needed, recording the expressions and behaviours within a clip, and labelling them (Yes/No) based on the observation. In total, 424 instances across 24 students were independently coded by each observer.

Notably, only one of these observers had been involved in developing the frustration detection model. This measure was taken to ensure the objectivity and independence of the observation process, as the remaining three observers were kept uninformed about the specifics of the developed models. Following this phase of independent coding, all observers convened in a group Zoom session to share their observation sheets. It was noticed that out of 424 instances, 402 were completely agreed upon by 4 observers. Therefore, the meeting helped the observers discuss and reconcile the discrepancies in their ratings, ensuring our observational data’s consistency, validity, and reliability.

4.1.7 Data processing and stratifying

As noted previously, during students’ participation in online learning activities, their facial expressions were recorded with cameras, and their interaction with the Moodle LMS was screen-captured in video format. Additionally, students’ activities and events during learning were gathered and stored through analysis of their clickstreams and browsing histories via log files on the database server. This allowed us to determine which student activities and events within the LMS were taking place when the observations were being made by the human observers from the recorded videos. Interactions with the LMS during a clip prior to data entry by the observer were aggregated, and data features were distilled. As previously mentioned, here, a clip is defined as starting from the first action on an original problem to the last attempt before the next original problem. A clip can be composed of only one action to complete an original problem, or it can contain more than fifty actions associated with completing similar related questions to solve the original problem. The action consisted of data on every student’s attempt and context to respond to each of the problems/skills. This includes the correctness of the answers (whether it was correct or incorrect), the number of hint requests for a problem, the number of attempts with similar questions, time taken for each problem, the number of correct or incorrect actions on a problem step and so on.

From the Moodle LMS’s log data, we extracted action-level features, which were then aggregated and processed, resulting in nine key features: student ID, problem number, averagecorrect, mincorrect, sumcorrect, averagepercentcorrectperskill, sumhinttotal, averagehinttotal, and mintotalfrattempted. These features were subsequently matched with observational data collected for each student, using the student ID, problem number, and observations for each clip. The resulting concatenated and reconciled dataset included the aforementioned nine features along with the class label. To prepare the validation datasets, we removed the student ID and problem number, aligning the final feature set with the optimal features identified in the ASSISTments dataset,Footnote 4 as detailed in Table 2.

The created validation datasetFootnote 5 (StudySet_001) comprises 424 instances, of which approximately 78% are non-frustrated instances, and around 22% are instances of frustration. The instances of frustration in our study are higher than those reported in previous studies, which can be attributed to the diversity of our participants. The participants in this study had varied backgrounds in PHP programming. Predominantly, they had basic programming knowledge but were new to PHP programming. Additionally, the scope of our study, which exclusively focuses on frustration as a negative affective state during coding, contributes to these higher instances. We did not consider other states such as confusion, engaged concentration, and boredom.

4.2 Creation of a validation dataset utilising publicly accessible dataset (EmoDetect)

In order to further evaluate the performance of the affect detectors, we utilised a publicly available dataset known as EmoDetect.Footnote 6 This dataset was generated by an independent research group at an educational institution aimed at enhancing the online learning experiences of their students (Rahman et al. 2024). Notably, one of our team members was part of this external project.

As part of their data collection process, the research team captured students’ facial expressions and screen-recorded their interaction with a LMS while participating in online learning activities, similar to our data collection methodology. All facial expression recordings and video storage complied with the institution’s ethical guidelines. The recordings included video captures of 30 South Asian students (9 female and 21 male). The participants were diploma-level college students from the ICT department, aged between 18 and 22, with a basic understanding of programming.

The recorded videos were shared with our research team upon request. Our observer team analysed students’ affective states from the videos and coded them as ‘frustration’ or ‘non-frustration’ based on observed facial expressions and interactions with the LMS. This coding process adhered to the same observation method and protocol described in the previous section.

Log files were analysed to distil seven key features (as listed in Table 1), which were then cross-referenced with human observations. This procedure resulted in generating a second validation dataset (EmoDetect) comprising 300 instances. Within these instances, approximately 53% were coded as frustrated and 46% as non-frustrated instances. It is worth noting that the higher percentage of frustrated instances in the EmoDetect dataset could be attributed to the participants’ academic backgrounds and skill levels. As mentioned, the participants of EmoDetect were diploma-level college students, most of whom were new to programming, particularly PHP programming, and they struggled with solving PHP programming problems. In contrast, participants from XYZ University (StudySet_001) were primarily Master’s degree students from the ICT department. This difference in academic and skill levels might have influenced the students’ experiences with the learning tasks, potentially leading to higher levels of frustration among participants in the EmoDetect dataset. The coded datasetFootnote 7 by our observers provides a valuable resource for further assessment of the performance of our Bayesian detectors and other models.

4.3 Validation results of affect detectors on the independent datasets (StudySet_001 and EmoDetect)

Table 6 presents the performance of two cost-sensitive Bayesian models, along with conventional Bayesian models (BN, NB), and three other conventional machine learning models (J48, RF, KNN), when trained on the ASSISTments dataset and tested on the StudySet_001 dataset. The results indicate that the two cost-sensitive Bayesian models (BN and NB) outperformed J48, RF and KNN models in terms of AUC, with an AUC of 0.94 ± 0.03, exhibiting superior performance at distinguishing between frustrated and non-frustrated instances. Moreover, their Cohen’s Kappa values (0.71 for Cost-Sensitive BN and 0.78 for Cost-Sensitive NB) are significantly higher than the other models, indicating a higher degree of agreement between the model’s predictions and the actual values. The conventional BN and NB models also achieved higher AUCs compared to the cost-sensitive models, but their κ values were significantly lower, indicating that their level of agreement is inferior compared to their cost-sensitive counterparts. The J48 model had an AUC of 0.50, which suggests that its predictions are no better than random chance. Its κ value was 0, indicating no agreement between its predictions and the actual values. This is due to the complex nature of the training dataset, which was highly imbalanced. If a dataset has a strong class imbalance (which seems to be the case with a 96.3% “no” class), the decision tree might find that the best strategy to minimise error is to predict the majority class, leading to a one-node tree. As a result, the J48 model ended up with an extremely biased representation of the problem, essentially ‘learning’ that the best strategy is always to predict the “no” class. This strategy minimises error on the training set due to the overwhelming presence of the “no” class, but it performs poorly when encountering the “yes” class, especially on an unseen test set. The RF and KNN models achieved moderate AUCs (0.87 ± 0.04 and 0.61 ± 0.05, respectively) but low κ values, suggesting they have weaker agreement levels compared to the Bayesian models.

Similar to the StudySet_001 dataset, the EmoDetect dataset was also used. Cost-Sensitive Bayesian variants (BN and NB) outperformed the other models (J48, RF, KNN), confirming their generalizability across distinct datasets. The Cost-Sensitive NB model demonstrated a promising AUC score of 0.94 ± 0.03, with a substantial κ value of 0.77 (Table 7). While the performance of the Cost-Sensitive BN model was slightly inferior to its Cost-Sensitive NB counterpart, it still achieved a reasonable AUC score of 0.87 ± 0.04 and a κ value of 0.52.

Table 6 Performance evaluation of two cost-sensitive Bayesian variants, BN, NB, J48, RF and KNN. Trained on ASSISTments dataset and tested on StudySet_001 dataset

The conventional Bayesian models, BN and NB, showcased high-class distinguishing capabilities with AUC scores of 0.92 ± 0.03 and 0.94 ± 0.02, respectively. However, their κ values (0.40 for BN and 0.12 for NB) were lower compared to Cost-Sensitive Bayesian variants, indicating a weaker correlation between their predictions and actual outcomes. When compared, J48, RF, and KNN models were not able to match the Bayesian variants’ performance. Specifically, J48 showed an AUC score equivalent to random guessing (0.50 ± 0.00) and exhibited zero agreement with actual outcomes, as signified by its κ value, which is due to the issue of class imbalance with the training dataset similarly, as previously explained. Although achieving a good AUC score of 0.84 ± 0.04, indicating a strong predictive capability, RF displayed a κ value of 0.15, showing a weaker alignment with actual outcomes. Similarly, KNN, with an AUC score of 0.61 ± 0.05 and a κ value of 0.19, demonstrated moderate predictive capabilities and a relatively weak agreement with the ground truth. In summary, the results from the EmoDetect dataset show the superior performance and robustness of the Cost-Sensitive Bayesian variants, particularly the Cost-Sensitive NB model, despite the conventional Bayesian models’ strong class-distinguishing capabilities.

Our validation efforts across two independent datasets, StudySet_001 (Table 6) and EmoDetect (Table 7), consistently demonstrated the superior performance of Cost-Sensitive Bayesian variants, specifically Cost-Sensitive Naïve Bayes (NB), in identifying student frustration during online learning. Despite the distinct populations and varying degrees of frustration prevalence within the datasets, the Cost-Sensitive NB model exhibited robust generalisability and yielded the highest AUC scores and substantial Cohen’s Kappa (κ) values, signifying excellent predictive capabilities and substantial agreement with the actual outcomes.

The enhanced performance of the Bayesian classifiers can be attributed to our strategic decision to simplify the problem by reducing the complexity inherent in multi-state classifications into binary classification. We observed that the Bayesian classifiers did confuse ‘frustration’ with the other three states, a common issue in multi-state classifications as noticed during training and testing with the ASSISTments dataset, where non-concentrating states are often misclassified. Contrary classifiers trained for binary classification effectively distinguish between ‘frustrated’ and ‘not frustrated’ states, avoiding the confusion common in multi-state classification. Simplifying to binary classification sharpens the decision boundary, enhances model performance, and ensures more robust generalisation on test datasets. Overall, we believe that carefully and efficiently selecting the most relevant and generalisable features related to frustration, simplifying the multi-state classification problem to binary classification, and applying cost-sensitive learning, including SMOTE, contributed to the impressive results of our models.

Table 7 Performance evaluation of two cost-sensitive Bayesian variants, BN, NB, J48, RF and KNN. Trained on ASSISTments dataset and tested on EmoDetect dataset

5 Discussion

Students’ webcam usage behaviour is related to personal thoughts and feelings (e.g. privacy) to course characteristics (e.g. group cohesion), and it differs due to specific groups (gender, study level) (Bedenlier et al. 2021). Recent studies conducted during the Covid-19 lockdown when most of the educational institutions around the globe shifted to online learning have reported that about a third of students hesitate to be visually present in video conferencing via webcam, and a further third of students use their webcams due to course requirements (Bedenlier et al. 2021; Händel et al. 2022). Particularly, university students tend not to use the webcam to regulate their learning process even when they frequently use digital technology (Yot-Domínguez and Marcelo 2017). Their desire to maintain separation between personal and educational settings as well as to experience learning as a private phenomenon may drive them to keep webcams off during online learning (Dennen and Burner 2017; Dennen et al. 2022). In line with these findings, we also noticed unwillingness and hesitation among our participants during data collection, who were adult students and were asked to turn on their webcams while accomplishing learning objectives. These factors resulted in a low participation rate in our study, but we overcame this obstacle by allowing students to complete tasks from their homes. Still, students reported they felt uncomfortable and shy and had privacy concerns, which were the primary reasons behind their low participation rate. In addition to these factors, researchers have reported that there are other challenges to scaling the use of physical and physiological sensors to larger groups of students or deploying them in classroom settings, as discussed previously. Therefore, it is pivotal to create sensor-free affect detectors that can determine a student’s affective state at any point during interaction with a learning system solely from the student’s interaction with the system. The detectors should be generalisable in various learning domains and systems and able to strike a balance between detection accuracy and generalisability.

Our research aimed to address these challenges by developing sensor-free affect detectors that are domain and system independent. We built generalisable frustration detectors that analyse students’ interactions with the online learning system without requiring any webcam or sensor data. We have identified the seven most essential features that are involved in causing frustration during online learning from high-dimension feature space (i.e. two hundred and four) by applying the feature selection process. The process removed irrelevant and redundant features and reduced the risk of overfitting, thus promoting the detection’s generalisation ability and performance. This set of optimal features indicates that during learning, students who performed fewer correct actions and asked for more help seemed to be more frustrated than others, which is identical to the previously reported work (Pardos et al. 2013). This finding is significant because these features are common in most computer-based learning systems of diverse domains and can be generalisable to any platform and domain. Using these optimal features, it is possible for those systems to detect students’ frustration during learning, including the causes of frustration, and provide feedback accordingly, making learning more effective.

Our evaluation of various frustration detectors revealed that most of the ML classifiers were benefited in both Kappa and AUC metrics from a smaller number of features (i.e. seven optimal features) compared to using a large number of features (i.e. two hundred and four) (Table 3). Among all the classifiers, Naïve Bayes (NB) and Bayesian Network (BN) classifiers showed superior performance. Their high efficiency can be attributed to their ability to simplify learning with small features and require less training data for classification (Rish 2001; Novakovic 2010; Pernkopf 2004; McCloskey 2000).

We also applied SMOTE and Cost-Sensitive Learning on the BN- and NB-based detectors to explore the possibility of further improving the performance of the detectors by addressing the class imbalanced problem. Another intention was to train the detectors in a way that they can perform well on both balanced and imbalanced datasets. This is because, in the real world, the dataset could be balanced or imbalanced like ASSISTments, which was predominately composed of a “no” class with only a small percentage of the “yes” class. It is also the case that the cost of misclassifying a (yes) class as a (no) class is often much higher than the cost of the reverse error (Chawla et al. 2002). Therefore, a detector should have the ability to tackle both scenarios efficiently. A combination of over-sampling the minority (yes) class and under-sampling the majority (no) class helped us to address the class imbalance problem, thus achieving better classifier performance in both AUC and kappa metrics (Table 4). The cost-sensitive learning approach was also applied to further mitigate class imbalance issue and improve classification performance. We assigned a higher penalty for the misclassification costs and minimised the total cost. We observed that cost-sensitive BN and NB variants produced better results on the ASSISTments dataset compared with previously reported works.

In this study, a crucial aspect of building generalisable frustration detectors was validating those across multiple datasets, including StudySet_001 and EmoDetect. The validation results consistently demonstrated that the cost-sensitive BN and NB models, specifically the Cost-Sensitive NB model, perform exceptionally well in detecting student frustration during online learning. This was evidenced by their high AUC scores and substantial Cohen’s Kappa values, indicating excellent predictive capabilities and significant agreement with the actual outcomes. Notably, this impressive level of performance was maintained across varying student populations and disparate levels of frustration prevalence in the different datasets. Therefore, the broad applicability and generalisability of these models are well-demonstrated. They hold great promise for widespread use across diverse platforms and domains, furthering our understanding and management of student frustration in an array of online learning environments.

6 Conclusion and future work

Sensor-free affective detectors can detect students’ emotions from their interactions within the online learning environment without using any physical sensors, making detection more scalable and less invasive. In this paper, we attempted to build frustration detectors using popular machine learning classifiers. Various feature selection techniques were applied to select the most appropriate features related to frustration, and then these features were fed into classifiers to evaluate their detection performance. Our investigation showed that it is possible to build affect detectors using only a small number of interaction features (such as correct/incorrect actions and asking for help), which can strike a balance between detection accuracy and generalisability. These identified features are widely found in many online learning/tutoring systems across different fields, and they can capture general aspects of learning that don’t rely on specific course content or teaching methods. This suggests their potential for widespread use in diverse online education platforms. Our study also highlights the superior performance of cost-sensitive Bayesian variants in frustration detection, surpassing other classifiers and prior results. The robustness of these variants, confirmed by their strong performance and agreement with real outcomes across independent datasets, underscores their wide applicability in various platforms and domains. As we look forward, there is potential for further refinement of these detectors and expanding their capabilities to identify other vital affective states that play a role in learning. Implementing these detectors into existing learning management systems could offer real-time, personalised feedback to learners and educators, paving the way for informed interventions that alleviate student frustration and enhance learning outcomes.

In our ongoing research, we plan to integrate these Bayesian frustration detectors into the Moodle Learning Management System as plugins. The goal is to automate frustration detection and offer personalised support to struggling students in real-time, potentially improving their learning experiences and outcomes.

Our study represents a significant advancement in creating sensor-free frustration detectors for online learning environments. We’ve outlined a practical and efficient approach to identifying students’ emotional states without compromising their privacy. This approach is cost-effective and scalable, making it particularly beneficial in real classroom settings. By successfully developing this sensor-free method, we’re moving a step closer to realising more empathetic, responsive, and inclusive online learning environments.