Introduction

Eye-tracking technology has progressed substantially in ease and extent of use over the last few decades. Early systems were often intrusive, like contact lenses fitted with search coils, and used in only a small number of specialized laboratories. More recent systems typically use non-invasive, video-based technologies and are used extensively in psychology, neuroscience, marketing, education, and other fields. These technological advances have helped to support major advances in our understanding of extensive relationships between gaze and cognition (Eckstein et al., 2017). For example, eye-gaze behaviors can be used to detect certain cognitive processes that can affect learning, such as mind wandering (referred to here as task-unrelated thought; TUT), and predict learning outcomes, such as comprehension (D’Mello et al., 2020; Faber et al., 2017; Hutt et al., 2016; Hutt, Mills et al., 2017b; Mills et al., 2016). These kinds of gaze-based detectors hold promise both as a basic-research tool for understanding the cognitive factors that relate to learning and for developing adaptive interventions to support more-effective student learning (D’Mello et al., 2012; Hutt et al., 2021; Mills et al., 2020).

However, to date, the practical application of gaze-based approaches to monitor and affect learning in the real world has been severely limited by the cost and availability of appropriate eye-tracking systems. For example, past studies that used eye tracking to infer task-unrelated thought and comprehension were limited to highly controlled laboratory settings and/or expensive eye-tracking hardware that can cost upwards of $40,000. Thus, relatively few individuals and schools have been able to take advantage of these promising technologies. Limiting accessibility to students in wealthy districts is not a viable path forward, because it would only further perpetuate the inequities that already exist in education. Instead, new tools are needed to bring the potential power of gaze-based detectors of cognitive states to more real-world contexts with broad and equitable accessibility.

The primary goal of the current work was thus to provide the first evidence of a scalable option for detecting comprehension and task-unrelated thought, in real-time, using webcam-based eye tracking embedded within a web browser. We focused on the use of systems requiring no specialized hardware beyond a web camera built into most laptops and other computers and publicly available software (Papoutsaki et al., 2016). This approach makes it possible to extend the benefits of detector-based measurement and automated personalization to a broader, more economically diverse population of individuals. As a secondary goal, we aimed to show that this system can be used to reproduce and build on previous findings that quantified relationships between these learning-related cognitive constructs and gaze.

Theoretical background and related work

Eye movements have long been viewed as a window into the cognitive processes that unfold during reading (Rayner, Chace et al., 2006a; Rayner, Reichle et al., 2006b; Reichle et al., 2012). Although a complete account of the “eye–mind” link is outside the scope of this paper, it is relevant to mention that eye gaze is considered a real-time index of the information-processing priorities of the visual system. For example, visual information is acquired primarily during periods when the eye remains relatively stable, known as fixations. In contrast, visual input is suppressed during saccades, which are ballistic movements of the eyes between fixations (Campbell & Wurtz, 1978; Irwin & Carlson-Radvansky, 1996; Matin, 1974; Zuber & Stark, 1966). Therefore, ongoing task goals are often best served when patterns of fixation ensure that central gaze, and therefore visual attention, is allocated to the most important visual information within the environment. This idea is particularly relevant to reading: fixation patterns are sensitive to both features of text being read and the reader’s understanding of that text (Rayner, Chace et al., 2006a; Rayner, Reichle et al., 2006b). Below, we briefly summarize past work relating gaze patterns to reading comprehension and TUT and describe the specific contributions of the present work.

Reading comprehension

From a theoretical perspective, reading comprehension is often understood in terms of the Construction-Integration model (CI model). This model proposes that the mental model constructed while reading a text consists of three primary levels (Kintsch, 1998; McNamara & Magliano, 2009). The first, and most basic, level is the surface code. This level reflects the verbatim wording and structure of the text. This level fades quickly from memory but is used to identify semantic and syntactic relationships. The second level, which is constructed from the first, is the textbase. This level preserves the key fact-level information that is necessary to eventually represent the “gist” of the text. The third level, which builds on the textbase with information from the reader’s prior knowledge to construct a more elaborate mental representation of the text’s meaning, is the situation model. This level contains all inferences generated to establish connections amongst ideas in the text and prior knowledge. It may be helpful to consider textbase comprehension as fact-based memory, whereas situation-model comprehension can be seen as an overall conceptual model of the text.

Our understanding of reading comprehension via CI and other models has benefitted greatly from the use of eye tracking (Rayner, 2009). For example, eye movements, such as regressions (moving backwards through the text) and longer fixations, have been linked to difficulties in constructing a situation model and consequently comprehension (Rayner et al., 2006; Schotter, Tran, & Rayner, 2014). In addition, eye movements can be sensitive to text characteristics such as difficulty (Rayner et al., 2006) and genre (Kraal et al., 2019). In recent years, attempts have been made to use these kinds of gaze-tracking metrics to predict comprehension (Ahn et al., 2020; D'Mello et al., 2020; Wallot et al., 2015). Historically these predictions have been largely unsuccessful in terms of accuracy and generalizability. For example, in certain naturalistic reading contexts (e.g., text not altered for stimuli presentation) standard global features such as fixation duration and number of eye movements were not predictive of comprehension (Wallot et al., 2015). Likewise, in another study, fixation times and overall reading times were also not predictive of long-term memory and comprehension on their own (Yeari et al., 2015; Dirix et al., 2020).

Nevertheless, more recent research indicates that comprehension prediction from eye gaze may be possible. For example, one gaze-based model was able to explain ~ 40% of the variance in comprehension on a [describe test/condition] (r = 0.661, D’Mello et al., 2020). These kinds of eye-gaze-based models can also predict text-based comprehension and are generalizable across multiple datasets (Southwell et al 2020). Despite this progress, there is still a need to: (1) extend these predictive models to situation model comprehension, as a way to assess whether students have a deep level of understanding, as opposed to simply recalling factual details of the text (i.e., build a person-independent predictive model of whether a correct or incorrect inference is made about a text as it unfolds in real-time; while also (2) finding more scalable solutions, given that the models mentioned above were all trained using a high-cost research-grade eye-tracker with a high sampling rate and high-fidelity data (Tobii TX Pro 300).

Task-unrelated thought (TUT)

One construct that has been closely linked to the disruption of comprehension is TUT (D’Mello & Mills, 2021; Phillips et al., 2016; Smallwood, 2011), commonly referred to as mind-wandering. TUT is defined as the act of shifting from an external task (e.g., reading) to internal thoughts about something unrelated to the current task (Smallwood & Schooler, 2015). TUT is ubiquitous in both everyday life and during reading, with estimates ranging from 20–40% of the time on average (D’Mello & Mills, 2021; Killingsworth & Gilbert, 2010; Klinger & Cox, 1987). Critically, TUTs are consistently negatively related to measures of performance in cognitively demanding tasks including reading comprehension (D’Mello & Mills, 2021; Randall et al., 2014).

TUT is thought to be a barrier to building an accurate mental model of a text because of its downstream effects on processing. For example, the cascade model of inattention (Smallwood, 2011) suggests that “perceptual decoupling” occurs during TUT, leading to slowed or diminished processing at lower levels of encoding (i.e., the surface code). This decoupling then causes breakdowns in the ability to integrate information across multiple levels, from processing the individual words to the meaning of a sentence. As such, interactive learning software that can adaptively respond to TUT improves students' deep comprehension (Mills et al., 2020), but reliable detection is a necessary first step.

A growing body of research suggests that changes in eye movements can be indicative of when people are off-task. For example, this relationship has been used to build gaze-based TUT detectors during reading (Bixler & D’Mello, 2014, 2016; Hutt, Hardey et al., 2017a). Commonly, supervised classification models are trained to discriminate between responses to embedded mind-wandering probes (“yes, I was off task” versus “no, I was on task”) using global (i.e., not context-specific) gaze features (such as average fixation duration, fixation dispersion, saccade frequency, angle, etc.). The models are then validated by testing their generalizability to unseen individuals.

As with comprehension, much of the work in this space has leveraged research-grade eye tracking in the laboratory. However, some recent work supports the idea that lower-fidelity eye tracking can be used to automatically detect TUT. For example, Hutt et al. (2016, 2019) demonstrated that TUT detection could also be achieved with a COTS eye-tracker, which retails for $100–150 USD. Though this tracking system uses a lower sampling frequency and provides less accurate and precise gaze measurements than more expensive systems, successful TUT detection was still possible and was later used to deliver learning interventions that benefited learners with low prior knowledge (Hutt et al., 2021). Though these eye-trackers present a more affordable approach, they still require additional, specialized hardware, thus limiting overall scalability.

Overview and novelty of current work

To overcome the limitations in scalability inherent to using expensive and/or specialized equipment, we focused on webcam-based eye-trackers that are beginning to be used in research and other settings (Degen et al., 2021; Semmelmann & Weigelt, 2018; Yang & Krajbich, 2021). A known limitation of these webcam systems is that they tend to be less accurate and precise than many specialized video-based systems (Zhang et al., 2019), particularly when they are deployed in real-world conditions in which lighting, head position, and other factors are not as controlled as typically are in laboratory settings. Thus, a major, open question is whether webcam systems provide a sufficiently reliable estimate of gaze position to be useful for monitoring gaze-sensitive cognitive states during reading.

A few studies have shown promise in this regard. For example, an unsupervised classification method has been used to derive areas of interest (AOIs) from gaze data collected with a webcam as users interacted with a communication task (Tran et al. 2019). In that study, gaze points were clustered to model users’ interpersonal behavior and ultimately improve interactions. Though AOIs present a slightly coarser-grain analysis than may be needed to monitor cognitive states (D’Mello et al., 2020), this work demonstrates that webcam-based gaze tracking is still picking up on a valid interaction signal between eye movements and comprehension. Particularly encouraging is recent work comparing webcam-based eye movements to data collected from the Tobii Pro Glasses 2 (Valliappan et al., 2020). Across four tasks, data from the standard camera embedded in a smartphone was comparable to data collected from the Tobii glasses. Though the Tobii glasses are not necessarily a “gold standard” PCCR tracker, with sampling rates lower than that of the EyeLink and other lab-based trackers, this work presents an important comparison between PCCR approaches and methods that utilize the RGB webcam. Our work builds upon these successes (though using a different gaze-tracking system) to examine if webcam data is sufficient for real-time modeling of TUT and comprehension.

Finally, almost all of the work reviewed above has used predominantly White samples to build detectors of TUT and comprehension, limiting its generalizability and potential scalability. Here we intentionally collected data from two different populations (Study 1: predominantly white university students; Study 2: mostly non-White adults recruited on the online platform Prolific) to see how our models generalize across these populations, i.e., to check for algorithmic bias in the eye-tracking technology.

Methods

Below we describe our general data collection method for two different studies, noting any (minor) differences between the two.

Participants

In Study 1, 105 University of New Hampshire students participated in the experiment (age range 18–25, 77 self-identifying as female, 27 as male, one as non-binary; 83.1% White) for course credit in their psychology-related courses. In Study 2, 173 participants (age range 18–52, 130 self-identifying as female, 40 as male, three as non-binary) were recruited through Prolific, an online data collection platform that allows individuals to sign up and receive compensation for participating in research studies. Participants were paid $4 for completing the study. To create a more diverse sample, we used the Prolific selection criteria to oversample participants of color (see Table 6 for a complete breakdown of participants by race).

The location of both studies was at the participant’s own discretion (wherever they chose to complete the online study), without a researcher present, and no video (other than for gaze tracking) was recorded. As a result, we have no structured way to evaluate when tracking error is a fault of the tracker and/or when it might be a context issue (e.g., the participant is looking down, or covering their face with their hand etc.).

Materials

Task

The study used a narrative anticipation task, which involved reading 65 narrative stories taken from Cranford and Moss (2018). The goal of this task was for participants to make an inference about the ending of a story, based on information given, which is a common exercise for teaching and/or developing reading comprehension skills. Each story consisted of three sentences and had three possible endings. Each ending is initially plausible, but there is only one appropriate ending after reading all three sentences. An example story is: “Larry always wanted to know what it was like to live in a foreign country. He went to read at his favorite store on main street. The steam rose from the cup as Larry brought it to his lips and slowly…”. The three ending options were: (1) “sipped coffee”, 2) “bought muffin”, and 3) “rolled marble”. As the story unfolds, the incorrect options become less plausible until the reader can make the inference that the only appropriate ending is “sipped coffee.” We note that this task does not reflect reading long, extended texts, but rather a reading comprehension skill-building exercise common in English-language learning, standardized tests (and test-prep), and other K-12 learning platforms.

Webcam-based gaze tracking

Gaze locations were collected using WebGazer (Papoutsaki et al., 2016). WebGazer is an online, webcam-based eye-tracker written in JavaScript that can be integrated in any website to infer gaze locations in real time using the user’s webcam. WebGazer initially uses facial and eye detection algorithms to detect pupil locations and represents the eye as an image patch. It then maps pupil locations and eye features to gaze locations using a ridge-regression model. WebGazer uses all eye features within a temporal interval of 500 ms when determining the onscreen x- and y-coordinates. Based on user interactions such as clicks and mouse movements that normally occur during web navigation, WebGazer is also able to continually self-calibrate to maintain mapping accuracy. In a lab study, WebGazer achieved 4.17° gaze accuracy (Papoutsaki et al., 2016). As a point of comparison, commercial eye-trackers achieve <1° gaze accuracy. It should be noted that because WebGazer runs on the client side, sampling rate cannot be guaranteed and varies as a result of available resources.

Procedure

After participants provided consent and remote connection to the webcam had been established by the software, participants completed WebGazer’s calibration process. Participants were directed to look at a red dot as it moved to 20 different locations around the screen. In Study 2, a pseudocalibration was additionally used to help account for possible calibration drift over time. The pseudocalibration did not affect WebGazer’s calibration but created an adjustment that could be applied to the gaze locations reported by WebGazer. In the pseudocalibration, participants looked at red dots at four locations which corresponded to the three options and center of the screen. The pseudocalibration was performed before the first main trial and after any trial in which no pseudocalibrated gaze locations were located in the option locations during the choice screen. Aside from the population differences and pseudo-calibration, nothing else was changed across the two datasets. For this initial feasibility study and to ensure that the two datasets were comparable, the pseudocalibration was not used to correct any data collected in the second study.

After calibration, participants completed the narrative anticipation task. The studies used a single between-participants manipulation with participants randomly assigned to one of two conditions related to how the stimuli were delivered. This manipulation is not relevant for the current research, the goal of which is to build a generalizable detector that works in either of the two conditions. We nevertheless describe the design in full here in case others wish to replicate our work. The two conditions were: (1) “audio”, in which participants heard a reading of each sentence; and (2) “visual”, in which participants read each sentence presented on the screen. Figure 1 shows the timeline for a trial. For each trial, participants were initially presented with the three endings for a story and asked to familiarize themselves with the on-screen location of each option before progressing. Participants controlled when to move onto the next sentence, but in the audio condition they could not progress until the reading of the sentence was finished. In both conditions, the three answer options were displayed on the screen at all times, making it possible to collect gaze data relative to the positions of those opinions throughout each trial.

Fig. 1
figure 1

Sample trial sequence. The probe screen could occur after any reading screen

Comprehension assessment

After reading or hearing all three sentences, participants were asked to click on the option that best completes the story. Participants completed five practice trials and 60 main trials. After completion of the narrative anticipation task, participants completed a demographic survey. The experiment lasted ~ 40 min.

Task-unrelated thought (TUT) probes

Detectors of TUT have almost exclusively relied on self-reports from participants to determine the ground truth data labels. The gold-standard in the field is to use a probe-caught method (Varao-Sousa & Kingstone, 2019; Weinstein, 2018), whereby participants are interrupted periodically to report on whether they are off-task (thinking about something else) or on-task at the current moment. Previous work has vetted the probe-caught method in a variety of ways, showing consistent results and reliable correlations with eye-gaze, pupillometry, reaction times, and performance (Foulsham et al., 2013; Franklin et al., 2013; McVay & Kane, 2012; Randall et al., 2014).

We used this method to probe participants on half of the stories. On these probe trials, participants were asked to report whether they were thinking about the story (on-task) or something else (off-task). The probes' timing was balanced across sentences to prevent predictability, occurring for 30 stories with ten after the first sentence, ten after the second sentence, and ten after the third sentence. Timing of the probe – both in terms of which story probes occurred and at what location within the story (sentence 1, 2, or 3) – was randomly assigned.

Feature engineering

For TUT models, we calculated gaze features from the time of screen up until the probe occurred to avoid using any data from after the probe. Because the probe could appear at different times throughout the trials, only the previous screen was used to ensure a consistent amount of data per instance. For example, if the probe occurred just after screen two, the gaze data from screen two would be used to predict TUT; if the probe occurred just after screen three, then screen three would be used etc. No eye-tracking data were used from the probe screen itself. For the comprehension models, the question had a consistent placement at the end of the reading (three screens), allowing us to use more data and still have consistent data volume across instances. We calculated features from all three screens of reading prior to the user being presented with the question. If a probe occurred during a trial, the gaze data from the probe screen was excluded. No data from the choice screen were used.

We converted the raw gaze data into features to use in our prediction models. Based on previous work (Bixler & D’Mello, 2016; Hutt et al., 2019), we investigated both global and local gaze features. We did not consider contextual/interaction features such as response time, although some of these features are implicitly encoded in the gaze features; e.g., more samples likely correspond to longer response times. The specific global and local feature categories that we used are described below.

Global gaze features

Global gaze features focus on general gaze patterns and are independent of the content on the screen. The global features we used were selected to relate to previous studies of comprehension and TUT, while allowing for the reduced accuracy that was expected from WebGazer compared to in-laboratory systems. Specifically, for each sentence/screen we calculated: (1) the number of gaze samples, which is a measure of how much valid gaze data there was during a sentence, giving an overview of how much the participant was looking at the screen; (2) the number of unique gaze samples, which is a measure of the number of distinct screen locations where a user looked. This value, though correlated with feature 1, removes duplicate screen locations, so gives a measure how much gaze may be moving around the screen; this is then further extended by (3) the dispersion of gaze points, which we quantified as the root mean square of the distances of each fixation to the average gaze position and is a measure of how spread out the gaze was. Because our data collection was based on variable and unknown sampling rates, we did not attempt to calculate fixation durations or identify saccades, as has been done in previous work. Additionally, our metrics do not correct for sampling rate, instead evaluating the robustness of gaze tracking just from the raw data. Future work should address this limitation.

Local gaze features

In contrast to the global features, local features encode where the gaze is fixated and thus were based on both gaze location and the relative screen content. To calculate these features, we first defined the three option locations and the center of the screen (sentence location) as areas of interest (see Fig. 2). For each page/screen, we then calculated the time spent on each AOI. Finally, we added context to these actions, based on the task (see Table 1). As this was an initial experiment, the features selected represent more fundamental local features that though relevant to this work and reading literature, again acknowledged the expected reduced accuracy of the gaze tracker. We did not include features such as gaze on individual words that we considered to be too task-specific, given that our goal was to obtain a more general understanding of the potential for generalizable gaze tracking in this domain.

Fig. 2
figure 2

Heatmap overlay showing a participant’s eye gaze during a reading page (top) and the choice page (bottom) of the task. Red indicates high concentration of fixations; purple indicates low concentration of fixations

Table 1 Local feature description per page

Figure 2 shows example heatmaps of one participant's eye gaze during a reading screen, and before providing their answer. Each of the three options were shown in circles with diameters equal to 20% of the screen height (e.g., if the screen height was 100 pixels, the diameter of the stimuli would be 20 pixels), each of which was associated with a slightly larger AOI (with diameters equal to 30% of the screen height). In this case, the gaze followed some expected patterns, including higher gaze densities around the AOIs on the screen. However, calibration drift was also evident, for example in the top image where calibration has likely drifted to the left. Subsequent analyses that used the eye-gaze data to predict cognitive states (TUT and comprehension) thus allowed a margin for error in AOI calculations.

Classification models and validation

To relate global and local gaze features to the TUT probes and reading-comprehension scores, we used scikit-learn (Pedregosa et al., 2011) to implement five classifiers (logistic regression, random forest, gradient boosted, support vector machine, and decision tree). We also implemented XGBoost with a separate library (Chen & Guestrin, 2016). Where appropriate, hyperparameters were tuned on the training set using scikit-learn’s cross-validated grid search (Pedregosa et al., 2011). Because of the limited volume of data and feature space, we did not consider neural networks or deep learning approaches at this time (see Future work, below).

We validated the models with a participant-level, tenfold cross-validation scheme. This process ensures that no instances of any individual participant could appear in both the training and test sets within a fold. All features were z-scored by condition (visual or audio) within each fold. We used the training data to calculate the statistics needed for z-scoring (mean, standard deviation, max, min), which were then subsequently applied to testing sets.

For both TUT and comprehension, we observed substantial imbalance between the classes (i.e., many more instances were not TUT than were). Class imbalance can present challenges, because supervised learning methods tend to bias predictions towards the majority class label. To compensate for this concern, we used the SMOTE algorithm (Chawla et al., 2002) to create synthetic instances of the minority class by interpolating feature values between an instance and its randomly chosen nearest neighbors until the classes were equated. SMOTE was applied only on the training sets. The original class distributions were maintained in the testing sets to ensure the validity of the results.

Evaluation

The analyses are described below, first for TUT and then for comprehension (correct/incorrect answers). Given the class imbalance in our data, we report precision, recall, and F1 scores as metrics for each class. To support easier comparison with previous work, we also report kappa value (Landis & Koch, 1977) to correct for chance. Precision, which provides detail on how accurate the model is for a specific class, was calculated as the number of true positives divided by the total number of true instances (in ground truth). Thus, for example, 40 correct predictions about of 100 instances of a given class X corresponds to precision of 0.4. Recall, sometimes also called true-positive rate, or sensitivity, refers to how many instances of class X were predicted correctly. Both metrics are informative about our model, and can present a trade-off; for example, if you over predict class X you may increase recall but decrease precision. F1 is defined as the harmonic mean between precision and recall, in order to combine the two scores into one meaningful evaluation. F1 was calculated as:

$$2\frac{precision\times recall}{precision+ recall}$$

The highest possible F1 score is 1 (indicating perfect precision and recall), and the lowest possible score is 0 (indicating that either precision or recall is 0). We report the individual metrics along with the combined metric in acknowledgement that whereas F1 weights precision and recall equally, in practice different types of misclassification can be more or less important (Hand & Christen, 2018).

To support easier comparison with previous work, we also report kappa values (Landis & Koch, 1977). The kappa metric is similar to F1 score in that it can be viewed as a combination of precision and recall. However, unlike F1 score, the kappa metric attempts to correct for chance. Kappa values > 0 indicate improvement over chance, whereas a kappa value of 1 indicates perfect classification.

Baseline/chance models for comparison

We included two different “baselines” for model-comparison purposes. This first chance baseline was generated using the Dummy Classifier in scikit-learn. The Dummy Classifier randomly assigns a label based on the base rate of the training sample; e.g., if 25% of the training data was TUT, then there is a 25% chance the dummy classifier will predict TUT.

As an additional baseline model, we trained a model using interaction data that is separate from the webcam data. Specifically, we used the time that participants spent on either the page before the probe (for TUT) or all three pages (for comprehension), because reading time can be correlated with TUT and comprehension (Mills et al., 2017). The main purpose of these baselines was to determine whether the eye tracking was providing information beyond what we could get from less complex means, such as basic log files. By comparing to these baseline models, we can assess if, and by how much, adding gaze data can help to improve predictions of TUT and comprehension beyond reading time. Put even more simply, we can ask; is the eye tracking worth the trouble?

Results

Before considering the results of our predictive models, we first consider the gaze itself. Though it is not possible in this experimental design to statistically evaluate the quality of the gaze recognition, or indeed the precision of the points recorded, we are able to anecdotally analyze the data through the heatmaps generated. As noted above, we observed drift in the recordings, and that in the reading task, gaze did not always appear to be on the sentence being read. This is somewhat to be expected, prior work has consistently reported lower gaze precision for webcam-based approaches (Zhang et al., 2019). This challenge has been addressed in the past by looking at relative changes in gaze patterns rather than specific gaze locations (D’Mello & Mills, 2021; Hutt et al., 2016; Mills et al., 2020), an approach we adopted as well.

To evaluate the predictive models, we begin by presenting results using a combined dataset with all participants from both studies in a single model. This approach allowed us to test the feasibility of the webcam-based eye-tracker for detecting TUT and comprehension with the largest, most diverse dataset under realistic conditions likely to introduce multiple, uncontrolled sources of error. After examining the combined dataset results, we then present model performance for each individual study, as well as cross-training results (train on Study 1, test on Study 2, and vice versa). Finally, we conclude with a slicing analysis to determine if model performance changes under various environmental conditions or across racial/ethnic subgroups.

Overall models

Correctness

Across all participants and conditions, correct answers were given 88% of the time. Across detection methods, gaze patterns measured via the webcam-based eye-tracker could be used to predict correct responses (indicating comprehension) better than chance (Table 2). The best performance was achieved by models using only Local features (the time spent in the three AOIs corresponding to the locations of the alternatives on the screen; kappa = 0.57; there was no reliable difference in performance when the models were tested separately for conditions in which the stimuli were presented as text or audio, t test p = 0.19). Models using only Global features (the number of gaze points, number of unique gaze points, and dispersion of gaze points) were also above chance but only marginally outperformed the interaction feature baseline (kappa = 0.15 vs 0.11). Combining these two feature sets produced worse performance than just Local features, potentially due to increased noise in the dataset. Thus, under these (admittedly limited) conditions, our results are encouraging and suggest that webcam-based eye tracking can be useful for assessing constructs like comprehension in online environments.

Table 2 Results for predicting correctness and incorrectness

In terms of individual label values (correct or incorrect), the models also showed an increase in F1 when predicting incorrect responses relative to chance, with higher values for both precision and recall. Thus, the model could be used to identify when someone did not understand a text as well as when they did. This result is likely relevant for future applications, where the ability to diagnose when someone makes an incorrect inference can be a possible place for real-time interventions.

Given that there were two conditions for how the stimuli was delivered in the experiment (though these were not explicitly examined here) we compared participant-level accuracy of the best performing model across condition with a t test. Results showed no significant difference (p = 0.19).

TUT

Gaze patterns from the webcam-based eye-tracker were also able to be used to predict TUT (Table 3). Specifically, the results indicate that: (1) all models outperformed the chance-baseline, and (2) the combined Global + Local model had the best performance (which was similar for text or audio conditions across participants, p = 0.36). Our finding that the most effective TUT prediction for this task relies on a mixture of general (Global) and context-specific (Local) features differs from past results. In general, global features have tended to be most predictive of TUT, whereas context-sensitive (local) features have provided a variety of results across different domains and tasks (Bixler & D’Mello, 2016; Hutt et al., 2019; Hutt, Hardey et al., 2017a). For example, local features provided improved TUT detection in reading from extended texts versus global features alone (Bixler & D’Mello, 2016). Our work shows that this benefit of local features can also be found in the kind of brief comprehension exercises we used.

Table 3 Results for predicting TUT

These results are somewhat modest in magnitude but are in line with previous studies using commercial eye-trackers (Bixler et al., 2015) that reported kappa values of ~ 0.20 (Bixler et al., 2015; Blanchard et al., 2014; Mills et al., 2016). They thus demonstrate that even with lower-quality, scalable sensing, we can still harness the so-called “eye–mind link” and detect TUT with webcam eye-tracking data. The eye-gaze models notably outperform the baseline response time-only model, indicating that there is a valid signal being detected. We note from the precision and recall scores indicate that a false positive is less likely than a false negative. Though there is still inaccuracy, this is potentially useful if triggering interventions and shows potential for future work.

We again compared participant-level accuracy across the two stimuli conditions in the experiment (whether the sentence was presented as text, or audio) with a t test. We observed no significant difference in model accuracy (p = 0.36)

Convergent validity

We explored the convergent validity of our models by calculating a set of correlations derived from the model’s predictions and the ground-truth data. Each correlation was calculated at the participant level in order to avoid violating independence assumptions. We calculated four participant-level values: TUT Ground Truth (proportion of probes to which participants reported TUT), Correctness Ground Truth (proportion of correct inferences made), TUT predicted (the average TUT prediction for that participant) and Correctness Prediction (the average Correctness prediction for that participant. The resulting correlation matrix is shown in Table 4.

Table 4 Correlation matrix for student-level TUT and correctness rates, both ground truth values, and predicted values from the best reported models

The models’ predictions of TUT and correctness were each positively correlated with their respective ground-truth labels. Specifically, actual and predicted TUT were weakly correlated (Spearman’s rho = 0.27), whereas actual and predicted correctness were strongly correlated (rho = 0.77, which is comparable to correlations reported in D’Mello et al., 2020, with higher-quality equipment). For a few of the participants, the model-predicted rate for question correctness was identical to the ground-truth rate, contributing to this high correlation.

Moreover, the models’ predicted rates of TUT were negatively correlated with ground-truth correctness (average question score). The magnitude of this negative correlation was somewhat similar when using predicted TUT (rho = – 0.23) or participant-level ground-truth TUT rate (measured as the average of probe responses; rho = – 0.11). This test of convergent validity is based on the consistent negative relationship between self-reported instances of TUT and reading comprehension scores in the literature, with an average reported effect size of r = – 0.28 (D’Mello & Mills, 2021).

Feature importance

To evaluate the importance of each gaze-based feature to our predictive models, we calculated SHapley Additive exPlanations (SHAP) values (Lundberg & Lee, 2017) using the SHAP library in Python. For the two best models of TUT and Correctness reported above, we computed the mean absolute SHAP value of each feature, per fold, and then averaged across folds to generate one value per feature (each between 0 and 1).

TUT

The variance in feature importance was low (SD = 0.01), implying that an ensemble of features was necessary for effective prediction. The top three features were all global features, characterizing the number of gaze points on a given page. This result aligns with earlier research using eye gaze for TUT detection, which has shown that the number of fixations is highly predictive (Bixler & D’Mello, 2016; Hutt et al., 2019). The most predictive feature was the number of gaze points on the third page. This result indicates that the last page provided more predictive power than the previous two but could reflect the proximity of the third page to the probe.

Correctness

The variance in feature importance was also very low (SD = 0.01), although slightly larger than for TUT. The most important features for this model were related to the number of gaze points on the answer options across all three sentences and the time spent looking at incorrect options. It should be noted that feature importance values for the three options were very close to each other (a range of 0.003), indicating that readers were most likely to be correct if they had spent time considering all options rather than focusing on one answer (even if it was the correct one). In both cases, we note the low variance between SHAPley values. Additional feature engineering and refinement may provide a more detailed insight into the relationships between individual eye movements and these two constructs.

Individual studies and cross-training analyses

The above analyses combined both datasets to: (1) increase the amount of data available for training the model, and (2) avoid overfitting to a particular sample at the outset. Below, we report analyses that treated each study as a separate source of data to further probe the reliability and generalizability of our models that use webcam-based eye tracking to predict TUT and comprehension. We examined different combinations of training and testing sets (see Table 5). In cases where the model was trained and tested on data from the same study, the same cross validation approaches described above were employed. In cases where the training and test sets were from different studies, models were trained on the entire training data set and tested on the entire testing data set. We interpret the results in terms of whether there are algorithmic biases manifesting as a degradation in prediction from one sample to another; i.e., does training the model on the predominantly White sample generalize to a more diverse sample, and vice versa?

Table 5 Kappa values from cross training experiments

Individual dataset models

Results from training and testing on the individual datasets (i.e., train on Study 1 or 2, test on the same study) were similar to those from the combined datasets (Table 5; also see Supplementary materials for full details on individual dataset results). We did not observe major changes in the kappa values for the individual datasets (e.g., train on Study 1, test on Study 1) compared to the combined dataset results presented above. Study 1 slightly outperformed the combined data for TUT (0.15 for combined compared to 0.19 in Study 1), whereas Study 2 slightly outperforms the combined dataset for correctness (0.55 for combined compared to 0.58 in Study 2). These findings suggest that there were no strong and systematic biases in the eye-tracking system that might have affected its ability to collect interpretable gaze data within each of the two study populations.

Cross-training models

When models were trained on one dataset and tested on the other dataset (keeping them completely independent), there was a slight degradation in performance. However, all models still performed above the respective chance baselines. Moreover, the degradation was bidirectional: in all cases, training on one sample (Study 1 or Study 2) led to a degradation in performance when tested using the other sample. Although these results indicate some level of generalizability between the two datasets, the differences should be noted. These results serve as a reminder of the importance of context and eventual use case when collecting/selecting training data.

Slicing analyses

To examine the data in finer detail, we used slicing analyses (Gardner et al., 2019) to identify if and how the predictiveness of the webcam data for reading comprehension and TUT differed for particular subpopulations. The best performing models for each construct (Tables 2 and 3, respectively) were evaluated in the slicing analysis, as well as models trained on each individual study. We considered four relevant subpopulations: (1) whether or not the participant wore glasses, (2) the lighting of the room the participant was in, (3) whether they reported having ever received treatment for a neurological problem, and (4) race/ethnicity. For each of the four categories, we relied on participant self-report and self-identification. For each subpopulation, we calculated model performance (kappa value) using just the instances from that subpopulation. For example, in Study 1, 15 participants wore glasses, so the model is then evaluated on instances only from those 15 participants.

Results of the slicing analysis are shown in Table 6. For both correctness and TUT, the results were relatively robust to wearing glasses or not, and to lighting changes. We did, however, notice a slight decrease in performance for individuals who self-identified as having a neurological condition, amounting to a 7% reduction in accurately predicting correctness and a 5% reduction for TUT.

Table 6 Slicing analysis by different moderators

The results for correctness were also relatively robust to differences in race/ethnicity, despite some variation across different racial categories. The overall kappa was 0.57, with the kappas for Asian/Pacific Islander, Black/African American, and Hispanic/Latinx in a comparable range of 0.56 to 0.59. There was a slight drop (~ 6%) for Other-identifying participants at 0.51. There was a more substantial drop for the group of participants who identified as Native American (kappa = 0.26); however, this group contained only two students and thus should be interpreted with caution and followed up with future studies using larger samples (see Future work). Overall, we interpret these results to be encouraging, providing the first (to our knowledge) glimpse into how gaze-based detectors of TUT/Comprehension may differ across these moderators.

This relative stability in correct inference prediction implies that the cause of the variation in detector performance may not be caused entirely by the quality of the eye tracking, or a potential bias in the tracking. In addition, the variation may result from noise in the self-reports and how participants responded (e.g., how comfortable participants are reporting being off-task), patterns in the simple gaze features used or the algorithm, or other factors. Identifying these factors will require further study, but it is nevertheless an important result to know that such variability in detection is occurring, which is an important step towards fixing it (Baker & Hawn, in press).

In contrast, TUT prediction was less stable across race/ethnicity. There was a reduction in performance for TUT detection for participants that identified as Black/African American (kappa = 0.04) versus those that identified as White (kappa = 0.17). This variation is the difference between a functioning (albeit modest) detector and chance prediction levels, suggesting that some aspect of the data or detector was different for these participants (see Discussion). However, the same reduction was not observed when predicting correct inferences for the Black-identifying participants (kappa = 0.57 for all students, 0.59 for White participants versus 0.56 for Black participants in the model trained on all data). Though there is still variation for predicting correct inferences across race/ethnicity, especially when training on just study 2 data (kappa = 0.67 for White students and 0.51 for Black students), the relative change is lower, and the resulting detector would still be considered effective (as opposed to chance level for TUT).

In general, these results suggest that this tracking methodology and the detection that it facilitates are acceptably robust for the task of correctness prediction. However, it is necessary to conduct additional work before including TUT prediction, especially with the variability across race/ethnicities. For example, our analyses were underpowered for Native American students (N = 2), corresponding to an ineffective model. Additional analysis (perhaps less quantitative) is required before we may draw any general conclusions. If pursuing a more robust predictive model from the data alone, additional training data would be required before the models can be adequately tested on this population. Thus, although these analyses are important to conduct, they are not intended to be conclusive or prescriptive for who and when webcam-based eye tracking will work.

General discussion

The idea that eye gaze behaviors provide a window into the mind has led to important research discoveries over many decades (Huey, 1908; Rayner, 1998; Rayner, Chace et al., 2006). However, the high cost of eye-trackers has severely limited the scalability of existing approaches to detect cognitive states in real time. Here we attempt to address this issue by integrating WebGazer into an online educational task in order to build models of TUT and comprehension during reading, with the goal of showing a proof-of-concept method for scalable eye-tracking.

Main findings

This work demonstrates the feasibility of using webcams for modeling internal states during learning. Though this data stream is of a lower fidelity than a typical PCCR eye-tracker (e.g., measured at 30 Hz rather than > 60 Hz, with reduced precision and accuracy), this work demonstrates that with appropriate calibration, WebGazer can be sufficient in most cases for user modeling. Despite having only a single calibration for a 40-min task, our models still made predictions at above chance levels. Our models also performed better (in terms of kappa values for the combined dataset, as well as for each dataset individually) than a model trained on participant response time alone, demonstrating that using webcams for this task was a useful augmentation. Moreover, both TUT and comprehension models performed comparably (as measured by kappa values) to prior work using research-grade equipment, for most groups of learners and tasks. This result is particularly encouraging given the poorer quality of our gaze data, which nonetheless was sufficient to model users at above chance rates and leverage this cheaper, more accessible technology to provide cognitive insights.

Throughout this work, we have used a chance baseline as a comparison point, with models performing above chance being considered successful. By this metric, our findings are roughly comparable to others using research-grade eye-tracking (Bixler et al., 2015, whose highest reported kappa value for gaze was 0.15) or EEG signals (Dong et al., 2021, whose MCC was .206). No such comparable values exist for comprehension, but given that we outperform the conventional rates for TUT, we speculate that webcam-based eye-tracking may also be suitable for real-world applications. In general, the limited degree to which our model is better than chance suggests that if used for intervention, it should be applied in “fail-soft” interventions (see Discussion below). More generally, more nuanced definitions of “successful” will require taking into account the intended application and risk

Our slicing analyses indicated that our detectors were largely robust to individual differences of race/ethnicity (with one key exception; see discussion below), and whether participants wore eyeglasses, as well as conditional differences such as lighting. We also found that models trained were generalizable between the two datasets, though did experience some performance degradation in both directions (e.g., train Study 1, test on Study 2, and train Study 2, test on Study 1 both saw a drop off in performance).

The one notable evidence of variability in our slicing analysis was the drop in performance for TUT detection across race/ethnicity, with lower performance for participants who identified as Black/African American or Native American. This reduction in performance could be for many reasons, such as differences in the self-reporting of TUT or in the simple eye-movement features. It seems unlikely to be a result of egregious computer-vision issues with the tracking, such as contrast issues or biases in the facial detection, given that the gaze tracking was sufficient for successful for correct inference prediction. However, more analyses are needed to rule out more minor differences in how the eyes are tracked. Future research is also needed to determine why we observed such differences, as well as how webcam-based eye-tracking and TUT detection can be improved for all participants. Given the broad, robust nature of the tracking for correct inference prediction, our work provides some initial evidence of feasibility for the webcam-based method in general. More evaluation is needed to determine which tasks this approach can be used for, without concerns of bias.

This caveat notwithstanding, our work serves as a proof-of-concept for a future real-time detector that leverages webcam data. All data used in the models came from interactions prior to the prediction point and could be gathered in real-time (either the previous page of reading or the previous three pages). Similarly, the model accuracy is within the ranges of detectors previously used in the literature for real-time intervention. This work adds to a growing body of research examining the feasibility of webcam-based eye tracking and adds further credence to their use a proxy for PCCR gaze tracking. Furthermore, it offers potential to scale up decades of research examining links between comprehension, TUT, and eye gaze, taking these experiments into new, ecologically valid environments.

Applications

It is perhaps easier to start with how this approach should not be used. Sensor technologies such as webcams hold great potential but also pose great risk. This method should not be used to monitor students (or anyone) without their permission, or without transparency as to how their data is being collected/stored. Any future application should clearly inform the user of what data is being collected and how it is being used.

Assuming careful consideration of privacy and transparency, these methods have many possible applications in software development. Eye tracking has consistently been used to identify interaction patterns and improve software development (Jacob, 1995; Kukkonen, 2005). Being able to monitor constructs such as attention in a cheap and scalable way can improve this process and help developers understand when materials or software is not engaging the audience (Toreini et al., 2020).

More specific to educational contexts, our work sets the foundation for improving the scalability of modeling techniques with the end goal of improving research methods, learning technologies, and student experiences. For example, our results show that webcam-based detectors provide more accurate detection than response time detector and are thus likely to be more useful for real-time intervention techniques that correct deficits “in the moment.” A student who is unlikely to answer a comprehension question correctly could be advised to read the text again before attempting the question or be given hints about which parts of the story are the most critical.

It is also important to consider that any intervention must rely on detection, which is inherently imperfect, especially in the case of TUT detection. False alarms (predicting someone is off-task when they are not) and misses (missing an instance of TUT) are both possible and must be accounted for in any application. In our view, detection does not need to be perfect to be useful. Indeed, prior work has used imperfect detection to trigger meaningful interventions for TUT, using a probabilistic approach (e.g., if the likelihood of TUT is 70%, then there is a 70% chance of an intervention) (Mills et al., 2020). Any interventions should also be designed to “fail-soft” in that there are no harmful effects to learning if delivered incorrectly. For example, an intervention may ask students to provide a self-explanation of what they had just read if they have been detected to be off-task. If a student is not off-task, this will reinforce what they already know without damaging the experience too greatly. A student that is off-task will be prompted to realize that they are missing details and go back.

The comprehension detector, has higher precision than recall for students who have not understood the text, meaning that a miss (predicting comprehension when the student has not understood the text) is more likely than a false positive (predicting a student has not understood the text when they have). In this case, the confidence in an individual prediction can be high, which is useful for most applications and reduces the need for a “fail-soft” approach, but more refinement is needed to reduce the number of misses in the model, to guarantee that detector is supporting all students.

Given these implementation considerations, the detectors presented in this work provide proof-of-concept for potential real-time integration using webcam eye-tracking solutions. Though there are known inaccuracies, these inaccuracies can, in principle, be accounted for to provide valuable real-time information and adaptation. Though we cannot directly measure the inaccuracies in this work due to not having an appropriate comparison set, previous work with WebGazer has already been evaluated to have error rates of up to 4 degrees. It is thus encouraging that despite this, our anecdotal evaluation shows that gaze is adaptive to the stimuli, and falls on expected AOIs in many cases.

Limitations and future work

There were several limitations of this work. Firstly, our study was designed to test low-cost eye tracking, by using the webcams included in devices. However, webcams have limited resolution and accuracy compared to research-grade eye-trackers. These limitations govern what can be derived from this data and the subsequent strength of any conclusions we can draw relative to the broader eye-tracking literature. Research-grade eye tracking will remain the gold standard for gaze-based research, even though webcam-based approaches have great promise in the real world. We have shown that despite low-quality tracking, we are able to model complex constructs in a manner that is scalable and easy to implement.

Second, though we have taken steps to improve ecological validity through webcam use, future work should focus on using other tasks. Given that our results likely depend strongly on exactly how we presented our stimuli, future work should consider alternate presentations of text, or alternate tasks. For example, the same approach could be implemented but using longer passages and with page-by-page reading. Although many psychological experiments routinely use word-by-word or sentence-by-sentence paradigms when studying reading comprehension, it is important to test the boundary conditions of the webcam-based eye-tracker, particularly as areas of interest become more difficult to outline (e.g., small, single-spaced typeface). This limitation also extends to our high baseline comprehension accuracy rate (average accuracy was 88% correct), resulting in unbalanced class labels. Given this imbalance, it is not surprising that the chance classifier performs much better for correctness than for incorrectness. Future work may also consider alternative forms of assessing comprehension.

We also note the better performance of Local versus Global features. This finding implies that for correctness, where a participant is looking is more important than their more general gaze patterns. This result makes sense given the layout of the stimuli, with the three answers located at different locations on the screen. This result also testifies to the general accuracy of the eye tracking we used; were it highly inaccurate, it is unlikely the local features would have been as effective for prediction. However, other formats of stimuli/material presentation would be helpful in future work. Similarly, we should consider more complex feature sets and deep learning approaches as additional data is collected. As is often the case with human participant data, the current dataset is not large enough for effective deep learning, however the scalability of this approach presents the opportunity to collect vast datasets. Now that the initial feasibility has been shown, future work should consider collecting a larger dataset that enable more complex data mining and machine learning.

Future work should also consider a more in-depth feature-engineering process, with additional global and local features. The features included in this work, though theoretically relevant, provide a baseline for future gaze-feature development. Other features may include more detailed logging of reading regressions, for example, or word-level features considering how long a participant spends on each word of a sentence. Some features may also require that tracking accuracy and/or resolution is first improved before they can be calculated.

Though local features provided valuable predictive information for comprehension and TUT, we have not explored the mechanistic relationship between eye gaze and these constructs in this work. We argue that webcam-based gaze tracking is perhaps not suited to this kind of fine-grained analysis of reading behaviors, but we are encouraged that the data-driven models presented here are able to identify the cognitive events considered. Future work could consider how gaze mechanisms identified with research-grade tracking systems translate to this more accessible tracking option, which could help determine if and how such mechanisms can still be detected using webcam-based systems.

This work is further limited by our use of thought probes. Thought probes require users to be mindful of their unrelated thoughts and respond honestly. Although this methodology has been previously validated (Franklin et al., 2013; Randall et al., 2014), it is still limited due to the reliance on self-reports. Unfortunately, there is no clear alternative to track a highly internal state like TUT outside of measuring brain activity directly, which is also limited in many respects. Indeed, this too is part of the motivation for automated detectors. Future work should focus on validating our detectors so that thought probes are no longer necessary for measuring TUT, though fully automated TUT detections may well be a long way in the future.

Conclusions

In sum, we provide evidence for a scalable solution for detecting attention and comprehension using only stock web cameras. Although there is still much room for improvement, the possibility to reach more individuals in more real-world settings –particularly those who are historically underrepresented – creates important opportunities for improving learning supports, and we hope to continue development along these lines.