Theory-based approach for assessing cognitive load during time-critical resource-managing human–computer interactions: an eye-tracking study

Computerized systems are taking on increasingly complex tasks. Consequently, monitoring automated computerized systems is becoming increasingly demanding for human operators, which is particularly relevant in time-critical situations. A possible solution might be adapting human–computer interfaces (HCI) to the operators’ cognitive load. Here, we present a novel approach for theory-based measurement of cognitive load based on tracking eye movements of 42 participants while playing a serious game simulating time-critical situations that required resource management at different levels of difficulty. Gaze data was collected within narrow time periods, calculated based on log data interpreted in the light of the time-based resource-sharing model. Our results indicated that eye fixation frequency, saccadic rate, and pupil diameter significantly predicted task difficulty, while performance was best predicted by eye fixation frequency. Subjectively perceived cognitive load was significantly associated with the rate of microsaccades. Moreover our results indicated that more successful players tended to use breaks in gameplay to actively monitor the scene, while players who use these times to rest are more likely to fail the level. The presented approach seems promising for measuring cognitive load in realistic situations, considering adaptation of HCI.


Introduction
Our daily lives are becoming more and more automated, with computerized systems such as autopilots, speed and lane assistants and robotic surgeons taking on increasingly complex tasks. Thus, a human operator is supported considerably by such systems, however, monitoring of such automated systems is becoming increasingly demanding due to the rising complexity of tasks they are capable of executing. This problem is particularly evident in unexpected time-critical situations, where the operator does not have time for a detailed analysis of the interface and therefore relies more than usual on cognitive ergonomics.
A possible solution to this challenge may be the development of intelligent human-computer interfaces (HCIs) capable of adapting their demands and appearance (e.g., minimizing displayed information when the operator is cognitively overloaded) to the situation as well as to the operators' cognitive and emotional states to optimize performance and prevent failures. The so-called cognitive load on the operator seems to be particularly relevant in this case, because it is considered to reflect the degree to which available cognitive resources are engaged in the task at hand [1] and thus can be used to predict operators' performance. Accordingly, the detection of the actual degree of cognitive load appears to be of great importance in a variety of realistic settings [2]. For example, based on this information, the appearance of the user interface could be adapted appropriately.
Empirical evidence indicates that differently designed HCIs may induce different levels of cognitive load while performing the same task. As one example, Charabati et al. [3] compared interfaces designed to monitor anesthesia parameters during a surgery and reported that participants rated their cognitive load significantly lower when using a mixed numerical-graphical interface compared to the numerical and advanced-graphical interfaces. Another example was provided by the study of Oviatt [4] who found that students performed significantly better at solving mathematical problems when using a digital pen and paper interface compared to graphical tablet interfaces. At the same time, online adaptation of the environment to the operators' current cognitive load was shown to lead to a significant performance improvements [5].
This raises the question of which measurement methods and metrics are best suited for the assessment of cognitive load in such systems. Fortunately, digital systems easily allow for collecting individual user data that may well be used to model cognitive or emotional states of the operator [6] including but not limited to cognitive load. Data driven approaches such as machine learning [2] appear very promising in this regard, however, they mostly require specific calibration and cannot be easily generalized to different users and settings. We aim to address the question whether this limitation might be solved by using a top-down approach, that is, detecting cognitive load states based on a suitable theoretical framework. By doing so, we hope to avoid additional calibrations and achieve good genaralizability of the method, when applying the proposed measurement system to similar settings. Specifically, in this study we investigated whether eye-tracking features collected during initial burst and initial idle time intervals (see Sect. 1.3.1), and calculated in a top-down procedure based on a theoretical model, may be successfully used to measure cognitive load during time-critical emergency simulation. This article is structured as follows. The next section (Sect. 1.1) provides a brief introduction into concept of cognitive load and its measurement, while specifically addressing eye-tracking method and eye-tracking features which were used in this study (Sect. 1.2). The subsequent section (Sect. 1.3) describes the time-based resource-sharing model [7], which we chose as a theoretical base for our later calculations. Section 1.3.1presents the temporal action density decay metric [8], derived from the model and provides introduction into calculation of burst and idle time periods, which are later used for recording of eye-tracking features. Subsequently the Materials and Methods (Sect. 2) and Results (Sect. 3) are described and discussed (Sect. 5).

Cognitive load
The concept of cognitive load is based on the realization that cognitive resources are limited [9]. It can be understood as the degree of "how hard the brain is working to meet task demands" [10]. At the same time, it needs to be considered that cognitive load evolves as a complex interplay between different task demands and mental processes [1], and thus represents a dynamic variable that fluctuates during task accomplishment.
An association between cognitive load and human performance was demonstrated in a variety of realistic settings such as e-learning [4,5], transportation [11][12][13][14][15], aviation [16], office work [17,18] and medicine [19]. It has been observed, that this relation seems to be shaped like an "inverted-U" [20] with best performance under medium cognitive load. Additionally, it seems to be associated with Csikszentmihalyis' concept of "flow" [21,22], which characterizes a state of total concentration on the task at hand and also assumes that performance usually declines when cognitive demands are too low or overstraining [e.g., 20, 23 -25]. As such, this indicates that human-computer interaction might be optimized by keeping the operators' cognitive load at a medium level [26].
The vision of an adaptive HCI to provide an optimal driver support was already expressed by Michon [27] in 1993. Since then, several attempts have been made to develop adaptive systems. As one example, BMW [28] has developed a system that diverts incoming calls to voicemail when driver cognitive overload is detected. Therefore, the cognitive state of the driver was estimated based on the traffic situation and driving dynamics. Toyota also took a similar approach by resetting voice messages when the driver was overloaded; thereby the authors estimated the cognitive state based on the use of the accelerator pedal. Comparatively, researchers from Daimler [29] used motion dynamics in the car seat to estimate cognitive load. Another way to reduce cognitive load in a car consists in optimizing intervals between incoming messages [30]. Messages that appear too frequently or concurrently can increase the driver's cognitive load and impair driving performance, whereas an adaptive extension of the intervals between messages reduces cognitive load and improves driving performance. [31]. As another example, Kohlmorgen, et al. [32] substantiated this consideration by showing that an adaptive reduction of cognitive load improved driving performance under real traffic conditions. Also in aviation the research of cognitive load has a long tradition, with first studies aiming to detect cognitive states in pilots date back to the 1980 and 1990s [33,34]. In subsequent years, the idea of developing intelligent assistive systems adapted to cognitive load came to the foreground. As one example, adaptive cognitive agent [35,36] was developed to individually support helicopter crew members. The cognitive load estimator detectes states of cognitive overload based on task load and behavioral data and adaptively decided whether the operator needs support. Another adaptive cognitive agent was developed to help with air traffic management, and was shown [37] to significant improve operators' performance. As another example, Wilson and Russell [38] used neural network to detect states of high cognitive load during a simulated uninhabited aerial vehicle task based on (neuro-) psychophysiological data and found that adaptive aiding may significantly enhance pilots' performance.
Besides automotive and aviation fields, such highly relevant contexts as medicine, emergency management and education were also shown to benefit from the adaptation to cognitive load. To give a few of examples, Sarkar et al. [39] designed a multitasking deep neural network to classify high and low cognitive load states in experts and novices while performing a trauma simulation based on electrocardiogram (ECG). Mirbabaie and Fromm [40] developed an augmented reality support system to support the emergency management during realistic emergencies. Yuksel et al. [41] reported on the benefits of adaptively increasing task complexity on learning performance in pianists, While, Walter, et al. [5] observed significant improvement in learners' math performance when using an electroencephalography (EEG)based adaptive learning environment.
Taken together, this brief summary indicates that online adaptation to cognitive load is beneficial as well as practicable. Already achieved results appear promising for a variety of realistic contexts while this research field remains vibrant. Thus, considering that cognitive load is a dynamic variable that fluctuates over time during task completion, we need a measurement method able to capture cognitive load at the early stages of task processing to adapt HCI to it in a timely manner, because the earlier a non-responsive cognitive state can be detected, the earlier HCI can be adapted accordingly.
Measurement techniques of cognitive load can be classified into four main categories: (i) Performance-based, (ii) Subjective, (iii) Behavioral, and (iv) Physiological measurements [42][43][44]. Performance-based measurements rely upon user performance, for example, the rate of correct responses while solving a sequence of arithmetic tasks. These measurements are hardly applicable in the context of human-computer interaction (HCI), where intermediate results can seldom be identified. Subjective measurements are usually obtained by (standardized) questionnaires such as SWAT [45] and NASA-TLX [46]. These measurements are well validated, easy to apply, and highly reliable. Unfortunately, they can primarily be collected after the task has already been completed and thus are hardly applicable for online assessment of cognitive load when aiming for timely HCI adaptation. Behavioral measurements rely on the analyses of differences in operators' interaction behavior with the system, such as mouse usage, click rates, etc. They potentially allow for online evaluation of fluctuations in cognitive load. However, behavioral measurements might be influenced by factors other than the task at hand such as attentional or motivational processes [e.g., 47]. Finally, physiological measurements (e.g., measurements of heart rate variability and electrodermal activity, electroencephalography (EEG), functional magnetic resonance imaging (fMRI), eye-tracking) relate physiological parameters to psychological constructs including cognitive load [44,[48][49][50]. They seem very promising for the development of adaptive systems because they allow for continuous recording of the respective variables and thus online adaptation. However, they often require costly equipment and sophisticated methods of data analysis. Moreover, some physiological methods are hardly feasible in realistic HCI environments [for an overview see 51] due to their immobility (e.g., fMRI) or high noise sensitivity (e.g., EEG).

Eye-tracking
One physiological method which is gaining increasing popularity when measuring cognitive load is eye-tracking. In particular the subtle realization of modern video-based eyetracking systems [for a general overview of eye tracking methodology, see: 52] makes this technique potentially promising for the commercial development of user-friendly adaptive HCIs. In this article, we focused on fixations, blinks, saccades, microsaccades, and pupil diameter as specific indices of participants' oculomotoric behavior because, as described in more detail below, evidence has shown that these features respond to fluctuations in cognitive load.

Fixations
Voluntarily controlled stable gazes that usualle last for about 200-300 ms are called fixations [53]. During these periods eyes stay relatively still, while the person processes information from the fixation area [54]. The relation between cognitive load and fixation time seems to depend on the task at hand. There is evidence that increased task complexity is associated with fewer but longer fixations [for reviews see: 53,55]. For example, Chen, et al. [56] concluded that increased fixation duration and decreased fixation rate indicate increased attentional effort on a more demanding task. Similarly, De Rivecourt, et al. [57] found that increased task complexity is associated with longer fixations on the control instruments during simulated flight. Contrarily, however, Van Orden et al. [58] observed fixation frequency to systematically increase with the visual complexity of a target classification task.

Saccades
Eye movements between two fixations that allow for exploration of the surroundings and attention control are called saccades. Similarly to fixation frequency, saccadic rate also appears to be strongly dependent on the type of the task. While increasing difficulty of the visual search task was shown to lead to increased saccadic rate [59], there is also evidence that saccadic rate decrease with the difficulty of the non-visual task [60].

Microsaccades
If our eyes would stay completely still during fixation, the visual image would gradually fade because neural response weakens with constant stimulation [54]. Microsaccades are small unintentional eye movements, which cover less than 1°o f visual angle and prevent currently viewed visual information from fading [54]. Evidence suggests that microsaccadic frequency increases with increasing visual complexity of the task at hand [61], whereas in non-visual tasks microsaccadic rate seems to decrease and microsaccadic magnitude to increase with task difficulty [62,63].

Blinks
A commonly known function of blinking consists in moisturizing the eyeball and protecting it from physical damage. Besides that, in addition to microsaccades, blinking is also needed to prevent perceptual fading [64]. Moreover, bursts of blinks seem to occur before and after periods of intense information processing [65]. Additionally, high blink rates were found associated with higher cognitive load [60].

Pupil dilation
The main function of the eye pupil is controlling the amount of light that enters the eye to achieve the best possible visual perception. This metric is most commonly considered in cognitive load research [for a general overview see : 66]. In states of high cognitive load, pupil diameter was repeatedly observed to increase proportionally both in visual and nonvisual tasks [60,67,68]. Hovewer, besides of this association, pupil size responds to a relatively broad number of stimulate reaching from intake of drugs [69,70] to such internal states as, e.g., interest and arousal [71,72]. Therefore, it is not trivial to determine what the exact reason for the observed pupil dilation was, and interpretation of the results must be done with caution.

Summary
Taken together, eye-tracking seems to be a promising technique well-suitable for assessing cognitive load during HCI. Empirical evidence indicates that the eye-tracking measures listed above fluctuate dynamically over time [65]. Hence it might be advantageous to know the exact on-and offset of each stimulus to be able to effectively differentiate between states of low and high cognitive load. While this premise is easy to achieve in a controlled laboratory setting, realistic HCI usually consists of a variety of interlocking tasks and stimuli that are impossible to analyze separately. Therefore, we chose a specific analytic approach. Instead of analyzing eye-tracking data for the entire time course of HCI, we focused on the analysis of most relevant time periods determined according to an established theoretical approach, allowing for better generalizability of these calculations to similar situations. In the context of time-critical interactions under severe time restraints, the time-based resource-sharing (TBRS) model [7] briefly described below provided a suitable theoretical basis for assessing cognitive load.

Time-based resource-sharing model
The main idea proposed in the TBRS model by Barrouillet,et al. [7] is that, in addition to task complexity, cognitive load also strongly depends on available time, which is particularly relevant in time-critical situations. The model describes working memory as a core system of cognition consisting of two processes indispensable for the execution of a cognitive task: information storage and processing. According to the model, both components require attention, to switch between subtasks resulting in complex and time-critical interactions between them and eventually causing interruptions in the processing of subtasks. Based on these assumptions, TBRS predicts that cognitive load, and thus performance "depends on the proportion of time during which attention is captured in such a way that the storage of information is disturbed" [7]. However, the authors acknowledge that it is not trivial to determine these time intervals. Inactive task forces are marked with black circles, active-with green A: At the beginning of the game level, all emergency personnel are ready and inactive. B: The initial burst begins with the first task the player assigns and lasts until all available personnel are actively engaged. In this example, the emergency doctor and paramedics are not available for assignment because there are no injured people to be treated. C: The initial burst ends as soon as the last available personnel are assigned, emergency doctor and paramedics cannot yet be assigned. This time also marks the beginning of the initial idle, in which the player must wait until some personnel become available again or until new tasks occur. In this example, all available personnel are already active and the player must wait until the person is rescued from the burning building, only then he can be treated by the doctor. D: The initial idle ends when the first personnel are available again. In this example, the initial idle phase ends as soon as the rescued person appears lying on the road. At this moment, an emergency doctor becomes free and can be assigned to treat the patient; at the same time, the ladder truck also becomes free again and can be assigned to rescue the next person

Temporal action density decay metric
In a recent study [8], authors addressed this challenge by proposing the temporal action density decay (TADD) metric, which is based on the TBRS model and was developed to estimate cognitive load in time-critical situations that require resource management. According to this approach, such situations can be divided into a series of so-called action blocks consisting of active phases (burst), in which resources are managed, and waiting phases (idle), in which all resources are occupied or unavailable and one must wait until a new task appears or a resource becomes available again (for a detailed example see Fig. 1). The ratio of the length of the first detected burst (initial burst) to the length of the first action block (initial action block: occurred right at the beginning of each level), was shown to significantly predict performance [8]. In fact, it turned out that participants, who completed their tasks faster and therefore had to wait longer at the beginning of the level, were also significantly more likely to successfully complete the respective level.

Present study
Although the TADD metric has been shown to be significantly related to cognitive load, it is a behavioral measure that does not allow direct conclusions to be drawn about the cognitive activities occurring at that time. While participants' cognitive engagement during the initial burst can be estimated based on logged in-game actions, this method cannot be used for the initial idle period, because no actions are performed during this time. It is conceivable that some participants might use the initial idle for relaxation, which would be reflected in decreased cognitive load, whereas others might use this time for planning and visual screening of the scenery, which we expect to lead to an increased level of cognitive load as compared to the first group.
In this study, we aim to further investigate the initial burst and idle time periods while playing the time-critical serious game using the eye-tracking method. Therefore we obtain fixations, blinks, saccades, microsaccades, and pupil diameter during these time intervals. Based on evidence we expect fixation frequency [58,73] and saccadic and microsaccadic rates [61,73] to increase in response to visual load and to decrease [56,57,60,62,63] in response to non-visual cognitive load. Based on previous evidence we also expect pupil dilation [67,68,73] and blinking rate [60,65] to increase with increased difficulty.
We expect that cognitive behavior during the initial idle time period should have impact on gaming success. Specifically, we expect that higher visual and cognitive activity during this time will help maintain and promptly update the cognitive model of the game scene, leading to better reaction times and thus better game success for participants who are more active during this time.
On the other hand, based on the theoretical considerations from Sevcenko et al. [8], we expect all participants to work at their limit, that is, to have maximum cognitive load during the initial burst phase. For this reason, we did not expect any relationship between eye tracking data during the initial burst phase with task difficulty as well as their subjective assessment of cognitive load and their performance.

Materials and methods
The study was carried out as part of a larger project. Besides eye-tracking features described below, it included other measurements which are not covered in this paper, i.e. behavioral in-game data, cardiac activity, galvanic skin response, and cortical hemodynamics, measured by functional nearinfrared spectroscopy [8,74].

Participants
47 participants took part in this study. In the following, we present data of 42 participants (31 females, 11 males) aged between 19 and 48 years (M = 24.3; SD = 5.4). Five participants were excluded from the analysis due to poor quality of their eye-tracking data. All participants spoke fluent German and were right-handed. They were recruited via an online database and compensated for their time expenditure. None of the participants reported neurologic, psychiatric, or cardiovascular disorders, and none of them were taking psychotropic medications. The study was approved by the local ethics committee and written informed consent was obtained prior to the experiment.

Task
Participants played an adapted version of the serious game [Emergency: 75], simulating time-critical emergencies. There were two different emergency scenarios with three different levels of difficulty each. During the game, participants had to coordinate different emergency personnel, such as emergency doctors, paramedics, and firefighters, as well as ambulances, fire-and ladder trucks to rescue victims and extinguish fires.
After familiarizing themselves with the task by playing a learning sequence, all participants completed two experimental scenarios: Fire and Train Crash. The learning sequence consisted of a short tutorial followed by a car accident scenario where participants had to free all victims from the crashed vehicles, provide first aid and then arrange their transport to hospital. The time limit for the training scenario was 5 min.
In the Fire scenario, participants had to extinguish a burning building block, rescue some residents from burning houses, provide first aid, and arrange their transport to hospital within a time limit of 7.5 min. The scenario Train Crash involved a train crashing into a building and causing a quick-spreading fire. The scenario required participants to free trapped passengers, provide first aid, and arrange their transport to hospital, as well as extinguish numerous fires. The time limit for each level of this scenario was 10 min.
Each scenario was presented at three levels of difficulty: easy, medium, and hard, as defined by varying the number of tasks to be performed and the number of personnel to be coordinated (see Table 1). In this way, the increasing density of actions increased task demands in terms of planning, coordination, and prioritization, leading to varying levels of cognitive load. Time pressure was additionally induced by setting time limits for levels.

Experimental setup and design
The experiment was performed in a quiet room under constant light conditions (see Fig. 2). The Emergency serious game was presented on a 16" notebook driven at a screen resolution rate of 1920 × 1080. A conventional computer mouse was used as the only interaction device. Gaze data were recorded at 250 Hz using a SensoMotoric Instruments (SMI) RED250 eye tracker with 0.4°gaze position accuracy in combination with SMI Experiment Center 3.7.60 software installed on the same notebook. The eye tracker was The number of fires depended on players' performance and might grow. These cases are marked by the ' + ' sign The study was implemented in a within-subject design, that is each participant completed all scenarios and levels. Each participant performed the same predefined sequence of levels only once. To minimize order effects, we decided to present the levels and scenarios in a constant sorted order, starting with the easiest one. The experiment began with the calibration followed by a baseline phase during which participants were asked to sit still and look at a fixation cross for 5 min to acquire baseline parameters of physiological measures. After that, participants completed an introductory learning sequence, followed by the two scenarios with their respective levels of difficulty (Fire: easy, medium, hard; Train Crash: easy, medium, hard). Subjective ratings of cognitive load experienced during the Emergency serious game were obtained intermittently after each level using the NASA-TLX questionnaire. The whole experiment lasted about one hour including the training time.

Features
To estimate cognitive load, we used eye-tracking features known from the literature in this regard. These data were recorded during narrow initial burst and initial idle time periods, which were calculated based on log data as described below. Hereafter, means of the respective eye-tracking measures for the respective time periods were associated with the difficulty scores of levels as well as participants' performance and their subjective ratings of cognitive load.

Difficulty score and performance
For each level, a difficulty score was defined as the percentage of participants, who failed to complete all tasks within the predefined time limit (see Table 1). Performance was reflected individually for each participant as the binary indicator of whether the level was completed successfully or not.

Subjective rating of cognitive load
After completing each level, we asked participants to rate their subjectively experienced cognitive load by completing selected items of the NASA-TLX questionnaire [46]. In order not to disturb the eye-tracking calibration procedure, the items were read aloud by the experimenter while the subject was instructed to sit still and look at the monitor while answering. The NASA-TLX consists of six items rated on a 21-level scale (0 to 100 points with steps of 5), and its' dimensions correspond to various theories distinguishing between physical, mental, and emotional demands imposed on the operator [76]. In this study, we considered the three items addressing the mental facet of operators' load (i.e. mental demand, temporal demand, and effort) [77,78].

Eye-tracking features
When analyzing gaze data, we focused on fixation frequency, saccadic and microcaccadic rates, number of blinks, and pupil diameter recorded durung initial burst and initial idle time periods. These eye-tracking features were extracted using SMI Experiment Center 3.7.60 software. During preprocessing of pupillometric data, we removed all data points where pupil diameter was non-positive, because such artifacts typically indicate invalid data. We also did not consider data up to 100 ms immediately before and after each blink, because during these periods the pupil is partially occluded by the eyelid or eyelashes and thus cannot be detected reliably [79,80]. Finally, we linearly interpolated small gaps of up to 50 ms to increase the amount of usable data. The rate of microsaccades was computed using the method proposed by Krejtz et al. [81]. After preprocessing, we averaged the collected data over time, subtracted the respective baseline value, recorded during 5 min prior to the start of the experiment (see Sect. 2.3 Apparatus and experimental setup), and z-standardized all features.

Analyzed time periods
In the present article, we determined initial burst and initial idle periods individually based on logged in-game activities, with both time periods varying between participants. At the beginning of each level, an initial burst period and an initial idle period were recorded, resulting in six pairs of periods per participant.

Statistical analysis
We employed linear mixed-effect analyses using statistical software R [82] with the lme4 package [83]. The p-values were obtained by likelihood ratio tests of the full model tested against a reduced model. Further model analyses were applied in case of a significant result, using the report package of Makowski, et al. [84]. Standardized parameters were obtained by fitting a model on a standardized version of the dataset.

Results
In this section, we present in detail associations between gaze data collected during initial burst, and initial idle time periods and (1) difficulty score, (2) participants' performance, and (3) subjective estimation of cognitive load. To ensure better readability and not overwhelm our readers with the vast amount of statistics, we have decided to only report significant results in detail.

Difficulty score
First, we aimed at investigating whether task difficulty affected oculomotor behavior during the initial burst and initial idle time periods. Therefore, we fitted linear mixed models for both time periods to predict eye-tracking data based on difficulty score. We considered difficulty score 1 as a fixed effect and added random intercepts for subjects. Table  1 depicts difficulty scored calculated per level as the percentage of participants, who failed to complete all tasks within the predefined time limit.

Initial burst
During the initial burst period we found significantly negative association between difficulty score and fixation frequency as well as with saccadic rate, whereas the effect of difficulty on pupil diameter was positive. That is, during initial burst phase more challenging levels were significantly associated with less fixations and saccades as well as with increased pupil diameter. See Fig. 3 and Table 2.

Initial idle
We found no significant association between gaze behavior and difficulty score for any eye-tracking within the initial idle phase, see Table 3 4

.2 Performance
We fitted linear mixed-effect models for each combination of gaze features and time periods on the relationship between performance and gaze data. We added random intercepts for participants and considered performance as a fixed effect. We also considered the scenario 2 as a fixed effect because 1 Both difficulty score and scenario reflect the same game characteristic. Here, the scenario represents a rough division of six game levels into two difficulty classes, while the difficulty score sorts these six levels and represents a finely graded difficulty scale. Because these metrics do not have the same numerical order (Scenario 1 includes difficulty levels 2, 35, and 64, while Scenario 2 includes difficulty levels 4, 52, and 71), it is not convenient to include them both simultaneously in a same mixed model analysis. In this case, we were interested in investigating whether eye-tracking features are able to discriminate between slighter variations in difficulty, so difficulty score was included in the analyses. 2 From the previous analyses, we already know that eye-tracking features recording during initial burst phase are able to respond to difficulty variations. In this next step, we aim to discover whether it is possible to distinguish between winners and losers within the same difficulty range. To get a clearer picture of the effect and to account for the real situation   we were interested in whether performance induces an effect in addition to the scenario.

Initial burst
During the initial burst phase we found a significantly negative effect of scenario on fixation frequency and saccadic rate, whereas the effect of scenario on pupil diameter was positive. That is, while playing more challenging Train Crash scenario participants did significantly less fixations and saccades, while their pupil diameter was significantly increased compared to Fire scenario. However, the associations between gaze data and performance showed the opposite pattern: we found positive effect on fixations, saccades and microsaccades, along with negative effect on blinks, meaning that participants who successfully completed the level, showed significantly more fixations, saccades and microsaccades, but blinked significantly less often during the initial burst phase compared to unsuccessful participants (see Fig. 4 and Table 4). We found a significant positive effect of performance on fixation frequency, saccadic and microsaccadic rates, meaning that participants who failed the level also exhibited less fixations, saccades and microsaccades during the initial burst phase. In contrast, the effect on blinks was significantly negative, meaning that more successful participants blinked less during initial burst phase. Footnote 2 continued in which we are usually not able to make a reliable and fine-grained scenario classification, we included the scenario as an additional fixed effect in all further analyses.

Initial idle
During initial idle only pupil diameter differed significantly between scenarios, whereas more challenging Train Crash scenario was associated with increased pupil diameter. In contrast, performance was significantly positively associated with fixation frequency, saccadic, and microsaccadic rates, meaning that participants who succeed the level showed significantly more fixations, saccades and microsaccades compared with participants who failed. See Table 5 for detailed statistics.

Subjective assessment of cognitive load
To investigate whether gaze behavior was influenced by subjectively reported cognitive load we fitted linear mixed-effect models for each combination of the inspected NASA-TLX items and gaze features on the relationship between subjective cognitive load and gaze data which resulted in 15 models for each time period. We included random intercepts for participants, whereas scenario was considered a fixed factor.
For all considered NASA-TLX items we found a significant difference between scenarios, whereas the Train Crash scenario was perceived as more demanding than scenario Fire regarding mental demand, time demand and effort, see Tables 6 and 7.

Initial burst
In addition to the effect of scenario during the initial burst period we found a significant negative effect of microsaccades on all three items: mental demand, time demand, and effort. That is, participant who reported to be more challenged in terms of mental demand, time demand and effort exhibited significantly less microsaccades than participants who rated their cognitive load lower, see Tables 6 and 7.

Initial idle
We found significant negative effect of microsaccades on time demand. That is, participants who reported to be more challenged regarding time demand and effort did significantly less microsaccades during initial idle phase compared to less challenged participants. We found no significant effect regarding mental demand, see Table 7.

Additional analyses
We hypothesized that more successful players should experience lower cognitive load, which should be reflected in their gaze behaviour-in particular, in lower ratings of subjective cognitive load among more successful participants. To evaluate this assumption, we fitted three linear mixed-effect  *** is significant at the 0.001 level,** is significant at the 0.01 level, * is significant at the 0.05 level, fixations: number of fixations in the given time period, blinks: number of blinks in the given time period, saccades: number of saccades in the given time period, microsaccades: number of microsaccades in the given time period *** is significant at the 0.001 level,** is significant at the 0.01 level, * is significant at the 0.05 level, fixations: number of fixations in the given time period, blinks: number of blinks in the given time period, saccades: number of saccades in the given time period, microsaccades: number of microsaccades in the given time period  We included scenario and performance as fixed factors and participants as random intercepts in the models. The association with scenario was significantly positive and the association with performance was significantly negative for all NASA-TLX items, see Table 8. That is, subjective ratings of cognitive load were higher in a more challenging Train Crash scenario, and at the same time more successful players reported lower cognitive load.

Discussion
Our first goal was to investigate whether participants' cognitive load during a time-critical Emergency serious game can be reliably estimated based on gaze features collected at the beginning of a game session within initial burst and initial idle time periods. As a second goal, we aimed at deepening our understanding of what cognitive processes occur during initial burst and idle phases. To identify these time periods we used behavioral log data interpreted in the light of the TBRS model [7] in line with a recent approach by Sevcenko et al. [8].
In the following we will first describe in detail how the presented approach can be used for prediction of task difficulty, operators' performance, and subjectively perceived cognitive load, then we proceed to discuss which cognitive processes seem to happen during the initial idle phase and demonstrate correctness of the level construction of the used serious game. After that we present strengths and limitations of the study and briefly outline a possible direction of future research, followed by a general conclusion.

Difficulty, performance and cognitive load prediction
In general, the results of the present study substantiated our hypothesis that cognitive load might be predicted using eyetracking features collected during time intervals related to initial TADD metric. Indeed, we found significant associations between gaze features during the indicated time periods and difficulty, performance, and cognitive load, although some of these associations were unexpected. In line with our expectations, we found significant associations between performance and gaze behavior during initial   idle: successful participants did significantly more fixations, saccades, and microsaccades during this time period as compared to participants who failed the level. Additionally, we found a significant negative association between microsaccadic rates and subjective ratings of cognitive load for both initial burst and initial idle time periods. Other investigated eye-tracking features showed no significant associations in this regard. In contrast to our expectations, we found strong associations between gaze behavior, difficulty, and performance when considering the initial burst time period. Because we assumed that all participants would play at their cognitive limit during the initial burst phase, we expected no effects during this time. Although gaze features were associated with task difficulty in the expected way, the observed association with performance was in the opposite direction. For instance, we observed that participants performed significantly fewer fixations during the more challenging levels, but this association was significantly less pronounced in more successful participants. The same pattern was also evident for other gaze features. Importantly, this finding seems sensible and might indicate that more successful players experienced lower cognitive load, which is reflected in their gaze behavior. To test this assumption we conducted additional analyses (see Sect. 3.4 Additional Analyses) which showed exactly the same pattern of subjectively reported cognitive load ratings and thus further supported this account. As such, contrary to our expectations, the initial burst might be well suited to determine task difficulty and to-be-expected performance. One possible explanation for this finding is that time pressure, induced by the Emergency serious game, was not high enough to make all participants work on their cognitive limits. In this case, initial burst might be well suitable for measuring cognitive load in situations with low to medium time pressure, while for measuring cognitive load under high time pressure other options need to be identified. This hypothesis requires further investigation.
Taken together, our results support the idea that cognitive load and performance during HCI can be successfully captured based on gaze data collected during relatively narrow time windows, the latter derived by a theory-driven approach based on the TBRS model [7].

Cognitive processes during initial idle
Our second goal was to better understand what cognitive processes occur during initial idle, because no loggable ingame actions happen during this period. As expected, we found that successful participants performed significantly more fixations, saccades, and microsaccades during this time as compared to participants who failed the level. Based on this finding and previous evidence [8] it seems that more successful participants tend to use the initial idle time for more intensive visual exploration and monitoring of the game scene [58,61,73].

Manipulation check
Last but not least, it is worth noting that our results substantiated that levels of difficulty within the Emergency serious game were well constructed and effectively induced different levels of cognitive load. As expected, participants reported the Train Crash scenario to be cognitively more demanding than the Fire scenario. Likewise, the difficulty score which was calculated for each game level representing the percentage of participants who failed a level indicated that level difficulty increased as expected during the respective scenarios. Furthermore, we found significant negative associations between level of difficulty and fixation as well as saccadic frequency, suggesting that game levels differed in non-visual cognitive demand components such as strategic planning [56,57,60].

Strengths and limitations
Analytical approaches to HCIs are often based on data-driven probabilistic performance evaluations [85][86][87], which sometimes seem insufficient for estimating operators' cognitive and emotional states. In such cases, (neuro-) physiological methods [for review see:88] seem advantageous, although often relatively complex to acquire and evaluate. Another peculiarity of physiological methods is that most of them require physical contact with the operator and therefore can hardly be used for online commercial developments. Based on these considerations, we employed the eye-tracking method in this study, because modern video-based eyetracking systems may not require physical contact and thus can be used in a very subtle way [for an overview see : 52].
Nevertheless, we think that the main strength of our approach is represented by our theoretically informed topdown development relying on an appropriate theoretical framework (in this case TBRS by [7] was used). In this way, we hope to foster generalizability of this method to similar situations, which might be less feasible in pure data-driven approaches. For this purpose, we used the TBRS model [7], which specifically emphasizes the role of time pressure in inducing cognitive load and therefore seems particularly suitable for predicting cognitive load during time-critical HCI. Moreover, the proposed method allows for early prediction of operator performance and can therefore be used for the development of interactive adaptive HCIs.
However, our approach is not free of limitations. First, the proposed method applies only to a relatively narrow family of situations or tasks with certain characteristics. Further research is needed to determine whether this approach can be applied or easily adapted to other contexts, e.g., interactions without time pressure. Second, as mentioned above, it is not clear whether different degrees of time pressure during HCI must be considered when using this method. Third, the recording frequency of the eye-tracker used for the study was 250 Hz, which might represent a technical limitation for instance to detect microsaccades.

Conclusion
In this paper, we presented a novel theory-driven approach considering specific eye-tracking features to predict cognitive load during time-critical resource-managing situations in combination with TBRS theory. Eye-tracking data was collected during relatively narrow time windows at the beginning of the interaction with the simulation serious game and thus can be potentially used for real-time adaptation of human-computer interactions. Moreover, the detection of the time periods was based on log data and can be easily run in the background. Obtained results supported the proposed approach, eye-tracking features collected during the initial burst appeared to be well suited to predict performance and task difficulty. Fixations frequency, saccadic rate, and pupil diameter seem to be well suited to predict task difficulty during the initial burst phase. Fixation rate was the best indicator to predict performance during the initial burst. If an estimate of subjectively perceived cognitive load is required, the microsaccadic rate recorded during the initial action block might be a good option. These results illustrate how theoretic knowledge about the task structure may be used advantageously for the assessment of cognitive load. Although requiring further investigation in terms of reliability and generalizability, the presented approach seems promising for measuring cognitive load in realistic time-critical HCI, considering adaptation to operators' mental needs.
Authors' contributions All authors contributed to the conception of the work. NS, MN, KM and PG conceptualized and designed the study, NS conducted the study. TA preprocessed data. NS conducted the statistical analyses. NS wrote the first draft of the manuscript which was edited in several rounds with TA, and MN. KM and PG provided the last rounds of edits on the manuscript. All authors revised the final manuscript.
Funding Open Access funding enabled and organized by Projekt DEAL. University of Tuebingen/IWM.

Data Availability Not applicable.
Code availability Not applicable.

Conflicts of interest
The authors declare a conflict of interest: "Author Natalia Sevcenko was employed by the company Daimler Truck AG. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest."

Ethical approval Study was approved by IWM Ethics Committee.
Consent to participate All participants signed written informed consent prior to the experiment.

Consent for publication All authors give their consent for publication.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.