1 Introduction

One recurrent goal in human systems engineering is to have the ability to adapt the use of technology to the workload needs of the operator. This issue has often been explored with the use of adaptive systems whose function and behavior can be adjusted according to changes in the operator’s mental workload state during task performance.

2 Mitigating the adverse effects of workload with adaptive aiding

Adaptive systems that support human performance have been developed and designed with increasingly sophistication and complexity over the years (Karwowski 2012; Dorneich et al. 2016). In their most basic form, they are closed-loop control systems that auto-regulate their effects within a changing environment to fulfill certain criteria or maintain a “set point.” To accomplish this aim, the environment is continuously monitored and its state is assessed against a target criterion or “set point.” When the state deviates from the target criterion, the system acts to return the state to the desired level. Refinements to the basic closed-loop system include targeting a range of criterion values, as in an autopilot that keeps an airplane within a safety envelope, control of dynamic behaviors such preventing an overshoot of the target state, and tracking a changing criterion as in adaptive cruise control for vehicles. Adaptive systems are best known in engineering contexts such as vehicle control, but there is increasing interest in building systems that can regulate operator functional states such as workload, stress, and fatigue (e.g., Hockey 2003). That is, performance is regulated indirectly rather than directly, through supporting the operator’s readiness to deal with a range of performance challenges.

The present study addresses adaptive automation designed to limit mental workload as it fluctuates over the duration of a cognitively demanding task (e.g., Freeman et al. 2000; Prinzel et al. 2000; Bailey et al. 2006). Mental workload has been defined in terms of the attentional resources that are needed to meet task demands (“taskload”), which may be mediated by the operator’s functional state, past experience, and external support (Hockey 2003; Young and Stanton 2005; Matthews et al. 2015a). It is the result of the combination of task features, environmental factors, and operator characteristics (Young et al. 2015). Extreme levels of workload can be detrimental to task performance (Young and Stanton 2002; Young et al. 2015), so a system that adjusts its behavior to keep the operator’s workload level within an optimum range (Hancock and Warm 1989) would be useful. In order to do this, the adaptive system would have sensors that measure and monitor the level of operator workload such that when workload reaches an excessive level, the system can then adapt appropriately to the operator’s state by rendering a suitable aid to relieve task load, which, in turn, influences workload (Hancock and Caird 1993; Matthews and Reinerman-Jones 2017; Hancock and Matthews 2019). In safety-critical domains, a system with such a capability can contribute to a reduction in accidents and errors that may result from fatigue, attention lapses, distractions, or boredom that are typically precipitated by operator overload or underload (Brookhuis and Waard 2001; Young et al. 2015).

While there are various measures of workload, several characteristics of psychophysiological workload measures make them suitable for use in adaptive systems. Unlike subjective and self-report measures, they are objective and do not disrupt the task since they do not require any overt response from the operator, who may also not have accurate insight into his or her own level of workload, especially when deeply engaged in the task (Kantowitz and Casper 2017). Psychophysiological workload measures allow continuous monitoring, providing high temporal resolution of operator state. Unlike performance-based workload measures, psychophysiological workload measures can be used to preempt performance declines before operational effectiveness is compromised.

3 Inter-individual variability in psychophysiology

The basis for using psychophysiological measures is that with activation of certain mental processes required for the task, there is a corresponding physiological response that reflects this mental activity. Although large inter-individual variability in mental workload is observed with all workload measures for the same task performed in the same environment, the issue is particularly troublesome with psychophysiological workload measures (Hancock et al. 1985; Roscoe 1993; Johannes and Gaillard 2014). First, there is wide variability in individual physiology. Psychophysiological workload measures reflect a variety of distinct responses. These include central brain activity assessed using the electroencephalogram (EEG), and peripheral systems such as pupil diameter and cardiac activity measured with the electrocardiogram (ECG). Workload is also indexed by slower hemodynamic responses reflecting metabolic activity, i.e., cerebral blood flow velocity (CBFV) and regional oxygen saturation (rSO2). Workload responses differ across individuals of various ages, gender, levels of cardiovascular fitness, and physical health. For instance, hypertension impacts ECG signals and cerebral blood flow, and there are age differences present in EEG and pupil diameter (Birren et al. 1950; Bill and Linder 1976; Winn et al. 1994; Pierce et al. 2003; Ang and Lang 2008).

Individuals’ psychological responses to the same task demands also vary widely, which can require multiple measures for assessment. Even a well-defined task may not implicate the same set of mental processes in different individuals since measures that index autonomic and central nervous system function can dissociate. For instance, in performing the same task, one individual may show greater changes in brain activity while another may show more changes in cardiac activity (Matthews et al. 2015b). There are also other reasons to use multiple psychophysiological measures in adaptive systems. Different measures are sensitive to different task demands such that one measure may capture the levels of certain workload manipulations, while others may not (Wilson and O’Donnell 1988; Matthews et al. 2015b). For example, an EEG-based workload index and eye fixation durations were sensitive to the single/dual task workload manipulation, but rSO2 and heart rate variability (HRV) discriminated between different levels of certain single tasks instead (Matthews et al. 2015b). This finding suggests that especially for multitasking environments, the determination of workload levels should not be based only on one measure.

4 Workload models for adaptive aiding

In neuro-ergonomic applications, adaptive systems can alter the extent and schedule of the aid in response to the operator’s changing needs throughout the course of a task. In doing so, they can minimize many unintended consequences of indiscriminate and persistent aiding (Carmody and Gluckman 1993; Parasuraman et al. 1993; Endsley and Kiris 1995). Adaptive systems invoked by psychophysiological measures of workload can preempt performance declines and do not depend on the operators to be aware of their need for aid or make explicit requests for aid. Such systems rely on a workload model based on psychophysiological measures to drive the schedule of adaptive aid. The model encodes when an excessive workload level is reached so that adaptive aid can be provided. To do so, it must be sensitive to and able to differentiate meaningful workload levels (e.g., levels that relate to different levels of performance). It should also provide diagnostic information about the measures associated with the need for aid to help design appropriate aiding behaviors. For example, in knowing that an adaptive aid had been triggered by an unusually high ocular activity, the system can work to relieve the visual demand.

This requirement for transparency is often cited among the limitations of using artificial neural networks (ANN), support vector machines (SVM), and other machine learning algorithms that include non-linear techniques for workload modeling (Mittelstadt et al. 2016). While some of these can result in models with very high accuracy rates (e.g., Wilson and Russell 2003a, b; Yeo et al. 2009; Baldwin and Penaranda 2012), not all are suitable for use in all adaptive systems or real-time applications. Some machine learning algorithms have limited diagnosticity because their features, rules, and criteria are less “transparent” and are often inscrutable. It is typically unclear how the multiple inputs are selected and combined to predict the target outcome. This inscrutability can have serious implications on the user’s trust in adaptive systems (Knight 2017; Ribeiro et al. 2016).

In addition, the model should include a variety of measures to detect a range of workload responses capturing the operator’s workload, as well as accommodate large inter-individual variability in psychophysiology. A recent review of more than 20 workload assessment algorithms developed for use in several environments revealed that none of the algorithms reviewed fulfill all these requirements, and most do not generalize across individuals (Heard et al. 2018). Table 1 includes some of the commonly encountered psychophysiological workload measures.

Table 1 Common psychophysiological workload measures

5 An individualized workload model

We sought to develop a workload model (Teo et al. 2016) that met the following criteria:

  • It must reliably distinguish between low and high workload and must identify when high workload is reached in real time.

    • Justification: For the adaptive aid to be useful, the system needs to identify low and high levels of operator workload and respond appropriately. Aid that does not match the workload level is not as helpful (see Teo et al. 2018).

  • It must customize the set of workload measures to the individual to optimize sensitivity on an individual basis.

    • Justification: The large inter-individual variability in physiological responses to workload precludes the use of a common set of psychophysiological workload measures and target criteria for adaptation across all individuals. Having a workload model that can specify the best set of measures for each individual will improve the system’s ability to identify excessive workload for that person.

  • It must incorporate multiple measures that assess a range of psychophysiological workload responses to capture the complete workload state of the individual and increase diagnosticity.

    • Justification: Including multiple measures that are differentially sensitive to various cognitive processes activated by the tasking should yield richer information about the source or nature of the workload for the individual, which can be used to improve the quality of the adaptive aid.

5.1 Identifying the onset of high workload

First, a robust workload manipulation which does not produce any taskload-workload dissociations (Yeh and Wickens 1988) is required in order to capture the pattern of psychophysiological responses associated with low and high workload through manipulating taskload. Data from a previous study, Study 1 (Abich et al. 2013), were used to develop the workload model (i.e., Study 1 data served as the training dataset). Study 1 manipulated low and high workload with single vs. dual tasking, which, from the performance results and subjective workload measures, was shown to consistently yield the needed workload manipulation. The scenarios from Study 1 can be found in Table 2.

Table 2 Manipulation of workload levels in Study 1

The basis of the workload model lies in comparing two sets of change scores, calculated from a workload algorithm based on multiple psychophysiological indicators. The first set comprised the change scores on the algorithm value between conditions known to elicit low and high workload, i.e., the “baseline difference scores.” To determine the workload level induced by a new condition (so that adaptive aid can be rendered if workload is high), a second set of difference scores is formed. This second set consisted of the psychophysiological change between the original low workload condition and the new condition that elicits an unknown level of workload, i.e., the “test difference scores.” If the set of “baseline difference scores” and “test difference scores” are similar in magnitude and direction (i.e., they match), then the new condition is considered to have induced a workload level comparable with the original high workload condition. To illustrate, Fig. 1 depicts three hypothetical sets of difference scores. A comparison of the sets of Baseline difference scores to the Test difference scores 1 would show that the psychophysiological changes producing the difference scores are similar in magnitude and direction, indicating that New Task 1 was eliciting a similarly high workload as the dual task since the sets of Baseline difference scores and Test difference scores 1 match. However, the sets of Baseline difference scores and Test difference scores 2 would not match, indicating that New Task 2 did not induce high workload (see Fig. 1).

Fig. 1
figure 1

Comparing difference scores to determine workload level elicited by a new task (y-axis shows hypothetical values of psychophysiological measures)

By pairing the various scenarios in Study 1, we obtained multiple sets of difference scores with which we could develop the workload model (see Table 3).

Table 3 Sets from Study 1 scenarios

5.2 Individualization

Measures sensitive to workload changes for the individual are those on which the individual shows a large change going from a low to high workload condition. For instance, for one individual, the workload measure of “number of eye fixations” may show a large change between a low and high workload condition indicating that the workload increase could be related to heavier visual demand. However, for a second individual, the measure showing a large change could be HRV instead. Such information contributes to diagnosticity as it suggests that an aid with visual processing may benefit the first individual more than the second.

There are different ways of specifying algorithms to capture with a single index the workload responses that are diagnostic for the individual. One approach is to weight responses according to their sensitivity. Another is to select only those responses that show large changes to increases in taskload. For some algorithms, the definition of “a large change” is a change of 0.5 standard deviations (SD) or more between the low and high workload conditions. Measures that show this type of change between conditions are designated as the individual’s set of workload “markers.” For other algorithms, the individual’s workload “markers” are the top few measures that register the largest changes between a low and high workload. In both approaches, only the measures most sensitive for the individual (i.e., his/her workload “markers”) will be used to compute the workload index. This approach allows the algorithms to be expressed as “rules” that are generalizable across different individuals and populations, while still accommodating inter-individual differences in psychophysiology. They are specific to the individual and yet generalizable at the same time.

5.3 Combining multiple measures

We examined a total of 26 psychophysiological measures,Footnote 1 including many listed in Table 1, i.e., EEG, ECG, CBFV, rSO2, eye fixation duration, Index of Cognitive Activity (ICA) (Marshall 2002), and pupil diameter measures as potential workload markers. These measures were selected for their sensitivity to the workload induced by the tasks used as shown in previous studies (Abich et al. 2013; Matthews et al. 2015b).

Scores from the multiple psychophysiological measures are first standardized to remove scale differences across the measures. Standardization of all scores allows a single workload index to be computed by combining multiple psychophysiological measures. The sets of “baseline difference scores” and “test difference scores” are obtained by combining the standardized values of multiple measures.

Algorithms that quantified the similarity in psychophysiological changes across multiple measures, or sets of differences scores, were generated. The algorithms combined the values on these measures in different ways to yield a single workload index. To be implemented in the adaptive system, a cutoff score is then required to specify index values that indicate when high workload is reached.

5.4 Formulation of algorithms for the workload model

Various algorithms were devised to compute the workload index that reflected individual variability in psychophysiological responses to workload and incorporated multiple measures. These either weight responses according to their sensitivity or focused only on responses that show large changes to increases in taskload. The workload index under each algorithm quantified the similarity of these responses between the high workload-inducing dual task and the new task condition by comparing the baseline difference scores and test difference scores.

In addition to the two sets formed from Study 1’s scenarios, two more sets were formed to compare algorithm performance to at-chance accuracy. These sets included the use of a separate data set comprising values drawn randomly from a theoretical normal distribution (all psychophysiological data have been standardized at this point). The random data were used as the data from a new unknown condition (see Table 4).

Table 4 Sets from Study 1 scenarios and random data

Unlike sets #1 and #2, in which the baseline and test difference scores are expected to reflect similar patterns of psychophysiological changes, the baseline and test difference scores in both sets #3 and #4 were not expected to match. Poor-performing algorithms may not yield workload indices that concur with these expectations.

5.4.1 Algorithm 1

In Algorithm 1, a workload index to quantify the similarity of psychophysiological changes is computed as the proportion of markers that show the same large change in workload response in the new condition. Computing the index as a proportion ensured a fixed range of values under this algorithm from 0 to 1. The more similar the workload response elicited by the new unknown condition is to that induced by the original high workload condition, the more the workload index would approach the value of 1 since in both instances, the same measures would have registered similarly large changes from the low workload condition. Examples of how the workload index would be computed under Algorithm 1 are as follows:

$$ \mathrm{Workload}\ \mathrm{in}\mathrm{dex}\ \mathrm{under}\ \mathrm{Algorithm}\ 1=\frac{\mathrm{Markers}\ \mathrm{observed}\ \mathrm{in}\ \mathrm{the}\ \mathrm{Baseline}\ \mathrm{and}\ \mathrm{Test}\ \mathrm{Difference}\ \mathrm{scores}}{\mathrm{Markers}\ \mathrm{observed}\ \mathrm{in}\ \mathrm{the}\ \mathrm{Baseline}\ \mathrm{Difference}\ \mathrm{scores}.} $$
$$ \mathrm{Conceptually},\mathrm{the}\ \mathrm{workload}\ \mathrm{index}\ \mathrm{under}\ \mathrm{Algorithm}\ 1=\frac{Wkld\ response\ common\ to\ both\ the\ new\& high\ wkld\ condition}{Wkld\ response\ in\ the\ high\ wkld\ condition} $$

Example 1

Workload index for an individual with 4 markers (i.e., HRV mean, interbeat-interval (IBI) mean, theta frontal mean spectral power density (SPD), left mean CBFV) when the new condition induces a similar workload level to that of the known high workload condition. The workload index value is higher, at 0.75:

$$ \mathrm{Workload}\ \mathrm{index}\ \mathrm{under}\ \mathrm{Algorithm}\ 1\ \mathrm{for}\ \mathrm{Example}\ 1=\frac{\mathrm{HRV},\kern0.75em \mathrm{IBI},\kern0.75em \mathrm{Theta}\ \mathrm{frontal}\ \mathrm{SPD}}{\mathrm{HRV},\kern0.75em \mathrm{IBI},\kern1em \mathrm{Theta}\ \mathrm{frontal}\ \mathrm{SPD},\kern0.75em \mathrm{Left}\ \mathrm{CBFV}\kern0.75em }=\frac{3}{4}=0.75 $$

Example 2

Workload index for an individual with 4 markers (i.e., HRV mean, IBI mean, theta frontal mean SPD, left mean CBFV) when the new condition induces a different workload level to that of the known high workload condition. The workload index value is low at 0.25:

$$ \mathrm{Workload}\ \mathrm{index}\ \mathrm{under}\ \mathrm{Algorithm}\ 1\ \mathrm{for}\ \mathrm{Example}\ 1=\frac{\mathrm{HRV}}{\mathrm{HRV},\kern0.75em \mathrm{IBI},\kern1em \mathrm{Theta}\ \mathrm{frontal}\ \mathrm{SPD},\kern0.75em \mathrm{Left}\ \mathrm{CBFV}\kern0.75em }=\frac{1}{4}=0.25 $$

Workload index under Algorithm 1

For this algorithm, similarity in psychophysiological response was indicated by the proportion of the individual’s “markers” that had registered the response indicating large workload change in both sets of difference scores. “Test difference scores” computed from single-dual task differences (i.e., Set #1 and Set #2) would match that in “Baseline difference scores.” “Test difference scores” computed with random data (i.e., Set #3 and Set #4) would not be expected to match the “baseline difference scores” (see Table 5).

Table 5 Algorithm 1: workload index means and std. dev.

The effect size (Cohen’s d) between sets #1 and #3 (using the same baseline differences scores) is 1.347 while that between sets #2 and #4 is 1.086, indicating that the Algorithm 1 was well able to distinguish between data from an actual high workload condition from random data. These values indicate large effect sizes according to Cohen’s (1988) criteria.

5.4.2 Algorithm 2

The workload index quantifying the similarity of the two sets of change scores for Algorithm 2 is the Euclidean distance between them, with smaller distance scores denoting greater similarity. The index can be individualized by only including the individual’s own set of markers in the distance computation. Whereas Algorithm 1 seeks to select a subset of response measures from those available for each individual, Algorithm 2 incorporates information from all responses, even those that were relatively insensitive for an individual. The following shows the computation of Algorithm 2 workload index:

$$ \mathrm{Workload}\ \mathrm{index},d\left(x,y\right)=\sqrt{\sum \limits_{i=1}^n\Big(}{x}_i-{y}_i\Big){}^2 $$

where x is the “baseline difference score” for a psychophysiological workload measure, y is the “test difference score” for a psychophysiological workload measure, and i is a psychophysiological metric (e.g., i = 1 denotes heartrate variability or HRV, i = 2 denotes interbeat interval or IBI, etc.)

Workload index under Algorithm 2

Similarity in psychophysiological response, for this algorithm, was reflected as the Euclidean distance between the sets of all difference scores. As before, “test difference scores” computed from single-dual task differences (i.e., Set #1 and Set #2) should match and be “nearer” (i.e., smaller distance) to the “baseline difference scores”, while the distance between “test difference scores” computed with random data (i.e., Set #3 and Set #4) and “baseline difference scores” should be larger (Table 6).

Table 6 Algorithm 2: workload index means and Std. dev.

Algorithm 2 was also well able to distinguish between data from an actual high workload condition from random data. The effect size (Cohen’s d) between Sets #1 and #3 is 1.519 while that between Sets #2 and #4 is 1.343.

5.4.3 Algorithm 3

Similarity of the psychophysiological change is quantified as the number of workload measures that have signs that match for the “baseline difference score” and “test difference score.” Matched signs indicate that the direction of the change between conditions is similar. The more similar the change in psychophysiological workload response is between the set of “baseline difference scores” and “test difference scores,” the greater the number of matched signs compared with that which will occur by chance. The workload index is computed as the number of the measures for which the signs for the “baseline difference score” and “test difference score” matched. Since the number of psychophysiological measures used is 26, the range of values for the workload index under Algorithm 3 is 0 to 26. Like Algorithm 2, this algorithm utilizes information from all responses, but on a categorical rather than a continuous basis.

Workload index under Algorithm 3

For this algorithm, the more similar the changes in psychophysiological responses were, the greater the number of matched signs between the sets of difference scores (Table 7). Index values for Sets #1 and #2 indicate greater match between the “baseline difference scores” and “test difference scores” while values for Sets #3 and #4, which involve random data, show poorer match.

Table 7 Algorithm 3: workload index means and std. dev.

Between Sets #1 and #3, the effect size (Cohen’s d) is 1.074 while that between Sets #2 and #4 is 0.962, indicating that the Algorithm 3 was able to distinguish between data from an actual high workload condition from random data. However, ds were somewhat smaller than those for Algorithms 1 and 2.

5.4.4 Algorithm 4

For this algorithm, the top two psychophysiological “markers” for the individual are first identified from the “baseline difference scores” (i.e., the two measures which showed the largest difference between the original low and high workload-inducing conditions). Since only the top two markers are included, the workload index would range from 0 to 2, with index values approaching 2 if the “baseline differences scores” and “test difference scores” are similar in magnitude and direction. Another derivative of this algorithm requires the change in the “test difference scores” to be in the same direction but only at least half the magnitude of that in the “baseline difference scores.” This algorithm reverts to selection of key markers on an individual basis, focusing especially on those most responsive to workload.

Workload index under Algorithm 4

Under this algorithm, similarity in psychophysiological response was the extent to which the individual’s top 2 “markers” showed the greatest change in both sets of difference scores. Very similar sets of difference scores (i.e., Sets #1 and #2) should yield values close to 2 (Table 8).

Table 8 Algorithm 4: workload index means and std. dev.

Although Algorithm 4 was also able to distinguish between data from an actual high workload condition from random data, the effect sizes were somewhat lower than for the other algorithms. Cohen’s d between Sets #1 and #3 (using the same baseline differences scores) ranged from 0.57 to 1.075 while that between Sets #2 and #4 ranged from 0.679 to 1.292.

All four algorithms seemed able to distinguish the psychophysiological changes resulting from high workload from random data. However, both Algorithms 3 and 4 produce discrete values that may limit their use. Algorithm 3 defines similarity only in terms of the direction of the psychophysiological change, without a criterion for change magnitude. Closer examination of the workload index values from Algorithm 4 showed that even when the “baseline” and “test” difference scores were supposed to match (i.e., both from single and dual task conditions), most of the participants had index values that did not reflect this similarity. Additionally, the range of values under Algorithm 4 is limited as it is equal to the number of “markers” considered to be “top markers”. Increasing the range of index values may result in including “markers” that are not as sensitive for the individual. The effect sizes for Algorithms 3 and 4 were also lower than those for Algorithms 1 and 2. For these reasons, only Algorithms 1 and 2 were selected for further analyses and evaluation.

6 Evaluation of workload models

The workload models generated with Algorithms 1 and 2 were further subjected to a mock-up of an adaptive aiding system with Study 1 data to help select threshold values, and the evaluation of the sensitivity of those threshold values.

6.1 Mock-up of the workload model in an adaptive aiding system

In the mock-up, 2-min blocks of data were streamed into the system as “live” samples of data (i.e., 2-min “rolling” window) every 30 seconds, such that consecutive samples have a 1.5-min overlap of data. In place of a static set of “test difference scores,” there is a set of “rolling test difference scores” which is constantly updated every 30 s to reflect the individual’s psychophysiological responses during the new condition inducing an unknown level of workload. With Study 1 scenarios and data, four more sets of “baseline difference scores” and “rolling test difference scores” were generated to compare index values for conditions that matched to differing extents (see Table 9).

Table 9 Study 1 scenarios yielding various sets for the mock-up

From the mock-up, a potential threshold or cutoff score (i.e., solid horizontal line in the figures below) was determined. This is the workload index value that differentiated similar sets of “baseline” and “rolling test” differences scores from dissimilar sets.

The mock-up with Algorithm 1 resulted in the expected order of similarity across all samples. The most similar sets of “baseline” and “rolling test” difference scores (i.e., Set #5) had the high index values, followed by the next most similar sets (i.e., Set #6), then by Set #7, followed by Set #8 which had the lowest index values denoting lowest similarity. A possible cutoff score for this Algorithm was 0.62 (see Fig. 2).

Fig. 2
figure 2

Mock-up of adaptive system with Algorithm 1

With Algorithm 2, the expected order of sets was not observed. Set #6 which comprised difference scores that should be more closely matched than that of Set #7 had index values that indicated lower similarity instead. In addition, the potential cutoff score of 7.2 may still result in misclassifications. Due to this, Algorithm 2 was eliminated from further consideration (see Fig. 3).

Fig. 3
figure 3

Mock-up of adaptive system with Algorithm 2 (smaller values indicate greater similarity)

This result prompted two derivatives of Algorithm 2 to be formulated. Algorithm 2a included only the top 5 measures that showed the greatest magnitude of psychophysiological change between single and dual task, while Algorithm 2b included the top 10 measures in the workload index computation. For both Algorithm 2a and 2b, the expected order of set similarity was observed although the distinction between similar sets (i.e., Set #5 and Set #6), and dissimilar sets (i.e., Set #7 and Set #8) was not distinct enough for a cutoff score to be established in both of these new algorithms (see Figs. 4 and 5).

Fig. 4
figure 4

Mock-up of adaptive system with Algorithm 2a (smaller values indicate greater similarity)

Fig. 5
figure 5

Mock-up of adaptive system with Algorithm 2b (smaller values indicate greater similarity)

6.2 Sensitivity of workload models and thresholds

The adaptive system, with the appropriate cutoff score, should detect when participants are in conditions that induce high workload (i.e., dual task in this case). A signal detection paradigm can be applied to evaluate the sensitivity of the system. When the system correctly identifies the high workload-inducing condition, then the system would have made a “Hit.” “Misses” are when the system fails to identify the onset of high workload. “False Alarms” are instances when the system triggers aid during a low workload-inducing condition, and “Correct Rejections” are when no aid is provided during low workload-inducing condition (Table 10).

Table 10 Signal detection outcomes from the mock-up

The optimal cutoff score would show high sensitivity (d′), a signal detection measure, as it will maximize “Hits” and “Correct Rejections” while minimizing “Misses” and “False Alarms (FAs).” Sensitivity was computed as follows:

$$ \mathrm{Sensitivity}\ \mathrm{or}\ d^{\prime }=Z\left(\mathrm{proportion}\ \mathrm{of}``\mathrm{hits}"\right)-Z\left(\mathrm{proportion}\ \mathrm{of}``\mathrm{FAs}"\right) $$

With data from Study 1, hit, miss, false alarm, and correct rejection rates were computed for Set #5 (most similar psychophysiological response) and Set #8 (most dissimilar psychophysiological response) using the most plausible thresholds of Algorithms, 1, 2a, and 2b. Results favored Algorithm 1 at the 0.62 cutoff (Table 11).

Table 11 Study 1 sets with various algorithms at proposed thresholds

7 Testing the workload models

7.1 Robustness of models to different workload manipulations

The workload model under Algorithm 1 was next tested on a separate sample of participants. We also wanted to see if the workload model was able to identify high workload from dual tasking that was elicited by a slightly different set of tasks. In addition, we explored the use of event rate to manipulate workload.

Study 2 used the change detection (CD) task and a monitoring task (MT) to create single and dual taskingFootnote 2 to elicit the low and high workload conditions. There were 3 levels of the monitoring task that differed on event rate. The scenarios in Study 2 were as follows (see Table 12):

Table 12 Manipulation of workload levels in Study 2

The scenarios were combined to create the following sets of baseline and test difference scores (see Table 13):

Table 13 Study 2 scenarios yielding sets of test data with alternative workload manipulations

These sets tested the workload model in following ways in Table 14.

Table 14 Testing robustness of the workload model

The workload index based on Algorithm 1 was computed with these sets using data from the Study 2 participants (see Table 15).

Table 15 Algorithm 1: workload index values with alternative workload manipulations

Comparing the values from these sets to values from Sets #1 to #4 (i.e., Table 5), the workload model generalized to a different sample, and to slightly different tasks; so long as the same single-dual tasking workload manipulation was used. The model performed less well with the event rate manipulation of workload or with mixed manipulation. This is probably because the psychophysiological responses are different for different workload manipulations (Matthews et al. 2015b).

7.2 Distribution of workload index values

The distribution and range of workload index values obtained with Algorithm 1 showed that it was able to sufficiently identify workload changes from single-dual tasking. The distribution generated with data where both the “baseline” and “test” difference scores were from single-dual task manipulations (i.e., graphs for sets #1, #2, #5, or the filled-in circles) and were distinct from that which involved random data (i.e., graphs for sets #3 and #4 or the open circles). Furthermore, 50% of the workload index values from matched conditions (i.e., both “baseline” and “test” difference scores were changes between single and dual task conditions) were at least 0.57 (solid arrow), while 90% of the values from unmatched conditions involving random data were below 0.50 (dotted arrow) (Fig. 6).

Fig. 6
figure 6

Distribution of workload index values for Algorithm 1 (Sets #1 through #4 from Study 1 data, Set#5 from Study 1 data)

Such distributions indicate that the workload index under Algorithm 1 would be sufficiently able to identify when high workload is reached. In a separate study (Teo et al. 2018), this workload model (i.e., based on Algorithm 1 with the cutoff of 0.62) was implemented in an adaptive aiding system that was driven by workload-related psychophysiological changes. Results of that study indicated that compared with those whose aid was not adaptive, those who received adaptive aid showed greater performance improvements.

8 Future work and conclusions

An individualized workload model was developed to drive adaptive aiding. The methodology used enabled various psychophysiological measures with different scale properties and sampling rates to be combined into a single workload index, which was formulated to accommodate the inter-individual variability in psychophysiological responses that is a major challenge in workload modeling. Comparisons of workload index values generated from random data provided a means to evaluate algorithm performance against chance level, while the sensitivity analysis provided a way to assess the selected threshold level. Generalizability of the workload model was assessed with alternative workload manipulations. This methodology resulted in a viable model that incorporated multiple workload measures and accommodated individual variability in psychophysiological workload responses. The model was used with some success in an adaptive aiding system (Teo et al. 2018). Nevertheless, follow-on work is needed to improve the generalizability of the model to other workload manipulations as well as model sensitivity and specificity. It is also important to develop adaptive aiding that is robust when task demands change dynamically and unpredictably.

The present work touches on several issues concerning workload and system design. For one, the relationship between workload and performance is hardly a straightforward one and can be difficult to characterize. Operators’ behavioral or compensatory strategies can result in different workload-performance relationships (i.e., associations, dissociations, insensitivities, linear, non-linear) (Yeh and Wickens 1988; Hancock and Matthews 2019). Secondly, different psychophysiological measures operate at different intrinsic frequencies which can affect the temporal resolution of workload characterization. For example, changes in EEG can be measured in milliseconds while changes in heart rate are detected in seconds (Hancock and Matthews 2019). Designers of system aiding behaviors must also consider the effects of the aid and other task changes since operator workload is susceptible to hysteresis effects (Cox-Fuenzalida 2007; Hancock and Matthews 2019).

A workload model that provides insight into individual operators’ workload responses during various tasks offers a valuable opportunity for designing all manner of individualized technological aids and interventions. Although there is much work still to be accomplished towards this end, the present work provides some impetus for the continuation of effort towards this vision.