1 Introduction

The construct of mental workload – defined as the degree to which mental resources are consumed by the task at hand - is difficult to quantify. This is due largely to the nature of the construct, a latent or hidden variable that results from an interplay of several other variables such as the objective task load, external distractions (task-irrelevant stimuli that draw one’s attention and temporarily occupy mental resources), internal distractions (e.g. task-related stress, task-irrelevant mentation), capacity of one’s mental resources and strategy of their utilization (Fig. 1). The overall capacity of mental resources and strategy of allocating them to the tasks are, in turn, strongly dependent on individual traits (e.g. personality profile, stress resiliency), previous training and factors such as motivation, fatigue and stress. Furthermore, for any given individual, different types of mental resources such as attention, audio-visual perception, cognition or motor control will be mobilized to a different degree in different tasks even though the subjective perception of their ‘difficulty’ may be the same [1, 2]. An ideal measure of mental workload therefore needs to be multifaceted and diagnostic, such that is had the ability to quantify the engagement levels of each of mental resources before eventually combining them into a single global measure. Moreover, it should be able to model the impact of individual traits and psychophysiological states onto the capacity and utilization of mental resources to a degree that does not hamper its ease of use.

Fig. 1.
figure 1figure 1

Schematic presentation of the concept of mental workload and its relationship with pertinent variables: task load (TL), capacity and management of mental resources (MR, MS), individual traits and states (e.g. fatigue), and external and internal distractions (ED and ID).

The standard techniques for workload assessment include self-report scales, performance-based metrics, and physiological measures. Self-report scales are popular due to their low cost and consistency (assuming that the individual is cooperative and capable of introspection). Some of these scales are one-dimensional such as the Rating Scale of Mental Effort (RMSE) and the Modified Cooper-Harper scale (MHC) [3], whereas some scales comprise subscales that measure specific mental resources, e.g., NASA Task Load Index (TLX) [6], Subjective Workload Assessment Technique (SWAT) [5], and Visual Auditory Cognitive Psychomotor method (VACP) [4]. The major drawback of these measures is that they cannot be unobtrusively administered during the task, but are assessed retrospectively at its conclusion. Furthermore, the inherent subjectivity of self-ratings makes across-subjects comparisons difficult. Self-report scales are, therefore, often complemented with performance measures, such as reaction time to different events or accuracy of responses. The performance assessment is relatively unobtrusive and can be accomplished in real time at low cost, but it is not sensitive enough because of the complex relationship between the two variables [7, 8]. Moreover, performance measures cannot tap into all cognitive resources with comparable accuracy. Lately, there has been renewed interest in physiological measures as workload assessment metrics, and signals [916] such as electro-oculography (EOG), electromyography (EMG), pupil diameter, electrocardiography, respiration, electroencephalography and skin conductance. Until recently, their utility was limited by the obtrusive nature of earlier instrumentation, but this has changed with the advent of miniaturized sensors and embedded platforms capable of supporting complex signal processing techniques. Still, physiological workload measures have multiple drawbacks. First, the physiological workload scales are often derived empirically on a set of tasks assumed to represent different workload levels and selected ad hoc, without detailed consideration of their ecological validity and ability to tap into different mental resources (e.g., cognitive, visual, auditory, or motor workload). As a result, the models trained on such atomic tasks may not perform well when applied to the physiological signals acquired during other non-atomic tasks even though they seemingly require the same mental resources. Second, in spite of the well known fact of considerable between- and within-subject variability of nearly all physiological signals and metrics, the majority of physiological workload models have been developed and validated on a relatively small sample of subjects. Third, the classifiers used in the models introduced hitherto have typically lacked mechanisms for an adjustment of the model’s parameters in relation to individual traits, which leads to models that do not generalize well. Finally, the models have mostly ignored the considerable amount of noise inherent in the acquired physiological signals. Thus, poor performance of some models could be attributed to their reliance on rather simple mathematical apparatus.

This paper introduces PHYSIOPRINT - a workload assessment tool based on physiological measures that is built around an established theoretical model called Improved Performance Research Integration Tool (IMPRINT) [17]. The proposed model distinguishes among seven different workload types, and is trained on tasks chosen to represent the key anchors on the respective workload scales. Its mathematical apparatus is not computationally expensive, so it is applicable in real time on a fine timescale.

The rest of the paper is organized as follows. In Sect. 2 we outline the experimental setting while Sect. 3 reports on the experimental results. Finally, in Sect. 4, we summarize our results and give an outlook on future work.

2 Methods

2.1 IMPRINT Workload Model

The IMPRINT Workload Model, developed by the Army Research Laboratory (ARL) [17], discriminates between seven types of workload: visual, auditory, cognitive, fine motor, gross motor, speech, and tactile. Each workload type is quantified on an ordinal/interval scale, similar to the VACP scales [4]. Each of the seven scales is defined by a set of behaviors of increasing complexity that are associated with a numeric value between 0 and 7. Furthermore, for each point in time, IMPRINT produces a composite measure of the overall workload, which is defined as a weighted sum of the type-specific workload values calculated across all tasks that are being simultaneously performed. The model has been successfully applied to estimate mental workload in a number of settings of military relevance, including a strike fighter jet, a mounted combat system [18], and the Abrams tank [19].

2.2 Study Design

Twenty-two healthy subjects (11 females, 25 ± 3 years) who had reported no significant previous or existing health problems participated in the study. They were required to maintain a sleep diary for 5 days prior, and refrain from alcoholic and caffeinated beverages 24 h prior to the experiment. The experiment would typically start at 9AM, when the attending technician would set up the subject with the sensors and recording equipment (Figs. 2 and 3). The wireless X24 sensor headset (Advanced Brain Monitoring Inc., Carlsbad, CA, USA) was used to acquire 20 channels of electroencephalography (EEG) along with electrocardiography (ECG), respiration and head movement data, while a smaller X4 device from the same manufacturer recorded the forearm electromyography (EMG). Following the setup, the subject would engage in a series of computer-based auditory, visual, cognitive and memory tasks that corresponded to the key anchors of the respective workload scales from the IMPRINT model (atom tasks, Table 1). The subject would next perform a set of physical exercises on a treadmill (3 min of walking at 2 mph at 0° inclination, 3 min of running at 6 mph at 0° inclination, 3 min of walking at 2 mph at 15° inclination, 3 min of walking at 6 mph at 15° inclination) and with weights (lift-ups with 5–10 lb in each hand). The subject would then participate in a 30 min session in a driving simulator, and, finally, repeat the computer-based atom tasks. The entire session was recorded with a microphone and video camera that were mounted on the PC or treadmill displays in front of the participant. The protocol was approved by a local Institutional Review Board; all subjects signed an informed consent before the experiment began, and were financially compensated for their participation in the study.

Fig. 2.
figure 2figure 2

Sequence of experimental activities and their estimated duration

Fig. 3.
figure 3figure 3

A subject during a driving simulator task (left) and computer-based atom tasks (right)

Table 1. Low workload PHYSIOPRINT tasks

2.3 Data Processing and Analyses

All computerized tasks, physical exercises and driving scenarios were scored on a second-by-second basis with respect to the workload they impose in accord with the IMPRINT workload model [17, 18]. Each EEG channel was processed with proprietary algorithms to eliminate artifacts and derive spectral features for each subsequent 2-s data segment with 1 s (50 %) overlap. ECG signals was filtered, QRS complexes were detected, and beat-to-beat heart rate (HR) were converted into second-by-second values. Time- and frequency-domain measures of heart-rate variability (HRV) were derived from the HR data in accord with the literature [20]. EOG signals were processed with our proprietary algorithms for detection for eye blinks and eye fixations. EMG levels and body and limb motion were quantified in each second of the data using the bin integration. In addition to these ‘absolute’ or primary variables, a number of secondary or ‘relative’ variables were derived by computing ratios and/or differences between different time instances of the same primary variable or between different but functionally or spatially related primary variables (e.g. anterior-posterior gradient of the alpha EEG power). Finally, brain-state variables quantifying fatigue, alertness and distraction were derived using our validated classifiers [15, 16]. Step-wise regression analysis was used to identify variables derived from the physiological signals that are most predictive of the IMPRINT workload profiles and performance. The analyses took into account the existing relationship between specific workload types and certain physiological signals (e.g. speech workload scale and respiration, gross motor workload and heart rate or body/limb motion).

3 Results

3.1 Speech Workload Scale

The impedance-based respiration signal and sound envelope from our X12 device sufficed for a very precise identification of speech episodes across the pertinent tasks (A2, A4, C3, C4, S1 and S2). Between-subject variability was not significant, and overall classification accuracy amounted to 88.7 % (Table 2).

Table 2. Classification of speech events

3.2 Fine Motor Workload Scale

The EMG acquired from the forearm was a good source for identification of fine motor activities in the pertinent tasks (B4, B5, C1, C2, A1, F1). Between-subject variability was relatively large, and normalization with respect to the baseline EMG activity (defined as the EMG activity during tasks B1 and B2) was required for obtaining the classification accuracy of 86.6 % (Table 3). As one can observe, the sensitivity was high for no activity and short discrete activities (on/off EMG pattern), but there was more confusion between the continuous activities (steering wheel adjustments vs. contour tracking).

Table 3. Classification of fine motor events

3.3 Gross Motor Workload Scale

The X-, Y-, and Z-axis signals from the accelerometer within our head-worn EEG recorder and arm-worn peripheral recorder proved to be an excellent source for differentiation of gross motor activities (push-ups and treadmill exercises). Between-subject variability was not significant, and the classification accuracy reached 89.3 % (Table 4).

Table 4. Classification of gross motor events

3.4 Auditory Workload Scale

The classifier attempted to distinguish among 5 conditions: ‘no activity’ (silent breaks during tasks B1–B3), register a sound (beeps delivered throughout tasks B1–B5), ‘discriminate sounds’ (uni- vs. bilateral beeps in tasks A1 and A4), ‘interpret speech’ (digits read during tasks C2 and C4), and interpret sound patterns (different honking patterns during the driving task). The overall classification accuracy (shown for a classifier developed on combination of subsets of feature vectors from both times of the day) amounted to 75.8 % (Table 5).

Table 5. Classification of auditory events

3.5 Visual Workload Scale

The classifier attempted to distinguish among 5 conditions: ‘no activity’ (silent breaks during tasks B1–B3), register an image (tasks B4, B5), ‘detect a difference’ (task V4), ‘read a symbol’ (digits read during tasks C1 and C3), and scan/search (task V5). The overall classification accuracy (shown again for a classifier developed on combination of subsets of feature vectors from both times of the day) amounted to 76.7 % (Table 6).

Table 6. Classification of visual events

3.6 Cognitive Workload Scale

The classifier attempted to distinguish among four (4) conditions: ‘no activity’ (silent breaks during tasks B1–B3), alternative selection (task A2, A4), ‘encoding/recall’ (tasks C1–C4), and calculation (task C5 and sign task during the driving). The overall classification accuracy (shown again for a classifier developed on combination of subsets of feature vectors from both times of the day) amounted to 72.5 % (Table 7).

Table 7. Classification of gross motor events

4 Discussion

The current study sought to develop a physiologically-based method for workload assessment applicable in the challenging automotive setting. We addressed this need by designing a comprehensive, sensitive, and multifaceted workload assessment tool that incorporates the already established theoretical workload framework that both: (1) covers the different types of workload employed in complex tasks such as driving, and (2) helps define the necessary atomic tasks for building the model. The experimental results suggested that the classifier benefits from combination of complementary input signals (EEG and ECG), better coverage of the scalp regions by an increased number of EEG channels, inclusion of concurrent physiological measurement of fatigue and alertness levels, and short-term signal history. We aimed to overcome the individual variability inherent in the physiological data by including the relative PSD variables in the feature vector. The generalization capability of the trained model was tested by using leave-one-subject-out cross-validation. The proposed method demonstrated that physiological monitoring holds great promise for real time assessment of mental workload.

In the future, we plan to extend the model validation to other simulated environments (flying simulator at Systems Technology Inc.) and real pertinent environments (fully instrumented HMMWV at the Operator Performance Laboratory at the University of Iowa). We also plan to refine the existing atom tasks, especially in the cognitive and visual areas. Alternative classification algorithms such as multi-label learning [21] will be evaluated to facilitate the process of resolving the conflicts between different workload types. The classifier will, finally, be validated on a much larger sample of subjects (target N = 150 subjects).

The ultimate PHYSIOPRINT workload assessment tool is envisioned as a flexible software platform that consists of three main components: (1) an executable that runs on a dedicated local (client) machine to acquire multiple physiological signals from one or more subjects, processes them in real time, and determines global and resource-specific workload on a fine time scale; (2) a large server-based database of physiological signals acquired during relevant atomic tasks from a large number of subjects with different socio-demographic and other characteristics (e.g., degree of driving experience); and (3) a palette of real-time signal processing, feature extraction, and workload classification algorithms. The platform will support a number of recording devices from a wide range of vendors (via the appropriate device drivers), and enable visualization of the workload measures. The users will essentially be able to build their own workload assessment methods from the available building blocks of feature extraction methods and implemented classifiers. Initially, the database will include 100–150 subjects, but we envision that the database will continue to evolve as the community grows in the following years.