1 Introduction

Mental workload (MWL) is a multi-faceted phenomenon with no clear and widely accepted definition. Intuitively, it can be described as the amount of cognitive work expended to a certain task during a given period of time. However, this is a simplistic definition and other factors such as stress, time pressure and mental effort can all influence MWL [11]. The principal reason for measuring MWL is to quantify the mental cost of performing a task in order to predict operator and system performance [1]. It is an important construct, mainly used in the fields of psychology and ergonomics, mainly with application in aviation and automobile industries [5, 20] and in interface and web design [15, 16, 23]. According to Young and Stanton, underload and overload can weaken performance [28]. However, optimal workload has a positive impact on user satisfaction, system success, productivity and safety [12]. Often the information necessary for modelling the construct of MWL is uncertain, vague and contradictory [13]. State-of-the-art measurement techniques do not take into consideration the inconsistency of data used in the modelling phase, which might lead to contradictions and loss of information. For example, if the time spent on a certain task is low it can be derived that the overall MWL is also low, however, if the effort invested in the task is extremely high, then the contrary can be inferred. The aim of this study is to investigate the use of rule-based expert systems for the modelling and inference of MWL. An expert system is a computable program designed to model the problem-solving ability of a human expert [3]. This human expert has to provide a knowledge base, then in turn is translated into computable rules. These rules are used by an inference engine aimed at inferring a numerical index of MWL. Since there is no ground truth indicating if such index is fully correct, the inferential capacity of the defined expert system needs to be investigated in order to gauge its quality. To solve this, the proposal is to adopt some of the most commonly used criteria used in psychometrics such as validity and sensitivity [4, 22, 24]. In simple terms, these criteria are aimed at assessing whether a technique is measuring the construct under investigation and whether it is capable of differentiating variations in workload. From this, the following research question can be defined: can implementations of rule-based expert systems, compared to state-of-the-art MWL inference techniques, enhance the modelling of mental workload according to sensitivity and validity?

The remainder of this paper is organised as follows: Sect. 2 describes related works on MWL, its assessment techniques and provides a general view on rule-based expert systems. Section 3 presents the design of an experiment, the methodology adopted. Findings are discussed in Sect. 4 while Sect. 5 concludes our contribution and introduces future work.

2 Related Work

2.1 Mental Workload Assessment Techniques

As stated by several authors, there is no simple and agreed definition of mental workload [6, 20, 27]. It is thought to be multidimensional and multifaceted, resulting from the aggregation of many different factors thus difficult to be uniquely defined [1]. The basic intuition is that mental workload is the necessary amount of cognitive work for a person to accomplish a task over a period of time. Nevertheless, a large number of measures have been developed [7, 29] and practitioners have found measuring MWL to be useful [25]. Most empirical classification assessment procedures can be divided in three major categories [19]:

  • Subjective measures: operators are required to evaluate their own MWL according to different rating scales or a set of questionnaires.

  • Performance-based measures: these infer an index of MWL from objective notions of performance on the primary task, such as number of errors, completion time or reaction time to respond to secondary tasks.

  • Physiological measures: these infer a value of MWL according to some physiological response from the operator such as pupillary reflex or muscle activity.

Further details for each category can be found in [5, 17]. This study makes use of two of the subjective measures of MWL that have been largely employed for the last four decades [7, 21, 24]. These are used as base-lines and are: NASA-Task Load Index (TLX) [7] and Workload Profile (WP) [24].

The NASA-TLX is a multidimensional scale, initially developed for the use in the aviation industry. Its application has been spread across several different areas, such as automobile drivers, medical profession, users of computers and military cockpits. Also, it has achieved great importance and is considered a reference point for the development of new measures and models [6]. NASA-TLX consists of six sub scales: mental demand, physical demand, temporal demand, frustration, effort and performance (Table 4, in the Appendix, questions 1–5 plus physical demand). The computation of an overall MWL index is made through a weighted average of these six dimensions \(d_i\) quantified using a questionnaire. The weights \(w_i\) are provided by the operator according to a comparison of each possible pair of the six dimensions, for example “which contributed more for the MWL: mental demand or effort?”, “which contributed more for the MWL: performance or frustration?”, giving a total of 15 preferences. The number of times each dimension is chosen defines its weight (Eq. 1).

The Workload Profile is another MWL assessment technique based on the Multiple Resource Theory (MRT) [26]. In contrast to the NASA-TLX, it is built upon 8 dimensions: perceptual/central processing, response processing, spatial processing, verbal processing, visual processing, auditory processing, manual responses and speech responses (Table 4, question 6–13). The operator is asked to rate the proportion of attentional resources, in the range 0 to 1, for each dimension, then summed. For comparison purpose, this sum is averaged (Eq. 2).

$$\begin{aligned} \text {TLX}_\text {MWL} = \Big (\sum _{i=1}^{6} d_i \times w_i \Big ) \frac{1}{15} \end{aligned}$$
$$\begin{aligned} \text {WP}_\text {MWL} = \sum _{i=1}^{8} d_i \end{aligned}$$

According to [22] WP is preferred to NASA-TLX if the goal is to compare the MWL of two or more tasks with different levels of difficulty, while NASA-TLX is preferred if the goal is to predict the performance of a particular individual in a single task. Several criteria have been proposed for the selection and development of measurement techniques [19]. In this study the focus is on two of them:

  • validity: to determine whether the MWL measurement instrument is actually measuring MWL. Two variations of validity are usually employed in psychology: concurrent and convergent. The former aims at determining to what extent a technique can explain objective performance measures, such as task execution time. The second indicates whether different MWL techniques correlate to each other [24]. In literature, concurrent and convergent validity are calculated adopting statistical correlation coefficients [12, 22].

  • sensitivity: the capability of a technique to discriminate significant variations in MWL and changes in resource demand or task difficulty [19]. Formally, sensitivity has been assessed in two different ways: multiple regression [24] and ANOVA [12, 22]. The aim was to identify statistically significant differences of the MWL indexes associated to each task under examination.

2.2 Mental Workload and Rule-Based Expert System

An expert system is a computer program created in order to emulate an expert in a given field [3]. The goal is to imitate the experts capability of solving different tasks in its area. Unlike usual procedural algorithms, an expert system normally has two modules: a knowledge base and an inference engine. The knowledge base is provided by the expert and translated into a set of rules, which will be utilised by an inference engine. A typical rule is of the form “IF ... THEN ...” and the engine will elicit and aggregate all the rules in order to infer a conclusion. In [9], a literature review of many areas in which expert systems have been applied is provided, while [8, 18] are examples of works in the more general field of knowledge representation. To the best of our knowledge, the only study that attempted to model MWL employing inference rules by Longo [10]. Here, modelling MWL has been proposed as a defeasible reasoning process, which is a kind of reasoning built upon inference rules that are defeasible. Defeasible reasoning does not produce a final representation of MWL, but rather a dynamic representation that might change in the light of new evidence and rules. Following this approach, rule-based expert systems might be suitable complements because of their capacity to imitate the problem-solving ability of an expert and facilitate the justification of the inferred conclusion.

3 Design and Methodology

In order to answer the research question an experiment is designed as it follows:

  1. 1.

    acquisition of a knowledge base (KB) related to MWL from an expert;

  2. 2.

    KB translation into different types of rule (forecast, undercutting, rebutting)

  3. 3.

    construction of models (\(e_1-e_4\), \(fr_1-fr_4\)) based on two variations of KB, each employing different types of rules and heuristics (\(H_1,...,H_4\));

  4. 4.

    comparison of the inferential capacity of each model against selected baseline instruments (NASA-TLX and WP) according to validity and sensitivity:

    • validity is measured to investigate if the implemented rule-based expert system is capable of inferring MWL as well as the baseline instruments.

    • sensitivity is measured to determine the quality of the inference made by the implemented expert system.

Table 1. Experiments set up: types of rules employed by two variations of the same knowledge base (left) and name of each model, variation used, heuristic adopted (right).

3.1 Knowledge Base (KB)

Research studies performed by Longo et al. have developed a knowledge base for the inference of MWL in the field of human computer interaction [11, 12, 16]. The goal was to investigate the impact of structural changes of web interfaces on the imposed mental workload on end-users after interacting with them. The knowledge base developed comprises by 21 attributes (Table 4), containing a set of features believed to be useful for modelling MWL, each of them quantified, through a subjective question, in the range [0, 100] \(\in \mathbb {R}\). The MWL has four possible levels, as per Definition 1.

Definition 1

(Mental workload level). Four MWL levels are defined: underload (U), fitting\(^-\) (F\(^-\)), fitting\(^+\) (F\(^+\)) and overload (O).

The set of rules built from the knowledge-base of the expert [11] can be seen in the Appendix and a formal definition follows.

Definition 2

(Rules). Three types of rules are defined.

  • Forecast rule (FR): takes a value \(\alpha \) of an attribute X and infers a MWL level \(\beta \) if \(\alpha \) is in a predefined range \([x_1, x_2]\) with \(x_1, x_2 \in \mathbb {N}\) and \(x_2 > x_1\).

    $$\begin{aligned} FR: \,\,{\varvec{IF}}\,\, \alpha \in [x_1, x_2] \,\, {\varvec{THEN}}\,\, \beta \end{aligned}$$
  • Undercutting rule (UR): takes one or more attributes values, \(\alpha _1, \cdots , \alpha _n\), and undercuts what is inferred by a forecast rule Y if \(\alpha _1 \in [x_1^1, x_2^1], \cdots , \alpha _n \in [x_1^n, x_2^n]\). In this case it is said that rule Y is discarded, d(Y), and will not be considered for future inferences of MWL.

    $$\begin{aligned} UR: \,\,{\varvec{IF}}\,\, \alpha _1 \in [x_1^1, x_2^1] \,\, {\varvec{and}}\,\, \cdots \,\,{\varvec{and}}\,\, \alpha _n \in [x_1^n, x_2^n] \,\, {\varvec{THEN}}\,\, d(Y) \end{aligned}$$
  • Rebutting rule (RR): is a relationship between two forecast rules, \(Y_1\) and \(Y_2\), that can not coexist.

    $$\begin{aligned} RR: \,\,{\varvec{IF}}\,\, Y_1 \,\,{\varvec{and}}\,\, Y_2 \,\,{\varvec{THEN}}\,\, d(Y_1) \,\, {\varvec{and}}\,\, d(Y_2). \end{aligned}$$

Example 1

An example of possible rules are:

  • – Forecast rules

    EF1: [IF effort \(\in [0, 32]\) THEN U] EF4: [IF effort \(\in \) [67, 100] THEN O]

    MD1: [IF mental demand \(\in \) [0, 32] THEN U]

    PK1: [IF past knowledge \(\in [0,32]\) THEN O]

  • – Undercutting rule

    DS1: [IF task difficulty \(\in \) [67, 100] and skills \(\in \) [67, 100] THEN d(EF4)]

  • – Rebutting rule - r5: [IF PK1 and EF1 THEN d(PK1) and d(EF1)].

3.2 Inference Engine

Having defined the set of rules, the next step for inferring MWL is to implement an inference engine. Our inference engine starts with the activation of rules in the set of FR. These will be called activated rules. This activation is based on the inputs provided by the user. Afterwards, rules from the set of UR and RR might discard activated rules, solving some part of the contradictory information. This step is not compulsory. The implementation of rule-based expert systems without UR and RR is also provided. Activated rules that are not discarded are called surviving rules. After defining the set of surviving rules, there still might be some inconsistent inferences. Surviving rules will likely be inferring different MWL levels, even with the application of UR and RR. The expert system, therefore, must be able to aggregate the surviving rules and produce a final inference of MWL. Next an example follows:

Example 2

Following rules from Example 1 and given a numerical input it is possible to define the set of activated rules and the set of surviving rules.

  • Inputs: [effort = 80, past knowledge = 15, task difficulty = 90,

             mental demand = 20, skills = 70, temporal demand = 10]

  • Rules: Activated: [EF4, PK1, MD1, TD1, DS1] Discarded: [EF4] Surviving: [PK1, MD1, TD1].

Example 2 illustrates a set of surviving rules inferring underload MWL (MD1, TD1) and overload MWL (PK1) at the same time. At this stage, a typical set of conflict resolutions strategies for expert systems include: deciding a priority for each rule, firing all possible lines of reasoning or choosing the first rule addressed. However, none of these strategies is applicable in our experiment, since there is no preference among rules, order of evaluation or possibility to compute more than one output. The knowledge base does not provide sufficient information for performing this computation and because of that four heuristics are defined to accomplish the aggregation of the surviving rules. The strategies are developed in order to extract different pieces of information from the surviving rules, which are aggregated or not in different fashions. The final MWL will be a value in the range [0, 100] \(\in \mathbb {R}\). Before presenting such heuristics it is necessary to define the value of a surviving rule (Definition 3).

Definition 3

(Surviving rule value). The value of a surviving rule \(r \in FR\), with input \(0 \le \alpha \le 100\) related to attribute X, is given by the function

$$\begin{aligned} f(r) = {\left\{ \begin{array}{ll} \alpha ,&{} if \, X \, \propto \, MWL \\ 100 - \alpha ,&{} if \, X \, \propto \, \frac{1}{MWL} \end{array}\right. } \end{aligned}$$

with \(X \, \propto \, MWL\) a direct relationship, \( X \, \propto \, \frac{1}{MWL}\) an inverse relationshipFootnote 1.

Given Definition 3 the following heuristics are designed:

  • \(\varvec{h_1}\): the average of the surviving rules of the MWL level with the largest cardinality of surviving rules. In case of two or more levels with equal cardinality, it computes the mean of the averages. The idea is to give importance to the largest point of view (largest set of surviving rules) to infer MWL.

  • \(\varvec{h_2}\): the highest average value of the surviving rules for each MWL level. This is a pessimistic point of view, and infers the highest MWL according to the different sets of surviving rules of each MWL level.

  • \(\varvec{h_3}\): average value of all surviving rules. This is to give equal importance to all surviving rules, regardless of which level of MWL they were supporting.

  • \(\varvec{h_4}\): average of average of surviving rules of each MWL level. This is to give equal importance to all sets of MWL levels.

Example 3

Following Example 2, the value of the surviving rules is given by \(f(PK1) = 85\), \(f(MD1) = 20\) and \(f(TD1) = 10\). Finally, the overall MWL computed by each heuristic is: \(h_1 \text {:} \, \frac{20 + 10}{2}=15, \, \, \, h_2 \text {:}\, max(85, \frac{20 + 10}{2}) = 85\),

\(h_3\text {:}\, \frac{20 + 10 + 85}{3} = 38.3\) and \(h_4\text {:}\, \frac{\frac{20 + 10}{2} + 85}{2} = 50\).

4 Data Collection, Elicitation of Models and Evaluation

Nine information seeking web-based tasks of varying difficulty and demand (Table 3), were performed by participants over three websites: Google, Wikipedia and Youtube. Two alterations of the interface of each web-site were proposed, having overall (9 \({\times }\) 2 = 18) configurations. 40 volunteers performed 9 tasks (on a random alteration) and after each, they answered each question of Table 4 using a paper-based scale in the range \([0..100] \in \aleph \), partitioned in 3 regions delimited at 33 and 66. Due to loss of data or partial completion of questionnaires, 406 instances were valid. Collected answers, for each instance, were used to elicit the rules of each model (Sect. 3), aggregated with their heuristic, that in turn, produced an index of MWL, in the scale \( [0..100] \in \mathfrak {R}\). The outputs formed a distribution of MWL indexes, one for each model, and these were compared against the ones of the baseline models according to validity and sensitivity (Fig. 1).

Fig. 1.
figure 1

Evaluation strategy schema

4.1 Validity

In line with other studies [12, 22], validity was assessed using correlation coefficients. In order to select the most suitable statistic, a test of the normality of the distributions of the MWL indexes, produced by each model, was performed using the Shapiro-Wilk test. This test did not achieve a significance greater than 0.05 for most of the models, underlying the non normality of data. As a consequence, the Spearman’s rank-order correlation was selected.

Convergent validity: aimed at determining to what extent a model correlate with other model of MWL. As it can be seen from Fig. 2, the baseline instruments (NASA-TLX and WP) achieved a correlation of .538 (dashed reference line) with each other. When correlated with NASA-TLX, \(e_3\) and \(fr_3\) obtained a higher correlation than this. These two models both apply the heuristic \(h_3\), which is the average of all surviving rules, a similar computational method used by the baseline instruments. Just in two other cases (\(e_1\), \(fr_1\)) a good correlation (close to the reference line) with WP was obtained. These 2 models implement heuristic \(h_1\), which is the average of the surviving rules of the MWL level (set of rules) with the largest cardinality. The above 4 cases demonstrate how models can be built using rule-based expert system showing similar validity than other baseline MWL assessment instruments believed to shape the construct of MWL.

Concurrent validity: aimed at determining the extent to which a model correlate with task completion time (objective performance measure)Footnote 2. From Fig. 3, it is possible to note that even the baseline instruments do not have a high correlation with task completion time. The first dashed line represents the correlation of 0.178 between NASA-TLX and Time while the second represents the correlation of 0.119 between WP and Time. Similarly to convergent validity, the models applying heuristic \(h_3\) (\(e_3\), and \(fr_3\)) plus the model \(e_2\) were the ones that better correlated with task completion time, Fig. 3, over performing the NASA-TLX. Almost all the models over performed also the WP baseline. These findings suggest that computational models of MWL can be built as rule-based expert systems, and these are capable of enhancing the concurrent validity of the assessments when compared with state-of-the-art models.

Fig. 2.
figure 2

Convergent validity: \(p < 0.05\).

Fig. 3.
figure 3

Concurrent validity: \(p < 0.05\).

4.2 Sensitivity

In line with other studies [12, 22], sensitivity was assessed by analysis of variance. In particular, the non-parametric Kruskal-Wallis H test was performed over the MWL distributions generated by each model, and this was selected because some of the assumptions behind the equivalent of one-way ANOVA were not met. Only model \(e_4\) was not capable of rejecting the null hypothesis of same distribution of MWL indexes across tasks (\(p < 0.01\)). This means that, for the other models, statistical significant differences exist. The Kruskal-Wallis H test, however, does not tell exactly which pairs of tasks are different from each other. As a consequence, post hoc analysis was performed and the Games-Howell test was chosen because of unequal variances of the distributions under analysis. Table 2 depicts how many pairs of tasks each model was capable of differentiating at different significance levels (\(p<0.05\) and \(p<0.01\)). As is can be observed, models applying heuristic \(h_3\) (\(fr_3\) and \(e_3\)) outperformed the WP but underperformed the NASA-TLX. This result is a confirmation that sensitive mental workload rule-based expert systems can be successfully built and compete with existing benchmarks in the field.

Table 2. Sensitivity of MWL models with Games-Howell post hoc analysis. The maximum pairwise comparisons of 9 tasks is \(\left( {\begin{array}{c}9\\ 2\end{array}}\right) =36)\).

4.3 Summary of Findings

Quantifications of the validity and the sensitivity of developed models suggest that rule-based expert systems can be successfully built for mental workload modelling and assessment because their inferential capacity lies between the inferential capacity of two state-of-the-art assessment instruments, namely the Nasa Task Load Index and the Workload profile. However, here it is argued that these systems are more appealing and dynamic than selected state-of-the-art approaches. Firstly, they use rules built with terms that are closer to the way humans reason and that imitate experts problem-solving ability. Secondly, they embed heuristics for aggregating rules in a more dynamic way, with a better capacity of handling uncertainty and conflicting pieces of information compared to fixed formulas of state-of-the-art models. Thirdly, they allow the comparison of knowledge-bases and beliefs of different MWL designers thus increasing the understanding of the construct of Mental Workload itself.

5 Conclusion and Future Work

This research presents a new way of modelling and assessing the construct of Mental Workload (MWL) by means of rule-based expert systems. A knowledge base of a MWL designer was elicited and translated into computational rules of various typology. Different heuristics for aggregating these rules were designed aimed at inferring MWL as a numerical index. Inferred indexes were systematically compared with those generated by two state-of-the-art MWL assessment techniques: the NASA Task Load Index and the Workload Profile. This comparison included the quantification of two properties of each distribution of MWL indexes, namely sensitivity and validity, commonly employed in the literature. Findings suggest that rule-based expert systems are promising not only because they can approximate the inferential capacity of selected state-of-the-art MWL assessment techniques. They also offer a flexible approach for translating different knowledge-bases and beliefs of MWL designers into computational rules supporting the creation of models that can be replicated, extended and falsified, thus enhancing the understanding of the construct of mental workload itself. Future works will be focused on the replication of the approach adopted in this study using other knowledge bases elicited from other MWL experts. Additionally, this approach will be extended incorporating fuzzy representation of rules and acceptability semantics, borrowed from argumentation theory [2, 14], with the aim of improving conflict resolution of rules and building models expected to have an even higher sensitivity and validity.