Designing computer-based tests: design guidelines from multimedia learning studied with eye tracking

The use of computer-based tests (CBTs), for both formative and summative purposes, has greatly increased over the past years. One major advantage of CBTs is the easy integration of multimedia. It is unclear, though, how to design such CBT environments with multimedia. The purpose of the current study was to examine whether guidelines for designing multimedia instruction based on the Cognitive Load Theory (CLT) and Cognitive Theory of Multimedia Learning (CTML) will yield similar effects in CBT. In a within-subject design, thirty-three vocational students completed a computer-based arithmetic exam, in which half of the items were presented in an original design format, and the other half was redesigned based on the CTML principles for instructional design (i.e., adapted format). Results showed that applying CTML principles to a CBT decreased the difficulty of the test items, i.e., students scored significantly higher on the adapted test items. Moreover, eye-tracking data showed that the adapted items required less visual search and increased attention for the question and answer. Finally, cognitive load, measured as silent pauses during a secondary think-aloud task, decreased. Mean fixation duration (a different indicator of cognitive load), however, did not significantly differ between adapted and original items. These results indicate that applying multimedia principles to CBTs can be beneficial. It seems to prevent cognitive overload and helps students to focus on important parts of the test items (e.g., the question), leading to better test results.


Introduction
Computers are frequently used in education to support and assess students' knowledge. The increasing use of computers for both formative and summative testing has many advantages. Computers provide for example an excellent medium to include different 1 3 representations (e.g., videos, text, and pictures) and response formats (e.g., drag-and-drop questions, interactive questions) in tests (see Basu et al., 2007 for an overview ;Mayer, 2014aMayer, , 2019. This makes computer-based tests (CBT) more engaging, effective, and entertaining as compared to paper-based tests (Azabdaftari & Mozaheb, 2012;Başoğlu & Akdemir, 2010;Basu et al., 2007;Chua & Don, 2013;Kaplan-Rakowski, & Loranc-Paszylk, 2017) and increases student motivation (Lin & Yu, 2017).
Despite these benefits, concrete design guidelines on how best to present multimedia content in test items are lacking, thus far. Research on learning with multimedia has, however, shown that the concrete design of learning materials (e.g., when to insert which type of pictures) has a strong influence on how learners process the presented information and thus, on the learning outcome itself (Alemdag & Cagiltay, 2018). Although learning is not the same as testing (i.e., the former requires information intake and the latter information retrieval) both require sensory and cognitive processing of information that is subjected to the same characteristics and restrictions of the human information-processing-system (Baddeley, 1992). Even more, to understand what is given and what is asked in a testing item, information intake is the very first step. In this light, Kirschner et al. (2016) suggested that testing involves similar cognitive processes as learning (i.e., schema creation and retrieval) and that thus, the design of the test items will play a vital role for its processing (cf. Jarodzka et al., 2015;Ögren et al., 2016). Consequently, Kirschner et al. (2016) suggested investigating and determining to which extent cognitive theories on learning with multimedia-the Cognitive Theory of Multimedia Learning (CTML; Mayer, 2014a, b) and the Cognitive Load Theory (CLT; Sweller, 2011;Sweller et al., 1998) -apply to testing.

Human cognitive architecture and information processing
Designers' understanding of how humans learn is reflected in their composition of multimedia messages and instructional materials. One crucial maxim is that the cognitive system of the learner should not be overloaded and meaningful cognitive processing should be supported through instructional design (Mayer, 2014a b;Paas & Sweller, 2014). Both theories, CLT (Sweller, 2011;Sweller et al., 1998) and CTML (Mayer, 2009(Mayer, , 2014a, make specific assumptions about how people process multimedia material.

The CTML perspective
Against the background of cognitive science, CTML builds on three pillars describing human learning (Mayer, 2009). First, incoming visual and auditory information is processed through separate channels (dual-channel assumption; Baddeley, 1992;Paivio, 1986). Secondly, the human cognitive system is limited in its amount of information that can be processed simultaneously (limited capacity assumption; Baddeley, 1992;Chandler & Sweller, 1991). Finally, the learner must engage in active processing to acquire knowledge (Mayer, 2009). Active processing refers to drawing attention to information, selection of relevant information from the presentations, organization of selected information into coherent mental models, and the integration of the mental models with prior knowledge from long-term memory (Mayer, 2009(Mayer, , 2014a. There are three demands on the learner's cognitive capacity that can be ascribed to external sources or sources inert to the learning task (Mayer, 2009(Mayer, , 2014a: extraneous processing is not related to the instructional objective but rather the processing of task-irrelevant extraneous material, essential processing is the mental representation of the essential material as presented to the learner, and generative processing is the process of making sense of the material. Extraneous and essential processing can result in overload when they exceed the learner's cognitive capacities. The cognitive activities required for meaningful learning, i.e., selecting, organizing, and integrating, are linked to those load kinds. Selection is associated with essential load; organization and integration are associated with generative load (Mayer, 2009). Therefore, good instructional design should support these processes and hence meaningful learning through (a) the reduction of extraneous processing due to insufficient presentation, (b) management of essential processing, and (c) fostering of generative processing. Each of these areas is defined by specific design principles. The present study focuses on the reduction of extraneous processing.

The CLT perspective
Similarly, the CLT explains load related to the learning tasks which affect the learner and offers a theoretical framework of the human cognitive architecture from which design principles can be derived . The human cognitive structure consists of a long-term memory (LTM) and a working memory (WM). The capacity of the long-term memory is unlimited, whereas the working memory is limited in its capacity and duration (Choi et al., 2014;Sweller et al., 1998). Note that similar assumptions are also made within the CTML (Mayer, 2009). In particular new and unfamiliar information challenge the limited capacity of the working memory (Choi et al., 2014). These challenges can be overcome when the new information can be easily connected and integrated with pre-existing information from long-term memory. The broader the prior knowledge base of a learner, the easier she will manage new incoming information. Cognitive schemas are an efficient way to organize and store knowledge. The construction of cognitive schemas refers to the chunking of similar information elements into one single element which is stored in LTM. The retrieval of one element from LTM is less load-intensive and simultaneously activates other information pieces within the schema which can be linked with the new information piece in working memory. Through practice, the learner can reach a level of automation where the schema can be used unconsciously thus successfully tackling the limitations of WM (Choi et al., 2014).
Within CLT, two load types can be distinguished: intrinsic load and extrinsic load (Choi et al., 2014;Sweller et al., 2019). Intrinsic load refers to the given and unchangeable complexity of the learning task, specifically the amount of interacting information elements that must be processed to master the learning task (Choi et al., 2014;Sweller et al., 2019). Extraneous load, on the other hand, is determined by the way the information is presented and the instructions provided to the learner. Unlike intrinsic load, the extraneous load can be changed through instructional design. Previous models have also introduced germane load as a third distinct load type defined as the cognitive load required to learn (Sweller et al., 2019). However, revised versions of the CLT consider intrinsic and germane cognitive load "closely intertwined" (Kalyuga, 2011;Sweller et al., 2019, p. 264), and rather call the WM resources dealing with intrinsic load "germane resources" (Choi et al., 2014, p. 227). The load kinds in CTML can be understood within the CLT as follows: essential cognitive processing corresponds to intrinsic cognitive load, generative cognitive processing corresponds to germane cognitive load, and finally extraneous processing to extraneous load (Mayer, 2009).
The measurement of cognitive load can be conceptualized through mental load, mental effort, and performance. Task performance is considered an indirect indicator of the 1 3 cognitive load because the amount of cognitive load during task processing is interpreted post-hoc in relation to the measured task performance (Korbach et al., 2017b). Measuring cognitive load through a secondary task is considered an objective and reliable method Korbach et al., 2017b). It obtains behavioral data from a secondary task that is performed by the learner in addition to a primary task (dual-task methodology). The secondary task represents experienced cognitive load during a task, whereas the subjective measures indicate load after the completion of the task . Previous work has demonstrated that speech analysis can serve as a promising cognitive load measurement (Yin & Chen, 2007). Speech-based features such as silent pauses are considered potential indicators of the level of cognitive load: participants make more pauses when they experience higher load. In addition, pauses in speech are associated with fixation durations which can be considered another indicator of cognitive load (cf. Jarodzka et al., 2015). Despite many advantages, the objective measures are less widely used because the subjective measures can be easier and quicker administered.

Multimedia in testing
Recently, there have been several studies aiming to reduce the extraneous load in testing by applying multimedia principles to computer-based testing (e.g., Jarodzka et al., 2015;Lindner et al., 2017;Lindner et al., 2016). When solving test questions, testees need to read test item stimuli and item stems, interpret visuals, and type in the correct response. It is assumed that CTLM and CLT can inform test design for computer-based test items (Beddow, 2018;Kirschner et al., 2016) but recent studies show applying CTML principles to testing does not always yield the expected results (e.g., Ögren et al., 2016). One of the reasons explicated in the literature is that the objectives of the constructed materials in instructional design and assessment are different (Beddow, 2018). Instructional materials are designed to produce instructional outcomes (i.e., acquisition and retention of knowledge) whereas assessment materials are designed to assess them.
A design principle that gained most attention thus far in the research focusing on testing is the basic principle of combining text and pictures (i.e., 'multimedia effect'; ). The empirical evidence confirms that there is a positive effect of adding pictures to test items on performance (see Butcher, 2014 for an overview; Lindner et al., 2017a;Lindner et al., 2016) which is explained as follows: Pictures seem to serve as mental scaffolds that support comprehension and decision making behavior during testing (Lindner et al., 2017a).
However, research also shows that pictures are not always helpful in CBT (Anglin et al., 2004;Jarodzka et al., 2015;Ögren et al., 2016). For instance, in a study by Jarodzka et al. (2015) on standardized testing material, pictures were either presented away from the text (as the testing material was originally designed) or positioned in close proximity to the text (contiguity principle). Analyses of eye-tracking data showed that the optimized, integrated presentation format guided students' attention to inspecting all of the multimedia material. This, however, did not improve testing performance. Post-hoc analyses of the content of the standard testing material revealed that a substantial amount of the provided information (pictorial and textual) in the question was not relevant for solving the test item and thus violated the coherence principle (Mayer, 2014b).
A very likely explanation for these findings can again be found in the learning literature. Here, irrelevant information is known to hamper performance (cf. negative effect of seductive details on learning: Abercrombie, 2013; or coherence principle: Mayer, 2014b). An interesting finding of Jarodzka et al. (2015) though was that a longer visual inspection of the question and answer itself (as opposed to the picture and explanatory text), was positively related to better test performance. Likewise, in the study by Ögren et al. (2016) on vector calculus with and without graphs, no overall multimedia effect was found. In that study, students were presented items containing a statement about a presented formula. In half of the items, the formula was presented in the test question on the right sight of the screen and in the other half of the items, the formula was accompanied by a relevant visual representation of that formula on the right sight of the screen. The results showed that students looked proportionally less at the test question containing the written formula when a graph was present and students experienced more cognitive load in the multimedia condition, as indicated by more silent pauses in thinking aloud. Further analysis revealed, however, again that the more students looked at the question, the better they performed. Thus, it seems that pictures and graphs draw students' attention and cognitive capacities away from focusing on the question and hinder, under certain conditions, performance. This could mean that applying one of the multimedia principles in isolation to reduce the extraneous load in testing (e.g., contiguity) is too simplistic and provides a distorted picture. Instead, it seems more valid to suppose that when adding pictures to test questions, the pictures need to be relevant (i.e., multimedia principle and seductive details), pictures and text should be well integrated (i.e., contiguity principle), and redundant information should be resolved (i.e., redundancy principle). However, no research thus far has combined several multimedia principles to optimize authentic computer-based test items. But, to be able to investigate whether the multimedia principles are worthwhile for multimedia assessment, this is an important investigation to conduct as a starting point for more in-depth research and theory building that can inform practitioners.

Visual processing of multimedia material
Prior research (e.g., Ögren et al., 2016) has shown that eye-tracking is a useful technique to gain a better understanding of the effects of design modifications (see Alemdag & Cagiltay, 2018 for a review;  as it provides insights into the cognitive and perceptual processes that the design evokes in the testee. Eye-tracking enables researchers to measure processes underlying learning or test performance in a more objective and online manner than, for example, self-reporting (Van Gog & Jarodzka, 2013). Eye-tracking captures two types of measurements: fixations and saccades. Fixations reflect where the learner is attending to whereas saccades show the change in focus of visual attention (cf. Holmqvist et al., 2011). The longer a person is fixating (operationalized as e.g., total fixation duration) on an area of interest, the more processing effort this area evokes (Underwood et al., 2004). Besides fixation durations, the number of revisits to certain on-screen areas can also provide useful information as it can indicate the extent to which the learner or testee engages in integration processes of the different elements (Alemdag & Cagiltay, 2018).
Another measure that is more and more used to measure cognitive load is dual-task performance (cf. Jarodzka et al., 2015;Yin & Chen, 2007). Under dual-task conditions, performance on the secondary task (i.e., thinking aloud) or both the primary (e.g., successful completion of the computer-based test questions) and the secondary task will suffer -for example, indicated by increased silent pauses (Yin & Chen, 2007)-when the task itself imposes high levels of cognitive load. Used as such, dual-task performance enables researchers to measure cognitive load in an objective and online way (e.g., Brünken et al., 2003;Ericsson & Simon, 1993;van Gog et al., 2009).

Present study
The present study focuses on the extraneous load type and the principles aiming to reduce extraneous processing by omitting extraneous materials in the task. Learners experience extraneous processing overload when extraneous materials are irrelevant to the learning task but still attract their attention so that they are distracted from essential and generative processing (Mayer, 2009). Coherence, signaling, and spatial and temporal contiguity principles are examples of evidence-informed principles that were shown effective in reducing extraneous processing when designing learning materials (Mayer, 2009(Mayer, , 2014b. However, similar principles should also be key in designing testing materials (Beddow, 2018). The present study aims to investigate whether design principles that reduce the extraneous load in learning can be transferred to testing situations and will reveal similar positive effects.
The effect of an extraneous load reduction in testing is expected to be similar to the one in learning with multimedia: Since extraneous processing is linked to the presentation format, we assume that presenting an assessment task in a more beneficial way by applying the multimedia principles should reduce cognitive load bound by extrinsic load due to extraneous material. Reducing extrinsic load should free capacities in working memory for essential processing and release cognitive resources for retrieval of prior knowledge from LTM. Given these considerations, reducing extraneous processing in testing situations should result in more correct responses. Based on existing literature we formulated three hypotheses (cf. Lindner et al., 2017): Hypothesis 1: Performance hypothesis Adapted items will result in lower item difficulty (i.e., higher item solving probability) than original items.
Hypothesis 2: Visual search hypothesis Adapted items lead to longer relative viewing and more revisits to the question and answer (referred to in this paper as item stem) and shorter relative viewing times and less revisits to the explanatory text and/or picture (referred to in this paper as item stimulus) than original items.

Participants and design
The sample consisted of thirty-three (44% female) Dutch vocational education students. Four students were excluded from the analyses because of inaccuracy in the eyetracking data, resulting in valid data of 29 students. Power analysis with G-power 3.1.9.4 showed that this should be a sufficiently large sample size to detect an effect. The students were in their first or second year of schooling and were between 16 and 28 1 years old (M = 18.77 years; SD = 3.05). Age ranges were high because in vocational training education students can drop-out from schooling when they are 18 but return to school when they are older. All students followed math classes at the minimal level (i.e., 2F) required to manage in society as defined by the Dutch national governance (Rijksoverheid, 2018). Participation in the study was voluntary and students were rewarded with a small treat afterwards. For the present study a within-subject experiment-with test item design as an independent factor-was conducted. Dependent variables were item difficulty, visual search (fixation duration and revisits), and cognitive load (mean fixation duration and silent pauses).

Testing material
The test used in the present study was an adaptation of an authentic standardized computer-based mathematics test used nationwide in Dutch vocational education. The items were presented in a self-paced way to students. The test used here contained 10 items and was presented to the students in the digital assessment environment FACET 4.0. There were two versions of the test. In version 1, the even items were in an adapted format and the uneven items were in original format whereas, in version 2, the even items retained its original format and uneven items were adapted. Participants were randomly assigned to one of the two test versions. It is important to note that for the current study, we optimized the original items according to those multimedia principles that aim at reducing extraneous cognitive load. This means that -depending on how the original, nationwide used, items were constructed-several modifications were implemented. For post-hoc analyses, we made three clusters of items. For the first cluster of items (n = 4), no elements were deleted but elements that belonged together were presented closely near each other and in a contingent order (i.e., contingency principle). For the second cluster (n = 3) some seductive elements were replaced by text and if the item contained redundant information, that information was removed. After adaptation, the items still contained a visual element that contained necessary information. In addition, information was presented in a more contingent manner on one side of the screen only (i.e., contingency, redundancy, and coherence principle). For the third cluster of items (n = 3), all seductive visual elements and redundant were removed and again information was presented in a more contingent manner on one side of the screen only (i.e., contingency, redundancy, and coherence). Figure 1 shows an example item of each of the three clusters including the original and adapted item.

Item-difficulty
The computer-based testing environment, FACET 4.0, provided individual performance scores per test item from which we were able to calculate item difficulty (total score on the item/number of participants).

Eye-tracking parameters
Relevant eye-tracking parameters were total and mean fixation durations and revisits on relevant on-screen elements. Therefore, these parameters were assigned to certain areas of interest (AOIs). The names of the AOIs were based on Beddow et al. (2009) and 1 3  were: 'item stimulus' (i.e., explanatory text + visualization) and 'item stem' (i.e., question + answer). In Fig. 2 the AOI's for one item are presented as an example.

Cognitive load
There were two measures for cognitive load. Mean fixation duration was calculated as a first measure for cognitive load. Mean fixation duration is the mean duration of all fixation durations on a certain area of interest during a trial. Higher cognitive load has been shown to be related to increased fixation durations in prior research (Jarodzka et al., 2015;cf. Hyönä, 2010;cf., van Gog et al., 2009). This parameter was measured per item.
Second, silent pauses were used as a speech-based measurement of cognitive load (Yin & Chen, 2007). Participants were trained and instructed in thinking aloud according to Ericsson and Simon (1993) while completing the test questions. They were asked to "verbalize everything that comes to mind, and disregard the experimenter's presence in doing so" (Ericsson & Simon, 1993;van Gog et al., 2005). If they were silent for 15 s, they were reminded to keep thinking aloud (van Gog et al., 2005). With the software Audacity, duration of silent pauses longer than two seconds were identified and registered in Excel (Jarodzka et al., 2015). We counted the frequency of pauses lasting more than two seconds (cf. Ericsson & Simon,1993).

Apparatus
Eye movements were recorded using a remote, video-based eye-tracking system (SMI RED; 250 Hz sampling rate). The apparatus was placed in a quiet room in school and students were tested individually. Participants sat at approximately 60 cm away from the screen.
Before data collection, the system was calibrated using a 5-point pulsing calibration image and subsequent validation. The computer-based test items were presented on a standard monitor of 22-inch, with a 1680 × 1050 picture resolution using the screen recording function of the software Experiment Center 3.7 from SensoMotoric Instruments (SMI; Teltow, Germany). Mean calibration accuracy was M x = 0.56 (SD = 0.19) degrees and M y = 0.52 (SD = 0.34) degrees of visual angle. Average tracking ratio was 81.1 (SD = 9.29). The data were analyzed with BeGaze version 3.7, from SensoMotoric Instruments (SMI; Teltow, Germany).
The silent pauses longer were recorded using the microphone of an external Logitec Pro 9000 Business webcam. The camera itself was covered so no video material of the participant was collected.

Procedure
Students were tested in a quiet room at their school in individual sessions. After signing informed consent, students were instructed to think aloud using an example item (i.e., "please think-out loud while solving 20 × 11?", Ericsson & Simon, 1993). Next, the eyetracking equipment was calibrated, and after successful calibration students were once more reminded to think aloud and the CBT began. Students worked at their own pace but had a maximum of 30 min to complete the test. Students were not allowed to use their calculators. During test completion, students were reminded to keep on thinking aloud when pauses of 15 s occurred. After the test was completed, students were rewarded with a small treat (i.e., chocolate bar).

Data analyses
The test-scores were transferred to SPSS version 24 and in accordance with classical testing theory, we calculated mean item difficulty as the proportion (p) of participants who got the item correct. A one-way repeated measures ANOVA with 'format' as within-subject variable and 'difficulty' as the dependent variable was used to analyze the results. For the AOIs, relative fixation duration and revisits were calculated as indicators for visual search. Relative fixation duration was calculated by total fixation duration on AOI/total fixation time on the item. We used MANOVA to analyze shifts in attention allocation between adapted and original items on the different AOIs. A one-way repeated measures ANOVA with the within-subject factor 'format' and the dependent variable 'mean fixation duration' or 'mean duration of silent pauses' was calculated to test hypothesis three. In addition we calculated correlation analyses between item difficulty and mean fixation duration and item difficulty and silent pauses. Table 1 shows the means and standard deviations for all outcome measures under investigation in this study. The results will be presented per hypothesis below.

Performance
The overall item difficulty differed between P min = 0.45 and P max = 0.82 with a mean item difficulty of M = 0.66 (SD = 0.28). Comparing items in the original and adapted format, we found a significant difference in item difficulty (F(1,28) = 10.48, p = 0.03, ƞ 2 = . 27). Items in the original format were significantly more difficult indicated by a lower average probability score (M = 0.58; SD = 0.32) as compared to items in the adapted format (M = 0.74; SD = 0.24).

Fixation duration
Before we tested our hypotheses, we first checked for differences in total time spent on the adapted and the original items. A one-way repeated-measures ANOVA showed a significant difference between the time spend on the two item formats (F(1, 28) = 5.40, p = 0.03, ƞ 2 = 0.16).
Then we looked if the relative time spent on the areas of interest differed between the adapted and original items. A repeated measures MANOVA with the within-subject factor 'format' and the dependent variable 'relative fixation duration' was calculated. Results show a main effect of format on fixation duration (F = (2, 27) = 15.18, p < 0.001, ƞ 2 = 0.53). Univariate tests show that format did lead to different viewing times on both the 'items stimulus' and 'item stem'. Adapted formatting led to longer viewing times for the AOI 'item stem' (F = (1, 28) = 28.99, p < 0.001, ƞ 2 = 0.51 and shorter viewing times for the AOI 'item stimulus' (F = (1, 28) = 14.73, p < 0.001, ƞ 2 = 0.35). The hypothesis that the adapted items would lead to longer fixation durations on the item stem and shorter fixation durations on the item stimulus than original items is hereby confirmed.

Revisits
A repeated measures MANOVA with the within-subject factors 'format' and the dependent variable 'number of revisits per AOI' was calculated. Results show a significant effect of format (F(2, 27) = 31.26, p < 0.001, ƞ 2 = 0.70. Univariate tests show that the number of revisits to the 'item stem' AOIs of the adapted items is significantly greater than the number of revisits to the item stem AOIs of the originally formatted items, F(1, 28) = 38.18, p < 0.001, ƞ 2 = 0.58. Moreover, in the adapted format there are significantly fewer revisits to the 'item stimulus' as compared to the original formatted items, F(1, 28) = 9.69, p < 0.001, ƞ 2 = 0.26. It thus seems that adapting the items helped students in their visual search in the sense that it helped them to focus more on the 'item stem' (question and answer) and be less distracted by the 'item stimulus' (picture and explanatory information). The hypothesis that the adapted format leads to fewer revisits to the explanatory and picture but more revisits to the question and answer was confirmed.

Cognitive load
Pearson correlation analysis showed only significant correlations between duration of silent pauses and item difficulty for original items (p < 0.002). There were no significant correlations between duration of silent pauses and item difficulty in the adapted items nor between mean fixation duration and item difficulty for original nor adapted items.

Mean fixation duration
A one-way repeated measures ANOVA with the within-subject factor 'format' and the dependent variable 'mean fixation duration on the item" was calculated. Results showed no significant effect of format on mean fixation duration (F = (1, 28) = 0.27, p = 0.61, ƞ 2 = 0.01). The hypothesis that the adapted items would lead to lower cognitive load was not confirmed.

Silent pauses
A one-way repeated measures ANOVA with 'format' as within-subject variable and 'duration of silent pauses' as dependent variable shows a significant difference in average silent pauses between the two different formats with a large effect size (F(1, 28) = 15.92, p < 0.001, ƞ 2 = 0.36). The mean duration of the silent pauses (in seconds) is larger for the originally formatted items (M = 33.29; SD = 15.78) as compared to the adapted items (M = 25.99; SD = 10.60). The hypotheses that the adapted items lead to lower cognitive load, as indicated by shorter silent pauses, as compared to original items is hereby confirmed.

Discussion
Computer-based testing provides the opportunity to include multimedia material (i.e., pictures, audio, etc.). However, the way such items are designed very much affects how students perform on them (e.g., Ögren et al., 2016). In this study, we investigated the effects of adapting computer-based test items according to the principles of multimedia learning on performance (i.e., operationalized as mean item difficulty) and used process measures to be able to better understand how the design affects performance. For that purpose, items from a standardized computer-based test were converted. Based on prior research, we expected that adapted items were less difficult as compared to the original items (cf. Lindner et al., 2017a;performance hypothesis). From a process perspective, we furthermore expected that adapting items would lead to more beneficial visual processing (cf. Jarodzka et al., 2015;Ögren et al., 2016; visual search hypothesis) and lower cognitive load (i.e., shorter fixation durations and less silent pauses) (cf. CTML; cognitive load hypothesis).
The results confirmed our performance hypothesis. Adapting items according to the CTML and CLT lowers item difficulty and thus items in the adapted format are easier for students. This result is in line with earlier studies (Lindner et al., 2016;Lindner, 2017).
Second, in accordance with the visual search hypothesis, the results showed that adapting items according to the principles of CTML led to longer viewing times on the question and answer but shorter viewing times on the pictures and explanatory information. These results suggest that applying multimedia principles to test items, leads to a more balanced viewing behavior in which students pay more attention to the question and answer sections and are less distracted by visual elements. In addition, in the adapted items there were fewer revisits to pictures and explanatory information and more revisits to question and answer parts of the items indicating more integration between the different elements.
Finally, the results partly confirmed our cognitive load hypothesis. Adapting items according to the principles of CTML and CLT did not result in lower overall cognitive load as indicated by shorter mean fixation durations but it did result in shorter silent pauses. Prior research has shown that more objective measures such as fixation durations and dualtask performance are not always useful for measuring extraneous load (van Gog et al., 2009). The here used stimuli contained a substantial amount of text, which might have affected results (Rayner, 2009). In hindsight, fixation duration might not have been the optimally sensitive measure for such text-rich stimuli. For future research, different measures such as the mental effort rating scale (Paas, 1992) might be more useful.
Taking these results all together, adapting multimedia computer-based items to conform to the multimedia principles enables students to increase their success in processing and comprehending the test questions and in turn, increasing success in answering the questions. Processing data indicates that this is probably because there is more attention to the question and answer and more integration of the different elements (i.e., question, explanatory text, answer, picture). Previous research has shown that students often do not read the question text carefully, which probably results in them making mistakes when solving the test item. (e.g., Bully & Valencia, 2002). Our findings are in line with such previous findings. Eliminating decorative pictures, redundant information, and integrating text-picture material seems to draw students' attention towards the question and answer (see also Jarodzka et al., 2015) and makes it easier to process the contextual information resulting in lower item difficulty (i.e., better performance). One possible explanation for that might be that students understand the questions better because they read it more carefully and integrate the information better but more research is needed to be able to explain effects properly.
Post-hoc analyses, however, also reveal that items for which all visual elements were removed, did not benefit from the application of the multimedia principles. Students thus seem to benefit when (relevant) visual elements are used in test-items (i.e., multimediaeffect) in the right way. Maybe because it increases comprehension by compensating for low reading ability (Wallen et al., 2005) and/or it enhances student interest and attention for the item (Wang & Adesope, 2014). Future research should more closely look into the effects of visualizations on item validity, difficulty, and processing.

Limitations
In the present study, we used a standardized test and adapted its items in such a way that they were in line with the principles in CTML and CLT. Our experimental approach of using authentic tests has the strong advantage that our findings are ecologically valid and thus are highly for educational practice. On the downside, this study did not examine how each of the single multimedia guidelines affect the performance and processing of computer-based test items when used in the absence of other multimedia guidelines, nor did this study examine and compare the relative impact of each guideline when used simultaneously and in conjunction with other guidelines. More research is thus needed to investigate under which conditions applying multimedia guidelines benefit CBT and under which conditions they hamper CBT. Related to this issue is the question how the multimedia design affects test performance. It can first affect students' understanding of the test questions, or it can provide better retrieval cues in for example the answer options (Kirschner et al., 2016). These are two different research lines which might be combined in future research.
A second limitation of our study is that we focused on one domain only, namely mathematics and the sample has generally low verbal abilities. If we used another domain such as language learning and/or conducted our studies among a different sample, our results might have been different. Especially since in some domains, it is rather important that students can select the right/useful information from a wider set of (irrelevant) information causing a shift in the purpose of the assessment. For future research, it is thus important to investigate the design principles for testing in different domains, among different populations, and for different test goals.
Third, we used eye-tracking and silent pauses to unravel cognitive processes. Both methods have their drawbacks. For instance, dual-task performance may slow down primary task performance, put additional load on participants, or be incomplete (in particular for highly visual processes). Still, dual-task performance is a well-established method to study cognitive processes ever since the widely-cited work of Ericsson and Simon (1993). Likewise, eye-tracking has its own disadvantages: it can only tell us, where someone looked at but not why. Still, it allows us to see, which information enters the cognitive system, in which order, and which elements are attended to for how long. In that way, both methods very well complement each other to come to a more complete picture of the cognitive processes at hand (e.g., Helle, 2017).

Conclusion and educational relevance
This study is one of few studies investigating the effect of applying several multimedia principles to testing. Despite the limitations, the current study shows promising results of applying multimedia principles of learning to testing. In general, the results show that the central idea underlying CTML and CLT is also applicable to CBT (Kirschner et al., 2016). More specifically, the results seem to indicate that when designing computer-based tests, it is important to reduce extraneous load by integrating text and pictures and delete seductive elements as much as possible. However, post-hoc analyses also showed that items from which relevant visualizations were removed, did not benefit from the redesign.

Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval It is also approved by our own universities ethical committee (reference number: U2017/03387/FRO).

Informed consent Informed consent was obtained from all individual participants included in the study.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.