Abstract
Accuracy in estimating knowledge with multiple-choice quizzes largely depends on the distractor discrepancy. The order and duration of distractor views provide significant information to itemize knowledge estimates and detect cheating. To date, a precise and accurate method for segmenting time spent for a single quiz item has not been developed. This work proposes process mining tools for test-taking strategy classification by extracting informative trajectories of interaction with quiz elements. The efficiency of the method was verified in the real learning environment where the difficult knowledge test items were mixed with simple control items. The proposed method can be used for segmenting the quiz-related thinking process for detailed knowledge examination.
Graphical abstract

Introduction
Test-taking activity is decisive in the modern computerized educational process. Beyond the basic knowledge of facts, the modern industry of achievement testing considers the speed of fact retrieval, confidence, and thinking skills. A variety of measured entities are defined by specially crafted testing items. The function of a quiz item can vary across a wide range—from the direct acquisition of a ready answer from one’s memory (e.g. knowledge test) to the formal submission of an answer gleaned as a result of effortful offline activity (e.g. engineering task) (Thompson and O’Loughlin, 2015). The nature of a quiz is nondeterministic by design. In theory, quiz-taking assumes alternation of the process of perception of an ongoing quiz item and decision process (Tversky et al., 2008). However, in real life, time spent answering a question is occupied with various loosely arranged cognitive operations, some of which are close to and some far from the problem being solved. If we want not only to measure someone’s ability to answer correctly, but also to understand the reasons that they fail, we should concentrate on the patterns of those cognitive operations, which can serve as fingerprints of the inner psychological state of a student in a critical controlled situation.
In this paper, we consider quizzes which consist of multiple-choice questions (MCQs) (Haladyna and Rodriguez, 2013). Important characteristics of a student's behavior during a single test episode are the order and duration of the elementary operations that constitute the test-taking behavior. Since the focus of a person's attention is limited, a quantum of information is perceived per unit of time. The quanta of information in traditional MCQs are the text of the question, which may include additional images, and the text of the proposed response options. A standard quiz-taking sequence consists of (1) reading the question text, (2) forming a set for perception of response options, and (3) sequential perception of response options, with a final choice of one of them. Chronometry of these operations allows us to estimate the optimality of students’ behavior at the level of elementary cognitive operations.
Technically, the selection of elementary operations is possible on the basis of eye fixation metrics, which requires a testing environment with a calibrated eye tracker (Ben Khedher et al., 2017; Celiktutan and Demiris, 2018; Spiller et al., 2021). However, special software that can reliably designate eye fixations on positioned quiz item elements is still a subject to develop, and eye trackers are not sufficiently available for massive online testing (Hutt et al., 2021).
We have proposed a method for strict isolation of these elementary operations (Sherbina, 2015), when the text of questions and response options is closed by opaque screens that become transparent when the mouse cursor hovers over them. This means that only one text element can be viewed at a time. The method allows us to measure the viewing time of each response option and the pattern of transitions between them.
For many years, testologists ignored the possibilities to delve into item response granularities, probably assuming that quiz behavior complied with a single standard behavior pattern.
In fact, students’ strategies for passing a test depend on their individual characteristics and level of knowledge (McCoubrie, 2004). The divergence of solution and guessing behaviors has theoretical justification based on the reaction time (Wise and DeMars, 2006; Goegebeur et al., 2008; Wang and Xu, 2015). However, the total solution time parameter does not allow one to reliably statistically separate groups of honest test-takers from simulators (Berinsky et al., 2014). We suggest that different quiz-taking strategies are reflected in the trajectory of the focus of attention moving across the components of the test item.
Empirical data from real quiz tasks have revealed that there are various non-optimal deviations from standard trajectory (Sherbina, 2016). Nonlinear learned combinations of the seven most informative metrics enabled reliable classification of quiz-taking strategies into five classes: confident knowledge, uncertain knowledge, partial knowledge (attempts to guess by matching the question and response options), random guessing, and cheating (copying from an external source).
Figure 1 shows some informative metrics depicted on the chronogram of one correct answer on a MCQ test on biostatistics
Example chronogram of text element views during correct quiz item solving. Codes on the vertical axis: t – text of the question, a1–a4 – response options. On the horizontal axis – time from the beginning of page load, s. Vertical gray line – the end of page load. Vertical black line – navigation to the next page. Metrics: d – duration of task solution, s; tstart – time of the first user event, s; tdecide – time up to the last click, s; dcheck – duration of re-checking activity
The manifestation of uncertain knowledge is a number of extra response option views which correspond to comparison operations. Also, low confidence in the chosen response option is reflected in additional checking and re-checking option views that are alternatives to the already chosen answer.
One remarkable point of non-optimal testing behavior is stereotyped actions. For example, one can read the rest of the response options even after the correct answer was already chosen.
To understand conditions which provoke such behavior patterns, we proposed chronometry of difficult and simple quiz items. We hypothesized that difficult items would provoke extra re-checking activity, while simple items with overt answers would contribute to the gradual exhaustion of such activity. In terms of strategies, we expected that students would more often choose the confident response strategy in simple tasks, and they would choose the guessing strategy in difficult tasks.
In this study we have no ground truth data for cognitive operation or in terms of brain activity in specially validated items. However, we designed a set of items with certain properties which we expected would bend participants’ course of action to one or another direction. We employed item response theory (IRT) to confirm the empirical difficulty of quiz items we designed. To describe the principal trajectories of weakly structured activity during quiz item-solving, we used process mining tools (Bannert et al., 2014).
The objectives of this study were as follows:
-
1.
Identify the peculiarities of behavior patterns when solving tasks of different complexity.
-
2.
Develop criteria for predicting subjective test item difficulty by chronometric features.
Materials and Methods
Ethics approval for the study was granted by the University Research Ethics Committee.
Participants and data collection
A total of 112 students participated in the study. Participants were second-year university students in the biology department whose aim was to pass the quiz and receive credit. The quiz was prepared and presented to participants by means of the university educational portal built within the Moodle framework (https://moodle.org).
Procedure
The testing procedure was performed under the default settings of the Moodle course management system (CMS). Quiz items were presented one per page. To move on to the next task, students pressed the “Next” button. The quiz interface included a navigation bar to see how many questions were already completed and how many were still to be completed. No time limit was placed on the administration of items. At the end of the test, participants could view the number of points scored and the content of questions marked with correct and incorrect answers.
Stimuli content
The quiz consisted of 60 items. Items were derived from categories (as the difficulty increased) as follows:
-
four questions on general knowledge easily extracted from memory, with a choice from four suggested answers (tagged “scala”);
-
46 questions on simple cognitive operations with numbers (tagged “simple”), including:
-
38 questions with a choice from four suggested answers;
-
eight questions with checking several (2–4) correct answers out of six suggested ones;
-
-
ten questions on skills for programming in Python language from the bank of questions of the data processing course passed in the previous semester (tagged “ipython”).
Items were sequenced in pseudo-random order with fixed positions for categories. Hard “ipython” questions were randomly selected from the question bank and presented at positions [1, 4, 10, 28, 32, 39, 44, 49, 52, 57] with 3–18 gaps in between, very simple “scala” questions at positions [7, 19, 42, 54], and the questions in the most numerous category “simple” at other positions.
Tasks in the “scala” category were used to obtain a reference strategy for solving tasks. They were presented at fixed positions during the test to control the transition to the guessing strategy. These tasks had a simple unambiguous answer, so an error in the solution would indicate an inadequate attitude toward the test; thus these tasks served as the “lie scale” in psychological tests. The constant order of distractors during the presentation of these items enabled reliable mining of the standard trajectory.
Tasks in the “ipython” category were relatively difficult. To successfully answer all these questions, a student had to have learned several principles of working with numeric arrays and basic commands for operations with them. Some of the tasks required students to solve a few logical steps in their mind, so for some students a paper drawing for assistance might be required. Distractors were crafted in such a way that students were unlikely to guess the correct answer without understanding the question. When creating questions for the training course, the authors sought to maximize the discriminatory ability of test tasks to ensure that the tasks were successfully solved only by a part of the student group.
Tasks in the “simple” category assumed a guaranteed solution for anyone with middle school education. To avoid dependence on the individual thesaurus, we used numerical abstractions to induce various cognitive operations. The assumption was that all students have concepts about integers/fractions, positive/negative, and even/odd numbers. The use of numbers reduced the assortment of response options by using the same options in different contexts to provoke different cognitive operations. For example, the tasks “How much is minus one multiplied by two?” and “Which number is greater than seven?” (in Russian in real tasks) were presented with the same set of options [11, 5, −2, 0.989].
Behavior data
The elementary events during testing were recorded with a specially crafted JavaScript extension to the standard quiz system implemented in the Moodle CMS. A web page generated for a given task included hypertext elements with the question text and text labels for 4-6 answer options. If only one answer could be selected, then the response options were accompanied by round radio input indicators. If several answers were correct, then the response options were accompanied by square checkbox input indicators. When a test item was presented on the screen, the item’s text elements (question and answer options) were covered with positioned opaque gray rectangles that became transparent if the mouse cursor hovered over them. When the mouse cursor hovered over the area occupied by the kth element, its text became visible and the start time of a view of the text element was registered. The text remained visible and therefore readable until the mouse cursor moved out of the kth text element and the text was again covered by the opaque rectangle. This modification of the standard quiz interface allowed the start and end times to be registered for every text element view. The time codes for events were collected along with auxiliary metadata and kept on a separate custom web server.
The raw data for every item response consisted of user, quiz, and question identifiers, time codes of views, and mouse clicks.
We calculated the speed and accuracy metrics of user activity (Table 1).
The success of the task q was assessed as 1.0 if the selected answer matched the correct one, and 0.0 if they did not match. In tasks with 2–4 correct answers, a partially correct answer was rated 0.5.
Unanswered questions (viewed only or skipped) were excluded from the analysis. Several students made more than one attempt. The final data set was gathered from 134 quiz attempts.
Overall, 7766 behavioral traces of solutions of 62 different questions were analyzed. The content of “scala” and “simple” questions was the same in all sessions, but the content of two “ipython” questions used changed for different student groups across different semesters. In total, behavioral data for 14 distinct “ipython” questions were collected, with 22 to 125 repeats (median 100, mean 87.2). The number of behavioral traces for tasks in “scala” and “simple” categories varied from 117 to 143 (median 133).
Statistical analyses
Item response metrics were grouped by class. The percentage of correct answers was calculated to assess the quality of the test. Difficulty metrics controlled by the student performance were obtained using the two-parameter logistic (2PL) item response model. The details of the model are described in Appendix A. The relationship of indicators was estimated using Spearman's correlation coefficient. The influence of factors on the decision time was evaluated using variational analysis for repeated measurements. Decision tree regression was conducted using a scikit-learn package (Pedregosa et al., 2011). To analyze the behavior traces of task solutions, we used tools developed within the process mining framework (van der Aalst, 2016).
Results
Estimation of the objective difficulty of the task
According to the original hypothesis of the study, items in the “simple” category assumed the choice of a confident response strategy, and harder MCQs in the “ipython” category a partial response or guessing strategy. On average, this assumption was confirmed: the mean success rate of solutions in the “simple” category was 0.9304, and in the “ipython” category 0.4723. The deviance of these means from 1.0 and 0.25, respectively, suggests that some subjects made mistakes in easy tasks, and some coped with hard ones.
The characteristics of speed (duration of task solution, s) and accuracy (mean success rate) were related to each other (r = −0.85, p < 0.001).
The relation of the solution duration with accuracy can be detailed with SATF (speed–accuracy trade-off function) (Heitz, 2014). However, a negative SATF slope is not typical: most operator tasks with supra-threshold stimuli are characterized by a positive SATF slope, when the acceleration of reactions (haste) leads to a deterioration in quality. The concept of speed–accuracy trade-off (SAT) describes the complex relationship between an individual's desire to react slowly with fewer mistakes and a desire to react quickly despite mistakes (Zimmerman, 2011).
To check whether the relative acceleration of responses affected the quality of the solution, we calculated conditional accuracy functions (CAF) for each question. To do this, the set of solutions for each question was divided by the duration of the solution into four quartile classes, and the averaged metrics in each class were connected by lines on the diagram (Fig. 2).
For easy tasks (with quality > 0.8) there was no clear dependence of accuracy on duration. For tasks with a duration of longest quartiles above 25 s, the accuracy of tasks in those quartiles was paradoxically lower. For hard tasks, which raised a large number of errors, CAF demonstrated a clear positive relationship between success rate and solution duration. This supports a speed–accuracy trade-off for the hard task solving.
Two hard tasks were from the “simple” category. Because of the wording of question text, some students misinterpreted these two task items and quickly responded to the distractor, while other students took a longer time to solve the tasks accurately. Empirically, we discovered that the subjective difficulty of tasks in the “simple” category varied from easy to hard, but this difficulty of misinterpreted simple tasks is not the same as the difficulty of really hard tasks which require more time to solve. Ideally, we wanted a quantitative metric of difficulty to be able to calculate the ratio when one task is harder than another.
Coefficients of task difficulty were obtained with the 2PL IRT models for 54 tasks with a choice of one of four suggested options. The choice of an answer from four options allowed us to fix the third and fourth parameters of logistic curves at 0.25 (chance level choice) and 0.99 (near perfect correctness). To meet the completeness requirements of the model, we used data for the 87 users who had trials for all 54 tasks. Characteristic curves for tasks of different categories (Fig. 3) revealed low discriminability for one task in the “simple” category (correct answer was not related to the student’s performance) and one very difficult task in the “simple” category (almost all students chose an incorrect distractor). These were the same tricky questions.
IRT modeling confirmed that the difficulty of “scala” and “simple” tasks is lower than average (except two), and the difficulty of “ipython” tasks is higher than average (except two). Incorrect answers on some of the “simple” tasks do not matter because they were not related to strategy change: students confidently chose a distractor as though it were the correct answer. The same situation exists with hard tasks: if we suppose an unconfident guessing strategy with hard tasks, then it does not matter whether a correct answer or incorrect distractor was chosen by chance.
Patterns of quiz item solving
The improvements to the test procedure made it possible to register detailed dynamics of elementary operations during the execution of the test task. We converted the log file of user actions into XES format used in the ProM application (http://www.promtools.org/). The combination of attempt ID and position in task sequence was used as an identifier of the individual item response. Each implementation could include seven types of operations (actions) (Table 2).
Visualization with a classic Petri net, which displays all the transition options between action codes, revealed a vast variety of trajectories. The Petri net for a set of traces in the “simple” category was an example of a “spaghetti” process, when the order of operations is weakly stratified and various transitions between operations are possible.
To detect the prevailing transitions between operations of different types we used directly-follows graphs (DFG), which are built based on the α-algorithm (de Aalst et al., 2004).
On the graph model of the process of completing tasks from the “scala” category, when the correct response option was always in the second position, the typical trajectory was easily traced by the thickest arrows (Fig. 4a). The trajectory shown explained 95% of the behavior traces in the sample. The overall trajectory of 460 traces was fairly straightforward except for the cyclical section around the “var” operation. Keep in mind that “var” operations are views of all non-target response options, which in typical tasks with four response variants are viewed three times more than the target response option under conditions of random transitions. It is obvious that in all cases, after reading the question, participants went to view the first wrong option, then the correct one, then in more than half of the cases (249 out of 460) viewed other distractors, although the answer was obvious, then returned to the correct answer and made a choice (“click”). It is worth noting that in a quarter of cases (113 out of 460), students repeatedly returned to reading the question—24 of them after having chosen the answer (arrow from a “click” event). This is a pattern of self-checking: rereading the question after having already made a decision to make sure that the task was solved correctly.
Process models for the task solution traces. a “scala” category – assumes ready answer. b “simple” category – assumes 2–4 mental operations. c “ipython” category – typical academic test of knowledge. d 80% of cheating task solutions with switches to external sources (“out” event).Direct-follows graphs for the majority of typical traces in a given category are shown. The numbers next to the arrows represent the frequency of transitions; the thickness and color intensity of the arrows are relative statistical characteristics. Layouts were automatically optimized
In the graph for a more representative sample of traces in the “simple” category (Fig. 4b), we see a similar pattern with similar relationships between transition frequencies. The differences relate to the pattern of self-checking: while every tenth trace in the “scala” category had distractor views after the decision click, here we see self-checking in more than half of the cases (transitions from “click” to “var” and “target”).
The graph for the task solution process in the “ipython” category (Fig. 4c) was characterized by the absence of a single typical trajectory. After participants had read the question, there were transitions in all directions. Transitions between responses options (from “var” to “var” and from “var” to “target”) were observed seven times as frequently as the number of traces in the sample. Therefore, students read and compared response options again and again. There were also switches out of the testing environment after reading the question that were related to searching for answers from external sources. Self-checking was initiated in about a third of cases.
To clarify the trajectory for cases when students accessed external sources, a model was built specifically for this subsample (Fig. 4d). If the previous graph, which described 95% of the most typical paths, showed “out” actions as occurring only after reading the question, then this graph, which describes 80% of traces with switching out to other windows, shows many transitions after viewing response options. “Out” actions helped students answer correctly: if clicks on distractors were prevalent for the overall sample of hard “ipython” traces (568 vs. 507 transitions, Fig. 4c), then the frequency of target clicks was three times as frequent for cheaters’ traces subsampled here (149 vs. 58 transitions, Fig. 4d). Despite probable help from external sources, the self-checking pattern was evident in about a third of cases (51 out of 172).
Thus, reducing the sample to a more uniform one made it easier to assess the trajectory of processes.
The trajectory analysis revealed that the self-checking pattern (superfluous question and response option views after the decisive click) is present in all types of task solutions regardless of difficulty. The self-checking pattern can be used to assess confidence in the choice made.
Decisions on questions from the “ipython” category were characterized by a high percentage of errors and significant solution duration. Since the tasks did not require long calculations and assumed a quick choice of answer extracted from memory, the increase in success of the solution with increasing duration (a typical SAT pattern) indicated a time-consuming search for the answer from external sources. Out of 999 test task traces in the “ipython” category, only 13.2% of successful responses were accompanied by switching out of the browser tab with the test and back; 7.6% of incorrect answers were accompanied by attempts to resort to external sources, which for some reason stayed unsuccessful. The relationship between the successful solutions of a hard task and the use of external sources was reliable (χ2=7.88, p=0.005), but it did not explain getting the correct answers in most cases.
Relatively quick solutions (less than 7 s) to tasks in the “ipython” category without switching out were mostly incorrect. The comparison of distributions of correct and incorrect answers revealed that responses made before 16 s were wrong (probably guessing), and correct solutions in most cases required 20–60 s.
Tasks in the “ipython” category differed in subjective difficulty for different students, depending on which topics they had learned in the previous semester. One task was easy because it contained an explicit correct answer, understandable from the context when carefully read. Other tasks required knowledge about indexing arrays from zero and matrix representation of simple images. For example, if only two correct answers were given out of ten programming questions, and one of them with an explicit correct answer, then one could suspect that the student was just lucky once out of the remaining nine times. The point is that a strategy classification of a given response can depend on neighbor responses of a similar topic and difficulty. For example, one can judge that if the student had read a fairly long question in 3.5 s and chosen the correct answer without hesitation on the very first view of the second option (Fig. 5), then that pattern corresponds to a confident decision. However, taking into account that all the answers to similar questions were incorrect (the student did not understand the topic as a whole) while the patterns of those poor decisions were similar to this pattern by their time characteristics, then we would prefer to think that this decision belongs to the category of random guessing. This is also supported by very fast decision-making when viewing the target: tdecide = 0.149 s. Such response times happen when the decision to click is made before one starts viewing the response option. The normal response time for two-choice reaction is 0.25–0.45 s. Considering the need to read and comprehend the text, tdecide should be even longer. The mode of the distribution of decision times for all task types was about 0.8 s. According to the leading edge of the histogram of correct decision times, it can be assumed that, regardless of the ease of perception, a reasonable choice requires at least 0.3 s of text viewing before clicking.
Sample of a quick correct solution of a task that can be classified as guessing provided the incorrect solutions for all similar tasks of that quiz attempt. Left: the appearance of the solved task asking about data array shape. Right: the chronogram of text element views. Codes on the vertical axis: t – text of the question, a1–a4 – response options. On the horizontal axis – time from the beginning of page load, s. Vertical gray line – the end of page load. Vertical black line – the navigation to the next page
Criteria for predicting subjective task difficulty
To create criteria for predicting subjective task difficulty which determines a chosen response strategy, we selected metrics with codes: [d, dcheck ,pout, n0, naextra, tcfirst, tclast, tdecide, keffort, vmin] (see Table 1). Most indicators correlated with the solution duration (Appendix B). The strong correlation between keffort and tcfirst was also mediated by overall duration, as students with a long solution viewed more items (more effort) and later performed the last mouse click.
To select the most informative indicators, we built decision trees that predict the difficulty score. Since the distribution of difficulty scores was strongly skewed to the left, the logarithm of the score was used as the target indicator. For other time-based metrics that had a left-skewed distribution, we also used logarithmic metrics coded with the prefix “log-.” The sample was randomly split into training and test samples. After averaging the feature importance of indicators calculated by repeatedly training a deep classifier with a depth of five levels on randomly selected subsamples (fivefold cross-validation of stochastic gradient boosting regressor model), it turned out that the logarithmic metrics were more informative than the original ones (see Appendix B).
Given the high covariance of time-based metrics, we selected five of the most independent but most important features: the logarithm of the time of the last click (logtclast), the coefficient of effort (keffort), the logarithm of the time of the first view of a suggested answer (logtafirst), decision time (tdecide), and minimum view speed (vmin), which indicates the presence of a long pause while viewing a text element of MCQ. To find the critical values, we trained regression decision trees with a depth of three layers with a limited number of features and identified the decision rules that best predicted the certain difficulty of task solutions (see Appendix C).
Discussion
Estimating the objective difficulty of the task
To increase the variability in behavior patterns within one subject, we generated a quiz with easy and hard tasks. This allowed us to compare different solutions in the same experimental settings.
To assess the difficulty of the task, the relationship between the success rate and the duration of tasks was considered. An increase in task difficulty metric correlated with an increase in solution duration.
Given the general limitations of human cognitive processes, the difficulty boundaries found in the work can be transferred to other types of tests without changes. However, the time limits for task solutions can be significantly different in special types of tests, for example, consisting of fundamentally different task types or for special categories of people.
Since the difficulty of the task is inversely proportional to the success of the solution, then a basic level of a difficulty should correspond to the simplest tasks. Failure to achieve 100% success in solving the simplest tasks should be explained by violations of the testing procedure—for example, when the student does not read the questions at all or is not able to comprehend the meaning of what they read. We chose to include in the quiz several easy general-knowledge questions with the same answer, which served as a lie scale in psychological tests. The approach of including questions on general knowledge has advantages: (1) it provides information about cognitive operations required to solve tasks at a basic difficulty level with the maximum speed for this subject; (2) it constitutes only a minor additional load without switching the type of activity. The approach compares favorably with the introduction of special “confusing” instructions proposed for detecting low-motivation participants by the authors of the instructional manipulation check (IMC) (Oppenheimer et al., 2009). Our approach is smoothly integrated into the overall canvas of the test and does not cause strong emotional responses, as in the case of counter low-performance measures. The maximum reaction that can occur in the subject is the question: “Why am I being asked about it?” But since the answer is easy, the person quickly moves on to the next item and does not concentrate on such questions.
Since we mixed simple and hard test items, we specifically addressed the effect of item difficulty on solution duration. CAF corresponding to the theoretical ones (Schnipke and Scrams, 2002) were obtained only for hard questions.
The success rate and solution duration are manifestations of hidden variables (different strategies) that are of interest to test organizers (van der Linden, 2007). To identify hidden variables, iterative algorithms based on the principles of Monte Carlo simulations are used (Meng et al., 2015). However, experiments on large open data sets have shown that IRT models with parameters obtained in such a way are worse than modern classification algorithms (Linacre, 2010).
The improvement of the testing methodology with the recording of individual operations made it possible to separate confident responses and unsure responses. Granular chronometry increases the reliability of dividing formally correct answers into true and randomly guessed answers as compared with previously known methods with one rough criterion for the solution duration (Lu and Sireci, 2007). Imagine that a student tries to make a choice between the right option and one distractor, and therefore we record many switches between two competing options. Even if that student finally clicks on the wrong option, we can assume the presence of partial knowledge on the subject of the question. Indeed, granular chronometry is capable of reproducing the functionality of the method of elimination of wrong options (Ben-Simon et al., 1997; Chang et al., 2007; Lau et al., 2011; Wu et al., 2019) without special modifications of test instructions.
In general, our approach allows us to move from the grades for correct responses to the grades for selected and implemented response strategies. The determining factor in the strategy selection is the perceived hardness of the task. In this work we tried to predict the empirical difficulty score with a number of behavior metrics.
We analyzed dynamic patterns of solving problems of various levels of difficulty. The formation of a pattern is caused by a cascade of individual decisions that the individual makes based on the current situation. Test taking involves the selection of a strategy aimed at the achievement of some utilitarian function. At the time of the task presentation, the subject is already pre-configured for a number of actions that they need to complete the task. To make a final decision, the subject needs to obtain information about the current task. This information—obtained, for example, when reading a question—can either activate the subject’s pre-configured strategy or, conversely, make them reconsider their action plan and ultimately change their strategy.
The choice of strategy depends on the subjective difficulty of the task, in which objective difficulty is superimposed on individual background, which is influenced by fluctuations in the current psychophysiological state. The influence of the current state can be reduced to one main factor of motivation. The results of testing individuals directly depend on their motivation (Liu et al., 2012). Motivation is considered as a score on a linear scale from “I don't want to pass the test” to “I want to pass the test.” The deterioration of health leads to a decrease in motivation. However, taking into account the strategies we have highlighted, this is not so much a quantitative change (for example, slower decision-making) but rather a qualitative change—a switching from one task execution strategy to another.
Some strategies only allow us to evaluate the level of knowledge, while others deal only with the attitude toward testing. Therefore, to create a consistent model of knowledge assessment, it is necessary to integrate assessments about global motivation, current motivation (modulated by well-being), strategies for solving tasks, and adjusted relative difficulty.
Diagnostic value of task solution patterns
Test taking has a large number of degrees of freedom. Within the total time allocated for passing the test, the student can perform any operations in any order. However, most students optimize their trajectory for completing tasks. Optimization is modeled using the utility function described in economic theory. Mechanisms that ensure the search for and consolidation of useful behavior patterns for an individual are studied in the field of neuroeconomics. However, test behavior optimization models have not gone beyond two-component mixed models (Wang and Xu, 2015).
If the set of all strategies is reduced to two—confident/uncertain response and random guess—then the pattern of frequency distribution of distractor selection is sufficient to describe all possible quantitative combinations between these strategies. However, in real test behavior, at least five strategies are allocated (confident knowledge, uncertain knowledge, partial knowledge, random guessing, cheating; Sherbina, 2016), so an adequate model for describing the entire variety of test behavior has yet to be created.
Trace analysis with process mining techniques has revealed phenomena of self-checking (reading the question after decision) and partial reading (excessive viewing of the question and distractor texts). Both phenomena happen more often with an increase in task difficulty and are thought to reflect uncertainty. The presence of a self-checking pattern in simple tasks can be a diagnostic sign of self-doubt or difficulty in perception (for example, a poor knowledge of the language).
Switching-out-of-test patterns found in hard task solutions offers a new way to estimate students’ commitment to excellence. The logging of Internet search queries in a parallel browser window can provide information about students' competencies in the field of information technology: the ability to choose keywords for a search query, expanding or narrowing the search query depending on the output, and so on.
Perspectives
Information about test-taking strategies allows us to make more accurate estimates based on empirical data. If we have the estimates of prevailing strategies, then the scoring issue is reduced to the choice of weighting coefficients. If the coefficients are negative, the student is fined for incorrect answers. Algorithms for optimizing the number of penalties for incorrect answers are proposed in the literature (Espinosa and Gardeazabal, 2010).
Because of the nonlinear dependence of the task success rate on miscellaneous behavior metrics, typical methods of dimension reduction, such as the principal component method, are not applicable, since they cannot reflect the complex structure of patterns unfolding within different strategies. Therefore, the next step for behavior trace analysis is a formal description of dynamic task execution patterns that consider the order and time of viewing individual elements.
The process mining models have provided us with information about typical transitions between elementary operations during test task solutions.
After a formal description of the patterns, we can proceed to the development of a probabilistic model for evaluating the ability to perform cognitive operations that are critical for the solution, based on the features of choosing and changing strategies in the course of solving test tasks of various operational complexity levels. For example, if the tasks in the adaptive test become progressively more complex, at some point the ongoing task will become too difficult for the subject, and they will change their strategy, for example, from uncertain decision to random guessing. Here, the indicator of knowledge is not the fact of a correct or incorrect answer, but the fact of a change in strategy, which depends on both knowledge and motivation. Switching strategies in the dynamics of the test can be represented as a hidden Markov model (HMM), which will allow one to estimate the probability of readiness to perform specified cognitive operations (hidden layer of HMM) by the probability of changing the strategy (observed layer of HMM).
In the future, it will be possible to develop a system for sequencing test tasks with a certain composition of cognitive operations in such a way that cognitive load will provoke changes in strategies at certain stages of the test. The results of the empirical probability of changing strategies will be used to clarify the a priori probability of choosing strategies, which is determined by the student's preparedness for the test.
Our method is a further development of the evidence-centered design (ECD) approach, which considers observed student behaviors, in particular tasks and items to support claims about what students know and can do (Mislevy et al., 2017).
Granular analysis of distractor view metrics is potentially an important tool for evaluating multiple-choice test items (Gierl et al., 2016; Wind et al., 2019).
Recent attempts to train a deep learning model on big data for guessing correct answers using linguistic features of distractors are hardly productive, with accuracy about 50% in the best cases (Watson et al., 2018). Our chronometry approach gives a number of metrics for every item response, which can potentially feed to much more precise classifiers of student behaviors.
Data availability
The data analyzed during the current study are available at https://osf.io/37tk8/ . The repository includes code examples to deploy the test extension to register granular behavior data.
References
Bannert, M., Reimann, P., Sonnenberg, C., 2014. Process mining techniques for analysing patterns and strategies in students’ self-regulated learning. Metacognition and learning 9, 161–185.
Ben Khedher, A., Jraidi, I., Frasson, C., 2017. Assessing Learners’ Reasoning Using Eye Tracking and a Sequence Alignment Method, in: Huang, D.-S., Jo, K.-H., Figueroa-García, J.C. (Eds.), Intelligent Computing Theories and Application, Lecture Notes in Computer Science. Springer International Publishing, Cham, pp. 47–57. https://doi.org/10.1007/978-3-319-63312-1_5
Ben-Simon, A., Budescu, D.V., Nevo, B., 1997. A Comparative Study of Measures of Partial Knowledge in Multiple-Choice Tests. Applied Psychological Measurement 21, 65–88. https://doi.org/10.1177/0146621697211006
Berinsky, A.J., Margolis, M.F., Sances, M.W., 2014. Separating the Shirkers from the Workers? Making Sure Respondents Pay Attention on Self-Administered Surveys. American Journal of Political Science 58, 739–753. https://doi.org/10.1111/ajps.12081
Celiktutan, O., Demiris, Y., 2018. Inferring Human Knowledgeability from Eye Gaze in Mobile Learning Environments. Presented at the Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0.
Chang, S.-H., Lin, P.-C., Lin, Z.-C., 2007. Measures of partial knowledge and unexpected responses in multiple-choice tests. Journal of Educational Technology & Society 10, 95–109.
der Aalst, W.V., Weijters, T., Maruster, L., 2004. Workflow mining: Discovering process models from event logs. IEEE Transactions on Knowledge and Data Engineering 16, 1128–1142.
Espinosa, M.P., Gardeazabal, J., 2010. Optimal correction for guessing in multiple-choice tests. Journal of Mathematical Psychology 54, 415–425. https://doi.org/10.1016/j.jmp.2010.06.001
Gierl, M.J., Lai, H., Pugh, D., Touchie, C., Boulais, A.-P., Champlain, A.D., 2016. Evaluating the Psychometric Characteristics of Generated Multiple-Choice Test Items. Applied Measurement in Education 29, 196–210. https://doi.org/10.1080/08957347.2016.1171768
Goegebeur, Y., De Boeck, P., Wollack, J.A., Cohen, A.S., 2008. A Speeded Item Response Model with Gradual Process Change. Psychometrika 73, 65–87. https://doi.org/10.1007/s11336-007-9031-2
Haladyna, T.M., Rodriguez, M.C., 2013. Developing and validating test items. Routledge.
Heitz, R.P., 2014. The speed-accuracy tradeoff: history, physiology, methodology, and behavior. Front Neuroscience 8. https://doi.org/10.3389/fnins.2014.00150
Hutt, S., Krasich, K. R., J.K. Brockmole, D’Mello, S., 2021. Breaking out of the Lab: Mitigating Mind Wandering with Gaze-Based Attention-Aware Technology in Classrooms, in: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, CHI ’21. Association for Computing Machinery, New York, NY, USA, pp. 1–14. https://doi.org/10.1145/3411764.3445269
Lau, P.N., Lau, S.H., Hong, K.S., Usop, H., 2011. Guessing, Partial Knowledge, and Misconceptions in Multiple-Choice Tests. Journal of Educational Technology & Society 14, 99–110.
Linacre, J.M., 2010. Predicting responses from Rasch measures. Journal of Applied Measurement 11, 1.
Liu, O.L., Bridgeman, B., Adler, R.M., 2012. Measuring Learning Outcomes in Higher Education: Motivation Matters. Educational Researcher 41, 352–362. https://doi.org/10.3102/0013189X12459679
Lu, Y., Sireci, S.G., 2007. Validity Issues in Test Speededness. Educational Measurement: Issues and Practice 26, 29–37. https://doi.org/10.1111/j.1745-3992.2007.00106.x
McCoubrie, P., 2004. Improving the fairness of multiple-choice questions: a literature review. Medical Teacher 26, 709–712. https://doi.org/10.1080/01421590400013495
Meng, X.-B., Tao, J., Chang, H.-H., 2015. A Conditional Joint Modeling Approach for Locally Dependent Item Responses and Response Times. Journal of Educational Measurement 52, 1–27. https://doi.org/10.1111/jedm.12060
Mislevy, R.J., Haertel, G., Riconscente, M., Rutstein, D.W., Ziker, C., 2017. Evidence-Centered Assessment Design, in: Mislevy, R.J., Haertel, G., Riconscente, M., Wise Rutstein, D., Ziker, C. (Eds.), Assessing Model-Based Reasoning Using Evidence- Centered Design: A Suite of Research-Based Design Patterns, SpringerBriefs in Statistics. Springer International Publishing, Cham, pp. 19–24. https://doi.org/10.1007/978-3-319-52246-3_3
Oppenheimer, D.M., Meyvis, T., Davidenko, N., 2009. Instructional manipulation checks: Detecting satisficing to increase statistical power. Journal of Experimental Social Psychology 45, 867–872. https://doi.org/10.1016/j.jesp.2009.03.009
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E., 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12, 2825–2830.
Schnipke, D.L., Scrams, D.J., 2002. Exploring issues of examinee behavior: Insights gained from response-time analyses. Computer-based testing: Building the foundation for future assessments 237–266.
Sherbina D.N., 2015. Strategies for passing the knowledge tests, identified by the distractor view chronometry // Valeology. 2015. № 4. P. 112–121. (in Russian)
Sherbina D.N., 2016. Improving the effectiveness of knowledge control on the basis of analysis of test tasks solution sequence // Educational Technology & Society. Vol. 19. No. 4. P. 346–363. (in Russian).
Spiller, M., Liu, Y.-H., Hossain, M.Z., Gedeon, T., Geissler, J., Nürnberger, A., 2021. Predicting Visual Search Task Success from Eye Gaze Data as a Basis for User-Adaptive Information Visualization Systems. ACM Translation Interaction Intell. System 11, 14:1-14:25. https://doi.org/10.1145/3446638
Thompson, A.R., O’Loughlin, V.D., 2015. The Blooming Anatomy Tool (BAT): A discipline-specific rubric for utilizing Bloom’s taxonomy in the design and evaluation of assessments in the anatomical sciences. Anatomical Sciences Education 8, 493–501.
Tversky, B., Zacks, J.M., Hard, B.M., 2008. 17. The Structure of Experience. Understanding Events 436–465.
van der Aalst, W.M.P., 2016. Process Mining: Data Science in Action. Springer.
van der Linden, W.J., 2007. A hierarchical framework for modeling speed and accuracy on test items. Psychometrika 72, 287.
Wang, C., Xu, G., 2015. A mixture hierarchical model for response times and response accuracy. British Journal Math Statistic Psychology 68, 456–477. https://doi.org/10.1111/bmsp.12054
Watson, P., Ma, T., Tejwani, R., Chang, M., Ahn, J., Sundararajan, S., 2018. Human-level Multiple Choice Question Guessing Without Domain Knowledge: Machine-Learning of Framing Effects, in: Companion Proceedings of the Web Conference 2018, WWW ’18. International World Wide Web Conferences Steering Committee, Lyon, France, pp. 299–303. https://doi.org/10.1145/3184558.3186340
Wind, S.A., Alemdar, M., Lingle, J.A., Moore, R., Asilkalkan, A., 2019. Exploring student understanding of the engineering design process using distractor analysis. International Journal STEM Education 6. https://doi.org/10.1186/s40594-018-0156-x
Wise, S.L., DeMars, C.E., 2006. An Application of Item Response Time: The Effort-Moderated IRT Model. Journal of Educational Measurement 43, 19–38. https://doi.org/10.1111/j.1745-3984.2006.00002.x
Wu, Q., Laet, T.D., Janssen, R., 2019. Modeling Partial Knowledge on Multiple-Choice Items Using Elimination Testing. Journal of Educational Measurement 56, 391–414. https://doi.org/10.1111/jedm.12213
Zimmerman, M.E., 2011. Speed–Accuracy Tradeoff, in: Kreutzer, J.S., DeLuca, J., Caplan, B. (Eds.), Encyclopedia of Clinical Neuropsychology. Springer, , pp. 2344–2344. https://doi.org/10.1007/978-0-387-79948-3_1247
Acknowledgements
The author expresses gratitude for the support from the Strategic Academic Leadership Program of the Southern Federal University (“Priority 2030”).
Funding
The project is supported by the Russian Ministry of Science and Higher Education in the framework of Decree No. 218, project No. 2019-218-11-8185 “Creating a software complex for human capital management based on neurotechnologies for enterprises of the high-tech sector of the Russian Federation” (Internal number HD/19-22-NY).
Author information
Authors and Affiliations
Contributions
Dmitry N. Sherbina – method, analysis, manuscript.
Corresponding author
Ethics declarations
Competing interest
None.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Highlights
• Decision rules that allow predicting subjective task difficulty.
• Trace analysis with process mining techniques revealed behavior patterns.
• Item response theory model in Python to determine the level of task difficulty.
Appendices
Appendix A
2PL IRT analysis was conducted with the pymc3 package. The resulting variable moda contains an object with the inferenced data. The full code including the construction of the picture with resulting logistic curves per question is available at the repository https://osf.io/37tk8/ .


Appendix B
Informative features. Figures 6 and 7
Appendix C
Decision tree analysis to predict the difficulty of task solutions (Fig. 8).
A decision tree that optimally splits the range of difficulty scores. A – based on the absolute error criterion (MAE), B – based on the standard error criterion (MSE). For each cluster, the number of samples and the predicted difficulty score (value) are specified. The color intensity corresponds to the difficulty metric of tasks in the cluster.
Decision rules correspond to the path from the top of the tree to the leaves, except for criteria with duplicate metrics. The rules are listed in ascending order of predicted task difficulty. Logarithmic scaled features converted to original scale.

When we increased the tree depth to four levels, we got the following set of rules (for clusters larger than 10 leaves).

Rights and permissions
About this article
Cite this article
Sherbina, D.N. Chronometry of distractor views to discover the thinking process of students during a computer knowledge test. Behav Res 54, 2463–2478 (2022). https://doi.org/10.3758/s13428-021-01743-x
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.3758/s13428-021-01743-x
Keywords
- Multiple-choice quiz
- Time measurements
- Elementary cognitive operations
- Process mining
- Computer-administered tests