Introduction

Two long-standing concerns in large-scale assessments are examinees not sufficiently engaging with items (i.e., disengagement) and running out of time before responding to all items (i.e., speededness). Both behaviors are concerning because examinees’ achievement scores under the conditions of speededness and disengagement typically do not represent their true ability, diminishing the validity of arguments based on the scores (Lu & Sireci, 2007; Wise, 2015; Yamamoto, 1995) and leading to potential biases in parameter estimates (Oshima, 1994; Rogers & Swaminathan, 2016). In the case of international large-scale low-stakes assessments, such as TIMSS, speededness and disengagement may also impact the validity of performance comparisons across countries and years.

Some studies have compared the percentage of disengaged examinees across countries (e.g., Debeer et al., 2014; Rios & Guo, 2020) and years (Kuang & Sahin, 2021). However, none have examined disengagement, speededness, and other test-taking behaviors simultaneously in a cross-country analysis. Identifying multiple test-taking behaviors simultaneously paints a fuller picture of how students spend their testing time and how test-taking behaviors relate to performance. Digitally based assessments allow for the collection of log data; that is, the accumulation of examinees’ interactions with the testing screen and associated timestamps. Using these timestamps, it is possible to compute response times (RTs), which are then used to identify disengaged and speeded examinees (Lu & Sireci, 2007; Wise, 2017).

Recent studies have examined disengagement at the item level using either RT and scores (Goldhammer et al., 2017; Guo et al., 2016) or RT and response behaviors derived from process data (Sahin & Colvin, 2020). An alternative to item-level detection is test-level detection, where researchers analyze RTs over item sequences (i.e., pacing trajectories) using latent models, such as mixture modeling or growth modeling, to detect disengagement (Zheng, 2019; Zheng et al., 2018) or speededness (Bolt et al., 2002; Kahraman et al., 2013).

Our study is one of the first to compare speededness and disengagement simultaneously and also one of the first to use RT data from TIMSS 2019. This was the cycle in which TIMSS began the transition from a paper-and-pencil assessment to a computer-based assessment, introducing a digital version called “eTIMSS”. Specifically, we used data from four education systems: England (ENG), Singapore (SGP), the United Arab Emirates (UAE), and the United States (USA). Our objectives were, first, to examine the presence and extent of unfavorable test-taking behaviors using the test-level mixture models within selected countries and, second, to investigate the commonality and specificity of the test-taking behaviors across the countries.

Data

This study compared eTIMSS 2019 eighth-grade mathematics data from the USA and three other countries: SGP, ENG, and UAE. The comparison countries were selected for the following reasons: (1) Like the USA, they administered the digital version, not the paper version, of TIMSS (i.e., eTIMSS) in 2019,Footnote 1 (2) like the USA, they administered the eighth-grade assessment in the English language,Footnote 2 and (3) compared to the USA, they had varying mathematics performance at the eighth-grade in TIMSS 2019. Specifically, SGP had performance higher than the USA’s; ENG had performance similar to the USA’s; and the UAE had performance lower than the USA’s.Footnote 3

Fig. 1
figure 1

Histogram of the transformed RT for an example item* responded to by USA eighth-graders * This is the last mathematics item in the eighth-grade TIMSS 2019 Booklet 1, Part 1. Note: Transformed RT = log(RT + 1 s). Sample size for USA is 618. Of the 624 eighth-grade USA students who were administered Booklet 1, 6 of them did not have valid RT data for any of the 32 items examined in the study. Source: International Association for the Evaluation of Educational Achievement (IEA), Trends in International Mathematics and Science Study (TIMSS), 2019

The study used item scores and response time variables (i.e., total time spent by the student on each item screen) for one part of a test booklet from the student achievement data files in the TIMSS 2019 public-use international database (Fishbein et al., 2021).

TIMSS uses a matrix sampling approach that packages mathematics and science items into 14 booklets, with each student completing just one booklet. As shown in Exhibit 4.2 of the TIMSS 2019 Assessment Frameworks Mullis & Martin. (2017), each booklet consists of two parts (Part 1 and Part 2), and each part contains two blocks of items (either two mathematics blocks or two science blocks). In half of the 14 booklets, the two mathematics blocks come first, and then the two science blocks, and in the other half the order is reversed. Students are allotted 45 min in the eighth-grade assessment to complete each part of the booklet. Booklets are distributed such that approximately equal proportions of students respond to each booklet and the students completing each booklet are approximately equivalent in terms of student ability Mullis & Martin. (2017).

To the extent possible, within each block the distribution of items across the TIMSS content and cognitive domains matches the overall item pool distribution Mullis & Martin. (2017). This important design feature of TIMSS made either part of any booklet equally suitable for examination in this study. However, to limit the scope, this study focused only on one of the seven booklets that started with two mathematics blocks; specifically, blocks ME01 and ME02 that are administered in Part 1, Booklet 1, of the 2019 TIMSS eighth-grade assessment. These blocks consisted of 31 mathematics items totaling 32 score points. Additional file 1: Table S1 lists the characteristics of these 31 items (e.g., item type, content domain, cognitive domain, item label) in the order in which they were administered during the first 45 min of Booklet 1 of the eighth-grade assessment. Examinees’ responses were scored using the scoring syntax provided in the international database where “Omitted” and “Not Reached” responses were recoded as incorrect.Footnote 4 For the 31 mathematics items examined, the database included 28 RT variables because several items shared the same screen.

Methods

This research used finite mixture models (see, e.g., Everitt & Hand, 1981; McLachlan & Basford, 1988; Titterington et al., 1985) to identify unique groups of examinees. Finite mixture modeling is a model-based clustering method that has been used to identify unobserved (latent) class memberships based on observed characteristics. In our study, the observed input variables were examinees’ RTs and item scores. If \({\varvec{y}}\) is a vector of the 28 RT and 31 score variables, the joint distribution of the observed \({\varvec{y}}\) and the latent class membership can be expressed as:

$$ f\left( {Y = y} \right) = \mathop \sum \limits_{k = 1}^{K} \pi_{k} f_{k} \left( {y|\mu_{k} ,\Sigma_{k} } \right), $$
(1)

where \(K\) is the total number of latent classes enumerated in the mixture model, \({\pi }_{k}\) is the probability for class \(k\), and \({f}_{k}\) is a function of observing \({\varvec{y}}\) for class \(k\) with means of \({{\varvec{\mu}}}_{{\varvec{k}}}\) and covariance structure of \({{\varvec{\Sigma}}}_{{\varvec{k}}}\). Various \({{\varvec{\Sigma}}}_{{\varvec{k}}}\) could be tested to identify the model with the best fit. As the RT variables were continuous and the item scores were ordinal, we only tested covariances of the RT variables. The covariances between RTs and item scores were not considered due to the complexity involved in parameterizing them as well as the lack of one-to-one correspondence for all item scores and RTs.

Maximum likelihood estimation was used iteratively with increasing numbers of classes to determine the best-fitting model. Due to the clustering of students within classrooms, the design-based sandwich estimator (Asparouhov & Muthén, 2006; Rabe-Hesketh & Skrondal, 2006) was used to estimate robust standard errors, which were adjusted for clustering and stratification. Each model was evaluated using fit indices, such as entropy (Celeux & Soromenho, 1996) and the Bootstrapped Likelihood Ratio Test (BLRT; McLachlan & Peel, 2000). Values for entropy range from 0 to 1, with higher values indicating better separation of latent classes. A general rule of thumb is that an entropy over 0.8 would indicate good distinctions of latent classes (see, e.g., Clark, 2010; Nagin, 2005). The BLRT compares the fit of a model with K classes to one with K-1 classes. If these tests are significant, the model with the higher number of classes is favored. Each model was also evaluated on its interpretability based on the observed characteristics of the classes. The mixture model was run separately for each country using one to five classes.

The item-level RTs needed to be transformed because they were highly skewed. In RT modeling, natural logarithm transformations are commonly used on timing data. In our data, some students did not interact with certain items (e.g., if they ran out of time before reaching an item). The RT for these items is 0, which cannot be transformed with a natural logarithm. To accommodate these cases, the RT was transformed by taking the natural logarithm of examinees’ RT (in seconds) plus 1. If an examinee did not spend time on an item, their transformed RT would be 0.

Figure 1 presents an example of the transformed RT distribution for the last item in the USA sample. Of the 618 eighth-graders, approximately 40 did not interact with this item. Therefore, the distribution is slightly zero-inflated.

Following mixture modeling, we examined the validity of the classifications by conducting cross tabulations to explore the association between the classifications and selected student contextual variables. Specifically, we focused on two questions from the U.S. national version of the student questionnaire: Question #32, with variable name BSXG32, which asked about examinees’ effort while taking the test; and Question #33, with variable name BSXG33, which asked about examinees’ perception of the importance of the test.Footnote 5 These two questions directly relate to students’ test-taking behaviors. We hypothesized that students classified as Disengaged in the USA would report lower effort on the test and lower perceived importance of the test.

However, these two questions were not included in the international version of the student questionnaire. As a result, data on these questions were not available for the other three countries included in our study (SGP, ENG, and the UAE). To address this limitation, we selected a student questionnaire variable that was available in the TIMSS 2019 international database for all eTIMSS countries: Question 1Ba, with variable name BSBE01BA, which asked whether students had difficulty typing during the test.Footnote 6 This variable served as a proxy of students’ familiarity with using computers, as some students might have been disengaged due to experiencing difficulties with the computer-based testing platform. We hypothesized that a greater percentage of disengaged students would report having trouble with typing in the test. Additionally, we examined students’ gender distribution (using ITSEX variable) within the identified classes, as previous research (Wise & DeMars, 2010) has shown that male students are typically overrepresented in the disengaged class. We hypothesized that the same pattern would emerge in the identified classes.

Results

Mixture modeling results

Through testing different mean and covariance structures for the transformed RTs in the mixture model, we found that including the class-specific variance–covariance matrices led to nonconvergence due to the large numbers of parameters to be estimated. Therefore, we constrained the variance of the transformed RT of a specific item to be equal across classes but allowed the variances of transformed RTs to vary across items within a class. The covariances of transformed RTs were not included in the model for convergence consideration.

The BLRT p-values and the entropies for the mixture models are presented in Table 1. The entropies were high (> 0.9, above the recommended threshold of 0.8) for all models, indicating very good separations between the latent classes within each model. Thus, entropy itself did not sufficiently differentiate between the models. As a result, more weight was placed on the BLRT p-values which explicitly compared the fits of two solutions. As the number of classes increased, the BLRT p-values pointed to 5-class solutions for the USA, ENG, and the UAE using a significance level of 0.05. For SGP, the 5-class model did not converge; instead, a 4-class solution converged and gave the best model fit. The final results adopted 5-class solutions for the USA, ENG, and UAE and a 4-class solution for SGP.

Table 1 Fit indices for 2 to 5-class mixture models for the USA, ENG, UAE, and SGP

Figures 2, 3, 4, 5 present the results for the USA, ENG, UAE, and SGP, respectively. Using median transformed RT, the two upper panels show the pacing trajectories of the classes across the items in each of the two blocks examined. The bottom-left panel presents the cumulative RT in minutes across the item sequence in both blocks. The bottom-right panel shows the percentage distribution and average sum score (SS) of each class found in the sample. Table 2 summarizes the results for all four countries.

Fig. 2
figure 2

Pacing trajectories of student classes identified in USA sample Note: Pacing trajectories are depicted for the 28 RT variables for the 31 mathematics items in the eighth-grade TIMSS 2019 Booklet 1, Part 1. Transformed RT = log(RT + 1 s). SS is the average sum score. Sample size for the USA is 618. Of the 624 eighth-grade USA students who were administered Booklet 1, 6 students did not have valid RT data for any of the 32 items examined in the study. The median cumulative response times for the two steady groups (green and blue) overlap, making only one of them (blue) visible on the graph. Source: International Association for the Evaluation of Educational Achievement (IEA), Trends in International Mathematics and Science Study (TIMSS), 2019.

Fig. 3
figure 3

Pacing trajectories of student classes identified in ENG sample NOTE: Pacing trajectories are depicted for the 28 RT variables for the 31 mathematics items in the eighth-grade TIMSS 2019 Booklet 1, Part 1. Transformed RT = log(RT + 1 s). SS is the average sum score. Sample size for ENG is 242. The median cumulative response times for the two steady groups (green and blue) overlap, making one of them (blue) more visible in the graph. Source: International Association for the Evaluation of Educational Achievement (IEA), Trends in International Mathematics and Science Study (TIMSS), 2019.

Fig. 4
figure 4

Pacing trajectories of student classes identified in UAE sample Pacing trajectories are depicted for the 28 RT variables for the 31 mathematics items in the eighth-grade TIMSS 2019 Booklet 1, Part 1. Transformed RT = log(RT + 1 s). SS is the average sum score. Sample size for the UAE is 1595. Of the 1599 eighth-grade USA students who were administered Booklet 1, 6 students did not have valid RT data for some or all of the 32 items examined in the study. The median cumulative response times for the two steady groups (green and blue) overlap, making one of them (blue) more visible in the graph. Source: International Association for the Evaluation of Educational Achievement (IEA), Trends in International Mathematics and Science Study (TIMSS), 2019.

Fig. 5
figure 5

Pacing trajectories of student classes identified in SGP sample NOTE: Pacing trajectories are depicted for the 28 RT variables for the 31 mathematics items in the eighth-grade TIMSS 2019 Booklet 1, Part 1. Transformed RT = log(RT + 1 s). SS is the average sum score. Sample size for SGP is 348. Of the 350 eighth-grade SGP students who were administered Booklet 1, 2 students did not have valid RT data any of the 32 items examined in the study. Source: International Association for the Evaluation of Educational Achievement (IEA), Trends in International Mathematics and Science Study (TIMSS), 2019.

Table 2 Latent class distributions, mean sum scores, and median total response time (RT) of the selected countries in the eighth-grade TIMSS 2019 Booklet 1, Part 1

For the USA sample (Fig. 2), we identified five classes and labeled them as Disengaged, Very speeded, Speeded, Steady and high-performing, and Steady but low-performing. About 9% of the sample was identified as Disengaged (shown in black). These students consistently spent relatively less time across all items (median total RT was about 20 of the 45 min allotted) and had low mean sum scores (about 5 of a possible 32 points). Based on students’ timing patterns and low performance, we concluded that they might not have fully considered each item.

Two classes were interpreted as speeded; both are characterized by a large amount of time spent on the items in the first block and little to no time spent on the last few items in the second block. The Very speeded class (magenta) began to run out of time earlier than the Speeded class (purple) (at the 21st vs. the 25th item, respectively) and had a lower average sum score (10 vs. 12 points, respectively). The cumulative RT of the Very speeded group plateaued earlier than the Speeded group’s cumulative RT.

The remaining two majority classes had an almost identical, steady pace and spent 35–40 min of the 45 min allotted, although their performance differed greatly. The Steady and high-performing group (green) scored 23 points on average, while the Steady but low-performing group (blue) scored about 10 points.

For the ENG and UAE samples (Figs. 3, 4, respectively), we identified the Speeded, Disengaged, Steady and high-performing, and Steady but low-performing classes (as in the USA), but not the Very speeded class. Instead, we observed a Very disengaged group, shown in red (about 4% of each country’s sample), which spent only about 15 min on all the items out of the 45 min allotted.

For the SGP sample (Fig. 5), we observed the Very disengaged, Steady and high-performing, and Steady but low-performing groups, but no speeded group. Instead, SGP had a unique Efficient and high-performing class, shown in orange (24.4% of the sample), characterized by a relatively fast responding pace (about 25 min total out of the 45 min allotted) and relatively high score (27 points, on average).

Table 2 summarizes class memberships and characteristics across all four countries. In all countries, most students belonged to the two steady-pacing groups, but we consistently identified some disengaged students, even in the high-performing SGP. The proportion of disengaged students was relatively small in the USA and SGP (lower than 10%), but over 20% in ENG and the UAE. Speeded examinees were found in all countries but SGP, which had a unique Efficient and high-performing class not observed elsewhere.

Relating class membership to student contextual variables

The crosstabulations of the two USA-specific questions with the latent classes are shown in Additional file 1: Table S2. For the question on effort on the test, a considerably higher percentage of Disengaged USA students (41%) reported trying “not as hard on TIMSS as on other tests” compared to students in the other classes (less than 30%). This confirmed our interpretation of the Disengaged class as having lower level of effort on the test. In addition, we noted that a comparatively larger percentage of high-performing Steady students (61%) reported trying “about as hard as on other tests” while a comparatively lower percentage tried “harder” or “much harder”, which is in line with the expectation that the students in the high-performing Steady class did not require as much effort due to their higher performance. For the question on the perceived importance of the test, a considerably higher percentage of Disengaged USA students (29%) reported that it was “not very important to do well” compared to the students in the other classes (less than 13%). This further supports our hypothesis that disengaged students placed less importance on the test than other students.

Regarding the question on experiencing difficulty with typing, the results for the four countries are shown in Additional file 1: Table S3. In the USA, UAE, and SGP, higher percentages of Very disengaged or Disengaged students reported experiencing difficulty with typing compared to students in the other classes. This supports our hypothesis that disengagement may be related to typing difficulties, which we considered as a proxy for lack of familiarity with the digital testing platform. However, in ENG, a higher percentage of students experiencing typing difficulty was observed in the high-performing Steady class (45%), not in the Very disengaged or Disengaged class (35%).

Lastly, the results for the gender variable are presented in Additional file 1: Table S4 for the four countries. In the USA, UAE, and SGP, a significantly higher percentage of male students were disengaged compared to female students, consistent with previous research. Again, ENG stood out as the only exception, with a higher proportion of female students being disengaged. This finding suggests the presence of unique contextual factors in ENG that may have contributed to this outcome, warranting further investigation. Overall, the response patterns of the identifies latent classes mostly confirmed our hypotheses and provided additional validity evidence for their interpretations.

Discussion

Classes discovered

In this study, the classes discovered were Very disengaged, Disengaged, Very speeded, Speeded, Steady but low-performing, Steady and high-performing, and Efficient and high-performing. Compared to previous studies, this study identified more fine-grained disengaged and speeded test-taking behaviors. Specifically, we differentiated between Disengaged and Very disengaged students as well as between Speeded and Very speeded students. Even though Wise and Kong’s (2005) response time effort (RTE) index considered disengagement on a continuum, it did not differentiate between Very disengaged and Disengaged behaviors. The RTE index reflects the number of items on which a student is identified as providing non-effortful responses out of the total number of items administered. If a student has an RTE index value of 0, the lowest possible value, the suggestion is that the student rushed through all the items, thus showing Very disengaged behavior. On the other hand, if a student has an RTE index value of 1, the highest possible value, the student did not rush through any of the items and is said to be fully engaged. The shortcoming of the RTE index is that it does not have an established cut-off point to distinguish Very disengaged from Disengaged students. In addition, there is no established RTE-like index to measure speededness, let alone to differentiate Speeded from Very speeded students.

Another limitation of many previous studies that detect test-taking behaviors is that some efficient test takers can be misclassified as speeded or disengaged, particularly when only response times are used for detection. For example, high-performing students may not find the test challenging and, therefore, spend little time on many items. Consequently, in the absence of scores, they may be misclassified as speeded or disengaged. To overcome the potential misclassifications, in this study, we used both response times and item scores. This joint model allowed us to differentiate between speeded, disengaged, and efficient students. We provided further validity evidence for the classifications via examining selected student contextual background variables as described in the Results section.

It should be noted that the two adjacent mathematics blocks (ME01 and ME02) of this study appeared in the first part of Booklet 1 of the eighth-grade eTIMSS 2019 assessment. The same two blocks also appeared in booklets 14 and 2, respectively, but the second part of those booklets. When the blocks are in a different position or timed together with some other blocks, the mixture model results may differ due to the change in the test-taking experience. Consequently, we advise caution in generalizing analysis results from booklet 1 without further evidence from other booklets.

Country comparisons

This study revealed distinct differences among countries in the existence and prevalence of various classes of test-taking behaviors. For example, the Disengaged and Very disengaged classes were more prevalent in England and the United Arab Emirates than in the United States or Singapore. Compared to the United States, the percentages of Disengaged and Very disengaged students were higher both in England, a country with eTIMSS performance similar to that of the United States, and the United Arab Emirates, a country with eTIMSS performance lower than that of the United States. Even in Singapore, a high-performing country, we were still able to identify a small Very disengaged group of students. These results suggest that the existence or prevalence of Disengaged students may not be directly related to country performance. We hypothesize that observing disengagement may have been related to the low-stakes nature of TIMSS as in other international large-scale assessments. Additionally, we found an Efficient and high-performing group only in Singapore, which is the highest performing country in our sample.

Methodological considerations

It should be noted that mixture modeling is a probability-based exploratory technique. Although entropy was high in all models, some examinees could have been misclassified. Thus, although we could make plausible interpretations about the classes at the group level, one should cautiously interpret these classifications at the individual student level.

We conducted the analyses separately for each country as opposed to analyzing a pooled dataset from the four countries. One reason for doing this was to be able to identify small but unique clusters within each country we examined. In general, working with a pooled dataset would be disadvantageous when the expectation is of small groups with distinct behaviors. Because both disengagement and speededness were expected to be nondominant testing behaviors (typically reported for less than 10% of the test takers), working with smaller but meaningful partitions of the data—in this case, split by countries—allowed us to identify these behaviors with more precision. Furthermore, we might not have identified the Efficient and high-performing class found in Singapore if we had combined the data from all countries and identified clusters within this pooled dataset.

In this study, we examined only the first part of one of the seven eighth-grade TIMSS 2019 test booklets that started with two mathematics blocks, drawing on data from 4 out of the 27 education systems that administered TIMSS 2019 digitally at grade 8. Similarly, we analyzed only one type of log-data-derived variable (i.e., item RT). Future studies can examine data from more countries, from the fourth-grade assessment, from the other six eighth-grade booklets starting with two mathematics blocks, or seven eighth-grade booklets starting with two science blocks. Future studies could also use additional log data-derived variables, such as frequency of item visits, to see if other distinct behaviors can be identified. These studies might find that students from other countries or in fourth-grade would interact differently with booklets containing different items.

Conclusions

This study is one of the first to examine distinct test-taking behaviors using response time, a process data variable included in the TIMSS 2019 database. As a result of this study, we discovered that even though most students in a country followed a steady testing pace, almost every country examined had students who demonstrated Speeded, Very speeded, Disengaged, and Very disengaged behavior. Also, an Efficient and high-performing group was found only in Singapore, which had the highest performance in our sample. Thus, future studies may consider studying the prevalence of efficient test-taking behavior, particularly in other high-performing countries.

The prevalence of the Disengaged group was not linearly related to the achievement of the countries in the study. For example, a Very disengaged student group was found in 2% of the sample in Singapore (one of the highest performing countries in TIMSS 2019 at eighth-grade), and in 4% of the sample both in England (a country that performed above the TIMSS scale center-point, 500, in TIMSS 2019 at eighth-grade) and the United Arab Emirates (a country performed below the TIMSS scale center-point, 500, in TIMSS 2019 at eighth-grade). Similarly, the Disengaged group was found in 9% of the United States, 17% in the United Arab Emirates, and in 24% of the England.

In addition, there was little difference in the percentage of students identified as Speeded across countries (between 1 and 5%) except in Singapore, where we did not observe any students in that class. A Very speeded class was observed in the United States sample (3%), again suggesting that there is not a linear relationship between achievement and speededness at the country level.

This study demonstrated that mixture modeling is a useful technique for identifying various test-taking behaviors simultaneously. Previously, it was used to study only one of the unfavorable testing behaviors in a two-cluster model that classified examinees into either “speeded” or “not speeded” groups (e.g., Schnipke & Scrams, 1997) or “disengaged” or “not disengaged” groups (e.g, Wang et al., 2018). This study also demonstrated that mixture modeling can differentiate between levels of speededness and disengagement. Identifying unfavorable testing behaviors simultaneously and differentiating the degree of these behaviors can help testing programs ensure the validity of inferences based on test scores. For example, testing programs can flag students who show these unfavorable behaviors and embed warning systems in the testing platform to prevent these behaviors on the fly. In addition, low-stakes testing programs may consider incorporating incentives, as appropriate, for students to show their best effort.