1 Introduction

The testing of abilities and skills has a long history in both psychology and educational measurement. While until recently the default administration form of such tests was paper and pencil, with the advance of computerized testing in many fields of psychology and educational measurement, it is becoming commonplace to administer tests digitally. One clear benefit of this digital administration of tests is the potential availability of process data that can be registered in addition to the registration of the response that is provided (Goldhammer & Zehner 2017). These process data can come in many forms, ranging from registering the number of attempts made on an item to data based on advanced mouse- and eye-tracking techniques. However, by far, the most commonly considered type of process data is the registering of the response time (RT, the time that passes between reaching an item and providing the final response), a measure that is generally considered to at least potentially contain information that is relevant in a wide range of testing settings.

While there have been many different ways of looking at and using RTs proposed in the literature over at least the last 70 years (Gulliksen 1950; van der Linden 2009), one relatively new method that has gained a lot of attention in recent years is the hierarchical model (van der Linden 2007), which jointly models the RTs together with the correctness of the responses. Partly due to its relatively simplicity, and partly due to its promise to improve the precision of measurement, practitioners are not only becoming aware of the existence of this model but are taking steps toward implementing this model as part of their measurement practice. While the model itself is rather simple and well known, the challenges that one should be aware of when using this model in practice are both less straightforward and less well known. This chapter aims to address these issues by providing a comprehensive overview of what the hierarchical model for RTs has to offer for measurement practice, what its limitations are (and how some of these limitations can be addressed), and what the risks are of using this model in practice.

2 An Overview of the Hierarchical Model and Its Advantages

The hierarchical model consists of two measurement models, one concerning the accuracy of the response (RA, for item j denoted by X j) and one concerning the speed of the response (RT, for item j denoted by T j). The measurement model for RA concerns the latent ability parameter θ, while the measurement model for RT concerns the latent speed parameterτ.Footnote 1 The modeling framework leaves it open which specific measurement models are used for modeling RA and RT and as such is neutral with respect to the particular relationship that is expected between the response data and the latent variables in the model. In practice, standard IRT models are commonly considered for modeling RA, and RTs are often modeled through a lognormal model (van der Linden 2006).

Regardless of which particular measurement models are chosen, both measurement models are connected at a higher level, through the inclusion of correlations between the different item parameters (e.g., item difficulty and item time intensity) and the inclusion of a correlation between the person parameters θ and τ. It is through these correlations that the hierarchical model can explain possible associations observed at the response level between RA and RT. The general structure of the hierarchical model is presented in Fig. 16.1, which remains neutral with respect to the choice of measurement models for RA and RT.

Fig. 16.1
A schematic shows the Theta and Tau connections. Theta flows to X subscript 1 and X subscript J. Tau flows to T subscript 1 and T subscript J.

The general structure of the hierarchical model

2.1 Using RTs to Improve the Precision of Measurement

When contrasting the hierarchical model for RT and RA with standard IRT models that consider only RA, one clear advantage of the hierarchical model becomes readily apparent: In addition to the information about ability that is captured by the IRT measurement model that considers the RA, the hierarchical model also considers information about ability that is contained in the RTs (van der Linden et al. 2010). As Fig. 16.1 shows, the RTs are indirectly linked to ability, through the latent speed variable τ. Thus, if in the population speed and ability are correlated, the measurement model for speed provides collateral information for the estimation of ability, on top of what is provided by standard IRT models.

The correlation between speed and ability can take on any value between − 1 and 1, and in practice, positive values (e.g., see Loeys et al. 2011; Wang & Xu 2015; Meng et al. 2015), negative values (e.g., see Klein Entink et al. 2009; Goldhammer & Klein Entink 2011; Scherer et al. 2015), and values close to 0 (e.g., see van der Linden et al. 1999; Bolsinova etal. 2017; Shaw et al. 2020) have been observed. Rather intuitively, the amount of information that the RTs can provide for improving the precision with which ability is estimated is bounded by the size of this correlation: If there is only a weak correlation between speed and ability, even a perfectly estimated speed latent variable will only be able to explain a small part of the variance in the latent ability variable. This also means that the marginal amount of information about ability that is gained through the measurement model of speed by adding items to the test quickly decreases as the test increases in length: Once speed is estimated with a reasonable amount of precision, for the precision with which ability is measured, the gain of reducing the measurement error with which speed is measured will be minimal (Ranger 2013).Footnote 2 This is in contrast with the measurement model for ability, where each new item contains new and independent information about ability that continues to increase precision as test length increases. Effectively, the RAs on all the items together with speed provide information about ability in the hierarchical model, and the relative relevance of the speed latent variable decreases as more RAs are observed, even if the latent speed dimension does end up being measured with lower measurement error as the test length increases. The consequence of this is that the biggest relative gains of using the hierarchical model instead of a “RA-only” model in terms of improving precision can be expected to be found for relatively short tests, where the added explanatory power of including an additional (imperfectly measured) predictor can be expected to matter the most.

2.2 Relevance of RT for Test Construction and Analysis

In addition to improving the precision of measurement of ability, the hierarchical model also provides the user with a more extensive toolbox to evaluate the quality of the test, the individual items, and the performance of individuals. In this sense, it can provide practitioners with more options for critically evaluating items during test construction, for evaluating the performance of an existing test, and for detecting aberrant responding.

Since for every item not only characteristics in the measurement model of ability are considered, but also characteristics in the model for speed are measured, a more complex and more complete picture emerges of the properties of the different items on the test. Not only is it possible to determine which items are relatively time intense, but it is also possible to assess the relationship between the different item characteristics in the two measurement models. Since in the context of the hierarchical model all commonly considered measurement models for ability and for speed contain a location parameter, this is also the most commonly studied association between the item characteristics (van der Linden 2009). Not unsurprisingly, the correlation between item difficulty and item time intensity is generally found to be positive, with more difficult items requiring on average more time from the respondents to be solved. While this pattern may not be unexpected, it is something that test constructors should keep in mind when designing a test, especially when there will be strict limits to the amount of testing time. Less studied, but equally relevant, is the relationship between time intensity and item discrimination in the RA model: Do items on the test that respondents spend more time on provide us with more information about ability than items that are answered more rapidly? If the answer is no for a particular testing setting, it may make sense for test constructors to focus on designing items with at most a moderate time intensity, to optimize the total testing time or the precision of measurement of ability obtained within a certain time limit.

On the person side, a similar picture emerges. Not only do we obtain information about the speed with which different individuals answer items on the test, but we also gain insight into the relationship between the speed with which persons take the test and their overall performance (as captured by their estimated ability). Since this speed-ability correlation takes on wildly different values in practice, studying that correlation can be considered important for getting a better picture of the response processes of different types of respondents who take the test: Do people who work fast on average show a better or worse performance than those that take more time on the test? It is important to stress that since this correlation considers a between-person association, it should not be confused with the often studied “speed-accuracy” trade-off (Heitz 2014): the well-known phenomenon that increasing the speed with which one executes cognitive tasks generally decreases the accuracy of the outcome of that task. This speed-accuracy trade-off (which in the context of IRT might better be considered in the form of a “speed-ability trade-off”; van der Linden 2009) describes a negative within-person association, which does not need to translate to a negative association at the between-person level. That is, the speed-ability trade-off is only one of the factors that contributes to the between-person association between speed and ability. Another phenomenon that contributes to this association is well known from cognitive psychology: More competent persons may be able to execute a task both faster and with higher accuracy than less competent persons (i.e., have a speed-accuracy trade-off curve that is positioned above those of other respondents). This explains why it is possible for the between-person association of speed and ability to be positive, even though the within-person speed-ability trade-off pushes this association in the negative direction. When the speed-ability trade-off is the main factor driving between-person differences, a negative correlation between speed and ability will emerge. In those cases, one could be worried about the validity of measurement of ability, since it means that many respondents performed suboptimally on the test (i.e., unnecessarily sacrificed performance in favor of speed). This phenomenon may be especially prevalent in low-stakes assessment, where it may not be safe to assume that all respondents are fully engaged with the test and where differences in observed performance (as captured by estimated ability) could possibly to a large extent be attributable to differences in engagement rather than to differences in actual ability.

Finally, the hierarchical model extends the possibilities for detecting aberrant persons and items on the test, compared to what is possible using standard IRT models (van der Linden & Guo 2008). When using the hierarchical model, in addition to determining whether (a set of) responses should be considered an outlier in terms of the observed RAs, other outliers can be studied. On the person side, outliers in RTs on the full set of items could suggest that the person may not be taking the test seriously (in case of both overly fast or overly slow responses). On the item side, observing overly fast or overly slow responses for a significant portion of the respondents could suggest problems with that item, such as possible guessing (in case of many fast responses) or possible issues with the clarity of the item (in case of many slow responses). Since the hierarchical model considers RAs and RTs simultaneously, these cases can be studied in further detail by considering whether the conjunction of the RA and RT of a (set of) response(s) should be considered an outlier. For example, observing many fast incorrect responses on an item might suggest that guessing is prevalent, while many fast correct responses might suggest that item preknowledge is a problem or that it can be solved using an unintended heuristic. While these patterns can to some extent be studied without the use of advanced psychometric models, the advantage of using the hierarchical model is that one can truly consider whether a (combination of) response(s) should be considered an outlier, since one can determine whether a (set of) residual(s) is extreme compared to what is expected under the model. This makes it possible to contrast an item that is simply so easy that many people provide a fast correct response to it with an item where a part of the population provides unexpectedly fast responses with an unexpectedly high rate of success.

2.3 Simple Structure and Flexibility

A final major advantage of using the framework of the hierarchical model is its relative simplicity and flexibility, which go hand in hand. The framework’s flexibility comes from the fact that a simple structure is assumed and the two measurement models are separated and are only linked through correlations at the higher level. Because of this, one can consider a wide range of models for the RA side (including all commonly considered IRT models) and independent of that choice also consider different models for the RT side of the model. This makes it possible to choose a model specification that is tailored to the specific needs of the testing context that is considered.

On the interpretation side, the simplicity that is entailed by this simple structure is also beneficial. On the RA side of the model, one generally uses one of the common IRT models for dichotomous or polytomous data, with the standard interpretation of both the item and the person parameters remaining applicable, unaffected by the fact that RTs are considered elsewhere in the model. Similarly, on the RT side, item and person parameters are considered that keep their standard interpretation and only relate to the RTs. The connection between the two measurement models is likewise easy to understand, since correlations between the different item parameters and between the different person parameters are considered. All of this can be considered to be an advantage for practitioners, both who themselves have to fully understand the workings of the model and who will need to be able to effectively communicate findings based on the models to stakeholders.

3 Limitations: A Range of Conditional Independence Assumptions

While the simplicity of the hierarchical model is often considered as one of its selling points, this simplicity is at the same time at the root of a set of limitations that have both an important practical and theoretical impact. That is, the assumption of a simple structure can often be considered problematic in practice, not only in the sense that the model shows less than perfect fit but also in the sense that important patterns may be overlooked or even that bias may occur in one of the outcome measures (e.g., the ability estimates or the estimated precision). It is therefore of great importance that practitioners are aware of these limitations before they consider applying the framework in practice.

The different limitations of the hierarchical model that will be considered in this section all relate to different conditional independence assumptions that are made by (all standard versions of) the hierarchical model. The hierarchical model as it was presented graphically in Fig. 16.1 shows that various variables in the model are not directly connected to each other, although all of them are indirectly connected. Figure 16.2 provides a graphical overview of the five different forms of conditional independence that are assumed by the model, where dashed lines indicate a residual correlation of 0 (i.e., conditional independence). A violation of any of these conditional independence assumptions constitutes a violation of the hierarchical model, which can result in various issues beyond simply a reduced model fit, all of which will be covered in this section.

Fig. 16.2
A schematic of the Theta and Tau connections. Theta flows to X subscript 1 and X subscript J. Tau flows to T subscript 1 and T subscript J. It also illustrates the connection between all the components with dotted lines.

The hierarchical model and its five conditional independence assumptions. All conditional independence assumptions are indicated by numbered broken lines, which indicate the assumed absence of a relationship

3.1 Conditional Independence of the RAs

The first conditional independence assumption considered is that of the RAs given the latent variables:

$$\displaystyle \begin{aligned} P(\mathbf{X}|\theta, \tau) = P(X_1|\theta, \tau)P(X_2|\theta, \tau)\ldots P(X_J|\theta, \tau), \end{aligned}$$

where X is the vector containing all the item responses X 1, …, X J. Since the hierarchical model assumes a simple structure, the RAs do not depend on speed given ability, so this assumption reduces to the standard local independence assumption considered in IRT:

$$\displaystyle \begin{aligned} P(\mathbf{X}|\theta) = P(X_1|\theta)P(X_2|\theta)\ldots P(X_J|\theta). \end{aligned}$$

Compared to the other four assumptions of conditional independence, violations of local independence and their impact have been studied rather extensively (Yen 1984; Wainer & Thissen 1996; Chen & Thissen 1997; Hoskens & De Boeck 1997; Zenisky et al. 2001). Since this form of conditional independence is shared by almost all commonly used IRT models, and since one can in principle use a measurement model for RA that allows for local dependence, this conditional independence assumption will not be discussed here extensively. It is however important to note that the presence of local dependence generally results in an underestimation of the standard error of ability (e.g., see Zenisky et al. 2001), such that in its presence one overestimates the precision with which ability is measured.

3.2 Conditional Independence of the RTs

Similar to the assumption of conditional independence of the RAs, standard versions of the hierarchical model assume conditional independence of the RTs given the latent variables:

$$\displaystyle \begin{aligned} P(\mathbf{T}|\theta, \tau) = P(T_1|\theta, \tau)P(T_2|\theta, \tau)\ldots P(T_J|\theta, \tau) = P(T_1|\tau)P(T_2|\tau)\ldots P(T_J|\tau), \end{aligned}$$

where T is the vector containing the RTs T 1, …, T J. Effectively, this assumption tells us that the RT of a response only depends on the overall speed of the respondent (and the item parameters in the RT model), but not on the RT of the previous response or of any other response.

While this assumption has not been studied extensively in the context of the hierarchical model, it links directly to the extensive literature on RT modeling. For example, the phenomenon of speeding on the test is well established in many testing settings with effective time limits (e.g., see Lu & Sireci 2007). Similarly, it is well known that respondents generally spend a relatively long time answering the first few items presented on a test. Both of these phenomena concern violations of the assumption of the hierarchical model that the latent variables are “stationary” throughout the test (Fox & Marianti 2016). This non-stationarity of speed throughout the test may lead to conditional dependence between the RTs in two ways. Firstly, respondents may differ in the extent to which they work with a slow start and speeded conclusion on the test, which should result in positive dependence between the RTs of adjacent responses of items in the beginning or at the end of the test. Secondly, if the items are presented in booklets, the item position will likely be different for different respondents, and hence respondents will differ in whether they encounter the item in the beginning, in the middle, or at the end of the test. In that case, even if all respondents show the exact same pattern of slowing down in the beginning and speeding up near the end of the test, positive residual dependencies will remain between adjacent items (i.e., between items in a booklet).

While the impact of unmodeled conditional dependence between the RTs in the hierarchical model has to our knowledge not been studied, one can be hopeful that in practice its impact is relatively limited. That is, one can expect an impact similar to what is commonly found in IRT models where unmodeled local dependencies are present: an underestimation of the standard error of the latent variable in the measurement model. While this may be undesirable, its impact in settings where one mainly uses the hierarchical model for improving the precision of measurement of ability can be expected to be minor, since it only directly concerns the precision with which speed is estimated. It does however mean that there is relevant model misfit and that one misses potentially relevant information about the response processes. If getting a more complete picture of these processes is considered desirable, one could consider working with a more complex measurement model for RT that allows for local dependencies.

3.3 Conditional Independence of RT and RA

While the previous two forms of conditional independence both only concerned one of the two measurement models, the remaining three forms of conditional independence all concern the relationship between the RA and RT side of the hierarchical model. In this sense, the remaining three forms of conditional independence can be considered to be unique to models that jointly consider RA and RT. The most well known and well studied of these assumptions is conditional independence of RT and RA:

$$\displaystyle \begin{aligned} P(\mathbf{X},\mathbf{T}|\theta, \tau) = P(\mathbf{X}|\theta, \tau)P(\mathbf{T}|\theta, \tau). \end{aligned}$$

This assumption of conditional dependence thus states that once the latent variables are taken into account, the accuracy of the response is not linked to the RT: Unexpectedly fast or slow responses cannot be expected to be more (or less) likely to be correct, and vice versa.

Conditional dependence between RA and RT implies that the association between RA and RT that is observed is not fully explained by the two latent variables in the model and hence that unexplained patterns remain. Since the hierarchical model purports to fully explain the observed association between RA and RT, this form of conditional dependence can be considered conceptually important. As will be discussed below, its presence both poses risks for the model inferences and creates opportunities to gain better insight into the response processes for specific items and for specific persons.

Bolsinova et al. (2017) have provided an extensive overview of the various possible sources of positive and negative conditional dependence, which will briefly be summarized here. As they point out, conditional dependence may both be present in situations where all individuals answer the items in similar ways (i.e., homogeneous response processes) and when individuals differ in how they answer the items (i.e., heterogeneous response processes).

When respondents take the test in similar ways, conditional dependence may occur due to between-person differences in the item parameters (i.e., differential item functioning). If differential item functioning (DIF) is present, an item may be relatively more difficult for one respondent than for another respondent with the same ability level. Since time intensity is generally positively correlated with item difficulty, it is reasonable to expect the item time intensity to similarly show DIF, meaning that respondents for whom the item is relatively difficult may also spend a relatively large amount of time on solving the item, introducing DIF for the item time intensity parameter as well. This covariation of item difficulty and item time intensity will generally result in negative conditional dependence, since those persons who find the item more difficult are both expected to provide a less accurate and slower response to the item. While DIF is normally only studied in the context of contrasting specific subgroups in the population that is tested, the negative conditional dependence described here can occur even if there is no DIF that links specifically to group membership, but only concerns “unexplained” between-person variation in the item parameters (e.g., the item having a higher difficulty parameter for one respondent than for another, without this difference being attributable to group membership). Such DIF is not studied in practice for the obvious reason that there is always too little data to consider it (since it concerns person-by-item interactions rather than group-by-item interactions), but this does not mean that such between-person variation in the item parameters should not be expected, as Bolsinova et al. explain (2017). Thus, any between-person covariation of item difficulty and item time intensity is sufficient for causing negative conditional dependence, and this covariance can be present even if the DIF on the RA and on the RT side average out at the level of the different groups and hence is not detected. This means that standard DIF analysis (even if extended to the hierarchical model) will not be able to show that such DIF is not present, since it only considers variation in the item parameter(s) across a small prespecified set of respondent groups. Unfortunately, this means that in practice excluding the possibility of this kind of DIF is empirically practically infeasible.

Additionally, conditional dependence may occur due to non-stationarity of the two latent variables. That is, while the hierarchical model assumes that all persons work at a constant speed and with a constant ability level, this assumption may often be unrealistic in practice. On any test with an effective time limit, speeding near the end of the test will occur for at least a subset of the respondents, meaning that their effective speed for those later items is higher than it was for the earlier items. Due to the speed-accuracy trade-off, we can expect responses to those later items to be both faster (i.e., negative residual RT) and more often incorrect (i.e., negative residual RA), resulting in positive dependence.

While the abovementioned sources of conditional dependence between RA and RT concern situations where respondents still take the test in comparable ways, additional sources of conditional dependence may play a role when there are qualitative differences in how respondents take that test. That is, when the response processes of respondents differ for a particular item, these differences can be expected to result in conditional dependence between the RA and RT of responses to that item. The most obvious example is rapid responding, which means that some respondents provide low-quality fast responses to the item, introducing positive dependence. In contrast, slow disengaged or unmotivated responding would result in negative dependence. Additionally, when engaged respondents show differences in their answer strategy, conditional dependence can be expected. For example, when some respondents produce the answer to an item through heuristics, while others solve the item algorithmically, both differences in the expected RA and the expected RT will be present, leading to dependence.

With all these different possible sources of conditional dependence between RT and RA, it should not come as a surprise that this assumption often appears to be violated in practice (Ranger & Ortner 2012; Meng et al. 2015; Bolsinova et al. 2017; Bolsinova etal. 2017; Bolsinova & Molenaar 2018). It should also be noted that both positive and negative conditional dependence between RA and RT can be observed within the same test. This will, for example, be the case if a heuristic approach leads to the correct response on one item, while it leads to an incorrect response on another item. Thus, conditional dependence between RA and RT should always be studied at the item level.

It may be noted that in addition to a possible dependence between the RA and RT on the same item, dependencies across items are also possible. For example, the well-studied phenomenon of post-error slowing (Rabbitt & Rodgers 1977; Laming 1979) suggests that there may often be a negative dependence between the RA of one response and the RT of the subsequent response. To our knowledge, this phenomenon has not been studied in the context of the hierarchical model, but it seems reasonable to assume that the impact of this kind of violation of conditional dependence will be similar to the impact of conditional dependence between RA and RT of the same item.

Beyond the fact that misfit shows that the model inadequately captures the patterns observed in the data, the presence of conditional dependence between RA and RT suggests that there may be important aspects of the response process that are not captured by the model or perhaps even misrepresented. Thus, a variety of extensions of the hierarchical model have been considered (Ranger & Ortner 2012; Meng et al. 2015; Bolsinova etal. 2017; Bolsinova et al. 2017) that attempt to incorporate possible residual dependencies between RA and RT in the model. These models generally provide a more flexible toolkit for jointly modeling RA and RT, allowing users to get a more complete picture of the response processes and item and person characteristics, at the cost of increased model complexity. Thus, it can be considered important to first critically test for the possible presence of conditional dependence between RA and RT (e.g., using the test proposed by Bolsinova & Tijmstra 2016) and subsequently explore the use of one of the extensions of the hierarchical model if needed and desired.

3.4 Conditional Independence of RT and Ability

In addition to the RT of a response possibly depending on the RA of that response or the RT of other responses, there is also the possibility that RT depends on ability. That is, there may be difference between persons of different ability levels in terms of how much time they spend on each item, beyond what can be explained through their overall speed. This would entail a violation of the following conditional independence assumption:

$$\displaystyle \begin{aligned} P(\mathbf{T}|\theta, \tau) = P(\mathbf{T}|\tau). \end{aligned}$$

This possibility was considered by Bolsinova and Tijmstra (2018).

Conceptually, the possibility of ability being linked to how much time a respondent spends on one item, relative to the other items, makes a lot of sense. Low-ability respondents in all likelihood realize that some of the more difficult items are too difficult for them to solve and may decide to allocate most of their limited time to solving the easier items, where they do stand a reasonable chance of finding the right answer. In contrast, high-ability respondents likely do not need to spend a lot of time in solving easy items and allocate most of their time to tackling the more difficult items. Effectively, the hierarchical model states that throughout the entire test, there will be no difference in how high-ability respondents allocate their time, compared to low-ability respondents. This assumption may not be plausible in most practical testing settings.

The ignored possibility of conditional dependence between RT and ability is not only a limitation for the standard hierarchical model in the sense that it introduces model misfit, but it also means that not all relevant information about ability that is contained in the RTs is utilized by the model. That is, conditional dependence between RT and ability means that there is collateral information in the RTs for the estimation of ability, beyond that which is contained in the overall correlation between speed and ability. Bolsinova and Tijmstra (2018) developed a model that allows for this kind of conditional dependence and found that in practice the gain in precision with which ability is estimated when allowing for this dependence can be notable and may exceed the original gain in precision when moving from an IRT model to the standard hierarchical model (i.e., from including speed as a predictor of ability). This is especially likely for larger tests, since in the extended model the collateral information in RT for the estimation of ability effectively increases linearly with every additional item, while in terms of collateral information, the standard hierarchical model can never do better than the inclusion of a single perfectly estimated covariate (i.e., speed). Thus, if one’s main reason for using the hierarchical model is to increase the precision with which ability is estimated, it makes sense to explore whether the extended model proposed by Bolsinova and Tijmstra makes better use of the collateral information from the RTs than the standard hierarchical model.

3.5 Conditional Independence of RA and Speed

In addition to the RA of a response possibly depending on the RAs of other responses and the RTs, it may also be the case that under the hierarchical model, a residual association remains between RA and speed. In that case, one is dealing with a violation of the following conditional independence assumption:

$$\displaystyle \begin{aligned} P(\mathbf{X}|\theta, \tau) = P(\mathbf{X}|\theta). \end{aligned}$$

Such violations can be expected when the effect of “operating speed” on the probability of success is not the same for all items. For example, it may be realistic that some items can be solved rather easily using heuristics, in which case a high speed would not necessarily lead to a low expected RA or a lower expectation than what is expected for respondents operating at lower speed levels. If there are other items on the same test where using heuristics does not lead to the correct (and possibly to an incorrect) answer, respondents who operate at that same high speed level would now be expected to do relatively worse compared to respondents operating at a lower speed level. This differential impact of speed on the expected accuracy of the response for different items would show up as a negative residual dependence between speed and RA.

Extensions of the hierarchical model that specifically attempt to address possible residual dependencies between RA and speed have to our knowledge not been developed. Additionally, no formal study into the possible presence of this kind of dependence in real life data has to our knowledge been conducted, nor have tests been developed that specifically aim to detect such possible dependence. However, an approach similar to the one proposed by Bolsinova and Tijmstra (2018) for dealing with conditional dependence between RT and ability could be explored. While such an extension would not lead to a notable improvement in the precision with which ability is estimated, it would provide users with relevant information about how different items function, which can be considered relevant for testing practice and especially test design (e.g., intentionally ex- or including items where fast operating speed improves the expected accuracy).

4 Risks of Using the Hierarchical Model in Practice

In addition to the formal and practical limitations of the standard hierarchical modeling framework discussed above, there are important risks and misconceptions of the framework that should be well understood by practitioners before they choose to use the model in practice, which will be covered in this section.

One important misconception that should be avoided concerns the interpretation of the correlation between speed and ability in the model. Given that in standard formulations of the model, τ effectively captures the (weighted) average RT on the test what could be called “effective speed,” while θ captures “effective ability” (i.e., overall performance on the test), all that this correlation tells us is whether persons who provide answers faster generally do so with higher or lower accuracy than those who provide answers more slowly (Tijmstra & Bolsinova 2018). While it may be tempting to take this between-person association and assume that it informs us what would happen to the expected performance of respondents if they would provide answers more (or less) quickly, no such inferences can be made, since this concerns a (counterfactual) within-person association that cannot be assessed based on the model. Fundamentally different models and a fundamentally different testing setting are needed if one wants to assess this within-person speed-ability trade-off (e.g., see Goldhammer 2015), which require respondents to operate at different levels of effective speed.

Another potential risk of the hierarchical model is that unlike standard IRT models, the estimates of ability depend on more than just the accuracy of the responses, since the correlation between speed and ability means that speed estimates affect ability estimates. Of course, this was also one of its main selling points, but the inclusion of speed as an additional predictor of ability does run the risk of introducing bias. That is, while the precision of measurement will increase through the inclusion of this additional predictor, if the actual relationship between these two variables does not fully match their relationship in the model, we will introduce bias in the ability estimates that would not have been there if we had used a “RA-only” model. With the complexity of standard test taking settings in mind, it may not be overly realistic to assume that the simple linear relationship between speed and ability completely and correctly captures the relationship between RT and RA, meaning that at least some degree of bias in the estimate of ability should be expected. Thus, the risk of introducing systematic bias is prominent if the actual relationship between speed and ability is not captured well by a linear correlation. This will, for example, be the case when respondents differ in how they take the test and, for example, a subset of the respondents provide fast disengaged responses. It is therefore important to ascertain that respondents all took the test in similar ways (e.g., with similar levels of engagement and using similar response processes). This will of course be difficult to actually establish in testing practice, where there is only a limited amount of information available per respondent.

There is another risk that follows from using speed as an additional predictor of ability that specifically applies to high-stakes testing. Since the speed with which responses are given will have an influence on the estimated ability, it may be possible to optimize one’s speed to maximize one’s estimated ability. Since the association between speed and ability is assumed by the model to be linear, this is simply a matter of responding as fast as possible in case of a positive association between speed and ability and as slow as possible when the association is negative. Giving very fast responses will likely result in a strong reduction in the accuracy of the responses, meaning that this strategy will likely not be very effective in case of a positive association between speed and ability. However, if the association is negative, there is nothing stopping a well-informed respondent from giving slow responses to all of the items (to the extent that the time limit allows) to obtain a speed estimate that is as high as possible. While one could partially address this issue by not informing respondents of how their speed will affect their estimated ability, this would mean that the scoring rule cannot be communicated to respondents before or during the test, which may also be problematic. These issues, together with the general possibility of introducing bias discussed in the previous paragraph, make it that using the hierarchical model for improving the precision of ability estimates in high-stakes testing settings may be ill-advised.

In contrast, using the model in low-stakes testing settings may be more defensible, since the introduction of some degree of bias in the individual ability estimates could be considered acceptable there if it leads to a relevant increase in precision of those ability estimates. However, in these settings, the risk of heterogeneous response processes will be more prominent, since unlike in high-stakes testing, settings there likely will be a relevant subset of respondents who are providing fully or partially disengaged responses. If these “deviant” responses and respondents are not detected and excluded from the analysis, they will likely have a notable impact on the estimated correlation between speed and ability. Concretely, when many fast disengaged responses are present, the correlation between speed and ability will likely be more negative than it would be if those disengaged responses would be excluded from the analysis. While RTs may provide relevant information for determining disengaged responding (e.g., see Goldhammer et al. 2016; Nagy & Ulitzsch 2021), it is unlikely that any method will succeed in detecting disengaged responses with such a degree of accuracy that their presence no longer biases the estimate of the correlation between speed and ability.Footnote 3 Consequently, there remains a risk of introducing notable bias in the estimate of ability for engaged respondents due to the failure to sufficiently exclude disengaged respondents and responses from the analysis.

Even if all disengaged responses and respondents can be eliminated from the analysis, the possibility of heterogeneous response processes remains. For example, if respondents differ in the extent to which they work heuristically versus algorithmically, this will affect both their expected RAs and RTs on the test. If one ignores these differences, one ends up with one overall association between speed and ability that aggregates the patterns found for the two styles of taking the test, which will likely not adequately represent the association between speed and ability in either of the subgroups, and hence potentially introduces bias in the ability estimates. While one would ideally study each subgroup separately, there may often be a variety of differences between persons in how they take the test, and adequately capturing this heterogeneity in the response processes will often not be feasible in practice. Thus, the possibility of heterogeneous response processes poses a challenge for the use of the hierarchical model, in both low- and in high-stakes testing settings.

An additional risk of bias lies in the assumption of the model that RTs are informative of ability (through speed) regardless of the accuracy of the response. Bolsinova and Tijmstra (2019) have found that in some settings, it may be plausible that only the RTs of correct responses are informative of ability. They proposed the possibility of separately measuring the speed with which correct and incorrect responses are given, respectively. Since the standard hierarchical model assumes that there is a single speed latent variable that explains the RTs and that RT (conditionally) does not depend on RA, it is not equipped to deal with this possibility. By combining the RTs of correct and incorrect responses together in a single latent speed variable, bias in the estimated ability may be introduced. This makes carefully checking whether there is indeed a single latent speed variable that explains the RTs important when using the hierarchical model in practice.

Finally, it may be relevant to point out the importance of distinguishing between θ and the construct of interest that the test is supposed to measure. While ideally the two overlap perfectly, even in the best of settings, it may be realistic to assume that there is some degree of construct-irrelevant variance present in the true value of θ of different respondents, meaning that θ is not a perfect proxy for the construct of interest even if there were no uncertainty in its estimates. For example, in addition to depending on ability, someone’s test performance might be influenced by their experience in test taking or by their general reading skill. Such construct-irrelevant factors that influence θ could easily affect the expected RTs as well. Thus, there is the risk that the predictive power of speed is especially linked to this construct-irrelevant variance in θ, which would mean that using the hierarchical model instead of a standard IRT model would exacerbate the confounding of measurement that occurs. That is, the precision with which θ is estimated would increase, but at the cost of increasing the discrepancy between θ and the construct that the test was intended to measure. Using the hierarchical model therefore requires users to be confident that there is no issue with construct-irrelevant variance in θ, which may be difficult to establish in practice.