A heteroscedastic hidden Markov mixture model for responses and categorized response times
Abstract
Various mixture modeling approaches have been proposed to identify withinsubjects differences in the psychological processes underlying responses to psychometric tests. Although valuable, the existing mixture models are associated with at least one of the following three challenges: (1) A parametric distribution is assumed for the response times that—if violated—may bias the results; (2) the response processes are assumed to result in equal variances (homoscedasticity) in the response times, whereas some processes may produce more variability than others (heteroscedasticity); and (3) the different response processes are modeled as independent latent variables, whereas they may be related. Although each of these challenges has been addressed separately, in practice they may occur simultaneously. Therefore, we propose a heteroscedastic hidden Markov mixture model for responses and categorized response times that addresses all the challenges above in a single model. In a simulation study, we demonstrated that the model is associated with acceptable parameter recovery and acceptable resolution to distinguish between various special cases. In addition, the model was applied to the responses and response times of the WAISIV block design subtest, to demonstrate its use in practice.
Keywords
Mixture models Item response theory Response times Hidden Markov modelsIn psychological and educational measurement of constructs and abilities, withinsubjects differences may exist in the psychological processes that resulted in the responses to the items of the test. For instance, respondents may resort to fast guessing on some of the items of an educational measurement test but use a regular response process on the other items (Schnipke & Scrams, 1997); respondents may alternate between memory retrieval and actual calculation on the items of an arithmetic test (Grabner et al., 2009); or they may use trial and error on some items of a spatial puzzle but use an analytical strategy on others (Goldstein & Scheerer, 1941).
The objective of this article is to improve on existing statistical methods to detect these withinsubjects differences in response processes. In psychological and educational measurement, the dominant source of information are the item responses themselves, which indicate the accuracy of the underlying response process. In this article, we will additionally focus on the item response times as a valuable additional source of information concerning the response process as they indicate the amount of time it took for the response processes to be executed (Luce, 1986). That is, everything else being equal, a systematic difference in response time suggests a difference in the underlying response process.
Various psychometric modeling approaches based on mixture modeling have been proposed that—in addition to the item responses—use the response times to identify withinsubjects differences in response processes (Molenaar, Oberski, Vermunt, & De Boeck, 2016; Schnipke & Scrams, 1997; Wang & Xu, 2015; Wang, Xu, & Shang, 2018). However, although valuable, the existing mixture models are associated with at least one of the following three challenges: (1) A parametric distribution is assumed for the response times that—if violated—may bias the results; (2) the response processes are assumed to result in equal variances (homoscedasticity) in the response times, whereas some processes may produce more variability than others (heteroscedasticity; e.g., fast guessing is commonly associated with less variance than the regular response process); and (3) the different response processes are modeled as independent latent variables, whereas they may be related (e.g., after a guess, a subject may be more likely to guess on the next item).
Challenges 1, 2, and 3 have all been studied separately. That is, Challenge 1 has been addressed by Molenaar, Bolsinova, and Vermunt (2018), who proposed a mixture modeling approach based on the categorized response times to avoid assumptions about the specific parametric shape of the response time distribution. The approach was demonstrated to perform better than a parametric approach based on the lognormal response time distribution if the observed response time distribution departs from lognormality. In addition, Challenge 2 has been addressed by Wang and Xu (2015) and Wang et al. (2018), who proposed a model for two response processes, fast guessing and a regular solution process, in which the processes were heteroscedastic, that is, associated with differences in the underlying response time variance. Finally, Challenge 3 has been addressed by Molenaar et al. (2016), who modeled the possible relation between the response processes underlying two subsequent items using a time homogeneous hidden Markov process of order one.
Although the three challenges above have been addressed separately, in practice they may occur simultaneously. In the present article, we therefore propose a heteroscedastic hidden Markov mixture model for responses and categorized response times in which we explicitly address Challenges 1, 2, and 3 in a joint model. That is, we combine the categorized response time approach of Molenaar et al. (2018), the heteroscedastic response processes approach by Wang and Xu (2015) and Wang et al. (2018), and the Markov process approach of Molenaar et al. (2016) in a single model. The outline is as follows: First, the full model is derived and tested in a simulation study to investigate parameter recovery and the resolution to distinguish between different special cases. Next, the model is applied to a real dataset to demonstrate its use in practice.
The general mixture framework
A joint modeling approach
Within traditional item response theory models, it is assumed either that the item responses to psychometric tests are the results of a single response process (e.g., an information accumulation process; see Tuerlinckx & De Boeck, 2005; van der Maas, Molenaar, Maris, Kievit, & Borsboom, 2011) or that the response processes are homogeneous (e.g., multiple processes underlie the scores of an arithmetic test, such as subtraction and addition, but these processes are homogeneous in the sense that, statistically, they are commonly unidimensional). As a result, betweensubjects differences in the accuracy of these response processes can be modeled by posing a latent ability variable, θ_{p}, to underlie the item responses of respondent p = 1, . . . , N to a test. Similarly, individual differences in the speed with which these processes are executed can be captured by posing a latent speed variable, τ_{p}, to underlie the response times to a test.
A mixture joint modeling approach
The general idea of the mixture approach by Schnipke and Scrams (1997), Wang and Xu (2015), Wang et al. (2018), and Molenaar et al. (2016) is to model withinsubjects differences in response processes by extending the joint model above to include itemspecific latent class variables, ζ_{pi}, with two states c = 0, 1 to underlie the responses and response times of item i. The two states either correspond to a discrete difference in two qualitative response processes that produce heterogeneity in the data (e.g., memory retrieval and logical reasoning) or the two states correspond to two statistical states that capture heterogeneity in the data that is due to discrete differences in multiple response processes (e.g., multiple solution strategies) or due to continuous differences in one or more response processes (e.g., motivation or fatigue).
Parameter restrictions in the general mixture framework necessary to obtain special cases from the literature
Response Times  Responses  

Model  References  c  ν _{ ci}  λ _{ ci}  σ _{ ci}  γ _{ ci}  α _{ ci}  β _{ ci} 
Hierarchical model (baseline)  van der Linden (2007)  0  ν _{0 i}  1  σ _{0 i}  γ _{0 i}  α _{0 i}  β _{0 i} 
1  –  –  –  –  –  –  
Standard mixture model  Schnipke and Scrams (1997)  0  ν _{0 i}  0  σ _{0 i}  –  –  – 
1  ν _{1 i}  0  σ _{1 i}  –  –  –  
Commonguessing mixture model  Schnipke and Scrams (1997)  0  ν _{0}  0  σ _{0}  –  –  – 
1  ν _{1 i}  0  σ _{1 i}  –  –  –  
Mixture hierarchical model  0  ν _{0}  0  σ _{0}  0  0  β _{0 i}  
1  ν _{1 i}  1  σ _{1 i}  γ _{ ci}  α _{1 i}  β _{1 i}  
Independentstates mixture model  Molenaar et al. (2016)  0  ν _{0 i}  1  σ _{ i}  0  α _{0 i}  β _{0 i} 
1  ν _{0 i} +δ _{1}  1  σ _{ i}  0  α _{1 i}  β _{1 i} 
From the table it can be seen that the first model, the hierarchical model by van der Linden (2007) discussed above, arises by specifying a lognormal model with λ_{0i} = 1 for the response times, and a threeparameter model for the responses in state 0 and leaving state 1 empty. Because this model assumes a single state only, it corresponds to a singleprocess model or homogeneous process model that can be used as a baseline in drawing inferences about withinsubjects differences in response processes in the data. Note that the factor loadings are constrained to be equal to 1 in the singlestate model and in all other models that include τ_{p}, which is an essentially tauequivalent factor model (Lord & Novick, 1968). This assumption has been relaxed in the hierarchical model by, for instance Fox, Klein Entink, and van der Linden (2007) and Molenaar, Tuerlinckx, and van der Maas (2015).
The next two models in Table 1 are by Schnipke and Scrams (1997). These models consider response times only. As can be seen, both models do not include a latent speed variable as λ_{ci} = 0 in both states. In the standard mixture model, the intercept and variance are estimated for each item in both states. In the commonguessing mixture model, the intercepts and variances in Class 0 (the guessing class) are restricted to be equal across items. Although these models by Schnipke and Scrams are not latent variable models, to our knowledge, these models have been the first to include a withinsubjects mixture component for response times. In addition, the idea of commonguessing has been adopted by Wang and Xu (2015) and Wang et al. (2018), who proposed a commonguessing latentvariable model for both responses and response times. As can be seen in Table 1, the response time model includes a latent speed variable in state 1 (i.e., λ_{1i} = 1) with itemspecific intercepts and residual variances, and a common intercept and residual variance in state 0, but without a latent speed variable. In addition, the response model includes a threeparameter latentvariable model for the responses in state 1 and a fastguessing parameter β_{0i} in state 0 without a latent variable. Finally, Molenaar et al. (2016) proposed a model with a latent speed variable in both states (i.e., λ_{0i} = 1 and λ_{1i} = 1), in which the itemspecific intercepts in state 1 are equal to the intercepts of state 0 shifted by a common scalar, δ_{1}. In addition, the residual standard deviation is assumed to be equal across states (σ_{ci} = σ_{i}). For the responses, a twoparameter model is used in both states (γ_{ci} = 0).
Challenges and a possible solution
The response time distribution
The mixture approaches discussed above are all associated with one of the following challenges. First, the approaches all assume a lognormal distribution for the response times within the states. As has been argued by Vermunt (2011) for standard mixture models, and demonstrated by Bauer and Curran (2003) for growth mixture models and by Molenaar et al. (2018) for the independent states mixture model in Table 1, violations of the assumed withinstates distribution may result in (1) spurious states—that is, states that are not actually in the data but appear as a significant source of variation in the modeling to capture the misfit in the data distribution—and (2) biased true states—that is, differences between true states (that are actually in the data) may seem smaller or larger depending on the source of the misfit in the data distribution (e.g., positive skew or negative skew, truncation, etc.).
In principle, this challenge can be solved by specifying a more appropriate response time distribution within each state. However, commonly there is no theory about the response time distribution within each state. In addition, inferring the withinstate response time distribution from the data is difficult, because only the observed distribution of the response times is available, which cannot straightforwardly be used to make inferences about the parametric form of the withinstate distribution as the observed response time distribution will depart from the withinstate distribution by definition. Kuipers, Visser, and Molenaar (2018) proposed a test on lognormality of the withinstate response time distribution. However, if the lognormality assumption fails, the above mixture models are not suitable for the data.
As a solution, Molenaar et al. (2018) proposed to categorize the continuous response times so that the resulting response time distribution could be better captured using categoryspecific threshold parameters. Specifically, Molenaar et al. (2018) proposed to replace the lognormal linear model above by a partialcredit model (Masters, 1982), which is an adjacentcategory model for ordered categories, or any other model for ordered categories (e.g., the graded response model [Samejima, 1969], which is a cumulative probability model). With respect to the categorization of the response times, Molenaar et al. (2018) proposed to use an itemwise categorization procedure using the observed percentiles. For five or seven categories, this approach worked well in terms of both parameter recovery and power.
Dependency between the states
In the general model in Eq. 5, it is assumed that the latent class variables underlying the items, ζ_{pi}, are independent. However, various examples show why the ζ_{pi} variables can be dependent. First, if a respondent guesses on one item, it may be more likely that this respondent will also guess on the next item. A similar example includes response strategies in general. That is, if multiple solution strategies are possible that differ in their efficiency, using an efficient solution strategy on one item will probably increase the probability that this strategy will also be used on the next item. Another example includes posterror slowing (Rabbitt, 1979), which refers to the phenomenon that respondents, who know (or think) that they made an error on a given item, slow down on the next item resulting in a dependency between subsequent ζ_{pi}s.
Within the general mixture framework in Eq. 5, Molenaar et al. (2016) accounted for a possible dependency of the item specific latent class variables of item i, ζ_{pi}, on the item specific latent class variables of item i – 1, ζ_{p(i–1)}. That is, in a model for continuous lognormal response times, the assumption of independent ζ_{pi} was relaxed by introducing a firstorder Markov structure (e.g., MacDonald & Zucchini, 1997) on ζ_{pi}. Molenaar et al. (2016) showed that the presence of a Markov structure in the data can successfully be detected using fit indices BIC, CAIC, AIC with a triple penalty (AIC3), and the samplesizeadjusted BIC (saBIC). The conventional AIC (which uses a double penalty term) was associated with an increased false positive rate.
Heteroscedasticity between the states
The categorized response time model and the Markov structure thus provide a solution to the spuriousstate and independency challenges of the general framework in Eq. 5. However, contrary to Wang and Xu (2015), Wang et al. (2018), and Schnipke and Scrams (1997), both models assume that the withinstate response time variance is homoscedastic (equal across states). In the Markov mixture model, this assumption is explicit, as σ_{0i} = σ_{1i} in the model by Molenaar et al. (2016). In the categorized response time model it is less explicit, since traditional item response theory models do not have a variance parameter. However, the same thresholds are applied in both states to categorize the response times (since the marginal response time distribution is categorized and not the withinstate response time distribution, because this distribution is unknown). Therefore, heteroscedasticity across states will not be detected and will bias the results, as we will demonstrate in the simulation study below.
Proposed model
Note that, contrary to Wang and Xu (2015) and Wang et al. (2018), we follow Molenaar et al. (2018; Molenaar et al., 2016) and use a twoparameter model for the responses (see also Table 1). Our main reason is that we want to operate in a generalized linear modeling framework that does not include the threeparameter model as a special case.^{2} Using a threeparameter model would increase our model complexity, resulting in a potentially poorly identified model. Within the generalized linear modeling framework, we are sure that the model is identified and can be estimated properly. In addition, our modeling interest is mainly in detecting possible differences in item discrimination and item easiness across the different states (suggesting different response processes). However, extending the present model to a threeparameter model would be possible in principle
 1.
Baseline: A baseline model with one state (see Table 1).
 2.
Heteroscedastic Markov states: The full model with a Markov structure on the latent class variables and heteroscedastic states.
 3.
Homoscedastic Markov states: A model with a Markov structure on the latent class variables and homoscedastic states.
 4.
Heteroscedastic independent states: A model with independent latent class variables and heteroscedastic states
 5.
Homoscedastic independent states: A model with independent latent class variables and homoscedastic states
In all models, we use categorized response times. In the simulation study below, we investigate the viability of the general model in terms of parameter recovery and the resolution to distinguish between the different models above in responses and categorized response time data.
Categorization of response times
Estimation
The models above were implemented in LatentGold (Vermunt & Magidson, 2013) and estimated using marginal maximum likelihood. We optimized the marginal loglikelihood function in Eq. 13 above by numerically integrating the double integral using ten quadrature points for each dimension. Next, we used the Baum–Welch adapted EM algorithm (Baum, Petrie, Soules, & Weiss, 1970; Welch, 2003) to obtain reasonable starting values, after which we used the Newton–Raphson algorithm to find the maximum of the likelihood function. Because this procedure is fullinformation, missing data in the responses or the response times do not pose a problem as long as these are missing at random (Little & Rubin, 1987). The syntax to fit the full model (heteroscedastic Markov states) is available in the Appendix.
Simulation study
Design
To study the viability of the proposed models, we investigated the parameter recovery of the latent state parameters α_{ic}, β_{ic}, π_{1}, π_{10}, and π_{11}. We considered the situation in which the response time distribution departs from a lognormal distribution such that the continuous response time mixture model for the response times in Eq. 7 is unsuitable (i.e., as it will produce bias and false positives as discussed above).

Heteroscedastic Markov states To generate data for the first scenario, we used the heteroscedastic Markov states model with a continuous lognormal response time distribution with mean ν_{i} − δ_{c} − τ_{p} and standard deviation σ_{c}, which is the continuous version of Eq. 11 from the heteroscedastic Markov states model for categorized response times. For the mixture parameters, we used π_{1} = .666 for the initial state parameter and π_{01} = .231 and π_{11} = .769 for the transition parameters (note that these choices imply that π_{0} = .333, π_{10} = .231, and π_{00} = .769). These effect sizes correspond to moderately imbalanced initial state probabilities (Dias, 2006) and moderately unstable transition parameters (Bacci et al., 2014). The responses were simulated using α_{0i} = 1.5 and α_{1i} = 1 for all i for the discrimination parameters. For the easiness parameter, we used increasing, equally spaced values between – 2 and 0 for β_{0i} and between 0 and 2 for β_{1i}. For the response times, we simulated τ_{p} with σ_{τ} = √0.13 and a correlation between τ_{p} and θ_{p} of .4. For the intercepts, we used ν_{i} = 2 for all i, δ_{0} = 0, and δ_{1} = 0.5. For the residual standard deviations, we used σ_{0} = √0.39 and σ_{1} = √0.13. These choices result in communalities of .25 in Class 0 and .5 in Class 1 on the logscale (as we simulated lognormal data; see above). In addition, the intercept differences of 0.5 between the states were considered of medium effect size by Molenaar et al. (2018). After the lognormal response time data were simulated, we logtransformed the simulated response times resulting in normally distributed logresponse times. These logresponse times were subsequently transformed using the Box–Cox transformation, ξ(x + 1)^{ζ}, with transformation parameter ξ = 0.3, such that the raw response times (i.e., the exponentially transformed Box–Cox logresponse times) are overly skewed as compared to a lognormal distribution. As we mentioned, this makes these data unsuitable for mixture models like the one in Eq. 7, calling for our categorized response time mixture model. See Fig. 4 for an example response time distribution from the present simulation study.

Homoscedastic Markov states In this scenario, we used the same setup and procedure as for the HeteroscedasticMarkovStates scenario but with σ_{0} = σ_{1} = √0.13.

Heteroscedastic independent states In this scenario, we used the same setup and procedure as in the HeteroscedasticMarkovStates scenario but without the Markov structure on the states (i.e., P(ζ_{pi} = 1) = π_{1} for all i)

Homoscedastic independent states In this scenario, we used the same setup and procedure as in the heteroscedastic independent states scenario, but with σ_{0} = σ_{1} = √0.13.

Baseline In this scenario, we used a baseline model without mixture (i.e., only one state: δ_{0} = δ_{1} = 0, σ_{0 =}σ_{1} = σ, α_{0i} = α_{1i} = α_{i}, and β_{0i} = β_{1i} = β_{i}). For the response time parameters ν_{i} and σ_{i}, we used the parameters from state 0 in the homoscedastic independent states model above. For the responses we used α_{i} = 1.5 and equally spaced values between – 2 and 2 for β_{i}. All other parameters were the same as in the homoscedastic independent states model above. In addition, like in the other scenarios, the response times data were transformed according to the Box–Cox transformation as explained above.
After the responses and continuous response times had been simulated, the raw response times were categorized at percentiles 2.28, 25.25, 74.75, and 97.73, resulting in five response time categories. Note that it does not make a difference whether the raw or transformed response times are categorized as the percentile scores will be the same. The percentiles that we used are obtained from a standard normal distribution at – 2, – 2/3, 2/3, and 2.
We used 50 replications for each data scenario. To the replications within each data scenario we fit the five models discussed above. Note that we thus did not fit the true model to the simulated data as the data were generated according to the Box–Coxtransformed lognormal model, and we fit a model for categorized response times. However, if the categorized model is viable, the latent state parameters α_{ic}, β_{ic}, π_{1}, π_{10}, and π_{11} should be correctly recoverable despite the response times being categorized. The recovery of the response time measurement model parameters ν_{it}, λ_{i}, and σ_{c} cannot be studied as they do not have a corresponding true parameter value.
Results
Parameter recovery
We limit our presentation of the parameter recovery results to the most complex model (heteroscedastic Markov states model) as this is the model of key interest and the most challenging model to fit in terms of the number of parameters, but the results for the other, more parsimonious, models are comparable.
Recovery results for the Markov parameters and for ρ
Parameter  True  MEAN(Est)  SD(Est)  RMSE  MEAN SE  Coverage 

ρ  – . 400  – . 420  . 033  . 038  . 032  . 940 
π_{1}  . 667  . 661  . 085  . 085  . 073  . 940 
π_{10}  . 231  . 222  . 014  . 016  . 015  . 900 
π_{11}  . 769  . 768  . 018  . 017  . 014  . 960 
True positive rates
Detection rates of the BIC, AIC, AIC3, CAIC, and saBIC for the five models in each data scenario
As can be seen from Table 3, for the baseline model and the heteroscedastic Markov states model, true positives are perfect (i.e., 1.00) for all fit indices, but the true positive rate for the AIC is only .24 for the baseline model. As can be seen from the false positive rate in the baseline scenario, using the AIC fit index, the baseline model is hard to distinguish from the homoscedastic Markov states model, which is associated with a false positive rate of .40. For the homoscedastic Markov states model, true positives are all acceptable to good, with values between .86 and .98. For the heteroscedastic independent states model, the true positives are also considered acceptable to good, with values between .72 and 1.00, and for the homoscedastic independent states model, the true positive rate is moderate for the AIC, with a rate of .62, but acceptable to good for the other fit indices, with values between .80 and .98.
Conclusion
In conclusion, it appeared that parameter recovery is acceptable and that all fit indices but the AIC behaved acceptably in selecting among the different models under the circumstances simulated. The poor behavior of the AIC in model selection is in line with the findings of Molenaar et al. (2016), who also found poor performance of the AIC in selecting among models that did and did not include (Markov) mixtures. In addition, we found that neglecting heteroscedasticity between classes may bias the item parameter estimates and increase their variance.
The main purpose of these simulations was a proof of principle in the sense that we wanted to show that we can adequately recover the true parameter values of the model and that we can distinguish well between the different models given a reasonable sample size and reasonable effect sizes. However, the results above depend on the choices we made concerning parameter values. That is, true positives will decrease for decreasing differences between the states in terms of δ_{c} and β_{ci} and α_{ci}. In addition, if the stability of the states decreases (reflected by larger values for π_{10} and smaller values for π_{11}) true positives will also decrease (see, e.g., Molenaar et al., 2016).
Illustration
Data
In this section, we demonstrate the viability of the present modeling approach in a real dataset. We used the responses and response times to the block design subtest of the Hungarian WAISIV (Nagyné Réz et al., 2008). These data have been analyzed by Molenaar, Bolsinova, Rósza, and De Boeck (2016), who analyzed these data using a mixture model for the responses but not for the response times. The data consist of the responses and response times of 978 respondents to 14 items. The items were designed to be decreasing in easiness. The raw response times are between 1 and 360 s. We omitted Item 1 from the analysis as this item caused numerical problems due to the high success rate (.999). We used the same procedure as in the simulation study. That is, we used the same categorization procedure for the raw response times, we considered the same models, and we used the same estimation procedure.
Results
Model fit indices for the five models considered in the application, for T = 5
Model  BIC  AIC  AIC3  CAIC  sBIC 

Baseline  27,612  27,163  27,255  27,704  27,320 
Heteroscedastic Markov states  27,043  26,442  26,565  27,166  26,652 
Homoscedastic Markov states  27,068  26,472  26,594  27,190  26,681 
Heteroscedastic independent states  27,428  26,837  26,958  27,549  27,044 
Homoscedastic independent states  27,437  26,851  26,971  27,557  27,056 
Robustness analysis
To see whether the results above are robust to the exact number of response time categories used, we also conducted the above analyses using T = 3 and T = 2 response time categories. In the case of T = 3, we categorized the continuous response times of each item at percentiles 15.87 and 84.13 (obtained from a standard normal distribution at – 1 and 1). In the case of T = 2, we used a median split of the continuous response times of each item (i.e., we used a cutoff corresponding to percentile 50).
First, the estimates of parameters π_{1}, π_{10}, and π_{11} are .599 (SE: .059), .152 (SE: .027), and .848 (SE: .016) for T = 3, and .582 (SE: .067), .241 (SE: .018), and .759 (SE: .021) for T = 2. As we discussed above, for T = 5 these estimates were, respectively, .617 (SE: .052), .124 (SE: .016), and .840 (SE: .015), respectively. As judged by the standard errors, these estimates do not differ importantly.
Model fit indices for the five models considered in the application for T = 3
Model  BIC  AIC  AIC3  CAIC  sBIC 

Baseline  20,553  20,231  20,297  20,619  20,343 
Heteroscedastic Markov states  20,158  19,689  19,785  20,254  19,853 
Homoscedastic Markov states  20,178  19,709  19,805  20,274  19,873 
Heteroscedastic independent states  20,417  19,958  20,052  20,511  20,118 
Homoscedastic independent states  20,426  19,967  20,061  20,520  20,128 
Model fit indices for the five models considered in the application for T = 2
Model  BIC  AIC  AIC3  CAIC  sBIC 

Baseline  18,603  18,344  18,397  18,656  18,435 
Heteroscedastic Markov states  18,096  17,686  17,770  18,180  17,829 
Homoscedastic Markov states  18,102  17,696  17,779  18,185  17,838 
Heteroscedastic independent states  18,408  18,008  18,090  18,490  18,148 
Homoscedastic independent states  18,404  18,008  18,089  18,485  18,147 
Discussion
In this article, we presented a mixture model to detect heterogeneity in the response processes underlying psychometric test items. The new model combines the strengths of previous mixture models by Schnipke and Scrams (1997), Wang and Xu (2015), Wang et al. (2018), Molenaar et al. (2016), and Molenaar et al. (2018). In our modeling approach we used mixture modeling in an indirect application (Yung, 1997). That is, the mixture components in our model are not necessarily substantively interpretable but are rather statistical tools to detect heterogeneity in the data that is due to differences in response processes. This is different from the modeling perspective by for instance Wang and Xu who used mixture modeling in a direct application (Dolan & van der Maas, 1998) in which the mixture components are substantively interpreted. Specifically, Wang and Xu distinguished between a fast guessing process and a solution process. Regardless of the nature of the mixture application (direct or indirect), the methodology presented in this article is equally amenable to the modeling of fast guessing and solution behavior. That is, if the measurement model for the responses in the faster state is restricted to represent fast guessing (i.e., discrimination equal to 0, see Table 1), the model is in essence the model by Wang and Xu, but with Markovdependent states. Other restrictions are possible, which we will illustrate below. However, such restrictions need a strong theory about the response processes, which is not always available.
Throughout this article, we have assumed two latent states to underlie the item responses and response times, this has mainly a pragmatic reason in the sense that we think that two states can capture the most important patterns in the data. In addition, some theories describe binary processing, for instance the automated versus controlled processing theory (Shiffrin & Schneider, 1977), and the fast versus slow intelligence theory (DiTrapani, Jeon, De Boeck, & Partchev, 2016; Partchev & De Boeck, 2012). However, it can certainly be that some situations require more than two states (e.g., if three clearly distinct solution strategies underlie the response behavior of the respondents). In principle, it is straightforward to extend the present model to include three or more item specific states. However, the number of parameters rapidly grows. That is, for three item specific states, six parameters need to be estimated for each response variable (three discriminations and three easiness parameters). In such a situation, either the sample sizes should be very large, or one should incorporate reasonable model restrictions. That is, model restrictions can be thought of that are either pragmatically defendable or that are derived from theory. For instance, Molenaar et al. (2018) considered a model in which the item parameters have an overall difference across states and not an item specific difference (as in the models considered in the present article). In addition, Molenaar et al. (2016) used the restrictions that van der Maas and Jansen (2003) derived from the developmental theory by Siegler (1981) to distinguish different solution strategies underlying the Piagetian balance scale task. Using these restrictions, Molenaar et al. (2016) identified up to five states in a hidden Markov model for responses and continuous response times.
To solve the problem of spurious mixtures, we followed Molenaar et al. (2018) and categorized the continuous response times. This approach is pragmatic but shown effective in countering false positives in the case of distributional misfit. However, the approach has the drawback that information about individual differences is decreased such that the power to detect an effect may depend on arbitrary choices concerning the number and location of the cutoff values. It is therefore advisable to always investigate the robustness of the results with respect to the cutoff values as was illustrated in our real data example.
Another aspect of the general mixture modeling framework considered in this article (Table 1) is the operationalization of response processes in terms of the item properties (discrimination and easiness) and the expected response times. That is, a response process difference is assumed to be characterized by (1) a difference in the discrimination and/or easiness parameter and (2) a difference in the expected response time. This operationalization in Difference 1 can be justified by the statistical theory about measurement invariance (Mellenbergh, 1989; Meredith, 1993), which dictates that a difference in measurement model parameters indicates a difference in the interpretation of the underlying latent variable. That is, if faster responses are associated with different measurement parameters (discrimination and/or easiness) as compared to the slower responses, the latent variable has a different interpretation for these responses indicating a different response process. As we discussed before, the operationalization in Difference 2 can be justified by the theory about response times in experimental psychology (e.g., Luce, 1986), which dictates that the response times indicate the time that is needed for a certain psychological process to be executed. A difference in expected response time thus indicates a different process (all other things being equal).
An alternative to the statistical operationalizations of response processes adopted here are processmodeling operationalizations from mathematical psychology. In this framework, stronger assumptions are made about the response process (e.g., a response process consists of noisy information accumulation that stops if enough information for one of the response alternatives is gathered). From these assumptions, a mathematical model can be derived that is fit to the data. Examples of such models include the diffusion model (Ratcliff, 1978), the linear accumulator model (Brown & Heathcote, 2008), and the race model (Audley & Pike, 1965). However, these models are mathematically more complex, which made them less suitable to the aims of the present article. Yet it will certainly be interesting to consider models from mathematical psychology in light of the present mixture modeling framework.
Footnotes
 1.
Note that the restrictions provided in Table 1 result in models equivalent to the models discussed in the text [i.e., equivalent in terms of the likelihood of the model. The exact parameterization in the corresponding articles is for some cases slightly different. For instance, Schnipke and Scrams (1997) estimated ln(ν_{ci}) instead of ν_{ci}, and Wang and Xu (2015) used α_{1i}(θ_{p} – β_{1i}) in the threeparameter model, instead of α_{1i}θ_{p} + β_{1i}.
 2.
However, note that it is possible to specify the threeparameter model as a mixture of a twoparameter model and a guessing model, which is a generalized linear model.
 3.
Note that we cannot speak of the “true model,” because the response time data were simulated under a different model (a categorized Box–Coxtransformed lognormal model) from the model applied to the data (a partialcredit model; see Eq. 11).
 4.
For numerical reasons, we estimated the logits of the initialstate and transition parameters. In addition, we estimated exp(– σ_{1}). The reported standard errors were obtained using the delta method.
Notes
Author note
The research by D.M. was made possible by a grant from the Netherlands Organization for Scientific Research (NWO VENI45115008).
References
 Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716–723.Google Scholar
 Audley, R. J., & Pike, A. R. (1965). Some alternative stochastic models of choice. British Journal of Mathematical and Statistical Psychology, 18 , 207–225.CrossRefGoogle Scholar
 Bacci, S., Pandolfi, S., & Pennoni, F. (2014). A comparison of some criteria for states selection in the latent Markov model for longitudinal data. Advances in Data Analysis and Classification, 8, 125–145.CrossRefGoogle Scholar
 Bauer, D. J., & Curran, P. J. (2003). Distributional assumptions of growth mixture models: Implications for overextraction of latent trajectory classes. Psychological Methods, 8, 338–363. https://doi.org/10.1037/1082989X.8.3.338 CrossRefGoogle Scholar
 Baum, L. E., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics, 41, 164–171.CrossRefGoogle Scholar
 Bozdogan, H. (1987). Model selection and Akaike's information criterion (AIC): The general theory and its analytical extensions. Psychometrika, 52(3), 345–370.Google Scholar
 Bozdogan, H. (1993). Choosing the Number of Component Clusters in the Mixture Model Using a New Informational Complexity Criterion of the InverseFisher Information Matrix. In (Eds., O. Opitz, B. Lausen, & R. Klar), Information and Classification (pp 40–54). Heidelberg: SpringerVerlag.Google Scholar
 Brown, S. D., & Heathcote, A. (2008). The simplest complete model of choice response time: Linear ballistic accumulation. Cognitive Psychology, 57, 153–178. https://doi.org/10.1016/j.cogpsych.2007.12.002 CrossRefGoogle Scholar
 Dias, J. (2006). Latent class analysis and model selection. In M. Spiliopoulou, R. Kruse, C. Borgelt, A. Nurnberger, & W. Gaul (Eds.), From data and information analysis to knowledge engineering (pp. 95–102). Berlin, Germany: Springer.CrossRefGoogle Scholar
 DiTrapani, J., Jeon, M., De Boeck, P., & Partchev, I. (2016). Attempting to differentiate fast and slow intelligence: Using generalized item response trees to examine the role of speed on intelligence tests. Intelligence, 56, 82–92.CrossRefGoogle Scholar
 Dolan, C. V., & van der Maas, H. L. (1998). Fitting multivariage normal finite mixtures subject to structural equation modeling. Psychometrika, 63, 227–253CrossRefGoogle Scholar
 Fox, J.P., Klein Entink, R. H., & van der Linden, W. J. (2007). Modeling of responses and response times with the package cirt. Journal of Statistical Software, 20, 1–14.Google Scholar
 Goldstein, K., & Scheerer, M. (1941). Abstract and concrete behavior an experimental study with special tests. Psychological Monographs, 53(2), 1–151.CrossRefGoogle Scholar
 Grabner, R. H., Ansari, D., Koschutnig, K., Reishofer, G., Ebner, F., & Neuper, C. (2009). To retrieve or to calculate? Left angular gyrus mediates the retrieval of arithmetic facts during problem solving. Neuropsychologia, 47, 604–608.CrossRefGoogle Scholar
 Gudicha, D. W., Schmittmann, V. D., & Vermunt, J. K. (2016). Power computation for likelihood ratio tests for the transition parameters in latent Markov models. Structural Equation Modeling, 23, 234–245.CrossRefGoogle Scholar
 Hedeker, D., Berbaum, M., & Mermelstein, R. (2006). Locationscale models for multilevel ordinal data: Between and withinsubjects variance modeling. Journal of Probability and Statistical Science, 4, 1–20.Google Scholar
 Kuipers, R., Visser, I., & Molenaar, D. (2018). Testing the withinclass distribution in mixture models for responses and response times. Manuscript in preparation.Google Scholar
 Little, R. J., & Rubin, D. B. (1987). Statistical analysis with missing data. New York, NY: Wiley.Google Scholar
 Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison Wesley.Google Scholar
 Luce, R. D. (1986). Response times: Their role in inferring elementary mental organization (No. 8). Oxford, UK: Oxford University Press.Google Scholar
 MacDonald, I. L., & Zucchini, W. (1997). Hidden Markov and other models for discretevalued time series (Vol. 110). New York, NY: CRC Press.Google Scholar
 Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–174.CrossRefGoogle Scholar
 Mehta, P. D., Neale, M. C., & Flay, B. R. (2004). Squeezing interval change from ordinal panel data: latent growth curves with ordinal outcomes. Psychological Methods, 9, 301–333.CrossRefGoogle Scholar
 Mellenbergh, G. J. (1989). Item bias and item response theory. International Journal of Educational Research, 13, 127–143.CrossRefGoogle Scholar
 Mellenbergh, G. J. (1994). A unidimensional latent trait model for continuous item responses. Multivariate Behavioral Research, 29, 223–236.CrossRefGoogle Scholar
 Meredith, W. (1993). Measurement invariance, factor analysis, and factorial invariance. Psychometrika, 58, 525–543.CrossRefGoogle Scholar
 Molenaar, D., Bolsinova, M., & Vermunt, J. K. (2018). A semiparametric withinsubject mixture approach to the analyses of responses and response times. British Journal of Mathematical and Statistical Psychology, 71, 205–228. https://doi.org/10.1111/bmsp.12117 CrossRefGoogle Scholar
 Molenaar, D., Bolsinova, M., Rozsa, S., & De Boeck, P. (2016). Response Mixture Modeling of Intraindividual Differences in Responses and Response Times to the Hungarian WISCIV Block Design Test. Journal of Intelligence, 4, 10.Google Scholar
 Molenaar, D., Oberski, D., Vermunt, J. K., De Boeck, P. (2016). Hidden Markov IRT models for responses and response times. Multivariate Behavioral Research, 51, 606–626.CrossRefGoogle Scholar
 Molenaar, D., Tuerlinckx, F., & van der Maas, H. L. J. (2015). A bivariate generalized linear item response theory modeling framework to the analysis of responses and response times. Multivariate Behavioral Research, 50, 56–74.CrossRefGoogle Scholar
 Nagyné Réz, I., Lányiné Engelmayer, Á., Kuncz, E., Mészáros, A., Mlinkó, R., Bass, L., . . . Kõ, N. (2008). WISCIV: A Wechsler Gyermek Intelligenciateszt Legújabb Változata (Hungarian Version of the Wechsler Intelligence Scale for Children—Fourth Edition, WISCIV). Budapest: OS Hungary Tesztfejlesztõ.Google Scholar
 Partchev, I., & De Boeck, P. (2012). Can fast and slow intelligence be differentiated? Intelligence, 40, 23–32.CrossRefGoogle Scholar
 Rabbitt, P. (1979). How old and young subjects monitor and control responses for accuracy and speed. British Journal of Psychology, 70, 305–311.CrossRefGoogle Scholar
 Ratcliff, R. (1978). A theory of memory retrieval. Psychological Review, 85, 59–108. https://doi.org/10.1037/0033295X.85.2.59 CrossRefGoogle Scholar
 Samejima, F. (1969). Psychometric monographs: Vol. 17. Estimation of ability using a response pattern of graded scores. Austin, TX: Psychometric Society.Google Scholar
 Schnipke, D. L., & Scrams, D. J. (1997). Modeling item response times with a twostate mixture model: A new method of measuring speededness. Journal of Educational Measurement, 34, 213–232.CrossRefGoogle Scholar
 Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464. https://doi.org/10.1214/aos/1176344136 CrossRefGoogle Scholar
 Sclove, L. (1987). Application of modelselection criteria to some problems in multivariate analysis. Psychometrika, 52, 333–343.Google Scholar
 Shiffrin, R. M., & Schneider, W. (1977). Controlled and automatic human information processing: II. Perceptual learning, automatic attending and a general theory. Psychological Review, 84, 127–190. https://doi.org/10.1037/0033295X.84.2.127 CrossRefGoogle Scholar
 Siegler, R. S. (1981). Developmental sequences within and between concepts. Monographs of the Society for Research in Child Development, 46(2). https://doi.org/10.2307/1165995
 Tan, B., & Yılmaz, K. (2002). Markov chain test for time dependence and homogeneity: An analytical and empirical evaluation. European Journal of Operational Research, 137, 524–543. https://doi.org/10.1016/S03772217(01)000819 CrossRefGoogle Scholar
 Tuerlinckx, F., & De Boeck, P. (2005). Two interpretations of the discrimination parameter. Psychometrika, 70, 629–650. https://doi.org/10.1007/s1133600008103 CrossRefGoogle Scholar
 van der Linden, W. J. (2007). A hierarchical framework for modeling speed and accuracy on test items. Psychometrika, 72(3), 287.Google Scholar
 van der Linden, W. J., & Glas, C. A. (2010). Statistical tests of conditional independence between responses and/or response times on test items. Psychometrika, 75, 120–139.CrossRefGoogle Scholar
 van der Maas, H. L. J., & Jansen, B. R. (2003). What response times tell of children’s behavior on the balance scale task. Journal of Experimental Child Psychology, 85, 141–177.CrossRefGoogle Scholar
 van der Maas, H. L. J., Molenaar, D., Maris, G., Kievit, R. A., & Borsboom, D. (2011). Cognitive psychology meets psychometric theory: On the relation between process models for decision making and latent variable models for individual differences. Psychological Review, 118, 339–356. https://doi.org/10.1037/a0022749 CrossRefGoogle Scholar
 Vermunt, J. K. (2011). Kmeans may perform as well as mixture model clustering but may also be much worse: Comment on Steinley and Brusco (2011). Psychological Methods, 16(1), 82–88.Google Scholar
 Vermunt J. K., & Magidson, J. (2013). Technical guide for latent GOLD 5.0: Basic, advanced, and syntax. Belmont, MA: Statistical Innovations Inc.Google Scholar
 Wang, C., & Xu, G. (2015). A mixture hierarchical model for response times and response accuracy. British Journal of Mathematical and Statistical Psychology, 68, 456–477.CrossRefGoogle Scholar
 Wang, C., Xu, G., & Shang, Z. (2018). A twostage approach to differentiating normal and aberrant behavior in computer based testing. Psychometrika, 83, 223–254. https://doi.org/10.1007/s113360169525CrossRefGoogle Scholar
 Welch, L. R. (2003). Hidden Markov models and the Baum–Welch algorithm. IEEE Information Theory Society Newsletter, 53, 10–13.Google Scholar
 Yung, Y. F. (1997). Finite mixtures in confirmatory factoranalysis models. Psychometrika, 62, 297– 330.CrossRefGoogle Scholar
 Zucchini, W., MacDonald, I. L., & Langrock, R. (2016). Hidden Markov models for time series: An introduction using R. New York, NY: Chapman & Hall/CRC.Google Scholar
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.