1 Introduction

Learning effects may take place in educational and psychological testing when the items share a set of solution principles that can be extrapolated from one item to another, so examinees may learn to respond more effectively during the test. There is a wide range of settings, both research and applied, where the detection and measurement of these learning effects may be of potential interest, such as those related to competence acquisition in developmental and educational contexts (e.g., Spada, 1977; Spada & McGaw, 1985) or to the substantive analysis of the learning processes that occur during a psychometric test (e.g., Lozano & Revuelta, 2020, 2021). Additionally, the presence of learning effects during the test may involve meaningful item associations beyond those explained by conventional item response models. In that case, assuming that the responses are locally independent would lead to incorrect parameter estimates and standard errors. Moreover, the inherent difficulty in distinguishing local dependence from multidimensionality (see Ip, 2010) may lead to overestimate the number of underlying factors when there are local dependencies between items due to learning effects. Incorporating previous practice into the models may allow for the detection and measurement of the learning effects as well as for the obtaining of unbiased estimates of item and person parameters while avoiding over-factoring.

A variety of models have been developed to account for the learning that takes place throughout a test (e.g., Deonovic et al., 2018; Fischer & Formann, 1982; Hohensinn et al., 2008; Kempf, 1977; Scheiblechner, 1972; Spada, 1977; Verguts & De Boeck, 2000; Verhelst & Glas, 1993). These models may be classified as contingent and non-contingent learning models (Verguts & De Boeck, 2000). Contingent learning models assume that learning depends on the correctness of the responses given to the items (e.g., Kempf, 1977; Verguts & De Boeck, 2000; Verhelst & Glas, 1993), whereas non-contingent learning models assume that learning occurs regardless of the correctness of the responses (e.g., Fischer & Formann, 1982; Scheiblechner, 1972; Spada, 1977). Another distinction can be made between descriptive and explanatory learning models (De Boeck & Wilson, 2004). Descriptive learning models are just aimed at measuring the learning effect, whereas explanatory learning models not only measure the learning effect but also explain it in terms of person and/or item properties. Most of the existing learning models are descriptive (e.g., Kempf, 1977; Verguts & De Boeck, 2000; Verhelst & Glas, 1993); however, a few models may be considered explanatory in that they account for the learning effect in terms of the operations involved in the items (e.g., Deonovic et al., 2018; Fischer & Formann, 1982; Scheiblechner, 1972; Spada, 1977). Interestingly, to date, all the explanatory learning models are non-contingent models and, therefore, do not make any distinction between correct and incorrect responses.

In the present paper, an explanatory contingent learning model is presented that is a generalization of the operation-specific learning model (OSLM) introduced by Scheiblechner (1972; see also Fischer & Formann, 1982; Spada, 1977). The OSLM accounts for the non-contingent learning that takes place during a psychometric test due to the repeated use of the cognitive operations required by the items. In the OSLM, the learning parameter is specific to each cognitive operation, and the learning component of the model is derived from the number of times the person has practiced in previous items each of the operations involved in the current item. The OSLM is subsumed by the proposed model, which accounts for the possibilities that learning may be derived from all the previous responses equally (non-contingent learning), from correct responses only (contingent learning), or from correct and incorrect responses in different degree (differential contingent learning). The distinction between correct and incorrect responses is reasonable in that learning is traditionally assumed to be greater when the examinee answers the items correctly. However, the reverse may also be true, since, according to the definition of learning implied in the OSLM (i.e., a decrease in the difficulty associated with a specific cognitive operation throughout the test as a function of practice), learning is potentially greater for those operations that are more difficult and, therefore, result in a greater number of incorrect responses at the beginning of the test.

In the next section, the new model is introduced and described in detail by discussing special cases subsumed by the general formulation. Model identification is described in Sect. 3. Section 4 describes a Bayesian framework for model estimation and evaluation. Section 5 includes a simulation study in which the performance of the estimation and evaluation methods is examined. Section 6 provides an empirical analysis to illustrate the applicability of the model to real data. Finally, a summary and concluding remarks are given in Sect. 7.

2 Model Specification

The models presented in this paper are based on the Rasch model (Rasch, 1960). For a Rasch model, the logit of a correct response for person i (\(i=1,2,\ldots ,I\)) to item j (\(j=1,2,\ldots ,J\)) is given by:

$$\begin{aligned} {{\,\mathrm{logit}\,}}\left[ X_{ij}=1\right] =\theta _i-\beta _j, \end{aligned}$$
(1)

where \(\theta _i\) is the ability of person i, and \(\beta _j\) is the difficulty of item j. The linear logistic test model (LLTM; Fischer, 1973, 1983, 1995; Scheiblechner, 1972) decomposes the difficulty parameter of the Rasch model into a linear combination that represents the weighted sum of the difficulties of the cognitive operations involved in the item. That is:

$$\begin{aligned} {{\,\mathrm{logit}\,}}\left[ X_{ij}=1\right] =\theta _i-\sum _{m=1}^Mw_{jm}\alpha _m, \end{aligned}$$
(2)

where \(\alpha _m\) is a basic parameter that represents the difficulty of operation m (\(m=1,2,\ldots ,M\)), and \(w_{jm}\) is the weight of item j on operation m. The model is completed by \(\mathbf{W}\), a \(J\times M\) matrix that contains the weights (\(w_{jm}\)) of each of the J items on each of the M operations. Each weight is given by the number of times operation m is involved in the solution of item j. The LLTM may be considered a restricted version (in which all the learning parameters are constrained to zero) of each of the learning models presented in the following subsections.

2.1 Operation-specific Learning Model

Based on the idea underlying the LLTM, Scheiblechner (1972; see also Fischer & Formann, 1982; Spada, 1977) introduced the OSLM. The OSLM is a non-contingent learning model; that is, it considers that learning is derived from both correctly and incorrectly answered items equally. According to this model, the logit of a correct response for person i to item j is a function of the person ability, the difficulty of the cognitive operations involved in the item, and the practice of said operations accumulated during previous items:

$$\begin{aligned} {{\,\mathrm{logit}\,}}\left[ X_{ij}=1\right] =\theta _i-\sum _{m=1}^Mw_{jm}\left( \alpha _m -\delta _m\sum _{k=1}^{j-1}w_{km}\right) , \end{aligned}$$
(3)

where \(\delta _m\) is a practice parameter that represents the change in the difficulty of operation m that occurs each time the operation is practiced, and \(w_{km}\) is the weight of the previous item \(k\, (k=1,2,\ldots ,j-1)\) on operation m. In this model, \(\alpha _m\) represents the initial difficulty of operation m, independently of the practice effect. As can be appreciated, the Rasch item parameter is decomposed into an initial-difficulty component (\(\sum w_{jm}\alpha _m\)), derived from the cognitive operations involved in solving the item, and a practice component (\(\sum w_{jm}\delta _m\sum w_{km}\)), derived from practicing said operations in previous items. Note that only when operation m is involved in both the previous item and the current item is the practice effect associated with operation m (\(\delta _m\)) subtracted from \(\alpha _m\). A positive sign for the \(\delta _m\) parameter implies a decrease in difficulty associated with operation m throughout the test as a function of practice, which may be interpreted as a learning effect. A negative sign, on the other hand, implies an increase in difficulty associated with operation m as a function of practice, which may be interpreted as fatigue or loss of attention. These fatigue effects associated with specific operations may occur, for example, in relatively easy operations that the subjects tend to perform correctly at the beginning of the test but that are prone to errors later on in the test due to the progressive effects of fatigue or loss of interest and/or attention. It should be noted that, although the OSLM models the effect of previous practice on the item response, like the LLTM and the Rasch model, it does not assume local dependence between items.

2.2 Operation-specific Contingent Learning Model

In contrast to the OSLM, the operation-specific contingent learning model (OSCLM) assumes that the mere exposure to items does not contribute to learning. According to the OSCLM, learning takes place only when the items are answered correctly:

$$\begin{aligned} {{\,\mathrm{logit}\,}}\left[ X_{ij}=1\right] =\theta _i-\sum _{m=1}^Mw_{jm}\left( \alpha _m -\delta _m\sum _{k=1}^{j-1}x_{ik}w_{km}\right) , \end{aligned}$$
(4)

where \(\delta _m\) represents the change in the difficulty of operation m that results from practicing the operation in a correctly answered item, and \(x_{ik}\) is the response of person i to the previous item k. Note that only when \(x_{ik}=1\) is the practice effect associated with operation \(m\,(\delta _m)\) subtracted from \(\alpha _m\). The contingent nature of the practice component implies that, unlike the OSLM, the OSCLM assumes local dependencies between items.

2.3 Operation-specific Differential Contingent Learning Model

Finally, the operation-specific differential contingent learning model (OSDCLM) considers that learning takes place in both correctly and incorrectly answered items, although, unlike the OSLM, the amount of learning that is derived in both cases may differ:

$$\begin{aligned} {{\,\mathrm{logit}\,}}\left[ X_{ij}=1\right] =\theta _i-\sum _{m=1}^Mw_{jm}\left[ \alpha _m -\delta _m\sum _{k=1}^{j-1}x_{ik}w_{km}-\gamma _m\sum _{k=1}^{j-1}(1-x_{ik})w_{km}\right] , \end{aligned}$$
(5)

where \(\gamma _m\) is a practice parameter that represents the change in the difficulty of operation m that results from practicing the operation in an incorrectly answered item. Note that when \(x_{ik}=0\), it is \(\gamma _m\) and not \(\delta _m\) that is subtracted from \(\alpha _m\). A positive sign for the \(\gamma _m\) parameter indicates that even when an item involving operation m is incorrectly answered, the difficulty of that operation decreases in subsequent items. This may be due to the fact that many participants perform operation m correctly (and, therefore, some amount of learning is derived from practicing the operation), but they fail to perform other operations involved in the item and, consequently, answer the item incorrectly. Alternatively, the positive sign may be due to the fact that, for many participants, operation m requires successive approximations over several items in order for it to be properly performed. A negative sign, on the other hand, indicates that answering incorrectly an item involving operation m increases the difficulty of that operation in subsequent items, which may be attributed to fatigue or loss of interest and/or attention. The OSDCLM generalizes both the OSLM and the OSCLM. In this regard, the OSCLM is a restricted OSDCLM in which all \(\gamma _m=0\), whereas the OSLM is a restricted OSDCLM in which \(\delta _m=\gamma _m\) for each m.

3 Model Identification

In the LLTM, for the basic parameters (\(\alpha _m\)) to be estimated by means of conditional maximum likelihood (CML), the matrix \(\mathbf{W}^+=(\mathbf{W};\mathbf{1})\) (i.e., \(\mathbf{W}\) supplemented with a column vector of ones) must have full column rank; that is, \(rank(\mathbf{W}^+)=M+1\) (Fischer, 1983). As a result, the number of operations is restricted to \(M\le J-1\). The full column rank condition of \(\mathbf{W}^+\) ensures that the Rasch item parameters (\(\beta _j\)) can be decomposed uniquely into the LLTM basic parameters (\(\alpha _m\)) while fixes the scale of the latent variable (\(\theta _i\)). In Bayesian inference, by contrast, the \(\theta \) scale is fixed by specifying the prior distribution of the parameter, so the looser condition of full column rank of \(\mathbf{W}\), \(rank(\mathbf{W})=M\), is enough to ensure the uniqueness of the relation between the parameters of the Rasch model and the LLTM. Consequently, in Bayesian inference, the original restriction \(M\le J-1\) is relaxed to \(M\le J\).

Mathematically, the OSLM is an LLTM with weigh matrix \(\mathbf{Q}=(\mathbf{W};\mathbf{V})\), where \(\mathbf{V}\) is a \(J\times M\) matrix whose elements represent previous practice. More specifically, the elements in \(\mathbf{V}\) are given by:

$$\begin{aligned} v_{jm}=w_{jm}\sum _{k=1}^{j-1}w_{km}. \end{aligned}$$
(6)

Therefore, in the OSLM, the full column rank condition for CML estimation is \({ rank}(\mathbf {Q}^+)=2M+1\), and the number of operations is restricted to \(M\le (J-1)/2\). In Bayesian inference, these restrictions are relaxed to \({ rank}(\mathbf{Q})=2M\) and \(M\le J/2\).

In the OSCLM and the OSDCLM, the weigh matrices are \(\mathbf{Q}=(\mathbf{W};\mathbf{V}_t)\) and \(\mathbf{Q}=(\mathbf{W};\mathbf{V}_t;\mathbf{U}_t)\), respectively, where \(\mathbf{V}_t\) and \(\mathbf{U}_t\) are \(J\times M\) matrices whose elements represent the amount of correct and incorrect previous practice for each item and operation. Specifically, the elements in \(\mathbf{V}_t\) and \(\mathbf{U}_t\) are given by:

$$\begin{aligned} \begin{aligned} v_{tjm}&=w_{jm}\sum _{k=1}^{j-1}x_{tk}w_{km} \text{ and }\\ u_{tjm}&=w_{jm}\sum _{k=1}^{j-1}(1-x_{tk})w_{km}, \end{aligned} \end{aligned}$$
(7)

where \(t\,(t=1,2,\ldots ,T)\) denotes a specific response pattern, and \(T=2^J\) is the number of different response patterns.

Let \({{\varvec{x}}}_t'=(x_{t1},x_{t2},\ldots ,x_{tJ})\) be a vector of responses to the J items. Assuming that \(\theta \) is a random effect that follows a standard normal distribution, the marginal probability of \({{\varvec{x}}}_t\) is:

$$\begin{aligned} p_t=\int _{-\infty }^{\infty }\frac{\exp (\lambda _t)}{\sum _{h=1}^T \exp (\lambda _h)}f(\theta )d\theta , \end{aligned}$$
(8)

where \(\lambda _t\) is a parameter associated with response pattern t, and \(f(\theta )\) is the standard normal density function. The OSCLM and the OSDCLM impose the following structure on the parameters:

$$\begin{aligned} \lambda _t=s_t\theta +{{\varvec{r}}}_t'\varvec{\xi }, \end{aligned}$$
(9)

where \(s_t=\sum _{j=1}^J x_{tj}\) is the number-right score of response pattern t, \({{\varvec{r}}}_t'\) is a row-vector of coefficients associated with response pattern tFootnote 1, and \({\varvec{\xi }}\) is the vector of structural parametersFootnote 2. The OSCLM and the OSDCLM assume that the vector of \(\lambda _t\) parameters, \({\varvec{\lambda }}=(\lambda _t)_{t=1}^T\), is:

$$\begin{aligned} {\varvec{\lambda }}={{\varvec{s}}}\theta +\mathbf{R}{{\varvec{\xi }}}, \end{aligned}$$
(10)

where \({{\varvec{s}}}=(s_t)_{t=1}^T\) is the vector of number-right scores of the T response patterns, and \(\mathbf{R}\) is a matrix of coefficients whose rows are the vectors \({{\varvec{r}}}_t'={{\varvec{r}}}_1',{{\varvec{r}}}_2',\ldots ,{{\varvec{r}}}_T'\).

The analysis of the identifiability of \({{\varvec{\xi }}}\) is based on the Jacobian matrix (Bishop et al., 2007; Cox, 1984):

$$\begin{aligned} \mathbf{J}=\frac{\partial }{\partial {{\varvec{\xi }}}}\log {{\varvec{p}}} =(\mathbf{I}-\mathbf{1}{{\varvec{p}}}')\mathbf{R}, \end{aligned}$$
(11)

where \(\mathbf{I}\) is an identity matrix of order T, \(\mathbf{1}\) is a vector of ones, and \({{\varvec{p}}}=(p_t)_{t=1}^T\) is the vector of probabilities of the T response patterns. The vector \({{\varvec{\xi }}}\) is identifiable if \(\mathbf{J}\) has full column rank. The matrix \((\mathbf{I}-\mathbf{1}{{\varvec{p}}}')\) is deficient in rank (it has rank \(T-1\)) because the elements in \({\varvec{p}}\) are constrained to sum 1. Specifically, \((\mathbf{I}-\mathbf{1}{{\varvec{p}}}')\mathbf{1}=\mathbf{0}\). Therefore, if the vector \(\mathbf{1}\) were in the column space of \(\mathbf{R}\), there would be a vector \({\varvec{\tau }}\) such that \((\mathbf{I}-\mathbf{1}{{\varvec{p}}}')\mathbf{R}{\varvec{\tau }}=\mathbf{0}\), and \(\mathbf{J}\) would be deficient in rank. Moreover, from the theory of multinomial maximum likelihood estimation, the information matrix for \({\varvec{\xi }}\) can be computed from the Jacobian matrix by the equation (Revuelta, 2012):

$$\begin{aligned} {\mathcal {I}}=\mathbf{J}'{} \mathbf{D}{} \mathbf{J}, \end{aligned}$$
(12)

where \(\mathbf{D}=\text{ diag }({{\varvec{p}}})\). If \(\mathbf{J}\) were deficient in rank, \({\mathcal {I}}\) would be so. Consequently, the identifiability condition for \({\varvec{\xi }}\) is that the matrix \(\mathbf{R}^+=(\mathbf{R};\mathbf{1})\) has full column rank. In practice, the analysis of empirical identifiability is based on the response patterns that have been actually realized in the sample. Let \(\hat{\mathbf{R}}^+\) be the matrix of coefficients based on the realized response patterns. The full column rank of \(\hat{\mathbf{R}}^+\) is necessary for the observed information matrix to be of full rank. However, since \(\hat{\mathbf{R}}^+\) has size \(N \times 3M\), where N can be in the order of hundreds or thousands, it is more computationally convenient to verify the equivalent condition that the matrix \({\hat{\mathbf{R}}}^{+'}{\hat{\mathbf{R}}}^+\) has full rank.

4 Bayesian Framework

A Bayesian framework is presented for the estimation and evaluation of the proposed model. In this work, Bayesian methods were implemented by means of Markov chain Monte Carlo (MCMC) simulation (Brooks et al., 2011). Applications of Bayesian MCMC in the field of item response modeling can be seen in Fox (2010) and Levy and Mislevy (2016).

4.1 Model Estimation

In Bayesian analysis, MCMC routines are usually employed to derive an empirical approximation to the posterior distribution of the parameters. In the present work, MCMC simulation was run using Stan (Carpenter et al., 2017; Gelman et al., 2015). Stan is a programming software that implements the no-U-turn sampler (NUTS; Hoffman & Gelman, 2014), an extension of the Hamiltonian Monte Carlo (HMC; Duane et al., 1987; Neal, 1994, 2011) algorithm. HMC overcomes some of the limitations of the traditional Gibbs sampler (Geman & Geman, 1984) and the Metropolis algorithm (Metropolis et al., 1953), particularly in terms of computational efficiency in exploring the posterior parameter space (Gelman et al., 2013).

4.2 Model Evaluation

In the Bayesian context, model assessment is typically based on posterior predictive model checking (PPMC; Gelman et al., 1996). PPMC is conducted based on discrepancy measures that are intended to capture relevant features of the data. The realized values of the model-data discrepancy, \(D({\mathbf {X}};{\varvec{\theta }},{\varvec{\xi }})\) (where \({\varvec{\theta }}\) and \({\varvec{\xi }}\) represent the vectors of incidental and structural parameters, respectively), are compared to those obtained from the posterior predictive distribution, \(D({\mathbf {X}}^{\mathrm {rep}};{\varvec{\theta }},{\varvec{\xi }})\) (where rep stands for replicated data). The results are summarized by means of the posterior predictive p value (PPP value; Gelman et al., 1996; Meng, 1994), the tail-area probability of the realized value of the discrepancy under the posterior predictive distribution of the discrepancy measure:

$$\begin{aligned} {\mathrm {PPP}}=P\left[ D\left( {\mathbf {X}}^{\mathrm {rep}};{\varvec{\theta }},{\varvec{\xi }}\right) \ge D\left( {\mathbf {X}};{\varvec{\theta }},{\varvec{\xi }}\right) \mid {\mathbf {X}}\right] . \end{aligned}$$
(13)

In the present study, the discrepancy between the data and the model was estimated via two discrepancy statistics: the odds-ratio (OR; Chen & Thissen, 1997; Sinharay, 2005) and the Bayesian latent residual (BLR; Albert & Chib, 1995; Fox, 2010). The OR is a measure of association between pairs of items that is computationally simple and does not depend on the fitted model. The OR for items j and \(j'\) is defined as:

$$\begin{aligned} \hbox {OR}_{jj'}=\frac{n_{11}n_{00}}{n_{10}n_{01}}, \end{aligned}$$
(14)

where \(n_{xx'}\) is the number of individuals scoring x on item j and \(x'\) on item \(j'\). The OR is useful for identifying inter-item associations beyond those explained by the model. Given that practice effects may elicit local dependencies between items, the OR is potentially useful for detecting the presence of learning effects during the test. Measures of inter-item associations at the item level and at the test level are obtained by summing the OR values over the pairs of items.

The BLR is a measure of overall fit that is not specifically tied to local dependencies. The BLR is based on an augmented (latent) data approach and is defined as the difference between the latent response and the expected response according to the model. For instance, for a Rasch model, the BLR corresponding to observation \(X_{ij}\) is defined as:

$$\begin{aligned} \varepsilon _{ij}=Z_{ij}-\theta _i+\beta _j, \end{aligned}$$
(15)

where \(Z_{ij}\) is the latent response of person i to item j, which, conditional on person and item parameters, follows a logistic distribution with expected value given by \({{\,\mathrm{logit}\,}}\left[ X_{ij}=1\right] \). Computational formulas for the BLR are given in Fox (2010). The squared residuals can be summed over individuals to obtain an item-specific discrepancy statistic. A global measure of fit at the test level is obtained by summing the values of the squared residuals over the items.

The PPP value is the proportion of draws in which the posterior predictive value of the discrepancy statistic is equal to or higher than the realized value. PPP values close to .5 indicate that the realized value is in the middle of the posterior predictive distribution of the discrepancy, evidencing adequate data-model fit; whereas extreme PPP values, close to zero or one, indicate that the realized value is in the upper or lower tail of the distribution, respectively, evidencing that the model is underpredicting or overpredicting the features captured by the discrepancy statistic. For instance, in the case of the OR, PPP values close to zero (one) indicate that the observed data exhibit more (less) local dependence than expected based on the model.

4.3 Model Comparison and Selection

Complementarily, other methods can be used for model comparison and selection: the widely applicable information criterion (WAIC; Watanabe, 2010, 2013) and the leave-one-out cross validation (LOO; Gelman et al., 2014). These methods quantify the out-of-sample predictive performance of competing models using the log-likelihood evaluated at the posterior simulations of the parameter values. WAIC and LOO adjust the log pointwise predictive density (lpd) of the observed data by penalizing for model complexity based on the effective number of parameters. Such penalty allows for the prevention of the over-fitting exhibited by more complex models by virtue of their higher flexibility.

Let \(l\,(l=1,2,\ldots ,L)\) be a draw from the posterior distribution. In the case of WAIC, the estimated expected log pointwise predictive density (elpd) is given by (Vehtari et al., 2016):

$$\begin{aligned} {\widehat{\mathrm {elpd}}_{\mathrm {waic}}}={\widehat{\mathrm {lpd}}}-{\widehat{p}}_{\mathrm {waic}}, \end{aligned}$$
(16)

where \({\widehat{\mathrm {lpd}}}\) is the computed log pointwise predictive density:

$$\begin{aligned} {\widehat{\mathrm {lpd}}}=\sum _{i=1}^I\sum _{j=1}^J\log \left[ \frac{1}{L} \sum _{l=1}^Lp\left( x_{ij}\mid {\varvec{\theta }}^l,{{\varvec{\xi }}}^l\right) \right] , \end{aligned}$$
(17)

and \({\widehat{p}}_{\mathrm{waic}}\) is the estimated effective number of parameters, which can be obtained based on the posterior variance of the log predictive density for each data point \(x_{ij}\):

$$\begin{aligned} {\widehat{p}}_{\mathrm{waic}}=\sum _{i=1}^I\sum _{j=1}^JVar_{l=1}^L \left[ \log p\left( x_{ij}\mid {\varvec{\theta }}^l,{{\varvec{\xi }}}^l\right) \right] . \end{aligned}$$
(18)

The \({\widehat{\mathrm {elpd}}_{\mathrm{waic}}}\) is usually converted to deviance scale as follows:

$$\begin{aligned} {\mathrm {WAIC}}=-2{\widehat{\mathrm {elpd}}_{\mathrm{waic}}}. \end{aligned}$$
(19)

In the case of LOO, the estimated elpd, obtained by Pareto smoothed importance sampling, is given by (Vehtari et al., 2016):

$$\begin{aligned} {\widehat{\mathrm {elpd}}_{\mathrm{loo}}}=\sum _{i=1}^I\sum _{j=1}^J\log \left[ \frac{\sum _{l=1}^Lw_{ij}^lp\left( x_{ij}\mid {\varvec{\theta }}^l,{{\varvec{\xi }}}^l \right) }{\sum _{l=1}^Lw_{ij}^l}\right] , \end{aligned}$$
(20)

where \(w_{ij}^l\) is a vector of smoothed weights for each data point \(x_{ij}\). For LOO, the effective number of parameters is given by:

$$\begin{aligned} {\widehat{p}}_{\mathrm{loo}}={\widehat{\mathrm {lpd}}}-{\widehat{\mathrm {elpd}}_{\mathrm{loo}}}. \end{aligned}$$
(21)

The LOO information criterion (LOOIC), expressed on the deviance scale, is defined as:

$$\begin{aligned} {\mathrm {LOOIC}}=-2{\widehat{\mathrm {elpd}}_{\mathrm{loo}}}. \end{aligned}$$
(22)

Lower values of WAIC and LOOIC indicate higher predictive accuracy. Compared to PPMC, WAIC and LOO have the advantage of avoiding re-sampling and, therefore, are less computationally intensive. However, WAIC and LOO are not intended to test a hypothesis of model fit but to compare models in order to select the one that fits the data best. In the present work, PPMC was used for model evaluation, whereas WAIC and LOO were used complementarily for model comparison and selection.

5 Simulation Study

A simulation study was conducted to test whether the Bayesian estimation and model evaluation methods allow for the recovery of the true item parameters and the identification of the model used to generate the data, respectively. Particular attention was paid to examine the bias of the estimates when there were learning effects in the data that were not taken into account in the model.

5.1 Method

In order to study different conditions of misspecification, a \(4\times 5\) factorial design was used for the simulation study, resulting from the combination of generating models and fitted models (the OSDCLM, OSCLM, OSLM, and LLTM were used as generating models, while the same models plus the Rasch model were used as fitted models).

One hundred data sets of dichotomous responses were simulated from each generating model. The simulation was conducted with R version 3.6.1 (R Development Core Team, 2019). The sample size, test length, weight matrix, and true values of the structural parameters (\(\alpha _m\), \(\delta _m\), and \(\gamma _m\)) were taken from the empirical study described in Sect. 6 (the weight matrix is shown in Table 4, and the structural parameters are shown in Table 9). The true values of the incidental parameters (\(\theta _i\)) were generated from a standard normal distribution.

The models were estimated from each simulated data set using the RStan R package version 2.19.2 (Stan Development Team, 2019). Four Markov chains of 2,000 samples each were run. The first half of the samples were discarded as burn-in, and the remaining samples were used to estimate the Bayesian posterior probabilities. The potential scale reduction statistic (Gelman & Rubin, 1992) was used to evaluate the convergence of parameter estimates. A weakly informative prior, N(0, 100), was used for all structural parameters, whereas a standard normal distribution was used as prior for the incidental parameters.

To assess the fit of the models to the data, a sample of predicted responses was generated for each sample of simulated parameters, and the PPP value (Gelman et al., 1996; Meng, 1994) was computed based on the discrepancy measures, OR (Chen & Thissen, 1997; Sinharay, 2005) and BLR (Albert & Chib, 1995; Fox, 2010), at the test level. The hypothesis that the model fits the data was rejected when the PPP value was less than .05 or greater than .95. The performance of the discrepancy measures was assessed by the average PPP value over the 100 simulated samples as well as by the empirical proportion of rejections (EPR), that is, the proportion of simulated samples in which the fitted model is rejected. When the fitted model coincides with the model used to generate the data, the EPR is an estimate of the false-positive error rate of the test, whereas when the fitted model and the generating model do not coincide, the EPR is an estimate of the sensitivity of the test.

Additionally, two information criterion measures were obtained using the loo R package (Vehtari et al., 2016): WAIC (Watanabe, 2010, 2013) and LOOIC (Gelman et al., 2014). As described above, these measures quantify the discrepancy between the model and the data while taking into account model complexity. They are not intended to test a hypothesis of model fit but to select the best model from a number of competing models. Lower values of WAIC and LOOIC indicate better balance between fit and parsimony. In this study, for each simulated sample, WAIC and LOOIC were used to select the best model from among the fitted models. For each condition of the study, the performance of WAIC and LOOIC was assessed by their average value over the simulated samples as well as by the empirical proportion of selections (EPS), that is, the proportion of simulated samples in which the fitted model is selected.

Item parameter recovery was assessed using measures of precision, bias, and accuracy of the estimation procedure. The standard error (SE) of the estimate was used as a measure of statistical variability (precision) of the estimation procedure. For instance, the SE for the \(\alpha \) parameter is defined as:

$$\begin{aligned} {\mathrm {SE}}\left( {\hat{\alpha }}\right) =\sqrt{\frac{1}{N-1}\sum _{n=1}^N \left[ \frac{\displaystyle \sum _{m=1}^M\left( {\hat{\alpha }}_{nm} -\overline{{\hat{\alpha }}}_{m}\right) ^2}{M}\right] }, \end{aligned}$$
(23)

where \(n\,(n=1,2,\ldots ,N)\) denotes a simulated sample, N is the number of simulated samples (in this study, \(N=100\)), M is the number of \(\alpha \) parameters, \({\hat{\alpha }}_{nm}\) is the EAP estimate of the m-th \(\alpha \) parameter in sample n, and \(\overline{{\hat{\alpha }}}_m\) is the mean of the estimates of \(\alpha _m\) over the N samples. Unlike bias and accuracy, precision depends only on the estimates (it does not depend on the true value of the parameter).

The bias quantifies the difference between the mean of the parameter estimates over the N samples and the true value of the parameter. The absolute bias for the \(\alpha \) parameter is defined as:

$$\begin{aligned} {\mathrm {Bias}}\left( {\hat{\alpha }}\right) =\frac{\displaystyle \sum _{m=1}^M \left| \overline{{\hat{\alpha }}}_m-\alpha _m\right| }{M}, \end{aligned}$$
(24)

where \(\alpha _m\) is the true value of the m-th \(\alpha \) parameter.

The root-mean-square error (RMSE) combines precision and bias to provide a measure of accuracy in parameter recovery. The RMSE quantifies the average difference between the true and the estimated parameters over the N samples. The RMSE for the \(\alpha \) parameter is defined as:

$$\begin{aligned} {\mathrm {RMSE}}\left( {\hat{\alpha }}\right) =\frac{1}{N}\sum _{n=1}^N \sqrt{\frac{\displaystyle \sum _{m=1}^M\left( {\hat{\alpha }}_{nm}-\alpha _m\right) ^2}{M}}. \end{aligned}$$
(25)

The SE, bias, and RMSE for the \(\delta \) and \(\gamma \) parameters are defined in the same way.

5.2 Results

Table 1 shows the mean PPP value and the EPR of the discrepancy measures for each combination of generating model and fitted model. As expected, for each generating model, fitting the true or a more general model led to a mean PPP value close to .5, indicating good model-data fit. On the contrary, fitting a more restrictive model than the one used to generate the data led to an extreme mean PPP value, close to zero or one, indicating model misfit. Likewise, when the true or a more general model was fitted to the data, the EPR was close to zero, indicating a low false-positive error rate. However, fitting a more restrictive model led to an EPR close to one, revealing the high sensitivity of the procedure in the detection of the different types of learning. The above applies to all conditions except when there were non-contingent learning effects in the data (i.e., when the OSLM was the generating model) and the estimated model was the LLTM. In that condition, the BLR and, to a lesser extent, the OR showed low sensitivity. It is also worth noting the low EPR values associated with the Rasch model when the data were generated with the OSLM. This result was due to the fact that the OSLM, like the LLTM, is a restricted Rasch model that does not model local dependencies between items. Consequently, as a more general model, the Rasch model is expected to fit data generated with the OSLM.

Table 1 Average posterior predictive p-value (\(\overline{\mathrm {PPP}}\)) and empirical proportion of rejections (EPR) of the discrepancy statistics for each combination of generating model and fitted model

Table 2 shows the mean values of WAIC and LOOIC and their corresponding EPS for each combination of generating model and fitted model. Based on both WAIC and LOOIC, for each condition of generating model, the true model (followed by more general models) led to the lowest mean discrepancy between the data and the model as well as to the highest EPS.

Table 2 Average WAIC and LOOIC (Mean) and empirical proportion of selections (EPS) for each combination of generating model and fitted model

Table 3 shows the SE, bias, and RMSE for each combination of estimated parameter, generating model, and fitted model. For each generating model, the SE was minimized by the most restrictive model (the LLTM, OSCLM, and OSDCLM, for the \(\alpha \), \(\delta \), and \(\gamma \) parameters, respectively), whereas the bias was minimized by the true or a more general model. As expected, for each generating model, fitting the true model minimized the RMSE and, therefore, maximized the accuracy of the estimates. Conversely, fitting a more restrictive model than the one used to generate the data led to inaccurate estimates of the difficulty and practice parameters. In order to rule out potential differential effects associated with the sign of the parameter, the SE, bias, and RMSE were also obtained for each operation separately without evidence of differential effects.

Table 3 Standard error (SE), bias, and root-mean-square error (RMSE) for each combination of estimated parameter, generating model, and fitted model

5.3 Conclusions

The simulation study illustrates the good performance of PPMC for model evaluation and selection as well as the accuracy of the MCMC algorithm in recovering the true parameters from simulated data. Regarding model evaluation, PPMC based on the discrepancy statistics showed good performance in identifying learning effects in the data. Specifically, the OR and BLR statistics only showed low sensitivity in one condition. Additionally, WAIC and LOO demonstrated relatively good performance in model comparison and selection, although they showed a certain tendency to favor complex models. Based on these results, when sufficient computational resources are available, the use of PPMC should be preferred also for model comparison and selection, taking as a decision rule to select the simplest model that shows an acceptable fit to the data. Regarding parameter recovery, as expected, fitting the true model provided the most accurate parameter estimates. On the contrary, when there were learning effects in the data that were not taken into consideration in the model formulation, the resulting parameter estimates were considerably inaccurate.

6 Empirical Study

An empirical study was conducted to illustrate the performance and applicability of the proposed framework for detecting practice effects in real data. Specifically, the models were fitted to data from a fraction arithmetic test (Tatsuoka, 1984) whose items are based on several arithmetic operations that are repeatedly applied throughout the test.

6.1 Method

The data consists of responses to 15 items involving subtraction of fractions by 536 examinees. The data set was originally used by Tatsuoka (1984) and is included in the CDM R package (George et al., 2016). The matrix \(\mathbf{W}\) used in this study was defined by de la Torre (2009) in the context of cognitive diagnosis modeling (see Table 4). In this example, the matrix \(\hat{\mathbf{R}}^+\) satisfies the rank condition, \(rank(\hat{\mathbf{R}}^+)=16\), and, consequently, the vector \({\varvec{\xi }}\) is empirically identified.

The same models, estimation method, and model evaluation procedures tested in the simulation study were used with the empirical data. A prior sensitivity study was conducted to investigate the effect of prior choice on the posterior parameter estimates. A normal prior distribution was used with mean set equal to zero, while the value of the variance was manipulated across conditions (i.e., 1, 5, 10, 50, 100, 500, 1,000, 5,000, and 1,000,000).

Table 4 Transposed weight matrix for the fraction-subtraction items (de la Torre, 2009)

6.2 Results

The prior sensitivity analysis revealed that the posterior parameter estimates were robust to different prior distributions. More specifically, the average standard deviations for the \(\alpha \), \(\delta \), and \(\gamma \) parameter estimates were .168, .042, and .040, respectively. Moreover, when removing the estimates corresponding to N(0, 1), the average standard deviations were .017, .004, and .004, whereas when removing the estimates corresponding to N(0, 1) and N(0, 5), the average standard deviations were .010, .003, and .003. The results shown in this section were obtained by using a weakly informative prior, N(0, 100), for all structural parameters and a standard normal distribution for the incidental parameters.

Table 5 shows the model evaluation statistics at the test level for each of the fitted models. The PPP values of the discrepancy measures led to the rejection of the hypothesis of fit for the LLTM and the Rasch model in all cases (PPP \(< .05\) or PPP \(> .95\)). More specifically, the observed and simulated values of the OR indicated that the data showed more local dependence than would be expected based on these models. According to the OR, the OSDCLM was the only model that reproduced the local dependencies present in the data (\(\hbox {PPP} = .200\)). Similarly, the PPP value of the BLR suggested that the OSDCLM was the only model that fitted the data well (\(\hbox {PPP} = .467\)).

Table 5 Model evaluation statistics at the test level for the fitted models

Tables 6 and 7, respectively, show the OR and BLR statistics at the item level for the fitted models. Based on the PPP value of both the OR and the BLR, the OSDCLM was the model that fitted the data best, showing the lowest proportion of non-fitting items (PPP \(< .05\) or PPP \(> .95\)).

Table 6 Odds-ratio at the item level for the fitted models
Table 7 Bayesian latent residuals at the item level for the fitted models

Table 8 shows the WAIC and LOOIC values for the fitted models. As can be observed, both indices coincided in selecting the OSDCLM as the model that showed the best balance between fit and parsimony.

Table 8 Comparison indices for the fitted models

Table 9 shows the expected a posteriori (EAP) estimates, posterior standard deviations, and posterior probability intervals of the parameters of the OSDCLM. According to the magnitude of the estimates, the second operation defined in the matrix \(\mathbf{W}\) was the most difficult operation at the beginning of the test, followed by the fifth, the fourth, the first, and finally the third. It is interesting to note that the EAP estimates obtained by fitting the LLTM led to a different order of difficultyFootnote 3: \({\hat{\alpha }}_1=-1.009\), \({\hat{\alpha }}_2=0.002\), \({\hat{\alpha }}_3=-0.420\), \({\hat{\alpha }}_4=1.810\), and \({\hat{\alpha }}_5=0.355\). These estimates represent the marginal difficulty associated with each cognitive operation; that is, its difficulty confounded with the practice effect.

Table 9 Expected a posteriori (EAP) estimates, posterior standard deviations (SD), and posterior probability intervals (\(2.5\%-97.5\%\)) of the difficulty (\(\alpha _m\)) and practice parameters (\(\delta _m\) and \(\gamma _m\)) of the operation-specific differential contingent learning model

The positive sign of the estimates of the \(\delta _1\), \(\delta _2\), \(\delta _5\), and \(\gamma _2\) parameters, together with the absence of zero in their corresponding posterior probability intervals, indicated the existence of learning associated with correct responses in operations 1, 2, and 5, and learning associated with incorrect responses in operation 2. Note that the second operation was the most difficult operation at the beginning of the test and, therefore, the most prone to require successive approximations for it to be properly performed. The magnitude of the estimates of the parameters suggested a greater learning effect for the second operation, followed by the first, and finally the fifth. The interpretation of these estimates is straightforward. For instance, responding correctly (incorrectly) to an item in which operation 2 was involved provided a decrease of 0.724 (0.676) in the difficulty of this operation.

The negative sign of the estimate of \(\delta _3\), together with the absence of zero in its posterior probability interval, indicated an increase in difficulty during the test associated with operation 3 as a function of correct practice, which may be interpreted in terms of progressive fatigue or loss of attention during the test. The negative sign of the estimates of \(\gamma _3\), \(\gamma _4\), and \(\gamma _5\), together with the absence of zero in their corresponding posterior probability intervals, indicated an increase in difficulty during the test associated with operations 3, 4, and 5 as a function of incorrect practice. These results suggested that those individuals who failed in applying these operations at the beginning of the test increased their failure rate in subsequent items, which may be interpreted as loss of interest and/or attention.

The posterior probability interval of the difference between the \(\delta _m\) and \(\gamma _m\) parameters (Table 10) indicated that this difference was credibly different from zero for operations 1, 3, 4, and 5. These results explain why the OSDCLM fitted the data better than the OSLM, which assumes no difference between \(\delta _m\) and \(\gamma _m\) for each m. Moreover, the fact that the \(\gamma _m\) parameter was credibly different from zero for operations 2, 3, 4, and 5 explains why the OSDCLM fitted the data better than the OSCLM, which assumes that \(\gamma _m\) equals zero for all m.

Table 10 Expected a posteriori (EAP) estimates, posterior standard deviations (SD), and posterior probability intervals (\(2.5\%-97.5\%\)) of the differences (\(d_m\)) by operation (m) between the practice parameters (\(\delta _m\) and \(\gamma _m\)) of the operation-specific differential contingent learning model

Figure 1 shows the difficulty of the cognitive operations as a function of previous practice for the first two subjects in the response matrix (\(i=1\) and \(i=2\)), whose response patterns were 010101111110111 and 111111111111111, respectively. It should be noted that previous practice equals zero the first time an operation appears in the test. The figure illustrates the decrease in difficulty during the test in operations 1, 2, and 5 for the two subjects. Note that the difficulty throughout the test of operation 2, as well as that of operation 5, was the same for both subjects because their response patterns to the items involving said operation were the same. By contrast, the difficulty of operation 1, as well as that of operations 3 and 4, evolved slightly differently for the two subjects because their response patterns to the items involving said operation were not the same.

Fig. 1
figure 1

Difficulty of the five cognitive operations as a function of previous practice for subjects \(i = 1\) (left) and \(i = 2\) (right)

6.3 Conclusions

This study illustrates the utility of the proposed model for investigating a variety of practice effects in real data. The best fitting model was the OSDCLM, which suggests the presence of different practice effects in the data derived from correct and incorrect responses. Specifically, learning effects associated with correct responses were observed for operations 1, 2, and 5, whereas a learning effect associated with incorrect responses was observed for operation 2. Additionally, a fatigue effect associated with correct responses was observed for operation 3, whereas fatigue effects associated with incorrect responses were observed for operations 3, 4, and 5.

7 Discussion

The purpose of the present work was to introduce a new explanatory item response model for the detection and measurement of differential contingent learning effects during psychometric tests due to the repeated use of the operations involved in the items. To that end, a Bayesian approach was adopted for model estimation and evaluation. The performance of the proposed framework was illustrated with a simulation study and an empirical application. The simulation study demonstrated the accuracy of the MCMC algorithm in parameter recovery as well as the good performance of PPMC and the information criterion indices in model evaluation and selection. The empirical study demonstrated the presence of differential contingent practice effects in real assessment data, which illustrates the utility of incorporating previous practice into item response models for correct and incorrect responses, separately. The proposed framework, therefore, has proved its usefulness when there is a suspicion of practice effects during the test and the goal of the researcher is to adopt an explanatory approach to account for the cognitive processes underlying the item responses. The R and RStan scripts used in this work for model estimation and evaluation are available as supplementary material to this paper.

Nevertheless, it is worth highlighting that the proposed model, as presented in this paper, is based on strong assumptions that might not always be justified. The main assumptions are inherited from the LLTM and the OSLM. Specifically, the LLTM assumes that item difficulty can be linearly decomposed into the difficulties of a well-defined set of operations, and that said difficulties are constant throughout the test and equal for all examinees. In the OSLM, by contrast, the difficulties are allowed to vary linearly as a function of practice, although they are still assumed to be equal for all examinees. These assumptions are highly restrictive and may lead to incorrect results when the assumed operations do not truly reflect the way in which individuals actually solve the items, when the practice effects have a more complex pattern, or when there are individual differences in the practice effects.

The proposed model is more flexible in that it accounts for differential contingent practice effects. In this regard, the model allows for different patterns of change in the difficulty associated with the cognitive operations throughout the test as a function of the persons’ particular response patterns. However, the model still assumes that item difficulty is exclusively determined by the cognitive operations involved in the item, an assumption that may not hold in all cases. For instance, other item properties, such as those related with drawing features in figural items, may also have an influence on item difficulty in certain types of tests. Nevertheless, provided that the researcher is able to operationalize these features, they could be incorporated into the matrix \(\mathbf{W}\) to account for their associated effects (e.g., Lozano & Revuelta, 2020). Likewise, learning effects during the test are still assumed to be completely explained by the accumulated practice in the assumed operations, which may be a strong assumption for tests where there are other learning sources to consider (e.g., becoming familiar with test instructions, item response format, item time limit, etc). Additionally, the practice effects are still assumed to be linear throughout the test, which must not necessarily be the case. For instance, a learning effect may show a quadratic trend, with a smaller effect at the beginning of the test and a more pronounced effect toward the end, or vice versa. In such a case, a nonlinear variant of the model, such as that proposed by Spada (1977) and Spada and McGaw (1985) for the OSLM, may be useful.

The model also makes the assumption that practice effects do not differ across items as a function of item difficulty. In this regard, the amount of learning or fatigue derived from performing an operation in a difficult item or in an easy one is assumed to be the same. Although this assumption may be true for many educational and psychological tests in which the items do not show a wide range of difficulty (such as the fraction arithmetic test used in the present study: Range \(= .310-.795\), \(\hbox {Var} = 0.023\), \(\hbox {Sd} = 0.151\)), it may not hold for tests with greater variability in item difficulty. In such cases, if there is a suspicion of interaction effects between operations combined in the same items, it may be useful to incorporate the corresponding product terms into the matrix \({\mathbf{W}}\) to account for the extra difficulty and practice effects derived from said interactions instead of using an additive model.

Finally, unlike the OSLM, the model accounts for individual learning patterns based on the persons’ particular response patterns to the items. However, the model still assumes that the learning effects are the same for all examinees, which may be a too restrictive assumption for particular sets of data. In this regard, future studies may be directed to extend the proposed framework to incorporate individual differences in learning (e.g., Embretson, 1991; Rijmen et al., 2002). Regarding future research, it would also be interesting that future studies investigate the influence of practice effects on dimensionality assessment and, more particularly, on over-factoring.

In summary, the proposed model has demonstrated its usefulness in detecting and measuring learning effects during a psychometric test, providing a promising range of applicability. In this regard, the model may be useful in a variety of settings. For instance, the model allows for the assessment of competence acquisition in developmental and educational contexts (e.g., Spada, 1977; Spada & McGaw, 1985), for the substantive analysis of the processes underlying the item responses (e.g., Lozano & Revuelta, 2020, 2021), or for the study of differences in learning ability between populations (e.g., normal vs impaired, children at different developmental stages, etc.). However, the model may also bring novel and interesting methodological applications in the field of adaptive testing. Based on a prior assessment of the difficulty and practice effects associated with each cognitive operation, the model allows for on-the-fly estimation of the difficulty that an item would show in any position within the test as a function of the operations involved in the item and the person’s response pattern to previous items. This opens the door for future studies to investigate the applicability of the model to deal with practice effects in computerized adaptive testing.