Abstract
The measurement of latent traits and investigation of relations between these and a potentially large set of explaining variables is typical in psychology, economics, and the social sciences. Corresponding analysis often relies on surveyed data from largescale studies involving hierarchical structures and missing values in the set of considered covariates. This paper proposes a Bayesian estimation approach based on the device of data augmentation that addresses the handling of missing values in multilevel latent regression models. Population heterogeneity is modeled via multiple groups enriched with random intercepts. Bayesian estimation is implemented in terms of a Markov chain Monte Carlo sampling approach. To handle missing values, the sampling scheme is augmented to incorporate sampling from the full conditional distributions of missing values. We suggest to model the full conditional distributions of missing values in terms of nonparametric classification and regression trees. This offers the possibility to consider information from latent quantities functioning as sufficient statistics. A simulation study reveals that this Bayesian approach provides valid inference and outperforms complete cases analysis and multiple imputation in terms of statistical efficiency and computation time involved. An empirical illustration using data on mathematical competencies demonstrates the usefulness of the suggested approach.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Models for measurement and structural analysis of latent traits have been developed among others by Muthén (1979), Zwinderman (1991) and Adams et al. (1997). These latent regression models (LRM) typically use a regression equation to assess the relationship between the latent trait and additional covariates and link measurements to the latent trait via a model, possibly arising from the context of item response theory (IRT; e.g., Embretson & Reise, 2000). As demonstrated by Rijmen et al. (2003), and described extensively in Wilson and De Boeck (2004), these models can be conceptualized within the wider context of nonlinear mixed models. Since the derived likelihood functions involve multiple integrals arising from the involved latent variables, a Bayesian framework using Markov chain Monte Carlo (MCMC) techniques is eminently suited to provide inference, see e.g. Edwards (2010). The seminal article of Albert (1992) adopts data augmentation (DA), see Tanner and Wong (1987), within a Bayesian estimation approach for measurement models with dichotomous items.^{Footnote 1} Further work adopted Albert’s DA procedure for extended model structures incorporating multilevel and clustered data structures (Aßmann & BoysenHogrefe, 2011; Fox, 2005; Fox & Glas, 2001; Johnson & Jenkins, 2005). Prominent applications of these models arise in the context of largescale assessment studies like the Programme for International Student Assessment (PISA; e.g., OECD, 2014), the Trends in International Mathematics and Science Study (TIMSS; e.g., Mullis & Martin, 2013), the Programme for the International Assessment of Adult Competencies (PIAAC; e.g., OECD, 2013b) or the National Assessment of Educational Progress (NAEP; e.g., Allen et al., 1999).
However, surveyed information is often seriously afflicted by item nonresponse. Si and Reiter (2013), for example, report less than five percent complete cases on a set of 80 background variables in a data file of the Trends in International Mathematics and Science Study (TIMSS; e.g., Mullis & Martin, 2013). Especially in multilevel contexts, such a large fraction of missing values poses a challenge to efficient parameter estimation. An appropriate strategy for handling missing values and corresponding model specification is required when analyzing the data. While several studies deal with the impact of missing or omitted competence items (Köhler et al., 2015; Pohl et al., 2014), there has been less work on missing values in background variables. By default, the educational assessment studies cited above treat missing values in context questionnaires via dummy variable adjustments, see e.g. OECD (2014). Aside from the obvious information loss, dummyvariable adjustments for missing values can cause biased estimation, see Jones (1996). The involved categorization of information may have negative side effects on the assumed functional relationship, see also Grund et al. (2020) for a more detailed discussion. These results are in line with a recent study by Rutkowski (2011) who found non negligible bias and misleading interpretations at the population level when partially missing covariates are dummy coded.^{Footnote 2}
With the latent factor being of substantial interest, the Bayesian approach implemented in terms of a MCMC algorithm using the DA device has the advantage to provide direct access to the latent factors in terms of the posterior distribution.^{Footnote 3} Furthermore, in the presence of missing values in background variables, DA in the Bayesian context offers a conceptually straightforward way to deal with missing values. The vector of unknown quantities can be augmented with the missing values in covariates. Correspondingly, the MCMC sampling scheme incorporates the set of full conditional distributions of the missing values. This approach has the advantage that the modeling of the full conditional distributions can incorporate information available in form of a latent variable serving in the considered model context as a sufficient statistic.^{Footnote 4} These advantages result in increased statistical efficiency and reduced computational costs as illustrated in this paper. Such a handling of information is in principle also possible in the context of Maximum Likelihood estimation in terms of a chained equation approach via iteratively sampling from an assumed or approximated set of full conditional distributions, see Grund et al. (2020) for a discussion in the absence of hierarchical structures. In addition, in data contexts with a large number of covariates relative to the number of observations, the Bayesian approach incorporates shrinkage in terms of the involved prior distributions and facilitates updating of information with regard to the modeled relationships. Next, Bayesian estimators of parameters or functions thereof, like context effects and uncertainty measures, are directly accessible without the use of combining rules.
The DA principle has been successfully applied in different contexts ranging from multivariate panel models to social network analysis and educational largescale assessments by Liu et al. (2000), Koskinen et al. (2010), Blackwell et al. (2017) and Kaplan and Su (2018). Full conditional distributions of missing values are typically operationalized in terms of a parametric modeling approach as discussed by Grund et al. (2020) and Erler et al. (2016). Goldstein et al. (2014), Erler et al. (2016) and Grund et al. (2018) provide a discussion in the context of linear regression models for metrically scaled hierarchical data.
In this article, we extend the DA approach towards missing values in covariate data in extended hierarchical structures in LRMs for dependent variables with binary and ordinal scale.^{Footnote 5} We also illustrate that DA allows for direct access to a valid model specification for the missing values incorporating information available in form of sufficient statistics as suggested by the Hammersley–Clifford theorem, see Robert and Casella (2004). Further, specifying the full conditional distributions of missing values in terms of sufficient statistics arising in the hierarchical latent regression context has the potential to reduce the computational burden. The role of sufficient statistics has also been stressed by Neal and Kypraios (2015) discussing situations, where the augmented variables and sufficient statistics are discrete and the models of interest belong to well known probability distributions. Our approach extends on this as we consider hierarchical structures and identifying restrictions arising from the factor like model structures resulting in complex posterior distributions.^{Footnote 6} Consideration of full conditional distributions for handling of missing values enriched with information from latent model structures extends also the sequential imputations approach discussed by Kong et al. (1994). Whereas the sequential imputations approach builds on predictive distributions for missing values separating thereby the model for the missing values in the covariate variables from the considered latent model structures, our approach is based on smoothed, i.e. full conditional distributions incorporating information from the latent model structures via the DA principle.^{Footnote 7}
In combination with modeling the full conditional distributions of missing values via nonparametric sequential regression trees as suggested by Burgette and Reiter (2010) and Doove et al. (2014), the DA approach suggested in this paper offers high flexibility in empirical applications to cope with nonlinear relationships, e.g. interaction terms, within a potentially large set of covariates having different scales. The proposed modeling approach allows hence for tackling research questions typically addressed in sociology, psychology, and economics in the field of educational inequality and the role of institutions, see among others Carlsson et al. (2015), Passaretta and Skopek (2021) and Cornelissen and Dustmann (2019). It simultaneously addresses the uncertainty associated with the estimation of a latent trait variable and the imputation of missing values in manifest covariate variables. The reciprocal dependence of outcomes and predictors is reflected to the full extent by the Bayesian DA estimation algorithm. The benefits of the suggested fully Bayesian approach arise in terms of methodological stringency and gains in statistical efficiency. Illustration of the suggested approach is provided by means of a simulation study and an empirical application using the first wave of the starting cohort of ninth graders surveyed in the German National Educational Panel Study—Educational Trajectories in Germany (NEPS), see Blossfeld and Roßbach (2019). To highlight the benefits of considering sufficient statistics within the suggested DA approach towards missing values in covariates, we provide a comparison with a classical imputation setup, where the full conditional distributions of missing values are defined on the basis of directly observable quantities only, see e.g. von Hippel (2007). As shown in the simulations, the consideration of sufficient statistics accelerates the computation up to a third and ensure the feasibility of specifying full conditional distributions in multilevel contexts.
The paper proceeds as follows. Section 2 outlines the specification of the considered model setup and provides the corresponding Bayesian sampling algorithm that deals with structures reflecting heterogeneity and missing values in covariates via DA. Performance of the estimation routine is demonstrated through a simulation study in Sect. 3, whilst Sect. 4 provides the empirical illustration using data from the NEPS. Section 5 concludes.
2 Model Setup and Bayesian Inference
2.1 Model Setup
Consider J measurement items observed on N individuals summarized in a \(N\times J\) data matrix \(Y=(y_1,\ldots ,y_N)'\) with row vectors \(y_i=(y_{i1},\ldots , y_{ij},\ldots ,y_{iJ})\) for each \(i=1,\ldots ,N\) and \(j=1,\ldots ,J\). In case of binary measurements \(y_{ij}\) denotes a random variable taking the value \(y_{ij}=1\) if in an educational assessment context respondent i is able to solve item j and the value \(y_{ij}=0\) otherwise. To analyze this kind of test items, Lord (1952, 1953) proposes an IRT model generally known as the twoparameter normal ogive (2PNO) stating the probability (\(\text {Pr}\)) for a correctly solved item as \(\text {Pr}(y_{ij}=1\theta _i,\alpha _j,\beta _j) = \Phi (\alpha _j\theta _i\beta _j)\), where \(\theta _i\) denotes a scalar person parameter, \(\alpha _j\) is a item discrimination parameter and \(\beta _j\) denotes the item difficulty or item fixed effect. We adopt the standard normal cumulative distribution function \(\Phi (\cdot )\) as the link function, as it offers computational advantages for MCMC based Bayesian estimation. Also, it allows for an alternative representation in terms of a threshold mechanism, which was first formalized in the context of individual level data by McKelvey and Zavoina (1975) and can be found for multivariate binary variables in Maddala (1983, p. 138). Extending towards the analysis of ordered polytomous item responses, see Samejima (1969), the observed item responses can be seen as a ordered polytomous version of an underlying continuous variable \(y_{ij}^*=\alpha _j\theta _i\beta _j+\varepsilon _{ij}\), where the independent and identically distributed error term \(\varepsilon _{ij}\) follows a standard normal distribution. Then one can link the observed categorical and the underlying continuous variable using a threshold mechanism, namely
where \(\kappa _j=(\kappa _{j0},\kappa _{j1},\ldots ,\kappa _{jQ_j})'\) is the \((Q_j+1)\)dimensional vector of item category cutoff parameters and \({\mathcal {I}}(\cdot )\) denotes the indicator function. The resulting probability that respondent i achieves grade q on item j, given his latent trait and item parameters, is hence implied by
thus nesting the binary case as well. This probability can be represented as in terms of the latent variables as \(\int f(y_{ij},y_{ij}^*\theta _i,\alpha _j,\beta _j,\kappa _j) {\textrm{d}}y_{ij}^*\), where
The necessary identifying restrictions for all parameters will be discussed jointly below.
IRT models are designed to directly link items and persons to a common scale. To enlarge their scope, the focus of analysis was broadened towards structural analysis by Muthén (1979) addressing the issue that persons may not only differ in terms of their competence, but also in terms of covariates which are correlated with their competence. The standard framework assuming \(\theta _i\), \(i=1,\ldots ,N\), to be identically and independently normally distributed can be extended to incorporate a conditional mean operationalized as \(\text {E}[\theta _iX_i]=X_i\gamma \), \(i=1,\ldots ,N\). Thereby \(X=(X_1',\ldots ,X_N')'=(X^{(1)},\ldots ,X^{(P)})\) in terms of row vectors \(X_i\), \(i=1,\ldots ,N\) and column vectors \(X^{(p)}\), \(p=1,\ldots ,P\) denotes a matrix of \(N\times P\) individual specific covariates and \(\gamma \) the corresponding vector of regression coefficients. When hierarchical clustering in observations is present, this needs to be incorporated in the model as well, as consideration of hierarchical data structures is an important prerequisite for valid inference on the relationship between explaining and latent variables. The multiple forms of population heterogeneity in educational research are reviewed in Muthén (1989) and Burstein (1980), whereas Greene (2004b) provides a discussion for economic applications of the panel probit model incorporating latent heterogeneity structures. Population heterogeneity may be considered in terms of a nested multilevel structure thereby assuming a composite population consisting of a finite number of G mutually exclusive groups indexed by \(g=1,\ldots ,G\), where \(L=(L_1,\ldots ,L_N)\) with \(L_i\in \{1,\ldots ,G\}\), \(i=1,\ldots ,N\) denotes the individual group membership. Within these groups, separate LRMs may hold. Sample stratification may be based on an explicitly observed cluster variable, e.g., gender or school type. This type of modeling dates back to the early works of Muthén and Christoffersson (1981) and Mislevy (1985), but without consideration of covariates except the cluster variable. Often, the specification is theory driven with the aim to discover substantial differences of covariate effects and variances for predefined groups. These differences are captured through the estimation of groupspecific latent trait distributions. Additionally, hierarchical structures may be related to random effects. As in multilevel models there is a composite population consisting of clusters \(c=1,\ldots ,C\), where the individual cluster membership is also known apriori and is captured by \(S=(S_1,\ldots ,S_N)\) with \(S_i\in \{1,\ldots ,C\}\) for all \(i=1,\ldots ,N\). While fixed groupspecific regression parameters are suitable for a relative small number of groups, consideration of hierarchical structures with regard to schools or classes often implies a prohibitively large number of parameters. Difficulties regarding the computation and the statistical properties of the maximum likelihood estimator in this context were studied by Greene (2004a).^{Footnote 8} Thus, the introduction of identically and independently normally distributed clusterspecific random effects \(\omega =(\omega _1,\ldots ,\omega _C)\) offers an appropriate alternative or addition to the fixed effects approach. The most basic multilevel specification is the random intercept latent regression item response model. Depending on the specific hierarchical structure under consideration, combinations of both approaches are possible and allow for multiple hierarchical levels.
To illustrate, consider a model with nested hierarchical structure with \(S_i=S_{i'}\) implying \(L_i=L_{i'}\), i.e. individuals within the same cluster also refer to the same group, but not viceversa, given as
Thereby \(\epsilon _{i}\), \(i=1,\ldots ,N\), is independently normally distributed with mean zero and heteroscedastic variance \(\sigma ^2_{L_{i}}\). Likewise \(\omega _{S_i}\) is independently normally distributed with mean zero and heteroscedastic variance \(\upsilon ^2_{L_i}\). The assumed heteroscedasticity is hence a further way to implement features of (nested) hierarchical structures.^{Footnote 9} We summarize all model parameters as \(\psi =(\{\alpha _j,\beta _j,\kappa _j\}_{j=1}^J,\{\gamma _g,\sigma ^2_g,\upsilon ^2_g\}_{g=1}^G)\). The implied conditional covariance structure with regard to two elements of \(\theta =(\theta _1,\ldots ,\theta _N)\) denoted with i and \(i'\) can be described as
This covariance structure allows for group specific conditional variances but possibly similar or different correlations within clusters. The corresponding likelihood function in case of completely observed data is given as
Thereby
where \(f(y_{i},y_{i}^*\theta _i,\psi )=\prod _{j=1}^{J} f(y_{ij},y_{ij}^*\psi ,\theta _i)\) with \(f(y_{ij},y_{ij}^*\psi ,\theta _i)\) as in Eq. (2),
and \(f(\omega \psi ,S,L)\) following a multivariate normal distribution with mean zero and covariance matrix \(\text {diag}(\upsilon _{L_1}^2,\ldots ,\upsilon _{L_N}^2)\).
In case of completely observed data Y and X, the Bayesian model setup is then completed by an appropriate prior distribution \(\pi (\psi )\). However, the estimation of IRT models is in general plagued by an identification problem, where the classical identification strategies impose restrictions on the parameter space. For the given model, the identification problem can be described as follows. First, the overall means of \(y_{ij}^*\) are implied by the mean values of \(\theta _i\), \(\beta _j\), and \(\kappa _j\), as well as the signs of \(\alpha _j\). The mean values of \(\theta _i\) in turn arise from the regression coefficients \(\gamma _g\) in combination with the observed covariates \(X_i\). Second, the scaling of \(y_{ij}^*\) is implied by the scaling of \(\theta _i\) and \(\alpha _j\), where the scaling of \(\theta _i\) arises from the variance parameters \(\upsilon _g^2\) and \(\sigma _g^2\). The given interdependencies lead to the fact that these parameters are not jointly identifiable. However, for given signs of \(\alpha _j\) and mean values for two of the three quantities \(\theta _i\), \(\beta _j\), and \(\kappa _j\), mean values for the remaining quantity become identifiable. The same holds for the scaling issue, where for given signs of \(\alpha _j\) and a given scaling for one of the two quantities \(\theta _i\) and \(\alpha _j\), the remaining scaling becomes identifiable. The decision which mean and scaling parameters to fix is in principle arbitrary. However, for the considered hierarchical structures it is more convenient, also in terms of the implied sampling scheme, to restrict the item parameters \(\alpha _j\), \(\beta _j\), and \(\kappa _j\). The typical choice as discussed in the literature by Fox (2010) and Albert and Chib (1997) imposes the following ordering and value constraints on the parameter space. With regard to the threshold parameter the restrictions can be formulated in terms of the condition \(\prod _{j=1}^J {\mathcal {I}}(\kappa _{j0}=\infty ,\kappa _{j1}=0< \kappa _{j2}<\cdots< \kappa _{jQ_j1} < \kappa _{jQ_j}=+\infty )\), while for the item difficulties and discrimination parameters, we have \({\mathcal {I}}(\sum _{j=1}^J \beta _j=0)\) and \({\mathcal {I}}(\prod _{j=1}^J \alpha _j {\mathcal {I}}(\alpha _j>0)=1)\). Given these identifying restrictions, appropriate (conjugate) prior distributions can be formulated as given in Table 1. In the light of the Clifford–Hammersley theorem, see Robert and Casella (2004) for theorem and proof, the implied joint posterior distribution
is accessible in terms of the corresponding set of full conditional distributions. With \(Z=\{Y,X,S,L\}\), we have
where the chosen sequence ordering \(\theta ,Y^*,\omega ,\psi \) is arbitrary and \({\tilde{\cdot }}\) denotes any admissible point of the indicated variable. The set of full conditional distributions resulting from Eq. (7) and employed within an MCMC algorithm taking the form of an iterative sequential MetropolisHastings (MH) within Gibbs sampling scheme to provide inference based on a sample from the posterior distributions is given in detail in Sect. 2.2.
Next, we will discuss the handling of missing values. Given the factorization of the likelihood described in Eqs. (2) and (5), handling of missing values in item responses \(Y=(Y_{\text {obs}},Y_{\text {mis}})\) is directly possible by dropping the corresponding elements \(Y_{\text {mis}}\) from the likelihood. That means, per item j, only the observed \(y_{ij}\) are used to estimate the parameters. An alternative approach of handling missing values in Y may be to consider missing values as wrong answers. Our approach is also fully compatible with von Hippel (2007) suggesting to consider draws of \(Y_{\text {mis}}\) from the posterior predictive distribution for the specification of the full conditional distributions of the missing values of the covariate variables X but not using them for analysis.^{Footnote 10}
However, when facing partially observed X one has to think of an appropriate missing data technique to facilitate estimation. In the following, we will denote \(X=(X_{\text {obs}}, X_{\text {mis}})\). In the context of the considered model structure, the latent variables and hierarchical structures take the role of sufficient statistics and may play a crucial role for implementing appropriate models defining the uncertainty associated with missing values \(X_{\text {mis}}\). We suggest to handle missing values \(X_{\text {mis}}\) by means of DA, as this allows for advantageous use of the latent and hierarchical model structures within the modeling of missing values by means of RaoBlackwellization and due to a lower dimensional representation of the relevant information also reducing the computational burden.^{Footnote 11} The advantages relate to gains in statistical efficiency in estimmation of \(\psi \) captured by the bias, root mean square error, and coverage. Hence, the augmented posterior distribution
incorporating an appropriate prior distribution \(\pi (X_{\text {mis}}X_{\text {obs}},\psi )\), is of interest and subject to inference. The characterization in terms of the full conditional distributions given in Eq. (7) is then extended as follows. With \({\tilde{Z}}=\{Y,X_{\text {obs}},{\tilde{X}}_{\text {mis}},S,L\}\), we have
thereby augmenting the MCMC sampling scheme.^{Footnote 12}
The suggested sequential sampling is also well suited to deal with regression specifications involving cross products of variables considered in X. Given an initialization of \(X_{\text {mis}}\) and thus the involved cross products, missing values for one variable can be drawn. If this variable is involved in cross products, these cross products are updated. This procedure is then repeated for each variable in X. In order to establish highly flexible modeling of the distributions of \(X_{\text {mis}}\) and allow for handling of a possibly large number of background variables, we adopt sequential recursive classification and regression trees in combination with sampling via a Bayesian bootstrap (CARTBB) for the construction of full conditional distributions, see Burgette and Reiter (2010) and Rubin (1981). Modeling the full conditional distributions of missing values in this way is compatible with assuming prior distributions for the missing values proportional to the empirical densities observed for each variable, see also Table 1.^{Footnote 13} This choice is motivated by the flexibility of CARTBB to handle variables of any scale and the potential to cope with nonlinear relationships among the variables, see also Doove et al. (2014). The application of CARTBB to model the full conditional distributions of missing values is particularly useful because the analyst does not need to specify the full conditional distributions of missing values (imputation models) explicitly. The complete set of full conditional distributions and further details referring to the augmented parameter vector are provided in the following. We label the suggested Bayesian estimation approach using data augmentation and sequential recursive partitioning classification and regression trees combined with a Bayesian bootstrap for handling missing values in covariate variables as DART approach.
2.2 Bayesian Inference
Bayesian inference is based on a posterior sample generated via the following MCMC algorithm iteratively sampling from the set of full conditional distributions.^{Footnote 14} The algorithm is based on the blocking scheme \(y_{11}^*,\ldots ,y_{NJ}^*\), \(\alpha _1,\beta _1,\ldots ,\alpha _J,\beta _J\), \(\kappa _1,\ldots ,\kappa _J\), \(X_{\text {mis}}^{(1)},\ldots ,X_{\text {mis}}^{(P)}\), \(\theta _1,\ldots \), \(\theta _N\), \(\gamma _1,\ldots ,\gamma _G\), \(\sigma _1^2,\ldots ,\sigma _G^2\), \(\omega _1,\ldots ,\omega _C^2\), \(\upsilon _1^2,\ldots ,\upsilon _G^2\), where the initialization of all quantities except \(y_{11}^*,\ldots ,y_{NJ}^*\) is described in Table 1 and initial values for \(\theta \) and \(\omega \) are drawn from standard normal distributions. An implementation of this MCMC sampling algorithm in R is available within the supplementary material. The set of full conditional distributions can be described as follows.
 \(f(y_{ij}^*\cdot )\):

The full conditional distributions of the random variables \(y_{ij}^*\), \(i=1,\ldots ,N\) and \(j=1\ldots ,J\) are independent and sampled from a truncated normal distribution with moments \( \mu _{y_{ij}^*} = \alpha _j\theta _{i}\beta _j\) and \(\sigma ^2_{y_{ij}^*} = 1\), where the truncation sphere is \((\kappa _{jy_{ij}},\kappa _{jy_{ij}+1})\).
 \(f(\alpha _1,\beta _1,\ldots ,\alpha _J,\beta _J\cdot )\):

Note that for the assumed model structure in absence of the identifying restrictions all full conditional distributions of the item parameters \(\xi _j=(\alpha _j,\beta _j)'\), \(j=1,\ldots ,J\) are mutually independent. In the presence of the identifying restrictions, however an arbitrarily chosen single element, say \(\xi _{j'}\), is completely determined by the others \(J1\) item parameters, i.e. \(\xi _{j'}=((\prod _{j\ne j'}\alpha _j)^{1},\sum _{j\ne j'}\beta _j)\). In this sense, the joint distribution of all item parameters is defective, as the distribution of the element implied by the other elements is not specified. Further, sampling from the full conditional distribution of item parameters \(\xi _j\) in absence of identifying restrictions can be characterized in terms of the linear regression equation \(y_{j}^*=H\xi _j+\epsilon _j\), where H is a \(N\times 2\) auxiliary matrix consisting of \(\theta \) and \(\iota _N\), where \(\iota _N\) denotes a \(N\times 1\) vector of ones. Since \(\epsilon _j\) is normally distributed, \(\xi _j\) is proportional to a bivariate truncated normal distribution with covariance matrix and mean vector
$$\begin{aligned} \Sigma _{\xi _j} = \big (H'H+\Omega _{\xi _j}^{1}\big )^{1} \quad \text {and}\quad \mu _{\xi _j} = \Sigma _{\xi _j}\big (H'y_{j}^* + \Omega _{\xi _j}^{1}\nu _{\xi _j}\big ). \end{aligned}$$The positivity constraints on the item discrimination parameters causing the truncation are handled via accept reject sampling. In each iteration sampling is performed until a draw is accepted. The values of the hyperparameters \(\nu _{\xi _j}\) and \(\Omega _{\xi _j}\) are chosen as given in Table 1. Note that for any possible subset containing \(J1\) item parameters, the remainder item parameters, say \(\xi _{j'}\), are implied by the assumed identifying restrictions. Although this element is determined by all other elements, the data driven information contained within the above regression is not incorporated in the characterization of these item parameters. Further, J equivalent possibilities exist to characterize the redundant element. Hence, incorporating these J alternative possibilities to draw from the full conditional distribution into a single raw via averaging seems preferable in order to use all available data based information and thus improve mixing and convergence issues. Given draws for \(\alpha =(\alpha _1,\ldots ,\alpha _J)\) and \(\beta =(\beta _1,\ldots ,\beta _J)\) averaging the J characterizations is possible in terms of the geometric mean and the arithmetic mean resulting in \(\alpha =(\alpha _1(\prod _{j=1}^J\alpha _j)^{\frac{1}{J}},\ldots ,\alpha _J(\prod _{j=1}^J\alpha _j)^{\frac{1}{J}})\) and \(\beta =(\beta _1\frac{1}{J}\sum _{j=1}^J\beta _j,\ldots ,\beta _J\frac{1}{J}\sum _{j=1}^J \beta _j)\). We refer to this approach to handling identifying restrictions as a kind of marginal data augmentation, see among others Imai and van Dyk (2005).
 \(f(\kappa _j\cdot )\):

Draws from the mutually independent full conditional distributions of the item category cutoff parameters \(\kappa _j\) are retained via a MH step following Albert and Chib (1997). To perform this sampling step it is convenient to consider a reparameterization of the elements \(\kappa _{j2},\ldots ,\kappa _{jQ_j1}\), where \(\kappa _{jq}=\sum _{w=2}^q \exp \{\tau _{jw}\}\) for all \(j=1,\ldots ,J\) and \(q=2,\ldots ,Q_j1\). The threshold parameters can then be stated as \(\kappa _j=(\infty ,0,\kappa _{j2},\ldots ,\kappa _{jQ_j1},\infty )=h(\tau _j)=(h_{j0},h_{j1},h_{j2},\ldots ,h_{jQ_j1},h_{jQ_j})=(\infty ,0,\exp \{\tau _{j2}\},\exp \{\tau _{j2}\}+\exp \{\tau _{j3}\},\ldots ,\sum _{q=2}^{Q_j1}\exp \{\tau _{jq}\},\infty )\). Given the prior for \(\kappa _j\) this transformation induces a multivariate normal prior for \(\tau _j=(\tau _{j2},\ldots ,\tau _{jQ_j1})\) given as
$$\begin{aligned} \pi (\tau _j)\propto \prod _{q=2}^{Q_j1} \exp \left\{ \frac{1}{2\Omega _{\kappa _{jq}}^2}\big (\tau _{jq}\nu _{\kappa _{jq}}\big )^2\right\} . \end{aligned}$$Hence, the posterior and thus full conditional distribution can be reformulated in terms of \(\tau _j\). To generate a draw from the full conditional of \(\tau _j\), we choose as a proposal a multivariate tdistribution with mean vector \(m_j\), covariance matrix \(V_j\) and \(Q_j2\) degrees of freedom, where
$$\begin{aligned} m_j=\arg \underset{\tau _j}{\max }\ \text{ ln }\{f(y_{j}\xi _j,h(\tau _j),\psi ,\theta )\pi (\tau _j)\} \end{aligned}$$and \(V_j\) is the inverse of the Hessian of \(\text{ ln }\{f(y_{j}\xi _j,h(\tau _j),\psi ,\theta )\pi (\tau _j)\}\) evaluated at \(m_j\). Note that \(f(y_j\xi _j,h(\tau _j),\theta ,\psi )=\prod _{i=1}^N [\Phi (h_{jy_{ij}+1}(\alpha _j\theta _i\beta _j))\Phi (h_{jy_{ij}}(\alpha _j\theta _i\beta _j))]\). The probability of accepting candidate values \(\tau _j^{\text{ cand }}\) is given as
$$\begin{aligned} a_{\tau _j}=\min \left\{ 1,\frac{f\big (y_{j}\xi _j,h\big (\tau ^{\text{ cand }}_j\big ),\psi ,\theta \big )\pi \big (\tau ^{\text{ cand }}_j\big )}{f\big (y_{j}\xi _j,h(\tau _j),\psi ,\theta \big )\pi (\tau _j)}\frac{f_t\big (\tau _jm_j,V_j,Q_j2\big )}{f_t\big (\tau ^{\text{ cand }}_jm_j,V_j,Q_j2\big )}\right\} . \end{aligned}$$The acceptance rates within the simulation study and the empirical application where found to be at least 0.95. A draw for \(\kappa _j\) is then implied by \(h(\tau _j)\). The chosen hyperparameter values for \(\Omega _{\kappa _{jq}}^2\) and \(\nu _{\kappa _{jq}}\) are given in Table 1.
 \(f(X_{\text {mis}}^{(p)}\cdot )\):

Values of \(X_{\text {mis}}\) are sampled sequentially for each column vector \(X^{(p)}\), \(p=1,\ldots ,P\) in two steps. Let \(X_{\text{ com }}^{({\setminus } p)}=(X_{\text{ obs }}^{({\setminus } p)},X_{\text{ mis }}^{({\setminus } p)})\) denote the completed matrix of conditional variables in \(X_{\text {}}\) except column p, with the operator \(\setminus p\) meaning without p.^{Footnote 15} First, a decision tree is built for \(X_{\text{ com }}^{({\setminus } p)}\) conditional on the corresponding values of all remaining variables \(X_{\text {com}}^{(\setminus p)}\) as well as conditional on \(\theta ,\omega , S,L\), and Y. A further possibility is to consider only subsets of the conditioning variables \(\theta \), \(\omega \), S, L, and Y. To incorporate a priori uncertainty on the hyperparameters of the sequential partitioning regression trees, we build trees with a randomly varying minimum number of elements within nodes. Every missing observation can then be assigned to a node and thus a grouping of observations implied by the binary partition in terms of the conditioning variables. The values within each node provide access to an empirical distribution function serving as an approximation to the full conditional distribution of a missing value and thus as the key element for running the data generating mechanism for missing values. With prior distributions of missing values proportional to observed data densities, draws from the empirical distribution function within a node correspond to draws from the full conditional distributions of missing values. To account for the estimation uncertainty of the full conditional distribution, the Bayesian bootstrap is applied to the assigned group of observations, see Rubin (1981). Thereby, the uncertainty regarding the estimated empirical distribution implied by the proposed set of observed values is fully considered.^{Footnote 16} The considered approach further offers the flexibility to consider any function of observed or augmented data within the set of conditioning variables as well. Next to the matrices \(Y^*\) and Y also statistics thereof might be considered. This may include draws of missing values in Y or \(Y^*\) from the posterior predictive distributions as suggested by von Hippel (2007). In case of restricting the analysis to observed values of Y only as in the empirical illustration, additionally missing categories might be considered. Note that this is the default of the R function rpart within the implementation of the CARTBB algorithm, see Therneau and Atkinson (2018). Further, also group specific or individual specific specifications of the full conditional distributions could be adapted by consideration of group specific variables within the set of conditioning variables only, i.e. create a binary partition only for those values fulfilling the conditions \(L_i=g\) or \(S_i=c\). The sampled \(X_{\text {mis}}\) values allow to refer to an updated completed matrix of covariates in all other steps of the MCMC algorithm.
 \(f(\theta _i\cdot )\):

The full conditional distributions for \(\theta _i\), \(i=1,\ldots ,N\) are elementwise conditionally independent. Let \(B_{i}=y_{i}^* + \beta \). This allows for stating the conditional distribution of the individual abilities as normal with moments
$$\begin{aligned} \sigma ^2_{\theta _{i}} = \big (\alpha '\alpha + \sigma _{S_i}^{2}\big )^{1} \quad \text {and}\quad \mu _{\theta _{i}} = \sigma ^2_{\theta _{i}}\left( \alpha ' B_{i} + \sigma _{S_i}^{2}(\omega _{S_i} + X_{i}\gamma _{L_i})\right) . \end{aligned}$$(9)  \(f(\gamma _g\cdot )\):

To sample from the full conditional distributions of the regression coefficients, let \(D^C\) denote a \(N\times C\) design matrix of zeros and ones. Each row of \(D^C\) has a single entry 1 indicating the respondents’ cluster membership \(S_i\). The operator [g] selects the elements of \(\theta \), respectively the rows of X and \(D^C\) for which the condition \(L_i=g\) holds. Further, let \(\Sigma _{\epsilon }\) be a \(N_g\times N_g\) diagonal matrix with elements \(\sigma ^2_{\epsilon ,g}\). Draws from the conditional distribution of \(\gamma _g\) are obtained from a multivariate normal with covariance matrix and mean vector
$$\begin{aligned} \Sigma _{\gamma _g} = \big (X_{[g]}'\Sigma _\epsilon ^{1}X_{[g]} + \Omega _{\gamma _g}^{1}\big )^{1} \text { and } \mu _{\gamma _g} = \Sigma _{\gamma _g}\big (X_{[g]}'\Sigma _\epsilon ^{1}(\theta _{[g]}  D^C_{[g]}\omega ) + \Omega _{\gamma _g}^{1}\nu _{\gamma _g}\big ). \end{aligned}$$Note that values of hyperparameters \(\nu _{\gamma _g}\) and \(\Omega _{\gamma _g}\) are chosen as given in Table 1.
 \(f(\sigma _{g}^2\cdot )\):

In each group g you find \(C_g\) clusters and \(N_g\) respondents. It holds that \(\sum _{g=1}^{G}C_g=C\) and \(\sum _{g=1}^{G}N_g=N\). Choosing a conjugate prior, the full conditional distribution of \(\sigma _{g}^2\) is distributed inverse gamma with shape and scale parameters
$$\begin{aligned} a_{\sigma _{g}^2} = a^0_{\sigma _{g}^2} + N_g/2, ~ b_{\sigma _{g}^2}&= \Bigg (b^0_{\sigma _{g}^2} + \frac{1}{2}\big (\theta _{[g]}  D^C_{[g]}\omega  X_{[g]}\gamma _g\big )'\big (\theta _{[g]}  D^C_{[g]}\omega  X_{[g]}\gamma _g\big )\Bigg )^{1}, \end{aligned}$$where the values of the hyperparameters \(a^0_{\sigma _g^2}\) and \(b^0_{\sigma _{g}^2}\) are chosen as given in Table 1.
 \(f(\omega _c\cdot )\):

Let the operator [c] select the elements of \(\theta \), respective the rows of X belonging to cluster c and \(N_c\) be the total number of persons in cluster c. The clusterspecific random intercepts \(\omega _c\) are conditionally independent and follow a full conditional distribution given as a normal distribution with moments
$$\begin{aligned} \sigma ^2_{\omega _c}&= \left( \upsilon ^{2}_{S_c} + N_c/\sigma ^2_{S_c}\right) ^{1} \quad \text {and}\quad \mu _{\omega _c} = \sigma ^2_{\omega _c}\left( \sigma ^{2}_{S_c}(\theta _{[c]}  X_{[c]}\gamma _{S_c})'\iota _{N_c}\right) . \end{aligned}$$The chosen values for hyperparameters are given in Table 1.
 \(f(\upsilon _g^2\cdot )\):

Given a conjugate prior and making use of the operator [g], \(\upsilon _{\omega ,g}^2\) is distributed inverse gamma with shape and scale parameter
$$\begin{aligned} a_{\upsilon _g^2} = a^0_{\upsilon _g^2} + C_g/2 \quad \text {and}\quad b_{\upsilon _g^2} = \left( b^0_{\upsilon _{g}^2} + 0.5\omega _{[g]}'\omega _{[g]}\right) ^{1}. \end{aligned}$$Note that values of hyperparameters \(a^0_{\upsilon _g^2}\) and \(b^0_{\upsilon _{g}^2}\) are chosen as given in Table 1.
Given this MCMC algorithm, parameter estimates and functions of interest thereof can be readily obtained from the MCMC output denoted as \(\{\psi ^{(r)},\theta ^{(r)},\omega ^{(r)}\}_{r=1}^R\) with R denoting the number of iterations after burnin. Deciding for an absolute loss function, the estimates are implied by the medians of the posterior sample. Their calculation does not involve the application of any combining rules as for other approaches to handle missing values. If relevant, also the MCMC output with regard to the augmented quantities \(\{Y^{*,(r)}, X^{(r)}=(X_{\text {obs}},X_{\text {mis}}^{(r)})\}_{r=1}^R\) may be considered as well. To illustrate, given the hierarchical model structure, within group correlation may as well be of interest, i.e.
with the corresponding estimator given as
Next, the effects of changes in X on the individual competence level conditional on school type g (CE) might be of interest. Additionally, also the relative effects to another school type \(g'\) (RE) or the conditional effects in standardized form (CSE), see e.g. Nieminen et al. (2013), can be considered, i.e.
where \(\text {sd}\) denotes the vector of standard deviations of the column vectors in \(X_{[g]}\). Also context effects in the form of ceteris paribus effects can be considered, e.g. \({\text {CP}}=\text {E}[\theta _iX_i,\psi ,L_i=g]\text {E}[\theta _iX_i,\psi ,L_i=g']=X_i(\gamma _g\gamma _{g'})\) or \({\text {CPA}}=\text {E}[\theta _iX_i,\psi ,L_i=g,S_i=c,y_i^*,y_i]\text {E}[\theta _iX_i,\psi ,L_i=g',y_i^*,y_i]=\frac{1}{C}\sum _{c=1}^C \mu _{\theta _i}(X_i,L_i=g,S_i=c,\psi ,\omega _c,y_i^*)\mu _{\theta _i}(X_i,L_i=g',S_i=c,\psi ,\omega _c,y_i^*)\), where \(\mu _{\theta _i}(\cdot )\) is given in Eq. (9).
Estimates of conditional, relative and conditional standardized effects are readily available as
whereas for the context effects we have \({\widetilde{{\text {CP}}}}=\text {median}\left\{ X_i^{(r)}(\gamma _g^{(r)}\gamma _{g'}^{(r)})\right\} _{r=1}^R\) and
Note that measures of uncertainty, e.g. posterior standard deviation or highest posterior density intervals, are likewise directly accessible without use of combining rules.
Finally, note that computation of the marginal data likelihood, i.e.
involved in Bayes factors to allow for nonnested model comparison is possible along the lines suggested by Chib (1995), Chib and Jeliazkov (2001) and Aßmann and Preising (2020) in the context of linear dynamic panel models.
3 Simulation Study
We assess the proposed strategy via a simulation study. To illustrate the possible gains arising from the handling of missing values by means of DA, we consider as benchmarks estimation without missing values, i.e., before any values have been discarded from the data sets (BD), estimation of complete cases only (CC), and a third benchmark situation mimicking the situation of handling missing values without latent structures, i.e., handling of missing values in an imputation sense before estimating the model of interest (IBM). For the IBM benchmark, the full conditional distributions of missing values are also constructed via CARTBB by using information from observable variables only. For this, we consider
The IBM strategy conditions on all observables \((Y,X_{\text {obs}},S,L)\) but not on latent model structures like \(\theta \) or \(\omega \).^{Footnote 17} These three benchmarks are contrasted with the suggested Bayesian estimation approach DART. Within the DART approach, we will add to the considered observable set of conditioning variables also the latent variables \(\theta \) and \(\omega \) to assess the full conditional distribution of \(X_{\text {mis}}\), i.e.
Next, we will consider also a modified version of the DART approach, labeled DARTm. We discard Y and S from the set of conditioning variables entering the CARTBB algorithm. This illustrates that the latent variables \(\theta \) and \(\omega \) serve as a kind of sufficient statistics of Y and S. When specifying the full conditional distributions of missing values \(X_{\text {mis}}\) the sufficient statistics allow for incorporation of the relevant information but provide a more parsimonious representation of this information leading to a noticeable reduction in computation time.
The simulation study is based on the following data generating process (DGP), where the comparison is based on averaged estimation over \({\mathcal {S}}=1000\) replications referring to the same DGP. The considered DGP satisfies the following conditions. The response matrix Y is simulated assuming the model outlined in Eqs. (1), (2) and (3) with a sample setup of \(N=4000\) students allocated equally to \(C=20\) schools which belong to either one of \(G=2\) school types. Thus, there are 200 students per school and 10 schools per school type corresponding to a nested hierarchical structure. The respondents face a test of altogether \(J=20\) items of which the first 18 are binary and the last two are ordinal with \(Q_{19}=Q_{20}=4\) categories. The J discrimination and difficulty parameters are fixed across replications and were obtained once via drawing from uniform distributions in the interval (0.7, 1.3) for discrimination and \((\,0.7,0.7)\) for difficulty parameters respectively. To fulfill the identifying restrictions, the item difficulty and discrimination parameters are transformed in terms of the geometric and arithmetic mean respectively, see also Sect. 2.2 for details. Finally, the item category cutoff parameters for the two ordinal items are set to \(\kappa _{19}=(0,0.5,1)'\) and \(\kappa _{20} = (0,0.7,1.4)'\).
We consider three covariates with two covariates \(X^{(p)}\), \(p=2,3\), capturing individual differences in the latent trait \(\theta _i\). Adding a constant, the regressor matrix can be stated as \(X=(\iota _N, X^{(2)}, X^{(3)})\). Since participants in largescale studies are often heterogeneous, we also map this circumstance in our simulation study. The chosen DGP leans towards the data situation in empirical surveys such as the NEPS, as we consider heterogeneity between groups of individuals. Therefore \(X^{(2)}\) is sampled from a Bernoulli distribution with \(\Pr (X_{i,g=1}^{(2)} = 1) =0.3\) for group 1 (\(g=1\)) and \(\Pr (X_{i,g=2}^{(2)} = 1) =0.6\) for group 2 (\(g=2\)). \(X^{(3)}\) is sampled from a normal distribution with school specific means and a variance set to one. The overall means in group 1 are chosen to be smaller compared to group 2. The corresponding parameters of the population model are set to \(\gamma _1 = (\,0.5,0.4,0.2)'\), \(\gamma _2 = (1,0.2,\,0.2)'\), \(\sigma ^2_{1} = 0.64\), \(\sigma ^2_{2} = 0.36\), \(\upsilon ^2_{1} = 0.81\) and \(\upsilon ^2_{2} = 0.49\). The simulation study consists out of four missing scenarios. For scenarios 1 and 2 the missing rates for \(X^{(2)}\) and \(X^{(3)}\) depend exclusively on the latent trait variable \(\theta \). As suggested by a reviewer, dependence of the missing probability on the latent variable \(\theta \) suggests to characterize the mechanism to be approximately at random, since the latent variable \(\theta \) becomes estimable in the considered model framework via observable quantities. For scenario 3 missing probabilities are determined by weighted sum scores depending on the observed variables \(X^{(2)}\), \(X^{(3)}\), and the latent variable \(\theta \). The scenario 4 is similar to scenario 3, but missing in \(X^{(3)}\) depends itself on \(X^{(3)}\) thus characterizing the mechanism to be not at random. For further details on the four described missing scenarios, see Table 2.
Each of the repeated estimations is finally based on MCMC chains of length 25,000. After discarding the first 5000 iterations as burnin, inference is based on the remaining 20,000 simulated draws from the joint posterior distribution. Convergence is monitored via the Geweke statistic, the Gelman–Rubin statistics, and the effective sample size, see Geweke (1992), Gelman et al. (2013), and Vehtari et al. (2021), and the supplementary material for further information. The convergence diagnostics indicate overall convergence.
Results for the four different missing scenarios are presented in Tables 3, 4, 5 and 6. They provide the true parameter values used in the DGP, mean posterior medians and averaged standard deviations over the 1, 000 replications obtained for the BD, CC, IBM, DART, and DARTm sample estimates with regard to the regression coefficients and conditional variance parameters. Beside the averaged estimates, simulation results are also evaluated in terms of the root mean square error (RMSE) and the coverage, i.e. the proportion of \(95\%\) highest posterior density regions (HDRs) that contain the true DGP parameter values. For completeness, results on item characteristics (item discrimination, item difficulty and item category cutoff parameters) are available in the supplementary material. For the BD estimates we find overall unbiased results for all parameters. The results indicate a correct implementation of the algorithm and further serve as a benchmark to assess the relative performance of the different methods in the case of missing values. As expected, the CC results show a huge bias, where the bias becomes larger as the proportion of missing values increases. The results also show that the biases tend to be larger when the probability of missing values in \(X^{(2)}\) and \(X^{(3)}\) depends only on \(\theta \), see Tables 3 and 4, and not additionally on the covariates themselves, see Tables 5 and 6. Not unexpectedly, coverage rates for CC are the lowest, see e.g. the parameters \(\gamma _{0,1}\) and \(\sigma _1^2\) in Table 3.
When comparing IBM to DART and DARTm, the differences are less pronounced. Nevertheless, it appears consistently across all four simulation studies that with using DART or DARTm we achieve smaller biases. Further inspection of the RMSE for IBM, DART and DARTm suggests no severe loss of statistical efficiency compared to BD, but with a small advantage for DART and DARTm. These results are supported by the coverage rates meeting the \(95\%\) confidence level for most of the parameters using DART, especially DARTm, whereas this becomes especially clear with Scenario 2 in Table 4 showing the highest proportion of missing values. Here, we could only achieve a coverage rate of around \(50\%\) for the parameters \(\gamma _{1,1}\) and \(\gamma _{2,1}\) using IBM, but obtain higher coverage rates using DART and even better using DARTm.
Taking a look at the averaged standard deviations, these tend to be smaller for IBM, since without the latent variables \(\theta \) and \(\omega \) drawn from the full conditional distributions in each iteration, we do not consider an important source of variability affecting the uncertainty of the missing values. Further, without consideration of \(\theta \) and \(\omega \), the bias increases as shown by our simulation results.
The advantages of the DARTm approach are particularly evident in the runtimes (mean runtimes per data set in minutes) given in Tables 3, 4, 5 and 6. DARTm efficiently uses the information from the latent variables \(\theta \) and \(\omega \), which serve as sufficient statistics and therefore can replace the item response Y and the school affiliation S. The resulting runtimes show that the suggested DARTm approach saves up to one third of the computation time compared to the IBM approach.
Similar effects can be seen when inspecting the properties of the sampled trajectories \(\{X_{\text {mis}}^{(r)}\}_{r=1}^R\). The properties arising from the different approaches can be assessed via calculating for each missing value the absolute and squared distance to the true (before deletion) and estimated value. With the former providing bias and the latter the variance, we summarize the finding per variable and aggregate over the missing values per variable and over the simulated data sets. The same procedure is also done to obtain root mean square errors. Note that after averaging over missing values per variable and over data sets, root mean square errors are not exactly identical to variance plus squared bias. With regard to bias and variance we calculate bias as \(\frac{1}{{\mathcal {S}}} \sum _{{\mathcal {s}}=1}^{{\mathcal {S}}} \frac{1}{\# X_{\text {mis}}^{(j)}}\sum _{k=1}^{\# X_{\text {mis}}^{(j)}}X_{\text {mis},k,{\mathcal {s}}}^{(j)}{\tilde{X}}_{\text {mis},k,{\mathcal {s}}}^{(j)}\), variance as \(\frac{1}{{\mathcal {S}}} \sum _{{\mathcal {s}}=1}^{{\mathcal {S}}} \frac{1}{\# X_{\text {mis}}^{(j)}}\sum _{k=1}^{\# X_{\text {mis}}^{(j)}}(X_{\text {mis},k,{\mathcal {s}}}^{(j)}{\hat{X}}_{\text {mis},k,{\mathcal {s}}}^{(j)})^2\), and root mean square error as \(\frac{1}{{\mathcal {S}}} \sum _{{\mathcal {s}}=1}^{{\mathcal {S}}} \frac{1}{\# X_{\text {mis}}^{(j)}}\sum _{k=1}^{\# X_{\text {mis}}^{(j)}}\sqrt{(X_{\text {mis},k,{\mathcal {s}}}^{(j)}{\tilde{X}}_{\text {mis},k,{\mathcal {s}}}^{(j)})^2}\). Thereby, \(\# X_{\text {mis}}^{(j)}\) denotes the number of missing values per variable, \(X_{\text {mis},k,{\mathcal {s}}}^{(j)}\) the kth missing value in variable \(j=2,3\), and \({\tilde{X}}_{\text {mis},k,{\mathcal {s}}}^{(j)}\) and \({\hat{X}}_{\text {mis},k,{\mathcal {s}}}^{(j)}\) denote true (before deletion) and estimated values within repeated estimation \({\mathcal {s}}\) respectively. The results are described in Table 7. As expected and in line with the other simulation results presented, the suggested augmentation approaches DART and DARTm show reduced bias although slightly increased variance compared to the IBM approach. This in turn then causes the improved inference regarding the regression coefficients both in terms of bias and coverage.
To summarize, the simulation illustrates that the combination of data augmentation and sequential recursive partitioning offers a suitable solution for the treatment of missing covariates in the context of LRMs, both with regard to estimation efficiency and computational burden.
4 Empirical Illustration
In order to illustrate the usefulness of the suggested Bayesian data augmentation approach in empirical analysis, we provide exemplary applications using the scientific data use file of the German National Educational Panel Study: Starting Cohort Grade 9, doi: 10.5157/NEPS:SC4:10.0.0, see NEPS Network (2019), on mathematical competencies of ninth graders. Children of this cohort have been surveyed in an institutional context. Data collection has taken place in schools in Germany between fall 2010 and winter 2010/2011 based on a stratified sampling of schools according to school types, see Aßmann et al. (2011). Both factors, the institutional setting of schools in Germany as well as the stratified sampling approach, give reason to consider a differentiated hierarchical data structure.
We chose the mathematical competency domain as an example for latent variable modeling with person covariates. The relationship between mathematical competency and individual characteristics is thereby structured by the type of secondary schooling. Mathematical competency was assessed in the first survey wave. The corresponding test comprises four content areas: quantity, change and relationships, space and shape, and data and chance (Neumann et al., 2013), where a total of 15, 629 ninth graders have taken the considered test. For an overview and further results on the mathematics test data see Duchhardt and Gerdes (2013). As most of the items have low missing rates, the estimation within the empirical illustration is based on the likelihood involving observed values of Y only and only students with a valid response to at least three mathematics test items are consider.^{Footnote 18} From the \(J=22\) tasks that had to be solved in the test, 20 items have a binary format and two are treated as ordinal items with four categories. In addition to the test data, we consider two clustering variables (schooltype and school) and student covariates. Merging mathematics test data and all student information together results in a final data set with 14, 320 observations. The available school type variable (Bayer et al., 2014) was transformed to cover four tracks of the German secondary education system: Hauptschule (HS; lower track), Realschule (RS; intermediate track), Gymnasium (GYM; academic or upper track) and, for observations where a clear assignment to these tracks was not possible or unclear, we define a residual category (OTHER). With \(37\%\) of students, GYM is the modal track. The school identifier school assigns a unique number to each school and serves as a further clustering variable with a total of 532 schools. Table 8 provides the descriptive statistics on the sample and considered variables. The illustration is provided in form of the following two model specifications.
The first model specification considers a small set of background variables with different scales including cross terms, whereas the second model specification has an enlarged set of categorical background variables to illustrate that the suggested DARTm approach is feasible and efficient in terms of computational cost and statistical efficiency. For the first model specification (model I) we adapt a specification discussed by Passaretta and Skopek (2021) to assess the role of schools in socioeconomic inequality of learning. Following a differential exposure approach, the relationship of mathematical competency is analyzed with regard to the student variables gender, parents’ socioeconomic status (HISEI), school exposure (schoolexp), and age at time of assessment (agetest).^{Footnote 19} In line with literature, we expect more school exposure and higher assessment age to be positively correlated with mathematical competence, whereas the (un)balancing effect of schools on competence is captured in terms of the cross terms between socioeconomic status and age of testing as well as school exposure. A positive effect for the considered cross terms would indicate that school experience accelerates competence more for students with higher socio economic status. The total amount of missing data for the variables within this model specification is to be considered as moderate to strong. Whereas the number of missing values in gender is negligible, about one fifth of the values are missing for HISEI. For agetest almost no missing values are present, whereas for school exposure the defining date of school entry was surveyed in the parental interview with a missing rate of 42.9%, see Table 8. The ratio of students having complete background information is \(47.1\%\) which corresponds to 6, 748 observations. The second model specification (model II) considers an enlarged set of background variables and contains gender (binary), generation status (4 categories), grade final report card mathematics (6 categories), school year repeated (binary), computer in your home (3 categories), homepos room (binary), and highest parental education qualification (HCASMIN, 9 categories). We can see a substantial heterogeneity within the covariate HCASMIN between the school types. For example, we observe that \(29.5\%\) of the students in HS have parents in category 2 (basic vocational training above and beyond compulsory schooling) but only \(3.2\%\) of the students in GYM, or the other way round with category 8 (completed traditional, academically orientated university education) which have only \(3.1\%\) of students in HS, but \(32.2\%\) for GYM. Most of the variables have a negligible amount of missing values. However, we have over \(40\%\) of missing values for the covariate HCASMIN, as this information has been surveyed within the parental interview. Therefore the ratio of students with complete background information drops to \(57.3\%\), i.e. only 7708 complete case observations.
For each of the two models, estimates are based on 25,000 MCMC iterations, where a burnin phase of 5000 has been found sufficient to mitigate the effects of initialization within the empirical analysis, see the supplementary material for corresponding results and further information concerning the convergence diagnostics and the assessment of the Monte Carlo error for the obtained point estimates.
Corresponding results for the two model specifications with regard to regression and conditional variance parameters are given in Table 9 for model I and Table 12 for model II respectively. Tables 10 and 11 provide corresponding estimates on relative effects between school types.^{Footnote 20} These tables provide the resulting estimates in terms of medians, standard deviations, and highest posterior density coverage rates (HDI). The results regarding the structural relationships show clear school type specific differences in the distribution of competencies, see upper panels of Figs. 1 and 2. The highest mean scores are consistently found for GYM, followed by the other school types RS, OTHER, and HS. In the same way, the conditional variances on the person and the schoollevel, \(\sigma ^2_{g}\) and \(\upsilon ^2_{g}\), differ across the different types of secondary schooling. However, student’s idiosyncratic error terms, i.e. interindividual differences not captured by the covariates, constantly contribute more to the variability in mathematical competency than school belonging over the different educational tracks, see lower panels of Figs. 1 and 2.
Regarding covariate effects, the models indicate interactions with the grouping variable. For more details, let us first look at the effects of the additional personal covariates used in model I. The negative effect of being female on mathematical competency (gender : 1) is shown to be stable across all school types, but at varying degrees. The effects of school exposure and age at testing are completely subsumed with the school type, i.e. in ninth grade these variables have no effect beyond school type in contrast to gender. This completes the findings from the literature discussing effects in primary schools, see Passaretta and Skopek (2021).
Next, we consider the structural parameter estimates of model II. Again, we see the negative effect although slightly reduced of being female in all school types. Compared to students without a migration background, a firstgeneration migration background has a substantial negative (99% HDI not including zero) impact on mathematics competency across all school types. The negative effects also prevail for a migration background of the second generation, while for third generation migrants the negative effects are reduced (GYM and OTHER) or become not substantially different from zero (HS and RS). For the covariate grade mathematics in the previous year, where grade 1 (very good) is the reference category, we see that a good result from the previous year has a negative effect on mathematics competence compared to very good, where with worsening grades, the effect accelerates. This pattern can be observed throughout all school types, where the overall effect is strongest in the school type GYM. With regard to the covariate school year repeated, we also find differences across the school types, where this variable has no impact for school types RS and OTHER, but positively different from zero effect for school types HS and GYM. Not having your own computer, but sharing one with other family is found to have no impact on individual competence level across all school types, where we point at the possibility that this relationship may have changed since 2010 substantially. Also having an own room has no substantial effect given the considered set of covariate variables, except for school type RS. With regard to the variable HCASMIN, we find positive effects for higher HCASMIN levels for school type HS, while no effects substantially different from zero are found for all other school types. However, this variables further illustrates that the inspection of relative effects as defined in Eq. (10) with corresponding results for model specification II given in Table 11 is important to gauge differences across schools correctly. The relative effects between the different school types for the variable HCASMIN show no substantial differences between the school types. In this regard, the findings relate to the school specific distribution of HCASMIN, compare Table 8. For this model, we also calculated within group correlations, see bottom of Table 12. Although the groups show different conditional variances, estimates show no evidence for differing within group correlations.
While this effects are in line with the results from the literature, the suggested Bayesian estimation approach allows for effectively incorporating all available information, i.e. all information and model features with regard to the measurement model in terms of discrimination and difficulty parameters, intraclass correlation, and school type heterogeneity are reflected within the corresponding full conditional distributions. Given this, the results document a clear shift in means and covariate effects as well as unequal variances of the school typespecific density curves. The results of these two empirical applications extend the findings of our simulation studies from Chapter 3.
5 Conclusion
To handle missing values this paper discusses a Bayesian estimation approach making use of the device of data augmentation. The missing values in conditioning variables are hence considered along with the underlying continuous outcomes, the model parameters and the latent traits or hierarchical structures in the MCMC sampling scheme involved in operationalizing the Bayesian estimation. The DA device enables to provide the estimation of all these quantities in a statistically efficient onestep procedure. The uncertainty stemming from partially missing covariate data is directly incorporated into parameter estimation. At every iteration of the algorithm an imputed version of the covariate data is used to sample from the set of full conditional posterior distributions. Vice versa, the iteratively updated parameter values resulting from posterior sampling can in turn be considered within the full conditional distribution of missing values. Thus, compared to existing methods the novel method carries out parameter estimation while handling missing values in background variables simultaneously. Taken together, there are several advantages resulting from such an approach. First, it is statistically efficient in the sense that values for the latent trait, item characteristics, and missing values of background variables are all provided at once, second, all possible sources of uncertainty are taken into account, and third, the approach is especially well suited to deal with latent variables corresponding to competencies or arising from hierarchical structures, where the mutual dependence can be directly handled in terms of the full conditional distributions inserted into the sampler.
The advantages show off in terms of statistical efficiency and the computational burden is possibly eased, when latent quantities in the sense of sufficient statistics can be used to specify the full conditional distributions of missing values. An empirical example using the NEPS further demonstrates the broad applicability of the approach to a wide range of social science topics. Besides permitting the estimation of competency scores and their correlations with the context variables purified from measurement error, any number of completed data sets arising from the MCMC output may also serve as multiple imputations of the missing background information. Future research may investigate in detail the possibilities to perform nested and nonnested model comparison via Bayes factors based on the marginal data likelihood. Also alternative models for the full conditional distributions of missing values or automated variable selection based on the spikeandslab prior specification, see Ročková and George (2018), to determine which variables have group specific influence and which variables have homogeneous influence across the different groups, could be considered.
Notes
Thereby DA facilitates efficient sampling from the posterior distribution via augmenting the posterior distribution with quantities not necessarily being of primary interest, but possibly functioning as sufficient statistics and thus enabling and operationalizing RaoBlackwellization. As a byproduct DA enables smoother sampling from the posterior distribution of the quantities of primary interest as either closed form sampling becomes available or the construction of an importance or enveloping density is considerably simplified, see Carlin and Louis (1998).
Note that also complete cases analysis, which excludes all observations having a missing value on any covariate from estimation, beside the inefficient use of the sample information in situations with high rates of missing values may result in biased estimation, especially when observations are missing at random (Little & Rubin, 2002, p. 41–44). Only in missing completely at random situations possibly related to multiple matrix designs estimation may stay unbiased.
When performing Maximum Likelihood based estimation typically implemented in terms of an Expectation Maximization algorithm, only point estimates are directly available but extra calculations are required to obtain corresponding uncertainty measures.
This may include information in terms of prevailing missing patterns, where Muthén et al. (1987) consider conditioning on missing data patterns for improved estimation.
In addition, the computational cost of the approach discussed by Neal and Kypraios (2015) grows exponentially with the total number of observations, whereas our MCMC based approach is linearly related to the number of observations.
The problem has been discussed extensively under the term incidental parameter problem in the statistics literature, see Lancaster (2000) for a survey.
Note that extensions in the form of random coefficients within groups or homogeneous coefficients across groups rendering the latent regression function as \(\theta _i=\omega _{S_i}+X_i\gamma _{iL_i} +W_i\lambda +\epsilon _i\) with \(\gamma _{iL_i}\) following a multivariate normal distribution with expectation \(\mu _{\gamma _{L_i}}\) and covariance \(\Omega _{L_i}\) and \(W_i\) denoting a set of covariates with homogeneous influence are also possible as discussed in Aßmann et al. (2011).
This extends also towards missingbydesign values in item responses. Sampling of missingbydesign values \(Y_{\text {mis}}\) are implied by Eq. (1) in the paper, where \(y_{ij}^*\) follows then a normal distribution not subject to truncation as the truncation is implied by the observed values in Y only. The completed \(Y^*\) and hence the completed Y might possibly be helpful for sampling values in \(X_{\text {mis}}\), as pointed out by von Hippel (2007). However, as illustrated and implied by the considered model framework in the paper, \(\theta \) serves takes a role of a sufficient statistic also for \(Y^*\), and thus consideration of sampled \(Y_{\text {mis}}\) values within the imputation of \(X_{\text {mis}}\) does not necessarily result in further gains in terms of statistical efficiency. Given this, we point out that the suggested approach should be applied to data situations, where at least some elements of \(y_i\) are observed for each individual \(i=1\ldots ,N\). Situations, where several competence domains are investigated can be addressed via multivariate extensions of the suggested modeling framework.
Consideration of sufficient statistics may also serve as a guiding principle for model specification.
Note that sampling from \(f(X_{\text {mis}}\theta ,Y^*,\omega ,\psi ,Y,X_{\text {obs}},S,L)\) will be based on sequential iterative sampling from the set of univariate full conditional distributions for each variable \(X_{\text {mis}}^{(p)}\), \(p=1,\ldots ,P\), see also Sect. 2.2 for details.
In combination with sampling from the empirical cumulative distribution function, i.e. sampling from the range of observed values only, this ensures that the CARTBB approach towards full conditional distributions does involve only proper prior distributions thus ensuring the existence of the integrating constant of the joint posterior distribution. Furthermore, the existence of the joint posterior distribution and the corresponding integrating constant as implied by the Eq. (8) is directly ensured in case the missing values relate to variables with finite sample spaces. In case the missing values relate to variables with theoretically possible countable infinite or uncountable infinite sample spaces, the CARTBB algorithm constructs the empirical cumulative distribution function implied by the obtained partition based on measures of homogeneity, e.g. the variance, and incorporates the restriction to the range of observed values as a modeling assumption. Thus, the suggested approach may be most useful in situations with many categorical variables, as in our empirical applications. For applications where the restriction to the range of observed values raises concerns, the suggested CARTBB approach could be applied to the set of categorical values only and alternative modeling approaches for the missing values for variables within continuous infinite support may be considered as well.
The proposed Bayesian analysis and its MCMC implementation is further suited to incorporate information arising from weighting factors. In case of nonstochastic weights, e.g. design weights, the variables entering the modeling can be transformed accordingly, whereas in case of stochastic weights, e.g. nonresponse adjusted weights typically handled via replication weights, the variables can be transformed within each MCMC iteration.
In case that also interaction terms are considered, \((X_{\text{ com }}^{({\setminus } p)})\) also subsumes all columns referring to cross terms not involving variable p. Cross terms involving variable p are hence not subject to modeling but updated each time an underlying variable has been updated.
Sampling from the empirical distribution function via the Bayesian bootstrap corresponds to running the data generating process of a parametric imputation model, with involved parameters being sampled from the estimated distributions in order to fully account for the uncertainty of the data generating process, i.e. the uncertainty how the empirical cumulative distribution function would look like if the missing values would be observed.
The IBM benchmark is hence in line with a typical multiple imputation strategy, although no combining rules are required as sampling is performed within the MCMC sampler. This ensures further that the comparison of the different approaches is conditional on the same level of numerical precision as implied by the number of MCMC iterations after burnin.
For ten items we have missing rates of less than \(2\%\), less than \(5\%\) for another eight items, for three items we have the range from \(5\%  10\%\) and only one item has a missing rate of \(20\%\).
Regarding socioeconomic status, there are many operationalizations implemented in the NEPS. In line with recent analyses of the PISA data (OECD, 2013a, p. 132), we took the highest occupational level of parents measured by the index ISEI08 (Ganzeboom, 2010) and calculated a variable HISEI as the higher ISEI08 score of either the students’ mother or the students’ father or the only available score. To calibrate the scale of the regression coefficient associated with HISEI, the original values are divided by 100. HISEI ranges from 1.16 to 8.90 with higher values indicating a higher level of occupational status. This variable in particular shows strong differences between the school types which can be seen in Table 8. Age at assessment and school exposure are defined as the difference between date of assessment and date of birth or date of school entry respectively.
Results on item characteristics (discrimination, difficulty, and cutoff parameters) are available within the supplementary material.
References
Adams, R. J., Wilson, M., & Wu, M. (1997). Multilevel item response models: An approach to errors in variables regression. Journal of Educational and Behavioral Statistics, 22(1), 47–76.
Albert, J. H. (1992). Bayesian estimation of normal ogive item response curves using Gibbs sampling. Journal of Educational Statistics, 17(3), 251–269.
Albert, J. H., & Chib, S. (1997). Bayesian methods for cumulative, sequential and twostep ordinal data regression models. Bowling Greene: Department of Mathematics and Statistics, Bowling Greene State University.
Allen, N. L., Carlson, J. E., & Zelenak, C. A. (1999). The NAEP 1996 technical report (NCES1999452).
Aßmann, C., & BoysenHogrefe, J. (2011). A Bayesian approach to modelbased clustering for binary panel probit models. Computational Statistics & Data Analysis, 55, 261–279.
Aßmann, C., Gaasch, C., Pohl, S., & Carstensen, C. H. (2015). Bayesian estimation in IRT models with missing values in background variables. Psychological Test and Assessment Modeling, 57(4), 595–618.
Aßmann, C., & Preising, M. (2020). Bayesian estimation and model comparison for linear dynamic panel models with missing values. Australian & New Zealand Journal of Statistics, 62(4), 536–557. https://doi.org/10.1111/anzs.12316
Aßmann, C., Steinhauer, H. W., Kiesl, H., Koch, S., Schönberger, B., MüllerKuller, A., Rohwer, G., Rässler, S., & Blossfeld, H.P. (2011). Sampling designs of the national educational panel study: Challenges and solutions. Zeitschrift für Erziehungswissenschaften Special Issue 14In H.P. Blossfeld, H.G. Roßbach, & J. von Maurice (Eds.), Education as a lifelong process. The German national educational panel study (NEPS) (pp. 51–65). Wiesbaden: VS Verlag für Sozialwissenschaften.
Bayer, M., Goßmann, F., & Bela, D. (2014). NEPS technical report: Generated school type variable t723080_g1 in starting cohorts 3 and 4 (NEPS working paper no. 46). University of Bamberg, Leibniz Institute for Educational Trajectories, National Educational Panel Study.
Blackwell, M., Honaker, J., & King, G. (2017). A unified approach to measurement error and missing data: Details and extensions. Sociological Methods & Research, 46(3), 342–369. https://doi.org/10.1177/0049124115589052
Blossfeld, H.P., & Roßbach, H.G. (Eds.). (2019). Education as a lifelong process. The German National Educational Panel Study (NEPS): Edition ZfENew York: Springer VS.
Burgette, L. F., & Reiter, J. P. (2010). Multiple imputation for missing data via sequential regression trees. American Journal of Epidemiology, 172(9), 1070–1076.
Burstein, L. (1980). The analysis of multilevel data in educational research and evaluation. Review of Research in Education, 8, 158–233.
Carlin, B. P., & Louis, T. A. (1998). Bayes and empirical Bayes methods for data analysis Monographs on statistics and applied probability (Vol. 69). Boca Raton: Chapman & Hall/CRC.
Carlsson, M., Dahl, G. B., Öckert, B., & Rooth, D.O. (2015). The effect of schooling on cognitive skills. The Review of Economics and Statistics, 97(3), 533–547. https://doi.org/10.1162/REST_a_00501
Chib, S. (1995). Marginal likelihood from the Gibbs output. Journal of the American Statistical Association, 90(432), 1313–1321.
Chib, S., & Jeliazkov, I. (2001). Marginal likelihood from the metropolis–hastings output. Journal of the American Statistical Association, 96(453), 270–281.
Cornelissen, T., & Dustmann, C. (2019). Early school exposure, test scores, and noncognitive outcomes. American Economic Journal: Economic Policy, 11(2), 35–63. https://doi.org/10.1257/pol.20170641
Doove, L. L., van Buuren, S., & Dusseldorp, E. (2014). Recursive partitioning for missing data imputation in the presence of interaction effects. Computational Statistics & Data Analysis, 72, 92–104.
Duchhardt, C. & Gerdes, A. (2013). NEPS technical report for mathematics—Scaling results of starting cohort 4 in ninth grade (NEPS working paper no. 22). University of Bamberg, Leibniz Institute for Educational Trajectories, National Educational Panel Study.
Edwards, M. C. (2010). A Markov chain Monte Carlo approach to confirmatory item factor analysis. Psychometrika, 75(3), 474–497. https://doi.org/10.1007/s1133601091619
Embretson, S. E., & Reise, S. (2000). Item response theory for psychologists. Mahwah: Lawrence Erlbaum Associates.
Erler, N. S., Rizopoulos, D., van Rosmalen, J., Jaddoe, V. W. V., Franco, O. H., & Lesaffre, E. M. E. H. (2016). Dealing with missing covariates in epidemiologic studies: A comparison between multiple imputation and a full Bayesian approach. Statistics in Medicine, 35(17), 2955–2974.
Fox, J.P. (2005). Multilevel IRT using dichotomous and polytomous response data. British Journal of Mathematical and Statistical Psychology, 58(1), 145–172.
Fox, J.P. (2010). Bayesian item response modeling: Theory and applications. New York: Springer.
Fox, J.P., & Glas, C. A. W. (2001). Bayesian estimation of a multilevel IRT model using Gibbs sampling. Psychometrika, 66(2), 271–288.
Ganzeboom, H. B. G. (2010). A new international socioeconomic index [ISEI] of occupational staus for the international standard classification of occupation 2008 [ISCO08] constructed with data from the ISSP 2002–2007; with an analysis of quality of occupational measurement in ISSP. In Annual conference of international social survey programme, Lisbon, May 1 2010.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian data analysis (3rd ed.). Boca Raton: CRC Press.
Geweke, J. (1992). Evaluating the accuracy of samplingbased approaches to the calculation of posterior moments, Bayesian statistics 4 (pp. 169–193). Oxford: Oxford University Press.
Goldstein, H., Carpenter, J. R., & Browne, W. J. (2014). Fitting multilevel multivariate models with missing data in responses and covariates that may include interactions and nonlinear terms. Journal of the Royal Statistical Society: Series A (Statistics in Society), 177(2), 553–564.
Greene, W. (2004a). The behaviour of the maximum likelihood estimator of limited dependent variable models in the presence of fixed effects. The Econometrics Journal, 7(1), 98–119.
Greene, W. (2004b). Convenient estimators for the panel probit model: Further results. Empirical Economics, 29(1), 21–47. https://doi.org/10.1007/s001810030187z
Grund, S., Lüdtke, O., & Robitzsch, A. (2018). Multiple imputation of missing data at level 2: A comparison of fully conditional and joint modeling in multilevel designs. Journal of Educational and Behavioral Statistics, 43(3), 316–353. https://doi.org/10.3102/1076998617738087
Grund, S., Lüdtke, O., & Robitzsch, A. (2020). On the treatment of missing data in background questionnaires in educational largescale assessments: An evaluation of different procedures. Journal of Educational and Behavioral Statistics. https://doi.org/10.3102/1076998620959058
Imai, K., & van Dyk, D. A. (2005). A Bayesian analysis of the multinomial probit model using marginal data augmentation. Journal of Econometrics, 124(2), 311–334.
Johnson, M. S., & Jenkins, F. (2005). A Bayesian hierarchical model for largescale educational surveys: An application to the national assessment of educational progress (ETS RR0438). Princeton: Educational Testing Service.
Jones, M. P. (1996). Indicator and stratification methods for missing explanatory variables in multiple linear regression. Journal of the American Statistical Association, 91(433), 222–230.
Kaplan, D., & Su, D. (2018). On imputation for planned missing data in context questionnaires using plausible values: A comparison of three designs. Largescale Assessments in Education, 6(1), 6. https://doi.org/10.1186/s4053601800599
Köhler, C., Pohl, S., & Carstensen, C. H. (2015). Taking the missing propensity into account when estimating competence scores: Evaluation of item response theory models for nonignorable omissions. Educational and Psychological Measurement, 75(5), 850–874.
Kong, A., Liu, J. S., & Wong, W. H. (1994). Sequential imputations and Bayesian missing data problems. Journal of the American Statistical Association, 89(425), 278–288.
Koskinen, J. H., Robins, G. L., & Pattison, P. E. (2010). Analysing exponential random graph (pstar) models with missing data using Bayesian data augmentation. Statistical Methodology, 7(3), 366–384.
Lancaster, T. (2000). The incidental parameter problem since 1948. Journal of Econometrics, 95(2), 391–413.
Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data (2nd ed.). Hoboken: Wiley.
Liu, M., Taylor, J. M. G., & Belin, T. R. (2000). Multiple imputation and posterior simulation for multivariate missing data in longitudinal studies. Biometrics, 56(4), 1153–1157.
Lord, F. M. (1952). A theory of test scores. Psychometric monograph no. 7Richmond: Psychometric Corporation.
Lord, F. M. (1953). An application of confidence intervals and of maximum likelihood to the estimation of an examinee’s ability. Psychometrika, 18(1), 57–75.
Maddala, G. S. (1983). Limiteddependent and qualitative variables in econometrics. Cambridge: Cambridge University Press.
McKelvey, R., & Zavoina, W. (1975). A statistical model for the analysis of ordered level dependent variables. Journal of Mathematical Sociology, 4(1), 103–120.
Mislevy, R. J. (1985). Estimation of latent group effects. Journal of the American Statistical Association, 80(392), 993–997.
Mullis, I. V. S., & Martin, M. O. (Eds.). (2013). TIMSS 2015 assessment frameworks. Chestnut Hill: TIMSS & PIRLS International Study Center.
Muthén, B., Kaplan, D., & Hollis, M. (1987). On structural equation modeling with data that are not missing completely at random. Psychometrika, 52(3), 431–462.
Muthén, B. O. (1979). A structural probit model with latent variables. Journal of the American Statistical Association, 74(368), 807–811.
Muthén, B. O. (1989). Latent variable modeling in heterogeneous populations. Psychometrika, 54(4), 557–585.
Muthén, B. O., & Christoffersson, A. (1981). Simultaneous factor analysis of dichotomous variables in several groups. Psychometrika, 46(4), 407–419.
Neal, P., & Kypraios, T. (2015). Exact Bayesian inference via data augmentation. Statistics and Computing, 25(2), 333–347.
NEPS Network. (2019). German national educational panel study, scientific use file of starting cohort grade 9. Leibniz Institute for Educational Trajectories (LIfBi), Bamberg. Retrieved from https://doi.org/10.5157/NEPS:SC4:10.0.0.
Neumann, I., Duchhardt, C., Grüßing, M., Heinze, A., Knopp, E., & Ehmke, T. (2013). Modeling and assessing mathematical competence over the lifespan. Journal for Educational Research Online, 5(2), 80–109.
Nieminen, P., Lehtiniemi, H., Vähäkangas, K., Huusko, A., & Rautio, A. (2013). Standardised regression coefficient as an effect size index in summarising findings in epidemiological studies. Epidemiology Biostatistics and Public Health, 10(4), 1–16.
OECD. (2013a). PISA 2012 results: Excellence through equity: Giving every student the chance to succeed (volume II). Paris: OECD Publishing.
OECD. (2013b). Technical report of the survey of adult skills (PIAAC). Paris: OECD Publishing.
OECD. (2014). PISA 2012 technical report. Paris: OECD Publishing.
Passaretta, G., & Skopek, J. (2021). Does schooling decrease socioeconomic inequality in early achievement? A differential exposure approach. American Sociological Review, 86(6), 1017–1042. https://doi.org/10.1177/00031224211049188
Pohl, S., Gräfe, L., & Rose, N. (2014). Dealing with omitted and notreached items in competence tests: Evaluating approaches accounting for missing responses in item response theory models. Educational and Psychological Measurement, 74(3), 423–452.
Richard, J. F., & Zhang, W. (2007). Efficient highdimensional importance sampling. Journal of Econometrics, 141(2), 1385–1411.
Rijmen, F., Tuerlinckx, F., De Boeck, P., & Kuppens, P. (2003). A nonlinear mixed model framework for item response theory. Psychological Methods, 8(2), 185–205.
Robert, C. P., & Casella, G. (2004). Monte Carlo statistical methods (2nd ed.). New York: Springer.
Ročková, V., & George, E. I. (2018). The spikeandslab lasso. Journal of the American Statistical Association, 113(521), 431–444.
Rubin, D. B. (1981). The Bayesian bootstrap. The Annals of Statistics, 9(1), 130–134.
Rutkowski, L. (2011). The impact of missing background data on subpopulation estimation. Journal of Educational Measurement, 48(3), 293–312.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometric monograph no. 17Richmond, VA: Psychometric Corporation.
Si, Y., & Reiter, J. P. (2013). Nonparametric Bayesian multiple imputation for incomplete categorical variables in largescale assessment surveys. Journal of Educational and Behavioral Statistics, 38(5), 499–521.
Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association, 82(398), 528–549.
Therneau, T., & Atkinson, B. (2018). rpart: Recursive partitioning and regression trees, [Computer software manual]. Retrieved from https://CRAN.Rproject.org/package=rpart. (R package version 4.113).
Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., & Bürkner, P.C. (2021). Ranknormalization, folding, and localization: An improved \({\widehat{R}}\) for assessing convergence of MCMC (with discussion). Bayesian Analysis, 16(2), 667–718.
von Hippel, P. (2007). Regression with missing Ys: An improved strategy for analyzing multiply imputed data. Sociological Methodology, 37, 83–117.
Wilson, M., & De Boeck, P. (2004). Explanatory item response models. New York, NY: Springer.
Zwinderman, A. H. (1991). A generalized Rasch model for manifest predictors. Psychometrika, 56(4), 589–600.
Acknowledgements
The authors thank David Kaplan, Roman Liesenfeld, and participants of the statistics seminars of the University of Cologne and the Leibniz Institute for Educational Trajectories, as well as the participants of the workshop on quality aspects of machine learning organized by the Statistics Network Bavaria for helpful comments and suggestions that helped to improve the manuscript in addition to the comments received by three reviewers and the associate editor. Financial support is acknowledged by the Deutsche Forschungsgesellschaft (DFG) within priority programme SPP 1646 under grants AS 368/31 and AS 368/32. Further, we would like to thank the Leibniz Supercomputing Centre of the Bavarian Academy of Sciences and Humanities for the provision of the system resources the simulation studies were performed with. This paper uses data from the German National Educational Panel Study (NEPS), see Blossfeld and Roßbach (2019). The NEPS is carried out by the Leibniz Institute for Educational Trajectories (LIfBi, Germany) in cooperation with a nationwide network.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Aßmann, C., Gaasch, JC. & Stingl, D. A Bayesian Approach Towards Missing Covariate Data in Multilevel Latent Regression Models. Psychometrika 88, 1495–1528 (2023). https://doi.org/10.1007/s11336022098880
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11336022098880