A Bayesian Approach Towards Missing Covariate Data in Multilevel Latent Regression Models

The measurement of latent traits and investigation of relations between these and a potentially large set of explaining variables is typical in psychology, economics, and the social sciences. Corresponding analysis often relies on surveyed data from large-scale studies involving hierarchical structures and missing values in the set of considered covariates. This paper proposes a Bayesian estimation approach based on the device of data augmentation that addresses the handling of missing values in multilevel latent regression models. Population heterogeneity is modeled via multiple groups enriched with random intercepts. Bayesian estimation is implemented in terms of a Markov chain Monte Carlo sampling approach. To handle missing values, the sampling scheme is augmented to incorporate sampling from the full conditional distributions of missing values. We suggest to model the full conditional distributions of missing values in terms of non-parametric classification and regression trees. This offers the possibility to consider information from latent quantities functioning as sufficient statistics. A simulation study reveals that this Bayesian approach provides valid inference and outperforms complete cases analysis and multiple imputation in terms of statistical efficiency and computation time involved. An empirical illustration using data on mathematical competencies demonstrates the usefulness of the suggested approach. Supplementary Information The online version contains supplementary material available at 10.1007/s11336-022-09888-0.


Introduction
Models for measurement and structural analysis of latent traits have been developed among others by Muthén (1979), Zwinderman (1991) and Adams et al. (1997).These latent regression models (LRM) typically use a regression equation to assess the relationship between the latent trait and additional covariates and link measurements to the latent trait via a model, possibly arising from the context of item response theory (IRT; e.g., Embretson & Reise, 2000).As demonstrated by Rijmen et al. (2003), and described extensively in Wilson and De Boeck (2004), these models can be conceptualized within the wider context of nonlinear mixed models.Since the derived likelihood functions involve multiple integrals arising from the involved latent variables, a Bayesian framework using Markov chain Monte Carlo (MCMC) techniques is eminently suited to provide inference, see e.g.Edwards (2010).The seminal article of Albert (1992) adopts data augmentation (DA), see Tanner and Wong (1987), within a Bayesian estimation approach for measurement models with dichotomous items. 1 Further work adopted Albert's DA procedure for extended model structures incorporating multilevel and clustered data structures (Aßmann & Boysen-Hogrefe, 2011;Fox, 2005;Fox & Glas, 2001;Johnson & Jenkins, 2005).Prominent applications of these models arise in the context of large-scale assessment studies like the Programme for International Student Assessment (PISA; e.g., OECD, 2014), the Trends in International Mathematics and Science Study (TIMSS; e.g., Mullis & Martin, 2013), the Programme for the International Assessment of Adult Competencies (PIAAC; e.g., OECD, 2013b) or the National Assessment of Educational Progress (NAEP; e.g., Allen et al., 1999).
However, surveyed information is often seriously afflicted by item nonresponse.Si and Reiter (2013), for example, report less than five percent complete cases on a set of 80 background variables in a data file of the Trends in International Mathematics and Science Study (TIMSS; e.g., Mullis & Martin, 2013).Especially in multilevel contexts, such a large fraction of missing values poses a challenge to efficient parameter estimation.An appropriate strategy for handling missing values and corresponding model specification is required when analyzing the data.While several studies deal with the impact of missing or omitted competence items (Köhler et al., 2015;Pohl et al., 2014), there has been less work on missing values in background variables.By default, the educational assessment studies cited above treat missing values in context questionnaires via dummy variable adjustments, see e.g.OECD (2014).Aside from the obvious information loss, dummy-variable adjustments for missing values can cause biased estimation, see Jones (1996).The involved categorization of information may have negative side effects on the assumed functional relationship, see also Grund et al. (2020) for a more detailed discussion.These results are in line with a recent study by Rutkowski (2011) who found non negligible bias and misleading interpretations at the population level when partially missing covariates are dummy coded. 2ith the latent factor being of substantial interest, the Bayesian approach implemented in terms of a MCMC algorithm using the DA device has the advantage to provide direct access to the latent factors in terms of the posterior distribution. 3Furthermore, in the presence of missing values in background variables, DA in the Bayesian context offers a conceptually straightforward way to deal with missing values.The vector of unknown quantities can be augmented with the missing values in covariates.Correspondingly, the MCMC sampling scheme incorporates the set of full conditional distributions of the missing values.This approach has the advantage that the modeling of the full conditional distributions can incorporate information available in form of a latent variable serving in the considered model context as a sufficient statistic. 4These advantages result in increased statistical efficiency and reduced computational costs as illustrated in this paper.Such a handling of information is in principle also possible in the context of Maximum Likelihood estimation in terms of a chained equation approach via iteratively sampling from an assumed or approximated set of full conditional distributions, see Grund et al. (2020) for a discussion in the absence of hierarchical structures.In addition, in data contexts with a large number of covariates relative to the number of observations, the Bayesian approach incorporates shrinkage in terms of the involved prior distributions and facilitates updating of information with regard to the modeled relationships.Next, Bayesian estimators of parameters or functions thereof, like context effects and uncertainty measures, are directly accessible without the use of combining rules.
The DA principle has been successfully applied in different contexts ranging from multivariate panel models to social network analysis and educational large-scale assessments by Liu et al. (2000), Koskinen et al. (2010), Blackwell et al. (2017) and Kaplan and Su (2018).Full conditional distributions of missing values are typically operationalized in terms of a parametric modeling approach as discussed by Grund et al. (2020) and Erler et al. (2016).Goldstein et al. (2014), Erler et al. (2016) and Grund et al. (2018) provide a discussion in the context of linear regression models for metrically scaled hierarchical data.
In this article, we extend the DA approach towards missing values in covariate data in extended hierarchical structures in LRMs for dependent variables with binary and ordinal scale. 5We also illustrate that DA allows for direct access to a valid model specification for the missing values incorporating information available in form of sufficient statistics as suggested by the Hammersley-Clifford theorem, see Robert and Casella (2004).Further, specifying the full conditional distributions of missing values in terms of sufficient statistics arising in the hierarchical latent regression context has the potential to reduce the computational burden.The role of sufficient statistics has also been stressed by Neal and Kypraios (2015) discussing situations, where the augmented variables and sufficient statistics are discrete and the models of interest belong to well known probability distributions.Our approach extends on this as we consider hierarchical structures and identifying restrictions arising from the factor like model structures resulting in complex posterior distributions. 6Consideration of full conditional distributions for handling of missing values enriched with information from latent model structures extends also the sequential imputations approach discussed by Kong et al. (1994).Whereas the sequential imputations approach builds on predictive distributions for missing values separating thereby the model for the missing values in the covariate variables from the considered latent model structures, our approach is based on smoothed, i.e. full conditional distributions incorporating information from the latent model structures via the DA principle. 7n combination with modeling the full conditional distributions of missing values via nonparametric sequential regression trees as suggested by Burgette and Reiter (2010) and Doove et al. (2014), the DA approach suggested in this paper offers high flexibility in empirical applications to cope with nonlinear relationships, e.g.interaction terms, within a potentially large set of covariates having different scales.The proposed modeling approach allows hence for tackling research questions typically addressed in sociology, psychology, and economics in the field of educational inequality and the role of institutions, see among others Carlsson et al. (2015), Passaretta and Skopek (2021) and Cornelissen and Dustmann (2019).It simultaneously addresses the uncertainty associated with the estimation of a latent trait variable and the imputation of missing values in manifest covariate variables.The reciprocal dependence of outcomes and predictors is reflected to the full extent by the Bayesian DA estimation algorithm.The benefits of the suggested fully Bayesian approach arise in terms of methodological stringency and gains in statistical efficiency.Illustration of the suggested approach is provided by means of a simulation study and an empirical application using the first wave of the starting cohort of ninth graders surveyed in the German National Educational Panel Study-Educational Trajectories in Germany (NEPS), see Blossfeld and Roßbach (2019).To highlight the benefits of considering sufficient statistics within the suggested DA approach towards missing values in covariates, we provide a comparison with a classical imputation setup, where the full conditional distributions of missing values are defined on the basis of directly observable quantities only, see e.g.von Hippel (2007).As shown in the simulations, the consideration of sufficient statistics accelerates the computation up to a third and ensure the feasibility of specifying full conditional distributions in multilevel contexts.
The paper proceeds as follows.Section 2 outlines the specification of the considered model setup and provides the corresponding Bayesian sampling algorithm that deals with structures reflecting heterogeneity and missing values in covariates via DA.Performance of the estimation routine is demonstrated through a simulation study in Sect.3, whilst Sect. 4 provides the empirical illustration using data from the NEPS.Section 5 concludes.

Model Setup
Consider J measurement items observed on N individuals summarized in a N × J data matrix Y = (y 1 , . . ., y N ) with row vectors y i = (y i1 , . . ., y i j , . . ., y i J ) for each i = 1, . . ., N and j = 1, . . ., J .In case of binary measurements y i j denotes a random variable taking the value y i j = 1 if in an educational assessment context respondent i is able to solve item j and the value y i j = 0 otherwise.To analyze this kind of test items, Lord (1952Lord ( , 1953) ) proposes an IRT model generally known as the two-parameter normal ogive (2PNO) stating the probability (Pr) for a correctly solved item as Pr(y i j = 1|θ i , α j , β j ) = (α j θ i − β j ), where θ i denotes a scalar person parameter, α j is a item discrimination parameter and β j denotes the item difficulty or item fixed effect.We adopt the standard normal cumulative distribution function (•) as the link function, as it offers computational advantages for MCMC based Bayesian estimation.Also, it allows for an alternative representation in terms of a threshold mechanism, which was first formalized in the context of individual level data by McKelvey and Zavoina (1975) and can be found for multivariate binary variables in Maddala (1983, p. 138).Extending towards the analysis of ordered polytomous item responses, see Samejima (1969), the observed item responses can be seen as a ordered polytomous version of an underlying continuous variable y * i j = α j θ i −β j +ε i j , where the independent and identically distributed error term ε i j follows a standard normal distribution.Then one can link the observed categorical and the underlying continuous variable using a threshold mechanism, namely where κ j = (κ j0 , κ j1 , . . ., κ j Q j ) is the (Q j + 1)-dimensional vector of item category cutoff parameters and I(•) denotes the indicator function.The resulting probability that respondent i achieves grade q on item j, given his latent trait and item parameters, is hence implied by thus nesting the binary case as well.This probability can be represented as in terms of the latent variables as f (y i j , y * i j |θ i , α j , β j , κ j )dy * i j , where (2) The necessary identifying restrictions for all parameters will be discussed jointly below.
IRT models are designed to directly link items and persons to a common scale.To enlarge their scope, the focus of analysis was broadened towards structural analysis by Muthén (1979) addressing the issue that persons may not only differ in terms of their competence, but also in terms of covariates which are correlated with their competence.The standard framework assuming θ i , i = 1, . . ., N , to be identically and independently normally distributed can be extended to incorporate a conditional mean operationalized as E[ 1) , . . ., X (P) ) in terms of row vectors X i , i = 1, . . ., N and column vectors X ( p) , p = 1, . . ., P denotes a matrix of N × P individual specific covariates and γ the corresponding vector of regression coefficients.When hierarchical clustering in observations is present, this needs to be incorporated in the model as well, as consideration of hierarchical data structures is an important prerequisite for valid inference on the relationship between explaining and latent variables.The multiple forms of population heterogeneity in educational research are reviewed in Muthén (1989) and Burstein (1980), whereas Greene (2004b) provides a discussion for economic applications of the panel probit model incorporating latent heterogeneity structures.Population heterogeneity may be considered in terms of a nested multilevel structure thereby assuming a composite population consisting of a finite number of G mutually exclusive groups indexed by g = 1, . . ., G, where L = (L 1 , . . ., L N ) with L i ∈ {1, . . ., G}, i = 1, . . ., N denotes the individual group membership.Within these groups, separate LRMs may hold.Sample stratification may be based on an explicitly observed cluster variable, e.g., gender or school type.This type of modeling dates back to the early works of Muthén and Christoffersson (1981) and Mislevy (1985), but without consideration of covariates except the cluster variable.Often, the specification is theory driven with the aim to discover substantial differences of covariate effects and variances for predefined groups.These differences are captured through the estimation of group-specific latent trait distributions.Additionally, hierarchical structures may be related to random effects.As in multilevel models there is a composite population consisting of clusters c = 1, . . ., C, where the individual cluster membership is also known a-priori and is captured by S = (S 1 , . . ., S N ) with S i ∈ {1, . . ., C} for all i = 1, . . ., N .While fixed group-specific regression parameters are suitable for a relative small number of groups, consideration of hierarchical structures with regard to schools or classes often implies a prohibitively large number of parameters.Difficulties regarding the computation and the statistical properties of the maximum likelihood estimator in this context were studied by Greene (2004a).8Thus, the introduction of identically and independently normally distributed cluster-specific random effects ω = (ω 1 , . . ., ω C ) offers an appropriate alternative or addition to the fixed effects approach.The most basic multilevel specification is the random intercept latent regression item response model.Depending on the specific hierarchical structure under consideration, combinations of both approaches are possible and allow for multiple hierarchical levels.
To illustrate, consider a model with nested hierarchical structure with S i = S i implying L i = L i , i.e. individuals within the same cluster also refer to the same group, but not vice-versa, given as (3) Thereby i , i = 1, . . ., N , is independently normally distributed with mean zero and heteroscedastic variance σ 2 L i .Likewise ω S i is independently normally distributed with mean zero and heteroscedastic variance υ 2 L i .The assumed heteroscedasticity is hence a further way to implement features of (nested) hierarchical structures. 9We summarize all model parameters as ψ = ({α j , β j , κ j } J j=1 , {γ g , σ 2 g , υ 2 g } G g=1 ).The implied conditional covariance structure with regard to two elements of θ = (θ 1 , . . ., θ N ) denoted with i and i can be described as This covariance structure allows for group specific conditional variances but possibly similar or different correlations within clusters.The corresponding likelihood function in case of completely observed data is given as Thereby where f (y i , y * i |θ i , ψ) = J j=1 f (y i j , y * i j |ψ, θ i ) with f (y i j , y * i j |ψ, θ i ) as in Eq. (2), and f (ω|ψ, S, L) following a multivariate normal distribution with mean zero and covariance matrix diag(υ 2 L 1 , . . ., υ 2 L N ).In case of completely observed data Y and X , the Bayesian model setup is then completed by an appropriate prior distribution π(ψ).However, the estimation of IRT models is in general plagued by an identification problem, where the classical identification strategies impose restrictions on the parameter space.For the given model, the identification problem can be described as follows.First, the overall means of y * i j are implied by the mean values of θ i , β j , and κ j , as well as the signs of α j .The mean values of θ i in turn arise from the regression coefficients γ g in combination with the observed covariates X i .Second, the scaling of y * i j is implied by the scaling of θ i and α j , where the scaling of θ i arises from the variance parameters υ 2 g and σ 2 g .The given interdependencies lead to the fact that these parameters are not jointly identifiable.However, for given signs of α j and mean values for two of the three quantities θ i , β j , and κ j , mean values for the remaining quantity become identifiable.The same holds for the scaling issue, where for given signs of α j and a given scaling for one of the two quantities θ i and α j , the remaining scaling becomes identifiable.The decision which mean and scaling parameters to fix is in principle arbitrary.However, for the considered hierarchical structures it is more convenient, also in terms of the implied sampling scheme, to restrict the item parameters α j , β j , and κ j .The typical choice as discussed in the literature by Fox (2010) and Albert and Chib (1997) imposes the following ordering and value constraints on the parameter space.With regard to the threshold parameter the restrictions can be formulated in terms of the condition , while for the item difficulties and discrimination parameters, we have I( J j=1 β j = 0) and I( J j=1 α j I(α j > 0) = 1).Given these identifying restrictions, appropriate (conjugate) prior distributions can be formulated as given in Table 1.In the light of the Clifford-Hammersley theorem, see Robert and Casella (2004) for theorem and proof, the implied joint posterior distribution is accessible in terms of the corresponding set of full conditional distributions.With Z = {Y, X, S, L}, we have where the chosen sequence ordering θ, Y * , ω, ψ is arbitrary and • denotes any admissible point of the indicated variable.The set of full conditional distributions resulting from Eq. ( 7) and employed within an MCMC algorithm taking the form of an iterative sequential Metropolis-Hastings (MH) within Gibbs sampling scheme to provide inference based on a sample from the posterior distributions is given in detail in Sect.2.2.Next, we will discuss the handling of missing values.Given the factorization of the likelihood described in Eqs.(2) and (5), handling of missing values in item responses Y = (Y obs , Y mis ) is directly possible by dropping the corresponding elements Y mis from the likelihood.That means, per item j, only the observed y i j are used to estimate the parameters.An alternative approach of handling missing values in Y may be to consider missing values as wrong answers.Our approach is also fully compatible with von Hippel (2007) suggesting to consider draws of Y mis from the posterior predictive distribution for the specification of the full conditional distributions of the missing values of the covariate variables X but not using them for analysis. 10owever, when facing partially observed X one has to think of an appropriate missing data technique to facilitate estimation.In the following, we will denote X = (X obs , X mis ).In the context of the considered model structure, the latent variables and hierarchical structures take the role of sufficient statistics and may play a crucial role for implementing appropriate models defining the uncertainty associated with missing values X mis .We suggest to handle missing values X mis by means of DA, as this allows for advantageous use of the latent and hierarchical model The hyperparameters are chosen as The hyperparameters for the inverse gamma distribution are chosen to provide finite variance and smallest possible prior sample size.
structures within the modeling of missing values by means of Rao-Blackwellization and due to a lower dimensional representation of the relevant information also reducing the computational burden. 11The advantages relate to gains in statistical efficiency in estimmation of ψ captured by the bias, root mean square error, and coverage.Hence, the augmented posterior distribution incorporating an appropriate prior distribution π(X mis |X obs , ψ), is of interest and subject to inference.The characterization in terms of the full conditional distributions given in Eq. ( 7) is then extended as follows.With Z = {Y, X obs , Xmis , S, L}, we have thereby augmenting the MCMC sampling scheme. 12he suggested sequential sampling is also well suited to deal with regression specifications involving cross products of variables considered in X .Given an initialization of X mis and thus the involved cross products, missing values for one variable can be drawn.If this variable is involved in cross products, these cross products are updated.This procedure is then repeated for each variable in X .In order to establish highly flexible modeling of the distributions of X mis and allow for handling of a possibly large number of background variables, we adopt sequential recursive classification and regression trees in combination with sampling via a Bayesian bootstrap (CART-BB) for the construction of full conditional distributions, see Burgette and Reiter (2010) and Rubin (1981).Modeling the full conditional distributions of missing values in this way is compatible with assuming prior distributions for the missing values proportional to the empirical densities observed for each variable, see also Table 1. 13 This choice is motivated by the flexibility of CART-BB to handle variables of any scale and the potential to cope with nonlinear relationships among the variables, see also Doove et al. (2014).The application of CART-BB to model the full conditional distributions of missing values is particularly useful because the analyst does not need to specify the full conditional distributions of missing values (imputation models) explicitly.The complete set of full conditional distributions and further details referring to the augmented parameter vector are provided in the following.We label the suggested Bayesian estimation approach using data augmentation and sequential recursive partitioning classification and regression trees combined with a Bayesian bootstrap for handling missing values in covariate variables as DART approach.
the set of univariate full conditional distributions for each variable X ( p) mis , p = 1, . . ., P, see also Sect.2.2 for details. 13 In combination with sampling from the empirical cumulative distribution function, i.e. sampling from the range of observed values only, this ensures that the CART-BB approach towards full conditional distributions does involve only proper prior distributions thus ensuring the existence of the integrating constant of the joint posterior distribution.Furthermore, the existence of the joint posterior distribution and the corresponding integrating constant as implied by the Eq. ( 8) is directly ensured in case the missing values relate to variables with finite sample spaces.In case the missing values relate to variables with theoretically possible countable infinite or uncountable infinite sample spaces, the CART-BB algorithm constructs the empirical cumulative distribution function implied by the obtained partition based on measures of homogeneity, e.g. the variance, and incorporates the restriction to the range of observed values as a modeling assumption.Thus, the suggested approach may be most useful in situations with many categorical variables, as in our empirical applications.For applications where the restriction to the range of observed values raises concerns, the suggested CART-BB approach could be applied to the set of categorical values only and alternative modeling approaches for the missing values for variables within continuous infinite support may be considered as well.

Bayesian Inference
Bayesian inference is based on a posterior sample generated via the following MCMC algorithm iteratively sampling from the set of full conditional distributions. 14The algorithm is based on the blocking scheme y * 11 , . . ., where the initialization of all quantities except y * 11 , . . ., y * N J is described in Table 1 and initial values for θ and ω are drawn from standard normal distributions.An implementation of this MCMC sampling algorithm in R is available within the supplementary material.The set of full conditional distributions can be described as follows.
The full conditional distributions of the random variables y * i j , i = 1, . . ., N and j = 1 . . ., J are independent and sampled from a truncated normal distribution with moments μ y * i j = α j θ i − β j and σ 2 = 1, where the truncation sphere is (κ j y i j , κ j y i j +1 ).
Note that for the assumed model structure in absence of the identifying restrictions all full conditional distributions of the item parameters ξ j = (α j , β j ) , j = 1, . . ., J are mutually independent.In the presence of the identifying restrictions, however an arbitrarily chosen single element, say ξ j , is completely determined by the others J −1 item parameters, i.e. ξ j = (( j = j α j ) −1 , − j = j β j ).In this sense, the joint distribution of all item parameters is defective, as the distribution of the element implied by the other elements is not specified.Further, sampling from the full conditional distribution of item parameters ξ j in absence of identifying restrictions can be characterized in terms of the linear regression equation y * j = H ξ j + j , where H is a N × 2 auxiliary matrix consisting of θ and −ι N , where ι N denotes a N × 1 vector of ones.Since j is normally distributed, ξ j is proportional to a bivariate truncated normal distribution with covariance matrix and mean vector The positivity constraints on the item discrimination parameters causing the truncation are handled via accept reject sampling.In each iteration sampling is performed until a draw is accepted.The values of the hyperparameters ν ξ j and ξ j are chosen as given in Table 1.Note that for any possible subset containing J − 1 item parameters, the remainder item parameters, say ξ j , are implied by the assumed identifying restrictions.Although this element is determined by all other elements, the data driven information contained within the above regression is not incorporated in the characterization of these item parameters.Further, J equivalent possibilities exist to characterize the redundant element.Hence, incorporating these J alternative possibilities to draw from the full conditional distribution into a single raw via averaging seems preferable in order to use all available data based information and thus improve mixing and convergence issues.Given draws for α = (α 1 , . . ., α J ) and β = (β 1 , . . ., β J ) averaging the J characterizations is possible in terms of the geometric mean and the arithmetic mean resulting in α We refer to this approach to handling identifying restrictions as a kind of marginal data augmentation, see among others Imai and van Dyk (2005).f (κ j |•) Draws from the mutually independent full conditional distributions of the item category cutoff parameters κ j are retained via a MH step following Albert and Chib (1997).To perform this sampling step it is convenient to consider a reparameterization of the elements κ j2 , . . ., κ j Q j −1 , where κ jq = q w=2 exp{τ jw } for all j = 1, . . ., J and q = 2, . . ., Q j − 1.The threshold parameters can then be stated as Given the prior for κ j this transformation induces a multivariate normal prior for τ j = (τ j2 , . . ., τ j Q j −1 ) given as Hence, the posterior and thus full conditional distribution can be reformulated in terms of τ j .
To generate a draw from the full conditional of τ j , we choose as a proposal a multivariate t-distribution with mean vector m j , covariance matrix V j and Q j − 2 degrees of freedom, where and V j is the inverse of the Hessian of ln{ f (y ].The probability of accepting candidate values τ cand j is given as The acceptance rates within the simulation study and the empirical application where found to be at least 0.95.A draw for κ j is then implied by h(τ j ).The chosen hyperparameter values for 2 κ jq and ν κ jq are given in Table 1.
Values of X mis are sampled sequentially for each column vector X ( p) , p = 1, . . ., P in two steps.Let X mis ) denote the completed matrix of conditional variables in X except column p, with the operator \ p meaning without p. 15First, a decision tree is built for X (\ p) com conditional on the corresponding values of all remaining variables X (\ p) com as well as conditional on θ, ω, S, L, and Y .A further possibility is to consider only subsets of the conditioning variables θ , ω, S, L, and Y .To incorporate a priori uncertainty on the hyperparameters of the sequential partitioning regression trees, we build trees with a randomly varying minimum number of elements within nodes.Every missing observation can then be assigned to a node and thus a grouping of observations implied by the binary partition in terms of the conditioning variables.The values within each node provide access to an empirical distribution function serving as an approximation to the full conditional distribution of a missing value and thus as the key element for running the data generating mechanism for missing values.With prior distributions of missing values proportional to observed data densities, draws from the empirical distribution function within a node correspond to draws from the full conditional distributions of missing values.To account for the estimation uncertainty of the full conditional distribution, the Bayesian bootstrap is applied to the assigned group of observations, see Rubin (1981).Thereby, the uncertainty regarding the estimated empirical distribution implied by the proposed set of observed values is fully considered. 16he considered approach further offers the flexibility to consider any function of observed or augmented data within the set of conditioning variables as well.Next to the matrices Y * and Y also statistics thereof might be considered.This may include draws of missing values in Y or Y * from the posterior predictive distributions as suggested by von Hippel (2007).In case of restricting the analysis to observed values of Y only as in the empirical illustration, additionally missing categories might be considered.Note that this is the default of the R function rpart within the implementation of the CART-BB algorithm, see Therneau and Atkinson (2018).Further, also group specific or individual specific specifications of the full conditional distributions could be adapted by consideration of group specific variables within the set of conditioning variables only, i.e. create a binary partition only for those values fulfilling the conditions L i = g or S i = c.The sampled X mis values allow to refer to an updated completed matrix of covariates in all other steps of the MCMC algorithm.
This allows for stating the conditional distribution of the individual abilities as normal with moments To sample from the full conditional distributions of the regression coefficients, let D C denote a N × C design matrix of zeros and ones.Each row of D C has a single entry 1 indicating the respondents' cluster membership S i .The operator [g] selects the elements of θ , respectively the rows of X and D C for which the condition L i = g holds.Further, let be a N g × N g diagonal matrix with elements σ 2 ,g .Draws from the conditional distribution of γ g are obtained from a multivariate normal with covariance matrix and mean vector Note that values of hyperparameters ν γ g and γ g are chosen as given in Table 1.
In each group g you find C g clusters and N g respondents.It holds that G g=1 C g = C and G g=1 N g = N .Choosing a conjugate prior, the full conditional distribution of σ 2 g is distributed inverse gamma with shape and scale parameters where the values of the hyperparameters a 0 σ 2 g and b 0 σ 2 g are chosen as given in Table 1.
select the elements of θ , respective the rows of X belonging to cluster c and N c be the total number of persons in cluster c.The cluster-specific random intercepts ω c are conditionally independent and follow a full conditional distribution given as a normal distribution with moments The chosen values for hyperparameters are given in Table 1.
Given a conjugate prior and making use of the operator [g], υ 2 ω,g is distributed inverse gamma with shape and scale parameter .
Note that values of hyperparameters a 0 υ 2 g and b 0 υ 2 g are chosen as given in Table 1.
Given this MCMC algorithm, parameter estimates and functions of interest thereof can be readily obtained from the MCMC output denoted as {ψ (r ) , θ (r ) , ω (r ) } R r =1 with R denoting the number of iterations after burn-in.Deciding for an absolute loss function, the estimates are implied by the medians of the posterior sample.Their calculation does not involve the application of any combining rules as for other approaches to handle missing values.If relevant, also the MCMC output with regard to the augmented quantities {Y * ,(r ) , X (r ) = (X obs , X (r )  mis )} R r =1 may be considered as well.To illustrate, given the hierarchical model structure, within group correlation may as well be of interest, i.e.

Cor(θ
g with the corresponding estimator given as . Next, the effects of changes in X on the individual competence level conditional on school type g (CE) might be of interest.Additionally, also the relative effects to another school type g (RE) or the conditional effects in standardized form (CSE), see e.g.Nieminen et al. (2013), can be considered, i.e.
where sd denotes the vector of standard deviations of the column vectors in X [g] .Also context effects in the form of ceteris paribus effects can be considered, e.g.
Estimates of conditional, relative and conditional standardized effects are readily available as , and CSE X,g = median sd[X (r ) , whereas for the context effects we have CP = median X (r ) r ) , ω (r )  c , y * ,(r .
Note that measures of uncertainty, e.g.posterior standard deviation or highest posterior density intervals, are likewise directly accessible without use of combining rules.Finally, note that computation of the marginal data likelihood, i.e.
involved in Bayes factors to allow for non-nested model comparison is possible along the lines suggested by Chib (1995), Chib and Jeliazkov (2001) and Aßmann and Preising (2020) in the context of linear dynamic panel models.

Simulation Study
We assess the proposed strategy via a simulation study.To illustrate the possible gains arising from the handling of missing values by means of DA, we consider as benchmarks estimation without missing values, i.e., before any values have been discarded from the data sets (BD), estimation of complete cases only (CC), and a third benchmark situation mimicking the situation of handling missing values without latent structures, i.e., handling of missing values in an imputation sense before estimating the model of interest (IBM).For the IBM benchmark, the full conditional distributions of missing values are also constructed via CART-BB by using information from observable variables only.For this, we consider mis |X (\ p)  com , Y, S, L , p = 1, . . ., P.
The IBM strategy conditions on all observables (Y, X obs , S, L) but not on latent model structures like θ or ω. 17 These three benchmarks are contrasted with the suggested Bayesian estimation approach DART.Within the DART approach, we will add to the considered observable set of conditioning variables also the latent variables θ and ω to assess the full conditional distribution of X mis , i.e.
Next, we will consider also a modified version of the DART approach, labeled DART-m.We discard Y and S from the set of conditioning variables entering the CART-BB algorithm.This illustrates that the latent variables θ and ω serve as a kind of sufficient statistics of Y and S. When specifying the full conditional distributions of missing values X mis the sufficient statistics allow for incorporation of the relevant information but provide a more parsimonious representation of this information leading to a noticeable reduction in computation time.
The simulation study is based on the following data generating process (DGP), where the comparison is based on averaged estimation over S = 1000 replications referring to the same DGP.The considered DGP satisfies the following conditions.The response matrix Y is simulated assuming the model outlined in Eqs. ( 1), ( 2) and ( 3) with a sample setup of N = 4000 students allocated equally to C = 20 schools which belong to either one of G = 2 school types.Thus, there are 200 students per school and 10 schools per school type corresponding to a nested hierarchical structure.The respondents face a test of altogether J = 20 items of which the first 18 are binary and the last two are ordinal with Q 19 = Q 20 = 4 categories.The J discrimination and difficulty parameters are fixed across replications and were obtained once via drawing from uniform distributions in the interval (0.7, 1.3) for discrimination and (− 0.7, 0.7) for difficulty parameters respectively.To fulfill the identifying restrictions, the item difficulty and discrimination parameters are transformed in terms of the geometric and arithmetic mean respectively, see also Sect.2.2 for details.Finally, the item category cutoff parameters for the two ordinal items are set to κ 19 = (0, 0.5, 1) and κ 20 = (0, 0.7, 1.4) .
We consider three covariates with two covariates X ( p) , p = 2, 3, capturing individual differences in the latent trait θ i .Adding a constant, the regressor matrix can be stated as X = (ι N , X (2) , X (3) ).Since participants in large-scale studies are often heterogeneous, we also map this circumstance in our simulation study.The chosen DGP leans towards the data situation in empirical surveys such as the NEPS, as we consider heterogeneity between groups of individuals.Therefore X (2) is sampled from a Bernoulli distribution with Pr(X (2)  i,g=1 = 1) = 0.3 for group 1 (g = 1) and Pr(X (2) i,g=2 = 1) = 0.6 for group 2 (g = 2).X (3) is sampled from a normal distribution with school specific means and a variance set to one.The overall means in group 1 are chosen to be smaller compared to group 2. The corresponding parameters of the population model are set to γ 1 = (− 0.5, 0.4, 0.2) , γ 2 = (1, 0.2, − 0.2) , σ 2 1 = 0.64, σ 2 2 = 0.36, υ 2 1 = 0.81 and υ 2 2 = 0.49.The simulation study consists out of four missing scenarios.For scenarios 1 and 2 the missing rates for X (2) and X (3) depend exclusively on the latent trait variable θ .As suggested by a reviewer, dependence of the missing probability on the latent variable θ suggests to characterize the mechanism to be approximately at random, since the latent variable θ becomes estimable in the considered model framework via observable quantities.For scenario 3 missing probabilities are determined by weighted sum scores depending on the observed variables X (2) , X (3) , and the latent variable θ .The scenario 4 is similar to scenario 3, but missing in X (3) depends itself on X (3)  thus characterizing the mechanism to be not at random.For further details on the four described missing scenarios, see Table 2.
Each of the repeated estimations is finally based on MCMC chains of length 25,000.After discarding the first 5000 iterations as burn-in, inference is based on the remaining 20,000 simulated draws from the joint posterior distribution.Convergence is monitored via the Geweke statistic, the Gelman-Rubin statistics, and the effective sample size, see Geweke (1992)

X (2)
(%) Table 6 As the missing mechanisms depend on latent variables, the scenarios 1-3 can be characterized as approximately missing at random and scenario 4, where the missing probability depends on the variable itself as missing not at random.All simulation runs have been performed with 25,000 Gibbs iterations with the first 5000 iterations as burn-in.Results for the four different missing scenarios are presented in Tables 3, 4, 5 and 6.They provide the true parameter values used in the DGP, mean posterior medians and averaged standard deviations over the 1,000 replications obtained for the BD, CC, IBM, DART, and DART-m sample estimates with regard to the regression coefficients and conditional variance parameters.Beside the averaged estimates, simulation results are also evaluated in terms of the root mean square error (RMSE) and the coverage, i.e. the proportion of 95% highest posterior density regions (HDRs) that contain the true DGP parameter values.For completeness, results on item characteristics  (item discrimination, item difficulty and item category cutoff parameters) are available in the supplementary material.For the BD estimates we find overall unbiased results for all parameters.
The results indicate a correct implementation of the algorithm and further serve as a benchmark to assess the relative performance of the different methods in the case of missing values.As expected, the CC results show a huge bias, where the bias becomes larger as the proportion of missing values increases.The results also show that the biases tend to be larger when the probability of missing values in X (2) and X (3) depends only on θ , see Tables 3 and 4, and not additionally on the covariates themselves, see Tables 5 and 6.Not unexpectedly, coverage rates for CC are the lowest, see e.g. the parameters γ 0,1 and σ 2 1 in Table 3.When comparing IBM to DART and DART-m, the differences are less pronounced.Nevertheless, it appears consistently across all four simulation studies that with using DART or DART-m we achieve smaller biases.Further inspection of the RMSE for IBM, DART and DART-m suggests no severe loss of statistical efficiency compared to BD, but with a small advantage for DART and DART-m.These results are supported by the coverage rates meeting the 95% confidence level for most of the parameters using DART, especially DART-m, whereas this becomes especially clear with Scenario 2 in Table 4 showing the highest proportion of missing values.Here, we could only achieve a coverage rate of around 50% for the parameters γ 1,1 and γ 2,1 using IBM, but obtain higher coverage rates using DART and even better using DART-m.Taking a look at the averaged standard deviations, these tend to be smaller for IBM, since without the latent variables θ and ω drawn from the full conditional distributions in each iteration, we do not consider an important source of variability affecting the uncertainty of the missing values.Further, without consideration of θ and ω, the bias increases as shown by our simulation results.
The advantages of the DART-m approach are particularly evident in the runtimes (mean runtimes per data set in minutes) given in Tables 3, 4, 5 and 6.DART-m efficiently uses the information from the latent variables θ and ω, which serve as sufficient statistics and therefore can replace the item response Y and the school affiliation S. The resulting runtimes show that the To summarize, the simulation illustrates that the combination of data augmentation and sequential recursive partitioning offers a suitable solution for the treatment of missing covariates in the context of LRMs, both with regard to estimation efficiency and computational burden.

Empirical Illustration
In order to illustrate the usefulness of the suggested Bayesian data augmentation approach in empirical analysis, we provide exemplary applications using the scientific data use file of the German National Educational Panel Study: Starting Cohort Grade 9, doi: 10.5157/NEPS:SC4:10.0.0, see NEPS Network (2019), on mathematical competencies of ninth graders.Children of this cohort have been surveyed in an institutional context.Data collection has taken place in schools in Germany between fall 2010 and winter 2010/2011 based on a stratified sampling of schools according to school types, see Aßmann et al. (2011).Both factors, the institutional setting of schools in Germany as well as the stratified sampling approach, give reason to consider a differentiated hierarchical data structure.
We chose the mathematical competency domain as an example for latent variable modeling with person covariates.The relationship between mathematical competency and individual characteristics is thereby structured by the type of secondary schooling.Mathematical competency was assessed in the first survey wave.The corresponding test comprises four content areas: quantity, change and relationships, space and shape, and data and chance (Neumann et al., 2013), where a total of 15,629 ninth graders have taken the considered test.For an overview and further results on the mathematics test data see Duchhardt and Gerdes (2013).As most of the items have low missing rates, the estimation within the empirical illustration is based on the likelihood involving observed values of Y only and only students with a valid response to at least three mathematics test items are consider. 18From the J = 22 tasks that had to be solved in the test, 20 items have a binary format and two are treated as ordinal items with four categories.In addition to the test data, we consider two clustering variables (schooltype and school) and student covariates.Merging mathematics test data and all student information together results in a final data set with 14,320 observations.The available school type variable (Bayer et al., 2014) was transformed to cover four tracks of the German secondary education system: Hauptschule (HS; lower track), Realschule (RS; intermediate track), Gymnasium (GYM; academic or upper track) and, for observations where a clear assignment to these tracks was not possible or unclear, we define a residual category (OTHER).With 37% of students, GYM is the modal track.The school identifier school assigns a unique number to each school and serves as a further clustering variable with a total of 532 schools.Table 8 provides the descriptive statistics on the sample and considered variables.The illustration is provided in form of the following two model specifications.
The first model specification considers a small set of background variables with different scales including cross terms, whereas the second model specification has an enlarged set of categorical background variables to illustrate that the suggested DART-m approach is feasible and efficient in terms of computational cost and statistical efficiency.For the first model specification (model I) we adapt a specification discussed by Passaretta and Skopek (2021) to assess the role of schools in socioeconomic inequality of learning.Following a differential exposure approach, the relationship of mathematical competency is analyzed with regard to the student variables gender, parents' socio-economic status (HISEI), school exposure (schoolexp), and age at time of assessment (agetest). 19In line with literature, we expect more school exposure and higher assessment age to be positively correlated with mathematical competence, whereas the (un)balancing effect of schools on competence is captured in terms of the cross terms between socioeconomic status and age of testing as well as school exposure.A positive effect for the considered cross terms would indicate that school experience accelerates competence more for students with higher socio economic status.The total amount of missing data for the variables within this model specification is to be considered as moderate to strong.Whereas the number of missing values in gender is negligible, about one fifth of the values are missing for HISEI.For agetest almost no missing values are present, whereas for school exposure the defining date of school entry was surveyed in the parental interview with a missing rate of 42.9%, see Table 8.The ratio of students having complete background information is 47.1% which corresponds to 6, 748 observations.The second model specification (model II) considers an enlarged set of background variables and contains gender (binary), generation status (4 categories), grade final report card mathematics (6 index ISEI-08 (Ganzeboom, 2010) and calculated a variable HISEI as the higher ISEI-08 score of either the students' mother or the students' father or the only available score.To calibrate the scale of the regression coefficient associated with HISEI, the original values are divided by 100.HISEI ranges from 1.16 to 8.90 with higher values indicating a higher level of occupational status.This variable in particular shows strong differences between the school types which can be seen in Table 8.Age at assessment and school exposure are defined as the difference between date of assessment and date of birth or date of school entry respectively.For each of the two models, estimates are based on 25,000 MCMC iterations, where a burn-in phase of 5000 has been found sufficient to mitigate the effects of initialization within the empirical analysis, see the supplementary material for corresponding results and further information concerning the convergence diagnostics and the assessment of the Monte Carlo error for the obtained point estimates.
Corresponding results for the two model specifications with regard to regression and conditional variance parameters are given in Table 9 for model I and   NEPS grade 9, Gaussian kernel density estimates for the set of conditional variances on person level σ 2 g and school level υ 2 g and expected a posteriori estimates of scalar person parameter θ i referring to mathematical competence in model I.
Tables 10 and 11 provide corresponding estimates on relative effects between school types.20These tables provide the resulting estimates in terms of medians, standard deviations, and highest posterior density coverage rates (HDI).The results regarding the structural relationships show clear school type specific differences in the distribution of competencies, see upper panels of Figs. 1 and 2. The highest mean scores are consistently found for GY M, followed by the other school types RS, OT H E R, and H S. In the same way, the conditional variances on the personand the school-level, σ 2 g and υ 2 g , differ across the different types of secondary schooling.However, student's idiosyncratic error terms, i.e. inter-individual differences not captured by the covariates, constantly contribute more to the variability in mathematical competency than school belonging over the different educational tracks, see lower panels of Figs. 1 and 2.
Regarding covariate effects, the models indicate interactions with the grouping variable.For more details, let us first look at the effects of the additional personal covariates used in model I.The negative effect of being female on mathematical competency (gender : 1) is shown to be stable across all school types, but at varying degrees.The effects of school exposure and age at testing are completely subsumed with the school type, i.e. in ninth grade these variables have no  effect beyond school type in contrast to gender.This completes the findings from the literature discussing effects in primary schools, see Passaretta and Skopek (2021).Next, we consider the structural parameter estimates of model II.Again, we see the negative effect although slightly reduced of being female in all school types.Compared to students without a migration background, a first-generation migration background has a substantial negative (99% HDI not including zero) impact on mathematics competency across all school types.The negative effects also prevail for a migration background of the second generation, while for third generation migrants the negative effects are reduced (GYM and OTHER) or become not substantially different from zero (HS and RS).For the covariate grade mathematics in the previous year, where grade 1 (very good) is the reference category, we see that a good result from the previous year has a negative effect on mathematics competence compared to very good, where with worsening grades, the effect accelerates.This pattern can be observed throughout all school types, where the overall effect is strongest in the school type GYM.With regard to the covariate school year repeated, we also find differences across the school types, where this variable has no impact for school types RS and OTHER, but positively different from zero effect for school types HS and GYM.Not having your own computer, but sharing one with other family is found to have no impact on individual competence level across all school types, where we point at the possibility that this relationship may have changed since 2010 substantially.Also having an own room has no substantial effect given the considered set of covariate variables, except for school type RS.With regard to the variable HCASMIN, we find positive effects for higher HCASMIN levels for school type HS, while no effects substantially different from zero are found for all other school types.However, this variables further illustrates that the inspection of relative effects as defined in Eq. ( 10) with corresponding results for model specification II given in Table 11 is important to gauge differences across schools correctly.The relative effects between the different school types for the variable HCASMIN show no substantial differences between the school types.In this regard, the findings relate to the school specific distribution of HCASMIN, compare Table 8.For this model, we also calculated within group correlations, see bottom of Table 12.Although the groups show different conditional variances, estimates show no evidence for differing within group correlations.
While this effects are in line with the results from the literature, the suggested Bayesian estimation approach allows for effectively incorporating all available information, i.e. all information and model features with regard to the measurement model in terms of discrimination and difficulty parameters, intra-class correlation, and school type heterogeneity are reflected within the corresponding full conditional distributions.Given this, the results document a clear shift in means and covariate effects as well as unequal variances of the school type-specific density curves.The results of these two empirical applications extend the findings of our simulation studies from Chapter 3.

Conclusion
To handle missing values this paper discusses a Bayesian estimation approach making use of the device of data augmentation.The missing values in conditioning variables are hence considered along with the underlying continuous outcomes, the model parameters and the latent traits or hierarchical structures in the MCMC sampling scheme involved in operationalizing the Bayesian estimation.The DA device enables to provide the estimation of all these quantities in a statistically efficient one-step procedure.The uncertainty stemming from partially missing covariate data is directly incorporated into parameter estimation.At every iteration of the algorithm an imputed version of the covariate data is used to sample from the set of full conditional posterior distributions.Vice versa, the iteratively updated parameter values resulting from posterior sampling can in turn be considered within the full conditional distribution of missing values.Thus, compared to existing methods the novel method carries out parameter estimation while handling missing values in background variables simultaneously.Taken together, there are several advantages resulting from such an approach.First, it is statistically efficient in the sense that values for the latent trait, item characteristics, and missing values of background variables are all provided at once, second, all possible sources of uncertainty are taken into account, and third, the approach is especially well suited to deal with latent variables corresponding to competencies or arising from hierarchical structures, where the mutual dependence can be directly handled in terms of the full conditional distributions inserted into the sampler.
The advantages show off in terms of statistical efficiency and the computational burden is possibly eased, when latent quantities in the sense of sufficient statistics can be used to specify the full conditional distributions of missing values.An empirical example using the NEPS further demonstrates the broad applicability of the approach to a wide range of social science topics.Besides permitting the estimation of competency scores and their correlations with the context variables purified from measurement error, any number of completed data sets arising from the MCMC output may also serve as multiple imputations of the missing background information.Future research may investigate in detail the possibilities to perform nested and non-nested model comparison via Bayes factors based on the marginal data likelihood.Also alternative models for the full conditional distributions of missing values or automated variable selection based on the spike-and-slab prior specification, see Ročková and George (2018), to determine which variables have group specific influence and which variables have homogeneous influence across the different groups, could be considered.

G
= 2; C = 20; N = 4000; J = 20; n iter = 20,000 + 5000.RMSE = root mean square error; BD = before deletion; CC = complete cases; IBM = multiple imputation before modeling based on observed data; DART = data augmentation using sequential recursive partitioning based on all data and latent parameters.DART-m = data augmentation using sequential recursive partitioning based on the sufficient statistics θ and ω.Runtimes = mean runtimes per data set in minutes (Leibniz Supercomputing Centre of the Bavarian Academy of Sciences and Humanities).

(
2013), andVehtari et al. (2021), and the supplementary material for further information.The convergence diagnostics indicate overall convergence.

G
= 2; C = 20; N = 4000; J = 20; n iter = 20,000 + 5000.RMSE = root mean square error; BD = before deletion; CC = complete cases; IBM = multiple imputation before modeling based on observed data; DART = data augmentation using sequential recursive partitioning based on all data and latent parameters.DART-m = data augmentation using sequential recursive partitioning based on the sufficient statistics θ and ω.Runtimes = mean runtimes per data set in minutes (Leibniz Supercomputing Centre of the Bavarian Academy of Sciences and Humanities).

G
= 2; C = 20; N = 4000; J = 20; n iter = 20,000 + 5000.RMSE = root mean square error; BD = before deletion; CC = complete cases; IBM = multiple imputation before modeling based on observed data; DART = data augmentation using sequential recursive partitioning based on all data and latent parameters.DART-m = data augmentation using sequential recursive partitioning based on the sufficient statistics θ and ω.Runtimes = mean runtimes per data set in minutes (Leibniz Supercomputing Centre of the Bavarian Academy of Sciences and Humanities).

G
= 2; C = 20; N = 4000; J = 20; n iter = 20,000 + 5000.RMSE = root mean square error; BD = before deletion; CC = complete cases; IBM = multiple imputation before modeling based on observed data; DART = data augmentation using sequential recursive partitioning based on all data and latent parameters.DART-m = data augmentation using sequential recursive partitioning based on the sufficient statistics θ and ω.Runtimes = mean runtimes per data set in minutes (Leibniz Supercomputing Centre of the Bavarian Academy of Sciences and Humanities).

PSYCHOMETRIKA
Expected a posteriori estimates of θ i , i=1,...(HS ; lower track), Realschule (RS ; intermediate track), Gymnasium (GYM ; academic or upper track) and, for observations where a clear assignment to these tracks was not possible or unclear, we define a residual category (OTHER).

Figure 2 .
Figure 2.NEPS grade 9, Gaussian kernel density estimates for the set of conditional variances on person level σ 2 g and school level υ 2 g and expected a posteriori estimates of scalar person parameter θ i referring to mathematical competence in model II.

Table 1 .
Prior specifications and MCMC starting values.

Table 9 .
NEPS grade 9, mathematical competencies-parameter estimates of model I.

Table 10 .
NEPS grade 9, mathematical competencies-relative effects for structural parameter estimates of model I.

Table 11 .
NEPS grade 9, mathematical competencies-relative effects for structural parameter estimates of model II.We can see a substantial heterogeneity within the covariate HCASMIN between the school types.For example, we observe that 29.5% of the students in H S have parents in category 2 (basic vocational training above and beyond compulsory schooling) but only 3.2% of the students in GY M, or the other way round with category 8 (completed traditional, academically orientated university education) which have only 3.1% of students in H S, but 32.2% for GY M. Most of the variables have a negligible amount of missing values.However, we have over 40% of missing values for the covariate HCASMIN, as this information has been surveyed within the parental interview.Therefore the ratio of students with complete background information drops to 57.3%, i.e.only 7708 complete case observations.

Table 12 .
Table 12 for model II respectively.NEPS grade 9, mathematical competencies-structural parameter estimates of model II.