A novel Bayesian approach for latent variable modeling from mixed data with missing values
 138 Downloads
Abstract
We consider the problem of learning parameters of latent variable models from mixed (continuous and ordinal) data with missing values. We propose a novel Bayesian Gaussian copula factor (BGCF) approach that is proven to be consistent when the data are missing completely at random (MCAR) and that is empirically quite robust when the data are missing at random, a less restrictive assumption than MCAR. In simulations, BGCF substantially outperforms two stateoftheart alternative approaches. An illustration on the ‘Holzinger & Swineford 1939’ dataset indicates that BGCF is favorable over the socalled robust maximum likelihood.
Keywords
Latent variables Gaussian copula factor model Parameter learning Mixed data Missing values1 Introduction
In psychology, social sciences, and many other fields, researchers are usually interested in “latent” variables that cannot be measured directly, e.g., depression, anxiety, or intelligence. To get a grip on these latent concepts, one commonly used strategy is to construct a measurement model for such a latent variable, in the sense that domain experts design multiple “items” or “questions” that are considered to be indicators of the latent variable. For exploring evidence of construct validity in theorybased instrument construction, confirmatory factor analysis (CFA) has been widely studied (Jöreskog 1969; Castro et al. 2015; Li 2016). In CFA, researchers start with several hypothesized latent variable models that are then fitted to the data individually, after which the one that fits the data best is picked to explain the observed phenomenon. In this process, the fundamental task is to learn the parameters of a hypothesized model from observed data, which is the focus of this paper. For convenience, we simply refer to these hypothesized latent variable models as CFA models from now on.
The most common method for parameter estimation in CFA models is maximum likelihood (ML), because of its attractive statistical properties (consistency, asymptotic normality, and efficiency). The ML method, however, relies on the assumption that observed variables follow a multivariate normal distribution (Jöreskog 1969). When the normality assumption is not deemed empirically tenable, ML may not only reduce the accuracy of parameter estimates, but may also yield misleading conclusions drawn from empirical data (Li 2016). To this end, a robust version of ML was introduced for CFA models when the normality assumption is slightly or moderately violated (Kaplan 2008), but still requires the observations to be continuous. In the real world, the indicator data in questionnaires are usually measured on an ordinal scale (resulting in a bunch of ordered categorical variables, or simply ordinal variables) (Poon and Wang 2012), in which neither normality nor continuity is plausible (Lubke and Muthén 2004). In this case, Item Response Theory (IRT) models (Embretson and Reise 2013) are widely used, in which a mathematical item response function is applied to link an item to its corresponding latent trait. However, the likelihood of the observed ordinal random vector does not have closedform and is considerably complex due to the presence a multidimensional integral, so that learning the model given just the ordinal observations is typically intractable especially when the number of latent variables and the number of categories of the observed variables are large. Another class of methods designed for ordinal observations is the diagonally weighted least squares (DWLS), which has been suggested to be superior to the ML method and is usually considered to be preferable over other methods (Barendse et al. 2015; Li 2016). Various implementations of DWLS are available in popular softwares or packages, e.g., LISREL (Jöreskog 2005), Mplus (Muthén 2010), lavaan (Rosseel 2012) and OpenMx (Boker et al. 2011)
However, there are two major issues that the existing approaches do not consider. One is the mixture of continuous and ordinal data. As we mentioned above, ordinal variables are omnipresent in questionnaires, whereas sensor data are usually continuous. Therefore, a more realistic case in real applications is mixed continuous and ordinal data. A second important issue concerns missing values. In practice, all branches of experimental science are plagued by missing values (Little and Rubin 1987), e.g., failure of sensors, or unwillingness to answer certain questions in a survey. A straightforward idea in this case is to combine missing values techniques with existing parameter estimation approaches, e.g., performing listwisedeletion or pairwisedeletion first on the original data and then applying DWLS to learn parameters of a CFA model. However, such deletion methods are only consistent when the data are missing completely at random (MCAR), which is a rather strong assumption (Rubin 1976), and cannot transfer the sampling variability incurred by missing values to followup studies. The two modern missing data techniques, maximum likelihood and multiple imputation, are valid under a less restrictive assumption, missing at random (MAR) (Schafer and Graham 2002), but they require the data to be multivariate normal.
Therefore, there is a strong demand for an approach that is not only valid under MAR but also works for mixed continuous and ordinal data. For this purpose, we propose a novel Bayesian Gaussian copula factor (BGCF) approach, in which a Gibbs sampler is used to draw pseudo Gaussian data in a latent space restricted by the observed data (unrestricted if that value is missing) and draw posterior samples of parameters given the pseudo data, iteratively. We prove that this approach is consistent under MCAR and empirically show that it works quite well under MAR.
The rest of this paper is organized as follows. Section 2 reviews background knowledge and related work. Section 3 gives the definition of a Gaussian copula factor model and presents our novel inference procedure for this model. Section 4 compares our BGCF approach with two alternative approaches on simulated data, and Sect. 5 gives an illustration on the ‘Holzinger & Swineford 1939’ dataset. Section 6 concludes this paper and provides some discussion.
2 Background
This section reviews basic missingness mechanisms and related work on parameter estimation in CFA models.
2.1 Missingness mechanism
Following Rubin (1976), let \(\varvec{Y} = (y_{ij}) \in \mathbb {R}^{n \times p}\) be a data matrix with the rows representing independent samples, and \( \varvec{R} = (r_{ij}) \in \{0,1\}^{n \times p}\) be a matrix of indicators, where \(r_{ij} = 1\) if \(y_{ij}\) was observed and \(r_{ij} = 0\) otherwise. \(\varvec{Y}\) consists of two parts, \(\varvec{Y}_\mathrm{obs}\) and \(\varvec{Y}_\mathrm{miss}\), representing observed and missing elements in \(\varvec{Y}\), respectively. When the missingness does not depend on the data, i.e., \(P(\varvec{R}\varvec{Y}, \theta ) = P(\varvec{R}\theta )\) with \(\theta \) denoting unknown parameters, the data are said to be missing completely at random (MCAR), which is a special case of a more realistic assumption called missing at random (MAR). MAR allows the dependency between missingness and observed values, i.e., \(P(\varvec{R}\varvec{Y}, \theta ) = P(\varvec{R}\varvec{Y}_\mathrm{obs},\theta )\). For example, all people in a group are required to take a blood pressure test at time point 1, while only those whose values at time point 1 lie in the abnormal range need to take the test at time point 2. This results in some missing values at time point 2 that are MAR.
2.2 Parameter estimation in CFA models
3 Method
In this section, we introduce the Gaussian copula factor model and propose a Bayesian inference procedure for this model. Then, we theoretically analyze the identifiability and prove the consistency of our procedure.
3.1 Gaussian copula factor model
Definition 1
The model is also defined in Murray et al. (2013), but the authors restrict the factors to be independent of each other while we allow for their interactions. Our model is a combination of a Gaussian factor model (from \(\varvec{\eta }\) to \(\varvec{Z}\)) and a Gaussian copula model (from \(\varvec{Z}\) to \(\varvec{Y}\)). The factor model allows us to grasp the latent concepts that are measured by multiple indicators. The copula model provides a good way to conduct multivariate data analysis for two reasons. First, it raises the theoretical framework in which multivariate associations can be modeled separately from the univariate distributions of the observed variables (Nelsen 2007). Especially, when we use a Gaussian copula, the multivariate associations are uniquely determined by the covariance matrix because of the elliptically symmetric joint density, which makes the dependency analysis very simple. Second, the use of copulas is advocated to model multivariate distributions involving diverse types of variables, say binary, ordinal, and continuous (Dobra and Lenkoski 2011). A variable \(Y_j\) that takes a finite number of ordinal values \(\{1,\, 2,\, \ldots ,\, c\}\) with \(c \ge 2\), is incorporated into our model by introducing a latent Gaussian variable \(Z_j\), which complies with the wellknown standard assumption for an ordinal variable (Muthén 1984) (see Eq. 1). Figure 1 shows an example of the model. Note that we allow the special case of a factor having a single indicator, e.g., \(\eta _1 \rightarrow Z_1 \rightarrow Y_1\), because this allows us to incorporate other (explicit) variables (such as age and income) into our model. In this special case, we set \(\lambda _{11} = 1\) and \(\epsilon _1 = 0\), thus \(Y_1 = F_1^{1}(\varPhi [\eta _1])\).
In the typical design for questionnaires, one tries to get a grip on a latent concept through a particular set of welldesigned questions (MartínezTorres 2006; Byrne 2013), which implies that a factor (latent concept) in our model is connected to multiple indicators (questions) while an indicator is only used to measure a single factor, as shown in Fig. 1. This kind of measurement model is called a pure measurement model (Definition 8 in Silva et al. (2006)). Throughout this paper, we assume that all measurement models are pure, which indicates that there is only a single nonzero entry in each row of the factor loadings matrix \(\varLambda \). This inductive bias about the sparsity pattern of \(\varLambda \) is fully motivated by the typical design of a measurement model.
Definition 2
3.2 Inference for Gaussian copula factor model
We first introduce the inference procedure for complete mixed data and incomplete Gaussian data, respectively, based on which the procedure for mixed data with missing values is then derived. From this point on, we use S to denote the correlation matrix over the response vector \(\varvec{Z}\).
3.2.1 Mixed data without missing values
 1.
Sample\(\varvec{Z}\): \(\varvec{Z} \sim P(\varvec{Z}\varvec{\eta },\varvec{Z} \in \mathscr {D}(\varvec{Y}),\varOmega )\);
Since each coordinate \(Z_j\) directly depends on only one factor, i.e., \(\eta _q\) such that \(\lambda _{jq} \ne 0\), we can sample each of them independently through \( Z_j \sim P(Z_j\eta _q,\varvec{z}_j \in \mathscr {D}(\varvec{y}_j),\varOmega ) \).
 2.
Sample\(\varvec{\eta }\): \(\varvec{\eta } \sim P(\varvec{\eta }\varvec{Z},\varOmega )\);
 3.
Sample\(\varOmega \): \(\varOmega \sim P(\varOmega \varvec{Z},\varvec{\eta },G)\).
3.2.2 Gaussian data with missing values
 1.
\(\varvec{Z}_\mathrm{miss} \sim P(\varvec{Z}_\mathrm{miss}\varvec{Z}_\mathrm{obs},S)\) ;
 2.
\(S \sim P(S\varvec{Z}_\mathrm{obs},\varvec{Z}_\mathrm{miss})\).
3.2.3 Mixed data with missing values
 1.
\(\varvec{Z}_\mathrm{obs} \sim P(\varvec{Z}_\mathrm{obs}\varvec{\eta },\varvec{Z}_\mathrm{obs} \in \mathscr {D}(\varvec{Y}_\mathrm{obs}),\varOmega )\);
 2.
\(\varvec{Z}_\mathrm{miss} \sim P(\varvec{Z}_\mathrm{miss}\varvec{\eta },\varvec{Z}_\mathrm{obs},\varOmega )\);
 3.
\(\varvec{\eta } \sim P(\varvec{\eta }\varvec{Z}_\mathrm{obs},\varvec{Z}_\mathrm{miss},\varOmega )\);
 4.
\(\varOmega \sim P(\varOmega \varvec{Z}_\mathrm{obs},\varvec{Z}_\mathrm{miss},\varvec{\eta },G)\).
3.2.4 Discussion on prior specification
To incorporate prior knowledge into the inference procedure, our model enjoys some flexibility. As mentioned in Sect. 3.1, placing a GWishart prior on \(\varOmega \) is equivalent to placing an inverseWishart on C, a product of multivariate normals on \(\varLambda \), and an inversegamma on the diagonal elements of D. Therefore, one could choose one’s favorite informative priors on C, \(\varLambda \), and D separately, and then derive the resulting GWishart prior on \(\varOmega \). While the inverseWishart and inversegamma distributions have been criticized as unreliable when the variances are close to zero (Schuurman et al. 2016), our model does not suffer from this issue. This is because in our model the response variables (i.e., the Z variables) depend only on the ranks of the observed data, and in our sampling process we always set the variances of the response variables and latent variables to one, which is scaleinvariant to the observed data.
One limitation of the current inference procedure is that one has to choose the prior on C from the inverseWishart family, on \(\varLambda \) from the normal family, and on D from the inversegamma family in order to keep the conjugacy, so that one can enjoy the fast and concise inference. When the prior is chosen from other families, sampling \(\varOmega \) from the posterior distribution (Step 4 in Algorithm 1) is no longer straightforward. In this case, a different strategy like the MetropolisHastings algorithm might be needed to implement our Step 4.
3.3 Theoretical analysis
3.3.1 Identifiability of C
Without additional constraints, C is nonidentifiable (Anderson and Rubin 1956). More precisely, given a decomposable matrix \(S = \varLambda C \varLambda ^\mathrm{T} + D\), we can always replace \(\varLambda \) with \(\varLambda U\) and C with \(U^{1} C U^{T}\) to obtain an equivalent decomposition \(S = (\varLambda U)(U^{1} C U^{T})(U^\mathrm{T} \varLambda ^\mathrm{T}) + D\), where U is a \(k \times k\) invertible matrix. Since \(\varLambda \) only has one nonzero entry per row in our model, U can only be diagonal to ensure that \(\varLambda U\) has the same sparsity pattern as \(\varLambda \) (see Lemma 1 in “Appendix”). Thus, from the same S, we get a class of solutions for C, i.e., \(U^{1} C U^{1}\), where U can be any invertible diagonal matrix. In order to get a unique solution for C, we impose two sufficient identifying conditions: 1) restrict C to be a correlation matrix; 2) force the first nonzero entry in each column of \(\varLambda \) to be positive. See Lemma 2 in “Appendix” for proof. Condition 1 is implemented via line 31 in Algorithm 1. As for the second condition, we force the covariance between a factor and its first indicator to be positive (line 27), which is equivalent to Condition 2. Note that these conditions are not unique; one could choose one’s favorite conditions to identify C, e.g., setting the first loading to 1 for each factor. The reason for our choice of conditions is to keep it consistent with our model definition where C is a correlation matrix.
3.3.2 Identifiability of \(\varLambda \) and D
Under the two conditions for identifying C, factor loadings \(\varLambda \) and residual variances D are also identified except for the case in which there exists one factor that is independent of all the others and this factor only has two indicators. For such a factor, we have 4 free parameters (2 loadings, 2 residuals) while we only have 3 available equations (2 variances, 1 covariance), which yields an underdetermined system. See Lemmas 3 and 4 in “Appendix” for detailed analysis. Once this happens, one could put additional constraints to guarantee a unique solution, e.g., by setting the variance of the first residual to zero. However, we would recommend to leave such an independent factor out (especially in association analysis) or study it separately from the other factors.
Under sufficient conditions for identifying C, \(\varLambda \), and D, our BGCF approach is consistent even with MCAR missing values. This is shown in Theorem 1, whose proof is provided in “Appendix”.
Theorem 1
4 Simulation study
In this section, we compare our BGCF approach with alternative approaches via simulations.
4.1 Setup
4.1.1 Model specification
4.1.2 Data generation
Given the specified model, one can generate data in the response space (the \(\varvec{Z}\) in Definition 1) via Eqs. (2) and (3). When the observed data (the \(\varvec{Y}\) in Definition 1) are ordinal, we discretize the corresponding margins into the desired number of categories. When the observed data are nonparanormal, we set the \(F_j(\cdot )\) in Eq. (4) to the CDF of a \(\chi ^2\)distribution with degrees of freedom df. The reason for choosing a \(\chi ^2\)distribution is that we can easily use df to control the extent of nonnormality: a higher df implies a distribution closer to a Gaussian. To fill in a certain percentage \(\beta \) of missing values (we only consider MAR), we follow the procedure in Kolar and Xing (2012), i.e., for \(j = 1,\ldots ,\lfloor p/2 \rfloor \), \(i = 1,\ldots ,n\): \(y_{i,2*j}\) is missing if \(z_{i,2*j1} < \varPhi ^{1}(2*\beta )\).
4.1.3 Evaluation metrics
4.2 Ordinal data without missing values
In this subsection, we consider ordinal complete data since this matches the assumptions of the diagonally weighted least squares (DWLS) method, in which we set the number of ordinal categories to be 4. We also incorporate the robust maximum likelihood (MLR) as an alternative approach, which was shown to be empirically tenable when the number of categories is more than 5 (Rhemtulla et al. 2012; Li 2016). See Sect. 2 for details of the two approaches.
Potential Scale Reduction Factor (PSRF) with 95% upper confidence limit of the 6 interfactor correlations and 16 factor loadings over 5 chains
PSRF  PSRF  PSRF  

\(C_{12}\)  1.00 (1.00)  \(\lambda _1\)  1.01 (1.02)  \(\lambda _9\)  1.01 (1.02) 
\(C_{13}\)  1.00 (1.01)  \(\lambda _2\)  1.00 (1.01)  \(\lambda _{10}\)  1.00 (1.01) 
\(C_{14}\)  1.00 (1.01)  \(\lambda _3\)  1.01 (1.02)  \(\lambda _{11}\)  1.00 (1.00) 
\(C_{23}\)  1.00 (1.01)  \(\lambda _4\)  1.00 (1.00)  \(\lambda _{12}\)  1.00 (1.00) 
\(C_{24}\)  1.00 (1.01)  \(\lambda _5\)  1.00 (1.00)  \(\lambda _{13}\)  1.00 (1.01) 
\(C_{34}\)  1.00 (1.00)  \(\lambda _6\)  1.01 (1.03)  \(\lambda _{14}\)  1.02 (1.05) 
\(\lambda _7\)  1.02 (1.06)  \(\lambda _{15}\)  1.00 (1.00)  
\(\lambda _8\)  1.01 (1.03)  \(\lambda _{16}\)  1.01 (1.02) 
4.3 Mixed data with missing values
In this subsection, we consider mixed nonparanormal and ordinal data with missing values, since some latent variables in realworld applications are measured by sensors that usually produce continuous but not necessarily Gaussian data. The 8 indicators of the first 2 factors (4 per factor) are transformed into a \(\chi ^2\)distribution with \(df = 8\), which yields a slightly nonnormal distribution (skewness is 1, excess kurtosis is 1.5) (Li 2016). The 8 indicators of the last 2 factors are discretized into ordinal with 4 categories.
One alternative approach in such cases is DWLS with pairwisedeletion (\(\hbox {DWLS} + \hbox {PD}\)), in which heterogeneous correlations (Pearson correlations between numeric variables, polyserial correlations between numeric and ordinal variables, and polychoric correlations between ordinal variables) are first computed based on pairwise complete observations, and then DWLS is used to estimate model parameters. A second alternative concerns DWLS with multiple imputation (\(\hbox {DWLS} + \hbox {MI}\)), where we choose 20 imputed datasets for the followup study.^{3} Specifically, we use the R package mice (Buuren and GroothuisOudshoorn 2010), in which the default imputation method “predictive mean matching” is applied. A third alternative is the full information maximum likelihood (FIML) (Arbuckle 1996; Rosseel 2012), which first applies an EM algorithm to impute missing values and then uses MLR to learn model parameters.
Two more experiments are provided in “Appendix”. One concerns incomplete ordinal data with different numbers of categories, showing that BGCF is favorable over the alternatives for learning factor loadings. Another one considers incomplete nonparanormal data with different extents of deviation from a Gaussian, which indicates that FIML is rather sensitive to the deviation and only performs well for a slightly nonnormal distribution, while the deviation has no influence on BGCF at all. See “Appendix” for more details.
5 Application to realworld data

Y1: Visual perception;

Y2: Cubes;

Y3: Lozenges;

Y4: Paragraph comprehension;

Y5: Sentence completion;

Y6: Word meaning;

Y7: Speeded addition;

Y8: Speeded counting of dots;

Y9: Speeded discrimination straight and curved capitals.
The summary of the 9 variables in this dataset is provided in Table 2, showing the number of unique values, skewness, and (excess) kurtosis for each variable (this dataset contains no missing values). From the column of unique values, we notice that the data are approximately continuous. The average of ‘absolute skewness’ and ‘absolute excess kurtosis’ over the 9 variables are around 0.40 and 0.54, respectively, which is considered to be slightly nonnormal (Li 2016). Therefore, we choose MLR as the alternative to be compared with our BGCF approach, since these conditions match the assumptions of MLR.
We run our Bayesian Gaussian copula factor approach on this dataset. The learned parameter estimates are shown in Fig. 6, in which interfactor correlations are on the bidirected edges, factor loadings are in the directed edges, and unique variance for each variable is around the selfreferring arrows. The parameters learned by the MLR approach are not shown here, since we do not know the ground truth so that it is hard to conduct a comparison between the two approaches.
The number of unique values, skewness, and (excess) kurtosis of each variable in the ‘HolzingerSwineford1939’ dataset
Variables  Unique values  Skewness  Kurtosis 

Y1  35  \(\) 0.26  0.33 
Y2  25  0.47  0.35 
Y3  35  0.39  \(\) 0.89 
Y4  20  0.27  0.10 
Y5  25  \(\) 0.35  \(\) 0.54 
Y6  40  0.86  0.84 
Y7  97  0.25  \(\) 0.29 
Y8  84  0.53  1.20 
Y9  129  0.20  0.31 

For MLR, we first learn the model parameters on the training set, from which we extract the linear regression intercept and coefficients of \(Y_j\) on \(\varvec{Y}_{\backslash j}\). Then, we predict the value of \(Y_j\) based on the values of \(\varvec{Y}_{\backslash j}\). See Algorithm 2 for pseudo code of this procedure.

For BGCF, we first estimate the correlation matrix \(\hat{S}\) over response variables (the \(\varvec{Z}\) in Definition 1) and the empirical CDF \(\hat{F}_j\) of \(Y_j\) on the training set. Then we draw latent Gaussian data \(Z_j\) given \(\hat{S}\) and \(\varvec{Y}_{\backslash j}\), i.e., \(P(Z_j \hat{S}, \varvec{Z}_{\backslash j} \in \mathscr {D}(\varvec{Y}_{\backslash j}))\). Lastly, we obtain the value of \(Y_j\) from \(Z_j\) via \(\hat{F}_j\), i.e., \(Y_j = \hat{F}_j^{1} \big (\varPhi [Z_j]\big )\). See Algorithm 3 for pseudo code of this procedure. Note that we iterate the prediction stage (lines 7–8) for multiple times in the actual implementation to get multiple solutions to \(Y_j^{(new)}\), then the average over these solutions is taken as the final predicted value of \(Y_j^{(new)}\). This idea is quite similar to multiple imputation.
The mean squared error (MSE) is used to evaluate the prediction accuracy, where we repeat a tenfold cross validation for 10 times (thus 100 MSE estimates totally). Also, we take \(Y_j\) as the outcome variable alternately while treating the others as predictors (thus 9 tasks totally). Figure 7 provides the results of BGCF and MLR for all the 9 tasks, showing the mean of MSE with a standard error represented by error bars over the 100 estimates. We see that BGCF outperforms MLR for Tasks 5 and 6 although they perform indistinguishably for the other tasks. The advantage of BGCF over MLR is encouraging, considering that the experimental conditions match the assumptions of MLR. More experiments are done (not shown) after we make the data moderately or substantially nonnormal, suggesting that BGCF is significantly favorable to MLR, as expected.
6 Summary and discussion
In this paper, we proposed a novel Bayesian Gaussian copula factor (BGCF) approach for learning parameters of CFA models that can handle mixed continuous and ordinal data with missing values. We analyzed the separate identifiability of interfactor correlations C, factor loadings \(\varLambda \), and residual variances D, since different researchers may care about different parameters. For instance, it is sufficient to identify C for researchers interested in learning causal relations among latent variables (Silva and Scheines 2006; Silva et al. 2006; Cui et al. 2016), with no need to worry about additional conditions to identify \(\varLambda \) and D. Under sufficient identification conditions, we proved that our approach is consistent for MCAR data and empirically showed that it works quite well for MAR data.
In the experiments, our approach outperforms DWLS even under the assumptions of DWLS. Apparently, the approximations inherent in DWLS, such as the use of the polychoric correlation and its asymptotic covariance, incur a small loss in accuracy compared to an integral approach like the BGCF. When the data follow from a more complicated distribution and contain missing values, the advantage of BGCF over its competitors becomes more prominent. Another highlight of our approach is that the Gibbs sampler converges quite fast, where the burnin period is rather short. To further reduce the time complexity, a potential optimization of the sampling process is available (Kalaitzis and Silva 2013).
There are various generalizations to our inference approach. While our focus in this paper is on the correlated kfactor models, it is straightforward to extent the current procedure to other class of latent models that are often considered in CFA, such as bifactor models and secondorder models, by simply adjusting the sparsity structure of the prior graph G.
Also, one may consider models with impure measurement indicators, e.g., a model with an indicator measuring multiple factors (crossloadings) or a model with residual covariances (Bollen 1989), which can be easily solved with BGCF by changing the sparsity pattern of \(\varLambda \) and D. However, two critical issues might arise in this case: the nonidentification problems due to a large number of parameters and the slow convergence problem of MCMC algorithms because of dependencies in D. The first issue can be solved by introducing stronglyinformative priors (Muthén and Asparouhov 2012), e.g., putting smallvariance priors on all crossloadings. The caveat here is that one needs to choose such priors very carefully to reach a good balance between incorporating correct information and avoiding nonidentification. See Muthén and Asparouhov (2012) for more details about the choice of priors on crossloadings and correlated residuals. Once having the priors on C, \(\varLambda \), and D, one can derive the prior on \(\varOmega \). The second issue can be alleviated via the parameter expansion technique (Ghosh and Dunson 2009; Merkle and Rosseel 2018), in which the residual covariance matrix is decomposed into a couple of simple components through some phantom latent variables, resulting in an equivalent model called a working model. Then, our inference procedure can proceed based on the working model.
It is possible to extend the current approach to multiple groups to accommodate crossnational research or by incorporating a multilevel structure, although this is not quite straightforward. Then, one might not be able to draw the precision matrix directly from a GWishart (Step 4 in Algorithm 1) since different groups may have different C and D while they share the same \(\varLambda \). However, this step can be implemented by drawing C, \(\varLambda \), and D separately.
Another line of future work is to analyze standard errors and confidence intervals while this paper concentrates on the accuracy of parameter estimates. Our conjecture is that BGCF is still favorable because it naturally transfers the extra variability incurred by missing values to the posterior Gibbs samples: we indeed observed a growing variance of the posterior distribution with the increase of missing values in our simulations. On top of the posterior distribution, one could conduct further studies, e.g., causal discovery over latent factors (Silva et al. 2006; Cui et al. 2018), regression analysis (as we did in Sect. 5), or other machine learning tasks. Instead of using a Gaussian copula, some other choices of copulas are available to model advanced properties in the data such as tail dependence and tail asymmetry (Krupskii and Joe 2013, 2015).
Footnotes
 1.
The code including those used in simulations and realworld applications is provided in https://github.com/cuiruifei/CopulaFactorModel.
 2.
Note that the parameter values used here to specify the Gibbs sampler are based on some empirical results. These values can be treated as default choice, but we recommend to retest the convergence for a specific realworld problem and make the best choice. If this is difficult to do, one could just choose a larger value than the current one to stay in a safe condition since the larger the better for all these parameters.
 3.
The overall recommendations are to use 20 imputations to have proper estimated coefficients, and use 100 imputations to have proper estimated coefficients and standard errors.
Notes
Acknowledgements
This research has been partially financed by the Netherlands Organisation for Scientific Research (NWO) under project 617.001.451.
Compliance with ethical standards
Conflicts of interest
The authors declare that they have no conflict of interest.
References
 Anderson, T.W., Rubin, H.: Statistical inference in factor analysis. In: Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 5: Contributions to Econometrics, Industrial Research, and Psychometry, University of California Press, Berkeley, CA, pp. 111–150 (1956)Google Scholar
 Arbuckle, J.L.: Full information estimation in the presence of incomplete data. In: Marcoulides, G.A., Schumacker, R.E. (eds.) Advanced Structural Equation Modeling: Issues and Techniques, vol. 243, p. 277. Lawrence Erlbaum Associates, Mahwah (1996)Google Scholar
 Barendse, M., Oort, F., Timmerman, M.: Using exploratory factor analysis to determine the dimensionality of discrete responses. Struct. Equ. Model. 22(1), 87–101 (2015)MathSciNetCrossRefGoogle Scholar
 Barnard, J., McCulloch, R., Meng, X.L.: Modeling covariance matrices in terms of standard deviations and correlations, with application to shrinkage. Stat. Sin. 10, 1281–1311 (2000)MathSciNetzbMATHGoogle Scholar
 Boker, S., Neale, M., Maes, H., Wilde, M., Spiegel, M., Brick, T., Spies, J., Estabrook, R., Kenny, S., Bates, T., et al.: Openmx: an open source extended structural equation modeling framework. Psychometrika 76(2), 306–317 (2011)MathSciNetCrossRefzbMATHGoogle Scholar
 Bollen, K.: Structural Equations with Latent Variables. Wiley, New York (1989)CrossRefzbMATHGoogle Scholar
 Browne, M.W.: Asymptotically distributionfree methods for the analysis of covariance structures. Br. J. Math. Stat. Psychol. 37(1), 62–83 (1984)MathSciNetCrossRefzbMATHGoogle Scholar
 Buuren, S.V., GroothuisOudshoorn, K.: mice: multivariate imputation by chained equations in R. J. Stat. Softw. 45, 1–68 (2010)Google Scholar
 Byrne, B.M.: Structural Equation Modeling with EQS: Basic Concepts, Applications, and Programming. Routledge, London (2013)CrossRefGoogle Scholar
 Castro, L.M., Costa, D.R., Prates, M.O., Lachos, V.H.: Likelihoodbased inference for Tobit confirmatory factor analysis using the multivariate Studentt distribution. Stat. Comput. 25(6), 1163–1183 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
 Cui, R., Groot, P., Heskes, T.: Copula PC algorithm for causal discovery from mixed data. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, pp. 377–392 (2016)Google Scholar
 Cui, R., Groot, P., Heskes, T.: Learning causal structure from mixed data with missing values using Gaussian copula models. Stat. Comput. (2018). https://doi.org/10.1007/s112220189810x
 Curran, P.J., West, S.G., Finch, J.F.: The robustness of test statistics to nonnormality and specification error in confirmatory factor analysis. Psychol. Methods 1(1), 16 (1996)CrossRefGoogle Scholar
 DiStefano, C.: The impact of categorization with confirmatory factor analysis. Struct. Equ. Model. 9(3), 327–346 (2002)MathSciNetCrossRefGoogle Scholar
 Dobra, A., Lenkoski, A., et al.: Copula Gaussian graphical models and their application to modeling functional disability data. Ann. Appl. Stat. 5(2A), 969–993 (2011)MathSciNetCrossRefzbMATHGoogle Scholar
 Embretson, S.E., Reise, S.P.: Item Response Theory. Psychology Press, Hove (2013)Google Scholar
 Gelman, A., Rubin, D.B., et al.: Inference from iterative simulation using multiple sequences. Stat. Sci. 7(4), 457–472 (1992)CrossRefzbMATHGoogle Scholar
 Ghosh, J., Dunson, D.B.: Default prior distributions and efficient posterior computation in bayesian factor analysis. J. Comput. Graph Stat. 18(2), 306–320 (2009)MathSciNetCrossRefGoogle Scholar
 Hoff, P.D.: Extending the rank likelihood for semiparametric copula estimation. Ann. Stat. 1, 265–283 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
 Holzinger, K.J., Swineford, F.: A study in factor analysis: the stability of a bifactor solution. Suppl. Educ. Monogr. 48, 468–469 (1939)Google Scholar
 Jöreskog, K.G.: A general approach to confirmatory maximum likelihood factor analysis. Psychometrika 34(2), 183–202 (1969)CrossRefGoogle Scholar
 Jöreskog, K.G.: Structural Equation Modeling with Ordinal Variables Using LISREL. Technical Report. Scientific Software International Inc, Lincolnwood, IL (2005)Google Scholar
 Kalaitzis, A., Silva, R.: Flexible sampling of discrete data correlations without the marginal distributions. In: Advances in Neural Information Processing Systems, pp. 2517–2525 (2013)Google Scholar
 Kaplan, D.: Structural Equation Modeling: Foundations and Extensions, vol. 10. Sage Publications, Thousand Oaks (2008)zbMATHGoogle Scholar
 Kolar, M., Xing, E.P.: Estimating sparse precision matrices from data with missing values. In: International Conference on Machine Learning (2012)Google Scholar
 Krupskii, P., Joe, H.: Factor copula models for multivariate data. J. Multivar. Anal. 120, 85–101 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
 Krupskii, P., Joe, H.: Structured factor copula models: theory, inference and computation. J. Multivar. Anal. 138, 53–73 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
 Li, C.H.: Confirmatory factor analysis with ordinal data: comparing robust maximum likelihood and diagonally weighted least squares. Behav. Res. Methods 48(3), 936–949 (2016)CrossRefGoogle Scholar
 Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley, Hoboken (1987)zbMATHGoogle Scholar
 Lubke, G.H., Muthén, B.O.: Applying multigroup confirmatory factor models for continuous outcomes to Likert scale data complicates meaningful group comparisons. Struct. Equ. Model. 11(4), 514–534 (2004)MathSciNetCrossRefGoogle Scholar
 Marsh, H.W., Hau, K.T., Balla, J.R., Grayson, D.: Is more ever too much? The number of indicators per factor in confirmatory factor analysis. Multivar. Behav. Res. 33(2), 181–220 (1998)CrossRefGoogle Scholar
 MartínezTorres, M.R.: A procedure to design a structural and measurement model of intellectual capital: an exploratory study. Inf. Manag. 43(5), 617–626 (2006)CrossRefGoogle Scholar
 Merkle, E.C., Rosseel, Y.: blavaan: Bayesian structural equation models via parameter expansion. J. Stat. Softw. 85(4), 1–30 (2018)CrossRefGoogle Scholar
 Murphy, K.P.: Conjugate Bayesian analysis of the Gaussian distribution. def 1(2), 16 (2007)MathSciNetGoogle Scholar
 Murray, J.S., Dunson, D.B., Carin, L., Lucas, J.E.: Bayesian Gaussian copula factor models for mixed data. J. Am. Stat. Assoc. 108(502), 656–665 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
 Muthén, B.: A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika 49(1), 115–132 (1984)CrossRefGoogle Scholar
 Muthén, B., Asparouhov, T.: Bayesian structural equation modeling: a more flexible representation of substantive theory. Psychol. Methods 17(3), 313 (2012)CrossRefGoogle Scholar
 Muthén, B., du Toit, S., Spisic, D.: Robust inference using weighted least squares and quadratic estimating equations in latent variable modeling with categorical and continuous outcomes. Psychometrika (1997)Google Scholar
 Muthén, L.: Mplus User’s Guide. Muthén & Muthén, Los Angeles (2010)Google Scholar
 Nelsen, R.B.: An Introduction to Copulas. Springer, Berlin (2007)zbMATHGoogle Scholar
 Olsson, U.: Maximum likelihood estimation of the polychoric correlation coefficient. Psychometrika 44(4), 443–460 (1979)MathSciNetCrossRefzbMATHGoogle Scholar
 Poon, W.Y., Wang, H.B.: Latent variable models with ordinal categorical covariates. Stat. Comput. 22(5), 1135–1154 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
 Rhemtulla, M., BrosseauLiard, P.É., Savalei, V.: When can categorical variables be treated as continuous? A comparison of robust continuous and categorical SEM estimation methods under suboptimal conditions. Psychol. Methods 17(3), 354 (2012)CrossRefGoogle Scholar
 Rosseel, Y.: lavaan: an R package for structural equation modeling. J. Stat. Softw. 48(2), 1–36 (2012)CrossRefGoogle Scholar
 Roverato, A.: Hyper inverse Wishart distribution for nondecomposable graphs and its application to Bayesian inference for Gaussian graphical models. Scan. J. Stat. 29(3), 391–411 (2002)MathSciNetCrossRefzbMATHGoogle Scholar
 Rubin, D.B.: Inference and missing data. Biometrika 63, 581–592 (1976)MathSciNetCrossRefzbMATHGoogle Scholar
 Schafer, J.L.: Analysis of Incomplete Multivariate Data. CRC Press, Boca Raton (1997)CrossRefzbMATHGoogle Scholar
 Schafer, J.L., Graham, J.W.: Missing data: our view of the state of the art. Psychol. Methods 7(2), 147 (2002)CrossRefGoogle Scholar
 Schuurman, N., Grasman, R., Hamaker, E.: A comparison of inversewishart prior specifications for covariance matrices in multilevel autoregressive models. Multivar. Behav. Res. 51(2–3), 185–206 (2016)CrossRefGoogle Scholar
 Silva, R., Scheines, R.: Bayesian learning of measurement and structural models. In: International Conference on Machine Learning, pp 825–832 (2006)Google Scholar
 Silva, R., Scheines, R., Glymour, C., Spirtes, P.: Learning the structure of linear latent variable models. J. Mach. Learn. Res. 7(Feb), 191–246 (2006)MathSciNetzbMATHGoogle Scholar
 YangWallentin, F., Jöreskog, K.G., Luo, H.: Confirmatory factor analysis of ordinal variables with misspecified models. Struct. Equ. Model. 17(3), 392–423 (2010)MathSciNetCrossRefGoogle Scholar
Copyright information
OpenAccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.