# A note on large-scale logistic prediction: using an approximate graphical model to deal with collinearity and missing data

- 703 Downloads
- 4 Citations

## Abstract

Large-scale prediction problems are often plagued by correlated predictor variables and missing observations. We consider prediction settings in which logistic regression models are used and propose a novel approach to make accurate predictions even when predictor variables are highly correlated and only partly observed. Our approach comprises three steps: first, to overcome the collinearity issue, we propose to model the joint distribution of the outcome variable and the predictor variables using the Ising network model. Second, to render the application of Ising networks feasible, we use a latent variable representation to apply a low-rank approximation to the network’s connectivity matrix. Finally, we propose an approximation to the latent variable distribution that is used in the representation to handle missing observations. We demonstrate our approach with numerical illustrations.

## Keywords

Logistic regression Ising model IRT model## 1 Introduction

Most large-scale or big data applications involve conditional models that utilize covariates to make predictions about a variable of interest. For instance, Google needs to predict which links to websites will be most advantageous based on millions of previous clicks, and Netflix needs to predict movie preferences based on millions of previous viewings and rankings. In these applications the interest is not in explaining why the connections between websites or movies exist, but in predicting which website will be most often requested or which movie will be preferred by an individual user. We will focus on the prediction problem where both the outcome variable and the covariates are binary, and the logistic regression model is an appropriate statistical model.

As is the case with all regression models, we observe that the logistic regression model is developed for situations where the covariates are independent and completely observed. However, a different situation is usually observed in large-scale applications, where covariates are typically correlated. As a consequence of the correlations between covariates, i.e., collinearity, the obtained set of coefficients is no longer unique. This can be seen in, for instance, the coordinate descent algorithm (Hastie et al. 2015), where each covariate is treated separately. For two equivalent covariates, any solution with a linear combination of the two (normalized) coefficients is correct, even when regularization is applied. This is certainly an issue for the identification of relevant covariates, i.e., variable selection. In a particular sample, one of the collinear covariates will have a slightly larger coefficient and, therefore, ends up in the solution, while the other does not. But in another sample it could be the other way around. This means that variable selection with collinear covariates is unreliable. We will illustrate that collinearity is not a problem for prediction.

Another issue is that in most large-scale applications covariates are only partially observed (e.g., Rubin 1976; Rousseeuw 2016). For estimation and variable selection, it is then pertinent to know in which way the data came to be missing. For instance, data could be missing completely at random, which means that there is no connection between the missing observations and the data generating process. But data could also be missing precisely because of the data-generating process. For instance, a response to the question“do you drink more on average than others in your circle of friends” will be missing if a negative response is observed to the question “do you take alcohol”. In such cases, conditional on taking alcohol, the two covariates will be correlated, and so the missing observations cannot be ignored. We illustrate that predictions based on partially observed data can still be accurate, even when the process that generates missing data cannot be ignored in a statistical sense.

Our goal in this paper was to introduce a novel approach to make accurate predictions with the logistic regression model when covariates are highly correlated and only partly observed. Our approach comprises three steps: First, we propose to model the joint distribution of the outcome variable and the predictor variables with the Ising model. In the Ising model the correlations between the observed variables are explicitly modeled, which overcomes the collinearity issue. Second, we use recent results that relate Ising networks to latent variable models to render the application of Ising models computationally tractable. Specifically, we use a low-rank approximation to the network’s connectivity matrix, which is opportune when variables are highly correlated (i.e., collinearity). Finally, we propose to approximate the latent variable distribution in the representation of the Ising model, which results in a model-based approximation to the full Ising model that is able to handle missing observations. Numerical illustrations are used to demonstrate different features of our approach.

## 2 Step I: the Ising model to overcome collinearity

*x*excluding the

*i*th element. Even though we are only interested in the predictive distribution \(\mathbb {P} (x_i \mid x_{\setminus i})\), our observations are the realizations of a multivariate random variable

*x*. The multivariate distribution \(\mathbb {P}(x)\) that is consistent with the logistic regression model is the Ising network model (Lenz 1920; Ising 1925),

*x*are no problem for the Ising model as their interactions \(\sigma _{ij}\) are explicitly modeled. That is, there is no collinearity issue when estimating the full Ising model \(\mathbb {P}(x)\). From the joint distribution we can then obtain the correct full-conditional \(\mathbb {P}(x_i \mid x_{\setminus i})\), and it is easily seen that the full-conditional distribution that is obtained from the Ising model is a logistic regression model:

*p*for the logistic regression model, it is quadratic in

*p*for the Ising model. A second problem is that the density of the Ising model is computationally intractable, except for small or heavily constrained networks. This computational burden is due entirely to the model’s normalizing constant:

*x*. Thus, even though we have resolved the collinearity issue with the Ising model, we have also increased the number of parameters with an order of magnitude and need to deal with estimating a model that is computationally intractable.

## 3 Step II: low-rank approximations for computational tractability

*x*are independent given the full set of latent variables \(\eta \). Since the diagonal elements from the connectivity matrix \(\Sigma \) are not identifiable from the data, we decompose it as

*c*serves to ensure that all eigenvalues are positive, i.e., ensuring that \(UU^\mathsf{T}\) is positive (semi-)definite and at the same time preserve the off-diagonal elements of \(\Sigma \). The latent variable representation of Kac then follows immediately from a clever use of the Gaussian identity:

*i*th column of \(U^\mathsf{T}\). MIRT models are frequently used in psychological and educational measurement (Ackerman et al. 2003; Ackerman 1996), where the observed variables correspond to item responses on some test or questionnaire, and the latent variables relate to the trait or abilities being assessed (Borsboom and Molenaar 2015). Importantly, this representation inspired a full-data-information estimation procedure that avoids having to compute the Ising model’s intractable normalizing constant (Marsman et al. 2015).

A second crucial ingredient is the low-rank approach that Marsman et al. (2015) proposed to approximate the full connectivity matrix, such that the number of parameters becomes linear in *p*. Their low-rank approach makes use of the Eckart and Young Theorem (1936), which states that in a least squares sense the best rank-*r* approximation to the full connectivity matrix \(\Sigma \) is one in which all but the *r* largest eigenvalues are equated to zero. Low-rank approximations have become increasingly popular in prediction problems since their crucial role in winning the Netflix price competition (Koren et al. 2009; Bell and Koren 2007; Bell et al. 2010) and it has been part of Google’s system ever since the very first implementation of the pageRank algorithm (Page et al. 1999; Brin and Page 2012). Most important for our present endeavors, however, is that a low-rank approximation to the full connectivity matrix is expected to make accurate predictions when predictor variables are highly correlated.

## 4 Step III: an approximate latent variable distribution for missing data

*x*we have a multivariate normal posterior distribution with mean \(U^\mathsf{T}x\) and variance \(2 I_p\). Even though this latent variable model is computationally intractable, it often takes a simple form in each of its dimensions. Specifically, it closely resembles either a single normal distribution with a mean of zero or a mixture of two normal distributions with their respective means placed symmetrically about zero.

*U*, \(\mu \) (and \(\eta \)) in the MIRT model. By replacing the latent variable distribution \(f(\eta )\) with a distribution \(g(\eta )\), the parameters are placed on a different scale. This indeterminancy does not affect the marginal distribution of observable variables (i.e., predictions), although it does affect parameter recovery.

The validity of our approximate model rests on how well \(f(\eta )\) is approximated by \(g(\eta )\). To assess the validity of this approach, we can make use of recent advances in plausible value methodology and explicitly consider whether or not the true latent variable distribution is equal (or similar) to \(g(\eta )\). Plausible values are draws from the posterior distribution of the latent variables \(\eta \) (Mislevy 1991; von Davier et al. 2009) and are commonly used in large-scale educational surveys to accommodate researchers in the field that are not able to estimate the complex IRT models used for these surveys. Recently, it was shown that the marginal distribution of plausible values is a consistent estimator of the true latent variable distribution \(f(\eta )\) (Marsman et al. 2016), meaning that one can assess the validity of using a single multivariate normal distribution by inspecting the (marginal) distribution of plausible values.

## 5 Numerical illustrations

### 5.1 Generating correlated binary data

*Q*and \(\mu \) will be sampled uniformly between \(-0.1/p\) and 0.1 /

*p*for the

*p*-variable networks (

*Q*is made orthogonal), where scaling by \(p^{-1}\) ensured similar dynamics for the different sized networks.

Data were generated from the Ising model using a Gibbs sampler (Geman and Geman 1984) applied to the joint distribution \(f(x\text {, }\eta )\) of the latent variables \(\eta \) and the data *x*. In each iteration of the Gibbs sampler, we sample from the full-conditional posterior distribution \(f(\eta \mid x)\) of the latent variables, and the full-conditional distribution \(\mathbb {P}(x \mid \eta )\) of the data. Both full-conditional distributions are easy to sample from; the posterior distribution of the latent variable \(f(\eta \mid x)\) is a multivariate normal distribution with mean vector \(U^\mathsf{T}x\) (where \(U^\mathsf{T} = Q^\mathsf{T}{\Lambda }^{\frac{1}{2}}\) as before) and a variance-covariance matrix \(2I_{10}\), and \(\mathbb {P}(x \mid \eta )\) is a ten-dimensional IRT model.

### 5.2 Estimating the MIRT model

*U*and \(\mu \). We will use logistic prior distributions with location 0 and scale 1 for both

*U*and \(\mu \).

*U*and \(\mu \) are also intractable:

*v*indexes the

*n*observations. The problem of sampling from these full-conditional distributions has been addressed in several places (e.g., Patz and Junker 1999a, b; Maris and Maris 2002). We use a Metropolis approach (Metropolis et al. 1953; Hastings 1970; Tierney 1994) that was specifically designed to handle full-conditional distributions of this form (Marsman et al. 2015, 2017).

### 5.3 Calculating prediction accuracy

*n*predictions that are made.

Since each of the *p* variables could be used as a dependent variable in Logistic regression—there are *p* full-conditionals \(\mathbb {P}(x_i \mid x_{\setminus i})\)—we calculate the prediction accuracy for each of the *p* variables and then average them. We furthermore repeat each procedure five times and average the results.

### 5.4 The two stages of our prediction procedure

Our prediction procedure consists of two stages, a training stage and a testing stage.

#### 5.4.1 The training stage comprises six steps

- (1)
Generate training data \(x_{\text {train}}\) from the Ising model.

- (2)
Generate new predictions \(x^*\) from the Ising model and compute \(c_{\text {true}} = c (x_{\text {train}}\text {, }x^*)\).

- (3)
Split the data into an observed part \(x_{\text {train}}^{(O)}\) and a \(0\%\), \(50\%\) or \(90\%\) missing part \(x_{\text {train}}^{(M)}\). The missing part \(x_{\text {train}}^{(M)}\) will only be used to evaluate predictions.

- (4)
Use the Gibbs sampler to estimate the IRT parameters \(\theta _{\text {train}} = \{U\text {, }\mu \}\) and the latent variables \(\eta _{\text {train}}\) using the observed training data \(x_{\text {train}}^{(O)}\).

- (5)Generate new predictions from the IRT model on the observed part,and compute \(c_{\text {train}}^{(O)} = c(x_{\text {train}}^{(O)}\text {, }x_{*}^{(O)})\).$$\begin{aligned} x_{*}^{(O)} \sim \mathbb {P}(x \mid \eta _{\text {train}}\text {, }\theta _{\text {train}}), \end{aligned}$$
- (6)Generate new predictions from the IRT model on the missing part,and compute \(c_{\text {train}}^{(M)} = c(x_{\text {train}}^{(M)}\text {, }x_{*}^{(M)})\).$$\begin{aligned} x_{*}^{(M)} \sim \mathbb {P}(x \mid \eta _{\text {train}}\text {, }\theta _{\text {train}}), \end{aligned}$$

*training stage*. Observe that the missing data were not used to estimate the parameters \(\theta _{\text {train}}\) and \(\eta _{\text {train}}\).

#### 5.4.2 The testing stage comprises five steps

- (7)
Generate testing data \(x_{\text {test}}\) from the Ising model.

- (8)
Split the data into an observed part \(x_{\text {test}}^{(O)}\) and a \(0\%\), \(50\%\) or \(90\%\) missing part \(x_{\text {test}}^{(M)}\). The missing part will only be used to evaluate predictions.

- (9)
Use the Gibbs sampler to estimate the latent variables \(\eta _{\text {test}}\) using the

*observed*testing data \(x_{\text {test}}^{(O)}\) and the IRT parameters \(\theta _{\text {train}}\) obtained from the training stage, e.g., step (4). - (10)Generate new predictions from the IRT model on the observed part,and compute \(c_{\text {test}}^{(O)}= c(x_{\text {test}}^{(O)}\text {, }x_{*}^{(O)})\).$$\begin{aligned} x_{*}^{(O)} \sim \mathbb {P}(x \mid \eta _{\text {test}}\text {, }\theta _{\text {train}}), \end{aligned}$$
- (11)Generate new predictions from the IRT model on the missing part,and compute \(c_{\text {test}}^{(M)}=c(x_{\text {test}}^{(M)}\text {, }x_{*}^{(M)})\).$$\begin{aligned} x_{*}^{(M)} \sim \mathbb {P}(x \mid \eta _{\text {test}}\text {, }\theta _{\text {train}}), \end{aligned}$$

*testing stage*. Observe that

*testing*data were not used to estimate the IRT parameters \(\theta _{\text {train}}\) and that only the latent variables \(\eta _{\text {test}}\) were estimated on the

*observed*part from the

*testing*data.

### 5.5 Illustration I: Collinearity

*n*and

*p*, ensuring that the procedure scales when more observations

*n*and/or more variables

*p*become available. This scalability is important for situations where the number of observations

*n*becomes too large, and one has to use a selection of the available observations, with the estimated IRT model cross-validating well in such applications.

Prediction accuracy for the two-dimensional IRT model applied to correlated data

| | \(c_{\text {true}}\) | \(c_{\text {train}}\) | \(c_{\text {test}}\) |
---|---|---|---|---|

\(\tau = 1.0\) network data | ||||

100 | 1000 | 81 | 81 | 80 |

100 | 100 | 80 | 80 | 79 |

1000 | 100 | 81 | 81 | 81 |

1000 | 1000 | 82 | 82 | 82 |

\(\tau = 0.5\) network data | ||||

100 | 1000 | 91 | 91 | 90 |

100 | 100 | 91 | 91 | 91 |

1000 | 100 | 92 | 92 | 92 |

1000 | 1000 | 91 | 90 | 89 |

### 5.6 Illustration II: Ignorable missing data

Table 2 reveals the prediction accuracy of a two-dimensional IRT model applied to data generated from a \(\tau = 1.0\) network, with either 50 or \(90\%\) of the data missing completely at random (MCAR; see Appendix A). Similarly, in Table 3 we report the prediction accuracy of a two-dimensional IRT model applied to data generated from a \(\tau = 0.5\) network, with either 50 or \(90\%\) of the data MCAR. We report both the accuracy in predicting the observed data \(c^{(O)}_{\text {train}} =c(x_{train}^{(O)}\text {, }x_{*}^{(O)})\), and the accuracy in predicting the missing data \(c^{(M)}_{\text {train}} =c(x_{train}^{(M)}\text {, }x_{*}^{(M)})\). Since the true model cannot be used with missing observations, we evaluate the predictions from the true model using the completely observed test data: \(c_{\text {true}} = c(x_{\text {train}}\text {, }x_{*})\).

Prediction accuracy for the two-dimensional IRT model applied to \(\tau = 1.0\) network data with ignorable missing observations

| | \(c_{\text {true}}\) | \(c_{\text {train}}^{(o)}\) | \(c_{\text {train}}^{(m)}\) | \(c_{\text {test}}^{(o)}\) | \(c_{\text {test}}^{(m)}\) |
---|---|---|---|---|---|---|

\(50\%\) missing observations | ||||||

100 | 1000 | 81 | 82 | 81 | 80 | 80 |

100 | 100 | 81 | 82 | 79 | 79 | 79 |

1000 | 100 | 81 | 80 | 80 | 80 | 80 |

1000 | 1000 | 81 | 82 | 81 | 81 | 81 |

\(90\%\) missing observations | ||||||

100 | 1000 | 82 | 86 | 79 | 78 | 77 |

100 | 100 | 81 | 88 | 71 | 78 | 71 |

1000 | 100 | 80 | 81 | 76 | 79 | 76 |

1000 | 1000 | 81 | 82 | 81 | 81 | 80 |

Prediction accuracy for the two-dimensional IRT model applied to \(\tau = 0.5\) network data with ignorable missing observations

| | \(c_{\text {true}}\) | \(c_{\text {train}}^{(o)}\) | \(c_{\text {train}}^{(m)}\) | \(c_{\text {test}}^{(o)}\) | \(c_{\text {test}}^{(m)}\) |
---|---|---|---|---|---|---|

\(50\%\) missing observations | ||||||

100 | 1000 | 91 | 91 | 90 | 90 | 90 |

100 | 100 | 92 | 92 | 91 | 91 | 91 |

1000 | 100 | 92 | 92 | 91 | 92 | 91 |

1000 | 1000 | 92 | 91 | 90 | 90 | 90 |

\(90\%\) missing observations | ||||||

100 | 1000 | 91 | 91 | 85 | 85 | 84 |

100 | 100 | 92 | 93 | 83 | 89 | 84 |

1000 | 100 | 91 | 90 | 86 | 90 | 86 |

1000 | 1000 | 91 | 91 | 90 | 91 | 90 |

#### 5.6.1 Logistic regression

Prediction accuracy for the logistic regression model applied to \(n = 1000\) observations of \(p = 100\) correlated variables (one dependent and \(p-1\) covariates)

\(\tau = 1.0\) network data | \(\tau = 0.5\) network data | |||||
---|---|---|---|---|---|---|

Observed | Missing | % | Observed | Missing | % | |

\(c_{\text {true}}\) | 80 | – | 0 | 91 | – | 0 |

\(c_{\text {train}}\) | 72 | – | 0 | 76 | – | 0 |

\(c_{\text {test}}\) | 72 | – | 0 | 76 | – | 0 |

\(c_{\text {train}}\) | 92 | 59 | 50 | 92 | 55 | 50 |

\(c_{\text {test}}\) | 94 | 59 | 50 | 93 | 55 | 50 |

\(c_{\text {train}}\) | 97 | 62 | 90 | 97 | 65 | 90 |

\(c_{\text {test}}\) | 99 | 61 | 90 | 99 | 64 | 90 |

The results that are reported in Tables 2 and 3 are based on data with missing observations. To allow a meaningful comparison between logistic regression and MIRT when some observations are missing, we use multiple imputation to complete the datasets. Unfortunately, we now encounter a serious complication. To impute the missing observations, we need to formulate an *a priori* distribution for the missing observations (c.f. Ibrahim et al. 2005). The most straightforward solution is to specify a joint distribution for a \(p + 1\) dimensional vector of binary variables *x*. That is, we assume that the variables are dependent on each other and inform about each other’s missing values. Even though this is a straightforward strategy, it will boil down to an *a priori* distribution for the missing observations that is at least as complex as the computationally intractable Ising model.

*a priori*independent. Specifically, we assume for each missing observation that the prior probability that its value is \(+1\) is equal to some number \(\pi \). We use the value \(\pi = 0.5\) since we have no

*a priori*preference for a particular value of the missing observation. Denote the logistic regression model as

We use the Gibbs sampler to estimate the logistic regression model and use logistic prior distributions with location 0 and scale 1 for the model’s parameters \(\alpha \) and \(\beta \). Our imputation strategy expands the Gibbs sequence by two distinct steps. In the first step, missing values for the dependent variable are drawn from the predictive distribution \(\mathbb {P}(y \mid x)\) (i.e., the logistic regression model). In the second step, missing values for each of the *p* covariates are drawn from their respective posterior distributions \({\mathbb {P}}(x_i \mid y,\,x_{\setminus i})\). After these two steps the data are complete and we can simulate the model’s parameters \(\alpha \) and \(\beta \) from their full-conditional posterior distributions as if all data had been observed.

We generate 300 datasets from the \(\tau = 1.0\) network and 300 datasets from the \(\tau = 0.5\) network and randomly remove 0, 50 or 90 of the observations. The prediction accuracy for logistic regression applied to these datasets are reported in Table 4 and reveals three important results. The first result is that the MIRT model performs better on the completely observed data than the logistic regression model. This is likely due to collinearity, as the relative performance of logistic regression deteriorates with increasing correlations. For instance, when compared to the true model’s prediction accuracy we observe an \(8\%\) accuracy drop for the \(\tau = 1.0\) network data and a \(15\%\) accuracy drop for the \(\tau = 0.5\) network data.

The second important result is that the logistic regression model’s prediction accuracy on observed data is much improved when missing observations are introduced. In fact, logistic regression outperforms both the MIRT model and the true generating model on the remaining—observed—data. (This improvement was only seen in the dependent variable, not the covariates.) This striking increase in accuracy is due to the way that we impute the missing values. The imputation distribution \({\mathbb {P}}(x_i \mid y,\, x_{\setminus i})\) tends to favor values that make the observed outcomes *y* more likely and minimize \(|(2y-1) - \mathbb {P}(y\mid x)|\). As a result, prediction accuracy increases when more observations are missing and the model over-fits the remaining observed data.

The final important result that we observe from Table 4 is the poor prediction accuracy on the missing observations. Compared to the MIRT model the accuracy of predicting missing values drops approximately 25–\(35\%\). This striking difference between the accuracy on the missing data and on the observed data is a clear illustration of the poor cross-validation that follows from over-fitting on the observed data points. This is particularly problematic when one aims to predict non-observed data points, e.g., classification of future preferences: whereas one believes to be doing quite a good job based on predictions of the observed data, one unknowingly is doing a very poor job in predicting non-observed data.

### 5.7 Illustration III: Nonignorable missing data

From the results that are reported in Tables 2 and 3 we have learned that the two-dimensional IRT model provides accurate predictions when applied to correlated data where some of the observations are MCAR. Since there is no additional difficulty for the case when the data are MAR instead of MCAR, we consider here the situation where the IRT model is applied to data where the missing data mechanism is nonignorable, i.e., not missing at random (NMAR; see Appendix A). We compare situations where either the training data, the testing data, or both the training data and the testing data have \(50\%\) data NMAR or \(50\%\) data MCAR.

*U*, for the two situations described by Fig. 5a, c, respectively, against the estimates that are obtained from data where the missing observations are MCAR. Clearly, ignoring the missing data mechanism produces bias to the estimates of \(u_{1}\), especially for the severe nonignorability case.

Prediction accuracy of the two-dimensional IRT model applied to \(\tau =1.0\) network data with moderate nonignorable missingness

\(x_{\text {train}}\) | \(x_{\text {test}}\) | | | \(c_{\text {true}}\) | \(c_{\text {train}}^{(o)}\) | \(c_{\text {train}}^{(m)}\) | \(c_{\text {test}}^{(o)}\) | \(c_{\text {test}}^{(m)}\) |
---|---|---|---|---|---|---|---|---|

NMAR | NMAR | 100 | 1000 | 81 | 81 | 80 | 77 | 79 |

NMAR | NMAR | 100 | 100 | 81 | 81 | 79 | 77 | 78 |

NMAR | NMAR | 1000 | 100 | 81 | 79 | 79 | 78 | 79 |

NMAR | NMAR | 1000 | 1000 | 82 | 79 | 80 | 79 | 80 |

NMAR | MCAR | 100 | 1000 | 81 | 80 | 79 | 77 | 78 |

NMAR | MCAR | 100 | 100 | 80 | 80 | 77 | 75 | 76 |

NMAR | MCAR | 1000 | 100 | 80 | 77 | 77 | 76 | 77 |

NMAR | MCAR | 1000 | 1000 | 82 | 80 | 80 | 79 | 80 |

MCAR | NMAR | 100 | 1000 | 81 | 82 | 81 | 80 | 80 |

MCAR | NMAR | 100 | 100 | 80 | 81 | 78 | 78 | 78 |

MCAR | NMAR | 1000 | 100 | 81 | 80 | 80 | 80 | 80 |

MCAR | NMAR | 1000 | 1000 | 81 | 81 | 81 | 81 | 81 |

Prediction accuracy of the two-dimensional IRT model applied to \(\tau =1.0\) network data with severe nonignorable missingness

\(x_{\text {train}}\) | \(x_{\text {test}}\) | | | \(c_{\text {true}}\) | \(c_{\text {train}}^{(o)}\) | \(c_{\text {train}}^{(m)}\) | \(c_{\text {test}}^{(o)}\) | \(c_{\text {test}}^{(m)}\) |
---|---|---|---|---|---|---|---|---|

NMAR | NMAR | 100 | 1000 | 81 | 82 | 65 | 72 | 73 |

NMAR | NMAR | 100 | 100 | 82 | 83 | 61 | 73 | 71 |

NMAR | NMAR | 1000 | 100 | 82 | 81 | 63 | 75 | 75 |

NMAR | NMAR | 1000 | 1000 | 81 | 81 | 64 | 76 | 75 |

NMAR | MCAR | 100 | 1000 | 81 | 82 | 64 | 72 | 73 |

NMAR | MCAR | 100 | 100 | 81 | 82 | 59 | 71 | 70 |

NMAR | MCAR | 1000 | 100 | 79 | 79 | 56 | 71 | 69 |

NMAR | MCAR | 1000 | 1000 | 81 | 81 | 64 | 75 | 74 |

MCAR | NMAR | 100 | 1000 | 81 | 82 | 81 | 80 | 80 |

MCAR | NMAR | 100 | 100 | 81 | 83 | 79 | 79 | 79 |

MCAR | NMAR | 1000 | 100 | 80 | 79 | 79 | 79 | 78 |

MCAR | NMAR | 1000 | 1000 | 81 | 81 | 81 | 81 | 81 |

## 6 Discussion

We have illustrated that the combination of the Ising model and its latent variable approximation can be used to overcome collinearity and missing data issues in prediction applications of the logistic regression model. The prediction accuracy of the latent variable model on the observed data compares favorably to the true model used in Illustrations I–III (e.g., Fig. 2). The latent variable model was also able to accurately predict the non-observed data points in Illustration II (e.g., Tables 2, 3) and compares favorably to our illustration of the logistic regression model (e.g., Table 4). The model has its limits, which was clearly demonstrated in Illustration III using nonignorable missing data mechanisms. The prediction accuracy deteriorates when missing data mechanisms have a strong effect on the data that is used to train the model (e.g., Tables 5, 6). However, even with the poor quality of the data that were used to train the model in Illustration III, the prediction accuracy of the latent variable model compares favorably to our application of the logistic regression model (e.g., compare Tables 4 and 6). Therefore, we believe that our approach and the associated latent variable model are superior to the logistic regression model in prediction settings with correlated covariates and/or missing observations.

We have considered a specific prediction setting with only binary random variables for our approach to overcome collinearity and missing data. Observe, however, that the two primary ideas that form our approach are entirely general. The first idea is to model the joint distribution of dependent and independent variables when the variables are correlated, e.g., Ising networks models, Gaussian graphical models (Lauritzen 1996), or mixtures thereof (Olkin and Tate 1961; Lauritzen and Wermuth 1989). A specific feature of these particular models is that they specifically model the correlations between variables and thus overcome collinearity issues in conditional regression models. However, the application of these models will also increase the number of parameters that need to be estimated. Our second idea is to approximate the full graphical model with a low-rank latent variable model, e.g., an IRT model, a factor model or mixtures thereof. This has two important benefits: it reduces the number of parameters that need to be estimated and introduces an elegant way of handling missing data. We believe that these two general ideas will inspire new avenues of future research and furthermore offer practical solutions to issues that are widespread in large-scale applications.

The latent variables \(\eta \) in this paper are used to summarize the observed data in order to make predictions. However, the latent variable, in combination with the model’s parameters \(\theta = \{U\text {, }\mu \}\), can also be used to inform about the structure of the prediction problem. For instance, in regular applications of IRT models to educational tests (e.g., an end of primary school test), the model informs about the dimensionality of the test (e.g., separates a mathematics and language dimension) and informs how different aspects of the test covary across the ability spectrum. In much the same way we may study the latent variable model in prediction settings, which might improve the model and its predictions. For example, in psychological data we nearly always observe a pattern of positive correlations (in contrast to Fig. 3), which means that the entries in the first eigenvector tend to have the same sign. Without much loss of accuracy we can then replace the *p* unknown values in \(u_1\) with a single (positive) number. This would significantly reduce the number of parameters that we need to estimate, and we obtain sparse models that can make accurate predictions.

## Notes

### Compliance with ethical standards

### Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

## References

- Ackerman T (1996) Developments in multidimensional item response theory. Appl Psychol Measure 20:309–310CrossRefGoogle Scholar
- Ackerman T, Gierl M, Walker C (2003) Using multidimensional item response theory to evaluate educational and psychological tests. Educ Measure Issues Pract 22:37–51CrossRefGoogle Scholar
- Anderson C, Vermunt J (2000) Log-multiplicative association models as latent variable models for nominal and/or ordinal data. Sociol Methodol 30:81–121CrossRefGoogle Scholar
- Bartlett P, Jordan M, McAuliffe J (2006) Convexity, classification, and risk bounds. J Am Stat Assoc 101:138–156MathSciNetCrossRefzbMATHGoogle Scholar
- Bell R, Koren Y (2007) Lessons from the netflix prize challenge. ACM SIGKDD Explor Newslett 9:75–79CrossRefGoogle Scholar
- Bell R, Koren Y, Volinsky C (2010) All together now: a perspective on the netflix prize. Chance 23:24–29CrossRefGoogle Scholar
- Besag J (1974) Spatial interaction and the statistical analysis of lattice systems. J R Stat Soc Ser B (Methodol) 36:192–236MathSciNetzbMATHGoogle Scholar
- Borsboom D, Molenaar D (2015) Psychometrics. In: Wright J (ed) International Encyclopedia of the Social and Behavioral Sciences, vol 19, 2nd edn, pp 418–422Google Scholar
- Brin S, Page L (2012) Reprint of: the anatomy of a large-scale hypertextual web search engine. Comput Netw 56:3825–3833CrossRefGoogle Scholar
- Eckart C, Young G (1936) The approximation of one matrix by another of lower rank. Psychometrika 1:211–218CrossRefzbMATHGoogle Scholar
- Eggen TJHM (2004) Contributions to the theory and practice of computerized adaptive testing. Ph.D. thesis, University of Twente, Enschede, The NetherlandsGoogle Scholar
- Emch G, Knops H (1970) Pure thermodynamical phases as extremal KMS states. J Math Phys 11:3008–3018MathSciNetCrossRefGoogle Scholar
- Epskamp S, Maris G, Waldorp L, Borsboom D (2017) Network psychometrics. In: Irwing P, Hughes D, Booth T (eds) Handbook of psychometrics. Wiley, New York (
**in press**)Google Scholar - Geman S, Geman D (1984) Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans Patt Anal Mach Intell 6:721–741CrossRefzbMATHGoogle Scholar
- Hastie T, Tibshirani R, Wainwright M (2015) Statistical learning with sparsity: the lasso and generalizations. CRC PressGoogle Scholar
- Hastings W (1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57:97–109MathSciNetCrossRefzbMATHGoogle Scholar
- Heitjan D (1994) Ignorability in general incomplete-data models. Biometrika 81:701–708MathSciNetCrossRefzbMATHGoogle Scholar
- Ibrahim J, Chen M, Lipsitz S, Herring A (2005) Missing-data methods for generalized linear models: a comparative review. J Am Stat Assoc 100:332–346MathSciNetCrossRefzbMATHGoogle Scholar
- Ising E (1925) Beitrag zur theorie des ferromagnetismus. Zeitschrift für Physik 31:253–258CrossRefGoogle Scholar
- Kac M (1968) Mathematical mechanisms of phase transitions. In: Chretien M, Gross E, Deser S (eds) Statistical physics: phase transitions and superfluidity, vol 1. Brandeis University Summer Institute in Theoretical Physics., pp 241–305. Gordon and Breach Science Publishers, New YorkGoogle Scholar
- Koren Y, Bell R, Volinsky C (2009) Matrix factorization techniques for recommender systems. Computer 42:30–37CrossRefGoogle Scholar
- Lauritzen S (1996) Graphical models. Oxford University PressGoogle Scholar
- Lauritzen S, Wermuth N (1989) Graphical models for associations between variables, some of which are qualitative and some quantitative. Ann Stat 17(1):31–57MathSciNetCrossRefzbMATHGoogle Scholar
- Lenz W (1920) Beiträge zum verständnis der magnetischen eigenschaften in festen körpern. Physikalische Zeitschrift 21:613–615Google Scholar
- Little R, Rubin D (1987) Statistical analysis with missing data. Wiley, New YorkzbMATHGoogle Scholar
- Maris G, Maris E (2002) A MCMC-method for models with continuous latent responses. Psychometrika 67:335–350MathSciNetCrossRefzbMATHGoogle Scholar
- Marsman M, Maris G, Bechger T, Glas C (2015) Bayesian inference for low-rank Ising networks. Sci Rep 5(9050):1–7Google Scholar
- Marsman M, Maris G, Bechger T, Glas C (2016) What can we learn from Plausible values? Psychometrika. doi: 10.1007/s11336-016-9497-x
- Marsman M, Maris G, Bechger T, Glas C (2017) Turning simulation into estimation: generalized exchange algorithms for exponential family models. PLoS One 12(e0169787):1–15Google Scholar
- McCullagh P (1994) Exponential mixtures and quadratic exponential families. Biometrika 81:721–729MathSciNetCrossRefzbMATHGoogle Scholar
- Metropolis N, Rosenbluth A, Rosenbluth M, Teller A (1953) Equation of state calculations by fast computing machines. J Chem Phys 21:1087–1092CrossRefGoogle Scholar
- Mislevy R (1991) Randomization-based inference about latent variables from complex samples. Psychometrika 56:177–196MathSciNetCrossRefzbMATHGoogle Scholar
- Olkin I, Tate R (1961) Multivariate correlation models with mixed discrete and continuous variables. Ann Math Stat 32:448–465MathSciNetCrossRefzbMATHGoogle Scholar
- Page L, Brin S, Motwani R, Winograd T (1999) The pagerank citation ranking: bringing order to the webGoogle Scholar
- Patz R, Junker B (1999) Applications and extensions of MCMC in IRT: multiple item types, missing data, and rated responses. J Educ Behav Stat 24:342–366CrossRefGoogle Scholar
- Patz R, Junker B (1999) A straightforward approach to Markov chain Monte Carlo methods for item response models. J Educ Behav Stat 24:146–178CrossRefGoogle Scholar
- Reckase M (2009) Multidimensional item response theory. SpringerGoogle Scholar
- Rousseeuw P, van den Bossche W (2016) Detecting deviating data cells. arXiv preprint arXiv:1601.07251
- Rubin D (1976) Inference and missing data. Biometrika 63:581–592MathSciNetCrossRefzbMATHGoogle Scholar
- Rubin D (1987) Multiple imputation for nonresponse in surveys. Wiley, New-YorkCrossRefzbMATHGoogle Scholar
- Tierney L (1994) Markov chains for exploring posterior distributions. Ann Stat 22:1701–1762MathSciNetCrossRefzbMATHGoogle Scholar
- von Davier M, Gonzalez E, Mislevy R (2009) What are plausible values and why are they useful? In: von Davier M, Hastedt D (eds) IERI monograph series: issues and methodologies in large scale assessments, vol 2. IEA-ETS Research InstituteGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.