Skip to main content

Advertisement

Log in

A Nonparametric Test of Missing Completely at Random for Incomplete Multivariate Data

  • Published:
Psychometrika Aims and scope Submit manuscript

Abstract

Missing data occur in many real world studies. Knowing the type of missing mechanisms is important for adopting appropriate statistical analysis procedure. Many statistical methods assume missing completely at random (MCAR) due to its simplicity. Therefore, it is necessary to test whether this assumption is satisfied before applying those procedures. In the literature, most of the procedures for testing MCAR were developed under normality assumption which is sometimes difficult to justify in practice. In this paper, we propose a nonparametric test of MCAR for incomplete multivariate data which does not require distributional assumptions. The proposed test is carried out by comparing the distributions of the observed data across different missing-pattern groups. We prove that the proposed test is consistent against any distributional differences in the observed data. Simulation shows that the proposed procedure has the Type I error well controlled at the nominal level for testing MCAR and also has good power against a variety of non-MCAR alternatives.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Chen, H. Y., & Little, R. (1999). A test of missing completely at random from generalised estimating equation with missing data. Biometrika, 86, 1–13.

  • Davison, A. C., & Hinkley, D. V. (1997). Bootstrap methods and their application. Oxford: Cambridge University Press.

  • Efron, B., & Tibshirani, R. (1993). An introduction to bootstrap. London: Chapman & Hall.

    Book  Google Scholar 

  • Fuchs, C. (1982). Maximum likelihood estimation and model selection in contingency tables with missing data. Journal of the American Statistical Association, 77, 270–278.

    Article  Google Scholar 

  • Jamshidian, M., & Jalal, S. (2010). Tests of homoscedasticity, normality and missing completely at random for incomplete multivariate data. Psychometrika, 75, 649–674.

    Article  PubMed Central  PubMed  Google Scholar 

  • Kim, K. H., & Bentler, P. M. (2002). Tests of homogeneity of means and covariance matrices for multivariate incomplete data. Psychometrika, 67, 609–624.

  • Little, R. J. A. (1988). A test of missing completely at random for multivariate data with missing values. Journal of the American Statistical Association, 83, 1198–1202.

    Article  Google Scholar 

  • Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data (2nd ed.). New York: Wiley.

  • Qu, A., & Song, P. X. K. (2002). Testing ignorable missingness in estimating equation approaches for longitudinal data. Biometrika, 89, 841–850.

    Article  Google Scholar 

  • Rizzo, M. L., & Székely, G. J. (2010). DISCO analysis: A nonparametric extension of analysis of variance. The Annals of Applied Statistics, 4, 1034–1055.

  • Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581–592.

    Article  Google Scholar 

  • Székely, G. J., & Rizzo, M. L. (2005). A new test for multivariate normality. Journal of Multivariate Analysis, 93, 58–80.

Download references

Acknowledgments

The authors thank the Editor, the Associate Editor, and the referees for their thoughtful and constructive comments, which have helped improve our article. This research has been supported in part by Institute of Education Sciences Grant R305D090019.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jun Li.

Appendix

Appendix

Proof of Proposition 1

We first denote the vector of the \(p\) random variables by \({\varvec{y}}\), and the vector of missing indicators for each of the \(p\) random variables as \({\varvec{r}}\). We define the joint density function of \({\varvec{y}}\) and \({\varvec{r}}\) as \(f({\varvec{y}},{\varvec{r}})\). We further define \({\varvec{r}}_{\varvec{i}}\) as the vector of missing indicators for the \(i\)th missing-pattern group, and \(f_{i}({\varvec{y}})\) as the joint density function of the \(p\) variables (including the variables both observed and missing) for the \(i\)th missing-pattern group, \(i=1,\ldots ,s\). It is clear that \(f_{i}({\varvec{y}})=f({\varvec{y}}|{\varvec{r}}_i)\).

We first prove that, if the missingness is MCAR, \(F_1=\cdots =F_s\). Based on the definition of MCAR, the missingness does not depend on the data, which implies \(f({\varvec{r}}|{\varvec{y}})=f({\varvec{r}})\). Therefore, \(f({\varvec{y}},{\varvec{r}})=f({\varvec{r}}|{\varvec{y}}) f({\varvec{y}})=f({\varvec{r}})f({\varvec{y}})\). Since \(f({\varvec{y}},{\varvec{r}})=f({\varvec{y}}|{\varvec{r}})f({\varvec{r}})\), we have \(f({\varvec{y}}|{\varvec{r}})=f({\varvec{y}})\). This further implies that \(f_{1}({\varvec{y}})=\cdots =f_{s}({\varvec{y}})\). In other words, \(F_1=\cdots =F_s\).

Next, we prove that, if \(F_1=\cdots =F_s\), the missingness is MCAR. Since \(F_1=\cdots =F_s,\, f_{1}({\varvec{y}})=\cdots =f_{s}({\varvec{y}})\), i.e., \(f({\varvec{y}}|{\varvec{r}}_1)=\cdots =f({\varvec{y}}|{\varvec{r}}_s)=f({\varvec{y}})\). Therefore, we have

$$\begin{aligned} f({\varvec{r}}|{\varvec{y}})=\frac{f({\varvec{y}}|{\varvec{r}})f({\varvec{r}})}{f({\varvec{y}})} =\frac{f({\varvec{y}})f({\varvec{r}})}{f({\varvec{y}})}=f({\varvec{r}}), \end{aligned}$$

which suggests that the missingness is independent of the data. Therefore, the missingness is MCAR.

Proof of Theorem 1

Suppose the null hypothesis is false, say \(F_{k,\mathbf{o}_{kl}} \ne F_{l,\mathbf{o}_{kl}}\) for some \(k \ne l \in \{1,\ldots ,s\}\) and \(\mathbf{o}_{kl} \ne \emptyset \). Since \(Q=\frac{B/(s-1)}{W/(n-s)},\, B=\sum _{\begin{array}{c} 1 \le i<j\le s \\ \mathbf{o}_{ij} \ne \emptyset \end{array}} (\frac{n_in_j}{2n})d (\mathbb {Y}_{i,\mathbf{o}_{ij}},\mathbb {Y}_{j,\mathbf{o}_{ij}})\), and \(d(\mathbb {Y}_{i,\mathbf{o}_{ij}},\mathbb {Y}_{j,\mathbf{o}_{ij}})\) is always nonnegative, we have

$$\begin{aligned} Q \ge \frac{n_kn_l}{2n}\cdot \frac{d(\mathbb {Y}_{k,\mathbf{o}_{kl}},\mathbb {Y}_{l,\mathbf{o}_{kl}})}{s-1} \cdot \frac{n-s}{W}. \end{aligned}$$

Therefore,

$$\begin{aligned} P(Q>c_{\alpha })&\ge P\left( \frac{n_kn_l}{2n}\cdot \frac{d(\mathbb {Y}_{k,\mathbf{o}_{kl}},\mathbb {Y}_{l,\mathbf{o}_{kl}})}{s-1} \cdot \frac{n-s}{W}>c_{\alpha }\right) \\&= P\left( d(\mathbb {Y}_{k,\mathbf{o}_{kl}},\mathbb {Y}_{l,\mathbf{o}_{kl}})>\frac{2c_{\alpha } n(s-1)W}{n_{k}n_{l}(n-s)}\right) . \end{aligned}$$

Since \(W=\sum _{i=1}^{s} n_{i}g(\mathbb {Y}_{i,\mathbf{o}_i},\mathbb {Y}_{i,\mathbf{o}_i})/2\) and \(n_ig(\mathbb {Y}_{i,\mathbf{o}_i},\mathbb {Y}_{i,\mathbf{o}_i})/(n_i-1)\) a \(U\)-statistic, based on the properties of \(U\)-statistics,

$$\begin{aligned} n_ig(\mathbb {Y}_{i,\mathbf{o}_i},\mathbb {Y}_{i,\mathbf{o}_i})/(n_i-1) \rightarrow \eta _{i},\quad \text { a.s.}, \end{aligned}$$

where \(\eta _{i}\) is a constant. This implies that

$$\begin{aligned} W/(n-s)&=\sum _{i=1}^{s} \frac{n_{i}-1}{n-s}\cdot \frac{1}{2}\cdot \frac{n_{i}}{n_{i}-1}g(\mathbb {Y}_{i,\mathbf{o}_i},\mathbb {Y}_{i,\mathbf{o}_i})\\&\rightarrow \frac{1}{2}\sum _{i=1}^{s}\lambda _{i}\eta _i,\quad \text { a.s.}, \end{aligned}$$

where \(\lambda _i=\lim _{n\rightarrow \infty }\frac{n_i}{n_{1}+\cdots +n_{s}}\). Therefore,

$$\begin{aligned} \lim _{n\rightarrow \infty }P(Q>c_{\alpha })&\ge \lim _{n\rightarrow \infty }P\left( d(\mathbb {Y}_{k,\mathbf{o}_{kl}},\mathbb {Y}_{l,\mathbf{o}_{kl}})>\frac{2c_{\alpha } n(s-1)W}{n_{k}n_{l}(\textit{n}-s)}\right) \nonumber \\&= \lim _{n\rightarrow \infty }P\left( d(\mathbb {Y}_{k,\mathbf{o}_{kl}},\mathbb {Y}_{l,\mathbf{o}_{kl}})>\frac{c_{\alpha }(s-1) \sum _{i=1}^s \lambda _i\eta _i}{n\lambda _k\lambda _l}\right) . \end{aligned}$$
(4)

Next we show that \(c_{\alpha }\) is bounded above by a constant which does not depend on \(n\). Recall that \(Q=\frac{B/(s-1)}{W/(n-s)},\, B=\sum _{\begin{array}{c} 1 \le i<j\le s \\ \mathbf{o}_{ij} \ne \emptyset \end{array}} (\frac{n_in_j}{2n})d(\mathbb {Y}_{i,\mathbf{o}_{ij}},\mathbb {Y}_{j,\mathbf{o}_{ij}})\). Denote the number of the pairs \((i,j)\) satisfying \(1 \le i<j\le s \) and \( \mathbf{o}_{ij} \ne \emptyset \) by \(t\). In other words, there are \(t\) terms in \(B\). Clearly, \(t \le s(s-1)/2\). Therefore, for any \(k\),

$$\begin{aligned} P(Q> k )&\le P\left( \text {at least one of the }\frac{n_in_j}{2n}\cdot \frac{d(\mathbb {Y}_{i,\mathbf{o}_{ij}},\mathbb {Y}_{j,\mathbf{o}_{ij}})}{s-1} \cdot \frac{n-s}{W}> k /t\right) \\&\le \sum _{\begin{array}{c} 1 \le i<j\le s \\ \mathbf{o}_{ij} \ne \emptyset \end{array}} P\left( \frac{n_in_j}{n_i+n_j}d(\mathbb {Y}_{i,\mathbf{o}_{ij}},\mathbb {Y}_{j,\mathbf{o}_{ij}})>\frac{2k n(s-1)W}{t (n_{i}+n_{j})(n-s)}\right) , \end{aligned}$$

and

$$\begin{aligned} \lim _{n\rightarrow \infty }P(Q>k)\le \lim _{n\rightarrow \infty } \sum _{\begin{array}{c} 1 \le i<j\le s \\ \mathbf{o}_{ij} \ne \emptyset \end{array}}P\left( \frac{n_in_j}{n_i+n_j}d(\mathbb {Y}_{i,\mathbf{o}_{ij}},\mathbb {Y}_{j,\mathbf{o}_{ij}})>\frac{k(s-1) \sum _{i=1}^s \lambda _i\eta _i}{t (\lambda _i+\lambda _j)}\right) . \end{aligned}$$
(5)

Based on Székely and Rizzo (2005), under the null hypothesis of equal distributions,\(n_in_jd(\mathbb {Y}_{i,\mathbf{o}_{ij}},\mathbb {Y}_{j,\mathbf{o}_{ij}})/(n_i+n_j)\) converges in distribution to a quadratic form

$$\begin{aligned} Q_{i,j}=\sum _{l=1}^{\infty } \omega _l Z^2_l, \end{aligned}$$

where the \(Z_l\) are independent standard normal random variables, and the \(\omega _{l}\) are positive constants and do not depend on \(n\). Therefore, we can choose \(k=k_{\alpha }\), a constant which does not depend on \(n\), such that

$$\begin{aligned} \sum _{\begin{array}{c} 1 \le i<j\le s \\ \mathbf{o}_{ij} \ne \emptyset \end{array}}P\left( Q_{i,j}>\frac{k_{\alpha }(s-1) \sum _{i=1}^s \lambda _i\eta _i}{t (\lambda _i+\lambda _j)}\right) =\alpha . \end{aligned}$$

For such a \(k_{\alpha }\), we have \(\lim _{n\rightarrow \infty }P(Q>k_{\alpha }) \le \alpha \) under \(H_0\) based on (5). Since \(\lim _{n\rightarrow \infty }P(Q>c_{\alpha })=\alpha \) under \(H_0,\, \lim _{n\rightarrow \infty } c_{\alpha } \le k_{\alpha }\). Therefore, we have shown that \(c_{\alpha }\) bounded above by \(k_{\alpha }\), a constant which does not depend on \(n\).

Applying this result to (4), we have \(c_{\alpha }(s-1) \sum _{i=1}^s \lambda _i\eta _i/(n\lambda _k\lambda _l) \rightarrow 0\), as \(n \rightarrow \infty \). Since \(d(\mathbb {Y}_{k,\mathbf{o}_{kl}},\mathbb {Y}_{l,\mathbf{o}_{kl}})\) is a V-statistic, \(d(\mathbb {Y}_{k,\mathbf{o}_{kl}},\mathbb {Y}_{l,\mathbf{o}_{kl}})\) converges in probability to \(0\) if \(F_{k,\mathbf{o}_{kl}}= F_{l,\mathbf{o}_{kl}}\), and to some nonzero constant if \(F_{k,\mathbf{o}_{kl}} \ne F_{l,\mathbf{o}_{kl}}\). Therefore,

$$\begin{aligned} \lim _{n\rightarrow \infty }P\left( d(\mathbb {Y}_{k,\mathbf{o}_{kl}},\mathbb {Y}_{l,\mathbf{o}_{kl}})>\frac{c_{\alpha }(s-1) \sum _{i=1}^s \lambda _i\eta _i}{n\lambda _k\lambda _l}\right) =1, \end{aligned}$$

which implies that \(\lim _{n\rightarrow \infty }P(Q>c_{\alpha })=1\). As a result, our \(Q\) test is consistent. This completes the proof.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, J., Yu, Y. A Nonparametric Test of Missing Completely at Random for Incomplete Multivariate Data. Psychometrika 80, 707–726 (2015). https://doi.org/10.1007/s11336-014-9410-4

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11336-014-9410-4

Keywords

Navigation