Skip to main content

An information criterion for model selection with missing data via complete-data divergence

Abstract

We derive an information criterion to select a parametric model of complete-data distribution when only incomplete or partially observed data are available. Compared with AIC, our new criterion has an additional penalty term for missing data, which is expressed by the Fisher information matrices of complete data and incomplete data. We prove that our criterion is an asymptotically unbiased estimator of complete-data divergence, namely the expected Kullback–Leibler divergence between the true distribution and the estimated distribution for complete data, whereas AIC is that for the incomplete data. The additional penalty term of our criterion for missing data turns out to be only half the value of that in previously proposed information criteria PDIO and AICcd. The difference in the penalty term is attributed to the fact that our criterion is derived under a weaker assumption. A simulation study with the weaker assumption shows that our criterion is unbiased while the other two criteria are biased. In addition, we review the geometrical view of alternating minimizations of the EM algorithm. This geometrical view plays an important role in deriving our new criterion.

This is a preview of subscription content, access via your institution.

Fig. 1

References

  • Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716–723.

    MathSciNet  Article  MATH  Google Scholar 

  • Amari, S. (1995). Information geometry of the EM and em algorithms for neural networks. Neural Networks, 8, 1379–1408.

  • Amari, S., Nagaoka, H. (2007). Methods of information geometry 191. Providence, RI: American Mathematical Society.

  • Bozdogan, H. (1987). Model selection and Akaike’s information criterion (AIC): The general theory and its analytical extensions. Psychometrika, 52, 345–370.

    MathSciNet  Article  MATH  Google Scholar 

  • Burnham, K. P., Anderson, D. R. (2002). Model selection and multimodel inference: A practical information-theoretic approach. New York: Springer.

  • Byrne, W. (1992). Alternating minimization and Boltzmann machine learning. IEEE Transactions on Neural Networks, 3, 612–620.

    Article  Google Scholar 

  • Cavanaugh, J. E., Shumway, R. H. (1998). An Akaike information criterion for model selection in the presence of incomplete data. Journal of Statistical Planning and Inference, 67, 45–65.

  • Chapelle, O., Schölkopf, B., Zien, A. (2006). Semi-supervised learning. Cambridge: The MIT Press.

  • Claeskens, G., Consentino, F. (2008). Variable selection with incomplete covariate data. Biometrics, 64, 1062–1069.

  • Csiszár, I. (1975). I-divergence geometry of probability distributions and minimization problems. The Annals of Probability, 3, 146–158.

    MathSciNet  Article  MATH  Google Scholar 

  • Csiszár, I., Tusnády, G. (1984). Information geometry and alternating minimization procedures. Statistics and decisions, Supplement Issue, 1, 205–237.

  • Dempster, A. P., Laird, N. M., Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B (Methodological), 39, 1–38.

  • Ip, E. H., Lalwani, N. (2000). A note on the geometric interpretation of the EM algorithm in estimating item characteristics and student abilities. Psychometrika, 65, 533–537.

  • Kawakita, M., Takeuchi, J. (2014). Safe semi-supervised learning based on weighted likelihood. Neural Networks, 53, 146–164.

  • Konishi, S., Kitagawa, G. (2008). Information criteria and statistical modeling. New York: Springer.

  • Meng, X.-L., Rubin, D. B. (1991). Using EM to obtain asymptotic variance-covariance matrices: The SEM algorithm. Journal of the American Statistical Association, 86, 899–909.

  • Seghouane, A. K., Bekara, M., Fleury, G. (2005). A criterion for model selection in the presence of incomplete data based on Kullback’s symmetric divergence. Signal Processing, 85, 1405–1417.

  • Shimodaira, H. (1994). A new criterion for selecting models from partially observed data. Selecting Models from Data (pp. 21–29). New York: Springer.

    Chapter  Google Scholar 

  • Shimodaira, H. (2000). Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90, 227–244.

    MathSciNet  Article  MATH  Google Scholar 

  • White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica, 50, 1–25.

    MathSciNet  Article  MATH  Google Scholar 

  • Yamazaki, K. (2014). Asymptotic accuracy of distribution-based estimation of latent variables. The Journal of Machine Learning Research, 15, 3541–3562.

    MathSciNet  Google Scholar 

Download references

Acknowledgements

We would like to thank the reviewers for their comments to improve the manuscript. We appreciate Kei Hirose and Shinpei Imori for their suggestions and comments. While preparing an earlier version of the manuscript, which was published as Shimodaira (1994), Hidetoshi Shimodaira is indebted to Shun-ichi Amari for the geometrical view of the EM algorithm and to Noboru Murata for the derivation of the Takeuchi information criterion.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hidetoshi Shimodaira.

Additional information

The research was supported in part by JSPS KAKENHI Grant (24300106, 16H02789).

Appendix A: Technical details

Appendix A: Technical details

A.1 Proof of Lemma 1

For brevity, we omit \((\varvec{y},\varvec{z})\) of \(f_x(\varvec{y},\varvec{z})\) in the integrals below. \(D_x(g_x;f_x) = \int \int g_{z|y} g_y ( \log g_{z|y} + \log g_y - \log f_{z|y} - \log f_y )d\varvec{z}d\varvec{y} = \int g_y \int g_{z|y} ( \log g_{z|y} - \log f_{z|y} )d\varvec{z}d\varvec{y} + \int g_y (\int g_{z|y}d\varvec{z}) ( \log g_y - \log f_y )d\varvec{y} = \int g_y \int g_{z|y} ( \log g_{z|y} g_y - \log f_{z|y} g_y )d\varvec{z}d\varvec{y} + \int g_y ( \log g_y - \log f_y )d\varvec{y} = D_x(g_{z|y}g_y; f_{z|y}g_y) + D_y(g_y; f_y)\), thus showing (6). \(D_y(g_y; f_y) = \int \int h_{z|y} g_y (\log g_y - \log f_y + \log h_{z|y} - \log h_{z|y}) d\varvec{z} d\varvec{y} = D_x(h_{z|y}g_y ; h_{z|y} f_y)\), which shows (7). \(\square \)

A.2 Proof of Lemma 2

We assume \(q_{z|y}=p_{z|y}(\varvec{\bar{\theta }}_y)\) and \(\varvec{\bar{\theta }}_x = \varvec{\bar{\theta }}_y\). From the definitions of \(\varvec{\bar{\theta }}_x\) and \(H_x\), we have

$$\begin{aligned} \frac{\partial D_x(q_x; p_x(\varvec{\theta }))}{\partial \varvec{\theta } }\Bigr |_{\varvec{\bar{\theta }}_y} =0, \quad \frac{\partial ^2 D_x(q_x; p_x(\varvec{\theta }))}{\partial \varvec{\theta } \partial \varvec{\theta }^T}\Bigr |_{\varvec{\bar{\theta }}_y} = H_x. \end{aligned}$$

Hence, the Taylor expansion of \(D_x(q_x; p_x(\varvec{\theta }))\) around \(\varvec{\theta }=\varvec{\bar{\theta }}_y\) is

$$\begin{aligned} D_x(q_x; p_x(\varvec{\theta }) )= D_x(q_x; p_x(\varvec{\bar{\theta }}_y)) + \frac{1}{2} (\varvec{\theta } - \varvec{\bar{\theta }}_y)^T H_x (\varvec{\theta } - \varvec{\bar{\theta }}_y) + O(n^{-3/2}) \end{aligned}$$

for \(\varvec{\theta } - \varvec{\bar{\theta }}_y = O(n^{-1/2})\). The first term on the right-hand side is \( D_y(q_y ; p_y(\varvec{\bar{\theta }}_y) )\) as shown in (16). Substituting \(\varvec{\theta } = \varvec{\hat{\theta }}_y\) in \( D_x(q_x; p_x(\varvec{\theta }) )\) and taking its expectation gives (17) by noting

$$\begin{aligned} E \left\{ (\varvec{\hat{\theta }}_y - \varvec{\bar{\theta }}_y)^T H_x (\varvec{\hat{\theta }}_y - \varvec{\bar{\theta }}_y) \right\} = \mathrm{tr}\left( H_x \, E \left\{ (\varvec{\hat{\theta }}_y - \varvec{\bar{\theta }}_y) (\varvec{\hat{\theta }}_y - \varvec{\bar{\theta }}_y)^T \right\} \right) , \end{aligned}$$

which becomes \(\mathrm{tr}\left( H_x H_y^{-1} G_y H_y^{-1} \right) /n+ O(n^{-2})\) from (4). \(\square \)

A.3 Proof of Theorem 1

From the definitions of \(\varvec{\hat{\theta }}_y\) and \(\hat{H}_y = H_y(\hat{q}_y; \varvec{\hat{\theta }}_y)\), we have

$$\begin{aligned} \frac{\partial L_y(\hat{q}_y; p_y(\varvec{\theta }))}{\partial \varvec{\theta } }\Bigr |_{\varvec{\hat{\theta }}_y} =0, \quad \frac{\partial ^2 L_y(\hat{q}_y; p_y(\varvec{\theta }))}{\partial \varvec{\theta } \partial \varvec{\theta }^T}\Bigr |_{\varvec{\hat{\theta }}_y} = \hat{H}_y. \end{aligned}$$

Hence, the Taylor expansion of \(L_y(\hat{q}_y ; p_y(\varvec{\theta }))\) around \(\varvec{\theta }=\varvec{\hat{\theta }}_y\) is

$$\begin{aligned} L_y(\hat{q}_y ; p_y(\varvec{\theta })) = L_y(\hat{q}_y ; p_y(\varvec{\hat{\theta }}_y)) + \frac{1}{2} (\varvec{\theta } - \varvec{\hat{\theta }}_y)^T \hat{H}_y (\varvec{\theta } - \varvec{\hat{\theta }}_y) +O_p(n^{-3/2}) \end{aligned}$$

for \(\varvec{\theta } - \varvec{\hat{\theta }}_y = O_p(n^{-1/2})\). Substituting \(\varvec{\theta }=\varvec{\bar{\theta }}_y\) in \(L_y(\hat{q}_y ; p_y(\varvec{\theta }))\), we take its expectation below. By noting \(\hat{H}_y = H_y + O_p(n^{-1/2})\), we have

$$\begin{aligned} E \left\{ (\varvec{\bar{\theta }}_y - \varvec{\hat{\theta }}_y)^T \hat{H}_y (\varvec{\bar{\theta }}_y - \varvec{\hat{\theta }}_y) \right\} = \mathrm{tr}\left( H_y \, E \left\{ (\varvec{\hat{\theta }}_y - \varvec{\bar{\theta }}_y) (\varvec{\hat{\theta }}_y - \varvec{\bar{\theta }}_y)^T \right\} \right) + O(n^{-3/2}), \end{aligned}$$

which becomes \( \mathrm{tr}(H_y H_y^{-1} G_y H_y^{-1} )/n + O(n^{-3/2}) \) from (4). This proves (20) because

$$\begin{aligned} E\{ L_y(\hat{q}_y ; p_y(\varvec{\bar{\theta }}_y)) \} = E\{ L_y(\hat{q}_y ; p_y(\varvec{\hat{\theta }}_y)) \} + \frac{1}{2n}\mathrm{tr}( G_y H_y^{-1}) + O(n^{-3/2}), \end{aligned}$$

and \(E\{ L_y(\hat{q}_y ; p_y(\varvec{\bar{\theta }}_y)) \} = L_y(q_y ; p_y(\varvec{\bar{\theta }}_y))\). Substituting (20) into (17) and comparing it with (18) yields (21). \(\square \)

About this article

Verify currency and authenticity via CrossMark

Cite this article

Shimodaira, H., Maeda, H. An information criterion for model selection with missing data via complete-data divergence. Ann Inst Stat Math 70, 421–438 (2018). https://doi.org/10.1007/s10463-016-0592-7

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10463-016-0592-7

Keywords

  • Akaike information criterion
  • Alternating projections
  • Data manifold
  • EM algorithm
  • Fisher information matrix
  • Incomplete data
  • Kullback–Leibler divergence
  • Misspecification
  • Takeuchi information criterion