Bayesian Treatment of Incomplete Discrete Data Applied to Mutual Information and Feature Selection

  • Marcus Hutter
  • Marco Zaffalon
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2821)


Given the joint chances of a pair of random variables one can compute quantities of interest, like the mutual information. The Bayesian treatment of unknown chances involves computing, from a second order prior distribution and the data likelihood, a posterior distribution of the chances. A common treatment of incomplete data is to assume ignorability and determine the chances by the expectation maximization (EM) algorithm. The two different methods above are well established but typically separated. This paper joins the two approaches in the case of Dirichlet priors, and derives efficient approximations for the mean, mode and the (co)variance of the chances and the mutual information. Furthermore, we prove the unimodality of the posterior distribution, whence the important property of convergence of EM to the global maximum in the chosen framework. These results are applied to the problem of selecting features for incremental learning and naive Bayes classification. A fast filter based on the distribution of mutual information is shown to outperform the traditional filter based on empirical mutual information on a number of incomplete real data sets.


Feature Selection Mutual Information Posterior Distribution Class Label Expectation Maximization 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Blum, A.L., Langley, P.: Selection of relevant features and examples in machine learning. Artificial Intelligence 97(1-2), 245–271 (1997)zbMATHCrossRefMathSciNetGoogle Scholar
  2. 2.
    Chen, T.T., Fienberg, S.E.: Two-dimensional contingency tables with both completely and partially cross-classified data. Biometrics 32, 133–144 (1974)CrossRefMathSciNetGoogle Scholar
  3. 3.
    Cheng, J., Hatzis, C., Hayashi, H., Krogel, M., Morishita, S., Page, D., Sese, J.: KDD cup 2001 report. ACM SIGKDD Explorations 3(2) (2002)Google Scholar
  4. 4.
    Duda, R.O., Hart, P.E.: Pattern classification and scene analysis. Wiley, New York (1973)zbMATHGoogle Scholar
  5. 5.
    Duda, R.O., Hart, P.E., Stork, D.G.: Pattern classification, 2nd edn. Wiley, Chichester (2001)zbMATHGoogle Scholar
  6. 6.
    Domingos, P., Pazzani, M.: On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning 29(2/3), 103–130 (1997)zbMATHCrossRefGoogle Scholar
  7. 7.
    Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B.: Bayesian Data Analysis. Chapman, Boca Raton (1995)Google Scholar
  8. 8.
    Hutter, M.: Distribution of mutual information. In: Dietterich, T.G., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems, Cambridge, MA, vol. 14, pp. 399–406. MIT Press, Cambridge (2002)Google Scholar
  9. 9.
    Kohavi, R., John, G., Long, R., Manley, D., Pfleger, K.: MLC++: a machine learning library in C++. In: Tools with Artificial Intelligence, pp. 740–743. IEEE Computer Society Press, Los Alamitos (1994)Google Scholar
  10. 10.
    Kleiter, G.D.: The posterior probability of Bayes nets with strong dependences. Soft Computing 3, 162–173 (1999)Google Scholar
  11. 11.
    Lewis, D.D.: Feature selection and feature extraction for text categorization. In: Proc. of Speech and Natural Language Workshop, San Francisco, pp. 212–217. Morgan Kaufmann, San Francisco (1992)CrossRefGoogle Scholar
  12. 12.
    Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. John-Wiley, New York (1987)zbMATHGoogle Scholar
  13. 13.
    Murphy, P.M., Aha, D.W.: UCI repository of machine learning databases (1995),
  14. 14.
    Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling, W.T.: Numerical Recipes in C: The Art of Scientific Computing, 2nd edn. Cambridge University Press, Cambridge (1992)Google Scholar
  15. 15.
    Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993)Google Scholar
  16. 16.
    Zaffalon, M., Hutter, M.: Robust feature selection by mutual information distributions. In: Darwiche, A., Friedman, N. (eds.) Proceedings of the 18th International Conference on Uncertainty in Artificial Intelligence (UAI 2002), San Francisco, CA, pp. 577–584. Morgan Kaufmann, San Francisco (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Marcus Hutter
    • 1
  • Marco Zaffalon
    • 1
  1. 1.IDSIAManno-LuganoSwitzerland

Personalised recommendations