Journal of Intelligent Information Systems

, Volume 29, Issue 3, pp 231–252 | Cite as

Bayesian networks for imputation in classification problems

  • Estevam R. HruschkaJr.Email author
  • Eduardo R. Hruschka
  • Nelson F. F. Ebecken


Missing values are an important problem in data mining. In order to tackle this problem in classification tasks, we propose two imputation methods based on Bayesian networks. These methods are evaluated in the context of both prediction and classification tasks. We compare the obtained results with those achieved by classical imputation methods (Expectation–Maximization, Data Augmentation, Decision Trees, and Mean/Mode). Our simulations were performed by means of four datasets (Congressional Voting Records, Mushroom, Wisconsin Breast Cancer and Adult), which are benchmarks for data mining methods. Missing values were simulated in these datasets by means of the elimination of some known values. Thus, it is possible to assess the prediction capability of an imputation method, comparing the original values with the imputed ones. In addition, we propose a methodology to estimate the bias inserted by imputation methods in classification tasks. In this sense, we use four classifiers (One Rule, Naïve Bayes, J4.8 Decision Tree and PART) to evaluate the employed imputation methods in classification scenarios. Computing times consumed to perform imputations are also reported. Simulation results in terms of prediction, classification, and computing times allow us performing several analyses, leading to interesting conclusions. Bayesian networks have shown to be competitive with classical imputation methods.


Missing values Bayesian networks Data mining 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Anderson, R. L. (1946). Missing plot techniques. Biometrics, 2, 41–47.CrossRefGoogle Scholar
  2. Batista, G. E. A. P. A., & Monard, M. C. (2003). An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence, 17(5–6), 519–533.CrossRefGoogle Scholar
  3. Beinlich, I. A., Suermondt, H. J., Chavez, R. M., & Cooper, G. F. (1989). The ALARM monitoring system: A case study with two probabilistic inference techniques for belief networks. Proceedings of the Second European Conference on Artificial Intelligence in Medicine, 247–256.Google Scholar
  4. Bilmes, J. (1997). A gentle tutorial on the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. Technical Report, University of Berkeley, ICSI-TR-97-021.Google Scholar
  5. Cano, R., Sordo, C., & Gutiérrez, J. M., (2004). Applications of Bayesian networks in meteorology. In J. A. Gámez, et al. (Eds.), Advances in Bayesian networks (pp. 309–327). Springer-Verlag.Google Scholar
  6. Cheng, J., & Greine, R. (1999). Comparing Bayesian network classifiers. Proc. of the fifteenth conference on uncertainty in artificial intelligence (UAI ’99) (pp. 101–108). Sweden.Google Scholar
  7. Cooper, G., & Herskovits, E. (1992). A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9, 309–347.zbMATHGoogle Scholar
  8. DeGroot, M. H. (1970). Optimal statistical decision. New York: McGraw-Hill.Google Scholar
  9. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39, 1–39.zbMATHMathSciNetGoogle Scholar
  10. Di Zio, M., Scanu, M., Coppola, L., Luzi, O., & Ponti, A. (2004). Bayesian networks for imputation. Journal of the Royal Statistical Society A, 167(Part 2), 309–322.CrossRefGoogle Scholar
  11. Druzdzel, M. J. (1999). SMILE: Structural Modeling, Inference, and Learning Engine and GeNIe: A development environment for graphical decision-theoretic models (Intelligent Systems Demonstration). In Proceedings of the sixteenth national conference on artificial intelligence (AAAI-99) (pp. 902–903). Menlo Park, CA: AAAI Press/The MIT Press.Google Scholar
  12. Friedman, N. (1997). Learning belief networks in the presence of missing values and hidden variables. Proceedings of the 14th International Conference on Machine Learning.Google Scholar
  13. Friedman, H. F., Kohavi, R., & Yun, Y. (1996). Lazy decision trees. In Proceedings of the 13th national conference on artificial intelligence (pp. 717–724). Cambridge, MA: AAAI Press/MIT Press.Google Scholar
  14. Friedman, N., Linial, M., Nachman, I., & Pe’er, D. (2000). Using Bayesian networks to analyze expression data. Proc. of the fourth international annual conference on computational molecular biology (pp. 127–135). New York: ACM Press.Google Scholar
  15. Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (1995). Bayesian data analysis. London: Chapman & Hall.Google Scholar
  16. Ghahramami, Z., & Jordan, M. (1995). Learning from incomplete data (Tech. Rep. AI Lab Memo No. 1509, CBCL Paper N°. 108). MIT AI Lab.Google Scholar
  17. Gilks W. R., Richardson, S., & Spiegelhalter D. J. (1996). Markov chain Monte Carlo in practice. London: Chapman & Hall.zbMATHGoogle Scholar
  18. Gilks, W. R., & Roberts, G. O. (1996). Strategies for improving MCMC. In W. R. Gilks, S. Richardson, & D. J. Spiegelhalter (Eds.), Markov chain Monte Carlo in practice (pp. 89–114). London: Chapman & Hall.Google Scholar
  19. Han, J., & Kamber, M. (2001). Data mining: Concepts and techniques. Morgan Kaufmann.Google Scholar
  20. Heckerman, D. (1995). A tutorial on learning Bayesian networks. Technical Report MSR-TR-95-06. Microsoft Research, Advanced Technology Division, Microsoft Corporation.Google Scholar
  21. Hruschka Jr., E. R., & Ebecken, N. F. F. (2002). Missing values prediction with K2. Intelligent Data Analysis 6(6). (The Netherlands)Google Scholar
  22. Hruschka Jr., E. R., & Ebecken, N. F. F. (2003). Variable ordering for bayesian networks learning from data. In Proceedings of the international conference on computational intelligence for modeling, control and automation—CIMCA’2003, Vienna.Google Scholar
  23. Hruschka Jr., E. R., Hruschka, E. R., & Ebecken, N. F. F. (2004). Feature selection by Bayesian networks. Lecture Notes in Artificial Intelligence, 3060, 370–379.Google Scholar
  24. Hsu, W. H. (2004). Genetic wrappers for feature selection in decision tree induction and variable ordering in Bayesian network structure learning. Information Sciences, 163, 103–122.CrossRefMathSciNetGoogle Scholar
  25. Jordan, M., & Jacobs, R. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6, 181–214.Google Scholar
  26. Jordan, M., & Xu, L. (1996). Convergence results for the EM approach to mixtures of experts architectures. Neural Networks, 8, 1409–1431.CrossRefGoogle Scholar
  27. Kononenko, I., Bratko, I., & Roskar, E. (1984). Experiments in automatic learning of medical diagnostic rules (Tech. Rep.). Ljubjana, Yogoslavia: Jozef Stefan Institute.Google Scholar
  28. Lam, W., & Bacchus, F. (1994). Learning Bayesian belief networks, an approach based on the MDL principle. Computational Intelligence, 10, 269–293.CrossRefGoogle Scholar
  29. Little, R., & Rubin, D. (1987). Statistical analysis with missing data. New York: Wiley.zbMATHGoogle Scholar
  30. Lobo, O. O., & Noneao, M. (2000). Ordered estimation of missing values for propositional learning. Journal of the Japanese Society for Artificial Intelligence, 15(1), 499–503.Google Scholar
  31. Madsen, A. L., Lang, M., Kjærulff, U. B., & Jensen, F. (2003). The Hugin Tool for learning Bayesian Networks. Lecture Notes in Computer Science, 2711, 594–605.CrossRefGoogle Scholar
  32. Merz, C. J., & Murphy, P. M. (1997). UCI Repository of Machine Learning Databases. Retrieved from Irvine, California: University of California, Department of Information and Computer Science.
  33. Mitchell, T. (1997). Machine learning. New York: McGraw-Hill.zbMATHGoogle Scholar
  34. Nigam, K. (2001). Using unlabeled data to improve text classification (Tech. Rep. CMU-CS-01-126). Doctoral dissertation, Computer Science Department, Carnegie Mellon University.Google Scholar
  35. Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Mateo, CA: Morgan Kaufmann.Google Scholar
  36. Preece, A. D. (1971). Iterative procedures for missing values in experiments. Technometrics, 13, 743–753.CrossRefGoogle Scholar
  37. Pyle, D. (1999). Data preparation for data mining. San Diego, CA: Academic.Google Scholar
  38. Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81–106.Google Scholar
  39. Quinlan, J. R. (1989). Unknown attribute values in induction. Proceedings of 6th international workshop on machine learning (pp. 164–168). Ithaca, NY.Google Scholar
  40. Redner, R., & Walker, H. (1984). Mixture densities, maximum likelihood and the EM algorithm. SIAM Review, 26(2), 152–239.CrossRefMathSciNetGoogle Scholar
  41. Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581–592.zbMATHCrossRefMathSciNetGoogle Scholar
  42. Rubin, D. B. (1977). Formalizing subjective notion about the effects of nonrespondents in samples surveys. Journal of the American Statistical Association, 72, 538–543.zbMATHCrossRefMathSciNetGoogle Scholar
  43. Rubin, D. B. (1987). Multiple imputation for non responses in surveys. New York: Wiley.Google Scholar
  44. Schafer, J. L. (2000). Analysis of incomplete multivariate data. London: Chapman & Hall/CRC.Google Scholar
  45. Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7(2), 147–177.CrossRefGoogle Scholar
  46. Schwartz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464.MathSciNetGoogle Scholar
  47. Sebastiani, P., & Ramoni, M. (1997). Bayesian inference with missing data using bound and collapse (Tech. Rep. KMI-TR-58). KMI, Open University.Google Scholar
  48. Spiegelhalter, D. J., & Lauritzen, S. L. (1990). Sequential updating of conditional probability on direct graphical structures. Networks, 20, 576–606.CrossRefMathSciNetGoogle Scholar
  49. Spiegelhalter, D. J., Thomas, A., & Best, N. G. (1996). Computation on Bayesian graphical models. Bayesian Statistics, 5, 407–425. Retrieved from
  50. Spiegelhalter, D. J., Thomas, A., & Best, N. G. (1999). WINBUGS: Bayesian inference using Gibbs sampling, Version 1.3. Cambridge, UK: MRC Biostatistics Unit.Google Scholar
  51. Spirtes P., Glymour C., & Scheines R. (1993). Causation, predication, and search. New York: Springer-Verlag.Google Scholar
  52. Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by data augmentation (with discussion). Journal of the American Statistical Association, 82, 528–550.zbMATHCrossRefMathSciNetGoogle Scholar
  53. White, A. P. (1987). Probabilistic induction by dynamic path generation in virtual trees. In M. A. Bramer (Ed.), Research and development in expert systems III, (pp. 35–46). Cambridge: Cambridge University Press.Google Scholar
  54. Witten, I. H., & Frank, E. (2000). Data mining—practical machine learning tools and techniques with java implementations. USA: Morgan Kaufmann Publishers.Google Scholar
  55. Wu, C. F. J. (1983). On the convergence properties of the EM algorithm. The Annals of Statistics, 11(1), 95–103.zbMATHMathSciNetGoogle Scholar
  56. Xu, L., & Jordan, M. (1996). On convergence properties of the EM algorithm for Gaussian mixtures. Neural Computation, 8, 129–151.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2006

Authors and Affiliations

  • Estevam R. HruschkaJr.
    • 1
    Email author
  • Eduardo R. Hruschka
    • 2
  • Nelson F. F. Ebecken
    • 3
  1. 1.UFSCar/Federal University of São CarlosSão CarlosBrazil
  2. 2.Catholic University of Santos (UniSantos)SantosBrazil
  3. 3.COPPE/Federal University of Rio de JaneiroRio de JaneiroBrazil

Personalised recommendations