Machine Learning

, Volume 29, Issue 2–3, pp 131–163 | Cite as

Bayesian Network Classifiers

  • Nir Friedman
  • Dan Geiger
  • Moises Goldszmidt

Abstract

Recent work in supervised learning has shown that a surprisingly simple Bayesian classifier with strong assumptions of independence among features, called naive Bayes, is competitive with state-of-the-art classifiers such as C4.5. This fact raises the question of whether a classifier with less restrictive assumptions can perform even better. In this paper we evaluate approaches for inducing classifiers from data, based on the theory of learning Bayesian networks. These networks are factored representations of probability distributions that generalize the naive Bayesian classifier and explicitly represent statements about independence. Among these approaches we single out a method we call Tree Augmented Naive Bayes (TAN), which outperforms naive Bayes, yet at the same time maintains the computational simplicity (no search involved) and robustness that characterize naive Bayes. We experimentally tested these approaches, using problems from the University of California at Irvine repository, and compared them to C4.5, naive Bayes, and wrapper methods for feature selection.

Bayesian networks classification 

References

  1. Binder, J., D. Koller, S. Russell, & K. Kanazawa (1997). Adaptive probabilistic networks with hidden variables. Machine Learning, this issue.Google Scholar
  2. Bouckaert, R. R. (1994). Properties of Bayesian network learning algorithms. In R. López de Mantarás & D. Poole (Eds.), Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence (pp. 102–109). San Francisco, CA: Morgan Kaufmann.Google Scholar
  3. Buntine, W. (1991). Theory refinement on Bayesian networks. In B. D. D' Ambrosio, P. Smets, & P. P. Bonissone (Eds.), Proceedings of the Seventh Annual Conference on Uncertainty Artificial Intelligence (pp. 52–60). San Francisco, CA: Morgan Kaufmann.Google Scholar
  4. Buntine, W. (1996). A guide to the literature on learning probabilistic networks from data. IEEE Trans. on Knowledge and Data Engineering, 8, 195–210.Google Scholar
  5. Cestnik, B. (1990). Estimating probabilities: a crucial task in machine learning. In L. C. Aiello (Ed.), Proceedings of the 9th European Conference on Artificial Intelligence (pp. 147–149). London: Pitman.Google Scholar
  6. Chickering, D.M. (1995). Learning Bayesian networks is NP-complete. In D. Fisher & A. Lenz, Learning from Data. Springer-Verlag.Google Scholar
  7. Chickering, D. M. & D. Heckerman (1996). Efficient approximations for the marginal likelihood of incomplete data given a Bayesian network. In E. Horvits & F. Jensen (Eds.), Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence (pp. 158–168). San Francisco, CA: Morgan Kaufmann.Google Scholar
  8. Chow, C. K. & C. N. Liu (1968). Approximating discrete probability distributions with dependence trees. IEEE Trans. on Info. Theory, 14, 462–467.Google Scholar
  9. Cooper, G. F. & E. Herskovits (1992). A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9, 309–347.Google Scholar
  10. Cormen, T. H., C. E. Leiserson, & R. L. Rivest (1990). Introduction to Algorithms. Cambridge, MA: MIT Press.Google Scholar
  11. Cover, T. M. & J. A. Thomas (1991). Elements of Information Theory. New York: John Wiley & Sons.Google Scholar
  12. Dawid, A. P. (1976). Properties of diagnostic data distributions. Biometrics, 32, 647–658.Google Scholar
  13. DeGroot, M. H. (1970). Optimal Statistical Decisions. New York: McGraw-Hill.Google Scholar
  14. Domingos, P. & M. Pazzani (1996). Beyond independence: Conditions for the optimality of the simple Bayesian classifier. In L. Saitta (Ed.), Proceedings of the Thirteenth International Conference on Machine Learning (pp. 105–112). San Francisco, CA: Morgan Kaufmann.Google Scholar
  15. Dougherty, J., R. Kohavi, & M. Sahami (1995). Supervised and unsupervised discretization of continuous features. In A. Prieditis & S. Russell (Eds.), Proceedings of the Twelfth International Conference on Machine Learning (pp. 194–202). San Francisco, CA: Morgan Kaufmann.Google Scholar
  16. Duda, R. O. & P. E. Hart (1973). Pattern Classification and Scene Analysis. New York: John Wiley & Sons.Google Scholar
  17. Ezawa, K. J. & T. Schuermann (1995). Fraud/uncollectable debt detection using a Bayesian network based learning system: A rare binary outcome with mixed data structures. In P. Besnard & S. Hanks (Eds.), Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence (pp. 157–166). San Francisco, CA: Morgan Kaufmann.Google Scholar
  18. Fayyad, U. M. & K. B. Irani (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence (pp. 1022–1027). San Francisco, CA: Morgan Kaufmann.Google Scholar
  19. Friedman, J. (1997a). On bias, variance, 0/1-loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery, 1, 55–77.Google Scholar
  20. Friedman, N. (1997b). Learning belief networks in the presence of missing values and hidden variables. In D. Fisher (Ed.), Proceedings of the Fourteenth International Conference on Machine Learning (pp. 125–133). San Francisco, CA: Morgan Kaufmann.Google Scholar
  21. Friedman, N. & M. Goldszmidt (1996a). Building classifiers using Bayesian networks. In Proceedings of the National Conference on Artificial Intelligence (pp. 1277–1284). Menlo Park, CA: AAAI Press.Google Scholar
  22. Friedman, N. & M. Goldszmidt (1996b). Discretization of continuous attributes while learning Bayesian networks. In L. Saitta (Ed.), Proceedings of the Thirteenth International Conference on Machine Learning (pp. 157–165). San Francisco, CA: Morgan Kaufmann.Google Scholar
  23. Friedman, N. & M. Goldszmidt (1996c). Learning Bayesian networks with local structure. In E. Horvits & F. Jensen (Eds.), Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence (pp. 252–262). San Francisco, CA: Morgan Kaufmann.Google Scholar
  24. Geiger, D. (1992). An entropy-based learning algorithm of Bayesian conditional trees. In D. Dubois, M. P. Wellman, B. D. D' Ambrosio, & P. Smets (Eds.), Proceedings of the Eighth Annual Conference on Uncertainty Artificial Intelligence (pp. 92–97). San Francisco, CA: Morgan Kaufmann.Google Scholar
  25. Geiger, D. & D. Heckerman (1996). Knowledge representation and inference in similarity networks and Bayesian multinets. Artificial Intelligence, 82, 45–74.Google Scholar
  26. Geiger, D., D. Heckerman, & C. Meek (1996). Asymptotic model selection for directed graphs with hidden variables. In E. Horvits & F. Jensen (Eds.), Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence (pp. 283–290). San Francisco, CA: Morgan Kaufmann.Google Scholar
  27. Heckerman, D. (1991). Probabilistic Similarity Networks. Cambridge, MA: MIT Press.Google Scholar
  28. Heckerman, D. (1995). A tutorial on learning Bayesian networks. Technical Report MSR-TR–95–06, Microsoft Research.Google Scholar
  29. Heckerman, D. & D. Geiger (1995). Learning Bayesian networks: a unification for discrete and Gaussian domains. In P. Besnard & S. Hanks (Eds.), Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence (pp. 274–284). San Francisco, CA: Morgan Kaufmann.Google Scholar
  30. Heckerman, D., D. Geiger, & D. M. Chickering (1995). Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20, 197–243.Google Scholar
  31. John, G. & R. Kohavi (1997). Wrappers for feature subset selection. Artificial Intelligence. Accepted for publication. A preliminary version appears in Proceedings of the Eleventh International Conference on Machine Learning, 1994, pp. 121–129, under the title “Irrelevant features and the subset selection problem”.Google Scholar
  32. John, G. H. & P. Langley (1995). Estimating continuous distributions in Bayesian classifiers. In P. Besnard & S. Hanks (Eds.), Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence (pp. 338–345). San Francisco, CA: Morgan Kaufmann.Google Scholar
  33. Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (pp. 1137–1143). San Francisco, CA: Morgan Kaufmann.Google Scholar
  34. Kohavi, R., G. John, R. Long, D. Manley, & K. Pfleger (1994). MLC++: A machine learning library in C++. In Proc. Sixth International Conference on Tools with Artificial Intelligence (pp. 740–743). IEEE Computer Society Press.Google Scholar
  35. Kononenko, I. (1991). Semi-naive Bayesian classifier. In Y. Kodratoff (Ed.), Proc. Sixth European Working Session on Learning (pp. 206–219). Berlin: Springer-Verlag.Google Scholar
  36. Kullback, S. & R. A. Leibler (1951). On information and sufficiency. Annals of Mathematical Statistics, 22, 76–86.Google Scholar
  37. Lam, W. & F. Bacchus (1994). Learning Bayesian belief networks. An approach based on the MDL principle. Computational Intelligence, 10, 269–293.Google Scholar
  38. Langley, P., W. Iba, & K. Thompson (1992). An analysis of Bayesian classifiers. In Proceedings, Tenth National Conference on Artificial Intelligence (pp. 223–228). Menlo Park, CA: AAAI Press.Google Scholar
  39. Langley, P. & S. Sage (1994). Induction of selective Bayesian classifiers. In R. López de Mantarás & D. Poole (Eds.), Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence (pp. 399–406). San Francisco, CA: Morgan Kaufmann.Google Scholar
  40. Lauritzen, S. L. (1995). The EM algorithm for graphical association models with missing data. Computational Statistics and Data Analysis, 19, 191–201.Google Scholar
  41. Lewis, P. M. (1959). Approximating probability distributions to reduce storage requirements. Information and Control, 2, 214–225.Google Scholar
  42. Murphy, P. M. & D. W. Aha (1995). UCI repository of machine learning databases. http://www.ics.uci. edu/~mlearn/MLRepository.html.Google Scholar
  43. Pazzani, M. J. (1995). Searching for dependencies in Bayesian classifiers. In D. Fisher & H. Lenz (Eds.), Proceedings of the fifth International Workshop on Artificial Intelligence and Statistics, Ft. Lauderdale, FL.Google Scholar
  44. Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems. San Francisco, CA: Morgan Kaufmann.Google Scholar
  45. Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. San Francisco, CA: Morgan Kaufmann.Google Scholar
  46. Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge: Cambridge University Press.Google Scholar
  47. Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14, 465–471.Google Scholar
  48. Rubin, D. R. (1976). Inference and missing data. Biometrica, 63, 581–592.Google Scholar
  49. Singh, M. & G.M. Provan (1995). A comparison of induction algorithms for selective and non-selective Bayesian classifiers. In A. Prieditis & S. Russell (Eds.), Proceedings of the Twelfth International Conference on Machine Learning (pp. 497–505). San Francisco, CA: Morgan Kaufmann.Google Scholar
  50. Singh, M. & G. M. Provan (1996). Efficient learning of selective Bayesian network classifiers. In L. Saitta (Ed.), Proceedings of the Thirteenth International Conference on Machine Learning (pp. 453–461). San Francisco, CA: Morgan Kaufmann.Google Scholar
  51. Spiegelhalter, D. J., A. P. Dawid, S. L. Lauritzen, & R. G. Cowell (1993). Bayesian analysis in expert systems. Statistical Science, 8, 219–283.Google Scholar
  52. Suzuki, J. (1993). A construction of Bayesian networks from databases based on an MDL scheme. In D. Heckerman & A. Mamdani (Eds.), Proceedings of the Ninth Conference on Uncertainty in Artificial Intelligence (pp. 266–273). San Francisco, CA: Morgan Kaufmann.Google Scholar

Copyright information

© Kluwer Academic Publishers 1997

Authors and Affiliations

  • Nir Friedman
    • 1
  • Dan Geiger
    • 2
  • Moises Goldszmidt
    • 3
  1. 1.Computer Science DivisionUniversity of CaliforniaBerkeley
  2. 2.Computer Science DepartmentTechnionHaifaIsrael
  3. 3.SRI InternationalMenlo Park

Personalised recommendations