Bayesian Networks for Data Mining

Abstract

A Bayesian network is a graphical model that encodesprobabilistic relationships among variables of interest. When used inconjunction with statistical techniques, the graphical model hasseveral advantages for data modeling. One, because the model encodesdependencies among all variables, it readily handles situations wheresome data entries are missing. Two, a Bayesian network can be used tolearn causal relationships, and hence can be used to gain understanding about a problem domain and to predict the consequencesof intervention. Three, because the model has both a causal andprobabilistic semantics, it is an ideal representation for combiningprior knowledge (which often comes in causal form) and data. Four,Bayesian statistical methods in conjunction with Bayesian networksoffer an efficient and principled approach for avoiding theoverfitting of data. In this paper, we discuss methods for constructing Bayesian networks from prior knowledge and summarizeBayesian statistical methods for using data to improve these models.With regard to the latter task, we describe methods for learning boththe parameters and structure of a Bayesian network, includingtechniques for learning with incomplete data. In addition, we relateBayesian-network methods for learning to techniques for supervised andunsupervised learning. We illustrate the graphical-modeling approachusing a real-world case study.

This is a preview of subscription content, log in to check access.

References

  1. Aliferis, C. and Cooper, G. 1994. An evaluation of an algorithm for inductive learning of Bayesian belief networks using simulated data sets. In Proceedings of Tenth Conference on Uncertainty in Artificial Intelligence. Seattle, WA: Morgan Kaufmann, pp. 8–14.

    Google Scholar 

  2. Badsberg, J. 1992. Model search in contingency tables by CoCo. In Computational Statistics, Y. Dodge and J. Wittaker (Eds.). Physica Verlag, Heidelberg, pp. 251–256.

    Google Scholar 

  3. Becker, S. and LeCun, Y. 1989. Improving the convergence of back-propagation learning with second order methods. In Proceedings of the 1988 Connectionist Models Summer School, Morgan Kaufmann, pp. 29–37.

  4. Bernardo, J. 1979. Expected information as expected utility. Annals of Statistics, 7:686–690.

    Google Scholar 

  5. Bernardo, J. and Smith, A. 1994. Bayesian Theory. New York: John Wiley and Sons.

    Google Scholar 

  6. Buntine, W. 1991. Theory refinement on Bayesian networks. In Proceedings of Seventh Conference on Uncertainty in Artificial Intelligence. Los Angeles, CA: Morgan Kaufmann, pp. 52–60.

    Google Scholar 

  7. Buntine, W. 1993. Learning classification trees. In Artificial Intelligence Frontiers in Statistics: AI and statistics III. New York: Chapman and Hall.

    Google Scholar 

  8. Buntine, W. 1994. Operations for learning with graphical models. Journal of Artificial Intelligence Research, 2:159–225.

    Google Scholar 

  9. Buntine, W. 1996. A guide to the literature on learning graphical models. IEEE Transacations on Knowledge and Data Engineering, 8:195–210.

    Google Scholar 

  10. Chaloner, K. and Duncan, G. 1983. Assessment of a beta prior distribution: PM elicitation. The Statistician, 32:174–180.

    Google Scholar 

  11. Cheeseman, P. and Stutz, J. 1995. Bayesian classification (AutoClass): Theory and results. In Advances in Knowledge Discovery and Data Mining, U. Fayyad, G. Piatesky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.). Menlo Park, CA: AAAI Press, pp. ??.

    Google Scholar 

  12. Chib, S. 1995. Marginal likelihood from the Gibbs output. Journal of the American Statistical Association, 90:1313–1321.

    Google Scholar 

  13. Chickering, D. 1995. A transformational characterization of equivalent Bayesian network structures. In Proceedings of Eleventh Conference on Uncertainty in Artificial Intelligence. Montreal, QU: Morgan Kaufmann, pp. 87–98.

    Google Scholar 

  14. Chickering, D. 1996. Learning equivalence classes of Bayesian-network structures. In Proceedings of Twelfth Conference on Uncertainty in Artificial Intelligence. Portland, OR: Morgan Kaufmann.

    Google Scholar 

  15. Chickering, D. and Heckerman, D. 1996. Efficient approximations for the marginal likelihood of incomplete data given a Bayesian network. Technical Report MSR-TR-96-08, Microsoft Research, Redmond, WA (revised).

    Google Scholar 

  16. Cooper, G. 1990. Computational complexity of probabilistic inference using Bayesian belief networks (Research note). Artificial Intelligence, 42:393–405.

    Google Scholar 

  17. Cooper, G. and Herskovits, E. 1991. A Bayesian method for the induction of probabilistic networks from data. Technical Report SMI-91-1, Section on Medical Informatics, Stanford University.

    Google Scholar 

  18. Cooper, G. and Herskovits, E. 1992. A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9:309–347.

    Google Scholar 

  19. Cox, R. 1946. Probability, frequency and reasonable expectation. American Journal of Physics, 14:1–13.

    Google Scholar 

  20. Dagum, P. and Luby, M. 1993. Approximating probabilistic inference in Bayesian belief networks is NP-hard. Artificial Intelligence, 60:141–153.

    Google Scholar 

  21. D'Ambrosio, B. 1991. Local expression languages for probabilistic dependence. In Proceedings of Seventh Conference on Uncertainty in Artificial Intelligence. Los Angeles, CA: Morgan Kaufmann, pp. 95–102.

    Google Scholar 

  22. Darwiche, A. and Provan, G. 1995. QueryDAGs: Apractical paradigm for implementing belief-network inference. In Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence. Portland, OR: Morgan Kaufmann, pp. 203–210.

    Google Scholar 

  23. Dawid, P. 1984. Present position and potential developments: some personal views. statistical theory, the prequential approach with Discussion. Journal of the Royal Statistical Society A, 147:178–292.

    Google Scholar 

  24. Dawid, P. 1992. Applications of a general propagation algorithm for probabilistic expert systems. Statistics and Computing, 2:25–36.

    Google Scholar 

  25. de Finetti, B. 1970. Theory of Probability. New York: Wiley and Sons.

    Google Scholar 

  26. Dempster, A., Laird, N., and Rubin, D. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B39:1–38.

    Google Scholar 

  27. DiCiccio, T., Kass, R., Raftery, A., and Wasserman, L. 1995. Computing Bayes factors by combining simulation and asymptotic approximations. Technical Report 630, Department of Statistics, Carnegie Mellon University, PA.

  28. Friedman, J. 1995. Introduction to computational learning and statistical prediction. Technical report, Department of Statistics, Stanford University.

  29. Friedman, J. 1996. On bias, variance, 0/1-loss, and the curse of dimensionality. Data Mining and Knowledge Discovery, 1.

  30. Friedman, N. and Goldszmidt, M. 1996. Building classifiers using Bayesian networks. In Proceedings AAAI-96 Thirteenth National Conference on Artificial Intelligence, Portland, OR, Menlo Park, CA: AAAI Press, pp. 1277–1284.

    Google Scholar 

  31. Frydenberg, M. 1990. The chain graph Markov property. Scandinavian Journal of Statistics, 17:333–353.

    Google Scholar 

  32. Geiger,D. and Heckerman, D. 1995.Acharacterization of the Dirichlet distribution applicable to learning Bayesian networks (revised). Technical Report MSR-TR-94-16, Microsoft Research, Redmond, WA.

    Google Scholar 

  33. Geiger, D., Heckerman, D., and Meek, C. 1996. Asymptotic model selection for directed networks with hidden variables. In Proceedings of Twelth Conference on Uncertainty in Artificial Intelligence. Portland, OR: Morgan Kaufmann.

    Google Scholar 

  34. Geman, S. and Geman, D. 1984. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721–742.

    Google Scholar 

  35. Gilks,W., Richardson, S., and Spiegelhalter, D. 1995. Markov Chain Monte Carlo in Practice. Chapman and Hall.

  36. Good, I. 1950. Probability and the Weighing of Evidence. New York: Hafners.

    Google Scholar 

  37. Heckerman, D. 1989.Atractable algorithm for diagnosing multiple diseases. In Proceedings of the Fifth Workshop on Uncertainty in Artificial Intelligence, Windsor, ON, pp. 174–181. Association for Uncertainty in Artificial Intelligence, Mountain View, CA. Also in Henrion, M., Shachter, R., Kanal, L., and Lemmer, J. (Eds.) 1990, Uncertainty in Artificial Intelligence. North-Holland, New York, vol. 5, pp. 163-171.

    Google Scholar 

  38. Heckerman, D. 1995. A Bayesian approach for learning causal networks. In Proceedings of Eleventh Conference on Uncertainty in Artificial Intelligence. Montreal, QU: Morgan Kaufmann, pp. 285–295.

    Google Scholar 

  39. Heckerman, D. and Shachter, R. 1995. Decision-theoretic foundations for causal reasoning. Journal of Artificial Intelligence Research, 3:405–430.

    Google Scholar 

  40. Heckerman, D., Geiger, D., and Chickering, D. 1995a. Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20:197–243.

    Google Scholar 

  41. Heckerman, D., Mamdani, A., and Wellman, M. 1995b. Real-world applications of Bayesian networks. Communications of the ACM, 38.

  42. Heckerman, D. and Geiger, D. 1996. Likelihoods and priors for Bayesian networks. Technical Report MSR-TR-95-54, Microsoft Research, Redmond, WA (revised).

    Google Scholar 

  43. Højsgaard, S., Skjøth, F., and Thiesson, B. 1994. User's guide to BIOFROST. Technical report, Department of Mathematics and Computer Science, Aalborg, Denmark.

  44. Howard, R. 1970. Decision analysis: Perspectives on inference, decision, and experimentation. Proceedings of the IEEE, vol. 58, pp. 632–643.

    Google Scholar 

  45. Howard, R. and Matheson, J. 1981. Influence diagrams. In Readings on the Principles and Applications of Decision Analysis, R. Howard and J. Matheson (Eds.). Strategic Decisions Group, Menlo Park, CA, vol. II, pp. 721–762.

    Google Scholar 

  46. Howard, R. and Matheson, J. (Eds.) 1983. The Principles and Applications of Decision Analysis. Strategic Decisions Group, Menlo Park, CA.

    Google Scholar 

  47. Humphreys, P. and Freedman, D. 1996. The grand leap. British Journal for the Philosphy of Science, 47:113–118.

    Google Scholar 

  48. Jaakkola, T. and Jordan, M. 1996. Computing upper and lower bounds on likelihoods in intractable networks. In Proceedings of Twelfth Conference on Uncertainty in Artificial Intelligence. Portland, OR: Morgan Kaufmann, pp. 340–348.

    Google Scholar 

  49. Jensen, F. 1996. An Introduction to Bayesian Networks. Springer.

  50. Jensen, F. and Andersen, S. 1990. Approximations in Bayesian belief universes for knowledge based systems. Technical report, Institute of Electronic Systems, Aalborg University, Aalborg, Denmark.

    Google Scholar 

  51. Jensen, F., Lauritzen, S., and Olesen, K. 1990. Bayesian updating in recursive graphical models by local computations. Computational Statisticals Quarterly, 4:269–282.

    Google Scholar 

  52. Kass, R. and Raftery, A. 1995. Bayes factors. Journal of the American Statistical Association, 90:773–795.

    Google Scholar 

  53. Kass, R., Tierney, L., and Kadane, J. 1988. Asymptotics in Bayesian computation. In Bayesian Statistics, J. Bernardo, M. DeGroot, D. Lindley, and A. Smith (Eds.). Oxford University Press, vol. 3, pp. 261–278.

  54. Koopman, B. 1936. On distributions admitting a sufficient statistic. Transactions of the American Mathematical Society, 39:399–409.

    Google Scholar 

  55. Lauritzen, S. 1982. Lectures on Contingency Tables. Aalborg, Denmark: University of Aalborg Press.

    Google Scholar 

  56. Lauritzen, S. 1992. Propagation of probabilities, means, and variances in mixed graphical association models. Journal of the American Statistical Association, 87:1098–1108.

    Google Scholar 

  57. Lauritzen, S. and Spiegelhalter, D. 1988. Local computations with probabilities on graphical structures and their application to expert systems. J. Royal Statistical Society B, 50:157–224.

    Google Scholar 

  58. Lauritzen, S., Thiesson, B., and Spiegelhalter, D. 1994. Diagnostic systems created by model selection methods: A case study. In AI and Statistics IV, Lecture Notes in Statistics, P. Cheeseman and R. Oldford (Eds.). Springer-Verlag, New York, vol. 89, pp. 143–152.

    Google Scholar 

  59. MacKay, D. 1992a. Bayesian interpolation. Neural Computation, 4:415–447.

    Google Scholar 

  60. MacKay, D. 1992b. A practical Bayesian framework for backpropagation networks. Neural Computation, 4:448–472.

    Google Scholar 

  61. MacKay, D. 1996. Choice of basis for the Laplace approximation. Technical report, Cavendish Laboratory, Cambridge, UK.

  62. Madigan, D. and Raftery, A. 1994. Model selection and accounting for model uncertainty in graphical models using Occam's window. Journal of the American Statistical Association, 89:1535–1546.

    Google Scholar 

  63. Madigan, D., Garvin, J., and Raftery, A. 1995. Eliciting prior information to enhance the predictive performance of Bayesian graphical models. Communications in Statistics: Theory and Methods, 24:2271–2292.

    Google Scholar 

  64. Madigan, D., Raftery, A., Volinsky, C., and Hoeting, J. 1996. Bayesian model averaging. In Proceedings of the AAAI Workshop on Integrating Multiple Learned Models, Portland, OR.

  65. Madigan, D. and York, J. 1995. Bayesian graphical models for discrete data. International Statistical Review, 63:215–232.

    Google Scholar 

  66. Martin, J. and VanLehn, K. 1995. Discrete factor analysis: Learning hidden variables in Bayesian networks. Technical report, Department of Computer Science, University of Pittsburgh, PA. Available at http://bert.cs.pitt.edu//vanlehn.

    Google Scholar 

  67. Meng, X. and Rubin, D. 1991. Using EM to obtain asymptotic variance-covariance matrices: The sem algorithm. Journal of the American Statistical Association, 86:899–909.

    Google Scholar 

  68. Neal, R. 1993. Probabilistic inference using Markov Chain Monte Carlo methods. Technical Report CRG-TR-93-1, Department of Computer Science, University of Toronto.

  69. Olmsted, S. 1983. On representing and solving decision problems. Ph.D. thesis, Department of Engineering-Economic Systems, Stanford University.

  70. Pearl, J. 1986. Fusion, propagation, and structuring in belief networks. Artificial Intelligence, 29:241–288.

    Google Scholar 

  71. Pearl, J. 1995. Causal diagrams for empirical research. Biometrika, 82:669–710.

    Google Scholar 

  72. Pearl, J. and Verma, T. 1991. A theory of inferred causation. In Knowledge Representation and Reasoning: Proceedings of the Second International Conference, (Eds.). New York: Morgan Kaufmann, pp. 441–452.

    Google Scholar 

  73. Pitman, E. 1936. Sufficient statistics and intrinsic accuracy. Proceedings of the Cambridge Philosophy Society, 32:567–579.

    Google Scholar 

  74. Raftery, A. 1995. Bayesian model selection in social research. In Sociological Methodology, P. Marsden (Ed.). Cambridge, MA: Blackwells.

    Google Scholar 

  75. Raftery, A. 1996. Hypothesis testing and model selection via posterior simulation. In Practical Markov Chain Monte Carlo. Chapman and Hall (to appear).

  76. Ramamurthi, K. and Agogino, A. 1988. Real time expert system for fault tolerant supervisory control. In Computers in Engineering, V. Tipnis, and E. Patton (Eds.). American Society of Mechanical Engineers, Corte Madera, CA, pp. 333–339.

    Google Scholar 

  77. Ramsey, F. 1931. Truth and probability. In The Foundations of Mathematics and other Logical Essays, R. Braithwaite (Ed.). London: Humanities Press. Reprinted in Kyburg and Smokler, 1964.

    Google Scholar 

  78. Rissanen, J. 1987. Stochastic complexity with discussion. Journal of the Royal Statistical Society, Series B, 49:223-239 and 253-265.

    Google Scholar 

  79. Robins, J. 1986. A new approach to causal interence in mortality studies with sustained exposure results. Mathematical Modelling, 7:1393–1512.

    Google Scholar 

  80. Rubin, D. 1978. Bayesian inference for causal effects: The role of randomization. Annals of Statistics, 6:34–58.

    Google Scholar 

  81. Russell, S., Binder, J., Koller, D., and Kanazawa, K. 1995. Local learning in probabilistic networks with hidden variables. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, Montreal, QU, San Mateo, CA: Morgan Kaufmann, pp. 1146–1152.

    Google Scholar 

  82. Saul, L., Jaakkola, T., and Jordan, M. 1996. Mean field theory for sigmoid belief networks. Journal of Artificial Intelligence Research, 4:61–76.

    Google Scholar 

  83. Savage, L. 1954. The Foundations of Statistics. New York: Dover.

    Google Scholar 

  84. Schwarz, G. 1978. Estimating the dimension of a model. Annals of Statistics, 6:461–464.

    Google Scholar 

  85. Sewell,W. and Shah, V. 1968. Social class, parental encouragement, and educational aspirations. American Journal of Sociology, 73:559–572.

    Google Scholar 

  86. Shachter, R. 1988. Probabilistic inference and influence diagrams. Operations Research, 36:589–604.

    Google Scholar 

  87. Shachter, R. and Kenley, C. 1989. Gaussian influence diagrams. Management Science, 35:527–550.

    Google Scholar 

  88. Shachter, R., Andersen, S., and Poh, K. 1990. Directed reduction algorithms and decomposable graphs. In Proceedings of the Sixth Conference on Uncertainty in Artificial Intelligence, Boston, MA, pp. 237-244. Association for Uncertainty in Artificial Intelligence, Mountain View, CA.

  89. Silverman, B. 1986. Density Estimation for Statistics and Data Analysis. New York: Chapman and Hall.

    Google Scholar 

  90. Singh, M. and Provan, G. 1995. Efficient learning of selective Bayesian network classifiers. Technical Report MS-CIS-95-36, Computer and Information Science Department, University of Pennsylvania, Philadelphia, PA.

    Google Scholar 

  91. Spetzler, C. and Stael von Holstein, C. 1975. Probability encoding in decision analysis. Management Science, 22:340–358.

    Google Scholar 

  92. Spiegelhalter, D. and Lauritzen, S. 1990. Sequential updating of conditional probabilities on directed graphical structures. Networks, 20:579–605.

    Google Scholar 

  93. Spiegelhalter, D., Dawid, A., Lauritzen, S., and Cowell, R. 1993. Bayesian analysis in expert systems. Statistical Science, 8:219-282.

    Google Scholar 

  94. Spirtes, P. and Meek, C. 1995. Learning Bayesian networks with discrete variables from data. In Proceedings of First International Conference on Knowledge Discovery and Data Mining. Montreal, QU: Morgan Kaufmann.

    Google Scholar 

  95. Spirtes, P., Glymour, C., and Scheines, R. 1993. Causation, Prediction, and Search. New York: Springer-Verlag.

    Google Scholar 

  96. Suermondt, H. and Cooper, G. 1991.Acombination of exact algorithms for inference on Bayesian belief networks. International Journal of Approximate Reasoning, 5:521–542.

    Google Scholar 

  97. Thiesson, B. 1995a. Accelerated quantification of Bayesian networks with incomplete data. In Proceedings of First International Conference on Knowledge Discovery and Data Mining, Montreal, QU: Morgan Kaufmann, pp. 306–311.

    Google Scholar 

  98. Thiesson, B. 1995b. Score and information for recursive exponential models with incomplete data. Technical report, Institute of Electronic Systems, Aalborg University, Aalborg, Denmark.

  99. Thomas, A., Spiegelhalter, D., and Gilks, W. 1992. Bugs: A program to perform Bayesian inference using Gibbs sampling. In Bayesian Statistics, J. Bernardo, J. Berger, A. Dawid, and A. Smith (Eds.). Oxford University Press, vol. 4, pp. 837–842.

  100. Tukey, J. 1977. Exploratory Data Analysis. Addison-Wesley.

  101. Tversky, A. and Kahneman, D. 1974. Judgment under uncertainty: Heuristics and biases. Science, 185:1124–1131.

    Google Scholar 

  102. Verma, T. and Pearl, J. 1990. Equivalence and synthesis of causal models. In Proceedings of Sixth Conference on Uncertainty in Artificial Intelligence. Boston, MA: Morgan Kaufmann, pp. 220–227.

    Google Scholar 

  103. Winkler, R. 1967. The assessment of prior distributions in Bayesian analysis. American Statistical Association Journal, 62:776–800.

    Google Scholar 

  104. Wittaker, J. 1990. Graphical Models in Applied Multivariate Statistics. John Wiley and Sons.

Download references

Author information

Affiliations

Authors

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Heckerman, D. Bayesian Networks for Data Mining. Data Mining and Knowledge Discovery 1, 79–119 (1997). https://doi.org/10.1023/A:1009730122752

Download citation

  • Bayesian networks
  • Bayesian statistics
  • learning
  • missing data
  • classification
  • regression
  • clustering
  • causal discovery