A Bayesian Approach to Causal Discovery

  • David Heckerman
  • Christopher Meek
  • Gregory Cooper
Part of the Studies in Fuzziness and Soft Computing book series (STUDFUZZ, volume 194)


We examine the Bayesian approach to the discovery of causal DAG models and compare it to the constraint-based approach. Both approaches rely on the Causal Markov condition, but the two differ significantly in theory and practice. An important difference between the approaches is that the constraint-based approach uses categorical information about conditional-independence constraints in the domain, whereas the Bayesian approach weighs the degree to which such constraints hold. As a result, the Bayesian approach has three distinct advantages over its constraint-based counterpart. One, conclusions derived from the Bayesian approach are not susceptible to incorrect categorical decisions about independence facts that can occur with data sets of finite size. Two, using the Bayesian approach, finer distinctions among model structures—both quantitative and qualitative—can be made. Three, information from several models can be combined to make better inferences and to better account for modeling uncertainty. In addition to describing the general Bayesian approach to causal discovery, we review approximation methods for missing data and hidden variables, and illustrate differences between the Bayesian and constraint-based methods using artificial and real examples.


Bayesian Approach Intelligence Quotient Causal Model Hide Variable Marginal Likelihood 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aliferis C and Cooper G (1994) An evaluation of an algorithm for inductive learning of Bayesian belief networks using simulated data sets. In Proceedings of Tenth Conference on Uncertainty in Artificial Intelligence Seattle WA pages 8–14. Morgan KaufmannGoogle Scholar
  2. 2.
    Becker S and LeCun Y (1989) Improving the convergence of backpropagation learning with second order methods In Proceedings of the 1988 Connectionist Models Summer School pages 29–37. Morgan KaufmannGoogle Scholar
  3. 3.
    Bernardo J and Smith A (1984) Bayesian Theory John Wiley and Sons New YorkGoogle Scholar
  4. 4.
    Buntine W (1991) Theory refinement on Bayesian networks In Proceedings of Seventh Conference on Uncertainty in Artificial Intelligence Los Angeles CA pages 52–60. Morgan KaufmannGoogle Scholar
  5. 5.
    Buntine W (1994) Operations for learning with graphical models Journal of Artificial Intelligence Research 2:159–225Google Scholar
  6. 6.
    Cheeseman P and Stutz J (1995) Bayesian classification (AutoClass): Theory and results. In Fayyad U, Piatesky-Shapiro G, Smyth P and Uthurusamy R editors, Advances in Knowledge Discovery and Data Mining pages 153–180. AAAI Press Menlo Park CAGoogle Scholar
  7. 7.
    Chib S (1995) Marginal likelihood from the Gibbs output. Journal of the American Statistical Association, 90:1313–1321.zbMATHMathSciNetCrossRefGoogle Scholar
  8. 8.
    Chickering D (1996a) Learning Bayesian networks is NP complete. In Fisher D and Lenz H editors, Learning from Data, pages 121–130 SpringerVerlag.Google Scholar
  9. 9.
    Chickering D (1996b) Learning equivalence classes of Bayesian network structures. In Proceedings of Twelfth Conference on Uncertainty in Artificial Intelligence Portland OR Morgan KaufmannGoogle Scholar
  10. 10.
    Chickering D and Heckerman D (1997) Efficient approximations for the marginal likelihood of Bayesian networks with hidden variables. Machine Learning, 29: 181–212.CrossRefzbMATHGoogle Scholar
  11. 11.
    Cooper G (1995) Causal discovery from data in the presence of selection bias. In Proceedings of the Fifth International Workshop on Artificial Intelligence and Statistics pages 140–150. Fort Lauderdale FLGoogle Scholar
  12. 12.
    Statistics pages 140–150. Fort Lauderdale FLGoogle Scholar
  13. 13.
    Cooper G and Herskovits E (1992) A Bayesian method for the induction of probabilistic networks from data. Machine Learning 9:309–347.zbMATHGoogle Scholar
  14. 14.
    Crawford S (1994) An application of the Laplace method to finite mixture distributions Journal of the American Statistical Association 89:259–267.zbMATHMathSciNetCrossRefGoogle Scholar
  15. 15.
    Dempster A Laird N and Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B 39:1–38.MathSciNetzbMATHGoogle Scholar
  16. 16.
    DiCiccio T, Kass R, Raftery A and Wasserman L (July 1995) Computing Bayes factors by combining simulation and asymptotic approximations Technical Report 630, Department of Statistics, Carnegie Mellon University PAGoogle Scholar
  17. 17.
    Geiger D and Heckerman D (1994) Learning Gaussian networks. In Proceedings of Tenth Conference on Uncertainty in Artificial Intelligence Seattle WA pages 235–243. Morgan KaufmannGoogle Scholar
  18. 18.
    Geiger D and Heckerman D (Revised February 1995). A characterization of the Dirichlet distribution applicable to learning Bayesian networks. Technical Report MSR-TR-94-16, Microsoft Research Redmond WAGoogle Scholar
  19. 19.
    Geiger D Heckerman D and Meek C (1996) Asymptotic model selection for directed networks with hidden variables. In Proceedings of Twelfth Conference on Uncertainty in Artificial Intelligence, Portland, OR. Pages 283–290. Morgan KaufmannGoogle Scholar
  20. 20.
    Geman S and Geman D (1984) Stochastic relaxation Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721–742.CrossRefzbMATHGoogle Scholar
  21. 21.
    Haughton D (1988) On the choice of a model to t data from an exponential family. Annals of Statistics 16:342–355.zbMATHMathSciNetGoogle Scholar
  22. 22.
    Heckerman D (1995) A Bayesian approach for learning causal network. In Proceedings of Eleventh Conference on Uncertainty in Artificial Intelligence, Montreal QU pages 285–295. Morgan KaufmannGoogle Scholar
  23. 23.
    Heckerman D and Geiger D (1996) (Revised November 1996) Likelihoods and priors for Bayesian networks. Technical Report MSR-TR-95-54 Microsoft Research, Redmond, WAGoogle Scholar
  24. 24.
    Heckerman D Geiger D and Chickering D (1995) Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning 20:197–243.zbMATHGoogle Scholar
  25. 25.
    Herskovits E (1991) Computer-based probabilistic network construction. PhD thesis, Medical Information Sciences, Stanford University, Stanford CAGoogle Scholar
  26. 26.
    Jensen F, Lauritzen S and Olesen K (1990) Bayesian updating in recursive graphical models by local computations. Computational Statisticals Quarterly 4:269–282.MathSciNetGoogle Scholar
  27. 27.
    Kass R and Raftery A (1995) Bayes factors. Journal of the American Statistical Association, 90:773–795.CrossRefzbMATHGoogle Scholar
  28. 28.
    Kass R, Tierney L and Kadane J (1988) Asymptotics in Bayesian computation. In Bernardo J, DeGroot M, Lindley D and Smith A editors, Bayesian Statistics 3, pages 261–278, Oxford University PressGoogle Scholar
  29. 29.
    Kass R and Wasserman L (1995) A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of the American Statistical Association 90:928–934.MathSciNetCrossRefzbMATHGoogle Scholar
  30. 30.
    Madigan D, Garvin J and Raftery A (1995) Eliciting prior information to enhance the predictive performance of Bayesian graphical models. Communications in Statistics Theory and Methods 24:2271–2292.MathSciNetzbMATHGoogle Scholar
  31. 31.
    Madigan D, Raftery A, Volinsky C and Hoeting J. (1996) Bayesian model averaging. In Proceedings of the AAAI Workshop on Integrating Multiple Learned Models, Portland, OR.Google Scholar
  32. 32.
    Madigan D and York J (1995) Bayesian graphical models for discrete data. International Statistical Review 63:215–232.zbMATHCrossRefGoogle Scholar
  33. 33.
    McLachlan G and Krishnan T (1997) The EM algorithm and extensions. WileyGoogle Scholar
  34. 34.
    Meek C. (1995) Strong completeness and faithfulness in Bayesian networks. In Proceedings of Eleventh Conference on Uncertainty in Artificial Intelligence. Montreal QU pages 411–418. Morgan KaufmannGoogle Scholar
  35. 35.
    Meng X and Rubin D (1991) Using EM to obtain asymptotic variance-covariance matrices: The SEM algorithm. Journal of the American Statistical Association 86:899–909.CrossRefGoogle Scholar
  36. 36.
    Neal R (1993) Probabilistic inference using Markov chain Monte Carlo methods. Technical Report CRG-TR-93-1. Department of Computer Science, University of TorontoGoogle Scholar
  37. 37.
    Raftery A (1995) Bayesian model selection in social research. In Marsden P, editor, Sociological Methodology. Blackwells, Cambridge MA.Google Scholar
  38. 38.
    Raftery A (1996) Hypothesis testing and model selection chapter 10. Chapman and HallGoogle Scholar
  39. 39.
    Rissanen J (1987) Stochastic complexity (with discussion). Journal of the Royal Statistical Society Series B 49:223–239 and 253–265.zbMATHMathSciNetGoogle Scholar
  40. 40.
    Robins J (1986) A new approach to causal inference in mortality studies with sustained exposure results. Mathematical Modelling 7:1393–1512.zbMATHMathSciNetCrossRefGoogle Scholar
  41. 41.
    Rubin D (1978) Bayesian inference for causal effects: The role of randomization. Annals of Statistics 6:34–58.zbMATHMathSciNetGoogle Scholar
  42. 42.
    Russell S, Binder J, Koller D and Kanazawa K. (1995) Local learning in probabilistic networks with hidden variables. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence Montreal QU pages 1146–1152. Morgan Kaufmann San Mateo CAGoogle Scholar
  43. 43.
    Scheines R, Spirtes P, Glymour C and Meek C (1994) Tetrad II Users Manual. Lawrence Erlbaum, Hillsdale NJGoogle Scholar
  44. 44.
    Schwarz G (1978) Estimating the dimension of a model. Annals of Statistics, 6:461–464.zbMATHMathSciNetGoogle Scholar
  45. 45.
    Sewell W and Shah V (1968) Social class parental encouragement and educational aspirations. American Journal of Sociology 73:559–572.CrossRefGoogle Scholar
  46. 46.
    Singh M and Valtorta M (1993) An algorithm for the construction of Bayesian network structures from data. In Proceedings of Ninth Conference on Uncertainty in Artificial Intelligence, Washington DC pages 259–265. Morgan KaufmannGoogle Scholar
  47. 47.
    Spiegelhalter D and Lauritzen S (1990) Sequential updating of conditional probabilities on directed graphical structures. Networks 20:579–605.MathSciNetzbMATHGoogle Scholar
  48. 48.
    Spirtes P, Glymour C and Scheines R (1993) Causation, Prediction and Search. Springer Verlag New YorkzbMATHGoogle Scholar
  49. 49.
    Spirtes P and Meek C (1995) Learning Bayesian networks with discrete variables from data In Proceedings of First International Conference on Knowledge Discovery and Data Mining Montreal QU Morgan KaufmannGoogle Scholar
  50. 50.
    Spirtes P, Meek C and Richardson T (1995) Causal inference in the presence of latent variables and selection bias. In Proceedings of Eleventh Conference on Uncertainty in Artificial Intelligence Montreal QU pages 499–506. Morgan KaufmannGoogle Scholar
  51. 51.
    Thiesson B (1997) Score and information for recursive exponential models with incomplete data. Technical report Institute of Electronic System, Aalborg University, Aalborg DenmarkGoogle Scholar
  52. 52.
    Verma T and Pearl J (1990) Equivalence and synthesis of causal models. In Proceedings of Sixth Conference on Uncertainty in Artificial Intelligence Boston MA pages 220–227. Morgan KaufmannGoogle Scholar
  53. 53.
    Winkler R (1967) The assessment of prior distributions in Bayesian analysis. American Statistical Association Journal 62:776–800.MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • David Heckerman
    • 1
  • Christopher Meek
    • 1
  • Gregory Cooper
    • 2
  1. 1.Microsoft ResearchRedmond
  2. 2.University of PittsburghPittsburgh

Personalised recommendations