Innovations in Machine Learning pp 1-28 | Cite as

# A Bayesian Approach to Causal Discovery

## Abstract

We examine the Bayesian approach to the discovery of causal DAG models and compare it to the constraint-based approach. Both approaches rely on the Causal Markov condition, but the two differ significantly in theory and practice. An important difference between the approaches is that the constraint-based approach uses categorical information about conditional-independence constraints in the domain, whereas the Bayesian approach weighs the degree to which such constraints hold. As a result, the Bayesian approach has three distinct advantages over its constraint-based counterpart. One, conclusions derived from the Bayesian approach are not susceptible to incorrect categorical decisions about independence facts that can occur with data sets of finite size. Two, using the Bayesian approach, finer distinctions among model structures—both quantitative and qualitative—can be made. Three, information from several models can be combined to make better inferences and to better account for modeling uncertainty. In addition to describing the general Bayesian approach to causal discovery, we review approximation methods for missing data and hidden variables, and illustrate differences between the Bayesian and constraint-based methods using artificial and real examples.

## Keywords

Bayesian Approach Intelligence Quotient Causal Model Hide Variable Marginal Likelihood## Preview

Unable to display preview. Download preview PDF.

## References

- 1.Aliferis C and Cooper G (1994) An evaluation of an algorithm for inductive learning of Bayesian belief networks using simulated data sets. In Proceedings of Tenth Conference on Uncertainty in Artificial Intelligence Seattle WA pages 8–14. Morgan KaufmannGoogle Scholar
- 2.Becker S and LeCun Y (1989) Improving the convergence of backpropagation learning with second order methods In Proceedings of the 1988 Connectionist Models Summer School pages 29–37. Morgan KaufmannGoogle Scholar
- 3.Bernardo J and Smith A (1984) Bayesian Theory John Wiley and Sons New YorkGoogle Scholar
- 4.Buntine W (1991) Theory refinement on Bayesian networks In Proceedings of Seventh Conference on Uncertainty in Artificial Intelligence Los Angeles CA pages 52–60. Morgan KaufmannGoogle Scholar
- 5.Buntine W (1994) Operations for learning with graphical models Journal of Artificial Intelligence Research 2:159–225Google Scholar
- 6.Cheeseman P and Stutz J (1995) Bayesian classification (AutoClass): Theory and results. In Fayyad U, Piatesky-Shapiro G, Smyth P and Uthurusamy R editors, Advances in Knowledge Discovery and Data Mining pages 153–180. AAAI Press Menlo Park CAGoogle Scholar
- 7.Chib S (1995) Marginal likelihood from the Gibbs output. Journal of the American Statistical Association, 90:1313–1321.zbMATHMathSciNetCrossRefGoogle Scholar
- 8.Chickering D (1996a) Learning Bayesian networks is NP complete. In Fisher D and Lenz H editors, Learning from Data, pages 121–130 SpringerVerlag.Google Scholar
- 9.Chickering D (1996b) Learning equivalence classes of Bayesian network structures. In Proceedings of Twelfth Conference on Uncertainty in Artificial Intelligence Portland OR Morgan KaufmannGoogle Scholar
- 10.Chickering D and Heckerman D (1997) Efficient approximations for the marginal likelihood of Bayesian networks with hidden variables. Machine Learning, 29: 181–212.CrossRefzbMATHGoogle Scholar
- 11.Cooper G (1995) Causal discovery from data in the presence of selection bias. In Proceedings of the Fifth International Workshop on Artificial Intelligence and Statistics pages 140–150. Fort Lauderdale FLGoogle Scholar
- 12.Statistics pages 140–150. Fort Lauderdale FLGoogle Scholar
- 13.Cooper G and Herskovits E (1992) A Bayesian method for the induction of probabilistic networks from data. Machine Learning 9:309–347.zbMATHGoogle Scholar
- 14.Crawford S (1994) An application of the Laplace method to finite mixture distributions Journal of the American Statistical Association 89:259–267.zbMATHMathSciNetCrossRefGoogle Scholar
- 15.Dempster A Laird N and Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B 39:1–38.MathSciNetzbMATHGoogle Scholar
- 16.DiCiccio T, Kass R, Raftery A and Wasserman L (July 1995) Computing Bayes factors by combining simulation and asymptotic approximations Technical Report 630, Department of Statistics, Carnegie Mellon University PAGoogle Scholar
- 17.Geiger D and Heckerman D (1994) Learning Gaussian networks. In Proceedings of Tenth Conference on Uncertainty in Artificial Intelligence Seattle WA pages 235–243. Morgan KaufmannGoogle Scholar
- 18.Geiger D and Heckerman D (Revised February 1995). A characterization of the Dirichlet distribution applicable to learning Bayesian networks. Technical Report MSR-TR-94-16, Microsoft Research Redmond WAGoogle Scholar
- 19.Geiger D Heckerman D and Meek C (1996) Asymptotic model selection for directed networks with hidden variables. In Proceedings of Twelfth Conference on Uncertainty in Artificial Intelligence, Portland, OR. Pages 283–290. Morgan KaufmannGoogle Scholar
- 20.Geman S and Geman D (1984) Stochastic relaxation Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721–742.CrossRefzbMATHGoogle Scholar
- 21.Haughton D (1988) On the choice of a model to t data from an exponential family. Annals of Statistics 16:342–355.zbMATHMathSciNetGoogle Scholar
- 22.Heckerman D (1995) A Bayesian approach for learning causal network. In Proceedings of Eleventh Conference on Uncertainty in Artificial Intelligence, Montreal QU pages 285–295. Morgan KaufmannGoogle Scholar
- 23.Heckerman D and Geiger D (1996) (Revised November 1996) Likelihoods and priors for Bayesian networks. Technical Report MSR-TR-95-54 Microsoft Research, Redmond, WAGoogle Scholar
- 24.Heckerman D Geiger D and Chickering D (1995) Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning 20:197–243.zbMATHGoogle Scholar
- 25.Herskovits E (1991) Computer-based probabilistic network construction. PhD thesis, Medical Information Sciences, Stanford University, Stanford CAGoogle Scholar
- 26.Jensen F, Lauritzen S and Olesen K (1990) Bayesian updating in recursive graphical models by local computations. Computational Statisticals Quarterly 4:269–282.MathSciNetGoogle Scholar
- 27.Kass R and Raftery A (1995) Bayes factors. Journal of the American Statistical Association, 90:773–795.CrossRefzbMATHGoogle Scholar
- 28.Kass R, Tierney L and Kadane J (1988) Asymptotics in Bayesian computation. In Bernardo J, DeGroot M, Lindley D and Smith A editors, Bayesian Statistics 3, pages 261–278, Oxford University PressGoogle Scholar
- 29.Kass R and Wasserman L (1995) A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of the American Statistical Association 90:928–934.MathSciNetCrossRefzbMATHGoogle Scholar
- 30.Madigan D, Garvin J and Raftery A (1995) Eliciting prior information to enhance the predictive performance of Bayesian graphical models. Communications in Statistics Theory and Methods 24:2271–2292.MathSciNetzbMATHGoogle Scholar
- 31.Madigan D, Raftery A, Volinsky C and Hoeting J. (1996) Bayesian model averaging. In Proceedings of the AAAI Workshop on Integrating Multiple Learned Models, Portland, OR.Google Scholar
- 32.Madigan D and York J (1995) Bayesian graphical models for discrete data. International Statistical Review 63:215–232.zbMATHCrossRefGoogle Scholar
- 33.McLachlan G and Krishnan T (1997) The EM algorithm and extensions. WileyGoogle Scholar
- 34.Meek C. (1995) Strong completeness and faithfulness in Bayesian networks. In Proceedings of Eleventh Conference on Uncertainty in Artificial Intelligence. Montreal QU pages 411–418. Morgan KaufmannGoogle Scholar
- 35.Meng X and Rubin D (1991) Using EM to obtain asymptotic variance-covariance matrices: The SEM algorithm. Journal of the American Statistical Association 86:899–909.CrossRefGoogle Scholar
- 36.Neal R (1993) Probabilistic inference using Markov chain Monte Carlo methods. Technical Report CRG-TR-93-1. Department of Computer Science, University of TorontoGoogle Scholar
- 37.Raftery A (1995) Bayesian model selection in social research. In Marsden P, editor, Sociological Methodology. Blackwells, Cambridge MA.Google Scholar
- 38.Raftery A (1996) Hypothesis testing and model selection chapter 10. Chapman and HallGoogle Scholar
- 39.Rissanen J (1987) Stochastic complexity (with discussion). Journal of the Royal Statistical Society Series B 49:223–239 and 253–265.zbMATHMathSciNetGoogle Scholar
- 40.Robins J (1986) A new approach to causal inference in mortality studies with sustained exposure results. Mathematical Modelling 7:1393–1512.zbMATHMathSciNetCrossRefGoogle Scholar
- 41.Rubin D (1978) Bayesian inference for causal effects: The role of randomization. Annals of Statistics 6:34–58.zbMATHMathSciNetGoogle Scholar
- 42.Russell S, Binder J, Koller D and Kanazawa K. (1995) Local learning in probabilistic networks with hidden variables. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence Montreal QU pages 1146–1152. Morgan Kaufmann San Mateo CAGoogle Scholar
- 43.Scheines R, Spirtes P, Glymour C and Meek C (1994) Tetrad II Users Manual. Lawrence Erlbaum, Hillsdale NJGoogle Scholar
- 44.Schwarz G (1978) Estimating the dimension of a model. Annals of Statistics, 6:461–464.zbMATHMathSciNetGoogle Scholar
- 45.Sewell W and Shah V (1968) Social class parental encouragement and educational aspirations. American Journal of Sociology 73:559–572.CrossRefGoogle Scholar
- 46.Singh M and Valtorta M (1993) An algorithm for the construction of Bayesian network structures from data. In Proceedings of Ninth Conference on Uncertainty in Artificial Intelligence, Washington DC pages 259–265. Morgan KaufmannGoogle Scholar
- 47.Spiegelhalter D and Lauritzen S (1990) Sequential updating of conditional probabilities on directed graphical structures. Networks 20:579–605.MathSciNetzbMATHGoogle Scholar
- 48.Spirtes P, Glymour C and Scheines R (1993) Causation, Prediction and Search. Springer Verlag New YorkzbMATHGoogle Scholar
- 49.Spirtes P and Meek C (1995) Learning Bayesian networks with discrete variables from data In Proceedings of First International Conference on Knowledge Discovery and Data Mining Montreal QU Morgan KaufmannGoogle Scholar
- 50.Spirtes P, Meek C and Richardson T (1995) Causal inference in the presence of latent variables and selection bias. In Proceedings of Eleventh Conference on Uncertainty in Artificial Intelligence Montreal QU pages 499–506. Morgan KaufmannGoogle Scholar
- 51.Thiesson B (1997) Score and information for recursive exponential models with incomplete data. Technical report Institute of Electronic System, Aalborg University, Aalborg DenmarkGoogle Scholar
- 52.Verma T and Pearl J (1990) Equivalence and synthesis of causal models. In Proceedings of Sixth Conference on Uncertainty in Artificial Intelligence Boston MA pages 220–227. Morgan KaufmannGoogle Scholar
- 53.Winkler R (1967) The assessment of prior distributions in Bayesian analysis. American Statistical Association Journal 62:776–800.MathSciNetCrossRefGoogle Scholar