# Learning Functional Causal Models with Generative Neural Networks

## Abstract

We introduce a new approach to functional causal modeling from observational data, called *Causal Generative Neural Networks* (CGNN). CGNN leverages the power of neural networks to learn a generative model of the joint distribution of the observed variables, by minimizing the Maximum Mean Discrepancy between generated and observed data. An approximate learning criterion is proposed to scale the computational cost of the approach to linear complexity in the number of observations. The performance of CGNN is studied throughout three experiments. Firstly, CGNN is applied to cause-effect inference, where the task is to identify the best causal hypothesis out of “*X* → *Y* ” and “*Y* → *X*”. Secondly, CGNN is applied to the problem of identifying v-structures and conditional independences. Thirdly, CGNN is applied to multivariate functional causal modeling: given a skeleton describing the direct dependences in a set of random variables * X* = [

*X*

_{1}, …,

*X*

_{d}], CGNN orients the edges in the skeleton to uncover the directed acyclic causal graph describing the causal structure of the random variables. On all three tasks, CGNN is extensively assessed on both artificial and real-world data, comparing favorably to the state-of-the-art. Finally, CGNN is extended to handle the case of confounders, where latent variables are involved in the overall causal model.

## Keywords

Generative neural networks Causal structure discovery Cause-effect pair problem Functional causal models Structural equation models## References

- Bühlmann, P., Peters, J., Ernest, J., et al. (2014). Cam: Causal additive models, high-dimensional order search and penalized regression.
*The Annals of Statistics*, 42(6):2526–2556.MathSciNetzbMATHCrossRefGoogle Scholar - Chickering, D. M. (2002). Optimal structure identification with greedy search.
*Journal of Machine Learning Research*, 3(Nov):507–554.MathSciNetzbMATHGoogle Scholar - Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation.
*arXiv*.Google Scholar - Colombo, D. and Maathuis, M. H. (2014). Order-independent constraint-based causal structure learning.
*Journal of Machine Learning Research*, 15(1):3741–3782.MathSciNetzbMATHGoogle Scholar - Colombo, D., Maathuis, M. H., Kalisch, M., and Richardson, T. S. (2012). Learning high-dimensional directed acyclic graphs with latent and selection variables.
*The Annals of Statistics*, pages 294–321.MathSciNetzbMATHCrossRefGoogle Scholar - Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function.
*Mathematics of Control, Signals, and Systems (MCSS)*, 2(4):303–314.MathSciNetzbMATHCrossRefGoogle Scholar - Daniusis, P., Janzing, D., Mooij, J., Zscheischler, J., Steudel, B., Zhang, K., and Schölkopf, B. (2012). Inferring deterministic causal relations.
*arXiv preprint arXiv:1203.3475*.Google Scholar - Drton, M. and Maathuis, M. H. (2016). Structure learning in graphical modeling.
*Annual Review of Statistics and Its Application*, (0).Google Scholar - Edwards, R. (1964). Fourier analysis on groups.Google Scholar
- Fonollosa, J. A. (2016). Conditional distribution variability measures for causality detection.
*arXiv preprint arXiv:1601.06680*.Google Scholar - Goldberger, A. S. (1984). Reverse regression and salary discrimination.
*Journal of Human Resources*.Google Scholar - Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In
*Neural Information Processing Systems (NIPS)*, pages 2672–2680.Google Scholar - Gretton, A., Borgwardt, K. M., Rasch, M., Schölkopf, B., Smola, A. J., et al. (2007). A kernel method for the two-sample-problem. 19:513.Google Scholar
- Gretton, A., Herbrich, R., Smola, A., Bousquet, O., and Schölkopf, B. (2005). Kernel methods for measuring independence.
*Journal of Machine Learning Research*, 6(Dec):2075–2129.MathSciNetzbMATHGoogle Scholar - Guyon, I. (2013). Chalearn cause effect pairs challenge.Google Scholar
- Guyon, I. (2014). Chalearn fast causation coefficient challenge.Google Scholar
- Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups.
*IEEE Signal Processing Magazine*.Google Scholar - Hoyer, P. O., Janzing, D., Mooij, J. M., Peters, J., and Schölkopf, B. (2009). Nonlinear causal discovery with additive noise models. In
*Neural Information Processing Systems (NIPS)*, pages 689–696.Google Scholar - Kalisch, M., Mächler, M., Colombo, D., Maathuis, M. H., Bühlmann, P., et al. (2012). Causal inference using graphical models with the r package pcalg.
*Journal of Statistical Software*, 47(11):1–26.CrossRefGoogle Scholar - Kingma, D. P. and Ba, J. (2014). Adam: A Method for Stochastic Optimization.
*ArXiv e-prints*.Google Scholar - Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes.
*arXiv preprint arXiv:1312.6114*.Google Scholar - Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks.
*NIPS*.Google Scholar - Lopez-Paz, D. (2016).
*From dependence to causation*. PhD thesis, University of Cambridge.Google Scholar - Lopez-Paz, D., Muandet, K., Schölkopf, B., and Tolstikhin, I. O. (2015). Towards a learning theory of cause-effect inference. In
*ICML*, pages 1452–1461.Google Scholar - Lopez-Paz, D. and Oquab, M. (2016). Revisiting classifier two-sample tests.
*arXiv preprint arXiv:1610.06545*.Google Scholar - Mendes, P., Sha, W., and Ye, K. (2003). Artificial gene networks for objective comparison of analysis algorithms.
*Bioinformatics*, 19(suppl_2):ii122–ii129.CrossRefGoogle Scholar - Mooij, J. M., Peters, J., Janzing, D., Zscheischler, J., and Schölkopf, B. (2016). Distinguishing cause from effect using observational data: methods and benchmarks.
*Journal of Machine Learning Research*, 17(32):1–102.MathSciNetzbMATHGoogle Scholar - Nandy, P., Hauser, A., and Maathuis, M. H. (2015). High-dimensional consistency in score-based and hybrid structure learning.
*arXiv preprint arXiv:1507.02608*.Google Scholar - Ogarrio, J. M., Spirtes, P., and Ramsey, J. (2016). A hybrid causal search algorithm for latent variable models. In
*Conference on Probabilistic Graphical Models*, pages 368–379.Google Scholar - Pearl, J. (2003). Causality: models, reasoning and inference.
*Econometric Theory*, 19(675-685):46.Google Scholar - Pearl, J. (2009).
*Causality*. Cambridge university press.Google Scholar - Pearl, J. and Verma, T. (1991).
*A formal theory of inductive causation*. University of California (Los Angeles). Computer Science Department.zbMATHGoogle Scholar - Peters, J. and Bühlmann, P. (2013). Structural intervention distance (sid) for evaluating causal graphs.
*arXiv preprint arXiv:1306.1043*.Google Scholar - Peters, J., Janzing, D., and Schölkopf, B. (2017).
*Elements of Causal Inference - Foundations and Learning Algorithms*. MIT Press.Google Scholar - Quinn, J. A., Mooij, J. M., Heskes, T., and Biehl, M. (2011). Learning of causal relations. In
*ESANN*.Google Scholar - Ramsey, J. D. (2015). Scaling up greedy causal search for continuous variables.
*arXiv preprint arXiv:1507.07749*.Google Scholar - Richardson, T. and Spirtes, P. (2002). Ancestral graph markov models.
*The Annals of Statistics*, 30(4):962–1030.MathSciNetzbMATHCrossRefGoogle Scholar - Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D. A., and Nolan, G. P. (2005). Causal protein-signaling networks derived from multiparameter single-cell data.
*Science*, 308(5721):523–529.CrossRefGoogle Scholar - Scheines, R. (1997). An introduction to causal inference.Google Scholar
- Sgouritsa, E., Janzing, D., Hennig, P., and Schölkopf, B. (2015). Inference of cause and effect with unsupervised inverse regression. In
*AISTATS*.Google Scholar - Shen-Orr, S. S., Milo, R., Mangan, S., and Alon, U. (2002). Network motifs in the transcriptional regulation network of escherichia coli.
*Nature genetics*, 31(1):64.CrossRefGoogle Scholar - Shimizu, S., Hoyer, P. O., Hyvärinen, A., and Kerminen, A. (2006). A linear non-gaussian acyclic model for causal discovery.
*Journal of Machine Learning Research*, 7(Oct):2003–2030.MathSciNetzbMATHGoogle Scholar - Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016). Mastering the game of Go with deep neural networks and tree search.
*Nature*.Google Scholar - Spirtes, P., Glymour, C., and Scheines, R. (1993). Causation, prediction and search. 1993.
*Lecture Notes in Statistics*.Google Scholar - Spirtes, P., Glymour, C. N., and Scheines, R. (2000).
*Causation, prediction, and search*. MIT press.zbMATHGoogle Scholar - Spirtes, P., Meek, C., Richardson, T., and Meek, C. (1999). An algorithm for causal inference in the presence of latent variables and selection bias.Google Scholar
- Spirtes, P. and Zhang, K. (2016). Causal discovery and inference: concepts and recent methodological advances. In
*Applied informatics*, volume 3, page 3. Springer Berlin Heidelberg.Google Scholar - Statnikov, A., Henaff, M., Lytkin, N. I., and Aliferis, C. F. (2012). New methods for separating causes from effects in genomics data.
*BMC genomics*, 13(8):S22.CrossRefGoogle Scholar - Stegle, O., Janzing, D., Zhang, K., Mooij, J. M., and Schölkopf, B. (2010). Probabilistic latent variable models for distinguishing between cause and effect. In
*Neural Information Processing Systems (NIPS)*, pages 1687–1695.Google Scholar - Tsamardinos, I., Brown, L. E., and Aliferis, C. F. (2006). The max-min hill-climbing bayesian network structure learning algorithm.
*Machine learning*, 65(1):31–78.CrossRefGoogle Scholar - Van den Bulcke, T., Van Leemput, K., Naudts, B., van Remortel, P., Ma, H., Verschoren, A., De Moor, B., and Marchal, K. (2006). Syntren: a generator of synthetic gene expression data for design and analysis of structure learning algorithms.
*BMC bioinformatics*, 7(1):43.CrossRefGoogle Scholar - Verma, T. and Pearl, J. (1991). Equivalence and synthesis of causal models. In
*Proceedings of the Sixth Annual Conference on Uncertainty in Artificial Intelligence*, UAI ’90, pages 255–270, New York, NY, USA. Elsevier Science Inc.CrossRefGoogle Scholar - Yamada, M., Jitkrittum, W., Sigal, L., Xing, E. P., and Sugiyama, M. (2014). High-dimensional feature selection by feature-wise kernelized lasso.
*Neural computation*, 26(1):185–207.MathSciNetCrossRefGoogle Scholar - Zhang, K. and Hyvärinen, A. (2009). On the identifiability of the post-nonlinear causal model. In
*Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence*, pages 647–655. AUAI Press.Google Scholar - Zhang, K., Peters, J., Janzing, D., and Schölkopf, B. (2012). Kernel-based conditional independence test and application in causal discovery.
*arXiv preprint arXiv:1202.3775*.Google Scholar - Zhang, K., Wang, Z., Zhang, J., and Schölkopf, B. (2016). On estimation of functional causal models: general results and application to the post-nonlinear causal model.
*ACM Transactions on Intelligent Systems and Technology (TIST)*, 7(2):13.Google Scholar