Causal feature learning: an overview


Causal feature learning (CFL) (Chalupka et al., Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence. AUAI Press, Edinburgh, pp 181–190, 2015) is a causal inference framework rooted in the language of causal graphical models (Pearl J, Reasoning and inference. Cambridge University Press, Cambridge, 2009; Spirtes et al., Causation, Prediction, and Search. Massachusetts Institute of Technology, Massachusetts, 2000), and computational mechanics (Shalizi, PhD thesis, University of Wisconsin at Madison, 2001). CFL is aimed at discovering high-level causal relations from low-level data, and at reducing the experimental effort to understand confounding among the high-level variables. We first review the scientific motivation for CFL, then present a detailed introduction to the framework, laying out the definitions and algorithmic steps. A simple example illustrates the techniques involved in the learning steps and provides visual intuition. Finally, we discuss the limitations of the current framework and list a number of open problems.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10


  1. 1.

    As discussed in some detail in Chalupka et al. (2016a), a causal interpretation of purely observational data is not possible without further assumptions.

  2. 2.

    Python code that implements the learning algorithms and reproduces all the figures and experimental results is available online at

  3. 3.

    It is possible that this \(\gamma \) is not in \(P'[\gamma ;\alpha , \beta ]\). However, it is guaranteed to be in \(P[\gamma ;\alpha , \beta ]\). Since a subset of measure zero in \(P[\gamma ;\alpha , \beta ]\) is also measure zero in \(P'[\gamma ; \alpha , \beta ]\), this does not influence the proof.

  4. 4.

    It is possible that this \(\gamma \) is not in \(P'[\gamma ;\alpha , \beta ]\). However, it is guaranteed to be in \(P[\gamma ;\alpha , \beta ]\). Since a subset of measure zero in \(P[\gamma ;\alpha , \beta ]\) is also measure zero in \(P'[\gamma ; \alpha , \beta ]\), this does not influence the proof.


  1. Bishop CM (1994) Mixture density networks. Technical report

  2. Chaloner K, Verdinelli I (1995) Bayesian experimental design: a review. Stat Sci 273–304

  3. Chalupka K, Perona P, Eberhardt F (2015) Visual causal feature learning. In: Proceedings of the thirty-first conference on uncertainty in artificial intelligence. AUAI Press, Corvallis, pp 181–190

  4. Chalupka K, Bischoff T, Perona P, Eberhardt F (2016a) Unsupervised discovery of el nino using causal feature learning on microlevel climate data. In: Proceedings of the thirty-second conference on uncertainty in artificial intelligence

  5. Chalupka K, Perona P, Eberhardt F (2016b) Multi-level cause-effect systems. In: 19th international conference on artificial intelligence and statistics (AISTATS)

  6. Chickering DM (2002) Learning equivalence classes of bayesian-network structures. J Mach Learn Res 2:445–498

    MathSciNet  MATH  Google Scholar 

  7. Claassen T, Heskes T (2012) A bayesian approach to constraint based causal inference. In: Proceedings of UAI. AUAI Press, Corvallis, pp 207–216

  8. Entner Doris, Hoyer Patrik O (2012) Estimating a causal order among groups of variables in linear models. Artif Neural Netw Mach Learn-ICANN 2012:84–91

    Google Scholar 

  9. Hoel Erik P, Albantakis L, Tononi G, Albantakis GT (2013) Quantifying causal emergence shows that macro can beat micro. Proc Natl Acad Sci 110(49):19790–19795

    Article  Google Scholar 

  10. Hoyer PO, Janzing D, Mooij JM, Peters J, Schölkopf B (2009) Nonlinear causal discovery with additive noise models. In: Advances in neural information processing systems, pp 689–696

  11. Hyttinen A, Eberhardt F, Hoyer PO (2012) Causal discovery of linear cyclic models from multiple experimental data sets with overlapping variables. arXiv:1210.4879

  12. Hyttinen A, Frederick E, Järvisalo M (2014) Conflict resolution with answer set programming. In: Proceedings of UAI, constraint-based causal discovery

  13. Jacobs KW, Hustmyer FE (1974) Effects of four psychological primary colors on gsr, heart rate and respiration rate. Percept Motor Skills 38(3):763–766

    Article  Google Scholar 

  14. Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KO (eds). Advances in neural information processing systems, vol 25, pp 1097–1105

  15. Lacerda G, Spirtes PL, Ramsey J, Hoyer PO (2012) Discovering cyclic causal models by independent components analysis. arXiv:1206.3273

  16. Levina E, Bickel P (2001) The earth mover’s distance is the mallows distance: some insights from statistics. In: Eighth IEEE international conference on computer vision, 2001. ICCV 2001. Proceedings, vol 2. IEEE, New York, pp 251–256

  17. Mooij JM, Janzing D, Heskes T, Schölkopf B (2011) On causal discovery with cyclic additive noise models. In: Advances in neural information processing systems, pp 639–647

  18. Okamoto M (1973) Distinctness of the eigenvalues of a quadratic form in a multivariate sample. Ann Stat 1(4):763–765

    MathSciNet  Article  MATH  Google Scholar 

  19. Parviainen P, Kaski S (2015) Bayesian networks for variable groups. arXiv:1508.07753

  20. Pearl J (2000) Causality: models. Reasoning and inference. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  21. Pearl J (2010) An introduction to causal inference. Int J Biostat 6(2)

  22. Reise SP, Moore TM, Haviland MG (2010) Bifactor models and rotations: Exploring the extent to which multidimensional data yield univocal scale scores. J Pers Assessm 92(6):544–559

    Article  Google Scholar 

  23. Richardson T (1996) A discovery algorithm for directed cyclic graphs. In: Proceedings of the twelfth international conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., USA, pp 454–461

  24. Rumelhart DE, Hinton GE, Williams RJ (1985) Learning internal representations by error propagation. Technical report, No. ICS-8506. California University of San Diego La Jolla Institute for Cognitive Science

  25. Shalizi CR (2001) Causal architecture, complexity and self-organization in the time series and cellular automata. PhD thesis, University of Wisconsin at Madison

  26. Shalizi CR, Crutchfield JP (2001) Computational mechanics: pattern and prediction, structure and simplicity. J Stat Phys 104(3–4):817–879

    MathSciNet  Article  MATH  Google Scholar 

  27. Shalizi CR, Moore C (2003) What is a macrostate? Subjective observations and objective dynamics. arXiv:cond-mat/0303625

  28. Shimizu Shohei, Hoyer Patrik O, Hyvärinen Aapo, Kerminen Antti (2006) A linear non-gaussian acyclic model for causal discovery. J Mach Learn Res 7:2003–2030

    MathSciNet  MATH  Google Scholar 

  29. Silander T, Myllymäki P (2006) A simple approach for finding the globally optimal bayesian network structure. In: Proc UAI. AUAI Press, Oregon, pp 445–452

  30. Silva R, Scheines R, Glymour C, Spirtes P (2006) Learning the structure of linear latent variable models. J Mach Learn Res 7:191–246

    MathSciNet  MATH  Google Scholar 

  31. Snoek J, Larochelle H, Adams RP (2012) Practical bayesian optimization of machine learning algorithms. In: Advances in neural information processing systems, pp 2951–2959

  32. Spirtes Peter, Scheines Richard (2004) Causal inference of ambiguous manipulations. Philos Sci 71(5):833–845

    MathSciNet  Article  Google Scholar 

  33. Spirtes P, Glymour CN, Scheines R (2000) Causation, prediction, and search, 2nd edn. Massachusetts Institute of Technology, Massachusetts

  34. Srinivas N, Krause A, Seeger M, Kakade SM (2010) Gaussian process optimization in the bandit setting: no regret and experimental design. In: Proceedings of the 27th international conference on machine learning (ICML-10), pp 1015–1022

  35. Tong S, Koller D (2001) Support vector machine active learning with applications to text classification. J Mach Learn Res 2:45–66

    MATH  Google Scholar 

  36. Tsao Doris Y, Freiwald Winrich A, Tootell RBH, Livingstone MS (2006) A cortical region consisting entirely of face-selective cells. Science 311(5761):670–674

    Article  Google Scholar 

Download references


We thank an anonymous reviewer for pointing out an error in our original theorem. This work was supported by NSF Award #1564330.

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Author information



Corresponding author

Correspondence to Krzysztof Chalupka.

Additional information

Communicated by Shohei Shimizu.

About this article

Verify currency and authenticity via CrossMark

Cite this article

Chalupka, K., Eberhardt, F. & Perona, P. Causal feature learning: an overview. Behaviormetrika 44, 137–164 (2017).

Download citation


  • Causal discovery
  • Causal inference
  • Graphical models
  • Bayesian networks
  • Macrovariables
  • Multiscale modeling