Learning Functional Causal Models with Generative Neural Networks

Goudet, Olivier; Kalainathan, Diviyan; Caillou, Philippe; Guyon, Isabelle; Lopez-Paz, David; Sebag, Michèle

doi:10.1007/978-3-319-98131-4_3

Olivier Goudet¹¹,
Diviyan Kalainathan¹¹,
Philippe Caillou¹¹,
Isabelle Guyon^12,13,
David Lopez-Paz¹⁴ &
…
Michèle Sebag¹¹

Part of the book series: The Springer Series on Challenges in Machine Learning ((SSCML))

5940 Accesses
31 Citations
1 Altmetric

Abstract

We introduce a new approach to functional causal modeling from observational data, called Causal Generative Neural Networks (CGNN). CGNN leverages the power of neural networks to learn a generative model of the joint distribution of the observed variables, by minimizing the Maximum Mean Discrepancy between generated and observed data. An approximate learning criterion is proposed to scale the computational cost of the approach to linear complexity in the number of observations. The performance of CGNN is studied throughout three experiments. Firstly, CGNN is applied to cause-effect inference, where the task is to identify the best causal hypothesis out of “X → Y ” and “Y → X”. Secondly, CGNN is applied to the problem of identifying v-structures and conditional independences. Thirdly, CGNN is applied to multivariate functional causal modeling: given a skeleton describing the direct dependences in a set of random variables X = [X ₁, …, X _d], CGNN orients the edges in the skeleton to uncover the directed acyclic causal graph describing the causal structure of the random variables. On all three tasks, CGNN is extensively assessed on both artificial and real-world data, comparing favorably to the state-of-the-art. Finally, CGNN is extended to handle the case of confounders, where latent variables are involved in the overall causal model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Hardcover + eBook: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

CD-BNN: Causal Discovery with Bayesian Neural Network

Causal Structure Learning: A Combinatorial Perspective

Article Open access 01 August 2022

Non-Gaussian Methods for Causal Structure Learning

Article 22 May 2018

Notes

1.
The so-called constraint-based methods base the recovery of graph structure on conditional independence tests. In general, proofs of model identifiability assume the existence of an “oracle” providing perfect knowledge of the CIs, i.e. de facto assuming an infinite amount of training data.
2.
After Ramsey (2015), in the linear model with Gaussian variable case the individual BIC score to minimize for a variable X given its parents is up to a constant n ln(s) + c k ln(n), where n ln(s) is the likelihood term, with s the residual variance after regressing X onto its parents, and n the number of data samples. c k ln(n) is a penalty term for the complexity of the graph (here the number of edges). k = 2p + 1, with p the total number of parents of the variable X in the graph. c = 2 by default, chosen empirically. The global score minimized by the algorithm is the sum over all variables of the individual BIC score given the parent variables in the graph.
3.
These methods can be extended to the multivariate case and used for causal graph identification by orienting each edge in turn.
4.
In some specific cases, such as in the bivariate linear FCM with Gaussian noise and Gaussian input, even by restricting the class of functions considered, the DAG cannot be identified from purely observational data (Mooij et al. 2016).
5.
The first four datasets are available at http://dx.doi.org/10.7910/DVN/3757KX. The Tuebingen cause-effect pairs dataset is available at https://webdav.tuebingen.mpg.de/cause-effect/.
6.
Using the R program available at https://github.com/ssamot/causality for ANM, IGCI, PNL, GPI and LiNGAM.
7.
The data generator is available at https://github.com/GoudetOlivie/CGNN. The datasets considered are available at http://dx.doi.org/10.7910/DVN/UZMB69.
8.
The datasets considered are available at http://dx.doi.org/10.7910/DVN/UZMB69.

References

Bühlmann, P., Peters, J., Ernest, J., et al. (2014). Cam: Causal additive models, high-dimensional order search and penalized regression. The Annals of Statistics, 42(6):2526–2556.
Article MathSciNet MATH Google Scholar
Chickering, D. M. (2002). Optimal structure identification with greedy search. Journal of Machine Learning Research, 3(Nov):507–554.
MathSciNet MATH Google Scholar
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv.
Google Scholar
Colombo, D. and Maathuis, M. H. (2014). Order-independent constraint-based causal structure learning. Journal of Machine Learning Research, 15(1):3741–3782.
MathSciNet MATH Google Scholar
Colombo, D., Maathuis, M. H., Kalisch, M., and Richardson, T. S. (2012). Learning high-dimensional directed acyclic graphs with latent and selection variables. The Annals of Statistics, pages 294–321.
Article MathSciNet MATH Google Scholar
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems (MCSS), 2(4):303–314.
Article MathSciNet MATH Google Scholar
Daniusis, P., Janzing, D., Mooij, J., Zscheischler, J., Steudel, B., Zhang, K., and Schölkopf, B. (2012). Inferring deterministic causal relations. arXiv preprint arXiv:1203.3475.
Google Scholar
Drton, M. and Maathuis, M. H. (2016). Structure learning in graphical modeling. Annual Review of Statistics and Its Application, (0).
Google Scholar
Edwards, R. (1964). Fourier analysis on groups.
Google Scholar
Fonollosa, J. A. (2016). Conditional distribution variability measures for causality detection. arXiv preprint arXiv:1601.06680.
Google Scholar
Goldberger, A. S. (1984). Reverse regression and salary discrimination. Journal of Human Resources.
Google Scholar
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Neural Information Processing Systems (NIPS), pages 2672–2680.
Google Scholar
Gretton, A., Borgwardt, K. M., Rasch, M., Schölkopf, B., Smola, A. J., et al. (2007). A kernel method for the two-sample-problem. 19:513.
Google Scholar
Gretton, A., Herbrich, R., Smola, A., Bousquet, O., and Schölkopf, B. (2005). Kernel methods for measuring independence. Journal of Machine Learning Research, 6(Dec):2075–2129.
MathSciNet MATH Google Scholar
Guyon, I. (2013). Chalearn cause effect pairs challenge.
Google Scholar
Guyon, I. (2014). Chalearn fast causation coefficient challenge.
Google Scholar
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine.
Google Scholar
Hoyer, P. O., Janzing, D., Mooij, J. M., Peters, J., and Schölkopf, B. (2009). Nonlinear causal discovery with additive noise models. In Neural Information Processing Systems (NIPS), pages 689–696.
Google Scholar
Kalisch, M., Mächler, M., Colombo, D., Maathuis, M. H., Bühlmann, P., et al. (2012). Causal inference using graphical models with the r package pcalg. Journal of Statistical Software, 47(11):1–26.
Article Google Scholar
Kingma, D. P. and Ba, J. (2014). Adam: A Method for Stochastic Optimization. ArXiv e-prints.
Google Scholar
Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
Google Scholar
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. NIPS.
Google Scholar
Lopez-Paz, D. (2016). From dependence to causation. PhD thesis, University of Cambridge.
Google Scholar
Lopez-Paz, D., Muandet, K., Schölkopf, B., and Tolstikhin, I. O. (2015). Towards a learning theory of cause-effect inference. In ICML, pages 1452–1461.
Google Scholar
Lopez-Paz, D. and Oquab, M. (2016). Revisiting classifier two-sample tests. arXiv preprint arXiv:1610.06545.
Google Scholar
Mendes, P., Sha, W., and Ye, K. (2003). Artificial gene networks for objective comparison of analysis algorithms. Bioinformatics, 19(suppl_2):ii122–ii129.
Article Google Scholar
Mooij, J. M., Peters, J., Janzing, D., Zscheischler, J., and Schölkopf, B. (2016). Distinguishing cause from effect using observational data: methods and benchmarks. Journal of Machine Learning Research, 17(32):1–102.
MathSciNet MATH Google Scholar
Nandy, P., Hauser, A., and Maathuis, M. H. (2015). High-dimensional consistency in score-based and hybrid structure learning. arXiv preprint arXiv:1507.02608.
Google Scholar
Ogarrio, J. M., Spirtes, P., and Ramsey, J. (2016). A hybrid causal search algorithm for latent variable models. In Conference on Probabilistic Graphical Models, pages 368–379.
Google Scholar
Pearl, J. (2003). Causality: models, reasoning and inference. Econometric Theory, 19(675-685):46.
Google Scholar
Pearl, J. (2009). Causality. Cambridge university press.
Google Scholar
Pearl, J. and Verma, T. (1991). A formal theory of inductive causation. University of California (Los Angeles). Computer Science Department.
MATH Google Scholar
Peters, J. and Bühlmann, P. (2013). Structural intervention distance (sid) for evaluating causal graphs. arXiv preprint arXiv:1306.1043.
Google Scholar
Peters, J., Janzing, D., and Schölkopf, B. (2017). Elements of Causal Inference - Foundations and Learning Algorithms. MIT Press.
Google Scholar
Quinn, J. A., Mooij, J. M., Heskes, T., and Biehl, M. (2011). Learning of causal relations. In ESANN.
Google Scholar
Ramsey, J. D. (2015). Scaling up greedy causal search for continuous variables. arXiv preprint arXiv:1507.07749.
Google Scholar
Richardson, T. and Spirtes, P. (2002). Ancestral graph markov models. The Annals of Statistics, 30(4):962–1030.
Article MathSciNet MATH Google Scholar
Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D. A., and Nolan, G. P. (2005). Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308(5721):523–529.
Article Google Scholar
Scheines, R. (1997). An introduction to causal inference.
Google Scholar
Sgouritsa, E., Janzing, D., Hennig, P., and Schölkopf, B. (2015). Inference of cause and effect with unsupervised inverse regression. In AISTATS.
Google Scholar
Shen-Orr, S. S., Milo, R., Mangan, S., and Alon, U. (2002). Network motifs in the transcriptional regulation network of escherichia coli. Nature genetics, 31(1):64.
Article Google Scholar
Shimizu, S., Hoyer, P. O., Hyvärinen, A., and Kerminen, A. (2006). A linear non-gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7(Oct):2003–2030.
MathSciNet MATH Google Scholar
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature.
Google Scholar
Spirtes, P., Glymour, C., and Scheines, R. (1993). Causation, prediction and search. 1993. Lecture Notes in Statistics.
Google Scholar
Spirtes, P., Glymour, C. N., and Scheines, R. (2000). Causation, prediction, and search. MIT press.
MATH Google Scholar
Spirtes, P., Meek, C., Richardson, T., and Meek, C. (1999). An algorithm for causal inference in the presence of latent variables and selection bias.
Google Scholar
Spirtes, P. and Zhang, K. (2016). Causal discovery and inference: concepts and recent methodological advances. In Applied informatics, volume 3, page 3. Springer Berlin Heidelberg.
Google Scholar
Statnikov, A., Henaff, M., Lytkin, N. I., and Aliferis, C. F. (2012). New methods for separating causes from effects in genomics data. BMC genomics, 13(8):S22.
Article Google Scholar
Stegle, O., Janzing, D., Zhang, K., Mooij, J. M., and Schölkopf, B. (2010). Probabilistic latent variable models for distinguishing between cause and effect. In Neural Information Processing Systems (NIPS), pages 1687–1695.
Google Scholar
Tsamardinos, I., Brown, L. E., and Aliferis, C. F. (2006). The max-min hill-climbing bayesian network structure learning algorithm. Machine learning, 65(1):31–78.
Article Google Scholar
Van den Bulcke, T., Van Leemput, K., Naudts, B., van Remortel, P., Ma, H., Verschoren, A., De Moor, B., and Marchal, K. (2006). Syntren: a generator of synthetic gene expression data for design and analysis of structure learning algorithms. BMC bioinformatics, 7(1):43.
Article Google Scholar
Verma, T. and Pearl, J. (1991). Equivalence and synthesis of causal models. In Proceedings of the Sixth Annual Conference on Uncertainty in Artificial Intelligence, UAI ’90, pages 255–270, New York, NY, USA. Elsevier Science Inc.
Chapter Google Scholar
Yamada, M., Jitkrittum, W., Sigal, L., Xing, E. P., and Sugiyama, M. (2014). High-dimensional feature selection by feature-wise kernelized lasso. Neural computation, 26(1):185–207.
Article MathSciNet Google Scholar
Zhang, K. and Hyvärinen, A. (2009). On the identifiability of the post-nonlinear causal model. In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence, pages 647–655. AUAI Press.
Google Scholar
Zhang, K., Peters, J., Janzing, D., and Schölkopf, B. (2012). Kernel-based conditional independence test and application in causal discovery. arXiv preprint arXiv:1202.3775.
Google Scholar
Zhang, K., Wang, Z., Zhang, J., and Schölkopf, B. (2016). On estimation of functional causal models: general results and application to the post-nonlinear causal model. ACM Transactions on Intelligent Systems and Technology (TIST), 7(2):13.
Google Scholar

Download references

Author information

Authors and Affiliations

Team TAU - CNRS, INRIA, Université Paris Sud, Université Paris Saclay, Paris, France
Olivier Goudet, Diviyan Kalainathan, Philippe Caillou & Michèle Sebag
INRIA, Université Paris Sud, Université Paris Saclay, Paris, France
Isabelle Guyon
ChaLearn, Berkeley, CA, USA
Isabelle Guyon
Facebook AI Research, Menlo Park, CA, USA
David Lopez-Paz

Authors

Olivier Goudet
View author publications
You can also search for this author in PubMed Google Scholar
Diviyan Kalainathan
View author publications
You can also search for this author in PubMed Google Scholar
Philippe Caillou
View author publications
You can also search for this author in PubMed Google Scholar
Isabelle Guyon
View author publications
You can also search for this author in PubMed Google Scholar
David Lopez-Paz
View author publications
You can also search for this author in PubMed Google Scholar
Michèle Sebag
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Olivier Goudet .

Editor information

Editors and Affiliations

INAOE, Puebla, Mexico
Hugo Jair Escalante
University of Barcelona, Barcelona, Spain
Sergio Escalera
INRIA, Université Paris Sud, Université Paris Saclay, Paris, France
Isabelle Guyon
Open University of Catalonia, Barcelona, Spain
Xavier Baró
Radboud University Nijmegen, Nijmegen, The Netherlands
Yağmur Güçlütürk
Radboud University Nijmegen, Nijmegen, The Netherlands
Umut Güçlü
Radboud University Nijmegen, Nijmegen, The Netherlands
Marcel van Gerven

Appendix

1.1 The Maximum Mean Discrepancy (MMD) Statistic

The Maximum Mean Discrepancy (MMD) statistic (Gretton et al. 2007) measures the distance between two probability distributions P and $\hat {P}$, defined over $\mathbb {R}^d$, as the real-valued quantity

$$\displaystyle \begin{aligned} \text{MMD}_k(P, \hat{P}) = \left\| \mu_k(P) - \mu_k(\hat{P}) \right\|{}_{\mathcal{H}_k}. \end{aligned}$$

Here, $\mu _k = \int k(x, \cdot ) \mathrm {d} P(x)$ is the kernel mean embedding of the distribution P, according to the real-valued symmetric kernel function $k(x, x') = \langle k(x, \cdot ), k(x', \cdot ) \rangle _{\mathcal {H}_k}$ with associated reproducing kernel Hilbert space $\mathcal {H}_k$. Therefore, μ _k summarizes P as the expected value of the features computed by k over samples drawn from P.

In practical applications, we do not have access to the distributions P and $\hat {P}$, but to their respective sets of samples $\mathcal {D}$ and $\hat {\mathcal {D}}$, defined in Sect. 4.2.1. In this case, we approximate the kernel mean embedding μ _k(P) by the empirical kernel mean embedding $\mu _k(\mathcal {D}) = \frac {1}{|\mathcal {D}|} \sum _{x \in \mathcal {D}} k(x, \cdot )$, and respectively for $\hat {P}$. Then, the empirical MMD statistic is

$$\displaystyle \begin{aligned} \widehat{\text{MMD}}_k(\mathcal{D}, \hat{\mathcal{D}}) &= \left\| \mu_k(\mathcal{D}) - \mu_k(\hat{\mathcal{D}}) \right\|{}_{\mathcal{H}_k} \\ &=\frac{1}{n^2} \sum_{i, j}^{n} k(x_i, x_j) + \frac{1}{n^2} \sum_{i, j}^{n} k(\hat{x}_i, \hat{x}_j) - \frac{2}{n^2} \sum_{i,j}^n k(x_i, \hat{x}_j). \end{aligned} $$

Importantly, the empirical MMD tends to zero as n →∞ if and only if $P = \hat {P}$, as long as k is a characteristic kernel (Gretton et al. 2007). This property makes the MMD an excellent choice to model how close the observational distribution P is to the estimated observational distribution $\hat {P}$. Throughout this paper, we will employ a particular characteristic kernel: the Gaussian kernel $k(x, x') = \exp (- \gamma \| x - x' \|{ }_2^2)$, where γ > 0 is a hyperparameter controlling the smoothness of the features.

In terms of computation, the evaluation of $\text{MMD}_k(\mathcal {D}, \hat {\mathcal {D}})$ takes O(n ²) time, which is prohibitive for large n. When using a shift-invariant kernel, such as the Gaussian kernel, one can invoke Bochner’s theorem (Edwards 1964) to obtain a linear-time approximation to the empirical MMD (Lopez-Paz et al. 2015), with form

$$\displaystyle \begin{aligned} \widehat{\text{MMD}}^m_k(\mathcal{D}, \hat{\mathcal{D}}) = \left\| \hat{\mu}_k(\mathcal{D}) - \hat{\mu}_k(\hat{\mathcal{D}}) \right\|{}_{\mathbb{R}^m} {} \end{aligned}$$

and O(mn) evaluation time. Here, the approximate empirical kernel mean embedding has form

$$\displaystyle \begin{aligned} \hat{\mu}_k(\mathcal{D}) = \sqrt{\frac{2}{m}} \frac{1}{|\mathcal{D}|} \sum_{x \in \mathcal{D}} \left[ \cos{}(\langle w_1, x \rangle + b_1), \ldots, \cos{}(\langle w_m, x \rangle + b_m) \right], \end{aligned}$$

where w _i is drawn from the normalized Fourier transform of k, and b _i ∼ U[0, 2π], for i = 1, …, m. In our experiments, we compare the performance and computation times of both $\widehat {\text{MMD}}_k$ and $\widehat {\text{MMD}}_k^m$.

1.2 Proofs

Proposition 1

Let X = [X ₁, …, X _d] denote a set of continuous random variables with joint distribution P, and further assume that the joint density function h of P is continuous and strictly positive on a compact and convex subset of $\mathbb {R}^{d}$ , and zero elsewhere. Letting $\mathcal {G}$ be a DAG such that P can be factorized along $\mathcal {G}$ ,

$$\displaystyle \begin{aligned}P(X) = \prod_i P(X_i | X_{\mathit{\text{Pa}}({i}; \mathcal{G})})\end{aligned}$$

there exists f = (f ₁, …, f _d) with f _i a continuous function with compact support in $\mathbb {R}^{|\mathit{\text{Pa}}({i}; \mathcal {G})|}\times [0,1]$ such that P(X) equals the generative model defined from FCM $({\mathcal {G}}, f, {\mathcal {E}})$ , with ${\mathcal {E}} = \mathcal {U}[0,1]$ the uniform distribution on [0, 1].

Proof

By induction on the topological order of $\mathcal {G}$. Let X _i be such that $|\text{Pa}({i}; \mathcal {G})|=0$ and consider the cumulative distribution F _i(x _i) defined over the domain of X _i (F _i(x _i) = Pr(X _i < x _i)). F _i is strictly monotonous as the joint density function is strictly positive therefore its inverse, the quantile function Q _i : [0, 1]↦dom(X _i) is defined and continuous. By construction, $Q_i(e_i) =F_i^{-1}(e_i)$ and setting Q _i = f _i yields the result.

Assume f _i be defined for all variables X _i with topological order less than m. Let X _j with topological order m and Z the vector of its parent variables. For any noise vector $e = (e_i, i \in \text{Pa}({j}; \mathcal {G}))$ let $z = (x_i, i \in \text{Pa}({j}; \mathcal {G}))$ be the value vector of variables in Z defined from e. The conditional cumulative distribution F _j(x _j|Z = z) = Pr(X _j < x _j|Z = z) is strictly continuous and monotonous wrt x _j, and can be inverted using the same argument as above. Then we can define $f_j(z,e_j) = F_j^{-1}(z,e_j)$.

Let K _j = dom(X _j) and $K_{\text{Pa}({j}; \mathcal {G})} = dom(Z)$. We will show now that the function f _j is continuous on $K_{\text{Pa}({j}; \mathcal {G})} \times [0,1]$, a compact subset of $\mathbb {R}^{|\text{Pa}({j}; \mathcal {G})|}\times [0,1]$.

By assumption, there exist $a_j \in \mathcal {R}$ such that, for $(x_j, z) \in K_j \times K_{\text{Pa}({j}; \mathcal {G})}$, $F(x_j|z) = \int _{a_j}^{x_j} \frac {h_j(u,z)}{h_j(z)} \mathrm {d}u$, with h _j a continuous and strictly positive density function. For $(a,b) \in K_j \times K_{\text{Pa}({j}; \mathcal {G})}$, as the function $(u, z) \rightarrow \frac {h_j(u,z)}{h_j(z)}$ is continuous on the compact $K_j \times K_{\text{Pa}({j}; \mathcal {G})}$, $\lim \limits _{\substack {x_j \rightarrow a}} F(x_j|z) = \int _{a_j}^{a} \frac {h_j(u,z)}{h_j(z)} \mathrm {d}u$ uniformly on $K_{\text{Pa}({j}; \mathcal {G})}$ and $\lim \limits _{\substack {z \rightarrow b}} F(x_j|z) = \int _{a_j}^{x_j} \frac {h_j(u,b)}{h_j(b)}$ on K _j, according to exchanging limits theorem, F is continuous on (a, b).

For any sequence z _n → z, we have that F(x _j|z _n) → F(x _j|z) uniformly in x _j. Let define two sequences u _n and x _j,n, respectively on [0, 1] and K _j, such that u _n → u and x _j,n → x _j. As F(x _j|z) = u has unique root x _j = f _j(z, u), the root of F(x _j|z _n) = u _n, that is, x _j,n = f _j(z _n, u _n) converge to x _j. Then the function (z, u) → f _j(z, u) is continuous on $K_{\text{Pa}({i}; \mathcal {G})} \times [0,1]$.

Proposition 2

For m ∈ [[1, d]], let Z _m denote the set of variables with topological order less than m and let d _m be its size. For any d _m -dimensional vector of noise values e ^(m) , let z _m(e ^(m)) (resp. $\widehat {z_m}(e^{(m)})$ ) be the vector of values computed in topological order from the FCM $({\mathcal {G}}, f, {\mathcal { E}})$ (resp. the CGNN $({\mathcal {G}}, \hat {f}, {\mathcal {E}})$ ). For any 𝜖 > 0, there exists a set of networks $\hat {f}$ with architecture $\mathcal {G}$ such that

$$\displaystyle \begin{aligned} \forall e^{(m)}, \|z_m(e^{(m)})- \widehat{z_m}(e^{(m)})\| < \epsilon {} \end{aligned} $$

(14)

Proof

By induction on the topological order of $\mathcal {G}$. Let X _i be such that $|\text{Pa}({i}; \mathcal {G})|=0$. Following the universal approximation theorem Cybenko (1989), as f _i is a continuous function over a compact of $\mathbb {R}$, there exists a neural net $\hat {f_{i}}$ such that $\|f_i - \hat {f_{i}}\|{ }_\infty < \epsilon /d_1$. Thus Eq. (14) holds for the set of networks $\hat {f_i}$ for i ranging over variables with topological order 0.

Let us assume that Proposition 2 holds up to m, and let us assume for brevity that there exists a single variable X _j with topological order m + 1. Letting $\hat {f_j}$ be such that $\|f_j - \hat {f_j}\|{ }_\infty < \epsilon /3$ (based on the universal approximation property), letting δ be such that for all u $\|\hat f_j(u) - \hat f_j(u+\delta )\|< \epsilon /3$ (by absolute continuity) and letting $\hat f_i$ satisfying Eq. (14) for i with topological order less than m for min(𝜖∕3, δ)∕d _m, it comes: $\|(z_m,f_j(z_m,e_j)) - (\hat z_m,\hat {f_j}(\hat {z_m}, e_j))\| \le \|z_m - \hat z_m\| + |f_j(z_m,e_j) - \hat {f_j}(z_m, e_j)| + | \hat {f_j}(z_m,e_j) - \hat {f_j}(\hat {z_m}, e_j)| < \epsilon /3 + \epsilon /3 + \epsilon /3$, which ends the proof.

Proposition 3

Let $\mathcal {D}$ be an infinite observational sample generated from $({\mathcal {G}}, f, {\mathcal {E}})$ . With same notations as in Proposition 2 , for every sequence 𝜖 _t such that 𝜖 _t > 0 goes to zero when t →∞, there exists a set $\widehat {f_t} = (\hat f^{t}_1 \ldots \hat f^{t}_d)$ such that $\widehat {\mathit{\text{MMD}}_k}$ between $\mathcal D$ and an infinite size sample $\widehat {\mathcal {D}}_{t}$ generated from the CGNN $(\mathcal {G},\widehat {f_{t}},\mathcal {E})$ is less than 𝜖 _t.

Proof

According to Proposition 2 and with same notations, letting 𝜖 _t > 0 go to 0 as t goes to infinity, consider ${\hat f}_t=(\hat f^{t}_1 \ldots \hat f^{t}_d)$ and $\hat {z_t}$ defined from ${\hat f}_t$ such that for all e ∈ [0, 1]^d, $\|z(e)- \widehat {z}_t(e)\| < \epsilon _t$.

Let $\{ \hat {\mathcal {D}_t} \}$ denote the infinite sample generated after $\hat {f_t}$. The score of the CGNN $(\mathcal {G},\hat {f_t},\mathcal {E})$ is $ \widehat {\text{MMD}}_k(\mathcal {D}, \hat {\mathcal {D}_t}) = \mathbb {E}_{e,e'}[k(z(e),z(e')) - 2 k(z(e),\widehat {z}_t(e')) + k(\widehat {z}_t(e), \widehat {z}_t(e'))]$.

As $\hat {f_t}$ converges towards f on the compact [0, 1]^d, using the bounded convergence theorem on a compact subset of $\mathbb {R}^{d}$, $\widehat {z_t}(e) \rightarrow z(e)$ uniformly for t →∞, it follows from the Gaussian kernel function being bounded and continuous that $\widehat {\text{MMD}}_k(\mathcal {D}, \hat {\mathcal {D}_t}) \rightarrow 0$, when t →∞.

Proposition 4

Let X = [X ₁, …, X _d] denote a set of continuous random variables with joint distribution P, generated by a CGNN $\mathcal {C}_{\mathcal {G},f} = (\mathcal {G}, f, \mathcal {E})$ with $\mathcal {G}$ , a directed acyclic graph. And let $\mathcal {D}$ be an infinite observational sample generated from this CGNN. We assume that P is Markov and faithful to the graph $\mathcal {G}$ , and that every pair of variables (X _i, X _j) that are d-connected in the graph are not independent. We note $\widehat {\mathcal {D}}$ an infinite sample generated by a candidate CGNN, $\mathcal {C}_{\widehat {\mathcal {G}},\hat {f}} = (\widehat {\mathcal {G}}, \hat {f}, \mathcal {E})$ . Then,

(i)
If $\widehat {\mathcal {G}} = \mathcal {G}$ and $\hat {f} = f$ , then $\widehat {\mathit{\text{MMD}}}_k(\mathcal {D}, \widehat {\mathcal {D}}) = 0$.
(ii)
For any graph $\widehat {\mathcal {G}}$ characterized by the same adjacencies but not belonging to the Markov equivalence class of $\mathcal {G}$ , for all $\hat {f}$ , $\widehat {\mathit{\text{MMD}}}_k(\mathcal {D}, \widehat {\mathcal {D}}) \neq 0$.

Proof

The proof of (i) is obvious, as with $\widehat {\mathcal {G}} = \mathcal {G}$ and $\hat {f} = f$, the joint distribution $\hat {P}$ generated by $\mathcal {C}_{\widehat {\mathcal {G}},\hat {f}} = (\widehat {\mathcal {G}}, \hat {f}, \mathcal {E})$ is equal to P, thus we have $\widehat {\text{MMD}}_k(\mathcal {D}, \widehat {\mathcal {D}}) = 0$.

(ii) Let consider $\widehat {\mathcal {G}}$ a DAG characterized by the same adjacencies but that do not belong to the Markov equivalence class of $\mathcal {G}$. According to Verma and Pearl (1991), as the DAG $\mathcal {G}$ and $\widehat {\mathcal {G}}$ have the same adjacencies but are not Markov equivalent, there are not characterized by the same v-structures.

a)
First, we consider that a v-structure {X, Y, Z} exists in $\mathcal {G}$, but not in $\widehat {\mathcal {G}}$. As the distribution P is faithful to $\mathcal {G}$ and X and Z are not d-separated by Y in $\mathcal {G}$, we have that in P. Now we consider the graph $\widehat {\mathcal {G}}$. Let $\hat {f}$ be a set of neural networks. We note $\hat {P}$ the distribution generated by the CGNN $\mathcal {C}_{\widehat {\mathcal {G}},\hat {f}}$. As $\widehat {\mathcal {G}}$ is a directed acyclic graph and the variables E _i are mutually independent, $\hat {P}$ is Markov with respect to $\widehat {\mathcal {G}}$. As {X, Y, Z} is not a v-structure in $\widehat {\mathcal {G}}$, X and Z are d-separated by Y . By using the causal Markov assumption, we obtain that (X⊥ ⊥ Z|Y ) in $\hat {P}$.
b)
Second, we consider that a v-structure {X, Y, Z} exists in $\widehat {\mathcal {G}}$, but not in $\mathcal {G}$. As {X, Y, Z} is not a v-structure in $\mathcal {G}$, there is an “unblocked path” between the variables X and Z, the variables X and Z are d-connected. By assumption, there do not exist a set D not containing Y such that (X⊥ ⊥ Z|D) in P. In $\widehat {\mathcal {G}}$, as {X, Y, Z} is a v-structure, there exists a set D not containing Y that d-separates X and Z. As for all CGNN $\mathcal {C}_{\widehat {\mathcal {G}},\hat {f}}$ generating a distribution $\hat {P}$, $\hat {P}$ is Markov with respect to $\widehat {\mathcal {G}}$, we have that X⊥ ⊥ Z|D in $\hat {P}$.

In the two cases a) and b) considered above, P and $\hat {P}$ do not encode the same conditional independence relations, thus are not equal. We have then $\widehat {\text{MMD}}_k(\mathcal {D}, \mathcal {D}') \neq 0$.

1.3 Table of Scores for the Experiments on Cause-Effect Pairs

See Table 3.

1.4 Table of Scores for the Experiments on Graphs

See Tables 4, 5 and 6.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Goudet, O., Kalainathan, D., Caillou, P., Guyon, I., Lopez-Paz, D., Sebag, M. (2018). Learning Functional Causal Models with Generative Neural Networks. In: Escalante, H., et al. Explainable and Interpretable Models in Computer Vision and Machine Learning. The Springer Series on Challenges in Machine Learning. Springer, Cham. https://doi.org/10.1007/978-3-319-98131-4_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-98131-4_3
Published: 30 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98130-7
Online ISBN: 978-3-319-98131-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Learning Functional Causal Models with Generative Neural Networks

Abstract

Access this chapter

Similar content being viewed by others

CD-BNN: Causal Discovery with Bayesian Neural Network

Causal Structure Learning: A Combinatorial Perspective

Non-Gaussian Methods for Causal Structure Learning

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

1.1 The Maximum Mean Discrepancy (MMD) Statistic

1.2 Proofs

Proposition 1

Proof

Proposition 2

Proof

Proposition 3

Proof

Proposition 4

Proof

1.3 Table of Scores for the Experiments on Cause-Effect Pairs

1.4 Table of Scores for the Experiments on Graphs

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Learning Functional Causal Models with Generative Neural Networks

Abstract

Access this chapter

Similar content being viewed by others

CD-BNN: Causal Discovery with Bayesian Neural Network

Causal Structure Learning: A Combinatorial Perspective

Non-Gaussian Methods for Causal Structure Learning

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

1.1 The Maximum Mean Discrepancy (MMD) Statistic

1.2 Proofs

Proposition 1

Proof

Proposition 2

Proof

Proposition 3

Proof

Proposition 4

Proof

1.3 Table of Scores for the Experiments on Cause-Effect Pairs

1.4 Table of Scores for the Experiments on Graphs

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation