Leastsquares independence regression for nonlinear causal inference under nonGaussian noise
 526 Downloads
 2 Citations
Abstract
The discovery of nonlinear causal relationship under additive nonGaussian noise models has attracted considerable attention recently because of their high flexibility. In this paper, we propose a novel causal inference algorithm called leastsquares independence regression (LSIR). LSIR learns the additive noise model through the minimization of an estimator of the squaredloss mutual information between inputs and residuals. A notable advantage of LSIR is that tuning parameters such as the kernel width and the regularization parameter can be naturally optimized by crossvalidation, allowing us to avoid overfitting in a datadependent fashion. Through experiments with realworld datasets, we show that LSIR compares favorably with a stateoftheart causal inference method.
Keywords
Causal inference Nonlinear NonGaussian Squaredloss mutual information Leastsquares independence regression1 Introduction
Learning causality from data is one of the important challenges in the artificial intelligence, statistics, and machine learning communities (Pearl 2000). A traditional method of learning causal relationship from observational data is based on the lineardependence Gaussiannoise model (Geiger and Heckerman 1994). However, the linearGaussian assumption is too restrictive and may not be fulfilled in practice. Recently, nonGaussianity and nonlinearity have been shown to be beneficial in causal inference, allowing one to break symmetry between observed variables (Shimizu et al. 2006; Hoyer et al. 2009). Since then, much attention has been paid to the discovery of nonlinear causal relationship through nonGaussian noise models (Mooij et al. 2009).
In the framework of nonlinear nonGaussian causal inference, the relation between a cause X and an effect Y is assumed to be described by Y=f(X)+E, where f is a nonlinear function and E is nonGaussian additive noise which is independent of the cause X. Given two random variables X and X′, the causal direction between X and X′ is decided based on a hypothesis test of whether the causal model X′=f(X)+E or the alternative model X=f′(X′)+E′ fits the data well—here, the goodness of fit is measured by independence between inputs and residuals (i.e., estimated noise). Hoyer et al. (2009) proposed to learn the functions f and f′ by Gaussian process (GP) regression (Bishop 2006), and evaluate the independence between inputs and residuals by the HilbertSchmidt independence criterion (HSIC) (Gretton et al. 2005).
However, since standard regression methods such as GP are designed to handle Gaussian noise, they may not be suited for discovering causality in the nonGaussian additive noise formulation. To cope with this problem, a novel regression method called HSIC regression (HSICR) has been introduced recently (Mooij et al. 2009). HSICR learns a function so that the dependence between inputs and residuals is directly minimized based on HSIC. Since HSICR does not impose any parametric assumption on the distribution of additive noise, it is suited for nonlinear nonGaussian causal inference. Indeed, HSICR was shown to outperform the GPbased method in experiments (Mooij et al. 2009).
However, HSICR still has limitations for its practical use. The first weakness of HSICR is that the kernel width of HSIC needs to be determined manually. Since the choice of the kernel width heavily affects the sensitivity of the independence measure (Fukumizu et al. 2009), lack of systematic model selection strategies is critical in causal inference. Setting the kernel width to the median distance between sample points is a popular heuristic in kernel methods (Schölkopf and Smola 2002), but this does not always perform well in practice. Another limitation of HSICR is that the kernel width of the regression model is fixed to the same value as HSIC. This crucially limits the flexibility of function approximation in HSICR.
To overcome the above weaknesses, we propose an alternative regression method for causal inference called leastsquares independence regression (LSIR). As HSICR, LSIR also learns a function so that the dependence between inputs and residuals is directly minimized. However, a difference is that, instead of HSIC, LSIR adopts an independence criterion called leastsquares mutual information (LSMI) (Suzuki et al. 2009), which is a consistent estimator of the squaredloss mutual information (SMI) with the optimal convergence rate. An advantage of LSIR over HSICR is that tuning parameters such as the kernel width and the regularization parameter can be naturally optimized through crossvalidation (CV) with respect to the LSMI criterion.
Furthermore, we propose to determine the kernel width of the regression model based on CV with respect to SMI itself. Thus, the kernel width of the regression model is determined independent of that in the independence measure. This allows LSIR to have higher flexibility in nonlinear causal inference than HSICR. Through experiments with benchmark and realworld biological datasets, we demonstrate the superiority of LSIR.
A preliminary version of this work appeared in Yamada and Sugiyama (2010); here we provide a more comprehensive derivation and discussion of LSIR, as well as a more detailed experimental section.
2 Dependence minimizing regression by LSIR
2.1 Estimation of squaredloss mutual information
SMI cannot be directly computed since it contains unknown densities \(p(x, \hat{e})\), p(x), and \(p(\hat{e})\). Here, we briefly review an SMI estimator called leastsquares mutual information (LSMI) (Suzuki et al. 2009).
2.2 Model selection in LSMI
2.3 Leastsquares independence regression
2.4 Model selection in LSIR
2.5 Causal direction inference by LSIR
In the previous section, we gave a dependence minimizing regression method, LSIR, that is equipped with CV for model selection. In this section, following Hoyer et al. (2009), we explain how LSIR can be used for causal direction inference.
Our final goal is, given i.i.d. paired samples \(\{(x_{i}, y_{i})\}_{i = 1}^{n}\), to determine whether X causes Y or vice versa. To this end, we test whether the causal model Y=f _{ Y }(X)+E _{ Y } or the alternative model X=f _{ X }(Y)+E _{ X } fits the data well, where the goodness of fit is measured by independence between inputs and residuals (i.e., estimated noise). Independence of inputs and residuals may be decided in practice by the permutation test (Efron and Tibshirani 1993).
More specifically, we first run LSIR for \(\{(x_{i}, y_{i})\}_{i = 1}^{n}\) as usual, and obtain a regression function \(\hat{f}\). This procedure also provides an SMI estimate for \(\{(x_{i}, \hat{e}_{i})\;\;\hat{e}_{i}=y_{i}\hat{f}(x_{i})\} _{i = 1}^{n}\). Next, we randomly permute the pairs of input and residual \(\{(x_{i}, \hat{e}_{i})\}_{i = 1}^{n}\) as \(\{(x_{i}, \hat{e}_{\kappa (i)})\}_{i = 1}^{n}\), where κ(⋅) is a randomly generated permutation function. Note that the permuted pairs of samples are independent of each other since the random permutation breaks the dependency between X and \(\hat{E}\) (if it exists). Then we compute SMI estimates for the permuted data \(\{(x_{i}, \hat{e}_{\kappa(i)})\}_{i = 1}^{n}\) by LSMI. This random permutation process is repeated many times (in experiments, the number of repetitions is set at 1000), and the distribution of SMI estimates under the nullhypothesis (i.e., independence) is constructed. Finally, the pvalue is approximated by evaluating the relative ranking of the SMI estimate computed from the original inputresidual data over the distribution of SMI estimates for randomly permuted data.

If p _{ X→Y }>δ _{2} and p _{ X←Y }≤δ _{1}, the causal model X→Y is chosen.

If p _{ X←Y }>δ _{2} and p _{ X→Y }≤δ _{1}, the causal model X←Y is selected.

If p _{ X→Y },p _{ X←Y }≤δ _{1}, the causal relation is not an additive noise model.

If p _{ X→Y },p _{ X←Y }>δ _{1}, the joint distribution seems to be close to one of the few exceptions that admit additive noise models in both directions.
In our preliminary experiments, we empirically observed that SMI estimates obtained by LSIR tend to be affected by the basis function choice in LSIR. To mitigate this problem, we run LSIR and compute an SMI estimate 5 times by randomly changing basis functions. Then the regression function that gives the smallest SMI estimate among 5 repetitions is selected and the permutation test is performed for that regression function.
2.6 Illustrative examples
Figure 4 depicts the regressor obtained by LSIR, giving a good approximation to the true function. We repeated the experiment 1000 times with the random seed changed. For the significance level 5%, LSIR successfully accepted the nullhypothesis 992 times out of 1000 runs.
Next, we consider the backward case where the roles of X and Y are swapped. In this case, the ground truth is that the input and the residual are dependent (see Fig. 4). Therefore, the nullhypothesis should be rejected (i.e., the pvalues should be small). Figure 5(b) shows the histogram of p _{ X←Y } obtained by LSIR over 1000 runs. LSIR rejected the nullhypothesis 989 times out of 1000 runs; the mean and standard deviation of the pvalues over 1000 runs are 0.0035 and 0.0094, respectively.
Figure 5(c) depicts the pvalues for both directions in a trialwise manner. The graph shows that LSIR perfectly estimates the correct causal direction (i.e., p _{ X→Y }>p _{ X←Y }), and the margin between p _{ X→Y } and p _{ X←Y } seems to be clear (i.e., most of the points are clearly below the diagonal line). This illustrates the usefulness of LSIR in causal direction inference.
Finally, we investigate the values of independence measure \(\widehat {\mbox{SMI}}\), which are plotted in Fig. 5(d) again in a trialwise manner. The graph implies that the values of \(\widehat{\mbox{SMI}}\) may be simply used for determining the causal direction, instead of the pvalues. Indeed, the correct causal direction (i.e., \(\widehat{\mbox{SMI}}_{X\rightarrow Y}< \widehat{\mbox{SMI}}_{X\leftarrow Y}\)) can be found 999 times out of 1000 trials by this simplified method. This would be a practically useful heuristic since we can avoid performing the computationally intensive permutation test.
3 Existing method: HSIC regression
In this section, we review the HilbertSchmidt independence criterion (HSIC) (Gretton et al. 2005) and HSIC regression (HSICR) (Mooij et al. 2009).
3.1 HilbertSchmidt independence criterion (HSIC)
The HilbertSchmidt independence criterion (HSIC) (Gretton et al. 2005) is a stateoftheart measure of statistical independence based on characteristic functions (see also Feuerverger 1993; Kankainen 1995). Here, we review the definition of HSIC and explain its basic properties.
The crosscovariance operator is a generalization of the crosscovariance matrix between random vectors. When \({\mathcal{F}}\) and \({\mathcal{G}}\) are universal RKHSs (Steinwart 2001) defined on compact domains \({\mathcal{X}}\) and \({\mathcal{E}}\), respectively, the largest singular value of C is zero if and only if x and e are independent. Gaussian RKHSs are examples of the universal RKHS.
\(\widehat{\mathrm{HSIC}}\) depends on the choice of the universal RKHSs \({\mathcal{F}}\) and \({\mathcal{G}}\). In the original HSIC paper (Gretton et al. 2005), the Gaussian RKHS with width set at the median distance between sample points was used, which is a popular heuristic in the kernel method community (Schölkopf and Smola 2002). However, to the best of our knowledge, there is no strong theoretical justification for this heuristic. On the other hand, the LSMI method is equipped with crossvalidation, and thus all the tuning parameters such as the Gaussian width and the regularization parameter can be optimized in an objective and systematic way. This is an advantage of LSMI over HSIC.
3.2 HSIC regression
4 Experiments
In this section, we evaluate the performance of LSIR using benchmark datasets and realworld gene expression data.
4.1 Benchmark datasets
Results for the ‘CauseEffect Pairs’ task. Each cell in the tables denotes ‘the number of correct causal directions detected/the number of incorrect causal directions detected (the number of correct causal directions detected/the number of all causal directions detected)’
δ _{1}  δ _{2}  

0.01  0.05  0.10  0.15  0.20  
(a) LSIR  
0.01  23/9 (0.72)  17/5 (0.77)  12/4 (0.75)  9/3 (0.75)  7/3 (0.70) 
0.05  —  26/8 (0.77)  20/6 (0.77)  15/5 (0.75)  12/4 (0.75) 
0.10  —  —  23/9 (0.72)  18/8 (0.69)  14/6 (0.70) 
0.15  —  —  —  19/9 (0.68)  15/7 (0.68) 
0.20  —  —  —  —  16/7 (0.70) 
(b) HSICR  
0.01  18/17 (0.51)  14/14 (0.50)  11/12 (0.48)  10/11 (0.48)  10/7 (0.59) 
0.05  —  18/18 (0.50)  14/15 (0.48)  13/13 (0.50)  11/8 (0.58) 
0.10  —  —  16/18 (0.47)  15/15 (0.50)  13/10 (0.57) 
0.15  —  —  —  17/16 (0.52)  14/11 (0.56) 
0.20  —  —  —  —  14/11 (0.56) 
Figure 7(c) shows that merely comparing the values of \(\widehat{\mbox{SMI}}\) is actually sufficient to decide the correct causal direction in LSIR; using this heuristic, LSIR successfully identified the correct causal direction 54 out of 80 cases. Thus, this heuristic allows us to identify the causal direction in a computationally efficient way. On the other hand, as shown in Fig. 7(d), HSICR gave the correct decision only for 36 out of 80 cases with this heuristic.
4.2 Gene function regulations
Finally, we apply our proposed LSIR method to realworld biological datasets, which contain known causal relationships about gene function regulations from transcription factors to gene expressions.
Causal prediction is biologically and medically important because it gives us a clue for diseasecausing genes or drugtarget genes. Transcription factors regulate expression levels of their relating genes. In other words, when the expression level of transcription factor genes is high, genes regulated by the transcription factor become highly expressed or suppressed.
Results for the ‘E. coli’ task. If p _{ X→Y }>0.05 and p _{ Y→X }<10^{−3}, we denote the estimated direction by ⇒. If p _{ X→Y }>p _{ Y→X }, we denote the estimated direction by →. When the pvalues of both directions are less than 10^{−3}, we concluded that the causal direction cannot be determined (indicated by ‘?’). Estimated directions in the brackets are determined based on comparing the values of \(\widehat{\mbox{SMI}}\) or \(\widehat{\mbox{HSIC}}\)
Dataset  pValues  \(\widehat{\mbox{SMI}}\)  Direction  

X  Y  X→Y  X←Y  X→Y  X←Y  Estimated  Truth  
(a) LSIR  
lexA  uvrA  <10^{−3}  <10^{−3}  0.0177  0.0255  ?  (→)  → 
lexA  uvrB  <10^{−3}  <10^{−3}  0.0172  0.0356  ?  (→)  → 
lexA  uvrD  0.043  <10^{−3}  0.0075  0.0227  →  (→)  → 
crp  lacA  0.143  <10^{−3}  −0.0004  0.0399  ⇒  (→)  → 
crp  lacY  0.003  <10^{−3}  0.0118  0.0247  →  (→)  → 
crp  lacZ  0.001  <10^{−3}  0.0122  0.0307  →  (→)  → 
lacI  lacA  0.787  <10^{−3}  −0.0076  0.0184  ⇒  (→)  → 
lacI  lacZ  0.002  <10^{−3}  0.0096  0.0141  →  (→)  → 
lacI  lacY  0.746  <10^{−3}  −0.0082  0.0217  ⇒  (→)  → 
(b) HSICR  
lexA  uvrA  0.005  <10^{−3}  0.0013  0.0037  →  (→)  → 
lexA  uvrB  <10^{−3}  <10^{−3}  0.0026  0.0037  ?  (→)  → 
lexA  uvrD  <10^{−3}  <10^{−3}  0.0020  0.0041  ?  (→)  → 
crp  lacA  0.017  <10^{−3}  0.0013  0.0036  →  (→)  → 
crp  lacY  0.002  <10^{−3}  0.0018  0.0051  →  (→)  → 
crp  lacZ  0.008  <10^{−3}  0.0013  0.0054  →  (→)  → 
lacI  lacA  0.031  <10^{−3}  0.0012  0.0043  →  (→)  → 
lacI  lacZ  <10^{−3}  <10^{−3}  0.0019  0.0020  ?  (→)  → 
lacI  lacY  0.052  <10^{−3}  0.0011  0.0027  ⇒  (→)  → 
5 Conclusions
In this paper, we proposed a new method of dependence minimization regression called leastsquares independence regression (LSIR). LSIR adopts the squaredloss mutual information as an independence measure, and it is estimated by the method of leastsquares mutual information (LSMI). Since LSMI provides an analyticform solution, we can explicitly compute the gradient of the LSMI estimator with respect to regression parameters.
A notable advantage of the proposed LSIR method over the stateoftheart method of dependence minimization regression (Mooij et al. 2009) is that LSIR is equipped with a natural crossvalidation procedure, allowing us to objectively optimize tuning parameters such as the kernel width and the regularization parameter in a datadependent fashion.
We experimentally showed that LSIR is promising in realworld causal direction inference. We note that the use of LSMI instead of HSIC does not necessarily provide performance improvement of causal direction inference; indeed, experimental performances of LSMI and HSIC were on par if fixed Gaussian kernel widths are used. This implies that the performance improvement of the proposed method was brought by datadependent optimization of kernel widths via crossvalidation.
In this paper, we solely focused on the additive noise model, where noise is independent of inputs. When this modeling assumption is violated, LSIR (as well as HSICR) may not perform well. In such a case, employing a more elaborate model such as the postnonlinear causal model would be useful (Zhang and Hyvärinen 2009). We will extend LSIR to be applicable to such a general model in the future work.
Footnotes
Notes
Acknowledgements
The authors thank Dr. Joris Mooij for providing us the HSICR code. We also thank the editor and anonymous reviewers for their constructive feedback, which helped us to improve the manuscript. MY was supported by the JST PRESTO program, MS was supported by AOARD and KAKENHI 25700022, and J.S. was partially supported by KAKENHI 24680032, 24651227, and 25128704 and the support of Young Investigator Award of Human Frontier Science Program.
References
 Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68, 337–404. MathSciNetCrossRefMATHGoogle Scholar
 Bishop, C. M. (2006). Pattern recognition and machine learning. New York: Springer. MATHGoogle Scholar
 Cover, T. M., & Thomas, J. A. (2006). Elements of information theory (2nd ed.). Hoboken: Wiley. MATHGoogle Scholar
 Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York: Chapman & Hall. CrossRefMATHGoogle Scholar
 Faith, J. J., Hayete, B., Thaden, J. T., Mogno, I., Wierzbowski, J., Cottarel, G., Kasif, S., Collins, J. J., & Gardner, T. S. (2007). Largescale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biology, 5(1), e8. CrossRefGoogle Scholar
 Feuerverger, A. (1993). A consistent test for bivariate dependence. International Statistical Review, 61(3), 419–433. CrossRefMATHGoogle Scholar
 Fukumizu, K., Bach, F. R., & Jordan, M. (2009). Kernel dimension reduction in regression. The Annals of Statistics, 37(4), 1871–1905. MathSciNetCrossRefMATHGoogle Scholar
 Geiger, D., & Heckerman, D. (1994). Learning Gaussian networks. In 10th annual conference on uncertainty in artificial intelligence (UAI1994) (pp. 235–243). Google Scholar
 Gretton, A., Bousquet, O., Smola, A., & Schölkopf, B. (2005). Measuring statistical dependence with HilbertSchmidt norms. In 16th international conference on algorithmic learning theory (ALT 2005) (pp. 63–78). CrossRefGoogle Scholar
 Hoyer, P. O., Janzing, D., Mooij, J. M., Peters, J., & Schölkopf, B. (2009). Nonlinear causal discovery with additive noise models. In D. Koller, D. Schuurmans, Y. Bengio, & L. Botton (Eds.), Advances in neural information processing systems (Vol. 21, pp. 689–696). Cambridge: MIT Press. Google Scholar
 Janzing, D., & Steudel, B. (2010). Justifying additive noise modelbased causal discovery via algorithmic information theory. Open Systems & Information Dynamics, 17(02), 189–212. MathSciNetCrossRefMATHGoogle Scholar
 Kanamori, T., Suzuki, T., & Sugiyama, M. (2012). Statistical analysis of kernelbased leastsquares densityratio estimation. Machine Learning, 86(3), 335–367. MathSciNetCrossRefMATHGoogle Scholar
 Kankainen, A. (1995). Consistent testing of total independence based on the empirical characteristic function. Ph.D. thesis, University of Jyväskylä, Jyväskylä, Finland. Google Scholar
 Kraskov, A., Stögbauer, H., & Grassberger, P. (2004). Estimating mutual information. Physical Review E, 69, 066138. MathSciNetCrossRefGoogle Scholar
 Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22, 79–86. MathSciNetCrossRefMATHGoogle Scholar
 Liu, D. C., & Nocedal, J. (1989). On the limited memory method for large scale optimization. Mathematical Programming Series B, 45, 503–528. MathSciNetCrossRefMATHGoogle Scholar
 Mooij, J., Janzing, D., Peters, J., & Schölkopf, B. (2009). Regression by dependence minimization and its application to causal inference in additive noise models. In 26th annual international conference on machine learning (ICML2009), Montreal, Canada (pp. 745–752). Google Scholar
 Patriksson, M. (1999). Nonlinear programming and variational inequality problems. Dordrecht: Kluwer Academic. CrossRefMATHGoogle Scholar
 Pearl, J. (2000). Causality: models, reasoning and inference. New York: Cambridge University Press. MATHGoogle Scholar
 Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine, 50, 157–175. CrossRefMATHGoogle Scholar
 Rockafellar, R. T. (1970). Convex analysis. Princeton: Princeton University Press. CrossRefMATHGoogle Scholar
 Schölkopf, B., & Smola, A. J. (2002). Learning with kernels. Cambridge: MIT Press. MATHGoogle Scholar
 Shimizu, S., Hoyer, P. O., Hyvärinen, A., & Kerminen, A. J. (2006). A linear nonGaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7, 2003–2030. MathSciNetMATHGoogle Scholar
 Steinwart, I. (2001). On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2, 67–93. MathSciNetMATHGoogle Scholar
 Suzuki, T., & Sugiyama, M. (2013). Sufficient dimension reduction via squaredloss mutual information estimation. Neural Computation, 3(25), 725–758. MathSciNetCrossRefMATHGoogle Scholar
 Suzuki, T., Sugiyama, M., Kanamori, T., & Sese, J. (2009). Mutual information estimation reveals global associations between stimuli and biological processes. BMC Bioinformatics, 10(S52). Google Scholar
 Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley. MATHGoogle Scholar
 Yamada, M., & Sugiyama, M. (2010). Dependence minimizing regression with model selection for nonlinear causal inference under nonGaussian noise. In Proceedings of the twentyfourth AAAI conference on artificial intelligence (AAAI2010) (pp. 643–648). Google Scholar
 Zhang, K., & Hyvärinen, A. (2009). On the identifiability of the postnonlinear causal model. In Proceedings of the 25th conference on uncertainty in artificial intelligence (UAI ’09) (pp. 647–655). Arlington: AUAI Press. Google Scholar