Machine Learning

, Volume 96, Issue 3, pp 249–267 | Cite as

Least-squares independence regression for non-linear causal inference under non-Gaussian noise



The discovery of non-linear causal relationship under additive non-Gaussian noise models has attracted considerable attention recently because of their high flexibility. In this paper, we propose a novel causal inference algorithm called least-squares independence regression (LSIR). LSIR learns the additive noise model through the minimization of an estimator of the squared-loss mutual information between inputs and residuals. A notable advantage of LSIR is that tuning parameters such as the kernel width and the regularization parameter can be naturally optimized by cross-validation, allowing us to avoid overfitting in a data-dependent fashion. Through experiments with real-world datasets, we show that LSIR compares favorably with a state-of-the-art causal inference method.


Causal inference Non-linear Non-Gaussian Squared-loss mutual information Least-squares independence regression 


  1. Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68, 337–404. MathSciNetCrossRefMATHGoogle Scholar
  2. Bishop, C. M. (2006). Pattern recognition and machine learning. New York: Springer. MATHGoogle Scholar
  3. Cover, T. M., & Thomas, J. A. (2006). Elements of information theory (2nd ed.). Hoboken: Wiley. MATHGoogle Scholar
  4. Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York: Chapman & Hall. CrossRefMATHGoogle Scholar
  5. Faith, J. J., Hayete, B., Thaden, J. T., Mogno, I., Wierzbowski, J., Cottarel, G., Kasif, S., Collins, J. J., & Gardner, T. S. (2007). Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biology, 5(1), e8. CrossRefGoogle Scholar
  6. Feuerverger, A. (1993). A consistent test for bivariate dependence. International Statistical Review, 61(3), 419–433. CrossRefMATHGoogle Scholar
  7. Fukumizu, K., Bach, F. R., & Jordan, M. (2009). Kernel dimension reduction in regression. The Annals of Statistics, 37(4), 1871–1905. MathSciNetCrossRefMATHGoogle Scholar
  8. Geiger, D., & Heckerman, D. (1994). Learning Gaussian networks. In 10th annual conference on uncertainty in artificial intelligence (UAI1994) (pp. 235–243). Google Scholar
  9. Gretton, A., Bousquet, O., Smola, A., & Schölkopf, B. (2005). Measuring statistical dependence with Hilbert-Schmidt norms. In 16th international conference on algorithmic learning theory (ALT 2005) (pp. 63–78). CrossRefGoogle Scholar
  10. Hoyer, P. O., Janzing, D., Mooij, J. M., Peters, J., & Schölkopf, B. (2009). Nonlinear causal discovery with additive noise models. In D. Koller, D. Schuurmans, Y. Bengio, & L. Botton (Eds.), Advances in neural information processing systems (Vol. 21, pp. 689–696). Cambridge: MIT Press. Google Scholar
  11. Janzing, D., & Steudel, B. (2010). Justifying additive noise model-based causal discovery via algorithmic information theory. Open Systems & Information Dynamics, 17(02), 189–212. MathSciNetCrossRefMATHGoogle Scholar
  12. Kanamori, T., Suzuki, T., & Sugiyama, M. (2012). Statistical analysis of kernel-based least-squares density-ratio estimation. Machine Learning, 86(3), 335–367. MathSciNetCrossRefMATHGoogle Scholar
  13. Kankainen, A. (1995). Consistent testing of total independence based on the empirical characteristic function. Ph.D. thesis, University of Jyväskylä, Jyväskylä, Finland. Google Scholar
  14. Kraskov, A., Stögbauer, H., & Grassberger, P. (2004). Estimating mutual information. Physical Review E, 69, 066138. MathSciNetCrossRefGoogle Scholar
  15. Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22, 79–86. MathSciNetCrossRefMATHGoogle Scholar
  16. Liu, D. C., & Nocedal, J. (1989). On the limited memory method for large scale optimization. Mathematical Programming Series B, 45, 503–528. MathSciNetCrossRefMATHGoogle Scholar
  17. Mooij, J., Janzing, D., Peters, J., & Schölkopf, B. (2009). Regression by dependence minimization and its application to causal inference in additive noise models. In 26th annual international conference on machine learning (ICML2009), Montreal, Canada (pp. 745–752). Google Scholar
  18. Patriksson, M. (1999). Nonlinear programming and variational inequality problems. Dordrecht: Kluwer Academic. CrossRefMATHGoogle Scholar
  19. Pearl, J. (2000). Causality: models, reasoning and inference. New York: Cambridge University Press. MATHGoogle Scholar
  20. Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine, 50, 157–175. CrossRefMATHGoogle Scholar
  21. Rockafellar, R. T. (1970). Convex analysis. Princeton: Princeton University Press. CrossRefMATHGoogle Scholar
  22. Schölkopf, B., & Smola, A. J. (2002). Learning with kernels. Cambridge: MIT Press. MATHGoogle Scholar
  23. Shimizu, S., Hoyer, P. O., Hyvärinen, A., & Kerminen, A. J. (2006). A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7, 2003–2030. MathSciNetMATHGoogle Scholar
  24. Steinwart, I. (2001). On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2, 67–93. MathSciNetMATHGoogle Scholar
  25. Suzuki, T., & Sugiyama, M. (2013). Sufficient dimension reduction via squared-loss mutual information estimation. Neural Computation, 3(25), 725–758. MathSciNetCrossRefMATHGoogle Scholar
  26. Suzuki, T., Sugiyama, M., Kanamori, T., & Sese, J. (2009). Mutual information estimation reveals global associations between stimuli and biological processes. BMC Bioinformatics, 10(S52). Google Scholar
  27. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley. MATHGoogle Scholar
  28. Yamada, M., & Sugiyama, M. (2010). Dependence minimizing regression with model selection for non-linear causal inference under non-Gaussian noise. In Proceedings of the twenty-fourth AAAI conference on artificial intelligence (AAAI2010) (pp. 643–648). Google Scholar
  29. Zhang, K., & Hyvärinen, A. (2009). On the identifiability of the post-nonlinear causal model. In Proceedings of the 25th conference on uncertainty in artificial intelligence (UAI ’09) (pp. 647–655). Arlington: AUAI Press. Google Scholar

Copyright information

© The Author(s) 2013

Authors and Affiliations

  1. 1.701 1st Ave.SunnyvaleUSA
  2. 2.TokyoJapan

Personalised recommendations