Skip to main content

A Practical Data Repository for Causal Learning with Big Data

  • Conference paper
  • First Online:
Benchmarking, Measuring, and Optimizing (Bench 2019)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12093))

Included in the following conference series:

Abstract

The recent success in machine learning (ML) has led to a massive emergence of AI applications and the increases in expectations for AI systems to achieve human-level intelligence. Nevertheless, these expectations have met with multi-faceted obstacles. One major obstacle is ML aims to predict future observations given real-world data dependencies while human-level intelligence AI is often beyond prediction and seeks the underlying causal mechanism. Another major obstacle is that the availability of large-scale datasets has significantly influenced causal study in various disciplines. It is crucial to leverage effective ML techniques to advance causal learning with big data. Existing benchmark datasets for causal inference have limited use as they are too “ideal”, i.e., small, clean, homogeneous, low-dimensional, to describe real-world scenarios where data is often large, noisy, heterogeneous and high-dimensional. It, therefore, severely hinders the successful marriage of causal inference and ML. In this paper, we formally address this issue by systematically investigating existing datasets for two fundamental tasks in causal inference: causal discovery and causal effect estimation. We also review the datasets for two ML tasks naturally connected to causal inference. We then provide hindsight regarding the advantages, disadvantages and the limitations of these datasets. Please refer to our github repository (https://github.com/rguo12/awesome-causality-data) for all the discussed datasets in this work.

R. Guo and R. Moraffah—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/vdorie/npci.

  2. 2.

    Downloaded from the UCI repository [7].

  3. 3.

    https://www.spotify.com/.

  4. 4.

    http://www.cs.cornell.edu/~adith/Criteo/.

  5. 5.

    https://webscope.sandbox.yahoo.com/catalog.php?datatype=r.

References

  1. Almond, D., Chay, K.Y., Lee, D.S.: The costs of low birth weight. Q. J. Econ. 120(3), 1031–1083 (2005)

    Google Scholar 

  2. Bache, K., Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml

  3. Beygelzimer, A., Langford, J.: The offset tree for learning with partial labels. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 129–138. ACM (2009)

    Google Scholar 

  4. Bonner, S., Vasile, F.: Causal embeddings for recommendation. In: Proceedings of the 12th ACM Conference on Recommender Systems, pp. 104–112. ACM (2018)

    Google Scholar 

  5. Brost, B., Mehrotra, R., Jehan, T.: The music streaming sessions dataset. In: The World Wide Web Conference, pp. 2594–2600. ACM (2019)

    Google Scholar 

  6. Dehejia, R.H., Wahba, S.: Propensity score-matching methods for nonexperimental causal studies. Rev. Econ. Stat. 84(1), 151–161 (2002)

    Article  Google Scholar 

  7. Dua, D., Graff, C.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml

  8. Duncan, G.J., Brooks-Gunn, J., Klebanov, P.K.: Economic deprivation and early childhood development. Child Dev. 65(2), 296–318 (1994)

    Article  Google Scholar 

  9. Galagate, D., Schafer, J., Galagate, M.D.: Package ‘causaldrf’ (2015)

    Google Scholar 

  10. Guo, R., Cheng, L., Li, J., Hahn, P.R., Liu, H.: A survey of learning causality with data: problems and methods. arXiv preprint arXiv:1809.09337 (2018)

  11. Guo, R., Li, J., Liu, H.: Learning individual treatment effects from networked observational data. arXiv preprint arXiv:1906.03485 (2019)

  12. Guyon, I., et al: Design and analysis of the causation and prediction challenge. In: Guyon, I.,et al. (eds.) Proceedings of the Workshop on the Causation and Prediction Challenge at WCCI 2008. Proceedings of Machine Learning Research PMLR, Hong Kong, 03–04 June 2008, vol. 3, pp. 1–33. http://proceedings.mlr.press/v3/guyon08a.html

  13. Hahn, P.R., Dorie, V., Murray, J.S.: Atlantic Causal Inference Conference (ACIC) data analysis challenge 2017. Technical report (2018)

    Google Scholar 

  14. Hill, J.L.: Bayesian nonparametric modeling for causal inference. J. Comput. Graph. Stat. 20(1), 217–240 (2011)

    Article  MathSciNet  Google Scholar 

  15. Jiang, N., Li, L.: Doubly robust off-policy value evaluation for reinforcement learning. arXiv preprint arXiv:1511.03722 (2015)

  16. Joachims, T., Swaminathan, A., de Rijke, M.: Deep learning with logged bandit feedback (2018)

    Google Scholar 

  17. Johansson, F., Shalit, U., Sontag, D.: Learning representations for counterfactual inference. In: International Conference on Machine Learning, pp. 3020–3029 (2016)

    Google Scholar 

  18. Kocaoglu, M., Dimakis, A., Vishwanath, S.: Cost-optimal learning of causal graphs. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 1875–1884. JMLR. org (2017)

    Google Scholar 

  19. LaLonde, R.J.: Evaluating the econometric evaluations of training programs with experimental data. Am. Econ. Rev. 76, 604–620 (1986)

    Google Scholar 

  20. Lefortier, D., Swaminathan, A., Gu, X., Joachims, T., de Rijke, M.: Large-scale validation of counterfactual learning methods: a test-bed. arXiv preprint arXiv:1612.00367 (2016)

  21. Li, J., Guo, R., Liu, C., Liu, H.: Adaptive unsupervised feature selection on attributed networks. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 92–100. ACM (2019)

    Google Scholar 

  22. Li, Y., Guo, R., Wang, W., Huan, L.: Causal learning in question quality improvement. In: 2019 BenchCouncil International Symposium on Benchmarking, Measuring and Optimizing (Bench 2019) (2019)

    Google Scholar 

  23. Liang, D., Charlin, L., Blei, D.: Causal inference for recommdendation (2016)

    Google Scholar 

  24. Louizos, C., Shalit, U., Mooij, J.M., Sontag, D., Zemel, R., Welling, M.: Causal effect inference with deep latent-variable models. In: Advances in Neural Information Processing Systems, pp. 6446–6456 (2017)

    Google Scholar 

  25. McAuley, J., Pandey, R., Leskovec, J.: Inferring networks of substitutable and complementary products. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794. ACM (2015)

    Google Scholar 

  26. Mitrovic, J., Sejdinovic, D., Teh, Y.W.: Causal inference via kernel deviance measures (2018). CoRR abs/1804.04622. http://arxiv.org/abs/1804.04622

  27. Mooij, J.M., Peters, J., Janzing, D., Zscheischler, J., Schölkopf, B.: Distinguishing cause from effect using observational data: methods and benchmarks (2014). CoRR abs/1412.3773. http://arxiv.org/abs/1412.3773

  28. Neyman, J.S., Dabrowska, D.M., Speed, T.P.: On the application of probability theory to agricultural experiments. Essay on principles. Stat. Sci. 5, 465–480 (1990). Ann. Agric. Sci. 10, 1–51 (1923). Section 9. Translated and Edited by Dabrowska, D.M., Speed, T.P

    Article  MathSciNet  Google Scholar 

  29. Rakesh, V., Guo, R., Moraffah, R., Agarwal, N., Liu, H.: Linked causal variational autoencoder for inferring paired spillover effects. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 1679–1682. ACM (2018)

    Google Scholar 

  30. Rubin, D.B.: Bayesian inference for causal effects: the role of randomization. Ann. Stat. 6, 34–58 (1978)

    Article  MathSciNet  Google Scholar 

  31. Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D.A., Nolan, G.P.: Causal protein-signaling networks derived from multiparameter single-cell data. Science 308(5721), 523–529 (2005). https://doi.org/10.1126/science.1105809. https://science.sciencemag.org/content/308/5721/523

    Article  Google Scholar 

  32. Schnabel, T., Swaminathan, A., Singh, A., Chandak, N., Joachims, T.: Recommendations as treatments: debiasing learning and evaluation. arXiv preprint arXiv:1602.05352 (2016)

  33. Schwab, P., Linhardt, L., Karlen, W.: Perfect match: a simple method for learning representations for counterfactual inference with neural networks. arXiv preprint arXiv:1810.00656 (2018)

  34. Shakarian, P., Bhatnagar, A., Aleali, A., Shaabani, E., Guo, R.: Diffusion in Social Networks. SCS. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23105-1

    Book  MATH  Google Scholar 

  35. Shalit, U., Johansson, F.D., Sontag, D.: Estimating individual treatment effect: generalization bounds and algorithms. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 3076–3085. JMLR. org (2017)

    Google Scholar 

  36. Shanmugam, K., Kocaoglu, M., Dimakis, A.G., Vishwanath, S.: Learning causal graphs with small interventions. In: Advances in Neural Information Processing Systems. pp. 3195–3203 (2015)

    Google Scholar 

  37. Smith, J.A., Todd, P.E.: Does matching overcome Lalonde’s critique of nonexperimental estimators? J. Econ. 125(1–2), 305–353 (2005)

    Article  MathSciNet  Google Scholar 

  38. Swaminathan, A., Joachims, T.: Counterfactual risk minimization: learning from logged bandit feedback. In: International Conference on Machine Learning, pp. 814–823 (2015)

    Google Scholar 

  39. Swaminathan, A., Joachims, T.: The self-normalized estimator for counterfactual learning. In: Advances in Neural Information Processing Systems, pp. 3231–3239 (2015)

    Google Scholar 

  40. Tsamardinos, I., Brown, L.E., Aliferis, C.F.: The max-min hill-climbing bayesian network structure learning algorithm. Mach. Learn. 65(1), 31–78 (2006). https://doi.org/10.1007/s10994-006-6889-7

    Article  Google Scholar 

  41. Weinstein, J.N., et al.: The cancer genome atlas pan-cancer analysis project. Nat. Genet. 45(10), 1113 (2013)

    Article  Google Scholar 

  42. Yoon, J., Jordon, J., van der Schaar, M.: GANITE: estimation of individualized treatment effects using generative adversarial nets (2018)

    Google Scholar 

Download references

Acknowledgement

This material is based upon work supported by ARO/ARL and the National Science Foundation (NSF) Grant #1610282, NSF #1909555.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lu Cheng .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cheng, L., Guo, R., Moraffah, R., Candan, K.S., Raglin, A., Liu, H. (2020). A Practical Data Repository for Causal Learning with Big Data. In: Gao, W., Zhan, J., Fox, G., Lu, X., Stanzione, D. (eds) Benchmarking, Measuring, and Optimizing. Bench 2019. Lecture Notes in Computer Science(), vol 12093. Springer, Cham. https://doi.org/10.1007/978-3-030-49556-5_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-49556-5_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-49555-8

  • Online ISBN: 978-3-030-49556-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics