A Practical Data Repository for Causal Learning with Big Data

Cheng, Lu; Guo, Ruocheng; Moraffah, Raha; Candan, K. Selçuk; Raglin, Adrienne; Liu, Huan

doi:10.1007/978-3-030-49556-5_23

Lu Cheng¹³,
Ruocheng Guo¹³,
Raha Moraffah¹³,
K. Selçuk Candan¹³,
Adrienne Raglin¹⁴ &
…
Huan Liu¹³

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12093))

Included in the following conference series:

International Symposium on Benchmarking, Measuring and Optimization

1426 Accesses
4 Citations

Abstract

The recent success in machine learning (ML) has led to a massive emergence of AI applications and the increases in expectations for AI systems to achieve human-level intelligence. Nevertheless, these expectations have met with multi-faceted obstacles. One major obstacle is ML aims to predict future observations given real-world data dependencies while human-level intelligence AI is often beyond prediction and seeks the underlying causal mechanism. Another major obstacle is that the availability of large-scale datasets has significantly influenced causal study in various disciplines. It is crucial to leverage effective ML techniques to advance causal learning with big data. Existing benchmark datasets for causal inference have limited use as they are too “ideal”, i.e., small, clean, homogeneous, low-dimensional, to describe real-world scenarios where data is often large, noisy, heterogeneous and high-dimensional. It, therefore, severely hinders the successful marriage of causal inference and ML. In this paper, we formally address this issue by systematically investigating existing datasets for two fundamental tasks in causal inference: causal discovery and causal effect estimation. We also review the datasets for two ML tasks naturally connected to causal inference. We then provide hindsight regarding the advantages, disadvantages and the limitations of these datasets. Please refer to our github repository (https://github.com/rguo12/awesome-causality-data) for all the discussed datasets in this work.

R. Guo and R. Moraffah—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/vdorie/npci.
2.
Downloaded from the UCI repository [7].
3.
https://www.spotify.com/.
4.
http://www.cs.cornell.edu/~adith/Criteo/.
5.
https://webscope.sandbox.yahoo.com/catalog.php?datatype=r.

References

Almond, D., Chay, K.Y., Lee, D.S.: The costs of low birth weight. Q. J. Econ. 120(3), 1031–1083 (2005)
Google Scholar
Bache, K., Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml
Beygelzimer, A., Langford, J.: The offset tree for learning with partial labels. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 129–138. ACM (2009)
Google Scholar
Bonner, S., Vasile, F.: Causal embeddings for recommendation. In: Proceedings of the 12th ACM Conference on Recommender Systems, pp. 104–112. ACM (2018)
Google Scholar
Brost, B., Mehrotra, R., Jehan, T.: The music streaming sessions dataset. In: The World Wide Web Conference, pp. 2594–2600. ACM (2019)
Google Scholar
Dehejia, R.H., Wahba, S.: Propensity score-matching methods for nonexperimental causal studies. Rev. Econ. Stat. 84(1), 151–161 (2002)
Article Google Scholar
Dua, D., Graff, C.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml
Duncan, G.J., Brooks-Gunn, J., Klebanov, P.K.: Economic deprivation and early childhood development. Child Dev. 65(2), 296–318 (1994)
Article Google Scholar
Galagate, D., Schafer, J., Galagate, M.D.: Package ‘causaldrf’ (2015)
Google Scholar
Guo, R., Cheng, L., Li, J., Hahn, P.R., Liu, H.: A survey of learning causality with data: problems and methods. arXiv preprint arXiv:1809.09337 (2018)
Guo, R., Li, J., Liu, H.: Learning individual treatment effects from networked observational data. arXiv preprint arXiv:1906.03485 (2019)
Guyon, I., et al: Design and analysis of the causation and prediction challenge. In: Guyon, I.,et al. (eds.) Proceedings of the Workshop on the Causation and Prediction Challenge at WCCI 2008. Proceedings of Machine Learning Research PMLR, Hong Kong, 03–04 June 2008, vol. 3, pp. 1–33. http://proceedings.mlr.press/v3/guyon08a.html
Hahn, P.R., Dorie, V., Murray, J.S.: Atlantic Causal Inference Conference (ACIC) data analysis challenge 2017. Technical report (2018)
Google Scholar
Hill, J.L.: Bayesian nonparametric modeling for causal inference. J. Comput. Graph. Stat. 20(1), 217–240 (2011)
Article MathSciNet Google Scholar
Jiang, N., Li, L.: Doubly robust off-policy value evaluation for reinforcement learning. arXiv preprint arXiv:1511.03722 (2015)
Joachims, T., Swaminathan, A., de Rijke, M.: Deep learning with logged bandit feedback (2018)
Google Scholar
Johansson, F., Shalit, U., Sontag, D.: Learning representations for counterfactual inference. In: International Conference on Machine Learning, pp. 3020–3029 (2016)
Google Scholar
Kocaoglu, M., Dimakis, A., Vishwanath, S.: Cost-optimal learning of causal graphs. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 1875–1884. JMLR. org (2017)
Google Scholar
LaLonde, R.J.: Evaluating the econometric evaluations of training programs with experimental data. Am. Econ. Rev. 76, 604–620 (1986)
Google Scholar
Lefortier, D., Swaminathan, A., Gu, X., Joachims, T., de Rijke, M.: Large-scale validation of counterfactual learning methods: a test-bed. arXiv preprint arXiv:1612.00367 (2016)
Li, J., Guo, R., Liu, C., Liu, H.: Adaptive unsupervised feature selection on attributed networks. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 92–100. ACM (2019)
Google Scholar
Li, Y., Guo, R., Wang, W., Huan, L.: Causal learning in question quality improvement. In: 2019 BenchCouncil International Symposium on Benchmarking, Measuring and Optimizing (Bench 2019) (2019)
Google Scholar
Liang, D., Charlin, L., Blei, D.: Causal inference for recommdendation (2016)
Google Scholar
Louizos, C., Shalit, U., Mooij, J.M., Sontag, D., Zemel, R., Welling, M.: Causal effect inference with deep latent-variable models. In: Advances in Neural Information Processing Systems, pp. 6446–6456 (2017)
Google Scholar
McAuley, J., Pandey, R., Leskovec, J.: Inferring networks of substitutable and complementary products. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794. ACM (2015)
Google Scholar
Mitrovic, J., Sejdinovic, D., Teh, Y.W.: Causal inference via kernel deviance measures (2018). CoRR abs/1804.04622. http://arxiv.org/abs/1804.04622
Mooij, J.M., Peters, J., Janzing, D., Zscheischler, J., Schölkopf, B.: Distinguishing cause from effect using observational data: methods and benchmarks (2014). CoRR abs/1412.3773. http://arxiv.org/abs/1412.3773
Neyman, J.S., Dabrowska, D.M., Speed, T.P.: On the application of probability theory to agricultural experiments. Essay on principles. Stat. Sci. 5, 465–480 (1990). Ann. Agric. Sci. 10, 1–51 (1923). Section 9. Translated and Edited by Dabrowska, D.M., Speed, T.P
Article MathSciNet Google Scholar
Rakesh, V., Guo, R., Moraffah, R., Agarwal, N., Liu, H.: Linked causal variational autoencoder for inferring paired spillover effects. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 1679–1682. ACM (2018)
Google Scholar
Rubin, D.B.: Bayesian inference for causal effects: the role of randomization. Ann. Stat. 6, 34–58 (1978)
Article MathSciNet Google Scholar
Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D.A., Nolan, G.P.: Causal protein-signaling networks derived from multiparameter single-cell data. Science 308(5721), 523–529 (2005). https://doi.org/10.1126/science.1105809. https://science.sciencemag.org/content/308/5721/523
Article Google Scholar
Schnabel, T., Swaminathan, A., Singh, A., Chandak, N., Joachims, T.: Recommendations as treatments: debiasing learning and evaluation. arXiv preprint arXiv:1602.05352 (2016)
Schwab, P., Linhardt, L., Karlen, W.: Perfect match: a simple method for learning representations for counterfactual inference with neural networks. arXiv preprint arXiv:1810.00656 (2018)
Shakarian, P., Bhatnagar, A., Aleali, A., Shaabani, E., Guo, R.: Diffusion in Social Networks. SCS. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23105-1
Book MATH Google Scholar
Shalit, U., Johansson, F.D., Sontag, D.: Estimating individual treatment effect: generalization bounds and algorithms. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 3076–3085. JMLR. org (2017)
Google Scholar
Shanmugam, K., Kocaoglu, M., Dimakis, A.G., Vishwanath, S.: Learning causal graphs with small interventions. In: Advances in Neural Information Processing Systems. pp. 3195–3203 (2015)
Google Scholar
Smith, J.A., Todd, P.E.: Does matching overcome Lalonde’s critique of nonexperimental estimators? J. Econ. 125(1–2), 305–353 (2005)
Article MathSciNet Google Scholar
Swaminathan, A., Joachims, T.: Counterfactual risk minimization: learning from logged bandit feedback. In: International Conference on Machine Learning, pp. 814–823 (2015)
Google Scholar
Swaminathan, A., Joachims, T.: The self-normalized estimator for counterfactual learning. In: Advances in Neural Information Processing Systems, pp. 3231–3239 (2015)
Google Scholar
Tsamardinos, I., Brown, L.E., Aliferis, C.F.: The max-min hill-climbing bayesian network structure learning algorithm. Mach. Learn. 65(1), 31–78 (2006). https://doi.org/10.1007/s10994-006-6889-7
Article Google Scholar
Weinstein, J.N., et al.: The cancer genome atlas pan-cancer analysis project. Nat. Genet. 45(10), 1113 (2013)
Article Google Scholar
Yoon, J., Jordon, J., van der Schaar, M.: GANITE: estimation of individualized treatment effects using generative adversarial nets (2018)
Google Scholar

Download references

Acknowledgement

This material is based upon work supported by ARO/ARL and the National Science Foundation (NSF) Grant #1610282, NSF #1909555.

Author information

Authors and Affiliations

Arizona State University, Tempe, AZ, USA
Lu Cheng, Ruocheng Guo, Raha Moraffah, K. Selçuk Candan & Huan Liu
Army Research Laboratory, Adelphi, MD, USA
Adrienne Raglin

Authors

Lu Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Ruocheng Guo
View author publications
You can also search for this author in PubMed Google Scholar
Raha Moraffah
View author publications
You can also search for this author in PubMed Google Scholar
K. Selçuk Candan
View author publications
You can also search for this author in PubMed Google Scholar
Adrienne Raglin
View author publications
You can also search for this author in PubMed Google Scholar
Huan Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lu Cheng .

Editor information

Editors and Affiliations

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Wanling Gao
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Jianfeng Zhan
School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN, USA
Geoffrey Fox
Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, USA
Xiaoyi Lu
Texas Advanced Computing Center, The University of Texas at Austin, Austin, TX, USA
Dan Stanzione

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cheng, L., Guo, R., Moraffah, R., Candan, K.S., Raglin, A., Liu, H. (2020). A Practical Data Repository for Causal Learning with Big Data. In: Gao, W., Zhan, J., Fox, G., Lu, X., Stanzione, D. (eds) Benchmarking, Measuring, and Optimizing. Bench 2019. Lecture Notes in Computer Science(), vol 12093. Springer, Cham. https://doi.org/10.1007/978-3-030-49556-5_23

Download citation

DOI: https://doi.org/10.1007/978-3-030-49556-5_23
Published: 09 June 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-49555-8
Online ISBN: 978-3-030-49556-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics