Advertisement

Elements of an Automatic Data Scientist

  • Luc De RaedtEmail author
  • Hendrik Blockeel
  • Samuel Kolb
  • Stefano Teso
  • Gust Verbruggen
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11191)

Abstract

A simple but non-trivial setting for automating data science is introduced. Given are a set of worksheets in a spreadsheet and the goal is to automatically complete some values. We also outline elements of the Synth framework that tackles this task: Synth-a-Sizer, an automated data wrangling system for automatically transforming the problem into attribute-value format; TacLe, an inductive constraint learning system for inducing formulas in spreadsheets; Mercs, a versatile predictive learning system; as well as the autocompletion component that integrates these systems.

Keywords

Automated data science Autocompletion Data wrangling Learning constraints Versatile models 

Notes

Acknowledgments

This work has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No [694980] Synth: Synthesising Inductive Data Models) and the Research Foundation, Flanders.

References

  1. 1.
    Bot.me: How artificial intelligence is pushing man and machine closer together. Technical Report, PwC (2017)Google Scholar
  2. 2.
    Barowy, D.W., Gulwani, S., Hart, T., Zorn, B.: Flashrelate: extracting relational data from semi-structured spreadsheets using examples. SIGPLAN Not. 50(6), 218–228 (2015)CrossRefGoogle Scholar
  3. 3.
    Beldiceanu, N., Simonis, H.: A model seeker: extracting global constraint models from positive examples. In: Proceedings 18th International Conference on Principles and Practice of Constraint Programming. Lecture Notes in Computer Science, vol. 7514, pp. 141–157 (2012)Google Scholar
  4. 4.
    Bessiere, C., Koriche, F., Lazaar, N., O’Sullivan, B.: Constraint acquisition. Artif. Intell. 244, 315–342 (2017)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Blockeel, H., De Raedt, L.: Top-down induction of first-order logical decision trees. Artif. Intell. 101(1–2), 285–297 (1998)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Contreras-Ochando, L., Martínez-Plumed, F., Ferri, C., Hernández-Orallo, J., Ramírez-Quintana, M.J., Katayama, S.: Domain specific induction for data wrangling automation (demo). AutoML @ ICML 2017 (2017)Google Scholar
  7. 7.
    De Raedt, L., Kimmig, A., Toivonen, H.: Problog: a probabilistic prolog and its application in link discovery. In: Proceedings 20th International Joint Conference on Artificial Intelligence (2007)Google Scholar
  8. 8.
    De Raedt, L., Passerini, A., Teso, S.: Learning constraints from examples. In: Proceedings 32nd AAAI Conference on Artificial Intelligence (2018)Google Scholar
  9. 9.
    Fierens, D., et al.: Inference and learning in probabilistic logic programs using weighted boolean formulas. Theory Pract. Log. Prog. 15(3), 358–401 (2015)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Gulwani, S.: Automating string processing in spreadsheets using input-output examples. In: ACM SIGPLAN-SIGACT, POPL, pp. 317–330 (2011)CrossRefGoogle Scholar
  11. 11.
    Hoos, H.H.: Programming by optimization. Commun. ACM 55(2), 70–80 (2012)CrossRefGoogle Scholar
  12. 12.
    Hutter, F., Kotthoff, L., Vanschoren, J. (eds.): AutoML: methods, systems, challenges (2018). Draft available from: https://www.ml4aad.org/book/
  13. 13.
    Jin, Z., Cafarella, M., Jagadish, H., Kandel, S., Minar, M.: Unifacta: profiling-driven string pattern standardization. arXiv preprint arXiv:1803.00701 (2018)
  14. 14.
    King, R.D., et al.: Functional genomic hypothesis generation and experimentation by a robot scientist. Nature 427, 247–252 (2004)CrossRefGoogle Scholar
  15. 15.
    Kolb, S., Paramonov, S., Guns, T., De Raedt, L.: Learning constraints in spreadsheets and tabular data. Mach. Learn. 106(9–10), 1441–1468 (2017)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Kurzweil, R.: The Age of Intelligent Machines. MIT press, Cambridge (1990)Google Scholar
  17. 17.
    Kwisthout, J.: Approximate inference in bayesian networks: parameterized complexity results. Int. J. Approx. Reason. 93, 119–131 (2018)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach, 3rd edn. Prentice Hall, Upper Saddle River (2010)zbMATHGoogle Scholar
  19. 19.
    Schietgat, L., Vens, C., Struyf, J., Blockeel, H., Kocev, D., Dzeroski, S.: Predicting gene function using hierarchical multi-label decision tree ensembles. BMC Bioinform. 11(2) (2010)CrossRefGoogle Scholar
  20. 20.
    Serban, F., Vanschoren, J., Kietz, J.U., Bernstein, A.: A survey of intelligent assistants for data analysis. ACM Comput. Surv. (CSUR) 45(3) (2013)CrossRefGoogle Scholar
  21. 21.
    Singh, R., Gulwani, S.: Transforming spreadsheet data types using examples. SIGPLAN Not. 51(1), 343–356 (2016)CrossRefGoogle Scholar
  22. 22.
    Steinruecken, C., Smith, E., Janz, D., Lloyd, J., Ghahramani, Z.: The automated statistician (2018). Draft available from: https://www.ml4aad.org/book/
  23. 23.
    Vens, C., Struyf, J., Schietgat, L., Dzeroski, S., Blockeel, H.: Decision trees for hierarchical multi-label classification. Mach. Learn. 73(2), 185–214 (2008)CrossRefGoogle Scholar
  24. 24.
    Verbruggen, G., De Raedt, L.: Towards automated relational data wrangling. In: Proceedings of AutoML 2017@ ECML-PKDD: automatic selection, configuration and composition of machine learning algorithms, pp. 18–26 (2017)Google Scholar
  25. 25.
    Verbruggen, G., De Raedt, L.: Automatically wrangling spreadsheets into machine learning data formats. In: Duivesteijn, W., et al. (eds.) IDA 2018. LNCS, vol. 11191, pp. 367–379. Springer, Cham (2018)Google Scholar
  26. 26.
    Wolputte, E.V., Korneva, E., Blockeel, H.: MERCS: multi-directional ensembles of regression and classification trees. In: Proceedings 32nd AAAI Conference on Artificial Intelligence (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Luc De Raedt
    • 1
    Email author
  • Hendrik Blockeel
    • 1
  • Samuel Kolb
    • 1
  • Stefano Teso
    • 1
  • Gust Verbruggen
    • 1
  1. 1.Department of Computer ScienceKU LeuvenLeuvenBelgium

Personalised recommendations