OpenML: An R package to connect to the machine learning platform OpenML

  • Giuseppe Casalicchio
  • Jakob Bossek
  • Michel Lang
  • Dominik Kirchhoff
  • Pascal Kerschke
  • Benjamin Hofner
  • Heidi Seibold
  • Joaquin Vanschoren
  • Bernd Bischl
Original Paper

Abstract

OpenML is an online machine learning platform where researchers can easily share data, machine learning tasks and experiments as well as organize them online to work and collaborate more efficiently. In this paper, we present an R package to interface with the OpenML platform and illustrate its usage in combination with the machine learning R package mlr (Bischl et al. J Mach Learn Res 17(170):1–5, 2016). We show how the OpenML package allows R users to easily search, download and upload data sets and machine learning tasks. Furthermore, we also show how to upload results of experiments, share them with others and download results from other users. Beyond ensuring reproducibility of results, the OpenML platform automates much of the drudge work, speeds up research, facilitates collaboration and increases the users’ visibility online.

Keywords

Databases Machine learning Reproducible research 

References

  1. Asuncion A, Newman DJ (2007) UCI Machine Learning Repository. University of California, School of Information and Computer ScienceGoogle Scholar
  2. Bifet A, Holmes G, Kirkby R, Pfahringer B (2010) MOA: Massive online analysis. J Mach Learn Res 11:1601–1604 http://www.jmlr.org/papers/v11/bifet10a.html
  3. Bischl B, Lang M (2015) parallelMap: Unified Interface to Parallelization Back-Ends. https://CRAN.R-project.org/package=parallelMap, r package version 1.3
  4. Bischl B, Lang M, Kotthoff L, Schiffner J, Richter J, Studerus E, Casalicchio G, Jones ZM (2016) mlr: Machine learning in R. J Mach Learn Res 17(170):1–5, http://jmlr.org/papers/v17/15-066.html
  5. Bischl B, Richter J, Bossek J, Horn D, Thomas J, Lang M (2017) mlrmbo: A modular framework for model-based optimization of expensive black-box functions. arXiv preprint arXiv:1703.03373
  6. Carpenter J (2011) May the best analyst win. Science 331(6018):698–699CrossRefGoogle Scholar
  7. Casalicchio G, Bischl B, Kirchhoff D, Lang M, Hofner B, Bossek J, Kerschke P, Vanschoren J (2017) OpenML: Exploring machine learning better, together. https://CRAN.R-project.org/package=OpenML, R package version 1.3
  8. Feurer M, Springenberg JT, Hutter F (2015) Initializing bayesian hyperparameter optimization via meta-learning. In: AAAI, pp 1128–1135Google Scholar
  9. Hall MA, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor Newslett 11(1):10–18, http://www.cs.waikato.ac.nz/ml/weka/
  10. Hothorn T, Hornik K, Zeileis A (2006) Unbiased recursive partitioning: a conditional inference framework. J Comput Gr Stat 15(3):651–674MathSciNetCrossRefGoogle Scholar
  11. Kuhn M, Weston S, Coulter N, Culp M (2015) C50: C5.0 decision trees and rule-based models. https://CRAN.R-project.org/package=C50, R package version 0.1.0-24, C code for C5.0 by R. Quinlan
  12. Lang M, Kotthaus H, Marwedel P, Weihs C, Rahnenführer J, Bischl B (2015) Automatic model selection for high-dimensional survival analysis. J Stat Comput Simul 85(1):62–76MathSciNetCrossRefGoogle Scholar
  13. Lang M, Bischl B, Surmann D (2017) batchtools: Tools for r to work on batch systems. J Open Source Softw 2(10), https://doi.org/10.21105%2Fjoss.00135
  14. Liaw A, Wiener M (2002) Classification and regression by randomForest. R News 2(3):18–22, http://CRAN.R-project.org/doc/Rnews/
  15. Nielsen M(2012) Reinventing discovery: the new era of networked science. Princeton University Press, http://www.jstor.org/stable/j.ctt7s4vx
  16. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay É (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830, http://scikit-learn.org/
  17. Post MJ, van der Putten P, van Rijn JN (2016) Does feature selection improve classification? a large scale experiment in OpenML. In: International Symposium on Intelligent Data Analysis, Springer, pp 158–170Google Scholar
  18. Probst P, Au Q, Casalicchio G, Stachl C, Bischl B (2017) Multilabel classification with R package mlr. arXiv preprint arXiv:1703.08991
  19. R Core Team (2016) R: A language and environment for statistical computing. R Foundation for statistical computing, Vienna, Austria, https://www.R-project.org/
  20. Schiffner J, Bischl B, Lang M, Richter J, Jones ZM, Probst P, Pfisterer F, Gallo M, Kirchhoff D, Kühn T, Thomas J, Kotthoff L (2016) mlr Tutorial. arXiv preprint arXiv:1609.06146
  21. Therneau T, Atkinson B, Ripley B (2015) rpart: Recursive Partitioning and Regression Trees. http://CRAN.R-project.org/package=rpart, R package version 4.1-10
  22. van Rijn JN, Umaashankar V, Fischer S, Bischl B, Torgo L, Gao B, Winter P, Wiswedel B, Berthold MR, Vanschoren J (2013) A RapidMiner Extension for Open Machine Learning. In: Proceedings of the 4th RapidMiner Community Meeting and Conference (RCOMM 2013), pp 59–70Google Scholar
  23. Vanschoren J, Blockeel H, Pfahringer B, Holmes G (2012) Experiment Databases. A new way to share, organize and learn from experiments. Mach Learn 87(2):127–158MathSciNetCrossRefMATHGoogle Scholar
  24. Vanschoren J, van Rijn JN, Bischl B, Torgo L (2013) OpenML: networked science in machine learning. SIGKDD Explor 15(2):49–60CrossRefGoogle Scholar
  25. Wickham H (2009) ggplot2: Elegant Graphics for Data Analysis. Springer, New York, NY, USA, http://ggplot2.org

Copyright information

© Springer-Verlag GmbH Germany 2017

Authors and Affiliations

  • Giuseppe Casalicchio
    • 1
  • Jakob Bossek
    • 2
  • Michel Lang
    • 3
  • Dominik Kirchhoff
    • 4
  • Pascal Kerschke
    • 2
  • Benjamin Hofner
    • 5
  • Heidi Seibold
    • 6
  • Joaquin Vanschoren
    • 7
  • Bernd Bischl
    • 1
  1. 1.Department of StatisticsLudwig-Maximilians-University MunichMunichGermany
  2. 2.Information Systems and StatisticsUniversity of MünsterMünsterGermany
  3. 3.Department of StatisticsTU Dortmund UniversityDortmundGermany
  4. 4.Dortmund University of Applied Sciences and ArtsDortmundGermany
  5. 5.Section of BiostatisticsPaul-Ehrlich-InstitutLangenGermany
  6. 6.Epidemiology, Biostatistics and Prevention InstituteUniversity of ZurichZurichSwitzerland
  7. 7.Eindhoven University of TechnologyEindhovenThe Netherlands

Personalised recommendations