Automating Biomedical Data Science Through Tree-Based Pipeline Optimization

  • Randal S. Olson
  • Ryan J. Urbanowicz
  • Peter C. Andrews
  • Nicole A. Lavender
  • La Creis Kidd
  • Jason H. Moore
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9597)

Abstract

Over the past decade, data science and machine learning has grown from a mysterious art form to a staple tool across a variety of fields in academia, business, and government. In this paper, we introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning—pipeline design. We implement a Tree-based Pipeline Optimization Tool (TPOT) and demonstrate its effectiveness on a series of simulated and real-world genetic data sets. In particular, we show that TPOT can build machine learning pipelines that achieve competitive classification accuracy and discover novel pipeline operators—such as synthetic feature constructors—that significantly improve classification accuracy on these data sets. We also highlight the current challenges to pipeline optimization, such as the tendency to produce pipelines that overfit the data, and suggest future research paths to overcome these challenges. As such, this work represents an early step toward fully automating machine learning pipeline design.

Keywords

Pipeline optimization Hyperparameter optimization Data science Machine learning Genetic programming 

References

  1. 1.
    RJMetrics: The State of Data Science, November 2015. https://rjmetrics.com/resources/reports/the-state-of-data-science/
  2. 2.
    Hornby, G.S., Lohn, J.D., Linden, D.S.: Computer-automated evolution of an X-band antenna for NASA’s space technology 5 mission. Evol. Comput. 19(1), 1–23 (2011)CrossRefGoogle Scholar
  3. 3.
    Forrest, S., Nguyen, T., Weimer, W., Le Goues, C.: A genetic programming approach to automated software repair. In: Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation, GECCO 2009, pp. 947–954. ACM, New York (2009)Google Scholar
  4. 4.
    Spector, L., Clark, D.M., Lindsay, I., Barr, B., Klein, J.: Genetic programming for finite algebras. In: Proceedings of the 10th Annual Conference on Genetic and Evolutionary Computation, GECCO 2008, pp. 1291–1298. ACM, New York (2008)Google Scholar
  5. 5.
    Banzhaf, W., Nordin, P., Keller, R.E., Francone, F.D.: Genetic Programming: An Introduction. Morgan Kaufmann, San Meateo (1998)CrossRefMATHGoogle Scholar
  6. 6.
    Hutter, F., Lücke, J., Schmidt-Thieme, L.: Beyond manual tuning of hyperparameters. KI - Künstliche Intelligenz 29(4), 329–337 (2015)CrossRefGoogle Scholar
  7. 7.
    Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012)MathSciNetMATHGoogle Scholar
  8. 8.
    Snoek, J., Larochelle, H., Adams, R.P.: Practical bayesian optimization of machine learning algorithms. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 2951–2959. Curran Associates, Inc. (2012)Google Scholar
  9. 9.
    Kanter, J.M., Veeramachaneni, K.: Deep feature synthesis: towards automating data science endeavors. In: Proceedings of the International Conference on Data Science and Advance Analytics. IEEE (2015)Google Scholar
  10. 10.
    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetMATHGoogle Scholar
  11. 11.
    Hastie, T.J., Tibshirani, R.J., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York (2009)CrossRefMATHGoogle Scholar
  12. 12.
    Pan, Q., Hu, T., Malley, J.D., Andrew, A.S., Karagas, M.R., Moore, J.H.: A system-level pathway-phenotype association analysis using synthetic feature random forest. Genet. Epidemiol. 38(3), 209–219 (2014)CrossRefGoogle Scholar
  13. 13.
    Fortin, F.A., Gardner, M.A., Parizeau, M., Gagne, C., de Rainville, F.M.: DEAP: evolutionary algorithms made easy. J. Mach. Learn. Res. 13, 2171–2175 (2012)MathSciNetMATHGoogle Scholar
  14. 14.
    Urbanowicz, R.J., Kiralis, J., Fisher, J.M., Moore, J.H.: Predicting the difficulty of pure, strict, epistatic models: metrics for simulated model selection. BioData Min. 5(1), 1–13 (2012)CrossRefGoogle Scholar
  15. 15.
    Urbanowicz, R.J., Kiralis, J., Sinnott-Armstrong, N.A., Heberling, T., Fisher, J.M., Moore, J.H.: GAMETES: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures. BioData Min. 5(1), 1–14 (2012)CrossRefGoogle Scholar
  16. 16.
    Moore, J.H., Hill, D.P., Sulovari, A., Kidd, L.C.: Genetic analysis of prostate cancer using computational evolution, pareto-optimization and post-processing. In: Riolo, R., Vladislavleva, E., Ritchie, M.D., Moore, J.H. (eds.) Genetic Programming Theory and Practice X, pp. 87–101. Springer, New York (2013)CrossRefGoogle Scholar
  17. 17.
    Breiman, L., Cutler, A.: Random forests - classification description, November 2015. http://www.stat.berkeley.edu/breiman/RandomForests/cc_home.htm
  18. 18.
    Goldberg, D.E.: The Design of Innovation: Lessons from and for Competent Genetic Algorithms. Kluwer Academic Publishers, Norwell (2002)CrossRefMATHGoogle Scholar
  19. 19.
    Konak, A., Coit, D.W., Smith, A.E.: Multi-objective optimization using genetic algorithms: a tutorial. Reliab. Eng. Syst. Saf. 91(9), 992–1007 (2006)CrossRefGoogle Scholar
  20. 20.
    Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 6(2), 182–197 (2002)CrossRefGoogle Scholar
  21. 21.
    Greene, C.S., Penrod, N.M., Kiralis, J., Moore, J.H.: Spatially Uniform ReliefF (SURF) for computationally-efficient filtering of gene-gene interactions. BioData Min. 2(1), 1 (2009)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Randal S. Olson
    • 1
  • Ryan J. Urbanowicz
    • 1
  • Peter C. Andrews
    • 1
  • Nicole A. Lavender
    • 2
  • La Creis Kidd
    • 2
  • Jason H. Moore
    • 1
  1. 1.Institute for Biomedical InformaticsUniversity of PennsylvaniaPhiladelphiaUSA
  2. 2.University of LouisvilleLouisvilleUSA

Personalised recommendations