Automating Biomedical Data Science Through Tree-Based Pipeline Optimization

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9597)


Over the past decade, data science and machine learning has grown from a mysterious art form to a staple tool across a variety of fields in academia, business, and government. In this paper, we introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning—pipeline design. We implement a Tree-based Pipeline Optimization Tool (TPOT) and demonstrate its effectiveness on a series of simulated and real-world genetic data sets. In particular, we show that TPOT can build machine learning pipelines that achieve competitive classification accuracy and discover novel pipeline operators—such as synthetic feature constructors—that significantly improve classification accuracy on these data sets. We also highlight the current challenges to pipeline optimization, such as the tendency to produce pipelines that overfit the data, and suggest future research paths to overcome these challenges. As such, this work represents an early step toward fully automating machine learning pipeline design.


Pipeline optimization Hyperparameter optimization Data science Machine learning Genetic programming 



We thank Sebastian Raschka for his valuable input during the development of this project. We also thank the Michigan State University High Performance Computing Center for the use of their computing resources. This work was supported by National Institutes of Health grants LM009012, LM010098, and EY022300.


  1. 1.
    RJMetrics: The State of Data Science, November 2015.
  2. 2.
    Hornby, G.S., Lohn, J.D., Linden, D.S.: Computer-automated evolution of an X-band antenna for NASA’s space technology 5 mission. Evol. Comput. 19(1), 1–23 (2011)CrossRefGoogle Scholar
  3. 3.
    Forrest, S., Nguyen, T., Weimer, W., Le Goues, C.: A genetic programming approach to automated software repair. In: Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation, GECCO 2009, pp. 947–954. ACM, New York (2009)Google Scholar
  4. 4.
    Spector, L., Clark, D.M., Lindsay, I., Barr, B., Klein, J.: Genetic programming for finite algebras. In: Proceedings of the 10th Annual Conference on Genetic and Evolutionary Computation, GECCO 2008, pp. 1291–1298. ACM, New York (2008)Google Scholar
  5. 5.
    Banzhaf, W., Nordin, P., Keller, R.E., Francone, F.D.: Genetic Programming: An Introduction. Morgan Kaufmann, San Meateo (1998)CrossRefzbMATHGoogle Scholar
  6. 6.
    Hutter, F., Lücke, J., Schmidt-Thieme, L.: Beyond manual tuning of hyperparameters. KI - Künstliche Intelligenz 29(4), 329–337 (2015)CrossRefGoogle Scholar
  7. 7.
    Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012)MathSciNetzbMATHGoogle Scholar
  8. 8.
    Snoek, J., Larochelle, H., Adams, R.P.: Practical bayesian optimization of machine learning algorithms. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 2951–2959. Curran Associates, Inc. (2012)Google Scholar
  9. 9.
    Kanter, J.M., Veeramachaneni, K.: Deep feature synthesis: towards automating data science endeavors. In: Proceedings of the International Conference on Data Science and Advance Analytics. IEEE (2015)Google Scholar
  10. 10.
    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetzbMATHGoogle Scholar
  11. 11.
    Hastie, T.J., Tibshirani, R.J., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York (2009)CrossRefzbMATHGoogle Scholar
  12. 12.
    Pan, Q., Hu, T., Malley, J.D., Andrew, A.S., Karagas, M.R., Moore, J.H.: A system-level pathway-phenotype association analysis using synthetic feature random forest. Genet. Epidemiol. 38(3), 209–219 (2014)CrossRefGoogle Scholar
  13. 13.
    Fortin, F.A., Gardner, M.A., Parizeau, M., Gagne, C., de Rainville, F.M.: DEAP: evolutionary algorithms made easy. J. Mach. Learn. Res. 13, 2171–2175 (2012)MathSciNetzbMATHGoogle Scholar
  14. 14.
    Urbanowicz, R.J., Kiralis, J., Fisher, J.M., Moore, J.H.: Predicting the difficulty of pure, strict, epistatic models: metrics for simulated model selection. BioData Min. 5(1), 1–13 (2012)CrossRefGoogle Scholar
  15. 15.
    Urbanowicz, R.J., Kiralis, J., Sinnott-Armstrong, N.A., Heberling, T., Fisher, J.M., Moore, J.H.: GAMETES: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures. BioData Min. 5(1), 1–14 (2012)CrossRefGoogle Scholar
  16. 16.
    Moore, J.H., Hill, D.P., Sulovari, A., Kidd, L.C.: Genetic analysis of prostate cancer using computational evolution, pareto-optimization and post-processing. In: Riolo, R., Vladislavleva, E., Ritchie, M.D., Moore, J.H. (eds.) Genetic Programming Theory and Practice X, pp. 87–101. Springer, New York (2013)CrossRefGoogle Scholar
  17. 17.
    Breiman, L., Cutler, A.: Random forests - classification description, November 2015.
  18. 18.
    Goldberg, D.E.: The Design of Innovation: Lessons from and for Competent Genetic Algorithms. Kluwer Academic Publishers, Norwell (2002)CrossRefzbMATHGoogle Scholar
  19. 19.
    Konak, A., Coit, D.W., Smith, A.E.: Multi-objective optimization using genetic algorithms: a tutorial. Reliab. Eng. Syst. Saf. 91(9), 992–1007 (2006)CrossRefGoogle Scholar
  20. 20.
    Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 6(2), 182–197 (2002)CrossRefGoogle Scholar
  21. 21.
    Greene, C.S., Penrod, N.M., Kiralis, J., Moore, J.H.: Spatially Uniform ReliefF (SURF) for computationally-efficient filtering of gene-gene interactions. BioData Min. 2(1), 1 (2009)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Institute for Biomedical InformaticsUniversity of PennsylvaniaPhiladelphiaUSA
  2. 2.University of LouisvilleLouisvilleUSA

Personalised recommendations