Automating Biomedical Data Science Through Tree-Based Pipeline Optimization
Over the past decade, data science and machine learning has grown from a mysterious art form to a staple tool across a variety of fields in academia, business, and government. In this paper, we introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning—pipeline design. We implement a Tree-based Pipeline Optimization Tool (TPOT) and demonstrate its effectiveness on a series of simulated and real-world genetic data sets. In particular, we show that TPOT can build machine learning pipelines that achieve competitive classification accuracy and discover novel pipeline operators—such as synthetic feature constructors—that significantly improve classification accuracy on these data sets. We also highlight the current challenges to pipeline optimization, such as the tendency to produce pipelines that overfit the data, and suggest future research paths to overcome these challenges. As such, this work represents an early step toward fully automating machine learning pipeline design.
KeywordsPipeline optimization Hyperparameter optimization Data science Machine learning Genetic programming
We thank Sebastian Raschka for his valuable input during the development of this project. We also thank the Michigan State University High Performance Computing Center for the use of their computing resources. This work was supported by National Institutes of Health grants LM009012, LM010098, and EY022300.
- 1.RJMetrics: The State of Data Science, November 2015. https://rjmetrics.com/resources/reports/the-state-of-data-science/
- 3.Forrest, S., Nguyen, T., Weimer, W., Le Goues, C.: A genetic programming approach to automated software repair. In: Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation, GECCO 2009, pp. 947–954. ACM, New York (2009)Google Scholar
- 4.Spector, L., Clark, D.M., Lindsay, I., Barr, B., Klein, J.: Genetic programming for finite algebras. In: Proceedings of the 10th Annual Conference on Genetic and Evolutionary Computation, GECCO 2008, pp. 1291–1298. ACM, New York (2008)Google Scholar
- 8.Snoek, J., Larochelle, H., Adams, R.P.: Practical bayesian optimization of machine learning algorithms. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 2951–2959. Curran Associates, Inc. (2012)Google Scholar
- 9.Kanter, J.M., Veeramachaneni, K.: Deep feature synthesis: towards automating data science endeavors. In: Proceedings of the International Conference on Data Science and Advance Analytics. IEEE (2015)Google Scholar
- 16.Moore, J.H., Hill, D.P., Sulovari, A., Kidd, L.C.: Genetic analysis of prostate cancer using computational evolution, pareto-optimization and post-processing. In: Riolo, R., Vladislavleva, E., Ritchie, M.D., Moore, J.H. (eds.) Genetic Programming Theory and Practice X, pp. 87–101. Springer, New York (2013)CrossRefGoogle Scholar
- 17.Breiman, L., Cutler, A.: Random forests - classification description, November 2015. http://www.stat.berkeley.edu/breiman/RandomForests/cc_home.htm