Skip to main content

Baseball Informatics—From MiLB to MLB Debut

  • Chapter
  • First Online:
Analytics Enabled Decision Making


Drafted baseball players typically begin their professional baseball career with Minor League teams and are not guaranteed opportunities in the Major League. Accurate estimation of players’ likelihood to advance to the Major League debut can reduce the cost and increase value for both players and franchises. We mined both baseball performance stats and non-baseball data of players drafted from 2001 to 2010. We applied machine learning techniques to analyze and rank stats and data variables. We compared four sets of variable selections to train and validate our models, which predict the likelihood of a drafted player reaching the Majors. We fitted extreme gradient boosting, random forest, decision tree, and support vector machine to determine the high impact variables in the prediction. We successfully translated our model results into guidance for drafted players in the Minor League on what they should improve to increase their chances to play in the Major League.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions


  • Barnes, S. L., & Bjarnadóttir, M. V. (2016). Great expectations: An analysis of Major League Baseball free agent performance. Statistical Analysis and Data Mining, 9(5), 295–309.

    Article  Google Scholar 

  • Bhargava, N., Fang, A., & Tseng, P. (2012). Machine learning an American pastime. In CS 229 Paper. Stanford University.

    Google Scholar 

  • Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.

    Article  Google Scholar 

  • Burger, J. D., & Walters, S. J. K. (2009). Uncertain prospects: Rates of return in the baseball draft. Journal of Sports Economics, 10(5), 485–501.

    Article  Google Scholar 

  • Chandler, G., & Stevens, G. (2012). An exploratory study of Minor League Baseball statistics. Journal of Quantitative Analysis in Sports, 8(4), 1–28.

    Google Scholar 

  • Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In The 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794).

    Google Scholar 

  • Clavelli, J., & Gottsegen, J. (2013). Maximizing precision of hit predictions in baseball. In CS 229 Paper. Stanford University.

    Google Scholar 

  • Closius, P. J., & Stephan, J. S. (2020). Myth, manipulation, and Minor League Baseball: How a capitalist democracy engenders income inequality. University of Cincinnati Law Review, 89(1), 84.

    Google Scholar 

  • Cohen, G. (2003). Minor stats history.

  • Cooper, J. (2019). How many MLB draftees make it to the majors.

  • Danovitch, J. (2019). Trouble with the curve: Predicting future MLB players using scouting reports. In Carnegie Mellon Sports Analytics Conference.

    Google Scholar 

  • Dietterich, T. G. (2000). An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40, 139–157.

    Article  Google Scholar 

  • Everman, B. (2015). Analyzing baseball statistics using data mining.

  • Ganeshapillai, G., & Guttag, J. (2013). A data-driven method for in-game decision making in MLB. In The 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 973–979).

    Google Scholar 

  • Ghosh, D., & Chinnaiyan, A. M. (2005). Classification and selection of biomarkers in genomic data using lasso. Journal of Biomedicine and Biotechnology, 2005(2), 147–154.

    Article  Google Scholar 

  • Gow, A. (2019). Using machine learning to predict MLB success based on MiLB performance. IPHS 300: Artificial Intelligence for the Humanities: Text, Image, and Sound.

    Google Scholar 

  • Greenwell, B. M., & Boehmke, B.C. (2020). Variable importance plots: An introduction to vip.

  • James, B. (1977). Baseball Abstract.

    Google Scholar 

  • Kaur, H., Pannu, H. S., & Malhi, A. K. (2020). A systematic review on imbalanced data challenges in machine learning: Applications and solutions. ACM Computing Surveys, 52(4), 1–36.

    Article  Google Scholar 

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90.

    Article  Google Scholar 

  • Kuhn, M., & Wickham, H. (2021). Tidymodels: Easily install and load the ‘tidymodels’ packages.

  • Lange, D. (2020). Participants in baseball in the U.S. from 2006 to 2017.

  • Lederer, R. (2004). Abstracts from the abstracts.

  • Lee, C.-Y., Cai, J,-Y, (2020), Lasso variable selection in data envelopment analysis with small datasets. Omega, 91.

    Google Scholar 

  • Lewis, M. (2004). Moneyball: The art of winning an unfair game (1st ed.). W. W. Norton Company.

    Google Scholar 

  • Myles, A. J., Feudale, R. N., Liu, Y., Woody, N. A., & Brown, S. D. (2004). An introduction to decision tree modeling. Journal of Chemometrics, 18(6), 275–285.

    Article  Google Scholar 

  • Petriello, M. (2017). ‘Air ball revolution’ rewards hard elevation.

  • Rickey, B. (1954). Goodby to some old baseball ideas. Life, 2, 78–89.

    Google Scholar 

  • Seiner, J. (2017). Baseball’s new trend: Saying ‘no’ to ground balls.

  • Spurr, S. J. (2000). The baseball draft: A study of the ability to find talent. Journal of Sports Economics, 1(1), 66–85.

    Article  Google Scholar 

  • Sun, Y., Wong, A. K. C., & Kamel, M. S. (2009). Classification of imbalanced data: A review. International Journal of Pattern Recognition and Artificial Intelligence, 23(4), 687–719.

    Article  Google Scholar 

  • Taylor, N. C. (2017). Forecasting batter performance using Statcast data in Major League Baseball.

  • Vogt, I. (2018). Does player performance outside of Major League Baseball translate to the MLB?

  • Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686.

    Article  Google Scholar 

  • Young, W. A., Holland, W. S., & Weckman, G. R. (2008). Determining hall of fame status for Major League Baseball using an artificial neural network. Journal of Quantitative Analysis in Sports, 4(4), 1–44.

    Google Scholar 

Download references


The authors thanked Drs. Paul Shapiro, Margrét Bjarnadóttir, and Teddy Helfers for their critiques on both baseball and writing.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Woei-jyh Lee .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Lee, CH., Lee, Wj. (2023). Baseball Informatics—From MiLB to MLB Debut. In: Sharma, V., Maheshkar, C., Poulose, J. (eds) Analytics Enabled Decision Making. Palgrave Macmillan, Singapore.

Download citation

Publish with us

Policies and ethics