Predicting University Students’ Academic Success and Major Using Random Forests

  • Cédric BeaulacEmail author
  • Jeffrey S. Rosenthal


In this article, a large data set containing every course taken by every undergraduate student in a major university in Canada over 10 years is analysed. Modern machine learning algorithms can use large data sets to build useful tools for the data provider, in this case, the university. In this article, two classifiers are constructed using random forests. To begin, the first two semesters of courses completed by a student are used to predict if they will obtain an undergraduate degree. Secondly, for the students that completed a program, their major is predicted using once again the first few courses they have registered to. A classification tree is an intuitive and powerful classifier and building a random forest of trees improves this classifier. Random forests also allow for reliable variable importance measurements. These measures explain what variables are useful to the classifiers and can be used to better understand what is statistically related to the students’ situation. The results are two accurate classifiers and a variable importance analysis that provides useful information to university administrations.


Higher education Student retention Academic success Machine learning Classification tree Random forest Variable importance 



We are very grateful to Glenn Loney and Sinisa Markovic of the University of Toronto for providing us with students grade data. The authors also gratefully acknowledge the financial support from the NSERC of Canada.


  1. Aulck, L., Velagapudi, N., Blumenstock, J., & West, J. (2016 June). Predicting Student Dropout in Higher Education. ArXiv e-prints.Google Scholar
  2. Bailey, M. A., Rosenthal, J. S., & Yoon, A. H. (2016). Grades and incentives: assessing competing grade point average measures and postgraduate outcomes. Studies in Higher Education, 41(9), 1548–1562. Scholar
  3. Bar, T., Kadiyali, V., & Zussman, A. (2009). Grade information and grade inflation: The cornell experiment. Journal of Economic Perspectivs, 23(3), 93–108.CrossRefGoogle Scholar
  4. Breiman, L. (1996a). Bagging predictors. Machine Learning, 24(2), 123–140. Scholar
  5. Breiman, L. (1996b, 12). Heuristics of instability and stabilization in model selection. The Annals of Statistics, 24(6), 2350–2383.
  6. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. Scholar
  7. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Belmont, CA: Wadsworth Publishing Company.Google Scholar
  8. Chen, R., & DesJardins, S. L. (2008). Exploring the effects of financial aid on the gap in student dropout risks by income Level. Research in Higher Education, 49(1), 1–18. Scholar
  9. Chen, R., & DesJardins, S. L. (2010). Investigating the impact of financial aid on student dropout risks: Racial and ethnic differences. The Journal of Higher Education, 81(2), 179–208.
  10. Chen, Y.- L., Hsu, C.- L., & Chou, S.- C. (2003). Constructing a multi-valued and multi-labeled decision tree. Expert Systems with Applications, 25(2), 199–209. Retrieved from
  11. Chou, S., & Hsu, C.-L. (2005). MMDT: A multi-valued and multi-labeled decision tree classifier for data mining. Expert Systems With Applications, 28(4), 799–812. Scholar
  12. Clare, A., & King, R. D. (2001). Knowledge discovery in multi-label phenotype data. In L. De Raedt & A. Siebes (Eds.), Principles of Data Mining and Knowledge Discovery: 5th European Conference, PKDD 2001, Freiburg, Germany, September 3–5, 2001 Proceedings (pp. 42–53). Berlin, Heidelberg: Springer Berlin Heidelberg. Retrieved from
  13. Eddelbuettel, D., & Francois, R. (2011). Rcpp: Seamless R and C++ integration. Journal of Statistical Software, 40(1), 1–18. Retrieved from
  14. Glaesser, J., & Cooper, B. (2012). Gender, parental education, and ability: their interacting roles in predicting GCSE success. Cambridge Journal of Education, 42(4), 463–480. Scholar
  15. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning (2nd ed.). Berlin: Springer.CrossRefGoogle Scholar
  16. Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 15(3), 651–674. Scholar
  17. Johnson, S. R., & Stage, F. K. (2018). Academic engagement and student success: Do high-impact practices mean higher graduation rates. The Journal of Higher Education, 0(0), 1–29. Scholar
  18. Johnson, V. E. (2003). Grade inflation : A crisis in college education. New York: Springer.Google Scholar
  19. Kappe, R., & van der Flier, H. (2012). Predicting academic success in higher education: what’s more important than being smart? European Journal of Psychology of Education, 27(4), 605–619. Scholar
  20. Kim, H., & Loh, W.- Y. (2001). Classification trees with unbiased multiway splits. Journal of the American Statistical Association, 96, 589–604. Retrieved from
  21. Kononenko, I. (1995). On biases in estimating multi-valued attributes. In Proceedings of the 14th international joint conference on artificial intelligence (vol. 2, pp. 1034–1040). San Francisco, CA, USAM organ Kaufmann Publishers Inc. Retrieved from
  22. Leeds, D. M., & DesJardins, S. L. (2015). The effect of merit aid on enrollment: A regression discontinuity analysis of iowa’s national scholars award. Research in Higher Education, 56(5), 471–495. Scholar
  23. Liaw, A., & Wiener, M. (2002). Classification and regression by randomforest. R News 2(3), 18-22. Retrieved from
  24. Loh, W.- Y. (2002). Regression trees with unbiased variable selection and interaction detection. Statistica Sinica, 12, 361–386. Retrieved from
  25. Loh, W.- Y., & Shih, Y.- S. (1997). Split selection methods for classification trees. Statistica Sinica, 7, 815–840. Retrieved from
  26. Mills, J. S., & Blankstein, K. R. (2000). Perfectionism, intrinsic vs extrinsic motivation, and motivated strategies for learning: a multidimensional analysis of university students. Personality and Individual Differences 29(6), 1191–1204. Retrieved from
  27. Niessen, A. S. M., Meijer, R. R., & Tendeiro, J. N. (2016). Predicting performance in higher education using proximal predictors. PLoS ONE, 11(4), 1–14. Scholar
  28. Ost, B. (2010). The role of peers and grades in determining major persistence in sciences. Economics of Education Review, 29, 923–934.CrossRefGoogle Scholar
  29. Sabot, R., & Wakeman-Linn, J. (1991). Grade inflation and course choice. Journal of Economic Perspectives, 5, 159–170.CrossRefGoogle Scholar
  30. Strobl, C., Boulesteix, A.-L., Zeileis, A., & Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8(1), 25. Scholar
  31. University of Toronto. (2017). Degree requirements (h.b.a.,, bcom). Retrieved 2017-08-30,_H.B.Sc.,_BCom).html.

Copyright information

© Springer Nature B.V. 2019

Authors and Affiliations

  1. 1.University of TorontoTorontoCanada

Personalised recommendations