# Predicting University Students’ Academic Success and Major Using Random Forests

## Abstract

In this article, a large data set containing every course taken by every undergraduate student in a major university in Canada over 10 years is analysed. Modern machine learning algorithms can use large data sets to build useful tools for the data provider, in this case, the university. In this article, two classifiers are constructed using random forests. To begin, the first two semesters of courses completed by a student are used to predict if they will obtain an undergraduate degree. Secondly, for the students that completed a program, their major is predicted using once again the first few courses they have registered to. A classification tree is an intuitive and powerful classifier and building a random forest of trees improves this classifier. Random forests also allow for reliable variable importance measurements. These measures explain what variables are useful to the classifiers and can be used to better understand what is statistically related to the students’ situation. The results are two accurate classifiers and a variable importance analysis that provides useful information to university administrations.

## Keywords

Higher education Student retention Academic success Machine learning Classification tree Random forest Variable importance## Notes

### Acknowledgements

We are very grateful to Glenn Loney and Sinisa Markovic of the University of Toronto for providing us with students grade data. The authors also gratefully acknowledge the financial support from the NSERC of Canada.

## References

- Aulck, L., Velagapudi, N., Blumenstock, J., & West, J. (2016 June). Predicting Student Dropout in Higher Education. ArXiv e-prints.Google Scholar
- Bailey, M. A., Rosenthal, J. S., & Yoon, A. H. (2016). Grades and incentives: assessing competing grade point average measures and postgraduate outcomes.
*Studies in Higher Education*,*41*(9), 1548–1562. https://doi.org/10.1080/03075079.2014.982528.CrossRefGoogle Scholar - Bar, T., Kadiyali, V., & Zussman, A. (2009). Grade information and grade inflation: The cornell experiment.
*Journal of Economic Perspectivs*,*23*(3), 93–108.CrossRefGoogle Scholar - Breiman, L. (1996a). Bagging predictors.
*Machine Learning*,*24*(2), 123–140. https://doi.org/10.1007/BF00058655.Google Scholar - Breiman, L. (1996b, 12). Heuristics of instability and stabilization in model selection.
*The Annals of Statistics*,*24*(6), 2350–2383. https://doi.org/10.1214/aos/1032181158 - Breiman, L. (2001). Random forests.
*Machine Learning*,*45*(1), 5–32. https://doi.org/10.1023/A:1010933404324.CrossRefGoogle Scholar - Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984).
*Classification and regression trees*. Belmont, CA: Wadsworth Publishing Company.Google Scholar - Chen, R., & DesJardins, S. L. (2008). Exploring the effects of financial aid on the gap in student dropout risks by income Level.
*Research in Higher Education*,*49*(1), 1–18. https://doi.org/10.1007/s11162-007-9060-9.CrossRefGoogle Scholar - Chen, R., & DesJardins, S. L. (2010). Investigating the impact of financial aid on student dropout risks: Racial and ethnic differences.
*The Journal of Higher Education*,*81*(2), 179–208. http://www.jstor.org/stable/40606850 - Chen, Y.- L., Hsu, C.- L., & Chou, S.- C. (2003). Constructing a multi-valued and multi-labeled decision tree.
*Expert Systems with Applications*,*25*(2), 199–209. https://doi.org/10.1016/S0957-4174(03)00047-2. Retrieved from http://www.sciencedirect.com/science/article/pii/S0957417403000472. - Chou, S., & Hsu, C.-L. (2005). MMDT: A multi-valued and multi-labeled decision tree classifier for data mining.
*Expert Systems With Applications*,*28*(4), 799–812. https://doi.org/10.1016/j.eswa.2004.12.035.CrossRefGoogle Scholar - Clare, A., & King, R. D. (2001). Knowledge discovery in multi-label phenotype data. In L. De Raedt & A. Siebes (Eds.), Principles of Data Mining and Knowledge Discovery: 5th European Conference, PKDD 2001, Freiburg, Germany, September 3–5, 2001 Proceedings (pp. 42–53). Berlin, Heidelberg: Springer Berlin Heidelberg. Retrieved from https://doi.org/10.1007/3-540-44794-6.
- Eddelbuettel, D., & Francois, R. (2011). Rcpp: Seamless R and C++ integration.
*Journal of Statistical Software*,*40*(1), 1–18. https://doi.org/10.18637/jss.v040.i08. Retrieved from https://www.jstatsoft.org/index.php/jss/article/view/v040i08 - Glaesser, J., & Cooper, B. (2012). Gender, parental education, and ability: their interacting roles in predicting GCSE success.
*Cambridge Journal of Education*,*42*(4), 463–480. https://doi.org/10.1080/0305764X.2012.733346.CrossRefGoogle Scholar - Hastie, T., Tibshirani, R., & Friedman, J. (2009).
*The elements of statistical learning*(2nd ed.). Berlin: Springer.CrossRefGoogle Scholar - Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework.
*Journal of Computational and Graphical Statistics*,*15*(3), 651–674. https://doi.org/10.1198/106186006X133933.CrossRefGoogle Scholar - Johnson, S. R., & Stage, F. K. (2018). Academic engagement and student success: Do high-impact practices mean higher graduation rates.
*The Journal of Higher Education*,*0*(0), 1–29. https://doi.org/10.1080/00221546.2018.1441107.Google Scholar - Johnson, V. E. (2003).
*Grade inflation : A crisis in college education*. New York: Springer.Google Scholar - Kappe, R., & van der Flier, H. (2012). Predicting academic success in higher education: what’s more important than being smart?
*European Journal of Psychology of Education*,*27*(4), 605–619. https://doi.org/10.1007/s10212-011-0099-9.CrossRefGoogle Scholar - Kim, H., & Loh, W.- Y. (2001). Classification trees with unbiased multiway splits.
*Journal of the American Statistical Association*,*96*, 589–604. Retrieved from http://www.stat.wisc.edu/~loh/treeprogs/cruise/cruise.pdf. - Kononenko, I. (1995). On biases in estimating multi-valued attributes. In Proceedings of the 14th international joint conference on artificial intelligence (vol. 2, pp. 1034–1040). San Francisco, CA, USAM organ Kaufmann Publishers Inc. Retrieved from http://dl.acm.org/citation.cfm?id=1643031.1643034.
- Leeds, D. M., & DesJardins, S. L. (2015). The effect of merit aid on enrollment: A regression discontinuity analysis of iowa’s national scholars award.
*Research in Higher Education*,*56*(5), 471–495. https://doi.org/10.1007/s11162-014-9359-2.CrossRefGoogle Scholar - Liaw, A., & Wiener, M. (2002). Classification and regression by randomforest.
*R News**2*(3), 18-22. Retrieved from http://CRAN.R-project.org/doc/Rnews/. - Loh, W.- Y. (2002). Regression trees with unbiased variable selection and interaction detection.
*Statistica Sinica*,*12*, 361–386. Retrieved from http://www.stat.wisc.edu/~loh/treeprogs/guide/guide02.pdf. - Loh, W.- Y., & Shih, Y.- S. (1997). Split selection methods for classification trees.
*Statistica Sinica*,*7*, 815–840. Retrieved from http://www3.stat.sinica.edu.tw/statistica/j7n4/j7n41/j7n41.htm. - Mills, J. S., & Blankstein, K. R. (2000). Perfectionism, intrinsic vs extrinsic motivation, and motivated strategies for learning: a multidimensional analysis of university students.
*Personality and Individual Differences**29*(6), 1191–1204. https://doi.org/10.1016/S0191-8869(00)00003-9. Retrieved from http://www.sciencedirect.com/science/article/pii/S0191886900000039. - Niessen, A. S. M., Meijer, R. R., & Tendeiro, J. N. (2016). Predicting performance in higher education using proximal predictors.
*PLoS ONE*,*11*(4), 1–14. https://doi.org/10.1371/journal.pone.0153663.CrossRefGoogle Scholar - Ost, B. (2010). The role of peers and grades in determining major persistence in sciences.
*Economics of Education Review*,*29*, 923–934.CrossRefGoogle Scholar - Sabot, R., & Wakeman-Linn, J. (1991). Grade inflation and course choice.
*Journal of Economic Perspectives*,*5*, 159–170.CrossRefGoogle Scholar - Strobl, C., Boulesteix, A.-L., Zeileis, A., & Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution.
*BMC Bioinformatics*,*8*(1), 25. https://doi.org/10.1186/1471-2105-8-25.CrossRefGoogle Scholar - University of Toronto. (2017). Degree requirements (h.b.a., h.b.sc., bcom). Retrieved 2017-08-30 http://calendar.artsci.utoronto.ca/Degree_Requirements_(H.B.A.,_H.B.Sc.,_BCom).html.