Predicting University Students’ Academic Success and Major Using Random Forests
In this article, a large data set containing every course taken by every undergraduate student in a major university in Canada over 10 years is analysed. Modern machine learning algorithms can use large data sets to build useful tools for the data provider, in this case, the university. In this article, two classifiers are constructed using random forests. To begin, the first two semesters of courses completed by a student are used to predict if they will obtain an undergraduate degree. Secondly, for the students that completed a program, their major is predicted using once again the first few courses they have registered to. A classification tree is an intuitive and powerful classifier and building a random forest of trees improves this classifier. Random forests also allow for reliable variable importance measurements. These measures explain what variables are useful to the classifiers and can be used to better understand what is statistically related to the students’ situation. The results are two accurate classifiers and a variable importance analysis that provides useful information to university administrations.
KeywordsHigher education Student retention Academic success Machine learning Classification tree Random forest Variable importance
We are very grateful to Glenn Loney and Sinisa Markovic of the University of Toronto for providing us with students grade data. The authors also gratefully acknowledge the financial support from the NSERC of Canada.
- Aulck, L., Velagapudi, N., Blumenstock, J., & West, J. (2016 June). Predicting Student Dropout in Higher Education. ArXiv e-prints.Google Scholar
- Breiman, L. (1996b, 12). Heuristics of instability and stabilization in model selection. The Annals of Statistics, 24(6), 2350–2383. https://doi.org/10.1214/aos/1032181158
- Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Belmont, CA: Wadsworth Publishing Company.Google Scholar
- Chen, R., & DesJardins, S. L. (2010). Investigating the impact of financial aid on student dropout risks: Racial and ethnic differences. The Journal of Higher Education, 81(2), 179–208. http://www.jstor.org/stable/40606850
- Chen, Y.- L., Hsu, C.- L., & Chou, S.- C. (2003). Constructing a multi-valued and multi-labeled decision tree. Expert Systems with Applications, 25(2), 199–209. https://doi.org/10.1016/S0957-4174(03)00047-2. Retrieved from http://www.sciencedirect.com/science/article/pii/S0957417403000472.
- Clare, A., & King, R. D. (2001). Knowledge discovery in multi-label phenotype data. In L. De Raedt & A. Siebes (Eds.), Principles of Data Mining and Knowledge Discovery: 5th European Conference, PKDD 2001, Freiburg, Germany, September 3–5, 2001 Proceedings (pp. 42–53). Berlin, Heidelberg: Springer Berlin Heidelberg. Retrieved from https://doi.org/10.1007/3-540-44794-6.
- Eddelbuettel, D., & Francois, R. (2011). Rcpp: Seamless R and C++ integration. Journal of Statistical Software, 40(1), 1–18. https://doi.org/10.18637/jss.v040.i08. Retrieved from https://www.jstatsoft.org/index.php/jss/article/view/v040i08
- Johnson, V. E. (2003). Grade inflation : A crisis in college education. New York: Springer.Google Scholar
- Kim, H., & Loh, W.- Y. (2001). Classification trees with unbiased multiway splits. Journal of the American Statistical Association, 96, 589–604. Retrieved from http://www.stat.wisc.edu/~loh/treeprogs/cruise/cruise.pdf.
- Kononenko, I. (1995). On biases in estimating multi-valued attributes. In Proceedings of the 14th international joint conference on artificial intelligence (vol. 2, pp. 1034–1040). San Francisco, CA, USAM organ Kaufmann Publishers Inc. Retrieved from http://dl.acm.org/citation.cfm?id=1643031.1643034.
- Liaw, A., & Wiener, M. (2002). Classification and regression by randomforest. R News 2(3), 18-22. Retrieved from http://CRAN.R-project.org/doc/Rnews/.
- Loh, W.- Y. (2002). Regression trees with unbiased variable selection and interaction detection. Statistica Sinica, 12, 361–386. Retrieved from http://www.stat.wisc.edu/~loh/treeprogs/guide/guide02.pdf.
- Loh, W.- Y., & Shih, Y.- S. (1997). Split selection methods for classification trees. Statistica Sinica, 7, 815–840. Retrieved from http://www3.stat.sinica.edu.tw/statistica/j7n4/j7n41/j7n41.htm.
- Mills, J. S., & Blankstein, K. R. (2000). Perfectionism, intrinsic vs extrinsic motivation, and motivated strategies for learning: a multidimensional analysis of university students. Personality and Individual Differences 29(6), 1191–1204. https://doi.org/10.1016/S0191-8869(00)00003-9. Retrieved from http://www.sciencedirect.com/science/article/pii/S0191886900000039.
- University of Toronto. (2017). Degree requirements (h.b.a., h.b.sc., bcom). Retrieved 2017-08-30 http://calendar.artsci.utoronto.ca/Degree_Requirements_(H.B.A.,_H.B.Sc.,_BCom).html.