## Abstract

In this article, a large data set containing every course taken by every undergraduate student in a major university in Canada over 10 years is analysed. Modern machine learning algorithms can use large data sets to build useful tools for the data provider, in this case, the university. In this article, two classifiers are constructed using random forests. To begin, the first two semesters of courses completed by a student are used to predict if they will obtain an undergraduate degree. Secondly, for the students that completed a program, their major is predicted using once again the first few courses they have registered to. A classification tree is an intuitive and powerful classifier and building a random forest of trees improves this classifier. Random forests also allow for reliable variable importance measurements. These measures explain what variables are useful to the classifiers and can be used to better understand what is statistically related to the students’ situation. The results are two accurate classifiers and a variable importance analysis that provides useful information to university administrations.

### Similar content being viewed by others

## References

Aulck, L., Velagapudi, N., Blumenstock, J., & West, J. (2016 June). Predicting Student Dropout in Higher Education. ArXiv e-prints.

Bailey, M. A., Rosenthal, J. S., & Yoon, A. H. (2016). Grades and incentives: assessing competing grade point average measures and postgraduate outcomes.

*Studies in Higher Education*,*41*(9), 1548–1562. https://doi.org/10.1080/03075079.2014.982528.Bar, T., Kadiyali, V., & Zussman, A. (2009). Grade information and grade inflation: The cornell experiment.

*Journal of Economic Perspectivs*,*23*(3), 93–108.Breiman, L. (1996a). Bagging predictors.

*Machine Learning*,*24*(2), 123–140. https://doi.org/10.1007/BF00058655.Breiman, L. (1996b, 12). Heuristics of instability and stabilization in model selection.

*The Annals of Statistics*,*24*(6), 2350–2383. https://doi.org/10.1214/aos/1032181158Breiman, L. (2001). Random forests.

*Machine Learning*,*45*(1), 5–32. https://doi.org/10.1023/A:1010933404324.Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984).

*Classification and regression trees*. Belmont, CA: Wadsworth Publishing Company.Chen, R., & DesJardins, S. L. (2008). Exploring the effects of financial aid on the gap in student dropout risks by income Level.

*Research in Higher Education*,*49*(1), 1–18. https://doi.org/10.1007/s11162-007-9060-9.Chen, R., & DesJardins, S. L. (2010). Investigating the impact of financial aid on student dropout risks: Racial and ethnic differences.

*The Journal of Higher Education*,*81*(2), 179–208. http://www.jstor.org/stable/40606850Chen, Y.- L., Hsu, C.- L., & Chou, S.- C. (2003). Constructing a multi-valued and multi-labeled decision tree.

*Expert Systems with Applications*,*25*(2), 199–209. https://doi.org/10.1016/S0957-4174(03)00047-2. Retrieved from http://www.sciencedirect.com/science/article/pii/S0957417403000472.Chou, S., & Hsu, C.-L. (2005). MMDT: A multi-valued and multi-labeled decision tree classifier for data mining.

*Expert Systems With Applications*,*28*(4), 799–812. https://doi.org/10.1016/j.eswa.2004.12.035.Clare, A., & King, R. D. (2001). Knowledge discovery in multi-label phenotype data. In L. De Raedt & A. Siebes (Eds.), Principles of Data Mining and Knowledge Discovery: 5th European Conference, PKDD 2001, Freiburg, Germany, September 3–5, 2001 Proceedings (pp. 42–53). Berlin, Heidelberg: Springer Berlin Heidelberg. Retrieved from https://doi.org/10.1007/3-540-44794-6.

Eddelbuettel, D., & Francois, R. (2011). Rcpp: Seamless R and C++ integration.

*Journal of Statistical Software*,*40*(1), 1–18. https://doi.org/10.18637/jss.v040.i08. Retrieved from https://www.jstatsoft.org/index.php/jss/article/view/v040i08Glaesser, J., & Cooper, B. (2012). Gender, parental education, and ability: their interacting roles in predicting GCSE success.

*Cambridge Journal of Education*,*42*(4), 463–480. https://doi.org/10.1080/0305764X.2012.733346.Hastie, T., Tibshirani, R., & Friedman, J. (2009).

*The elements of statistical learning*(2nd ed.). Berlin: Springer.Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework.

*Journal of Computational and Graphical Statistics*,*15*(3), 651–674. https://doi.org/10.1198/106186006X133933.Johnson, S. R., & Stage, F. K. (2018). Academic engagement and student success: Do high-impact practices mean higher graduation rates.

*The Journal of Higher Education*,*0*(0), 1–29. https://doi.org/10.1080/00221546.2018.1441107.Johnson, V. E. (2003).

*Grade inflation : A crisis in college education*. New York: Springer.Kappe, R., & van der Flier, H. (2012). Predicting academic success in higher education: what’s more important than being smart?

*European Journal of Psychology of Education*,*27*(4), 605–619. https://doi.org/10.1007/s10212-011-0099-9.Kim, H., & Loh, W.- Y. (2001). Classification trees with unbiased multiway splits.

*Journal of the American Statistical Association*,*96*, 589–604. Retrieved from http://www.stat.wisc.edu/~loh/treeprogs/cruise/cruise.pdf.Kononenko, I. (1995). On biases in estimating multi-valued attributes. In Proceedings of the 14th international joint conference on artificial intelligence (vol. 2, pp. 1034–1040). San Francisco, CA, USAM organ Kaufmann Publishers Inc. Retrieved from http://dl.acm.org/citation.cfm?id=1643031.1643034.

Leeds, D. M., & DesJardins, S. L. (2015). The effect of merit aid on enrollment: A regression discontinuity analysis of iowa’s national scholars award.

*Research in Higher Education*,*56*(5), 471–495. https://doi.org/10.1007/s11162-014-9359-2.Liaw, A., & Wiener, M. (2002). Classification and regression by randomforest.

*R News**2*(3), 18-22. Retrieved from http://CRAN.R-project.org/doc/Rnews/.Loh, W.- Y. (2002). Regression trees with unbiased variable selection and interaction detection.

*Statistica Sinica*,*12*, 361–386. Retrieved from http://www.stat.wisc.edu/~loh/treeprogs/guide/guide02.pdf.Loh, W.- Y., & Shih, Y.- S. (1997). Split selection methods for classification trees.

*Statistica Sinica*,*7*, 815–840. Retrieved from http://www3.stat.sinica.edu.tw/statistica/j7n4/j7n41/j7n41.htm.Mills, J. S., & Blankstein, K. R. (2000). Perfectionism, intrinsic vs extrinsic motivation, and motivated strategies for learning: a multidimensional analysis of university students.

*Personality and Individual Differences**29*(6), 1191–1204. https://doi.org/10.1016/S0191-8869(00)00003-9. Retrieved from http://www.sciencedirect.com/science/article/pii/S0191886900000039.Niessen, A. S. M., Meijer, R. R., & Tendeiro, J. N. (2016). Predicting performance in higher education using proximal predictors.

*PLoS ONE*,*11*(4), 1–14. https://doi.org/10.1371/journal.pone.0153663.Ost, B. (2010). The role of peers and grades in determining major persistence in sciences.

*Economics of Education Review*,*29*, 923–934.Sabot, R., & Wakeman-Linn, J. (1991). Grade inflation and course choice.

*Journal of Economic Perspectives*,*5*, 159–170.Strobl, C., Boulesteix, A.-L., Zeileis, A., & Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution.

*BMC Bioinformatics*,*8*(1), 25. https://doi.org/10.1186/1471-2105-8-25.University of Toronto. (2017). Degree requirements (h.b.a., h.b.sc., bcom). Retrieved 2017-08-30 http://calendar.artsci.utoronto.ca/Degree_Requirements_(H.B.A.,_H.B.Sc.,_BCom).html.

## Acknowledgements

We are very grateful to Glenn Loney and Sinisa Markovic of the University of Toronto for providing us with students grade data. The authors also gratefully acknowledge the financial support from the NSERC of Canada.

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

### Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Appendix

### Appendix

The following section contains some mathematical notations and definitions for readers who are interested in more a thorough explanation of "Classification Tree" and "Random Forest" sections content. Full understanding of the appendix is not needed in order to grasp the essential of the article but it serves as a brief but precise introduction to the mathematical formulation of decision trees and random forests.

Rigorously, a typical supervised statistical learning problem is defined when the relationship between a response variable \({\mathbf {Y}}\) and an associated *m*-dimensional predictor vector \({\mathbf {X}} = (X_1,\ldots ,X_m)\) is of interest. When the response variable is categorical and takes *k* different possible values, this problem is defined as a *k*-class classification problem. One challenge in classification problems is to use a data set \(D = \{ (Y_i,X_{1,i},\ldots ,X_{m,i}) ; i = 1,\ldots ,n \}\) in order to construct a classifier \(\varphi (D)\). A classifier is built to emit a class prediction for any new data point \({\mathbf {X}}\) that belongs in the feature space \({\mathcal {X}} = {\mathcal {X}}_1 \times ... \times {\mathcal {X}}_m\). Therefore a classifier divides the feature space \({\mathcal {X}}\) into *k* disjoint regions such that \(\cup _{j =1}^k B_l = {\mathcal {X}}\), i.e. \(\varphi (D,{\mathbf {X}}) = \sum _{j=1}^k j {\mathbf {1}}\{ {\mathbf {X}} \in B_j\}\).

As explained in "Classification Tree" section a classification tree (Breiman et al. 1984) is an algorithm that forms these regions by recursively dividing the feature space \({\mathcal {X}}\) until a stopping rule is applied. Most algorithms stop the partitioning process whenever every terminal node of the tree contains less than \(\beta\) observations. This \(\beta\) is a tuning parameter that can be established by cross-validation. Let \(p_{rk}\) be the proportion of the class *k* in the region *r*, if the region *r* contains \(n_r\) observations then:

The class prediction for a new observation that shall fall in the region *r* is the majority class in that region, i.e. if \({\mathbf {X}} \in R_r\), \(\varphi (D,{\mathbf {X}}) = \text {argmax}_k (p_{kr})\). When splitting a region into two new regions \(R_1\) and \(R_2\) the algorithm will compute the total impurity of the new regions ; \(n_{1} Q_1 + n_2 Q_2\) and will pick the split variable *j* and split location *s* that minimizes that total impurity. If the predictor *j* is continuous, the possible splits are of the form \(X_{j} \le s\) and \(X_j > s\) which usually results in \(n_r-1\) possible splits. For a categorical predictor having *q* possible values, it is common to consider all of the \(2^{q-1} -1\) possible splits. Hastie et al. (2009) introduces many possible region impurity measurements \(Q_r\), in this project, the *Gini index* has been chosen:

Here is a pseudo-code of the algorithm:

Since decision trees are unstable procedures (Breiman 1996b) they greatly benefit from bootstrap aggregating (bagging) (Breiman 1996a). In classifier aggregating, the goal is to find a way to use an entire set of classifiers \(\{ \varphi (D_q) \}\) to get a new classifier \(\varphi _a\) that is better than any of them individually. One method of aggregating the class predictions \(\{ \varphi (D_q,{\mathbf {X}}) \}\) is by *voting*: the predicted class for the input \({\mathbf {X}}\) is the most picked class among the classifiers. More precisely, let \(T_k = | \{ q : \varphi (D_q, {\mathbf {X}}) = k \} |\) then, the aggregating classifier becomes \(\varphi _a({\mathbf {X}}) = \text {argmax}_k (T_k)\).

On way to form a set of classifiers is to draw bootstrap samples of the data set *D* which forms a set of learning sets \(\{ D_B \}\). Each of the bootstrap samples will be of size *n* drawn at random with replacement from the original training set *D*. For each of these learning set a classifier \(\varphi (D_b)\) is constructed and the resulting set of classifiers \(\{ \varphi (D_b) \}\) can be used to create an aggregating classifier. If the classifier is an unpruned tree then the aggregating classifier is a random forest.

A random forest classifier is more precise than a single classification tree in the sense that it has lower mean-squared prediction error (Breiman 1996a). By bagging a classifier, the bias will remain the same but the variance will decrease. One way to further decrease the variance of the random forest is by construction trees that are as uncorrelated as possible. Breiman introduced in 2001 random forests with random inputs (Breiman 2001). In these forests, instead of finding the best variable and partitioning among all the variables, the algorithm will now randomly select \(p < m\) random covariates and will find the best condition among those *p* covariates.

The fitted random forest classifiers were compared to two logistic regression models. A simple logistic model is used to predict if a student completes its program or not with the following parametrization:

where \(Y_i=1\) means student *i* completed its program, *m* is the number of predictors, \(\beta 's\) the parameters and \(x_i's\) the predictor values. To predict the major completed, a generalization of the logistic regression, the multinomial logistic regression is used with the following parametrization:

where \(Y_i =p\) means the student *i* completed the program *p* and where *k* is the number of programs.

Finally, here is a short example of code to fit random forests, get predictions for new observations and produce variable importance plots using the R language:

## Rights and permissions

## About this article

### Cite this article

Beaulac, C., Rosenthal, J.S. Predicting University Students’ Academic Success and Major Using Random Forests.
*Res High Educ* **60**, 1048–1064 (2019). https://doi.org/10.1007/s11162-019-09546-y

Received:

Published:

Issue Date:

DOI: https://doi.org/10.1007/s11162-019-09546-y