Predicting University Students’ Academic Success and Major Using Random Forests

Beaulac, Cédric; Rosenthal, Jeffrey S.

doi:10.1007/s11162-019-09546-y

Predicting University Students’ Academic Success and Major Using Random Forests

Published: 25 January 2019

Volume 60, pages 1048–1064, (2019)
Cite this article

Research in Higher Education Aims and scope Submit manuscript

3178 Accesses
70 Citations
10 Altmetric
1 Mention
Explore all metrics

Abstract

In this article, a large data set containing every course taken by every undergraduate student in a major university in Canada over 10 years is analysed. Modern machine learning algorithms can use large data sets to build useful tools for the data provider, in this case, the university. In this article, two classifiers are constructed using random forests. To begin, the first two semesters of courses completed by a student are used to predict if they will obtain an undergraduate degree. Secondly, for the students that completed a program, their major is predicted using once again the first few courses they have registered to. A classification tree is an intuitive and powerful classifier and building a random forest of trees improves this classifier. Random forests also allow for reliable variable importance measurements. These measures explain what variables are useful to the classifiers and can be used to better understand what is statistically related to the students’ situation. The results are two accurate classifiers and a variable importance analysis that provides useful information to university administrations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A random forest guided tour

Article 19 April 2016

A Review on Random Forest: An Ensemble Classifier

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

References

Aulck, L., Velagapudi, N., Blumenstock, J., & West, J. (2016 June). Predicting Student Dropout in Higher Education. ArXiv e-prints.
Bailey, M. A., Rosenthal, J. S., & Yoon, A. H. (2016). Grades and incentives: assessing competing grade point average measures and postgraduate outcomes. Studies in Higher Education, 41(9), 1548–1562. https://doi.org/10.1080/03075079.2014.982528.
Article Google Scholar
Bar, T., Kadiyali, V., & Zussman, A. (2009). Grade information and grade inflation: The cornell experiment. Journal of Economic Perspectivs, 23(3), 93–108.
Article Google Scholar
Breiman, L. (1996a). Bagging predictors. Machine Learning, 24(2), 123–140. https://doi.org/10.1007/BF00058655.
Google Scholar
Breiman, L. (1996b, 12). Heuristics of instability and stabilization in model selection. The Annals of Statistics, 24(6), 2350–2383. https://doi.org/10.1214/aos/1032181158
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324.
Article Google Scholar
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Belmont, CA: Wadsworth Publishing Company.
Google Scholar
Chen, R., & DesJardins, S. L. (2008). Exploring the effects of financial aid on the gap in student dropout risks by income Level. Research in Higher Education, 49(1), 1–18. https://doi.org/10.1007/s11162-007-9060-9.
Article Google Scholar
Chen, R., & DesJardins, S. L. (2010). Investigating the impact of financial aid on student dropout risks: Racial and ethnic differences. The Journal of Higher Education, 81(2), 179–208. http://www.jstor.org/stable/40606850
Chen, Y.- L., Hsu, C.- L., & Chou, S.- C. (2003). Constructing a multi-valued and multi-labeled decision tree. Expert Systems with Applications, 25(2), 199–209. https://doi.org/10.1016/S0957-4174(03)00047-2. Retrieved from http://www.sciencedirect.com/science/article/pii/S0957417403000472.
Chou, S., & Hsu, C.-L. (2005). MMDT: A multi-valued and multi-labeled decision tree classifier for data mining. Expert Systems With Applications, 28(4), 799–812. https://doi.org/10.1016/j.eswa.2004.12.035.
Article Google Scholar
Clare, A., & King, R. D. (2001). Knowledge discovery in multi-label phenotype data. In L. De Raedt & A. Siebes (Eds.), Principles of Data Mining and Knowledge Discovery: 5th European Conference, PKDD 2001, Freiburg, Germany, September 3–5, 2001 Proceedings (pp. 42–53). Berlin, Heidelberg: Springer Berlin Heidelberg. Retrieved from https://doi.org/10.1007/3-540-44794-6.
Eddelbuettel, D., & Francois, R. (2011). Rcpp: Seamless R and C++ integration. Journal of Statistical Software, 40(1), 1–18. https://doi.org/10.18637/jss.v040.i08. Retrieved from https://www.jstatsoft.org/index.php/jss/article/view/v040i08
Glaesser, J., & Cooper, B. (2012). Gender, parental education, and ability: their interacting roles in predicting GCSE success. Cambridge Journal of Education, 42(4), 463–480. https://doi.org/10.1080/0305764X.2012.733346.
Article Google Scholar
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning (2nd ed.). Berlin: Springer.
Book Google Scholar
Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 15(3), 651–674. https://doi.org/10.1198/106186006X133933.
Article Google Scholar
Johnson, S. R., & Stage, F. K. (2018). Academic engagement and student success: Do high-impact practices mean higher graduation rates. The Journal of Higher Education, 0(0), 1–29. https://doi.org/10.1080/00221546.2018.1441107.
Google Scholar
Johnson, V. E. (2003). Grade inflation : A crisis in college education. New York: Springer.
Google Scholar
Kappe, R., & van der Flier, H. (2012). Predicting academic success in higher education: what’s more important than being smart? European Journal of Psychology of Education, 27(4), 605–619. https://doi.org/10.1007/s10212-011-0099-9.
Article Google Scholar
Kim, H., & Loh, W.- Y. (2001). Classification trees with unbiased multiway splits. Journal of the American Statistical Association, 96, 589–604. Retrieved from http://www.stat.wisc.edu/~loh/treeprogs/cruise/cruise.pdf.
Kononenko, I. (1995). On biases in estimating multi-valued attributes. In Proceedings of the 14th international joint conference on artificial intelligence (vol. 2, pp. 1034–1040). San Francisco, CA, USAM organ Kaufmann Publishers Inc. Retrieved from http://dl.acm.org/citation.cfm?id=1643031.1643034.
Leeds, D. M., & DesJardins, S. L. (2015). The effect of merit aid on enrollment: A regression discontinuity analysis of iowa’s national scholars award. Research in Higher Education, 56(5), 471–495. https://doi.org/10.1007/s11162-014-9359-2.
Article Google Scholar
Liaw, A., & Wiener, M. (2002). Classification and regression by randomforest. R News 2(3), 18-22. Retrieved from http://CRAN.R-project.org/doc/Rnews/.
Loh, W.- Y. (2002). Regression trees with unbiased variable selection and interaction detection. Statistica Sinica, 12, 361–386. Retrieved from http://www.stat.wisc.edu/~loh/treeprogs/guide/guide02.pdf.
Loh, W.- Y., & Shih, Y.- S. (1997). Split selection methods for classification trees. Statistica Sinica, 7, 815–840. Retrieved from http://www3.stat.sinica.edu.tw/statistica/j7n4/j7n41/j7n41.htm.
Mills, J. S., & Blankstein, K. R. (2000). Perfectionism, intrinsic vs extrinsic motivation, and motivated strategies for learning: a multidimensional analysis of university students. Personality and Individual Differences 29(6), 1191–1204. https://doi.org/10.1016/S0191-8869(00)00003-9. Retrieved from http://www.sciencedirect.com/science/article/pii/S0191886900000039.
Niessen, A. S. M., Meijer, R. R., & Tendeiro, J. N. (2016). Predicting performance in higher education using proximal predictors. PLoS ONE, 11(4), 1–14. https://doi.org/10.1371/journal.pone.0153663.
Article Google Scholar
Ost, B. (2010). The role of peers and grades in determining major persistence in sciences. Economics of Education Review, 29, 923–934.
Article Google Scholar
Sabot, R., & Wakeman-Linn, J. (1991). Grade inflation and course choice. Journal of Economic Perspectives, 5, 159–170.
Article Google Scholar
Strobl, C., Boulesteix, A.-L., Zeileis, A., & Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8(1), 25. https://doi.org/10.1186/1471-2105-8-25.
Article Google Scholar
University of Toronto. (2017). Degree requirements (h.b.a., h.b.sc., bcom). Retrieved 2017-08-30 http://calendar.artsci.utoronto.ca/Degree_Requirements_(H.B.A.,_H.B.Sc.,_BCom).html.

Download references

Acknowledgements

We are very grateful to Glenn Loney and Sinisa Markovic of the University of Toronto for providing us with students grade data. The authors also gratefully acknowledge the financial support from the NSERC of Canada.

Author information

Authors and Affiliations

University of Toronto, 100 St. George Street, Toronto, ON, Canada
Cédric Beaulac & Jeffrey S. Rosenthal

Authors

Cédric Beaulac
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey S. Rosenthal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cédric Beaulac.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

The following section contains some mathematical notations and definitions for readers who are interested in more a thorough explanation of "Classification Tree" and "Random Forest" sections content. Full understanding of the appendix is not needed in order to grasp the essential of the article but it serves as a brief but precise introduction to the mathematical formulation of decision trees and random forests.

Rigorously, a typical supervised statistical learning problem is defined when the relationship between a response variable ${\mathbf {Y}}$ and an associated m-dimensional predictor vector ${\mathbf {X}} = (X_1,\ldots ,X_m)$ is of interest. When the response variable is categorical and takes k different possible values, this problem is defined as a k-class classification problem. One challenge in classification problems is to use a data set $D = \{ (Y_i,X_{1,i},\ldots ,X_{m,i}) ; i = 1,\ldots ,n \}$ in order to construct a classifier $\varphi (D)$. A classifier is built to emit a class prediction for any new data point ${\mathbf {X}}$ that belongs in the feature space ${\mathcal {X}} = {\mathcal {X}}_1 \times ... \times {\mathcal {X}}_m$. Therefore a classifier divides the feature space ${\mathcal {X}}$ into k disjoint regions such that $\cup _{j =1}^k B_l = {\mathcal {X}}$, i.e. $\varphi (D,{\mathbf {X}}) = \sum _{j=1}^k j {\mathbf {1}}\{ {\mathbf {X}} \in B_j\}$.

As explained in "Classification Tree" section a classification tree (Breiman et al. 1984) is an algorithm that forms these regions by recursively dividing the feature space ${\mathcal {X}}$ until a stopping rule is applied. Most algorithms stop the partitioning process whenever every terminal node of the tree contains less than $\beta$ observations. This $\beta$ is a tuning parameter that can be established by cross-validation. Let $p_{rk}$ be the proportion of the class k in the region r, if the region r contains $n_r$ observations then:

$$\begin{aligned} p_{rk}= \frac{1}{n_r} \sum _{x_i \in R_r} {\mathbf {1}}\{y_i = k\}. \end{aligned}$$

(1)

The class prediction for a new observation that shall fall in the region r is the majority class in that region, i.e. if ${\mathbf {X}} \in R_r$, $\varphi (D,{\mathbf {X}}) = \text {argmax}_k (p_{kr})$. When splitting a region into two new regions $R_1$ and $R_2$ the algorithm will compute the total impurity of the new regions ; $n_{1} Q_1 + n_2 Q_2$ and will pick the split variable j and split location s that minimizes that total impurity. If the predictor j is continuous, the possible splits are of the form $X_{j} \le s$ and $X_j > s$ which usually results in $n_r-1$ possible splits. For a categorical predictor having q possible values, it is common to consider all of the $2^{q-1} -1$ possible splits. Hastie et al. (2009) introduces many possible region impurity measurements $Q_r$, in this project, the Gini index has been chosen:

$$\begin{aligned} Q_r = \sum _{j=1}^k p_{rj}(1-p_{rj}). \end{aligned}$$

(2)

Here is a pseudo-code of the algorithm:

Since decision trees are unstable procedures (Breiman 1996b) they greatly benefit from bootstrap aggregating (bagging) (Breiman 1996a). In classifier aggregating, the goal is to find a way to use an entire set of classifiers $\{ \varphi (D_q) \}$ to get a new classifier $\varphi _a$ that is better than any of them individually. One method of aggregating the class predictions $\{ \varphi (D_q,{\mathbf {X}}) \}$ is by voting: the predicted class for the input ${\mathbf {X}}$ is the most picked class among the classifiers. More precisely, let $T_k = | \{ q : \varphi (D_q, {\mathbf {X}}) = k \} |$ then, the aggregating classifier becomes $\varphi _a({\mathbf {X}}) = \text {argmax}_k (T_k)$.

On way to form a set of classifiers is to draw bootstrap samples of the data set D which forms a set of learning sets $\{ D_B \}$. Each of the bootstrap samples will be of size n drawn at random with replacement from the original training set D. For each of these learning set a classifier $\varphi (D_b)$ is constructed and the resulting set of classifiers $\{ \varphi (D_b) \}$ can be used to create an aggregating classifier. If the classifier is an unpruned tree then the aggregating classifier is a random forest.

A random forest classifier is more precise than a single classification tree in the sense that it has lower mean-squared prediction error (Breiman 1996a). By bagging a classifier, the bias will remain the same but the variance will decrease. One way to further decrease the variance of the random forest is by construction trees that are as uncorrelated as possible. Breiman introduced in 2001 random forests with random inputs (Breiman 2001). In these forests, instead of finding the best variable and partitioning among all the variables, the algorithm will now randomly select $p < m$ random covariates and will find the best condition among those p covariates.

The fitted random forest classifiers were compared to two logistic regression models. A simple logistic model is used to predict if a student completes its program or not with the following parametrization:

$$\begin{aligned} P(Y_i =1) = \frac{\exp (\sum _{i=0}^m \beta _i x_i)}{1+\exp (\sum _{i=0}^m \beta _i x_i)}, \end{aligned}$$

(3)

where $Y_i=1$ means student i completed its program, m is the number of predictors, $\beta 's$ the parameters and $x_i's$ the predictor values. To predict the major completed, a generalization of the logistic regression, the multinomial logistic regression is used with the following parametrization:

$$\begin{aligned} P(Y_i = p) = \frac{\exp (\sum _{i=0}^m \beta _i^{(p)} x_i)}{1+\exp (\sum _{l=1}^k \sum _{i=0}^m \beta _i^{l} x_i)}, \end{aligned}$$

(4)

where $Y_i =p$ means the student i completed the program p and where k is the number of programs.

Finally, here is a short example of code to fit random forests, get predictions for new observations and produce variable importance plots using the R language:

Rights and permissions

Reprints and permissions

About this article

Cite this article

Beaulac, C., Rosenthal, J.S. Predicting University Students’ Academic Success and Major Using Random Forests. Res High Educ 60, 1048–1064 (2019). https://doi.org/10.1007/s11162-019-09546-y

Download citation

Received: 12 September 2017
Published: 25 January 2019
Issue Date: November 2019
DOI: https://doi.org/10.1007/s11162-019-09546-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Predicting University Students’ Academic Success and Major Using Random Forests

Abstract

Access this article