Skip to main content
Log in

Predicting University Students’ Academic Success and Major Using Random Forests

  • Published:
Research in Higher Education Aims and scope Submit manuscript

Abstract

In this article, a large data set containing every course taken by every undergraduate student in a major university in Canada over 10 years is analysed. Modern machine learning algorithms can use large data sets to build useful tools for the data provider, in this case, the university. In this article, two classifiers are constructed using random forests. To begin, the first two semesters of courses completed by a student are used to predict if they will obtain an undergraduate degree. Secondly, for the students that completed a program, their major is predicted using once again the first few courses they have registered to. A classification tree is an intuitive and powerful classifier and building a random forest of trees improves this classifier. Random forests also allow for reliable variable importance measurements. These measures explain what variables are useful to the classifiers and can be used to better understand what is statistically related to the students’ situation. The results are two accurate classifiers and a variable importance analysis that provides useful information to university administrations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

Download references

Acknowledgements

We are very grateful to Glenn Loney and Sinisa Markovic of the University of Toronto for providing us with students grade data. The authors also gratefully acknowledge the financial support from the NSERC of Canada.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cédric Beaulac.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

The following section contains some mathematical notations and definitions for readers who are interested in more a thorough explanation of "Classification Tree" and "Random Forest" sections content. Full understanding of the appendix is not needed in order to grasp the essential of the article but it serves as a brief but precise introduction to the mathematical formulation of decision trees and random forests.

Rigorously, a typical supervised statistical learning problem is defined when the relationship between a response variable \({\mathbf {Y}}\) and an associated m-dimensional predictor vector \({\mathbf {X}} = (X_1,\ldots ,X_m)\) is of interest. When the response variable is categorical and takes k different possible values, this problem is defined as a k-class classification problem. One challenge in classification problems is to use a data set \(D = \{ (Y_i,X_{1,i},\ldots ,X_{m,i}) ; i = 1,\ldots ,n \}\) in order to construct a classifier \(\varphi (D)\). A classifier is built to emit a class prediction for any new data point \({\mathbf {X}}\) that belongs in the feature space \({\mathcal {X}} = {\mathcal {X}}_1 \times ... \times {\mathcal {X}}_m\). Therefore a classifier divides the feature space \({\mathcal {X}}\) into k disjoint regions such that \(\cup _{j =1}^k B_l = {\mathcal {X}}\), i.e. \(\varphi (D,{\mathbf {X}}) = \sum _{j=1}^k j {\mathbf {1}}\{ {\mathbf {X}} \in B_j\}\).

As explained in "Classification Tree" section a classification tree (Breiman et al. 1984) is an algorithm that forms these regions by recursively dividing the feature space \({\mathcal {X}}\) until a stopping rule is applied. Most algorithms stop the partitioning process whenever every terminal node of the tree contains less than \(\beta\) observations. This \(\beta\) is a tuning parameter that can be established by cross-validation. Let \(p_{rk}\) be the proportion of the class k in the region r, if the region r contains \(n_r\) observations then:

$$\begin{aligned} p_{rk}= \frac{1}{n_r} \sum _{x_i \in R_r} {\mathbf {1}}\{y_i = k\}. \end{aligned}$$
(1)

The class prediction for a new observation that shall fall in the region r is the majority class in that region, i.e. if \({\mathbf {X}} \in R_r\), \(\varphi (D,{\mathbf {X}}) = \text {argmax}_k (p_{kr})\). When splitting a region into two new regions \(R_1\) and \(R_2\) the algorithm will compute the total impurity of the new regions ; \(n_{1} Q_1 + n_2 Q_2\) and will pick the split variable j and split location s that minimizes that total impurity. If the predictor j is continuous, the possible splits are of the form \(X_{j} \le s\) and \(X_j > s\) which usually results in \(n_r-1\) possible splits. For a categorical predictor having q possible values, it is common to consider all of the \(2^{q-1} -1\) possible splits. Hastie et al. (2009) introduces many possible region impurity measurements \(Q_r\), in this project, the Gini index has been chosen:

$$\begin{aligned} Q_r = \sum _{j=1}^k p_{rj}(1-p_{rj}). \end{aligned}$$
(2)

Here is a pseudo-code of the algorithm:

figure a

Since decision trees are unstable procedures (Breiman 1996b) they greatly benefit from bootstrap aggregating (bagging) (Breiman 1996a). In classifier aggregating, the goal is to find a way to use an entire set of classifiers \(\{ \varphi (D_q) \}\) to get a new classifier \(\varphi _a\) that is better than any of them individually. One method of aggregating the class predictions \(\{ \varphi (D_q,{\mathbf {X}}) \}\) is by voting: the predicted class for the input \({\mathbf {X}}\) is the most picked class among the classifiers. More precisely, let \(T_k = | \{ q : \varphi (D_q, {\mathbf {X}}) = k \} |\) then, the aggregating classifier becomes \(\varphi _a({\mathbf {X}}) = \text {argmax}_k (T_k)\).

On way to form a set of classifiers is to draw bootstrap samples of the data set D which forms a set of learning sets \(\{ D_B \}\). Each of the bootstrap samples will be of size n drawn at random with replacement from the original training set D. For each of these learning set a classifier \(\varphi (D_b)\) is constructed and the resulting set of classifiers \(\{ \varphi (D_b) \}\) can be used to create an aggregating classifier. If the classifier is an unpruned tree then the aggregating classifier is a random forest.

A random forest classifier is more precise than a single classification tree in the sense that it has lower mean-squared prediction error (Breiman 1996a). By bagging a classifier, the bias will remain the same but the variance will decrease. One way to further decrease the variance of the random forest is by construction trees that are as uncorrelated as possible. Breiman introduced in 2001 random forests with random inputs (Breiman 2001). In these forests, instead of finding the best variable and partitioning among all the variables, the algorithm will now randomly select \(p < m\) random covariates and will find the best condition among those p covariates.

The fitted random forest classifiers were compared to two logistic regression models. A simple logistic model is used to predict if a student completes its program or not with the following parametrization:

$$\begin{aligned} P(Y_i =1) = \frac{\exp (\sum _{i=0}^m \beta _i x_i)}{1+\exp (\sum _{i=0}^m \beta _i x_i)}, \end{aligned}$$
(3)

where \(Y_i=1\) means student i completed its program, m is the number of predictors, \(\beta 's\) the parameters and \(x_i's\) the predictor values. To predict the major completed, a generalization of the logistic regression, the multinomial logistic regression is used with the following parametrization:

$$\begin{aligned} P(Y_i = p) = \frac{\exp (\sum _{i=0}^m \beta _i^{(p)} x_i)}{1+\exp (\sum _{l=1}^k \sum _{i=0}^m \beta _i^{l} x_i)}, \end{aligned}$$
(4)

where \(Y_i =p\) means the student i completed the program p and where k is the number of programs.

Finally, here is a short example of code to fit random forests, get predictions for new observations and produce variable importance plots using the R language:

figure b

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Beaulac, C., Rosenthal, J.S. Predicting University Students’ Academic Success and Major Using Random Forests. Res High Educ 60, 1048–1064 (2019). https://doi.org/10.1007/s11162-019-09546-y

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11162-019-09546-y

Keywords

Navigation