1 Introduction

Data mining and machine learning approaches are being increasingly used to analyze educational- and learning-related datasets toward understanding how students learn and improving learning outcomes. These efforts express the need of the modern higher educational system for meaningful and useful tools to support students’ decisions throughout their studies. The reliable performance prediction can be an essential part of these tools.

Our work focuses on developing methods that utilize historical student–course grade information to accurately estimate how well students will perform (as measured by their grade) on courses that they have not yet taken. Being able to accurately estimate students’ grades in future courses is important as it can be used by them (and/or their academic advisers) to identify the appropriate set of courses to take during the next term and create personalized degree pathways that enable them to successfully and effectively acquire the required knowledge to complete their studies in a timely fashion.

In this paper, we develop various future course grade prediction methods that utilize approaches based on sparse linear models and low-rank matrix factorizations. Regression and matrix factorization have been applied before in related work with a variety of data, but our methods rely entirely on the performance that the students achieved in previously taken courses. A unique aspect of many of our methods is that their associated models are either specific to each course or specific to each student–course tuple. This allows them to identify and utilize the relevant information from the prior courses that are associated with the grade for each course and better address problems associated with the reliable estimation of the low-rank models and the not-missing-at-random nature of the student–course historical grade data.

We experimentally evaluated the performance of our methods on a dataset obtained from the University of Minnesota that contained historical grades that span 12.5 years. Our results showed that the course-specific models outperformed various competing schemes. Another conclusion was that the performance can significantly vary across different departments.

The remainder of the paper is organized as follows. Section 2 discusses the related work in this area of performance prediction. Section 3 introduces the notation and definitions used. Section 4 describes the methods developed, and Sect. 5 provides information about the experimental design. Section 6 presents an extensive experimental evaluation of the methods and compares them against existing approaches. Finally, Sect. 7 provides some concluding remarks.

2 Related work

In recent years, there has been a lot of research activity in using data analysis approaches to understand, support, and enhance student learning. This research addresses a variety of problems at different environment settings, such as “early warning systems”, systems that optimize the educational material to meet the student’s learning (e.g., the tutoring systems), systems that are designed to identify additional educational material or the traditional classroom environment. The key problems include student modeling and student’s performance prediction at tasks, course activities, homework questions, examinations, and final grades, either during a course or after its completion. Some studies target the prediction of students’ term and final GPA [1, 16, 17].

The need of tools that understand and support student learning has led to the development of the intelligent “early warning systems” that monitor the students’ performance during the term [2, 14, 24]. The data collected by learning management systems (LMS) have also been exploited [21, 22]. Text mining of written comments has been applied for performance prediction by [13, 23], while [15, 20] apply classification and genetic algorithms with features about student interaction and the use of the LMS. In order to analyze the student’s past performance and interaction with the LMS and predict how well he/she will perform in course activities, multi-regression models have also been proposed [9]. Various approaches for modeling and predicting the success or failure of students in the context of intelligent tutoring systems have been developed. In that case, we predict the correctness of a student’s attempt to solve a single or a sequence of problems/tasks/exercises. These approaches include regression models [3, 5, 10], HMMs and bagged decision trees [18], collaborative filtering techniques and their combination (k-NN, SVD, RBM) [30], matrix completion [11, 27, 28], and tensor factorization [29].

Recently, research efforts aim to predict the grade that a student will obtain in a future course, which is the problem addressed in this paper. Within the context of developing methods to predict the next-term grades, most existing approaches [4, 7, 8, 19] rely on neighborhood-based collaborative filtering methods. For each student whose grade needs to be predicted, a set of similar students are identified that have already taken that course and their grades are used to estimate the desired grade via some similarity-weighted aggregation function. Despite their relative simplicity, the estimations obtained by these methods are reasonably accurate indicating that there is sufficient information in the historical student–course grade data to make the estimation problem feasible. Influenced by the area of recommender systems, the authors of [25, 26] examine grading prediction as rating prediction using a matrix completion approach. In [26], the features are about the student, course, and instructor, while in [25] (and most relevant to our problem), the matrix contains information about the grades of past courses. The matrix will be estimated by the product of lower-rank matrices. The authors also point out that bias terms are important and quite informative for the models.

The models developed in this paper are based on linear regression and matrix factorization, but they utilize only a course and student-course-specific subset of the data. According to our results, focusing on specific models per course improves the prediction accuracy and enables more reliable model estimation, while models are capable to fit the data better.

3 Definitions and notations

Throughout the paper, bold lowercase letters will denote column vectors (e.g., \(\mathbf {y}\)) and bold uppercase letters will denote matrices (e.g., \(\mathbf {G}\)). Individual elements will be denoted using subscripts (e.g., for a vector \({y}_i\), and for a matrix \(g_{s,c}\)). A single subscript on a matrix will denote its corresponding row. The sets will be represented by calligraphic letters.

The historical student–course grade information will be represented by a sparse matrix \(\mathbf {G} \in \mathbb {R}^{n \times m}\), where n and m are the number of students and courses, respectively, and \(g_{i,j}\) is the grade in the range of [0,4] that student i achieved in course j. If a student has not taken a course, the corresponding entry will be missing. The course, semester, and student, whose grades need to be predicted, will be called target course, target semester, and target student, respectively.

4 Methods

In this section, we describe various classes of methods that we developed for predicting the grade that a student will obtain on a course that he/she has not yet taken.

4.1 Course-specific regression (CSR)

Undergraduate degree programs are structured in such a way that courses taken by students provide the necessary knowledge and skills for them to do well in future courses. As a result, the performance that a student achieved in a subset of the earlier courses can be used to predict how well he/she will perform in future courses. Motivated by this, we developed a grade prediction method, called course-specific regression (CSR) that predicts the grade that a student will achieve in a specific course as a sparse linear combination of the grades that the student obtained in past courses.

In order to estimate the CSR model for course c, we extract from the overall student–course matrix \(\mathbf {G}\) the set of rows corresponding to the students that have taken c. For each of these students (rows), we keep only the grades that correspond to courses taken prior to course c. Let \(\mathbf{G}^c \in \mathbb {R}^{n_c \times m}\) be the matrix representing that extracted information, where \(n_c\) is the number of students that took course c. In addition, let \(\mathbf{{y}}^c \in \mathbb {R}^{n_c}\) be the grades that the students in \(\mathbf{G}^c\) obtained in course c (\({y^c_i}\) is the grade that was obtained by the student of the ith row of \(\mathbf{G}^c\)). Given this, the CSR model \(\mathbf{w}^c \in \mathbb {R}^{m}_{+}\) for c is estimated as:

$$\begin{aligned} \underset{\mathbf{w}^c \succeq 0}{\hbox {minimize}}~\left||\mathbf {y}^c-\mathbbm {1}w^c_0-\mathbf{G}^c\mathbf{w}^c\right||^2_2 + \lambda _{1}\left||\mathbf{w}^c\right||^2_2 + \lambda _{2}\left||\mathbf{w}^c\right||_1, \end{aligned}$$
(1)

where \(w^c_0\) is a bias term, \(\mathbbm {1} \in \mathbb {R}^{n_c}\) is a vector of ones, and \(\lambda _{1},\lambda _{2}\) are regularization parameters to control overfitting and promote sparsity. The model is nonnegative because we assume that prior courses can only provide knowledge to future courses. The individual weights of \(\mathbf{w}^c\) indicate how much each prior course contributes to the prediction and represent a measure of the importance of the prior course within the context of the estimated model. Using this model, the grade that a student will obtain in course c is given by:

$$\begin{aligned} {\hat{y}^c}= w^{c}_0+\mathbf{s}^T\mathbf{w}^c, \end{aligned}$$
(2)

where \(\mathbf{s} \in \mathbb {R}^m\) is the vector of the student’s grades in the courses he/she has taken so far.

In this approach, prior to estimating the model using Eq. 1, we first subtract from each \(g^c_{i,j}\) grade the GPA of the ith student (GPA is calculated based on the information in \(\mathbf{G}^c\)). This centers the data for each student and takes into consideration a notion of student bias as it predicts the performance with respect to the current state of a student. Note that in the case of GPA-centered data, we remove the nonnegativity constraint on \(\mathbf{w}^c\). We found that by centering each student’s grades around his/hers GPA leads to more accurate predictions (see Sect. 6.1).

4.2 Student-specific regression (SSR)

Depending on the major, the structure of different undergraduate degree programs can be different. Some degree programs have limited flexibility as to the set of courses that a student has to take and at which point in their studies they can take them (i.e., specific semester). Other degree programs are considerably more flexible and are structured around a fairly small number of core courses and a large number of elective courses.

For the latter type of degree programs, a drawback of the CSR method is that it requires the same linear regression model to be applied to all students. However, given that the set of prior courses taken by students in such flexible degree programs can be quite different, there can be cases in which many of the most important courses that were identified by the CSR model were simply not be taken by some students, even though these students have acquired the necessary knowledge and skills by taking a different set of courses. To address this limitation, we developed a different method, called student-specific regression (SSR), which estimates course-specific linear regression models that are also specific to each student.

The student-specific model is derived by creating a student-course-specific grade matrix \(\mathbf{G}^{s,c}\) for each target student s and each target course c from the \(\mathbf{G}^c\) matrix used in the CSR method. \(\mathbf{G}^{s,c}\) is created in two steps. First, we eliminate from \(\mathbf{G}^c\) any grades for courses that were not taken by the target student. Second, we eliminate from \(\mathbf{G}^c\) the rows that correspond to the students that have not taken a sufficient number of courses that are in common with the target student s. Specifically, if \(\mathcal {C}_s\) and \(\mathcal {C}_i\) are the set of courses for student s and i, respectively, we compute the overlap ratio \((\text{ OR })={|\mathcal {C}_s \cap \mathcal {C}_i|}/{|\mathcal {C}_s|}\) and if OR\(<t\), then student i is not included in \(\mathbf{G}^{s,c}\). The value of t is a parameter of the SSR method, and high values ensure that the set of students forming \(\mathbf{G}^{s,c}\) have taken many courses in common with s and have followed similar degree plans. Given \(\mathbf{G}^{s,c}\), the SSR method proceeds to estimate the model using Eq. 1 (with \(\mathbf{G}^{s,c}\) replacing \(\mathbf{G}^{c}\)) and uses Eq. 2 for prediction.

4.3 Methods based on matrix factorization

Low-rank matrix factorization (MF) approaches have been shown to be very effective for accurately estimating ratings in the context of recommender systems [12]. These approaches can be directly applied to the problem of predicting the grade that a student will achieve on a particular course by treating the student–course grade matrix \(\mathbf {G}\) as the user-item rating matrix.

The use of such MF-based approaches for grade prediction is postulated on the fact that there is a low-dimensional latent feature space that can jointly represent both students and courses. Given the nature of the domain, this latent space can correspond to the space of knowledge components. Each course vector is the set of components associated with a course, and each student vector represents the student’s level of knowledge across these knowledge components.

By applying the common approaches of MF-based rating prediction to the problem of grade prediction, the grade that student i will obtain on course j is estimated as

$$\begin{aligned} {\hat{g}_{i,j}}= \mu + sb_i + cb_j + {\mathbf {p}_i\mathbf {q}_j}^T, \end{aligned}$$
(3)

where \(\mu \) is a global bias term, \(sb_i\) and \(cb_j\) are the student and course bias terms, respectively, and \(\mathbf{p_i}\) and \(\mathbf{q_j}\) are the latent representations for student i and course j, respectively. The parameters of the MF method (\(\mu , \mathbf{sb} \in \mathbb {R}^n, \mathbf{cb} \in \mathbb {R}^m, \mathbf{P} \in \mathbb {R}^{n \times l}\), and \(\mathbf{Q} \in \mathbb {R}^{n \times l}\)) are estimated following a matrix completion approach that considers only the observed entries in \(\mathbf {G}\) as

$$\begin{aligned} \begin{aligned} \underset{\mu , \mathbf{sb}, \mathbf{cb}, \mathbf{P}, \mathbf{Q}}{\text{ minimize }} ~ \sum _{g_{i,j} \in \mathbf{G}} {(g_{i,j}-\mu -sb_i-cb_j- \mathbf{p}_i\mathbf{q}^{T}_{j})}^2\\ + \lambda (\left||\mathbf{P}\right||^2_F + \left||\mathbf{Q}\right||^2_F + \left||\mathbf{sb}\right||^2_2 + \left||\mathbf{cb}\right||^2_2), \end{aligned} \end{aligned}$$
(4)

where \(\lambda \) is a regularization parameter and l is the dimensionality of the latent space, which is a parameter to this method.

The accurate recovery of the low-rank model (when such a model exists) from a set of partial observations depends on having a sufficient number of observed entries and on these entries be randomly sampled from the entries of the target matrix \(\mathbf {G}\) [6]. However, in the context of student grade data, the set of courses that students take is not a random subset of the courses being offered as they need to satisfy their degree program requirements. As a result, such an MF approach may lead to suboptimal prediction performance.

In order to address this problem, we developed a course-specific matrix factorization (CSMF) approach that estimates an MF model for each course by utilizing a course-specific subset of the data that is denser (in terms of the number of observed entries and the dimensions of the matrix). As a result, it contains a larger number of randomly sampled subsets of sufficient size. The denser course-specific matrix will allow a more reliable estimation of the low-rank models. At the same time, the sub-matrix will be more homogeneous, as the students included are likely to be more similar (i.e., with more common prior courses) compared to all the students. That will allow the model to fit the data better.

Given a course c and a set of students \(\mathcal {S}^c\) for which we need to estimate their grade for c (i.e., the students in \(\mathcal {S}^c\) have not taken this course yet), the data that CSMF utilizes are the:

  1. (i)

    the students and grades of the \(\mathbf {G}^c\) matrix and \(\mathbf{y}^c\) vector of the CSR method (Sect. 4.1), and

  2. (ii)

    the students in \(\mathcal {S}^c\) and their grades.

These data are used to form a matrix \(\mathbf{X}^c \in \mathbb {R}^{(n_c+n_t) \times (m_c+1)}\), where \(n_c\) is the number of students in \(\mathbf {G}^c\), \(n_t = |\mathcal {S}^c|\), and \(m_c\) is the number of distinct courses that have at least one grade in \(\mathbf {G}^c\) or \(\mathcal {S}^c\). The values stored in \(\mathbf{X}^c\) are the grades that exist in \(\mathbf {G}^c\) and \(\mathcal {S}^c\). The last column of \(\mathbf{X}^c\) stores the grades \(\mathbf{y}^c\) for the course c that were obtained from the students in \(\mathbf {G}^c\). Thus, \(\mathbf{X}^c\) contains all the prior grades associated with the students who have already taken course c and the students for which we need to have their grade on c predicted. Matrix \(\mathbf{X}^c\) is then used in place of matrix \(\mathbf G\) in Eq. 4 to estimate the parameters of the CSMF method, which are then used to predict the missing entries of the last column of \(\mathbf{X}^c\), which are the grades that need to be predicted.

Table 1 Statistics for course-specific datasets

5 Experimental design

5.1 Dataset

The student–course grade dataset that we used in our experiments was obtained from the University of Minnesota which has a very flexible degree program. It contains the students that have been part of the Computer Science and Engineering (CSE) and Electrical and Computer Engineering (ECE) programs from Fall of 2002 to Spring of 2014. Both of these degree programs are part of the College of Science and Engineering. Students have to take a common set of core science courses during the first 2–3 semesters, but they can select more courses from different levels and departments.

Because of the nature of these departments, the curriculum coherence tends to be vertically aligned, i.e., what students learn in one lesson, course, or grade level is most likely going to be used by the next lesson, course, or grade level. Students select courses in order to learn the knowledge and skills that will progressively prepare them for more challenging, higher-level topics. However, we need to point out that this might not always be the case, as there are departments that are more horizontally aligned, where there do not exist such strong dependencies across different courses and levels.

While preprocessing the dataset, we removed any courses that are not part of those offered by departments in the college, as these correspond to various liberal arts and physical education courses, which are taken by few students and in general do not count toward degree requirements. Furthermore, we eliminated any courses that were taken as pass/fail. The initial grades were in the A–F scale, which was converted to the 4–0 scale using the standard letter grade to GPA conversion. The resulting dataset consists of 2949 students, 2556 different courses, and 76,748 student–course grades.

We used this dataset to assess the performance of the different methods for the task of predicting the grades that the students will obtain in the last semester (i.e., the most recent semester for which we have data). For this reason, the dataset was further split into two parts, one containing the students that are still active, i.e., have taken courses in the last semester (\(D_{active}\)) and another that contains the remaining students (\(D_{inactive}\)). \(D_{active}\) contains 876 students, 19,089 grades, out of which 3427 grades are for the 475 distinct classes taken in the last semester. \(D_{inactive}\) contains 2073 students and 57,659 grades.

These datasets were used to derive various training and testing datasets for the different methods that we developed. Specifically, for the CSR method, we extracted the course-specific training and testing datasets as follows. For each course c that was offered in the last semester, we extracted course-specific training and testing sets (\(D^{c, \ge k}_{\text{ train }}\) and \(D^{c, \ge k}_{\text{ test }}\)) by selecting from \(D_{inactive}\) and \(D_{active}\), respectively, the students that have taken c, and prior to taken c, they also took at least k other courses. The reason that these datasets were parameterized with respect to k is because we wanted to assess how the methods perform when different amount of historical student performance information is available. In our experiments, we used k in the set \(\{5, 7, 9\}\). That information creates the grade matrix \(\mathbf{G}^c\), where \(g^{c}_{i,j}\) is the grade of the ith student on the jth course from the training set \(D^{c, \ge k}_{\text{ train }}\). Table 1 shows various statistics about the various course-specific datasets for different values of k.

For the CSMF method, the training dataset for course c was obtained by combining \(D^{c, \ge k}_{\text{ train }}\) and \(D^{c, \ge k}_{\text{ test }}\) into a single matrix after removing the grades that the target students achieved in course c.

For the MF method, the matrix \(\mathbf {G}\) is constructed using data from all \(\mathbf{X}^c\) matrices. It refers to the union of the sets \(D^{c, \ge k}_{\text{ train }}\) and \(D^{c, \ge k}_{\text{ test }}\) for every course to be predicted, after removing the grades that the active students achieved in the courses we want to predict. We formulated the dataset in this way in order to provide the same information for training and testing to all our models. Moreover, since we predict the grades for a specific semester, matrix \(\mathbf {G}\) does not contain any grading information regarding following semesters.

In the SSR, the grade matrix \(\mathbf {G}^{s,c}\) is created by selecting from \(D^{c, \ge k}_{\text{ train }}\) the set of courses that were also taken by student s and the set of students whose OR with s is at least t. Figure 1 shows some statistics about these datasets as a function of t, and Fig. 2 shows only the common subsets that can be predicted by both course specific and SSR datasets. When the OR is more than 0.8, we cannot predict many grades because there are not enough students that had followed the same degree plan as the selected student.

Fig. 1
figure 1

Statistics of the datasets used in SSR w.r.t. overlap ratio

Fig. 2
figure 2

Statistics of the common subset of datasets used in SSR and in course-specific approaches w.r.t. overlap ratio

Table 2 Performance achieved by linear course-specific regression per department

Finally, we did not consider the courses that have less than 20 students in their corresponding dataset, as we consider them to have too few training instances for reliable estimation, or less than 4 test students, as we might not get valid results.

5.2 Competing methods

In our experiments, we compared our methods with the following competing approaches.

  1. 1.

    BiasOnly. We only took into consideration local and global biases to predict the students’ grades. These biases were estimated using Eqn. 4 by setting \(l=0\).

  2. 2.

    Student-Based Collaborative Filtering (SBCF). This method implements the approach described in [4]. For a target course c, every student i is represented by a vector whose nonzero entries are the grades that the student obtained on the courses taken prior to c. We compare the vector of a target student s against the vectors of the other students that have taken course c using the Pearson’s correlation coefficient. We perform grade prediction while taking into consideration the positively similar students to s according to

    $$\begin{aligned} \hat{g}_{s,c} = \bar{g_s} + \frac{\min (r,nbr)}{r}\frac{\sum _{i=1}^{nbr}(g_{i,c}-\bar{g_i}) \text{ sim }_{s,i}}{\sum _{i=1}^{nbr} \text{ sim }_{s,i}}, \end{aligned}$$
    (5)

    where nbr is the number of students selected, r is a confidence lower limit for significance weighting, \(\bar{g_i}\) is the average grade of the student prior taking c, and sim\(_{s,i}\) represents the similarity of target student s with i.

Fig. 3
figure 3

Performance achieved by the SSR model w.r.t. overlap ratio

Fig. 4
figure 4

Comparison of SSR model and CSR-RC with 9 prior courses w.r.t. overlap ratio. The other options for number of prior courses have similar behavior

5.3 Parameters and model selection

For CSR, we let \(\lambda _1\) take values from 0 to 40 in increments of 1 and \(\lambda _2\) from 0 to 50 in increments of 1. For SSR, we let \(\lambda _1\) take values from 0 to 10 in increments of 1 and \(\lambda _2\) from 0 to 14 in increments of 2. For BiasOnly, MF, and CSMF, we let \(\lambda \) take values from 0 to 16 in increments of 0.05. For SSR, the range of the tested values for overlap ratio is 0.3 to 1, in increments of 0.04, and for the confidence lower limit is 10 to 100, in increments of 10. For SBCF, we tested the number of neighbors to be from 10 to 100 with increments of 10. For MF and CSMF methods, we tested the number of latent dimensions with the values 2, 5, and 8.

For SBCF, CSR, and SSR, we used the semester before the target semester to estimate and select the best parameters. For BiasOnly, MF, and CSMF, model selection was based on the performance of the validation set, which was a randomly selected 10 % subset of the training data. For the CSMF model, the best-performing parameters were selected for each course.

5.4 Evaluation methodology and performance metrics

We evaluated the performance of the different approaches by using them to predict the grades for the last semester in our dataset using the data from the previous semesters for training. We report the results for the courses belonging to CSE and ECE departments.

We assessed the performance using the root-mean-square error (RMSE) between the actual grades and the predicted ones. Since the courses whose grades are predicted have different number of students, we computed two RMSE-based metrics. The first is the overall RMSE in which all the grades across the different courses were pooled together, and the second is the average RMSE obtained by averaging the RMSE values for each course. We will denote the first by RMSE and the second as AvgRMSE.

In order to get a better understanding of the quality of the predictions, we also report the distribution of the actual vs predicted letter grades. The grading system used by the University of Minnesota has 11 letter grades (A, A\({-}\), B+, B, B\({-}\), C+, C, C\({-}\), D+, D, F) that correspond to grades from 4 to 0 (4, 3.667, 3.333, 3, 2.667, 2.333, 2, 1.667, 1.333, 1, 0). After converting the predicted grades to their closest letter grade, we compute the percentage of grades that are within or more than x ticks away from their actual grades. A tick is defined as the difference between two successive letter grades (e.g., B vs B+ is one tick, A vs B is 3 ticks).

Table 3 Errors per department for matrix factorization methods

6 Experimental results

6.1 Course-specific regression

Table 2 shows the performance achieved by the CSR and CSR-RC models when trained using the three different datasets discussed in Sect. 5.1. These results show that between the two models, CSR-RC, which operates on the GPA-centered grades, leads to considerably lower errors both in terms of RMSE and AvgRMSE, especially for the CSE courses.

In terms of the sensitivity of their performance on the amount of historical information that was available when estimating these models (i.e., the minimum number of prior courses), we can see that the performance of the models does not change significantly for the CSR-RC method. CSR predicts CSE courses better when using 5 prior courses, while it predicts better the ECE courses with 9 prior courses. This indicates that the model benefits from increased number of students that increased number of prior courses, because the students with 9 prior courses are only 67 % of the students with 5 prior courses. The ECE department does not suffer from such low number of students left with 9 prior courses, as the corresponding percentage is 80 % (statistics according to Table 1).

6.2 Student-specific regression

As one of the parameters for this problem was the overlap ratio between the courses of the target student and other students, Fig. 3 presents the behavior of the model’s RMSE (left) and AvgRMSE (right) as we vary the OR for \(D^{c, \ge 9}_{\text{ test }}(k=9)\). When the OR is increased, the selected students have more courses in common with the target user and that leads to better performance.

In order to compare the performance of SSR against CSR-RC, Fig. 4 shows the RMSE of the best CSR-RC and SSR models. The RMSE values were computed on the subsets of the test set that was predicted by both models for \(D^{c, \ge 9}_{\text{ test }}(k=9)\). These results show that SSR leads to consistently worse predictions for the CSE courses than the CSR-RC model. However, in the case of the ECE courses, SSR does better than CSR-RC when the OR is greater than 0.8. That might be related to the fact that the degree program of ECE is more structured than the CSE degree program, giving some advantage to the SSR method. As shown in Fig. 1, at such high OR values, the number of grades that can be predicted by SSR is small. For example, when OR is 0.8, the SSR model can predict less than 10 % of the grades in the target semester.

6.3 Methods based on matrix factorization

The performance of the methods based on matrix factorization (Sect. 4.3) is shown in Table 3.

These results show that for the CSE courses, CSMF performs the best in terms of RMSE and AvgMSE, for any number of prior courses. That confirms that by building matrix factorization models on smaller but denser course-specific sub-matrices, we can derive low-rank models that lead to more accurate matrix completion. On the other hand, the performance of the ECE courses does not vary a lot. For that department, the best predictions are performed by MF, followed by CSMF with a RMSE difference of 0.002. A potential explanation for these results is that the ECE courses are part of a stricter degree program, whose structure is present even in the more general setting of MF. As a result, by selecting the course-specific sub-matrices does not provide any further insight to the data, as happens for the CSE courses.

In order to see how the size of the training set associated with the different courses impacts the performance of the MF and CSMF methods, Fig. 5 shows the cumulative AvgRMSE over the courses with increasing training size and the RMSE per course achieved from each method. Cumulative AvgRMSE is used to provide some insight to the impact that the training size has on the performance of our models. We can notice that for the ECE courses, MF model has an advantage against CSMF for relatively smaller courses. MF performs better for eight out of the ten smallest courses, indicating that it gains its accuracy by utilizing other data that are not included in the course-specific datasets in order to compute better biases. Moreover, from the bottom part of the figure, we can confirm that the performance of both MF and CSMF is similar for the ECE courses in comparison with the CSE courses.

In terms of the number of latent factors, we see that when we are using the smallest dataset for training (the one with 9 prior courses), the best performance is achieved for smaller number of latent factors compared to the datasets with 5 or 7 prior courses. In that case, the average number of grades per course is lower, which might not support a large number of latent factors.

Fig. 5
figure 5

Cumulative AvgRMSE w.r.t. increasing training size (top) and RMSE achieved per course (bottom) of CSMF and MF models for \(D^{c, \ge 9}_{\text{ test }}(k=9)\)

Table 4 Errors per department
Table 5 Wins/ties/losses for every pair of methods tested
Table 6 Analysis of the accuracy of the predictions in terms of letter grades
Table 7 Analysis of the error severity of the predictions in terms of letter grades
Table 8 Errors per course for all methods for the case of 9 prior courses (\(D^{c, \ge 9}_{\text{ test }}(k=9)\))

6.4 Comparison with other methods

Table 4 compares the performance of the baseline approaches described in Sect. 5.2 (BiasOnly and SBCF) with the best-performing course-specific regression method (CSR-RC), the MF and CSMF methods. From these results, we can see that CSR-RC leads to the best RMSE for the CSE courses and MF leads to the best RMSE for the ECE courses, closely followed by CSMF (0.002 difference).

A summary of the comparison between every pair of methods tested is shown in Table 5. For each method, we count the courses for which a method wins, ties, and losses in terms of RMSE against each other method tested. This analysis shows that for the CSE courses, CSR-RC outperforms the other methods, except SBCF that is very close, in the majority of the courses, whereas for the ECE courses, the CSMF outperforms each one of the other methods (even MF method that has slightly better RMSE) in the majority of the courses.

6.5 Fine grain analysis of the predictions

In order to gain a better understanding as to the types of errors generated by the different methods and the real-world implication of the predictions, Tables 6 and 7 analyze the performance achieved by the different methods by focusing on grade ticks as opposed to RMSE values.

Table 6 shows the percentage of predicted grades that were close to the true grades, over all the instances predicted by a model. For the CSE department, CSMF is the model with the most grades that are predicted to be within two ticks from their true values, while CSR-RC is the best model when focusing on exact predictions. For the ECE department, MF has the highest percentages, and CSMF can be better only for the case of 9 prior courses, within two letter grades from the actual grades.

Table 7 analyzes the performance of the models on the instances that they fail to accurately predict. We examine the difference between the grades over or under predicted, i.e., they are predicted to be more or less than their real values, respectively. In this case, the lower the percentage, the better the model is, as there are less inaccurate predictions. These results show that, compared the CSE, ECE has less under predictions, but higher number of over predictions of more than one tick. Moreover, the best methods for the CSE courses are the CSR-RC and CSMF, and for the ECE courses, are the MF and CSMF. Another finding is that CSR-RC has the highest percentages of under prediction errors for the ECE department. The reason this is happening is because a student might have not taken an important course, and its corresponding regressor will be missing while estimating their grade. As a result, we can see that in the case of this department, that has a stricter degree program, CSR-RC (that is a linear model) cannot handle the absence of an important prior course. However, CSR-RC is the only model that manages to lower the over prediction error while using more dense data (case of 9 prior courses).

Table 8 compares the RMSE per course for the methods of BiasOnly, SBCF, CSR-RC, CSMF, and MF, for both the CSE and ECE departments. Some statistical information per course is also included. This information suggests that if a course has a poor RMSE, then it is very likely that the standard deviation of the grades on the test set is quite high or higher that the standard deviation of the grades on the training set.

7 Conclusions

In this paper, we presented two course-specific approaches based on linear regression and matrix factorization that perform better than existing approaches based on traditional methods, assuming that the degree programs involved have a vertical structure. In that case, focusing on a course-specific subset of the data can result in more accurate predictions. Moreover, the performance for different departments can significantly vary, as they may have different characteristics and structures. A student-course-specific approach was also developed but its accuracy in grade prediction is limited by the diverse nature of degree plans. Overall, the course-specific methods can improve the performance of grade prediction over other methods tested for our dataset, while the degree of improvement depends on the department.