1 Introduction

Online education systems are becoming increasingly popular due to their flexibility and ability to reach a larger audience. The main advantages of this learning system for students are lower cost, flexible schedules, more in-depth information with directed tasks, and adaptation of the learning space [1, 2]. However, despite these benefits, teachers find it very complex to monitor their students. In contrast to traditional educational systems, it is very hard to keep track of each student, mainly due to the large number of students per course, and the lack of face-to-face interaction. Therefore, students must have discipline and be able to organize their time to pass the subject. For this reason, the completion rates of online learning are notoriously lower than face-to-face learning [3].

In addition, online systems make it possible to obtain a series of additional data on the students that can serve as an indication of their progress [4, 5]. However, the large volume of information stored on each student makes individual monitoring unfeasible. Therefore, it is necessary to use data mining techniques to establish rules and predictions about student performance. These techniques, together with the predictive models, make it possible to understand the critical factors that contribute to student success analysis [6], identify students in need of support on time [7], or store opinion data [8]. In addition, the initial learning period of a new course is crucial for students [9]. During this period, students can experience the novelty of the course, eliminate doubts and establish the foundation for the learning stages. In this scenario, a system capable of predicting the student’s performance is crucial in online learning where the number of students who fail or drop out of the course is very high. These systems could help us quickly detect students with learning difficulties, students who can be asked more complex concepts, and obtain valuable information about what activities have allowed students to pass the course. Therefore, there is a need for an effective framework to assess student performance in online education systems that predicts expected outcomes and early failures associated with students.

Although there is significant research addressing the problem of predicting the critical factors for student tracking [10], most of them simplify the problem to “student will or will not pass a course” or “student will drop out or will not drop out a course”. Nevertheless, this information alone is not always enough for early and effective action. Instead, it is necessary to know what characteristics allow us to predict the different students’ grades or marks from dropout to passing with excellence. In this way, it is useful to model students according to their potential performance to increase success rates and manage resources well.

In this context, the main contributions can be summarized as:

  1. 1.

    Ordinal classification is a novelty in the problem of student’s performance prediction. This paper proposes an algorithm to predict student performance considering four ranking categories: Withdrawn < Fail < Pass < Distinction. The model is able to penalize the mistakes according to the ranking. Thus, the mistake of predicting withdrawal when it passed with distinguished is penalized more than the mistake of predicting withdrawal when it failed. Thus, the final classification will be more reliable.

  2. 2.

    Understandable results that can be used easily by teachers. This paper proposes an algorithm that also includes interpretability. Thus, it uses fuzzy logic and a rule-based system to display comprehensive information. This information will identify the resources, activities, and materials available in a course that can affect the student’s performance and obtain as many behaviors that can benefit them to succeed in a subject as those that can harm them. In this way, the teacher will be able to redirect those students who have problems following the course and further encourage those who are doing excellent work.

  3. 3.

    An exhaustive experimental study considering 10 state-of-the-art methods over 7 different courses and statistical analysis is carried out. Numerous studies in this area are carried out using non-availability datasets. Thus, it is very difficult to analyze the improvements of new proposals. We use the Open University Learning Analytics Dataset (OULAD) [11]. This dataset comprises a large sample size of students and courses and it is being used as a benchmark to draw any meaningful conclusions.

The rest of this paper is organized as follows. Section 2 reviews the previous studies to predict the student’s performance. Section 3 details preliminary concepts of fuzzy logic, machine learning performance metrics and the dataset used in this work. Section 4 describes the proposed framework for student’s performance prediction. Experimentation results and analysis are reported in Section 5, followed by the conclusions and future works in Section 6.

2 Related work

Predicting academic performance is one of the most studied tasks in Educational Data Mining (EDM) [10]. This section presents the latest advancements in this field with a special focus on the approaches using different classification categories. It should be taken into account that open-access datasets in the field of EDM are difficult to find [12], making comparisons of results difficult. For this reason, the OULA [11] dataset has become a benchmark in the field. Therefore, this section also reviews the different contributions that have used this dataset.

Table 1 Predicting students performance in general context
Fig. 1
figure 1

Classes to predict vs ML model used

2.1 Student performance prediction

The prediction of academic success encompasses many attributes of students’ experience. Most studies focus on determining the student’s final grade in a course according to the grading system [12]. Specifically, most studies focus on predicting whether or not a student passes the course [13]. In this section, a review analyzing the most relevant and recent works on predicting student performance.

Recent surveys, as proposed by Namoun and Alshanqiti [10], show that the prediction of academic performance is addressed generally as a classification problem (47% of the works) instead of as a regression problem (28% of the works) [14]. Moreover, works consider different academic aspects like perceived competence, educational self-reports, attendance, etc. Focusing on classification works, 56% of them only consider two classes, while the rest adds different grades: 15% consider three classes, 9% consider four classes and 20% consider more than four. The number of classes to predict in the most studies is given by the final grades [10]; however, other measures can be the success at the program level or student satisfaction at the personal level. In the binary classification, commonly it is used two classes: ’pass’ and ’fail’ [15,16,17], while in multiple classification, classes usually take the next values: ’dropout’, ’fail’, ’satisfactory’, ’good’, or ’excellent’ [18].

Besides of type of problem (classification or regression) and the number of classes, interpretability is also vital in this problem. Thus, in Table 1, it is shown a summary of the most relevant analyzed studies attending to the aim of the prediction, the factors utilized for that prediction, the machine learning model implemented, and the number of classes that it predicts. The last column gives an idea of the possibilities of the prediction system implantation in a real scenario based on the interpretability of the output, i. e. whether the solutions are meaningful, and give instructors understandable explanations to include students in a specific category.

The analysis of the works detailed in Table 1 leads us to Fig. 1, which summarizes the relevant information related to the proposed work in this paper. This figure shows the number of classes and the type of machine learning algorithms applied. It can be seen that there are a large number of works based on binary classification (54%) and that in general, one of the most popular machine learning techniques are those based on decision trees and recurrent neural networks. Also, it is interesting to note that the analyzed works that predict more than two classes do not use ordinal classification to fix an order relation between classes. Indeed, to the best of the authors’ knowledge, only one previous work [31] makes the first approach to ordinal classification in the EDM domain. However, this work does not focus on predicting academic performance. It explores the feasibility of applying ordinal classification for data labeling in semi-supervised learning environments on three public EDM datasets.

2.2 Student performance prediction in OULAD

This subsection covers papers that use OULAD [11]. This dataset is one of the few open datasets available about learning analytics and educational data mining. It is collected from a real case of study at the Open UniversityFootnote 1, the largest institution of distance education in the United Kingdom. Section 3.1 describes the dataset in detail.

As OULAD is one of the few existing open EDM datasets, it is also one of the most used by the different authors to validate their proposals. Table 2 shows a summary of the analysis of the most relevant previous works that use OULAD to predict academic performance. For each one, it is analyzed the aim of the prediction, the factors utilized in the context of OULAD, the machine learning model, and the number of classes, among those contemplated in OULAD. Finally, the last column indicates if the model obtained by the proposal is comprehensive or is not. As in the previous analysis, a count by classification techniques has also been carried out, distinguishing the number of classes to be predicted. Figure 2 shows this analysis. In this case, it can be appreciated that most of the works focus on binary classification (86%), grouping similar classes like Pass and Distinction or ignoring the minority ones. Attending to the machine learning technique used, similar trends can be observed concerning the general analysis of the previous section: decisions tree models are widely used, probably due to a combination of accuracy and interpretability. On the other hand, contrary to previous analysis, it can be seen an increase in the use of deep learning methods, specifically recurrent networks because they are particularly good at capturing the temporal information available on OULAD.

Table 2 Predicting students performance in OULAD dataset
Fig. 2
figure 2

Classes to predict vs ML model used in OULAD

Once the latest contributions in EDM have been reviewed, we propose the implementation of a fuzzy ordinal classification algorithm to model the order relationship existing between the groups in which a student can be classified. Thus, a student who drops out should be closer to failing than to being excellent. In our proposal, we seek to minimize the error of the prediction of the classes. A mistake with more than one order of difference is penalized more. In this way, the system will be more reliable, especially for minority classes. In addition, the system seeks to be as explainable as possible. Thus, the model provides a set of rules that give information on the most relevant factors associated with each category considered in the classification. In this line, an understandable algorithm based on fuzzy rules is proposed.

3 Preliminary

In this section, we present the preliminary background to understand the contribution of this work. Firstly, we explain in detail the OULA dataset, which is used to validate and compare our proposal. Secondly, we describe the problem and the main concepts of fuzzy systems and ordinal classification. Finally, we describe the main metrics which are used in section 5.

3.1 OULA dataset

The OULA dataset [11] is one of the few open-access EDM datasets (available for downloading inFootnote 2). The OULAD characteristics make it possible to use it in different EDM problems, such as predicting the students’ grades [44], measuring the engagement factors of the courses [34], or classifying the final outcome (see Table 2). This section explains which information contains the data source related to students and their academic activity in OULAD, and the following subsection details the preprocessing operations for adapting the data to a pattern mining scenario.

OULAD focuses on distance learning in higher education, i.e. fully online interaction through VLE systems. Thus, it contains anonymous information about seven independent courses presented at the Open University (The United Kingdom), with 10,655,280 entries of 32,593 students, their assessment results, and logs of their interactions with the VLE represented by daily summaries during 2013 and 2014.

Although each course in OULAD has differences in terms of domain and difficulty levels, they share an equivalent structure that allows us to prove EDM proposals in different scenarios. Each course has several resources on the VLE used to present its contents, one or more assignments that mark the milestones of the course, and a final exam. An overview of the course structure can be seen in Fig. 3. The curriculum and contents of a course are usually available in VLE a couple of weeks before the official course starts, so that enrolled students can access it in an early way. Course content is classified into 20 types according to its nature (homepage, forum, glossary...). Thus, OULAD tracks the number of daily clicks that a student made on each type of resource. During the course presentation, students’ knowledge is evaluated through assignments that define milestones. Two types of assignments are considered: Tutor Marked Assessment (TMA) and Computer Marked Assessment (CMA). If a student decides to submit an assignment, the VLE collects information about the date of submission and the obtained mark. At the end of a course edition, enrolled students can take a final exam that provides them a final grade. Based on this grade, each student receives a final outcome that can take three different values: pass, distinction, or fail. Additionally, if a student does not take the exam, he/she does not finish the course and the final grade is set as withdrawn. In addition to the course activity, OULAD tracks demographic information of the students, such as their gender, region, or age band, and extra academic information, such as the number of previous enrollments in the course, if any, or the total credits currently enrolled.

Fig. 3
figure 3

Typical course structure

Having commented on the similarities, we now analyze the differences between OULAD courses. Each of the seven courses belongs to a different domain and considers a difficulty level, i.e., they are independent of each other and have different patterns and enrolled students. Thus, starting for aaa course, is a level-3 course [11] belonging to the field of Social Sciences. As a specialized course, it usually has a few students (around 374 per edition), which makes it possible to have only TMA assignments. It has high rates of VLE interactions with a high success rate (students that pass the course or get distinction) are around 70%. Courses bbb, ccc, ddd, eee, and fff are 1-level courses, mostly belonging to the STEM field (except for bbb that is a Social Sciences course) [11]. All of them have more than 1000 students per edition and around 55% of them do not overcome the course, most of them due to withdrawal. These courses show a decrease in activity in the VLE as the course progresses. This lack of follow-up is also reflected in the assignments submitted. On average, these courses have six assignments of each type (TMA and CMA), but students submit around 3-4 of each type. The eee course deserves a special mention with a failure rate of 44% and with only TMA assignments. Finally, ggg is a Social Sciences preparatory course [11] with an average of 840 students. Its moderate follow-up in the VLE, together with most of the assignments of TMA type, contrasts with its relatively high success rate of 60%. It indicates that is an easy course, consequently to its introductory level.

3.1.1 OULA dataset preprocessing

The original OULAD format is composed of several CSV files which contain tables related to the different components: course, student, VLE, and assessments, as well as the interactions among them. These tables have to be processed in order to properly join all the factors considered in this study and match them with the ordered label to predict. Thus, these files have been loaded into a MySQL database and have been slightly restructured to ensure that it is maintained in Codd’s normal form [45] and avoid data duplication. Finally, the appropriate queries have been carried out with the aim of joining all the information available for each student in each course in a single pattern that can be used to extract the fuzzy rules in the proposed method. The attributes considered are the following:

  • Pattern identification: composed of the students’ identifications and the course edition.

  • Student demographics: student’s gender and age band, the highest level of studies reached by the student, its region and index of multiple deprivations, and if he/she has a disability condition.

  • Enrollment information: date of registration in the course, number of previous attempts to pass the course and the total credits enrolled by the student currently.

  • Assignments information: for each TMA/CMA assignment in the course, it is created an attribute to keep the score obtained by the student. It is used empty value if he/she does not submit the assignment.

    • Note that the number and identification of assignments depends of each course or module, so this attributes differ in the different dataset created.

  • VLE logs information: the total number of clicks given by each student in VLE resources during the whole curse. As it has been commented, there are twenty different types of resources, so twenty attributes are created under this category.

  • final grade: the student’s final grade achieved in each enrolled module. This is a categorical value with an implicit order so that the lowest category would be withdrawn, then fail, then pass, and finally distinction.

3.2 Fuzzy rule-based system

Two types of models are distinguished for solving classification problems. On the one hand, there are models which are called black-box models. Some examples of these models are neural networks, ensembles, and deep learning [46], among others. On the other hand, there are others called white-box models with decision trees [47] or rule-based systems [48] as examples. Both models solve the problem by giving an output from some inputs. The difference is that in black-box models, it is unknown how the input parameters are related to obtaining the outputs; while in white-box models, the steps and relationships between the input variables to provide the output are known [49].

Focusing on white-box models, in this work, the model knowledge is represented as a set of IF-THEN rules where if the antecedent is true, the consequent represents the output. To give more flexibility to the rule, the fuzzy logic concepts [50] are used. Fuzzy logic is an extension of boolean logic that uses concepts of membership to sets nearly to human beings thinking. If the classical sets (called crips set) take only two values: one, when an element belongs to the set, and zero, when it does not. In fuzzy set theory is not limited to these two extreme values, an element can belong to a fuzzy set with its membership degree ranging from zero to one. Thus, the variables in fuzzy logic are called “linguistic variables” and are defined by “linguistic terms” and each linguistic term is identified with a membership function to indicate the degree of belonging of an object to a particular label. Figure 4 shows the age variable with three linguistic terms child, young or old and the membership function that defines to each one of them.

Fig. 4
figure 4

Example of age variable in fuzzy logic

Therefore, when fuzzy logic is used, the rules are transformed into more flexible rules, called fuzzy rules. So, the models which use a set of fuzzy rules are called fuzzy rule-based models, and the systems are called Fuzzy Rule-Base Systems (FRBS) which lead to eXplainable Artificial Intelligence (XAI) [51].

3.3 Ordinal classification

As was introduced previously, in the pattern classification field there are two kinds of methods or problems: classification and regression. However, between them, there is a special category of methods called ordinal classification [52].

Ordinal classification problems [53, 54] are defined as prediction problems of an unknown value of an attribute \(y=\{y_1, y_2, \dots , y_Q\}\), where Q is the number of classes. But unlike other types of prediction problems the labels have a predefined order among them \(y_1 \prec y_2 \prec \dots \prec y_Q\). For example, in an age classification problem with child, young, old classes, there is a logical order among them \(child \prec young \prec old\).

Hence, ordinal classification could seem like nominal classification because the target is the prediction of several nominal classes; however it is different because, as was commented before, the classes have a pre-established order among them. Moreover, ordinal classification methods share features with regression problems because there is an order in the output predicted values; however, in ordinal classification, the set of output values is finite in contrast with regression where the output values are undefined continuous values.

3.4 Evaluation metrics

As was previously described, classification models try to predict the labels of a class for new patterns. In this context, the confusion matrix allows us to know the model performance. Indeed, most of the paper of literature in EDM converts the problem into a binary classification considering only ’pass’ and ’fail’ classes or, equivalently, a positive class and a negative class. In binary classification, the confusion matrix is specified as follows:

$$ M = \left( \begin{array}{cc} tp &{} fn \\ fp &{} tn \\ \end{array} \right) \; $$

where:

  • tp (true positive): represents the number of elements of the positive class correctly classified by the model.

  • fn (false negative): represents the number of elements of the positive class classified as negative class by the model.

  • fp (false positive): represents the number of elements of the negative class classified as positive class by the model.

  • tn (true negative): represents the number of elements of the negative class correctly classified by the model.

However, this confusion matrix is limited to a binary classification problem and cannot be used in ordinal classification problems. Hence, the confusion matrix is modified as follows to allow a general use of labels:

$$\begin{aligned} M = \left\{ n_{ij} \Vert \sum _{i,j=1}^{Q} n_{ij} = N \right\} \end{aligned}$$

where:

  • \(n_{ij}\): represents the number of elements of i class which the classifier has classified as j class.

  • \(n_{i\bullet }\): is the number of elements of the i class.

  • \(n_{\bullet j}\): is the number of elements which the classifier has classified as j class.

Also, in this new matrix, the values of the diagonal represent the number of elements classified correctly by the model, and the other ones represent the classification errors. However, the confusion matrix is complex to handle when the number of classes grows or when comparisons with other proposals are intended. Thus, model performance metrics are extracted based on the information provided by the confusion matrix. The most usual metrics [53] within the field of classification are the following:

  • Accuracy or Correct Classification Rate (CCR): it is a ratio of correctly classified elements to the total of elements as the (1) shows:

    $$\begin{aligned} CCR = \frac{1}{N}\sum _{i=1}^{Q} n_{ii} \; \end{aligned}$$
    (1)
  • F1 Score: it is the weighted average of Precision and Recall, that is, this measure takes both false positives and false negatives into account. This metric is calculated following the (2):

    $$\begin{aligned} F1 Score = 2 * \frac{Recall * Precision}{Recall + Precision} \; \end{aligned}$$
    (2)

    where \(Precision_{i} = \frac{n_{ii}}{n_{\bullet i}}\) and \(Recall_{i} = \frac{n_{ii}}{n_{i \bullet }}\).

  • The error measure for misclassification is even more relevant in ordinal classification problems than in nominal classification problems. In order to measure these errors the Ordinal Mean Absolute Error (OMAE) is obtained using the (3):

    $$\begin{aligned} OMAE = \frac{1}{N} \sum _{i,j=1}^{Q} abs(i-j)n_{ij} \; \end{aligned}$$
    (3)

    where abs() indicates the function that computes the absolute value.

4 Proposed methodology

This section describes the methodology proposed to carry out the prediction of student performance by applying fuzzy ordinal classification algorithms. The first subsection describes the base algorithm chosen to carry out fuzzy rule learning. The second subsection specifies the main novelties introduced to the base algorithm to enhance the expected classification results.

As previously mentioned, many authors simplify the performance classification problem to a binary classification problem. However, the challenge is found when there are several classes and these have a logical order associated with them. Because of this, we are faced with an ordinal multi-class classification problem. Therefore, this paper proposes the use of the ordinal classification algorithm FlexNSLVOrd. This proposal is based on the NSLVOrd algorithm [54]. First, it is given the general features of the NSLVOrd algorithm. Then, the main improvements included in FlexNSLVOrd are detailed.

4.1 NSLVOrd algorithm

NSLVOrd is a machine learning algorithm that is categorized within fuzzy rule-based algorithms. This algorithm provides the advantage of generating rules whose antecedents are composed of fuzzy variables, allowing greater flexibility. Each rule is composed of a set of fuzzy input variables linked by a conjunctive operator. The values of the input variables can be a set of linguistic terms linked by disjunctive operators. In this way, the rules become fuzzy rules with the following general form for a specific rule \(R_B(A)\):

IF \(X_1~ \text {is } A_1 \text {and } \ldots \text {and } X_n \text {is } A_n~ \text {{\textbf {THEN }}} Y~ \text {is } B\) with weight w;

Fig. 5
figure 5

Labels of TMA fuzzy variable

Fig. 6
figure 6

Example of the meaning of the union of fuzzy labels in a fuzzy rule

where:

  • \(X = \{X_1,X_2,\ldots , X_n\}\): are the set of antecedent variables.

  • \(A = \{A_1, A_2, \ldots , A_n\}\): are the subset of values of the fuzzy domain of variable \(X_i\).

  • Y: is the consequent.

  • B: is the value of the consequent (classes) with an specific weight (w) of the rule.

Regarding the antecedents of the rules, NSLVOrd employs fuzzy logic by converting numerical variables into fuzzy ones with a defined number of homogeneous labels. This transformation of variables can be defined by a determined number of linguistic terms (labels) using triangular, left linear, and right linear fuzzy membership functions on the domain boundaries. An example of the conversion of a numerical variable to a fuzzy variable is shown in Fig. 5 using the TMA input variable.

In addition, an antecedent variable of the rule may be composed of a subset of labels joined with a disjunctive operator such as OR. This feature makes it possible to generate more understandable rules. An example using the previous TMA variable is shown in Fig. 6 which represents the antecedent of the rule (“IF TMA = S2, S1 ...”) and its meaning where can be observed that TMA is less than or equal to the label S1.

Once the fuzzy input and output variables have been defined, we can explain the learning algorithm. NSLVOrd employs an iterative rule learning (IRL) [55] approach together with a Genetic Algorithm (GA) [56] in which each individual of the population represents one rule of the rule set. An individual is composed of a codification of antecedents of the rule and its corresponding consequent. More details about the codification are explained in [54].

This set of rules is created using a sequential covering strategy algorithm, which is detailed in Algorithm 1.

Algorithm 1
figure e

Sequential Covering strategy of NSLVOrd.

Algorithm 1 receives as input a set of instances or samples denoted as E. Firstly, the RemovedRules variable is initialized to true. This variable is the control variable in the first loop and indicates if, after the learning, some rule has been removed and the learning must continue. Next, the LearnedRules variable is initialized with the default rule. This variable contains the set of learned rules which will be the output of the algorithm. At the beginning of the first loop, a new rule is obtained using the \(Learn\_One\_Ord\_Rule\) function where the GA is used to obtain the best rule at the time which is added to the set of rules. The second loop controls if the new rule improves the performance of the system, and the learning can continue. For this, the \(PERFORMANCE\_ORD\) function together with the (4) are used. At the beginning of the second loop, if the performance is better, the new rule (Rule) is added to the set of rules (LearnedRules). Next, the PENALIZE function marks the examples covered by the set of rules. To end the second loop, a new rule is learned. When there is no improvement in the performance, the \(FILTER\_RULES\) function removes the superfluous rules. If there are no superfluous rules, the RemovedRules variable is set to false and finishes the algorithm. Once the algorithm has finished, the LearnedRules variable has stored the best rules which describe the behavior of E.

The \(Learn\_One\_Ord\_Rule\) function is the core of the algorithm where the rules are learned from the set E. This function uses a Steady State Genetic Algorithm (SSGA) over the individuals which represent the rules. These individuals are modified with mutation and crossover operations to get the best rule. The improvement of the rule is guided through a fitness function shown in (4).

This fitness function is a multi-criteria function where the selection of the best rule is guided by the next lexicographical order:

$$\begin{aligned} fitness(R_{B}(A))= & {} [\Phi (R_{B}(A)), \Psi (R_{B}(A)),\nonumber \\{} & {} \qquad svar(R_{B}(A)), sval(R_{B}(A))] \; \end{aligned}$$
(4)

where:

  • \(\Phi (R_{B}(A))\) is the CCR-OMAE-Rate, that is, a measure proposed by us based on a modification of CCR (Correct Classification Rate) and OMAE (Ordinal Mean Absolute Error).

  • \(\Psi (R_{B}(A))\) is a modification of completeness and consistency proposed originally in [57].

  • \(svar(R_{B}(A))\) is the simplicity in variables and it indicates the simplicity in the variables of a rule \(R_B(A)\)

  • \(sval(R_{B}(A))\) is the comprehensibility also called the simplicity in values of a rule.

More details of these concepts can be found in [54].

4.2 Flexible NSLVOrd algorithm

This section describes the FlexNSLVOrd algorithm with the two relevant improvements carried out on the NSLVOrd algorithm. These adaptations allow the adaptation of the algorithm to solve this specific problem. They are comments in the following two subsections.

4.2.1 First improvement: cost matrix

To define the first two criteria in the fitness function, the concepts of coverage and the number of positive and negative examples, which can be found in [54], are considered for measuring the successes and errors in the classification. These errors are weighted with the same value for contiguous classes using the position of the class as it is shown in (5).

$$\begin{aligned} \widetilde{U(e,\overline{B})} = \mathrm {Rank(B) - Class(e))} \; \end{aligned}$$
(5)

where:

  • Rank(B) is the ranking of the consequent.

  • Class(e) is the class of the example e.

Using this equation we can make a general matrix of cost of misclassification with Q classes as follows.

$$\begin{aligned} Cost\_matrix = \left( \begin{array}{ccccc} 0 &{} \cdots &{} 1-\textrm{j} &{} \cdots &{} 1-\textrm{Q} \\ \vdots &{} \ddots &{} \vdots &{} \ddots &{} \vdots \\ \textrm{i}-1 &{} \cdots &{} 0 &{} \cdots &{} \mathrm {i-Q} \\ \vdots &{} \ddots &{} \vdots &{} \ddots &{} \vdots \\ \textrm{Q}-1 &{} \cdots &{} \mathrm {Q-j} &{} \cdots &{} 0 \\ \end{array} \right) \end{aligned}$$

For the academic qualifications considered in this paper, the cost matrix composed of four classes would be:

$$ Original\_Cost\_matrix(OULAD) = \left( \begin{array}{cccc} 0 &{} 1 &{} 2 &{} 3 \\ 1 &{} 0 &{} 1 &{} 2 \\ 2 &{} 1 &{} 0 &{} 1 \\ 3 &{} 2 &{} 1 &{} 0 \\ \end{array} \right) \; $$

However, the difference or importance of errors between contiguous classes may be different depending on the type of problem. For example, in the case of academic qualifications, the error that can occur when predicting that a student has dropped out when the student has actually failed is not the same. Considering this peculiarity, we would have to use a cost matrix in which we give more importance to certain errors. For this problem in question a possible cost matrix is the one presented below:

$$ Cost\_matrix(OULAD) = \left( \begin{array}{cccc} 0 &{} 20 &{} 40 &{} 60\\ 20 &{} 0 &{} 1 &{} 2\\ 40 &{} 1 &{} 0 &{} 1\\ 60 &{} 2 &{} 1 &{} 0\\ \end{array} \right) \; $$

This cost matrix is used to consider the weighted error in (6) which substitutes to (5). Consequently, FlexNSLVOrd considers this modification in the concepts defined above and finally in the fitness function.

$$\begin{aligned} \widetilde{U_{cost}(e,\overline{B})} = Cost\_matrix[B][Class(e)] \; \end{aligned}$$
(6)

4.2.2 Second improvement: non-homogeneous linguistics terms

The original implementation of NSLVOrd converts numerical variables into fuzzy variables with a given number of linguistic terms. Moreover, the distribution of these linguistic terms is homogeneous throughout the variable domain. However, this feature is a disadvantage when working with numerical variables where the distribution of values is not homogeneous. A clear example of this problem can be found in the OULAD dataset with the variable “vle_homepage” among others. If the original implementation of NSLVOrd is used, as shown in Fig. 7, the distribution of the linguistic terms is homogeneous. However, as can be seen in the same figure represented by different colors, the percentages of the number of examples in the dataset that are covered by each linguistic term of the fuzzy variable, are not homogeneous. Indeed, a high percentage of the data is under the S2 label. Therefore, having non-homogeneous data distributions and homogeneous distributions of linguistic terms have a great influence on rule learning and can lead to errors in rule training.

Fig. 7
figure 7

Distribution of examples in fuzzy labels for vle_homepage variable (homogeneous)

To overcome this disadvantage, FlexNSLVOrd changes to the original implementation of NSLVOrd allowing a balance in the distribution of the training examples in the different fuzzy labels of the new fuzzy variables. This new definition of labels, for each fuzzy variable, is carried out using the algorithm 2.

Algorithm 2
figure f

Build non-homogeneous fuzzy labels in variable.

The inputs of algorithm 2 are the set of samples E and the set of labels L. The first step of the algorithm consists in ordering the set of samples depending on its values, obtaining a new set denoted by Eo. Next, with the information provided by Eo, we calculate the number of samples N on each label. At this point, once the above variables have been calculated, a subset \(SetEo_{(i)}\) of N samples is obtained corresponding to label i. The rest of the calculations in for loop (\(a_i\), \(b_i\), and \(c_i\)) are used to obtain the values of the triangular membership function. The procedure of calculation of subset \(SetEo_{(i)}\) and the subsequent steps are repeated for all labels. Finally, when the loop is over the values of triangular membership for all labels are returned. It is important to note that this new label redefinition does not provide a precise fit to the sample distribution, thus obtaining a method to avoid overfitting.

Once explained how the algorithm works, Fig. 8 shows the new distribution of non-homogeneous fuzzy labels and shows that, unlike Fig. 7, now the label results are closer to the data distribution of the “vle_homepage” variable. For better understanding and visual perception, similar to Fig. 7, the percentages covered by each linguistic term of the fuzzy variable are represented with different colors. Moreover, the second sub-plot is a zoom of the linguistic terms S2, S1, CE, and part of B1.

Fig. 8
figure 8

Distribution of examples in fuzzy labels for vle_homepage variable (non-homogeneous)

Table 3 Average CCR results on test

5 Experimental study

This section compares the performance of our proposal. In order to carry out a detailed performance analysis, this section is divided into two parts. In section 5.1, it is compared our proposal with both shallow and deep machine learning. In section Section 5.2, it is shown a comparative study from an XAI point of view as well as an analysis of the obtained rules.

5.1 Comparison with other shallow and deep learning algorithms

This section compares our proposal with a wide selection of algorithms previously studied in the problem of predicting academic performance. The comparison is carried out in terms of performance using the metrics presented in Section 3.4. The analyzed algorithms include both shallow and deep learning algorithms widely used to solve these problems previously.

The traditional shallow machine learning algorithms used in the comparison belong to the state of the art in different approaches (Bayesian, decision trees, ensembles, etc.) and correspond to the most common approaches taken by the previous work to perform multi-class perform prediction in OULAD (see Table 2):

  • Naive Bayes: a numeric estimator precision value based on Bayes’s probabilities. These models are used by Pei et al. [36].

  • SimpleLogistic: a classifier for building linear logistic regression models. These models are used by Radovanovic et al. [32].

  • RBFNetwork: a normalized Gaussian radial basis function network that uses the k-means clustering algorithm to provide the basic functions. This model is used to build the approach of Quiao et al. [39].

  • Random Forest: an ensemble of random trees constructing a forest where each tree is trained using bagging without replacement. This model is used in [35, 37].

  • J48: an implementation of the C4.5 method that generates decision trees based on gain information. This model is used by Hussain et al. [34].

  • ZeroR: the simplest classification method that relies on the target and ignores all predictors, so it simply predicts the majority class.

  • OneR: a simple classification algorithm that generates one rule for each predictor in the data, then selects the rule with the smallest total error as its “one rule”.

  • PART: which uses separate-and-conquer and builds a partial C4.5 decision tree in each iteration and makes the “best” leaf into a rule. This model is used by Ruiz et al. [26].

Table 4 Average F1-score results on test
Table 5 Average OMAE results on test

The deep learning algorithms used in the comparison are based on the models previously analyzed in Table 2. Specifically, we have taken as reference the proposals that do not involve time series analysis, since our data does include that information. The deep learning algorithms used are the following:

  • DeepMLP: a deep multi-layer perceptron based on the proposal of Waheed et al. [38].

  • DeepCNN: a deep convolutional neural network based on the work of Song et al. [41]

Attending to implementation details, for the classical shallow machine learning algorithms we have used the versions available at Weka [58] with default configurations, and the deep learning models have been implemented using the Python library Tensorflow [59]. RÌegarding FlexNSLVOrd is the proposal presented in this work that includes the modifications presented in Section 4.2.1 and Section 4.2.2.

Tables 3, 4, and 5 present the results of CCR, F1-Score, and OMAE metrics respectively for each course. These results have been obtained using a cross-validation partitioning method of 5-folds (5x2CV): each course presented in OULAD (aaa, bbb, ..., ggg) is analyzed as a separate dataset, and each dataset is divided into two partitions of the same size (proportion of 50% / 50%) five different times following a stratification approach. In each partitioning, two experiments are performed: once using one partition for training and one for testing, and the other way around. Thus, for each course and algorithm, ten experiments are carried out.

Regarding the results of FlexNSLVOrd shown in the tables, it is important to remark that we used five non-homogeneous labels along with the cost matrix presented in Section 4.2.1. In general, we can see that FlexNSLVOrd outperforms the results of all the methods analyzed. Only for the “ggg” course, it is slightly below the CCR and OMAE average metrics. In this case, DeepMLP would be the best proposal. However, if we observe for the same course the value obtained by the F1-score average metric, FlexNSLVOrd is the clear winner. These results reinforce the use of an ordinal algorithm for this type of problem.

In order to confirm the superiority of the ordinal proposal, a statistical test is carried out with the results in the previous tables. Thus, Friedman test [60] is applied to determine whether there are significant differences between the performance of the different algorithms included in the comparative study. Then, Shaffer procedure [60] is applied as a post-hoc procedure to evaluate with more precision the differences between proposals. The results of Friedman’s test, including Friedman’s statistics and the p-values are shown in Table 6 as well as the ranking assigned for Friedman’s test in Table 7. These results show that FlexNSLVOrd obtains the lowest ranking for all measures. According to this test, the lower ranking values are achieved by algorithms that show better performance. Moreover, for all metrics, Friedman’s test rejects the null hypothesis (p-value lower than 0.01), and therefore, significant differences exist in the performance of the algorithms at 99% confidence. A Shaffer’s post-hoc test is applied to check what algorithms can be considered worse proposals. Significant differences among algorithms for these measures at 99% confidence level are shown in Fig. 9. These tests indicate that, for the problem studied, FlexNSLVOrd is significantly better than all other shallow and deep learning algorithms. Only, for the F1-score, the test determines that there are not significant differences between FlexNSLVOrd and RBFNetwork. However, FlexNSLVOrd has a lower ranking and includes higher interpretability, it will be studied in the following section.

Table 6 Friedman’s test results
Table 7 Friedman’s average rankings

Finally, attending to potential limitations of the proposed FlexNSLVOrd, it should mention its higher computational cost concerning most of the methods included in the comparative study. As an evolutive algorithm, the proposed method is slower than simpler algorithms like Naive Bayes, logistic regression, or tree-based algorithms. However, the building times of FlexNLSVOrd reach a few minutes for every studied course, which is acceptable in the context of outcome prediction in semestral courses, where data arrives daily or weekly. Nevertheless, there is a potential path for improvement in this aspect of the proposed method that can be addressed in future works.

5.2 Explainable knowledge obtained and analysis of the rules

As observed from the results in the previous section, FlexNSLVOrd obtains the most accurate results with significant differences compared to the other methods. However, these results only cover one of the objectives proposed in this work. The other objective was to obtain an understandable model. Therefore, in this section, we present the advantage of using a technique that permits us to obtain understandable knowledge according to the XAI trend [61]. XAI refers to methods and techniques of artificial intelligence whose operation can be understood by a human. So, XAI pretends to extract knowledge that can help a human expert to understand the behavior of a system or problem. Thus, this section presents the results, in terms of the number of rules, of the algorithms used in previous sections to compare with our proposal. However, as the majority of the previous algorithms do not represent the knowledge in an explainable way, only the algorithms that generate models providing output rules are considered in this section.

Fig. 9
figure 9

Critical distance for internal metrics of Shaffer’s procedure at 99% confidence

Table 8 shows the average number of rules for each course of each algorithm using again the 5x2-cv data partitioning scheme used in the previous section. ZeroR algorithm is not considered because it works for one class only and it classifies all examples as belonging to this class. Similarly, the OneR algorithm creates a rule for each attribute in the training data, then chooses the rule with the smallest error rate as its “one rule”. Finally, for the randomForest algorithm, we have run it with ten trees and we show the mean of the rules. We can observe that, in general, the number of rules is high for the randomForest algorithm. Without considering this algorithm, we can see that only for course “aaa” there is a similar number of rules in all considered algorithms. Nevertheless, the other courses present a notable difference in the number of rules which can indicate great difficulty in explaining the behavior in these cases. It provides a clearer idea of the benefits of using FlexNSLVord and ordinal classifications algorithms compared to other techniques in terms of interpretability.

Table 8 Average number of rules by algorithm

Another interesting point to analyze, in addition to the number of rules generated, is the composition of the rules. By analyzing the rules in detail, it is possible to obtain relevant information on the number of attributes used and their importance to identify each class. As an example case, we have applied our FlexNSLVOrd proposal to the course “aaa” without considering partitioning schemes. In this case, a total of 23 rules explain the students’ behavior. Specifically, two rules for withdrawn, four rules for fail, five rules for pass, and twelve rules for distinction classes. The complete set of rules can be seen in Appendix.

Table 9 shows the used variables in the rules for the Withdrawn, Fail, Pass and Distinction classes. From this table, that comes from the rule analysis, we can determine that “code_module”, “AgeBand” and “studiedCredits” variables are irrelevant and are not used in the classification. Regarding to the Withdrawn class, it can be observed that “HigestEducation”, “DateUnregistration” and two assignments (“TMA1755”, and “TMA1756”) are the only used variables. These variables are of special interest because they are present not only in Withdrawn class but in most classes. Also, it can be observed that the most of variables used in inferior classes are used in upper classes too, except for “CodePresentation”, “ImdBand” and “VleHomepage” which are used in Fail and Distinction classes but they are not used in Pass class. This indicates that these variables are key to identifying for failing the course or passing it with a good grade. A similar effect can be seen in the assignments “TMA1754” and “TMA1752”, which are used only in Fail and Distinction classes, respectively. This behavior indicates that the marks assigned by the tutor in each variable are important to classify students into these two classes.

Finally, looking at the ratio of the number of variables used for all classes, it is relevant to remark that the number increases as the mark increases. This behavior can be considered normal because it is necessary to consider more aspects to get higher qualifications.

Table 9 Use of variables in each classification

6 Conclusions

This paper has proposed a system for the classification of students’ academic performance in online and distance education courses. In total four classes have been identified: Withdrawn, Fail, Pass, Distinction. However, within the education environment, the order of the classes is not of equal importance. For example, in the case of an early classification, it is not the same to consider that a student is going to obtain a distinction when the student finally is going to drop out of the course. Therefore, and since the use of ordinal classification algorithms has not been widely explored in this context, this paper has proposed a fuzzy ordinal classification algorithm to perform the prediction task. Specifically, the FlexNSLVOrd algorithm has been presented in this work. The most relevant features of this algorithm are the inclusion of cost matrices, which weigh the distances between the four classes considered, in the fitness functions of the training phase of the algorithm and the generation of non-homogeneous labels based on the data distributions to adapt the fuzzy labels to the real data of the studied problem.

To analyze the performance of our proposal which combines ordinal classification and a rule-based system, an experimental study with the OULA datasets is carried out. In particular, comparisons have been made with ten shallow and deep learning algorithms. Experimental results show an excellent performance for FlexNSLVOrd which obtains the most accurate results in terms of classification using the CCR, F1-Score, and OMAE metrics. FlexNSLVOrd outperforms even deep learning methods.

Moreover, most of the models employed in related works are based on black-box models where the knowledge is not interpretable by humans. With respect to other models that obtain rules-based systems, FlexNSLVOrd achieves models with a lower number of rules using a fuzzy rules-based system. Hence, from the point of view of XAI, FlexNSLVOrd is the most interpretable.

Finally, given the good results obtained by applying ordinal classification using FlexNSLVOrd, it would be interesting to analyze and compare different ordinal classification algorithms in future research. Moreover, other interesting future work could be the performance analysis of ordinal classification and FlexNSLVOrd algorithms with other more flexible representations such as multi-instance and multi-label learning.