Introduction

The university period is one of the most decisive periods in one’s life. Students often experience the university period as a very stressful time mainly due to the fear of failure [1]. From time to time, the lack of success is caused by reasons related to the students themselves, including the freedom to plan their learning processes and the flexibility of the university schedules [2]. According to Eurostat, the statistical office of the European Union (EU), over 3 million young people in EU had been to university but had discontinued their studies at some point in their life. The main reasons for not continuing their education were numerous: a desire to work instead, finding their studies uninteresting or not meeting their needs, family reasons, etc.

Long-term course planning (LTCP) is an important task in academic advising [3], and it aims to help students in proposing a course list of all future semesters so the dropout rate might be reduced. Nevertheless, LTCP is especially challenging for several reasons such as the number of constraints related to university regulations, the students’ abilities and background knowledge, or simply some personal preferences caused by external factors [4]. As a result, when building a study plan, a lot of different aspects are required to prioritize courses. Some of such aspects quantify the importance of including a specific course in the study plan (students’ preferences, expected grades due to easy courses, etc.), whereas others rate the chronology of those courses (complexity of the semester based on the courses). LTCP, however, does not consider graduate students to model the course priority, which might be a good starting.

Taking all the above into consideration, this paper aims to provide a course index [1] based on the sequence of courses a student has taken already, and the grades obtained by graduate students that followed a similar sequence. This analysis not only provides an index for a specific course but also enables general paths (courses grouped by semesters) that should be followed. The performed recommendation does not take any external factor, but the know-how of the system. In other words, it is based on the paths followed by other students and the final grade of such students not only in a specific course but in the degree. To this aim, we propose (ES)\(^2\)P (Evolutionary Search of Emerging Sequential Patterns), a sequential pattern mining algorithm [5] to extract general paths (a set of semesters with different courses per semester) that were frequently followed by excellent students, but infrequently or never followed by not so good students. In this regard, (ES)\(^2\)P gathers two well-known tasks in descriptive analysis: sequential pattern mining [6] and emerging patterns [7]. This synergy is key to identify paths that provide a course recommendation for each student.

The proposal should also deal with an extra issue: the provided set of solutions. Generally, frequent itemset mining algorithms [8] require a minimum frequency threshold value to be predefined, which is related to the number of solutions. It was studied [9] that a small change in that threshold value may lead to an extreme variation in the number of solutions as well as a significant increment in the execution time, especially on high-dimensional data [10]. Hence, to determine the right threshold value is key, and it is not trivial even though the user has a profound background in the application field (it needs to try different thresholds by guessing and re-executing the algorithms once and again until results are good enough). To achieve this goal, the proposed algorithm, which is based on evolutionary algorithms [11], is able to extract a reduced set of solutions without considering any frequency threshold. This proposal guides the search process through the growth ratio or difference in the frequency between two groups of students (excellent and not so good ones). The novelty of the paper can be summarized as follows:

  • An evolutionary algorithm, known as (ES)\(^2\)P, for mining emerging sequential patterns. Sequential pattern mining [5] is a descriptive data mining task that has been applied on discovering frequent patterns. This task, however, has not been considered for discriminative or emerging patterns whose frequency increases significantly from one group or dataset to another. (ES)\(^2\)P extracts the top-k solutions discovered along the evolutionary process. It is a reduced set of solutions that provide useful information to the students, not requiring any threshold value and, therefore, any background in the application field.

  • A methodology for rating courses is also proposed. Unlike LTCP, the proposal requires nothing more than the courses (ordered by semesters) already passes by the student that is advised. This methodology provides recommendations for students on which courses should be taken, and it considers the proposed (ES)\(^2\)P algorithm to obtain the most promising sequences of courses.

  • A methodology for ordering courses that should be taken by students to success in the degree. This methodology, also based on the proposed (ES)\(^2\)P algorithm, provides full study plans that should be followed by students to reduce the dropout and failure.

The rest of the paper is organized as follows. Preliminaries and Related Work introduces some important concepts and related works. An Evolutionary Algorithm for the Search of Emerging Sequential Patterns describes the proposed methodology, and Experimental Analysis presents some experimental studies to demonstrate the performance of proposal. Finally, Cases of Study: Applying (ES) P to Course Recommendation shows some study cases for a real scenario and Conclusion gives some conclusions.

Preliminaries and Related Work

This section describes some concepts concerning sequential pattern mining and emerging patterns that are needed to be understood. It also presents related studies for course recommendation systems.

Preliminaries

The sequential pattern mining task was introduced by Agrawal and Srikant [5] as a way to identify useful patterns in a set of sequences. Although the task was originally proposed to mine sequences of patterns, it has been extended to time series of ordered events [12]. Formally speaking, let \(I = \{i_{1}, i_{2}, ..., i_{n}\}\) be the set of n items contained in a database \(\Omega\). Let us also define an itemset X as a set of items from I, that is, \(X \subseteq I \in \Omega\). A sequence s is described as an ordered list of itemsets \(\langle X_{1}, X_{2}, ..., X_{m}\rangle\). In a sequence s, an item \(i_{j}\) appears only once in an itemset \(X_{k}\), but such an item \(i_{j}\) is allowed to appear multiple times in different itemsets belonging to s. As a matter of clarification, let us consider the set of items \(I = \{a, b, c, d\}\) and the sequence \(s = \langle \{a, c\}, \{b\}, \{a, c, d\}\rangle\). In this example, the item \(\{a\} \in I\) appears in the itemsets \(X_1=\{a, c\}\) and \(X_3=\{a, c, d\}\), but it only appears once in each itemset. Additionally, the sequence s is formed by a set of itemsets: \(X_1=\{a, c\}\), \(X_2=\{b\}\) and \(X_3=\{a, c, d\}\). Generally speaking, the meaning of a sequence is that events within an itemset occur at the same time, but itemsets take place one after another (the itemset \(X_1\) does always appear before \(X_2\)).

Sequential pattern mining aims to extract any sequence that appears in \(\Omega\). A database \(\Omega\) gathers a set of sequences \(S = \langle s_{1}, s_{2}, ..., s_{l} \rangle\) as it is shown in Table 1, where S comprises three different sequences, that is, \(S = \{s_1, s_2, s_3\}\). Additionally, given two sequences \(s_1 = \langle X_{1}, X_{2}, ..., X_{n} \rangle\) and \(s_2 = \langle Y_{1}, Y_{2}, ..., Y_{m} \rangle\), \(s_1\) is called a subsequence of \(s_2\), denoted as \(s_1 \subseteq s_2\), if there exists integers \(1 \le t_{1}< t_{2}< ... < t_{n} \le m\) such that \(X_{1} \subseteq Y_{t_{1}}, X_{2} \subseteq Y_{t_{2}}, ..., X_{n} \subseteq Y_{t_{n}}\). Thus, \(s_3 \subseteq s_1\) in Table 1 since \(\{c\} \in s_3 \subseteq \{c\} \in s_1\), and \(\{d, e\} \in s_3 \subseteq \{d, e, f\} \in s_1\).

Table 1 Sample sequence database

Most sequential pattern mining algorithms focus on the extraction of frequent sequences. This frequency is quantify by the support quality measure, defined as the percentage of sequences in \(S \subseteq \Omega\) that contains a specific sequence. It is also defined in terms of absolute values as the number of sequences in \(S \subseteq \Omega\) that contains the sequence to be evaluated. Given a sequence \(s_j\), the frequency of such a sequence in a database \(\Omega\) is denoted as \(support(s_j, \Omega )\) and formally defined, in relative terms, as shown in Eq. 1.

$$\begin{aligned} support (s_j, \Omega ) = \frac{|{\forall s \in S: s_j \subseteq s}|}{|S|} \end{aligned}$$
(1)

A wide variety of sequential pattern mining algorithms have been proposed so far. GSP, proposed by Srikant et al. [6], is considered as one of the first algorithms in the field. A wide variety of algorithms can be found in the specialized literature for mining sequences that appear in data above a predefined frequency value: Spade [13], Spam [14], PrefixSpan [15], CM-Spade [16] and CM-Spam [16]. The last two are considered as the ones that best perform. Some additional approaches were proposed for mining the top-k frequent sequences, not requiring any a minimum frequency threshold value to be predefined: TKS [17] and TSP [18]. Finally, there are some approaches based on evolutionary computation such as G-CSPM [19], which is a genetic algorithm for mining closed sequential patterns. Another genetic algorithm was proposed in [20] for mining negative sequential patterns. Finally, authors in [21] proposed a Particle Swarm Optimization algorithm for mining sequential patterns.

The finding of interesting patterns encompasses many additional tasks including periodic pattern mining, high-utility itemset mining, graph mining, among others. Emerging pattern mining (EPM), which is mainly categorized as a supervised descriptive pattern mining technique, is an additional task that aims to discover discriminative patterns. EPM seeks for patterns whose frequency greatly differs from one group or dataset \(\Omega _{1}\) to another dataset \(\Omega _{2}\). The quality of a pattern p is therefore quantified by such a difference in frequency, generally known as growth rate (GR) that is formally defined in Eq. 2.

$$\begin{aligned} GR (p, \Omega _{1}, \Omega _{2}) = \frac{support (p, \Omega _{1})}{support (p, \Omega _{2})} \end{aligned}$$
(2)

Many different algorithms have been proposed so far for mining emerging patterns: MDB-LLBorder [7], JEPProducer [22], ConsEPMiner [23], iEP-Miner [24]. Some of such research works focused on the discovery of intrinsic properties of each group, that is, patterns that do not appear in one group, but it appears at least once in the other group. In other words, patterns having a GR value of \(\infty\). These patterns, known as jumping emerging patterns, are really useful in fields such as image classification [25], human tasks recognition [26] and bioinformatics [27], among others. Emerging patterns have been also considered in a research work [28] to compare sequences in two groups. Nevertheless, to the best of our knowledge, no algorithm for mining emerging sequential patterns has been proposed yet.

Related Work

Recommender systems have received a lot of attention from different institutions to improve the overall satisfaction grade of their students that results in a higher number of students to enroll. According to a recent review [1], course recommendation systems can be categorized into collaborative filtering-based recommendation systems (CFRS), content-based recommendation systems (CRS), knowledge-based recommendation systems (KRS), hybrid approaches and data mining approaches.

CFRSs are based on the assumption that predictions are done by considering the choices of other students with similar preferences and interests. Chen et al. [29] proposed a collaborative filtering algorithm based on the history of students’ course selection records. This filtering algorithm also considered introductory text from the courses and the students’ performance in that courses. Huang et al. [30] recommended courses by means of a novel cross-user-domain collaborative filtering algorithm. The algorithm was able to predict a score for each student based on the course score distribution of similar students that already passed the course. On the other hand, CRSs rely on similarities between course features. CRSs recommend courses to students based on previously studied courses and their degree of satisfaction. Lessa et al. [31] proposed the use of Likedin profiles to recommend appropriate courses. Mostafa et al. [32] also developed a recommendation system but, instead of taking the students’ preferences from an external source, they analyzed the descriptions of the courses already done by such students. The final aim was to recommend courses that are similar to the previously chosen and, therefore, courses that are of interest for the students.

A major problem of CFRSs and CRSs is the huge amount of data they require to make recommendations. When these systems begin to be used, the recommendation power decreases significantly due to a lack of information, what is called as the cold-start problem [1] in literature. Knowledge-based recommendation systems (KRSs) are able to overcome that issue since the recommendation is performed by meeting users’ requirements and courses. Huang et al. [33] designed a course recommendation system based on an ontology. The system relied on several curricular profiles needed by the students to meet the requirements of different jobs. Like KRSs, hybrid course recommendation systems are also commonly used to overcome the problems of CFRSs and CRSs. Authors in [34] proposed a hybrid system combining courses analyses and rates given by students to provide a course ranking. Esteban et al. [35] also proposed a hybrid system that combines information from both the student and the courses, and includes collaborative and content-based filters.

Last but not least, recommendation systems based on data mining approaches have been proposed to help students with the choice of the courses that best fit to them [36]. The UniNet [37] method was recently proposed as a recommendation system based on deep learning to help students to take the right decision on the order, combination and number of courses to take. Britto et al. [38] proposed to recommend courses according to the background and preferences of the students, paying special attention to those courses in which the students obtained the best grade. In [39], the authors proposed the use of clustering algorithms to group students and make recommendations based on similarities. Similarly, authors in [40] propose a subgroup discovery algorithm to group types of learners. Wang et al. [41] proposed the use of sequential pattern mining to build a course recommendation system. The system searches for sequences of courses followed by students with high GPAs. The largest sequences were considered since they are usually infrequent and, therefore, may offer a better connection to achieve the same academic success as previous students. The proposed algorithm took into consideration features such as the percentage of students enrolled in the course, the quantity of time spent by the students to graduate, the requirements of the courses, etc. A sequential pattern mining approach was also proposed in [42] to recommend courses based on the learning outcome that each student would get if he/she enrolls in the course.

An Evolutionary Algorithm for the Search of Emerging Sequential Patterns

This section describes the proposed (ES)\(^2\)P algorithm. First, it describes the data representation that should be followed by any problem including information about students: courses (organized into semesters) and their grades. Here, this section includes how each solution is represented in the proposed evolutionary algorithm together with each of the procedures of the proposal. Finally, it describes how to deal with the resulting set of patterns.

Data Representation

In the proposed approach, the original database is stored into two different data representations while data are read, transaction by transaction. The idea behind these data representations is to provide a fast data access, and to avoid useless information to be maintained. First, data are kept in memory through a vertical data representation that creates a list of indices (sequences in data) in which each item appears at least once. To obtain the frequency or support of each single item in data is quite simple since the algorithm just needs to calculate the length of the list associated with the item at hand. Given two or more items, this vertical data representation allows to obtain the set of data records that have at least one common instance of such items. The only operation to be performed is an intersection of the lists associated with each item. Second, an horizontal data representation is performed where a list of indices is stored. Here, instead of index of the sequences in data, it stores the index of the itemsets in which the item appears. This second data representation is based on a hashing function, so given a key k based on an item, it maps k to the corresponding set of indices on which k is included. Index values are in increasing order from sequence to sequence and it depends on the number of itemsets each sequence has, which is also saved.

Fig. 1
figure 1

Data representations of the proposed algorithm

Figure 1 illustrates the data representation followed by the proposed approach for the sample transactional dataset shown in Table 1. The vertical representation (see Fig. 1a) includes eight different lists of indices, one per item in data. Since each index denotes the data record (sequence) in which the item appears, the length of the list is the frequency of each item in data. Hence, the item \(\{a\}\) appears twice in the dataset (first and second sequence). The item \(\{b\}\) also appears twice in the dataset (first and second sequence), even when it appears twice in the second sequence, that is, \(\langle \{a, b\}, \{b\}, \{g, h\} \rangle\). Similarly, the horizontal data representation (see Fig. 1b) is responsible for storing the itemsets in which each item appears. This data representation is as if all the sequences (IDs 1, 2, 3, etc.) were placed in a single row (single sequence) and we consider the place in which the item appears. For example, the item \(\{a\}\) appears in the first itemset (sequence with ID 1) and the first itemset of the sequence with ID 2. In other words, \(\{a\}\) appears in the first and fourth itemsets from Table 1 if all the sequences were placed in a row. Additionally, the item \(\{b\}\) appears in the first itemset (sequence with ID 1) as well as in the first and second itemsets of the sequence with ID 2. In other words, \(\{b\}\) appears in the first, fourth and fifth itemsets from Table 1. Finally, the proposed data representation needs to store the number of itemsets included in each sequence (see Fig. 1c). The accumulated sum is maintained so the last value corresponds to the number of itemsets in data. This accumulated sum is really useful to determine whether the horizontal representation values belong to one or different sequences. In other words, taking the vector of values shown in Fig. 1c, any horizontal data representation value in the range [1, 3] belongs to the first sequence; any horizontal data representation value in the range [4, 6] belongs to the second sequence; and any value in the range [7, 8] belongs to the third sequence.

The proposed data representation is really useful to compute the frequency in a fast way. The frequency of a single item is simply computed as the length of the vertical representation: the frequency of \(\{a\}\) is 2, the frequency of \(\{b\}\) is 2, etc. Additionally, the frequency of a sequence (including itemsets) is computed through both the vertical and horizontal representations. Let us consider the sequence \(s = \langle \{c\}, \{d, e\} \rangle\). For this sequence, the intersection of the vertical data representation is performed for each single item, resulting as indices 1 and 3 (see Fig. 1a). At this point, it is required to check if every itemset in such sequences is also satisfied. The itemset \(\{c\}\) appears in indices 2 and 7 according to the horizontal representation (see Fig. 1b). Due to the resulting indices from the vertical representation were 1 and 3, we have to check the first and third ranges of values from Fig. 1c as follows: \(2 \in [1, 3]\) and \(7 \in [7, 8]\). As a result, the itemset \(\{c\}\) is satisfied in the first and third sequence. Let us do the same for the itemset \(\{d, e\}\), which appears in indices 3 and 8 according to the horizontal representation (see Fig. 1b). Again, due to the resulting indices from the vertical representation were 1 and 3, we have to check the first and third ranges of values from Fig. 1c as follows: \(3 \in [1, 3]\) and \(8 \in [7, 8]\). As a result, the sequence s appears twice in data: first and third sequences.

Encoding Criterion

The proposed algorithm uses an encoding vector of variable length to represent each individual or solution to the problem. The vector includes the set of items that belong to the represented solution. Such a set of items, in turn, is grouped into subsets that represent the itemsets within a sequence. The proposed encoding criterion includes the only restriction that the same item cannot appear twice in the same sequence. Since the algorithm was proposed for mining sequences of subjects (items) ordered by semesters (itemsets), it does not make sense to include the same subject twice in a sequence. For a matter of clarification, let \(I = \{a, b, c, d\}\) be a sample set of items (subjects). A valid sequence is \(s=\langle \{a, b\}, \{c\}, \{d\}\rangle\), denoting that a student passed the subjects a and b in a semester. Then, a semester later, the student passed the subject c. Finally, in a following semester, the student passed the subject d. An invalid solution would be \(s=\langle \{a, b\}, \{a\}\rangle\) since once the subject a is passed by a student there is no sense in enrolling it again.

The proposed encoding criterion is easily adapted to the data representation since each item has two different pointers, one to the corresponding list of sequences (vertical data representation) and other to the list of itemsets (horizontal data representation). Thus, to determine in which sequences the items appear it only performs an intersection of the lists (vertical data representation) associated with the items in the sequence. Additionally, to obtain which itemsets appear in the sequences, it only has to perform an intersection of the lists (horizontal data representation) associated with the items for each itemset in the sequence. In order to clarify this methodology, let us consider again the valid sequence \(s=\langle \{a, b\}, \{c\}, \{d\}\rangle\). The sequences in which the items belonging to s appear are obtained by intersecting the vertical representation of a, b, c and d. Additionally, the intersection of the lists obtained from the horizontal data representation returns the sequences in which the itemsets in s are satisfied. Such itemsets are \(\{a, b\}\), \(\{c\}\), and \(\{d\}\).

(ES)\(^2\)P Algorithm

The proposed (ES)\(^2\)P algorithm comprises three main procedures, which were properly designed to the problem at hand. Descriptions of all these procedures as well as how they are combined to form the whole algorithm can be found below.

  1. 1.

    Initial solutions. The proposed algorithm creates the initial set of solutions randomly, each solution is a sequence including random itemsets. The number of itemsets that each solution (sequence) may include is limited by a maximum predefined value. Each itemset includes, in turn, a random subset of items from I. A maximum number of items per itemsets is also predefined. It is finally important to highlight that an item \(i \in I\) cannot appear more than once in a sequence, as it was previously described in Encoding Criterion.

  2. 2.

    Evaluation procedure. This procedure is responsible for assigning a fitness value F to each individual or solution s. The evaluation procedure calculates how close a given solution is to the optimum solution on a dataset \(\Omega\). In the proposed approach, F for a solution s is formally defined based on the support of s and the GR (it was previously described in Eq. 2Preliminaries) of s on the dataset \(\Omega\) as shown in Eq. 3. F is defined in the range [0, 1] and the best solutions are close to 1. \(\Omega\) is split into two groups, that is, \(\Omega _1\) for good students (those that obtained a high final mark) and \(\Omega _2\) for not so good students. The fitness value F is based on the frequency of s in the subset of good students and the normalized growth rate obtained by s. In other words, a solution s is good if it represents a high percentage of good students and a low percentage of not so good students. Finally, it is important to clarify that, in those situations where the difference in the frequencies between \(\Omega _1\) and \(\Omega _2\) is maximum, that is, \(GR(s, \Omega _{1}, \Omega _{2})=\infty\), only the frequency of s in \(\Omega _1\) is considered to compute F.

    $$\begin{aligned} F(s,\Omega _{1}, \Omega _{2})=\frac{support(s,\Omega _{1})}{GR(s, \Omega _{1}, \Omega _{2})} \times (GR(s, \Omega _{1},\Omega _{2})-1) \end{aligned}$$
    (3)
    Table 2 Sample database including sequences for good and not so good students

    For a matter of clarification, let us consider a sample dataset (see Table 2) that is divided into \(\Omega _1\) (good students) and \(\Omega _2\) (not so good students). A sample solution \(s_1=\langle \{a, b\}, \{c\}\rangle\) appears in 60% of the good students (two first sequences as well as the last sequence in \(\Omega _1\)). It is mathematically represented as \(support(s_1,\Omega _{1})=\frac{3}{5}=0.6\). This solution \(s_1\) appears in 20% of not so good students (third sequence in \(\Omega _2\)), also denoted as \(support(s_1,\Omega _{2})=\frac{1}{5}=0.2\). It implies that \(GR(s_1,\Omega _{1},\Omega _{2})=\frac{0.6}{0.2}=3\) and, therefore, \(F(s_1,\Omega _1,\Omega _2)=\frac{0.60}{3} \times (3-1) = 0.40\). Let us now consider an additional sample solution \(s_2=\langle \{a, b\}, \{c\}, \{e\}\rangle\). Its support value in \(\Omega _1\) is calculated as \(support(s_2,\Omega _{1})=\frac{3}{5}=0.6\), whereas in \(\Omega _2\) it is calculated as \(support(s_2,\Omega _{2})=\frac{0}{5}=0\). Hence, \(GR(s_2,\Omega _{1},\Omega _{2})=\frac{0.6}{0}=\infty\) and therefore, the fitness value F is only computed as the frequency of \(s_2\) in \(\Omega _1\) or \(F(s_2,\Omega _1,\Omega _2)=support(s_2,\Omega _{1})=0.60\).

  3. 3.

    Genetic operators. The proposal includes two genetic operators: the crossover operator, which focuses on exploiting current individuals by examining their neighbors; the mutation operator, which aims to diversify the search process and to explore new areas in the search space.

    figure a

    The crossover genetic operator (see Algorithm 1) combines generic material of two solutions to act as parents (\(p_1\) and \(p_2\)) to generate offspring (\(o_1\) and \(o_2\)). This operator works as a two-points crossover by considering the itemsets as feasible points. Hence, it is not possible to split any itemset within \(p_1\) or \(p_2\). An additional requirement of this operator is that the crossover points cannot produce the whole individual. This two-points crossover operator works differently depending on a probability. On the one hand, it leaves the itemsets of \(p_1\) within the two cut-points unaltered (see lines 5 to 14, Algorithm 1). It adds the itemsets that appear out of the range of the cut-points of \(p_2\), but removing those items that were already in the itemsets obtained from \(p_1\). Additionally, it does the same for \(p_2\), that is, it takes the itemsets within the cut-points and adds those itemsets out of the cut-points from \(p_1\). Again, those items that were already included due to \(p_2\) are removed. As a result, no item can appear more than once either in \(o_1\) or in \(o_2\). On the other hand, the genetic operator leaves the itemsets of \(p_1\) and \(p_2\) outside the two cut-points unaltered (see lines 14 to 23, Algorithm 1). It then adds the corresponding itemsets within the two cut-points by removing repeated items. As a matter of clarification, Fig. 2 illustrates an example of the proposed crossover operator. Let us consider the following individuals taken from a sample dataset (see Table 2) to act as parents \(p_1 = \langle \{b\}, \{c\}, \{d, e, f\} \rangle\), \(p_2 = \langle \{a, c\}, \{j, k\}, \{f, l\}, \{d, m\} \rangle\), and the cut-points marked as dotted lines. Let us also consider the range within the cut-points to be copied unaltered. Thus, \(o_1\) is first initialized as \(\langle \{b\}, \{c\} \rangle\). The itemsets outside the cut-points from \(p_2\) are then added to \(o_1\), that is, \(\{a, c\}\) and \(\{d, m\}\). Since the item c is already in \(\langle \{b\}, \{c\} \rangle\), then it is removed and the resulting offspring is \(o_1 = \langle \{a\}, \{b\}, \{c\}, \{d, m\} \rangle\). Finally, \(o_2\) is initialized as \(\langle \{j, k\}, \{f, l\} \rangle\) and the itemsets outside the cut-points from \(p_1\) are added: \(\{d, e, f\}\). Due to f is already in \(o_2\) it is removed and the offspring is finally formed as \(\langle \{j, k\}, \{f, l\}, \{d, e\} \rangle\).

    figure b

    On the other hand, the mutation genetic operator (see Algorithm 2) has been designed to perform different tasks with a certain probability. It slightly modifies solutions, looks for near neighbors, and seeks for far unexplored areas of the search space, maintaining part of the information of the original sequence (solution or individual). Given an individual, we first give the same probability to apply a more disruptive operator or a subtler one (see line 2, Algorithm 2). The disruptive option (see lines 2 to 10, Algorithm 2) randomly selects a cut-point within the solution and any itemset before that cut-point (see lines 5 to 7) or after it (see lines 7 to 9) is replaced by a set of itemsets generated randomly. It adds a random number of itemsets, but ensuring that the sequence does not exceed the maximum number of itemsets. Those items that are already included in p are not considered. On the contrary, the less disruptive option (see lines 11 to 24, Algorithm 2) provides three different options: 1) to include a new itemset at a random position of the sequence (see lines 12 to 15). Those items within such an itemset that are already included in p are not considered; 2) remove a random itemset from p (see lines 16 to 18); 3) replace a random itemset from p by a new one randomly generated (see lines 19 to 24). Again, those items within such an itemset that are already included in p are not considered. As a matter of clarification, let us consider the following individual to act as a parent \(p = \langle \{b\}, \{c\}, \{d, e, f\} \rangle\), which is a feasible solution from the sample dataset shown in Table 2. Considering the disruptive operator (see lines 2 to 10, Algorithm 2), the cut-point between the second and third itemsets, and removing any itemset on the left of such cut-point, the result is a partial new solution comprising just the itemset \(\{d, e, f\}\). After generating the random itemset \(X = \{a, b\}\), the resulting solution o obtained from p is \(o = \langle \{a, b\}, \{d, e, f\}\rangle\). As for the less disruptive operators, let us consider the last one that replaces an itemset by another randomly obtained (see lines 19 to 24, Algorithm 2). A random itemset \(\{b\}\) is chosen from the individual \(p = \langle \{b\}, \{c\}, \{d, e, f\} \rangle\), and that itemset is replaced by a new one randomly generated: \(\{a, c, h\}\). Due to the item c was already included in p, it is removed from the new itemset and the resulting individual is \(o = \langle \{a, h\}, \{c\}, \{d, e, f\} \rangle\).

figure c
Fig. 2
figure 2

Example of crossover operator

Finally, it is important to combine all the procedures described above to produce the final algorithm (see Algorithm 3). The proposed (ES)\(^2\)P algorithm maintains a fixed size elite with the best individuals (sequences) produced along the evolutionary process, and this set of best solutions is finally returned. The first step is to split the dataset \(\Omega\) into two datasets, and then, the algorithm creates the population or the initial set of solutions, which are evaluated according to the fitness function (see lines 2 and 3, Algorithm 3). At this point, the elite is also initialized with the best e individuals from the initial population (see line 4), and the number of generations (iterations of the algorithm) is set to 0. An iterative process starts and it is performed for a number of generations g (see lines 6 to 21, Algorithm 3). In each iteration, the algorithm performs as follows. First, a set of individuals are selected from \(\mathcal {P}\) to act as parents. This selection procedure is carried out by a tournament selector of size 2. The set \(\mathcal {P}\) is used to apply genetic operators by considering an \(\alpha\) probability for the crossover, and a \(\beta\) probability for the mutation. Such genetics operators were already described in Algorithms 1 and 2. Right after the application of the genetic operators, a restoration operator is performed (see line 9) to check that invalid solutions are not formed so individuals can be evaluated again (see line 11). The population is then updated by replacing the previous population by the set of offspring obtained in the current generation. At this point, the best individual is never lost so if it is not in the new population then it is taken from the previous one (see lines 12 to 15, Algorithm 3). The following step carried out by the proposed algorithm is to update the elite set \(\mathcal {E}\) with the best e unrepeated solutions found along the evolutionary process (see line 16).

Last but not least, it is important to highlight that in the event that the algorithm is stuck (i.e., the elite does not improve after m generations), the current population restarts (lines 20 to 24, Algorithm 3). The population \(\mathcal {P}\) is formed by n random individuals, and the crossover and mutation probabilities are also reset to the default values. Finally, once the maximum number of generations is reached, the elite population is returned (see line 27).

Resulting Set

The previously described (ES)\(^2\)P algorithm is really useful to extract not only paths followed by students during the degree but also to extract paths previous to some specific courses. Hence, the idea would be the same, but the input dataset is properly obtained by considering the full paths or the specific subpaths before the specific course. The resulting set \(\mathcal {E}\), which is given by the elite of the algorithm for a specific purpose (for example to extract paths that reach to a specific objective course j) is key to perform course recommendations. Giving a course j, the recommendation is based on the analysis of the sequences that describe the paths followed by excellent students in the course j.

The set \(\mathcal {E}\) of sequences returned by the algorithm is ordered by the fitness value \(F_s\) for each \(s \in \mathcal {E}\). Additionally, each item i within a sequence s has associated a \(F_i^s = F_s \times c_i\), being \(c_i\) the credit hours of the course i. Thus, those courses with a higher number of credit hours are more important than those with lower credit hours. Once a new student t is analyzed by the system, his/her complete path \(s_t\) is taken into account, and every sequence \(s \in \mathcal {E}\) is activated for t if \(s \subseteq s_t\), i.e., if the student passed the courses in the same order denoted by s. A recommendation index \(d_t^j\) for the student t to take the course j is calculated based on Eq. 4. The numerator sums the highest \(F_i^s\) value for items in the activated sequences, i.e., in those sequences \(s \subseteq s_t\). Finally, to provide a recommendation index in the [0, 1] range, the denominator sums the maximum \(F_i^s\) value for items in all sequences in the set. Values of \(d_t^j\) closer to 1 means that the student t is prepared to enroll in course j due to its background. Values of \(d_t^j\) closer to 0 means that this course should not be taken at this moment. Applying the proposed recommendation index to a given student t on all the courses returns the set of courses that are more appropriate for t.

$$\begin{aligned} d_t^j \ = \frac{\sum _{ i \in s | s \subseteq s_t} \max \left( F_i^s | s \subseteq s_t \right) }{\sum _{ i \in s | s \in \mathcal {E}} \max \left( F_i^s | s \in \mathcal {E} \right) } \end{aligned}$$
(4)
Table 3 Sample set \(\mathcal {E}\) of sequences returned by the algorithm and their \(F_s\) values, paths \(s_t\) already followed by some sample students, and credit hours \(c_i\) of each course

For a matter of clarification, let us consider the sample set \(\mathcal {E}\) shown in Table 3, which is ordered by the fitness value \(F_s\). Let us also consider three sample students and the correspondence of credit hours for each of the courses. Considering the course j as objective, the first student \(t=1\) activates the sequences #1 and #3 from \(\mathcal {E}\), since \(s_1 \in \mathcal {E} \subseteq s_t\) and \(s_3 \in \mathcal {E} \subseteq s_t\). Then, for each item i in \(s_1 \cup s_3\), its maximum \(F_i^s\) is obtained, and added to calculate the recommendation index \(d_t^j\). In the case of student \(t=1\), the numerator of \(d_t^j\) would be: \(0.8\times 3 + 0.8\times 2 + 0.8\times 3 + 0.8\times 2 + 0.6\times 2 + 0.6\times 1 = 9.8\). Note that from \(s_3\) only two courses f and g have incremented the index, since b and d were already present in \(s_1\) with higher \(F_i^s\) value. Additionally, the denominator is calculated as if all the items in \(\mathcal {E}\) are satisfied. Hence, according to \(d_t^j\) (see Eq. 4), the recommendation index for student \(t=1\) would be \(d_1^j = \frac{9.8}{13.4} = 0.731\). On the other hand, the recommendation index value for the other two students (\(t=2\) and \(t=3\)) and the course j is \(d_2^j = 0.112\), and \(d_3^j = 0\). Note that student \(t=3\) does not match any of the sequences in the resulting set so the algorithm does not recommend him/her to enroll in course j at all.

Experimental Analysis

This section presents the experimental study, describing first the experimental set up. This section analyzes the performance of the proposed algorithm, which is finally compared to exhaustive search algorithms to demonstrate that the evolutionary process makes sense.

Experimental Setup

Table 4 Datasets and their main characteristics

All the exhaustive search algorithms used in this comparison are available in the SPMF library [43]: Spade [13], Spam [14], PrefixSpan [15], CM-Spade [16] and CM-Spam [16]. Additionally, the algorithms TKS [17] and TSP [18] are also considered since they do not require any frequency threshold, and their aim is to extract the top-k most frequent itemsets from data. The experiments are carried out on a set of 13 real datasets (see Table 4) taken from King Abdulaziz University including information about different faculties: sequences of courses taken by students. Additionally, all the gathered data were divided into 13 groups or datasets, one per faculty. \(\#Records\) stands for the number of students, Length is the average number of subjects taken by the students to obtain the degree, and finally, \(\#Courses\) is the number of different courses that each faculty provides. Last but not least, it is important to highlight that each dataset (faculty) is split into two: \(\Omega _1\) includes the 30% of students with best GPA in the degree; \(\Omega _2\) includes the rest of students. All the experiments are performed on a machine with 6 Intel Xeon E5-2620 CPUs at 2.10 GHz and 64 GB of RAM. The experiments are run ten times, and the average results are considered to reduce environment variations.

Analysis of the Proposed (ES)\(^2\)P Approach

Table 5 Average fitness values obtained by the proposed (ES)\(^2\)P algorithm considering different population sizes (n) and number of generations without improvement (m) to reset the population. Best results are in bold type-face

The goal of this first analysis is to demonstrate how well the proposal behaves on multiple datasets and to determine the best values for the hyperparameters, that is, those that provide best fitness values requiring a lower computational time. Table 5 shows the average results obtained by different combinations of values for the population size (n) and the population restarting after m generations without improvement. A hypothesis testing by means of non-parametric statistical tests has been conducted with the aim of determining whether there exist significant differences in the overall performance for the aforementioned combination of values. The Friedman’s test [44] has been used to analyze the general differences, whereas the Shaffer’s post hoc test [45] has been employed to perform all pairwise comparisons. The Friedman’s test detected that there were general statistical differences in the ten combinations of values at a significance level of \(\alpha =0.01\), rejecting the null hypothesis with a p-value smaller than \(2.2\mathrm {e}{-16}\). Then, the Shaffer’s post hoc test was performed to detect where these significant differences were located. The results for this post hoc test, at a significance level of \(\alpha =0.01\), are summarized through the critical difference diagram shown in Fig. 3, illustrating that the values = 500 and = 100 produce the best values. However, according to the post hoc test, no statistical difference is found among n values between 500 and 300, and m values between 50 and 100, being the only exception the combination of n = 300 and m = 50. At this point, it is interesting to analyze the runtime (see Table 6) for these six combinations of parameters that present the same performance, with a statistical significance of 99%. The Friedman’s test revealed statistical differences in these combination of values at a significance level \(\alpha =0.01\), thus rejecting the null hypothesis with a p-value smaller than \(5.074\mathrm {e}{-10}\). Finally, the Shaffer’s post hoc test at a significance level of \(\alpha =0.01\) (see Fig. 4) revealed no significant differences for \(n=300\) and any m value, as well as \(n=400\) and \(m=100\). According to the results shown in Table 6, the combination of parameters \(m=100\) and \(n=300\) presents the best runtime. Taking all the above into consideration, we recommend the previous combination of parameters.

Fig. 3
figure 3

Critical difference diagram of different parameter combinations considered. The comparisons were performed using a Shaffer’s test

Table 6 Average time in seconds obtained by the proposed (ES)\(^2\)P algorithm considering different population sizes (n) and number of generations without improvement (m) to reset the population. Best results are in bold type-face
Fig. 4
figure 4

Critical difference diagram of different execution times obtained for the parameter combinations considered. The comparisons were performed using a Shaffer’s test

Fig. 5
figure 5

Analysis of the convergence of (ES)\(^2\)P on different datasets

Let us continue now with the analysis of the convergence of the proposed approach. Figure 5 shows how the algorithm behaves on four different datasets: Computing & Information Technology; Design & Art; Economics & Administration; Law. The results on this heterogeneous group of datasets (\(\#Records\) varies from 144 to 2,206; Length is between 48.3 and 54.67; and \(\#Courses\) varies from 109 to 387. See Table 4) demonstrate that the convergence of the algorithm is high on different scenarios and it is around 1,500 generations for which the algorithm does not widely improve the results. Thus, in order to avoid spending time and computational resources on little fitness improvements, the number of generations is set to 1,500. Last but not least, the crossover and mutation probability values are also fixed to 0.8 and 0.3, respectively. With the aim of providing a better description of the experimental study, the readers could find a further study on the combination of probability values at the website http://www.uco.es/kdis/course-recommendation. In summary, to obtain the best combination of parameter values, more than 50 parameter configurations were considered, resulting in more than 6,500 executions. Additionally, 10 independent runs were performed for each dataset and parameter configuration to study the algorithm’s performance due to its stochastic component.

(ES)\(^2\)P Against Other Sequential Pattern Mining Algorithms

Table 7 Number of solutions and average fitness value returned by exhaustive search algorithms (considering different support threshold values) and the proposed (ES)\(^2\)P approach. Memory stands for memory problems when running (out of memory). The average fitness value was calculated by taking the best 50 solutions based on the fitness value

This second analysis aims to study how well the proposed (ES)\(^2\)P algorithm behaves when it is compared to existing exhaustive search algorithms. First, we analyze the number of solutions returned by any exhaustive search algorithm (results are exactly the same for any algorithm, as expected) when different frequency thresholds are considered (see Table 7). At this point, it is required to highlight that our proposal returns exactly the same for any dataset since it obtains the best 50 solutions found. As it is shown, the number of solutions extracted by exhaustive search algorithms highly vary and it depends on the dataset. This number varies between 59 to 148,461,607 solutions. Second, we analyze the average fitness value obtained by the algorithms on different support threshold values (see Table 7). It is important to remark that this is not the average frequency value but the fitness value already described in (ES) P Algorithm. Additionally, since the number of solutions is completely different, we have taken the best 50 solutions to perform a fair comparison with regard to the proposal. As it is expected, exhaustive search algorithms obtained the best results (bold type-face) and, the lower the support threshold value the better results since a wider number of solutions are analyzed. However, for some datasets, our proposal achieved better results in average fitness: Design & Arts; Economics & Administration; Engineering; Home Economics; Information Technology; Law; Sciences & Arts. This interesting behavior occurs due to fitness function depends on the GR measure, which can be high for small support values. Thus, given a specific support threshold value, extremely good results for the fitness value might be missed. On the contrary, the proposal guides the searching process by the fitness value, achieving better results.

Additionally, let us analyze the runtime required by different algorithms on different support threshold values. In this analysis, we consider a set of exhaustive search algorithms that is denoted as the best ones in the specialized literature [12]: Spade [13], Spam [14], PrefixSpan [15], CM-Spade [16] and CM-Spam [16]. Table 8 shows the runtime, in seconds, required by the algorithms for a support threshold value of 0.5. At this point, it is important to remind that small differences in the average fitness value were obtained (see Table 7): 0.028 in Design & Arts; 0.037 in Economics & Administration; 0.017 in Engineering; 0.057 in Information Technology; 0.053 in Law. Additionally, three datasets cannot be run due to memory problems when using exhaustive search algorithms on such a threshold value (see Tables 7 and 8). In general terms, our proposal needs lower runtimes and these values do not widely vary from dataset to dataset. Huge differences are found on multiple datasets. For example, in Economics & Administration exhaustive search approaches need more than 2,000 seconds, whereas our proposal only needs 260 seconds. In fact, for this specific dataset the difference in the resulting average fitness value was really low (0.457 in exhaustive search algorithms and 0.420 in our proposal). Among all the results, the maximum differences in runtime are found when the Information Technology dataset is considered, since exhaustive search approaches require more than 16,000 seconds whereas our proposal just 108 seconds. Additionally, for this dataset, the difference in average fitness value was really small (0.446 in exhaustive search algorithms and 0.389 in our proposal). In summary, after analyzing all the results for a support threshold value of 0.5, it is possible to assert that the proposed approach is really useful to obtain really good results (according to the average fitness value) in a small quantity of time. Furthermore, this proposal is able to be run on any dataset, whereas exhaustive search approaches fail on some datasets due to memory requirements.

Table 8 Runtime, in seconds, required by each algorithm on different datasets and considering different support threshold values (0.5, 0.6, 0.7 and 0.8). Our proposal does not require any threshold. Memory stands for memory problems when running (out of memory)

If we continue the analysis for other support threshold values (0.6, 0.7 and 0.8), it is obtained that the higher the threshold value, the lower the runtime required by exhaustive search approaches (see Table 8). However, analyzing the average fitness value (see Table 7), the higher the threshold value, the lower the average fitness value obtained by exhaustive search approaches. In fact, considering a support threshold of 0.8, our proposal obtains better results in seven datasets (see Table 7).

Last but not least, it is important to remark that those algorithms that require a minimum support threshold value to be predefined need an extra (previous) process to determine the exact value. This procedure is not trivial, and generally requires a profound background in the application field. Inexpert and many expert users need to try different thresholds by guessing and re-executing the algorithms once and again until results are good for them [9]. All of this, together with the large runtimes required on different datasets, and the small differences in the resulting average fitness values, let us to the conclusion that our proposal outperforms exhaustive search algorithms.

Comparison to Top-k Sequential Pattern Mining Algorithms

This third analysis aims to study how well the proposed (ES)\(^2\)P algorithm behaves when it is compared to existing exhaustive search algorithms for mining the top-k solutions. The main advantage of these approaches is that they do not require a previous study to determine a good threshold value. Additionally, they return the same number of solutions regardless the input dataset, which is easier to be manage by experts. However, the main disadvantage of these approaches is related to the fitness values. Existing algorithms for mining top-k sequential patterns were proposed for mining the best results in terms of frequency (support values). Nevertheless, for the problem at hand, the support value cannot establish the importance of the sequence. A sequence can be frequent for both excellent and not so good students and, therefore, the GR value is low. Additionally, a sequence can be infrequent for excellent students and zero for not so good students, providing an excellent GR value. This theoretical behavior is tested by running two algorithms that determine the state-of-the-art, that is, TKS [17] and TSP [18], on different datasets. The results (see Table 9) demonstrate that extremely bad results are obtained by these algorithms. The runtime needed by TKS and TSP is much lower than (ES)\(^2\)P, but the results are useless for the problem at hand (fitness values close to 0).

Table 9 Runtime and average fitness value returned by top-k search algorithms and the proposed approach considering different k values (25, 50, 100, 200)

Cases of Study: Applying (ES)\(^2\)P to Course Recommendation

In this section, we propose two different methodologies to apply the proposed (ES)\(^2\)P algorithm for course recommendation. First, we propose a methodology for ordering courses that should be taken by students to success in the degree. This methodology, based on the proposed (ES)\(^2\)P algorithm, provides full study plans that should be followed by students to reduce the dropout and failure. Second, we propose a methodology for rating courses with the aim of providing the students with advices on which courses should be taken at any specific moment of their degree. The aim is to recommend subjects that best fits to them according to their paths (previous courses). This methodology requires nothing more than the courses (ordered by semesters) already passes by the student that is advised. Last but not least, it is important to remark that courses are represented by IDs in these cases of study to simplify the results. The real name of the courses are explained at the website http://www.uco.es/kdis/course-recommendation.

Study Plan Recommendation Based on the Best Ordering of Courses

When no course is given, the algorithm extracts discriminative sequential patterns on complete paths carried out by students. As a study case, we have considered two different faculties: Business and Information Technology (see Table 4). As a matter of simplification, we have taken only the top 5 solutions returned by the proposed methodology. However, the whole set of paths obtained is available at the aforementioned website with any extra information. Additionally, the aforementioned website includes information about the real name of the courses.

Table 10 Top 5 complete paths returned by the proposal on Faculty of Business

Let us start with the Faculty of Business study case. Table 10 shows the 5 best solutions found according to the fitness value. The support on the set of good students and the GR value is also available. The path with the best fitness value denotes that more than 70% of the excellent students have passed course 30 in a semester and courses 11 and 43 together in a subsequent semester. This path is satisfied 1.69 times more often in excellent students than in not so good students. A similar behavior is denoted by the second (\(\langle\){44}, {11, 43}\(\rangle\)) and the third paths (\(\langle\){11, 43}\(\rangle\)). As a result, it is possible to assert that to take subjects with IDs 11 and 43 in the same semester is a synonymous of being an excellent student in the Faculty of Business. Nevertheless, it is fair to say that no excellent result was obtained in terms of courses that heavily denote a difference between excellent and not so good students. It is mainly due to, for this Faculty, there is not good paths to be performed by students and, generally, all the students equally behave.

Even more interesting are the results obtained on the Faculty of Information Technology (see Table 11). Analyzing the top 5 solutions, we obtain that the courses with id 35 and 47 appear in any of the paths and, in fact, they are studied in the same semester. In any of the cases, all the returned paths presents a behavior that is three times more often for excellent students than for not so good students. For example, focusing on the solution \(\langle\){50}, {39}, {35, 47}\(\rangle\), it determines that if a student pass the course with ID 50 in a semester, then in a different semester, such a student pass the course with ID 39, and then, in a different semester, he/she pass courses with IDs 35 and 47 (in the same semester this time), such a student has 3.8 times more probability to be an excellent student and the end of the degree. Hence, this information is really useful to provide study plans and to analyze why such differences among students when they take such courses in that order.

Table 11 Top 5 complete paths returned by the algorithm for the Faculty of Information Technology

Course Recommendation Based on the Previous Academic Path

This second study case is related to the recommendation of which courses should be taken by a student in a specific semester according to its path (courses already taken by him/her). The aim is to improve his/her academic success. In this study case, we have considered the same faculties discussed in the previous study case: Business and Information Technology. We have also taken four different students to provide them a recommendation (two of each faculty).

Table 12 Top 5 courses recommended to the student \(t_1=\langle\){12, 22, 30, 44, 45, 75}, {1, 9, 14, 15, 17, 33}, {3, 11, 34}\(\rangle\) belonging to the Faculty of Business

Let us start with a student \(t_1\) from the Faculty of Business, having the path \(t_1=\langle\){12, 22, 30, 44, 45, 75}, {1, 9, 14, 15, 17, 33}, {3, 11, 34}\(\rangle\). Thus, \(t_1\) has passed 6 subjects in a semester, 6 subjects in a posterior semester, and 3 subjects in a subsequent semester. For this student, our proposal recommend to take the courses shown in Table 12 right now. The recommendation index score of the recommended courses are really good (close to the maximum of 1), meaning that all (or almost all) students with the same path have obtained excellent marks in such courses. Analyzing what this student really did, we check that one of the recommended courses was taken (course with ID 88). In this course, the student \(t_1\) obtained a GPA that is within the 5.4% of the best GPA obtained for that course among all the students. Thus, it is demonstrated that when a student follows the recommendations, he/she obtains really good GPAs.

Let us now consider a second student \(t_2\) from the same faculty and who has followed the path \(t_2=\langle\){12, 22, 30, 44, 45, 75}, {1, 9, 14, 15, 17, 33}, {3, 21}, {2, 11, 26, 43}, {10, 24, 36, 74, 76}, {20, 46, 63, 77, 142}, {7, 8, 68, 69, 71}, {34}\(\rangle\). This student is close to finish his/her degree since he/she has completed 8 semesters and 34 different subjects. The top 5 courses recommended by the algorithm and the recommendation index scores are summarized in Table 13. This time, the student did not followed any of the recommended courses and took some courses that were not appropriate at all for him/her. For example, analyzing the path finally followed by such a student, he/she took the courses with ID 72 and 88. Such courses presents a recommendation index score of 0.069 and 0.000, respectively. Thus, such courses were not appropriate for the student \(t_2\) as it is finally proved by the GPA obtained for such courses. In course with ID 72, the student obtained a GPA of 87, which is within the 34.89% of the students (ranked by GPA for that course). Similarly, the student obtained a GPA of 85 in the course 88, which is within the 55.94% of the ranking of students. As it is demonstrated, to follow the recommendation is crucial to obtain good GPAs.

Table 13 Top 5 courses recommended to the student \(t_2=\langle\){12, 22, 30, 44, 45, 75}, {1, 9, 14, 15, 17, 33}, {3, 21}, {2, 11, 26, 43}, {10, 24, 36, 74, 76}, {20, 46, 63, 77, 142}, {7, 8, 68, 69, 71}, {34}\(\rangle\) belonging to the Faculty of Business.

The following analysis is carried out on a different Faculty, that is, Information Technology. For this Faculty, we take a student \(t_3\) that has followed the path \(t_3=\langle\){5, 8, 9, 18, 25, 45}, {4, 12, 13, 23, 28}, {1, 10, 15, 21, 51}, {3, 6, 14, 27, 74}, {26, 29, 75, 78, 84}\(\rangle\). Table 14 summarizes the top 5 courses recommended to this student by the proposed methodology together with the recommendation index scores for each course. In this occasion, the student \(t_3\) finally took two of the five best courses recommended to him/her, that is, courses with IDs 64 and 77. To show the adequacy of the proposal and the validity of the recommendations proposed, let us analyze the GPA obtained by \(t_3\) on those courses. \(t_3\) obtained a GPA of 96 in the course with ID 64, being among the 11.11% best students for that course. As for the course with ID 77, he/she obtained a GPA of 95, which corresponds to the top 13.58% of the best students for that course.

Table 14 Top 5 courses recommended to the student \(t_3=\langle\){5, 8, 9, 18, 25, 45}, {4, 12, 13, 23, 28}, {1, 10, 15, 21, 51}, {3, 6, 14, 27, 74}, {26, 29, 75, 78, 84}\(\rangle\) belonging to the Faculty of Information Technology
Table 15 Top 5 courses recommended to the student \(t_4=\langle\){5, 12, 23, 28, 45}, {4, 8, 9, 13, 18, 25}, {1, 10, 15, 21, 51}, {3, 6, 14, 27, 74}\(\rangle\) belonging to the Faculty of Information Technology

Finally, let us consider a student \(t_4\) belonging to the Faculty of Information Technology, which has passed the courses identified by the following path \(t_4=\langle\){5, 12, 23, 28, 45}, {4, 8, 9, 13, 18, 25}, {1, 10, 15, 21, 51}, {3, 6, 14, 27, 74}\(\rangle\). Analyzing \(t_4\), he/she is recommended to take 5 courses as the top according to the recommendation index score (see Table 15). Analyzing what the student finally did, it is obtained that he/she finally took two of such five courses that were recommended. In this way, for the course with ID 26, \(t_4\) obtained a GPA of 85 (top 37.32% in the total GPA ranking for that course). On the other hand, the GPA obtained by \(t_4\) on the course with ID 77 was again 85, being this time in the top 27.16% of the ranking for the given course. However, if we take into account the rest of the courses taken by this student, they are present with a very low recommendation index score. For example, in the case of the course with ID 78, the recommendation index score is 0.365 (the student \(t_4\) finally obtained a GPA of 68 in that course, which is within the 80.25% of the best GPAs of that course), whereas for the course with ID 84, the recommendation index value was 0. The student finally took this course and his/her GPA was 87 for that course and this GPA is within the 64.36% of best students. As it is demonstrated, the recommendation index score is low because taking such courses implies not to be in the group of excellent students. Last but not least, it is important to clarify that the number of credits of each course is taken into account to obtain the recommendation index score, what explains why lower score values may imply the student to be in a higher position of the ranking.

Conclusion

In this paper, we have proposed an evolutionary algorithm for mining top-k emerging sequential patterns, which is called (ES)\(^2\)P. It is able to discover a reduced set of discriminative patterns whose frequency increases significantly from one group or dataset to another. Its main advantage is that it does not need any threshold as existing algorithms do and, therefore, any background in the application field. Additionally, a methodology for rating courses is also proposed, which does not require anything except for the courses (ordered by semesters) that already pass by those students that are advised. This methodology considers the proposed (ES)\(^2\)P algorithm to obtain the most promising sequences of courses. Last but not least, we have also proposed a methodology for ordering courses that should be taken by students so they can be successful in their degree. This methodology, also based on the proposed (ES)\(^2\)P algorithm, provides full study plans that should be followed by students to reduce the dropout and failure. The experimental analysis has demonstrated that the proposed (ES)\(^2\)P algorithm behaves really well in terms of runtime, and it is able to extract useful information in huge datasets where other algorithms fail. The efficiency of the proposal has been tested on two cases of study, providing excellent recommendations on a real scenario.