1 Introduction

Education is the cornerstone of civilizations. Its increasing importance moved education to be a priority in everyone’s daily life. The recent advances in technological developments have given rise to a new medium for learning and tutoring, termed among practitioners as E-learning or online learning [28]. Online learning comes with several advantages. Flexibility in enrollment management, lower cost of online course, and time saved for both students and teachers have contributed to its popularity in education [29]. It also can be offered to everybody without university enrollment requirement and with affordable costs. In addition, learners can arrange their schedule according to their convenience without schedule restrictions. Furthermore, E-learning platforms have massive users data. Hence, machine learning and specifically recommendation systems are deployed to enhance the learning experience by personalizing the content for each learner. It is forecasted that the online education market will reach $350 billions by 2025 and the forecast does not count for the growth impacts of COVID-19 on the online learning market [32]. E-learning platforms are continuously proliferating. Among the famous ones, we can cite Udemy,Footnote 1, CourseraFootnote 2, UdacityFootnote 3 and EdxFootnote 4. In recent times, the COVID-19 pandemic forced educational institutes to observe the importance of online teaching. E-learning has become the de-facto solution to maintain the continuation of the educational process around the world [31, 38]. This opens doors to huge research opportunities [30].

Dossou [1] conducted a study to determine the probability of "E-Learning’s De Facto" in educational sector during the COVID-19 pandemic. Findings show that the probabilities of E-learning’s ’de facto’ implementation is 0.65 in the world, 0.87 in High-Incomes, 0.70 in Upper-Middle-Incomes, 0.52 in Lower-Middle-Incomes, and 0.29 in Low-Income economies. Unlike in-class, online learning session exhibits less interaction between instructor and students. Consequently, instructor has less parameters at-hand that enable him to accurately assess student performance in the session which is usually derived from cognition, sentiment, visual contact etc. Jordan [2] reported that for Massive Open Online Courses (MOOCs), the completion rate is low (0.7%-52.1%) with median value of 12.6%. Similar conclusion have been drawn in [3] for a study of online courses offered by Open University UK and China.

In this work, we address the problem of predicting learner performance in E-learning by adopting a novel strategy. Unlike state-of-art techniques where Long-Short Term Memory (LSTM) is commonly adopted, we propose a strategy where the clickstream time series (the interaction of learners during the online session) are transformed into images using the Gramian Angular Field hence allowing to benefit from the power of Convolution Neural Network (CNN) and avoiding LSTM shortcomings such as the vanishing/exploding gradient problem. In addition, the proposed model incorporates demographic and online assessment data of learners by dedicating a second pathway of fully connected layers. Both pathways are aggregated and followed by a classifier to output probabilities of classes: Distinction, Pass, Fail and Withdrawal. This model is strengthened by the Batch Normalization to regularize the model and hence cope with the problem of unbalanced classes. We demonstrate the benefit of the proposed approach by conducting comprehensive experiments with comparison against state-of-art approaches.

The rest of the paper is organized as follow: ins section 2, we present the research works that addressed the problem of predicting learner performance in E-learning. In section 3, we details the dataset used in this study. Section 4 details the proposed approach. Experimental results and validation of our approach are presented in section 5. We conclude and present future directions in the last section.

1.1 Machine learning techniques for student performance prediction

Machine learning has revolutionized various sectors, industries and services. Educational data mining has emerged as a new field aiming at supporting decision makers [33, 34]. With MOOC and online learning becoming more and more popular, multiple research efforts have focused on applying state-of-art machine learning techniques to address several research questions including but not limited to student performance prediction. Al-Shabandar et al. [4] conducted an exploratory data analysis on MOOC data with the objective of predicting student outcome in a course. Authors identified a strong correlation between clickstream actions and the outcome of the online learners. In addition, multiple machine learning algorithms including Decision Tree (DT), Random Forest (RF), Support Vector machine (SVM), Naive Bayes (NB), Feedforward Neural Network (NN), Linear Discriminant Analysis (LDA) and Self Organized Map (SOM) have been used to predict learner outcome. RF achieved the best performance with 98.8% accuracy when trained on all features. When trained on the most important features, RF achieved 98.5% accuracy. In [5], a semi-automated peer-assessment system is used for two computer courses for undergraduate students to collect data. Students asked questions about topics addressed during the sessions, and answered questions from their peers. Answers were then rated. Several features are extracted and used to train multiple linear regression models to predict student scores which range from 18 to 30 out of 30. Root Mean Square Error (RMSE) is used for evaluation. The final trained model achieved a RMSE=2.93 and 3.44 for the first and second course respectively. Authors argued that this prediction could provide insights about online course attrition rate. Azizah et al. [6] conducted a performance comparative study between C4.5 [35] and Naive Bayes algorithms for prediction of student performance in virtual learning environment. The data consist of web history and the sum of webpage interaction of the students. Both algorithms achieved comparable performance with 63% of the data instances accurately classified. In [7], authors adopted an ensemble learning approach to predict students’ academic performance based on their socio-economic status and historic grades. Ensemble approaches have shown significant performance improvement compared to classic ones in several learning tasks. The idea is to aggregate the decision of multiple classifiers. In other words, the decision is taken collectively by relying on "wisdom of the crowd". Authors used three classification algorithms: DT , k-Nearest Neighbors (kNN) and Aggregating One-Dependence Estimators (AODE). This ensemble is tested on three datasets and achieved 87% average accuracy, the best compared to single models. Peach et al. [8] adopted a different strategy by applying an unsupervised learning technique. Specifically, given time series of student engagement during online session, the objective is to identify groups or clusters of learners having similar temporal behavior. Dynamic Time Warping is first used to compute pairwise similarity between time series of learner actions. Then, a multiscale graph clustering algorithm is applied to identify the groups. Findings show distinct engagement patterns of learners with different levels of regularity, adherence to pre-planned course structure and task completion. Results also revealed that low performance learners are grouped in distinct cluster, hence accurately identified.

Although machine learning algorithms and techniques are popular choices in educational data mining and have shown great performance, their applicability is still limited to small datasets.

1.2 Deep learning for student performance prediction

Since AlexNet breakthrough in 2012 at ImageNet competition [9], Deep Learning (DL) has been the leading technology in several applications including computer vision, healthcare, security, self-driving car, and so many others. Recently, DL has attracted practitioners and researchers in educational data mining [39]. Aljohani et al. [10] used clickstream of students during online learning sessions on weekly basis to predict whether learners will pass or fail using deep learning. The data are formed by sequentially stacking the sequence of weekly clickstream. These data are part of the well-known Open University Learning Analytics (OULA) dataset [11]. The deep neural network consists of a stack of LSTM [12] layers. LSTM is a specific type of neural network perfectly adapted for sequential data such as time series and text. To overcome the notorious problem of overfitting, dropout [13] is used. This model achieved 95.23% accuracy when predicting pass/fail in the last week of the online course outperforming SVM, Logistic Regression and Artificial Neural Network (ANN). In [14], authors used 54 features from OULA dataset to train a DL model to predict student performance. As a preprocessing step, sparse feature reduction technique is applied in order to select the subset of features that mostly affect student performance. Then, Singular Value Decomposition is applied to cater the data sparsity. Finally, min-max is applied for normalization. The preprocessed data are used to train an ANN of three hidden layers with 50, 20 and 10 neurons. In their experiments, authors established four scenarios to classify: pass/fail, distinction/fail, distinction/pass and withdrawn/pass students. Results showed that withdrawn/pass are highly distinguishable with 94% accuracy. Less accuracy is obtained for the rest of the classification formulation with distinction/pass having the lowest accuracy: 80%. The findings also showed that for students achieving distinction, the age, region and having special needs negatively affect their performance. Students in rural areas are faced with difficulty accessing the online learning system due to connectivity problems. For students with failure and withdrawal outcome, the overall credits, highest achieved education and region significantly affected their performance. The age factor has shown positive association with the withdrawal outcome. Indeed, mature learners are less likely to withdraw compared to younger ones. Although this study is comprehensive, its formulation is still restricted to binary classification. This can be explained by the class imbalance in the data making the classification task more challenging. DL has also opened the door for the possibility to learn from variety of data at the same time, i.e. getting insights by combining different data modalities to strengthen the learning process and achieve better performance. Such learning strategy is widely used in computer vision [16] and health applications [17]. Using the demographic data along with the clickstream from OULA dataset, Karimi et al. [15] proposed a deep learning model that learns from both modalities termed as DOPP and DOPPFCN. In literature, demographic is a broad term and may refer to anything ranging from age, gender, location, nationality to income. In their study, Karimi et al. considered the gender, age, highest education level and special need status of students. Authors proposed a network with two pathways. The first one is a stack of fully connected layers, dedicated to extract hidden representations from demographic data. The second pathway is a sequence of LSTM that learns from the clickstream, modeled as time series. Both pathways are merged using simple concatenation followed by a classifier to output the probability of student outcome. Binary and multiclass classification experiments are conducted and F1 score is evaluated for the different courses. The proposed model achieved more than 0.85 F1 score for binary classification and 0.54 for multiclass classification. Experiments also showed that counting for the demographic information boosted the model performance. This latter performance is significantly low due to the imbalance class problem. Similar approach is witnessed by the work reported in [18]. He et al. [19] proposed a three-pathway deep neural network where, in addition to demographic and clickstream pathway, a third pathway for assessment data is considered. The assessment stream consists of outcome of learners’evaluation during the semesters in: Tutor Marked Assessment (TMA), Computer Marked Assessment (CMA) and Final Exam (Exam). Results showed that when training is conducted using data up to the 5th week, the proposed model achieved 60% accuracy. As course progresses, more data became available and accuracy improved to reach more than 90% using all data.

DL approaches have showed significant performance improvement compared to traditional machine learning algorithms. Nevertheless, imbalanced class is a recurrent problem and is challenging to overcome.

2 Open university learning analytics dataset

Virtual learning platforms are commonly used by online educational platform to collect data about learners’ interaction during the online session to get better insights about their learning behavior. In this study, we use the OULA dataset. This dataset contains information about 32,593 learners monitored for 9 months. It includes demographics of learners: the area of residence, gender, age, highest education level on entry to the module and special needs status. These learners were enrolled in 7 different courses. Each course is taught at least twice and started at different months of the year 2013 and 2014. Among the seven courses, this study focuses on three particular ones, those with high enrollment. Table 1 presents description of these courses code-named BBB, DDD and FFF.

Table 1 OULA dataset: Description of the studied courses
Fig. 1
figure 1

Number of clicks per learner outcome: time series representation

The dataset also contains assessment information and mutual information, i.e. the interaction of learners during the online session. This interaction was logged in number of clicks on daily basis for each course. 20 click actions are presented including completing quizzes, visiting URL and resources, filling questionnaires etc. A learner can have four possible outcomes in a course: Distinction, Pass, Fail and Withdrawn. The exploratory data analysis conducted by He et al. [19] showed that learners with frequent interaction and high assessment scores are high likely to pass the course while learners who fail the course, rarely interact during the online session. This observation is visually confirmed by Fig. 1 which illustrates samples of click time series per learner outcome.

Fig. 2
figure 2

Overview of the student performance analysis

3 Proposed approach

Figure 2 illustrates the proposed deep learning model for prediction of learners outcome. As preprocessing, clickstream data i.e. time series of clicks, are transformed into images using the Gramian Angular Summation Field. The demographics and assessment data are typically presented in tabular format. In the forward pass, images are fed in the first pathway which consists of a sequence of CNN blocks. At each CNN block, the input is convolved with a set of filters. This is followed by Batch Normalization and non-linear transformation, typically a ReLU function. The output of each CNN is then downsized using a Pooling layer. This process is repeated through the rest of the CNN blocks and Pooling layers. At the second pathway, the demographics and assessment tabular data are fed into a sequence of fully connected dense layers. The outputs of both pathways are then merged using simple concatenation of both pathways. We opt for this merging strategy for simplicity and computation reduction although more complicated techniques can be adopted (e.g. Compact Bilinear Pooling). The output of this layer goes through a sequence of fully connected layers. Finally a Softmax classifier outputs the predicted class, i.e. the learner’s outcome. The set of filters and the weights of the fully connected layers are the parameters that will be updated during the training process i.e. the backward pass also called backpropagation. Indeed, this process aims at minimizing the error between the true label of the input data and the model output. A typical error for classification task is the crossentropy loss.

In the following, we present the technical details of time series imaging using Gramian Angular Field. We also describe the layers used to build this model.

3.1 Imaging time series: the Gramian Angular Field

Transforming time series into images has demonstrated performance improvement in several applications [20, 36, 37]. The intuition is to exploit spatial features by projecting the raw time series data into another space, then applying trigonometric transformation. This approach has been proposed by Wang et al. [20] and applied for classification task related to 20 datasets including electrocardiogram and human motions. The obtained images are used to train a Tiled deep CNN [21] for classification. Hitami et al. [22] applied the Gramian Angular Field [20] and Recurrence Plot [23] for time series imaging. Obtained images are used to train a deep CNN. The pipeline showed better performance compared to traditional ones where particular features, e.g. Scale-invariant feature transform (SIFT) [24], Gabor and Local Binary Patterns (LBP) [25] features are extracted and classified. De Santo et al. [40] encoded the time series as images using several techniques including The Gramian Angular Filed and Recurrence Plot for predictive maintenance. Hong et al. [42] applied the time series imaging paradigm for predictive maintenance in context of photovoltaic arrays. In [41], the authors used the imaging techniques to accurately predict natural gas consumption. Ding et al. [43] used the Gramian Angular Field for fast and accurate fault detection in Direct Current electricity grid. Kong et al. [45] transformed financial time series into images to represent the temporal characteristics and reveal intrinsic feature details for better prediction performance. Imaging time series has been also successfully applied for cancer prediction [44].

Fig. 3
figure 3

Imaging time series using the Gramian Angular Field. (Left) Time series (Middle) Polar Transformation (Right) Gramian Angular Field

The pipeline of imaging time series using Gramian Angular Field is depicted in Fig 3. Given a time series \(ts=[ts_1,ts_2,\cdots ,ts_N]\), normalization is applied:

$$\begin{aligned} \hat{ts}_i = \frac{2ts_i -\max \limits _i(ts_i) - \min \limits _i(ts_i)}{\max \limits _i(ts_i)-\min \limits _i(ts_i)} \end{aligned}$$

Where \(\hat{ts}_i\) is the i\(^{\text {th}}\) element of the normalized time series \(\hat{ts}\). The normalization brings the range of value to \([-1,1]\). \(\hat{ts}\) is then transformed to polar coordinates system by encoding each value using angular cosine and its timestamp i as its radius:

$$\begin{aligned} \begin{aligned}&\Phi _i = \text {arccos}(\hat{ts}_i) \quad -1\le ts_i \le 1 \\&r_i = \frac{i}{R} \end{aligned} \end{aligned}$$

Where arccos is the inverse of cosine and R is a constant to control the span of the polar coordinates system. This mapping is unique as it ensures one and only one point in the polar coordinates system for the corresponding time series value. After this transformation, the angular property can be exploited to identify the temporal correlation among the time intervals to obtain the images. This can be achieved using the Gramian Angular Summation Field (GASF). GASF image is a matrix of the form:

$$\begin{aligned} GASF = \left( \begin{matrix} cos(\Phi _1 + \Phi _1) &{} \ldots &{} cos(\Phi _1 + \Phi _n) \\ cos(\Phi _2 + \Phi _1) &{} \ldots &{} cos(\Phi _2 + \Phi _n) \\ \vdots &{} \ddots &{} \vdots \\ cos(\Phi _n + \Phi _1) &{} \ldots &{} cos(\Phi _n + \Phi _n) \end{matrix} \right) \end{aligned}$$

It can also be written as:

$$\begin{aligned} GASF = \Big (\hat{ts}\Big )^{tr}\hat{ts} -\Big (\sqrt{Id-\hat{ts}^2}\Big )^{tr}\sqrt{Id-\hat{ts}^2} \end{aligned}$$

Where tr is the transpose operator and Id is the unit vector. Each \(GASF_{i,j}\) represents the temporal correlation by summation of angular directions. The main diagonal of GASF is the special case which represents the original angular information. It can be used to approximitly reconstruct the time series from the high level features learned by a deep neural network [21]. We illustrate in Fig. 4 the imaging of sin(x) and sin(2x) using GASF Transform. We can clearly notice the difference in patterns between the two images.

Fig. 4
figure 4

Gramian Angular Summation Field of sin(x) (left) and sin(2x)

3.2 deep learning layers

3.2.1 Convolution layer

A convolution layer consists of a set of filters. Parameters of these filters are learned as a result of the training process of the neural network model. These filters are small in size, typically \(3 \times 3\), \(5 \times 5\) or \(7 \times 7\) and are convolved with the data coming from the previous layer. Given an input I of size \(M \times N\) and a filter K of size \(m \times n\), the convolution of I by the filter K is expressed as:

$$\begin{aligned} \begin{aligned}&O(i,j) = \sum _{k}^{m} \sum _{l}^{n} I(i+k-1,j+l-1)K(k,l) \\&\quad \forall \; i=1,...,M-m+1 \; j=1,...,N-n+1 \end{aligned} \end{aligned}$$

In other words, the filter is slid across the width and height of I and the dot products between I and K are calculated at every position (ij). In this way, each O(ij) is locally connected to a small local region of the input I. The resulting convolution with all filters are stacked along the depth dimension.

3.2.2 Batch normalization

Batch normalization [26] was introduced to combat the effect of distribution change of the inputs from layer to layer. This is previously addressed by lowering the learning rate and carefully initializing each layer parameter. Batch normalization addressed this issue by normalizing each layer input which enables using high learning rate hence accelerating the learning process. It also acts as a regularization hence strengthening the network by reducing the overfitting.

3.2.3 Activation function

The activation function defines the output at the neuron level. It is a mathematical function that is applied on the representation value. The obtained value determines whether the neurons should be activated ("fired") or not, and refers to whether the neuron input is relevant for the model’s prediction. Multiple activation functions have been proposed and studied in the literature. Historically, the sigmoid and hyperbolic tangent function were used. Recently more efficient activation functions have been proposed. The efficiency is reflected in terms of achieving better learning performance and avoiding the notorious problem of vanishing gradients during the minimization of the network loss. \(ReLU(x)=max(0,x)\) is a widely used activation functions that reduces the likelihood of vanishing gradient.

3.2.4 Pooling

The pooling layer is periodically inserted between CNNs. Its purpose is to reduce the spatial size of the output, known as representation, which results in reducing the number of parameters of the deep network. The most common type is the max pooling. Specifically, a filter of typically \(2 \times 2\) is used and downsampling is applied by choosing the maximum value of the representation within the filter, hence discarding 75% of it. Other pooling operation can be applied such us the average.

3.2.5 Classifier: Softmax

The top layer of deep network is commonly set as a softmax. Given \(x=(x_1, x_2,...,x_K) \in \mathbb {R}^K\), let:

$$\begin{aligned} \textrm{softmax}(x_i;x) = \frac{e^{x_i}}{\sum _{j=1}^K e^{x_j}} \end{aligned}$$

The exponential function is applied on each element \(x_i\) of the input x and normalized by diving it by the sum of all the exponentials. The output layer consists of C neurons where C is the number of the data classes. The i\(^{th}\) output is the probability that the input belongs to the i\(^{th}\) class.

3.2.6 Merge

The merge layer receives the output of each pathway and performs a simple concatenation at a specific dimension. Other possible merging approaches include summation, product, Compact Bilinear Pooling etc.

3.2.7 Fully connected dense layer

The fully connected dense layer applies a linear transformation of the form \(Wx+b\) where W and b are the weight matrix and bias vector. This transformation is followed by an activation function. This layer consists of different units called neurons. Each neuron is connected to all neurons from the previous layer, hence the fully connected terminology. The number of neurons per layer is commonly known as the layer size. Fully connected dense layer is data agnostic, i.e. there is no assumptions needed about the input data.

4 Experimental results

In this section, we assess the performance of the proposed approach and validate it on the OULA dataset. Our experimental protocol is as follows:

  • We compare the proposed approach against: Support Vector Machine (SVM) with radial basis function kernel, Logistic Regression (LR), Deep Online Performance Prediction model (DOPP) [15] and Deep Online Performance Prediction model with fully connected layers DOPPFCN

  • We assess the model under two formulations: binary and multiclass classification. In the first one, we consider the pass/distinction as a single class i.e. pass and fail/withdrawn as fail. For the second setting, we consider each outcome as a class on its own, i.e. 4-class classification problem.

  • As the courses’ duration is 39 weeks, we evaluate the proposed approach at different weeks e.g. 5th, 10th, 15th etc. We expect that as course progresses, hence more data become available, model accuracy will improve.

  • We also evaluate the model performance under two settings: intra and inter-course outcome evaluation, i.e. we train the model on data of one and only one specific course and evaluate it on data from that course and other courses.

  • We demonstrate the importance of including assessment and demographic information of learners in the learning process by reporting the performance with and without these information.

  • For the binary classification case, we report both the Accuracy and F1 Score. For the multiclass classification case, we are interested in predicting the critical cases i.e. students with risk of failure (Withdraw from course and Fail) representing the minority classes in the data . Hence, we report the Recall in addition to the F1 score, both expressed as:

    $$\begin{aligned} \mathrm {F1 = \frac{2*Recall*Precision}{Precision+Recall}} \end{aligned}$$


    $$\begin{aligned} \mathrm {Precision = \frac{TP}{TP+FP}} \end{aligned}$$
    $$\begin{aligned} \mathrm {Recall = \frac{TP}{TP+FN}} \end{aligned}$$

    and where TP, FP and FN are the number of true positive, false positive and false negative samples, respectively.

We use the data of three courses: BBB, DDD and FFF. 80% are used for training and 20% for testing. 20% of the trainining data are used for validation purposes to monitor the model behavior during the learning process. The proposed model is trained using Adam optimizer [27]. The model configuration consists of 3 CNN blocks with 16, 32 and 32 filters, respectively, 2 dense layers of 128 and 32 neurons, respectively. For the binary classification, the loss is the binary crossentropy while for multic-class classification, we use the categorical crossentropy loss. We also use Adam optimizer and set the learning rate = 0.00001 and batch size = 32 for all our experiments.

Fig. 5
figure 5

Samples of Gramian Angular Summation Field (GASF) of the clickstream per learner outcome: Fail outcome

Fig. 6
figure 6

Samples of Gramian Angular Summation Field (GASF) of the clickstream per learner outcome: Withdraw outcome

Fig. 7
figure 7

Samples of Gramian Angular Summation Field (GASF) of the clickstream per learner outcome: Pass outcome

Fig. 8
figure 8

Samples of Gramian Angular Summation Field (GASF) of the clickstream per learner outcome: Distinction outcome

4.1 Gramian angular summation field of clickstream

Figures 5, 6, 7 and 8 illustrate the GASF images of the clickstream time series of learners for each outcome. A visual inspection of the images clearly shows different visual patterns between pass (pass/distinction) and fail (fail/withdrawn). We also notice visual similarity between the success outcomes i.e. pass/distinction. This similarity is also witnessed between the fail/withdrawn outcome. We can conclude that the multi-class classification task is more challenging compared to the binary classification.

4.2 Binary classification

Figures 9 and 10 depict the variation of the accuracy and F1 score for three courses: BBB, DDD and FFF for binary classification: Fail and Success where the classification is conducted at different weeks from week 5 to week 39. The findings showed that as the courses progressed, performance of all models has improved as more click data become available. By the end of the course, the models reached their best accuracy and F1 score performance. We notice that, on overall, the proposed approach achieved the best performance. For the BBB course, it significantly outperforms SVM, LR and DOPPFCN. For the DDD course, DOPP achieved the best performance up to week 30 then outperformed by our approach for week 35 and 39 week. For the FFF course, SVM failed to accurately classify learner performance. Our approach achieved the best classification performance for weeks 10, 15, 20, 25, 35 and 39.

Fig. 9
figure 9

Accuracy evaluation for the three courses: Binary classification

Fig. 10
figure 10

F1 score for the three courses: Binary classification

Fig. 11
figure 11

Recall score for the three courses: Multiclass classification

4.3 Multiclass classification

We report in Figs. 11 and 12 the Recall and F1 scores for classification of learners performance: Fail, Pass, Withdrawal and Distinction. We notice that both scores are lower compared to the binary classification task as more confusion between Pass-Distinction and Fail-Withdrawal is introduced. The proposed model achieved the best Recall and F1 scores for all three courses, hence less confusion between the four classes and better detection of the minority classes i.e. Withdrawal and Fail. In fact, the introduction of batch normalization contributed in reducing the overfitting and guided the model towards better distinction between classes. SVM resulted in the lowest performance with F1 score not exceeding 0.5 for all weeks while DOPP showed competitive results. Similar to the binary classification task, as course progressed, more click data are obtained, classification performance has improved.

4.4 Intra and inter-course evaluation

In this experiment, we train our proposed model on data of a specific course and test it on data from the same and other courses for both classification settings: binary and multiclass classification during week 20. Results, detailed in Tables 2, 3, 4, 5 show that our approach achieved good performance for intra domain experiments and outperformed DOPP. It also achieved good results when trained on data of one course and tested on data of another. The proposed approach successfully extracted specific common features for cross-domain learning, although the performance is less compared to the intra-course experiments.

Fig. 12
figure 12

F1 score for the three courses: Multiclass classification

Table 2 Intra and Inter course binary classification results: performance of the proposed approach
Table 3 Intra and Inter course binary classification results: DOPP performance
Table 4 Intra and Inter course multiclass classification results: performance of the proposed approach
Table 5 Intra and Inter course multiclass classification results: DOPP performance

4.5 Importance of including extra information of learners

Fig. 13
figure 13

F1 Score: Proposed approach trained with and without extra data at the 20th week

Fig. 14
figure 14

F1 Score: Proposed approach trained with and without extra data at the 39th week

To demonstrate the importance of including the extra information, we assess the performance of the proposed model when trained with and without the non click data. Binary classification experiments are conducted at weeks 20 and 39. Results, illustrated in Figs. 13 and 14, demonstrate that, when trained with extra non click data, classification performance has improved for all courses.

5 Conclusion

We addressed the problem of predicting learners’ outcome in online learning environment based on their interaction during online sessions in addition to extra demographic and assessment data. Our approach relies on the time series aspect of the clicks and use the Gramian Angular Summation Field to transform these time series data into images. The proposed model, trained on both the click images and extra data, achieved competitive performance when tested at different weeks of the courses. The findings also confirmed that interaction seems to be in a one-to-one correspondence with student academic outcome. Hence, more attention and research efforts should be dedicated to the development and implementation of new learning techniques and methodologies to keep learners more engaged in the online session. This aspect is very critical as E-learning is becoming a viable learning option. In future work, we will investigate the residual architecture as a potential upgrade to the proposed model with the objective of reducing the confusion between classes. We also plan to investigate the potential of other imaging techniques such as the Recurrence Plot and its combination with the Gramian Angular Field. A further investigative idea is to apply the state-of-art novel attention model most suited for sequential data with breakthrough in natural language processing.