1 Introduction

The COVID-19 pandemic has forced many schools and universities to switch to e-learning, which is also known as online distance learning [15]. E-learning involves using the Internet and other related technologies for learning, teaching, and regulating courses in an organization [1]. E-learning has been widely accepted as a significant educational platform not only by organizations but also by teachers and students [27]. Besides the COVID-19 pandemic, the expansion of e-learning is due to the benefits of e-learning itself. The main advantages of e-learning include a variety of learning materials, cost-effectiveness, and self-pacing [1, 28, 32]. Among the many advantages of e-learning, time and place flexibility are among the most crucial advantages, largely contributing to the spread of e-learning.

Despite the advantages of e-learning, one of the main disadvantages is the lack of interaction between teachers and students. It is evident that some aspects of education, including learning with peers and interactions with professors, cannot be replaced by online [15]. These disadvantages often lead to the ineffectiveness of education, leading to the learning loss in many students. In particular, e-learning requires immense self-motivation and self-discipline from students, which poses significant challenges. Several attempts have been made to overcome these limitations. Enhancing the interaction between teachers and students and installing monitoring systems for learning progress are often considered appropriate approaches to the limitations.

Concentration plays an essential role in learning. This has become more critical for online education. Effective and efficient assessment of the concentration of e-learners is crucial for providing necessary feedback to learners and tutors. The development of effective and customizable intelligent tutoring systems (ITS) has been proposed to understand the cognitive state of a learner’s knowledge, emotions, and concentration [17]. De Carolis et al. [8] argued that it is important to develop personalized e-learning environments that can customize the learning experience of students.

This paper aims to develop a methodology to predict e-learners’ concentration by applying recurrent neural network models to eye gaze and facial landmark data extracted from e-learners’ video data. One hundred eighty-four video data of ninety-two e-learners were obtained, and their features were extracted using the OpenFace 2.0 toolkit. The data were then divided into 5-s units, and their concentration levels were labeled by education experts. The recurrent neural network(RNN), long short-term memory(LSTM), and gated recurrent unit(GRU) models were utilized in the comparative experiments. It is expected that the proposed methodology can predict the concentration level of students in a natural e-learning environment, thereby increasing the effectiveness of education by facilitating feedback between students and e-learning systems.

The structure of the paper is as follows. The relevant theories and literature are reviewed in Sect. 2. Section 3 explains the proposed RNN-based concentration classification model for e-learners. The experimental results are presented in Sect. 4. Finally, Sect. 5 discusses the benefits and limitations of our methodology.

2 Literature review

2.1 Recurrent neural networks

Recurrent neural networks (RNNs) are a variant of artificial neural networks (ANNs). They are capable of selectively passing information across sequence steps while processing sequential data one element at a time [22]. By overcoming a major limitation of ANN, the assumption of independence among data, RNNs have been proposed to deal with sequential data. RNNs can model input and/or output consisting of sequences of elements that are not independent. Furthermore, recurrent neural networks can simultaneously model sequential and time dependencies on multiple scales.

RNNs have been successfully applied to numerous applications, including time-series prediction [34, 37], speech recognition [9, 16], image classification [26], and video analysis [40], where a model effectively captures the dynamics of sequences via cycles in the network nodes.

Training time-series data often requires dealing with input information in the past and future of a specific time frame [24], for which bidirectional RNNs have been proposed. It splits the state neurons of a regular RNN into a forward state (positive time direction) and a backward state (negative time direction). Outputs from forward states are not connected to inputs of backward states and vice versa. As a bidirectional RNN has shown good performance in modeling time-series data, it has been adopted in our model.

Another limitation of RNNs is the vanishing gradient of traditional RNNs. To overcome this limitation, Hochreiter and Schmidhuber [12] introduced a long short-term memory(LSTM) model primarily to overcome the problem of vanishing gradients of RNNs. Unlike traditional RNNs, LSTM has feedback connections, thereby better dealing with the entire sequence of data.

Gated recurrent units (GRUs) are another notable approach to vanishing gradients. GRUs create shortcut paths that bypass multiple temporal steps [7]. These shortcuts allow the error to be back-propagated easily, minimizing vanishing as a result of passing through multiple bounded nonlinearities, thus reducing the difficulty due to vanishing gradients.

A GRU adaptively makes each recurrent unit capture the dependencies of different time scales. Similar to the LSTM unit, the GRU has gating units that modulate the flow of information inside the unit without having separate memory cells. While the LSTM consists of input, forget, and output gates, the GRU has a small number of parameters because this function is performed at the reset and updated gates without memory cells, increasing the computational efficiency. In this study, three models using bidirectional RNNs, LSTM, and GRU were proposed. Comparison experiments were conducted using video data collected from real e-learners.

2.2 Related works on e-learning

Recently, there has been an increasing interest in research within areas related to e-learning. To overcome the lack of interaction between teachers and students, Troussas et al. [36] proposed an alternative educational tool over a Social Network Service for students. Lopez et al. [24] presented a comparative study of the effectiveness of face-to-face and remote educational escape rooms. Stevens [33] presented a comparative study on the learning outcomes within and between online and face-to-face education.

Face retrieval or recognition is essential in computer vision and e-learning environments as well [25]. Lin et al. [21] proposed a cloud-based face video retrieval system with deep learning. Cognitive theory has also been utilized in the research related to e-learning. Wen et al. [38] presented chaos optimization cognitive learning model, where the learning process of distance learning has been formulated into a multi-objective optimization problem. Liu and Peng [23] proposed an online user focus evaluation system, where eye tracking and face recognition technologies were combined with the cognitive theory to evaluate the concentration of students.

Several attempts have been made to determine the concentration level of e-learners based on their behavior and biological information. Asteriadis et al. [2] presented a neuro-fuzzy inference system that utilizes the position and movement of the eyes and irises of an e-learner to determine the concentration level in the context of reading an electronic document. To monitor the concentration level of e-learners, Lee et al. [19] utilized the pupillary responses and eye-blinking patterns of students. A one-class support vector machine (SVM) was used to determine the concentration levels. Li et al. [20] utilized data collected by a webcam and a mouse to determine the concentration levels of e-learners. SVM techniques were applied to identify useful features for recognizing human attention levels.

Convolutional neural networks (CNNs) have been widely used for image classification [35]. They have also been used to determine the concentration levels of students. Hasnine et al. [10] utilized CNNs to detect the concentration level of students. Six types of basic emotions were extracted using a pre-trained CNN, and those were used to detect the concentration level of students in a virtual classroom. Sharma et al. [31] proposed a CNN-based machine learning system for student engagement detection using emotion analysis, eye tracking, and head movement by using a web camera. Although the CNN-based methods are noteworthy, they still have their weaknesses; as they are based on still images, they are unable to capture the sequential and temporal nature of e-learners’ responses. As a result, the actual e-learning environment can hardly be represented. Therefore, the application of RNNs has attracted the interest of researchers to effectively capture the dynamics of sequential data obtained from videos.

Sharma et al. [30] presented LIVELINET to estimate the liveliness of educational videos. While LIVELINET combines audio and visual information to predict the liveliness of educational videos using convolutional neural networks and LSTM, it does not utilize the behavior and biological information of e-learners.

De Carolis et al. [8] presented a method to determine the concentration, also referred to as engagement, of e-learners using LSTM. The OpenFace Toolkit was used to extract the necessary features from the video data. LSTM was applied to the features consisting of eye gaze, facial landmark, head pose, and facial expressions, and the degree of concentration was predicted. The subjective evaluation of the engagement from a questionnaire based on the psychological notion of “flow” was used in this study. Although the proposed method is noteworthy, the limited data set and subjective nature of a questionnaire can pose limitations in terms of practical applications; students had to answer questionnaires to assess their own engagement. In practical applications, the need for questionnaires or special instruments requires additional costs, causing difficulties in real e-learning environments. Therefore, research is needed to determine the degree of learning concentration by extracting various features using only the videos obtained in an actual e-learning environment.

3 Methods

3.1 Overview

Figure 1 shows the overall procedures of our study. First, video data of e-learners were collected and preprocessed as sequential temporal data so that they could be used as input data for RNN. Each dataset was labeled with its concentration levels prior to the application of supervised learning tasks. Three different RNN models, vanilla RNN, LSTM, and GRU, were used in the experiment, along with an SVM baseline model.

Fig. 1
figure 1

Overview of procedures

3.2 Participants

Ninety-two undergraduate students between the ages of 20 and 31 participated in the experiment. Prior to the video recording, a consent form for providing personal information and utilizing information was provided to the participants. The shooting resolution was 480 × 640 pixels, with a frame rate of 30 frames per second. During the filming process, interference with participants was minimized; participants were guided only in the direction of the experiments.

3.3 Procedures

Two distinct online lectures were used for the experiments. An interesting lecture and an interest-inhibiting lecture were shown to collect learners’ different behaviors. During the experiments, participants watching the lectures were unaware of the differences between the lectures.

The first lecture to evoke interest was a famous history lecture. The second lecture on mathematics intended to evoke boredom in students was selected from MIT Open Courseware. All participants watched the first lecture for about 9 min and the second for approximately 15 min. To effectively control the environmental variables, video recording was performed only in a laboratory with a camera located in the upper center of the monitor. One hundred eighty-four video data were obtained from the participants.

3.4 Data preprocessing

The video data were converted to structured data using the OpenFace Toolkit, a tool for facial behavior analysis [3]. The output data provided by the OpenFace Toolkit consist of a point distribution model (PDM) of facial landmark location, head pose, eye gaze, facial expressions, and facial action units (AU). Among the data obtained from the Toolkit, facial landmark location, head pose, and eye gaze information were mainly utilized in our model. Figures 2 and 3 represent 2D eye landmarks and 2D facial landmarks, respectively, as detected by OpenFace Toolkit.

Fig. 2
figure 2

2D eye landmarks as detected by OpenFace

Fig. 3
figure 3

2D facial landmarks as detected by OpenFace

Each PDM data point encompasses three-dimensional coordinates (X, Y, Z). Among the PDM data, sixteen iris data points from the eye landmark data and seventeen face contour data points from the facial landmark data were utilized. The data points #20 ~ #27 and #48 ~ #55 were used from the eye landmarks in Fig. 2. The data points #0 ~ #16 were used from the facial landmarks in Fig. 3. In addition, two eye gaze data with (X, Y, Z) and one eye gaze direction data with (X, Y) were used in our model. A total of 109 features were used, and their details are presented in Table 1.

Table 1 Description of data features

3.5 Data sets

The video data were divided into 5-s units. Each video was labeled as binary, depending on the concentration of the learner of the video. Three education experts reviewed the videos, and a voting method was used in the labeling process.

Even though the recording proceeded with the learner located in the center, each participant had a different location on the screen and often changed their position during the experiment. Thus, data scaling was conducted so that the head positions of the participants as located equidistant as much as possible.

The video data were preprocessed to obtain 150 frames for each. As each video data contain motion noise for shooting preparation, we took the video from t = 150. A total of 27,026 data were used in the experiment. Each 5-s clip was modeled as a temporal sequence\(\{{x}_{1},{x}_{2},\dots ., {x}_{t}{,\dots ., x}_{\mathrm{T}}\}\), where xt (t = 1, 2, …,150) is a vector representing the input data at time instant t. The data were divided into training, validation, and test datasets at a ratio of 8:1:1.

3.6 Modeling

Figure 4 illustrates the overall architecture of the proposed RNN model. The sequential temporal data are fed into the bidirectional RNN layers, go through the normalization process, and pass through deep neural network layers to generate a binary classification of concentration levels.

  1. (1)

    RNN

Fig. 4
figure 4

Architecture of RNN

For given a sequence x = (x1; x2;……; xT), the recurrent state ht is determined from the recurrent state ht-1 at the previous time and the current input xt through a transition function [7, 15] and, consequently, the output of the RNN’s cell state (ot) is determined:

$$h_{t} = f\left( {x_{t} ,h_{t - 1} ;\theta } \right) = \tanh \left( {W_{x} x_{t} + W_{h} h_{t - 1} + b_{h} } \right)$$
(1)
$$o_{t} = W_{o} h_{t} + b_{o}$$
(2)

where h0 = 0 and \(\theta\) are the parameters of the function f. W and b are the weight matrix and the bias vector between the input and the output layers. The hyperbolic tangent activation function, tanh(), guarantees that the output(ot) of the RNN unit should be within the range of (− 1, 1). Figure 5 shows the structure of RNNs used in our experiment.

Fig. 5
figure 5

Illustration of RNNs

As illustrated in Fig. 6, a bidirectional RNN computes both the forward hidden sequence \(\overrightarrow{h}\) and the backward hidden sequence \(\overleftarrow{h}\) [9, 29]. The output sequence is given by iterating the backward layer from t = T to 1 and the forward layer from t = 1 to T.

  1. (2)

    LSTM

Fig. 6
figure 6

Illustration of bidirectional RNNs

As shown in Fig. 7, each LSTM unit maintains a memory Ct at time t [7]. The activation of the LSTM unit ht is

$$h_{{\text{t}}} = o_{{\text{t}}} *\tanh \left( {C_{{\text{t}}} } \right),$$
(3)

where ot is an output gate. The output gate is determined by

$$o_{{\text{t}}} = \sigma \left( {W_{{\text{o}}} \left[ {h_{t - 1} ,x_{{\text{t}}} } \right] + b_{{\text{o}}} } \right),$$
(4)

where σ is a logistic sigmoid function and bo is a diagonal matrix. Then, the memory cell Ct and the new memory cell \(\widetilde{{C_{t} }}\) are

$$C_{{\text{t}}} = f_{{\text{t}}} *C_{t - 1} + i_{{\text{t}}} *\widetilde{{C_{{\text{t}}} }}, {\text{and}}$$
(5)
$$\widetilde{{C_{{\text{t}}} }} = tanh\left( {W_{{\text{C}}} \cdot \left[ {h_{t - 1} ,x_{{\text{t}}} } \right] + b_{{\text{C}}} } \right).$$
(6)
Fig. 7
figure 7

Illustration of LSTM. i, f, and o represent the input, forget, and output gates, respectively. C is the memory cell and \(\widetilde{C}\) is the new memory cell

A forget gate ft and an input gate it are given by

$$f_{{\text{t}}} = \sigma \left( {W_{{\text{f}}} \cdot \left[ {h_{t - 1} ,x_{t} } \right] + b_{{\text{f}}} } \right), {\text{and}}$$
(7)
$$i_{{\text{t}}} = \sigma \left( {W_{{\text{i}}} \cdot \left[ {h_{t - 1} ,x_{{\text{t}}} } \right] + b_{{\text{i}}} } \right).$$
(8)
  1. (3)

    GRU

As shown in Fig. 8, the GRU [5, 7] is designed to adaptively capture dependencies of different time scales using a more sophisticated transition function. The transition function ht is given as

$$h_{{\text{t}}} = \left( {1 - z_{{\text{t}}} } \right) \odot \widetilde{{h_{{\text{t}}} }} + z_{{\text{t}}} \odot h_{t - 1} ,$$
(9)

where

$$z_{{\text{t}}} = \sigma \left( {W_{{{\text{xz}}}} x_{{\text{t}}} + W_{{{\text{hz}}}} h_{t - 1} + b_{{\text{z}}} } \right),$$
(10)
$$r_{{\text{t}}} = \sigma \left( {W_{{{\text{xr}}}} x_{{\text{t}}} + W_{{{\text{hr}}}} h_{t - 1} + b_{{\text{r}}} } \right),{\text{ and}}$$
(11)
$$\widetilde{{h_{{\text{t}}} }} = tanh\left( {W_{{{\text{xh}}}} x_{{\text{t}}} + W_{{{\text{hh}}}} \left( {r_{{\text{t}}} \odot h_{t - 1} } \right)} \right).$$
(12)
Fig. 8
figure 8

Illustration of gated recurrent units. r and z are the reset and update gates, respectively

Note that, \(\odot\) denotes an element-wise multiplication operator.

  1. (4)

    Configurations

Six configurations were used in training, including one to two layers for bidirectional RNNs and one to three layers for deep neural networks. RNNs, LSTM, and GRU were applied to each configuration, along with batch normalization and dropout. Note that, both LSTM and GRU were constructed on the basis of the bidirectional RNNs.

Prior to comparative experiments, the following requirements were considered to select a proper baseline classifier. First, the classifier needed to perform well with a limited number of data samples while minimizing overfitting. Secondly, the classifier is required to classify elements nonlinearly [18]. In addition, the classifier needs to be utilized in related works [19, 20]. Upon a review of machine learning approaches based on relevant literature, SVMs were identified as the baseline classifier most suited to meet the requirements.

By standardizing the inputs to a layer for each mini-batch, batch normalization stabilizes the learning process and accelerates the training of deep neural nets. It eliminates internal covariate shifts and changes in the distributions of the internal nodes of a deep network [14].

In the course of optimizing the binary cross-entropy loss function, as shown in Eq. (12), Nesterov-accelerated adaptive moment estimation, or Nadam, was used. Nadam is an extension of the adaptive moment estimation [17] algorithm that incorporates Nesterov’s accelerated gradient (NAG) and can result in better performance of the optimization algorithm [6].

$${\text{Loss Function}} = - \mathop \sum \limits_{i = 1}^{C = 2} t_{i} \log \left( {f\left( {s_{i} } \right)} \right) = - t_{1} \log \left( {f\left( {s_{1} } \right)} \right) - \left( {1 - t_{1} } \right){\text{log}}(1 - f\left( {s_{1} } \right))$$
(13)

Learning rate is “the single most important hyper-parameter” [4] in training neural networks. Learning rate decay (lrDecay) is a de facto technique for training modern neural networks, where we adopt an initially large learning rate and then decay it by a certain factor after pre-defined epochs. Popular deep networks such as ResNet [11] and DenseNet [13] are all trained by Stochastic Gradient Descent (SGD) with lrDecay.

As it has been empirically observed that learning rate decay helps to learn complex patterns [39], the learning rate decay is set to 1 × 10–5 with a learning rate of 1 × 10–4. Even though the initial epoch was set to 300, the training terminated if the validation loss during 50 epochs did not decrease. The batch size was set to 256.

The specifications of the computational machine include an AMD Ryzen 7 3.20 GHz processor with 32 GB of RAM, and an NVIDIA GeForce RTX 3070 GPU running the 64-bit Windows 10 operating system. The Keras Python library was used on top of a source build of TensorFlow.

4 Experimental results

The experimental results are summarized in Table 2. Overall, the RNNs performed better than the baseline SVM method. Among the RNNs, a GRU method with one RNN layer and two FF layers provided the best performance, with an accuracy of 0.8431. The recall and precision of the GRU method were 0.8512 and 0.9077, respectively.

Table 2 Summary of Experiments. “True” means that a participant of the video is concentrating on learning, and “False” means otherwise. Each experiment has been repeated five times. The mean and standard deviation are reported in the table

Figures 911 present the comparison results of RNN models, which illustrate an accuracy and loss plot and ROC curves. Figure 9 shows the accuracy/loss and AUC plot of Vanilla bidirectional RNN with two RNN layers and three FF layers. It shows that the validation loss is minimum with 90 epochs, and the AUC is 0.8664.

Fig. 9
figure 9

Accuracy/loss and AUC plot of RNNs

Figure 10 shows the accuracy/loss and AUC plot of the LSTM with one RNN layer and two FF layers. It shows that the validation loss is minimum with 15 epochs, and the AUC is 0.9076. Note that, the LSTM model tends to converge to the minimum loss relatively faster than the other two models, but it shows overfitting after certain epochs.

Fig. 10
figure 10

Accuracy/loss and AUC plot of LSTM

Figure 11 shows the accuracy/loss and AUC plot of the GRU with one RNN layer and one FF layers. It shows that the validation loss is minimum with 42 epochs, and the AUC is 0.9210. While the GRU model reaches the minimum loss gently, it shows instability after certain epochs.

Fig. 11
figure 11

Accuracy/loss and AUC plot of GRU

5 Conclusion

This study explored the use of RNNs to determine the concentration of students in an e-learning environment. Three RNN models, namely bidirectional RNNs, LSTM, and GRUs, were utilized in our model, along with an SVM baseline model. A total of 27,026 datasets obtained in a natural e-learning environment were used in the experiment. Overall, the RNN models demonstrated that they are suitable for predicting the concentration of students, showing better performance than the baseline model. Among the RNN models, GRUs exhibited the best performance, with an overall accuracy of 84.3%.

The contributions of this work are summarized as follows. Our main contribution lies in designing a prediction model for e-learners’ concentration in an actual e-learning environment. Our model is one of the studies that are implemented in the most actual e-learning environment. The proposed model does not require any additional questionnaires or special instruments, which easily enables its implementation in an online education system. The detailed procedures of a model, including data collection, preprocessing, data modeling, and testing, were presented, which can stimulate research in this area. To effectively evaluate e-learner’s concentration, the architectures and configurations of the bidirectional RNN, LSTM, and GRU models have been proposed. In addition, comparative experiments were conducted to demonstrate the usefulness of the proposed model. Finally, the applicability of the proposed models has been examined.

Despite our contributions, we cannot help admitting the limitations of our approach. Significantly, our model was applied only to well-structured process models. Video data should be transformed and preprocessed for application to our model, which requires additional time and effort in an actual application. Automation of such processes would facilitate the usability of the proposed system.

As our approach has focused on the RNNs, LSTM, and GRU, expanding our model to other architectures such as CNNs and temporal convolutional networks (TCNs) would be our future work. Another limitation of our approach is the robustness of the model. The experiments were conducted in a controlled environment. However, real e-learning environments involve unexpected situations, which were not considered in our model. For example, students can excessively change their postures or even leave their seats during an online lecture, which may cause problems in our model. Thus, the development of a model that can effectively handle such situations would be a suitable topic for future research.