Introduction

Online learning platforms, such as Blackboard, Khan Academy, and Moodle, have revolutionized how students learn by providing learner-centric environments [1]. These platforms augment traditional education systems, facilitate learners’ adaptation to contemporary technological approaches [2], and offer a broad spectrum of courses catering to all learning levels. Online learning systems also house educational resources, empower instructors in sharing materials with learners, and facilitate easy access for learners to lecture notes, quizzes, and exams, among other activities, contributing to the formation of educational big data [3]. Furthermore, online learning allows instructors to monitor student behavior and improve student–teacher interactions. With these tools, data mining techniques can provide intelligent services that enhance student learning efficacy [4].

One of the intelligent services that has emerged as an area of keen interest in big data analysis in education is forecasting academic performance [3]. Academic performance prediction is an efficient method for managing learning and assessing the learning process [5]. It can help instructors identify students at risk of failing or dropping out, provide timely feedback and intervention, and personalize learning paths and resources. It can also help students monitor their progress, adjust their learning strategies, and improve their motivation and self-regulation [6]. Numerous studies have been undertaken in the realm of academic performance prediction [7]. Conventional methods, such as LR, RF, ANN, and SVM, have proven effective in this endeavor [6, 8]. Deep neural network-based methods, including RNN [9], CNN [10], and attention networks [5], have also made significant progress in the era of educational big data.

However, predicting academic performance is a challenging task that requires capturing the complex relationships in online learning activities and student attributes, essential for improving prediction accuracy [8]. While conventional machine learning and deep learning-based models have made significant progress in this endeavor, many of these methods need help to exploit both the structural information derived from interaction activities and the attribute feature spaces of students [11]. Moreover, existing methods often ignore the temporal dynamics and long-term dependencies of the learning behavior and performance data, which can reveal students’ academic states and learning patterns. The advent of graph neural networks (GNNs) has broadened the application scope of deep neural networks, encompassing graph data. Graph neural networks are a class of deep learning methods that can process graph-structured data and capture relationships between nodes and edges [12]. GNNs use a recursive aggregation strategy to gather information from neighboring nodes, effectively utilizing node features and graph structure for advanced node classification tasks [13]. This approach offers the potential to represent and uncover relationships between students [14]. For example, a graph-based model called MTGNN has been developed to encode learning behavior data using graph structures, which can preserve valuable information for performance prediction [15]. However, research has shown that the sequential patterns of learning behaviors or interaction activities can reveal students’ academic states [16, 17]. This means encoding online learning behavior data in graph structures with temporal properties may better preserve valuable learning cues for predicting academic performance. As such, developing a graph-based model that can capture both the static relationships between learners and the dynamic interaction patterns among learning activities is a promising solution for academic performance prediction tasks.

To address these challenges, we propose a novel model for predicting academic performance in this paper. This model harnesses the power of dual graph neural networks to effectively utilize both the interaction activities and the attribute features of students. Using graph neural networks, we can represent and uncover the complex and dynamic interaction patterns among students and learning activities and the similarities and differences among hierarchical feature spaces of student attributes.

The proposed model, APP-DGNN, consists of two main components: an interaction-based graph neural network (IGCN) and an attribute-based graph neural network (AGCN). The IGCN module learns local academic performance representations from online interaction activities, which reflect students’ engagement, behavior, and performance in different periods. The AGCN module learns global academic performance representations from attribute features of all students using dynamic graph convolution operations, which capture the similarities and differences among students regarding demographics, academic background, and learning behavior. The learned representations from local and global levels are combined in a local-to-global representation learning module to generate predicted academic performances.

The major contributions are summarized as follows:

  • An academic performance prediction model is introduced that utilizes dual graph neural networks to leverage both the interaction activities and the attribute features of students, resulting in highly accurate model predictions.

  • An interaction-based graph neural network is designed with a novel graph convolutional operator to extract helpful information from online learning activities.

  • An attribute-based graph neural network is developed with novel dynamic relation graph convolution blocks to exploit the structural information from learners’ attribute features.

  • Extensive experiments are conducted on a widely recognized public dataset derived from a real-world educational application, demonstrating that the proposed model significantly outperforms existing methods.

The rest of the paper is organized as follows. Section “Related works” reviews the related works on predicting academic performance and graph neural networks in educational applications. Section “Methods” introduces the proposed APP-DGNN model, which uses dual graph neural networks to exploit both the interaction activities and the attribute features of students. Section “Case study” presents the experimental results and discussions on a public dataset and compares the performance of APP-DGNN with several baseline models. Section “Conclusions” concludes the paper and discusses future directions.

Related works

Predicting academic performance

Predicting academic performance is a challenging task that has been the focus of many researchers [18]. Various conventional machine learning techniques predict academic performance [19]. For example, Julie et al. used linear regression, decision tree, and Naive Bayes classifiers to identify factors affecting academic performance in K-12 education, with Naive Bayes achieving the highest accuracy [20]. Burgos et al. applied logistic regression to foresee student attrition in distance learning courses, which effectively decreased the number of withdrawals [21]. Mohd et al. compared Logistic Regression and Artificial Neural Networks in predicting student depression and found that Artificial Neural Networks achieved the best accuracy [22]. Riestra et al. conducted an in-depth analysis of Learning Management System (LMS) log files using distinct machine learning techniques, and their objective was to predict academic performance across various learning periods. In addition to this, they employed a clustering algorithm to distinguish six unique student groups. Interestingly, they discovered a significant correlation between the interaction patterns of these groups and their academic performance [23]. Suresh et al. collected a large amount of data on students’ meta-information and learning experiences to build a robust SVM model for predicting learning styles in an e-learning environment, and this model was used for further research on academic performance [24]. Xu et al. employed a decision tree, a neural network, and a support vector machine for predicting correlations linking internet usage behaviors with undergraduate academic performance. Their research findings affirmed internet usage data’s capability in differentiating and forecasting academic performance [6]. Priya et al. proposed an EDM framework with a rule-based recommendation system for analyzing academic performance and providing explanations by gathering information from various sources. This framework can identify individual student weaknesses and provide practical suggestions for improvement [25]. Alshabandar et al. designed two models to separately predict student assessment grades and final performances. Their experiments showed that random forest had the lowest RSME for assessment grades while GBM had the best score for final performance prediction [26].

Other research has focused on using deep learning techniques for better prediction on real teaching scenario datasets [27]. Poudyal et al. developed a hybrid 2D Convolutional Neural Network (CNN) that transformed student learning process data into 2D image data for predicting student scores and found it to surpass baseline models such as Decision Tree and K-Nearest Neighbors (KNN) in accuracy [28]. Ali et al. devised a method that integrates Convolutional Neural Networks (CNNs) with traditional machine learning techniques to predict low-performing students, where features were extracted from student interests and learning patterns using CNN and then filtered with the mRMR method. Finally, an LDA classifier was employed to generate a probability distribution [29]. He et al. used RNN-GRU joint neural networks to capture time-series features of student behavior in virtual learning environments and identify at-risk students. Their results showed that simple RNN and GRU models may have higher accuracy than LSTM and that a joint model achieved 80% accuracy in predicting at-risk students [30]. Liu et al. built an RNN fused with an attention mechanism as a knowledge-tracing model. This framework utilized student behavior and exercise features to achieve a score of up to 98%, thereby establishing a new benchmark in the field [31]. Chen et al. proposed an explainable academic performance prediction (ESPP) framework. They collected student activity data weekly and used a CNN-LSTM network to extract spatiotemporal features. The model was explained using visualization techniques and analysis of typical predictions and student activity maps. Their results showed better performance than the LSTM baseline model [32].

Graph neural networks in educational applications

Graphs have become a powerful tool for data analysis due to the recent application of deep learning methods. Graph Neural Networks (GNNs) are particularly effective at processing graph-structured data and capturing relationships between nodes and edges. With advantages such as end-to-end learning and scalability, GNNs have broad application prospects in social network analysis, recommendation systems, and chemical molecular analysis. Several survey papers and references on this topic are listed therein [33,34,35,36,37].

Intelligent education is rapidly advancing, and Graph Neural Networks (GNNs) are becoming increasingly popular in education data analysis. One example is Knowledge Tracking (KT), which aims to estimate students’ learning status and future performance by modeling their learning behavior and performance [38]. Nakagawa et al. introduced Graph Knowledge Tracing (GKT). This model uses a graph structure to represent the knowledge structure of coursework, with concepts as vertices and their relationships as edges, and by doing so, it provides more understandable predictions and superior accuracy in forecasting academic performance than Deep Knowledge Tracing (DKT) [39]. Abdelrahman et al. introduced the Deep Graph Memory Network (DGMN). This model employs a forget-gating mechanism and learns a dynamic latent concept graph to track the evolving knowledge states across latent concepts and their interrelationships [40]. Another model, Bi-CLKT, proposed by [41], utilizes graph-level and node-level GCNs to assess a student’s proficiency in a particular concept by considering their past performance on related exercises. By leveraging global and local information with contrastive loss, Bi-CLKT can better capture the complex relationships between students, concepts, and exercises. Overall, this approach represents a significant advancement in knowledge-tracing tasks.

Cognitive diagnosis, a pivotal aspect of intelligent education, aims to discern students’ knowledge proficiency and cognitive capabilities within a specific domain by analyzing their learning behaviors and performance. Several researchers have proposed innovative approaches to address this issue [42]. Zhang et al. have developed GKT-CD that integrates knowledge tracing and cognitive diagnosis via Gated Graph Neural Networks. By refining the Q-matrix, this model enhances the precision of predicting a student’s performance, leading to more accurate assessments of their knowledge level and cognitive abilities [43]. Su et al. have constructed a graph-based Cognitive Diagnosis model (GCDM) for intelligent tutoring systems that involves the creation of two graph-based layers. By leveraging these two layers, this approach can effectively model the complex relationships between a student’s cognitive state and learning environment, leading to more accurate predictions of their performance [44]. Gao et al. devised a model, RCD, capable of concurrently assimilating relation-aware representations of students, exercises, and concepts utilizing a heterogeneous relation map. This methodology facilitates a holistic comprehension of a student’s learning trajectory while providing focused suggestions for performance enhancement [45]. Several researchers have proposed innovative approaches to improve the accuracy of cognitive diagnosis for students with sparse data. Wang et al. have developed a self-supervised Cognitive Diagnosis (SCD) framework that leverages self-supervised methods in graph neural networks to achieve this goal [46]. Meng et al. [47] have proposed a Dual Autoencoder Enhanced Subgraph Pattern Mining (DASPM) approach that considers collaborative information among students to improve the accuracy of predicting their knowledge state and future performance. This approach uses a subgraph extraction algorithm and a dual autoencoder module to eliminate irrelevant information, minimize interference, and enhance the representation of nodes in the subgraph.

Graph neural networks have also been widely applied to various domains, such as agriculture, transportation, bioinformatics, health care, and remote sensing. For example, Bukumira et al. proposed a carrot grading system using computer vision and a cascaded graph convolutional neural network [48]. Lan et al. developed a dynamic spatial-temporal aware graph neural network for traffic flow forecasting [49]. Sun et al. presented a deep learning method for predicting metabolite-disease associations via graph neural network [50]. Bacanin et al. designed an intelligent wireless health care system using graph LSTM pollution prediction and dragonfly node localization [51]. Yao et al. proposed a deep hybrid multi-graph neural network collaboration for hyperspectral image classification [52]. Jiang et al. proposed a face2nodes model that learns facial expression representations with relation-aware dynamic graph convolution networks [53]. These applications demonstrate the effectiveness and versatility of graph neural networks in modeling complex data and solving challenging problems.

Table 1 Comparison of the relevant academic performance prediction models

The above methods demonstrate that existing approaches to academic performance prediction have certain limitations, as shown in Table 1. Firstly, they feed multidimensional information into a model or categorize it into a multi-channel model, attempting to fit it with a uniform model. However, their ability to capture both non-temporal and temporal information is incomplete due to the lack of a temporal-expert channel to capture time-series information as auxiliary guidance for the main structure of GNN. Secondly, due to the sparsity of real students’ data, simple GNNs cannot effectively capture students’ knowledge level information, student behavior performance information, and interaction information between students. This results in poor generalization. These limitations have inspired further research in this field. In contrast, the proposed APP-DGNN model leverages dual graph neural networks to exploit students’ interaction activities and attribute features, resulting in highly accurate model predictions. The proposed model differs from the conventional machine learning and non-graph deep learning methods in that it can handle graph data and capture the complex and dynamic relationships among students and learning activities. The proposed model also differs from the existing graph neural network methods. It uses dynamic graph processing, adaptive graph convolution operator with low-high filters, and effective relation learning between interaction activities and the attribute feature spaces. The proposed APP-DGNN is thus more comprehensive and robust than the previous methods in predicting academic performance.

Methods

The section introduces the details of the proposed APP-DGNN. Firstly, a brief introduction of the framework of APP-DGNN is presented, followed by an explanation of the different components of APP-DGNN.

Framework of APP-DGNN

Figure 1 illustrates the proposed solution framework using DGNN, which comprises five main components: Data Collection and Pre-processing, Dynamic Graph Construction, Graph Generation, Dual Graph Neural Networks, and Performance Representation and Prediction. Below is a brief overview of these modules:

Fig. 1
figure 1

Framework of the proposed APP-DGNN for academic performance prediction

Data collection and pre-processing. This phase encompasses the selection of attributes, cleaning of data, and transformation of data to prepare it for the creation of two distinct types of graphs.

Graph generation. Pre-processed data from online learning systems introduces interaction-based and attribute-based graph generation methods. These methods serve as the source of input graphs for Dual Graph Neural Networks.

Dual graph neural networks. This module employs two types of graph neural networks. One processes interaction-based graphs to learn local features from interaction activities, while the other processes attribute-based graphs to obtain global features from academic performance relations with other students.

Performance representation and prediction. Local and global representations are concatenated and passed through a multi-head attention module to learn a final academic performance representation. An MLP-based classifier then predicts the academic performances of these candidate students.

The following subsections describe each module of the framework in detail.

Data cleaning amd pre-processing

Data from Learning Management System (LMS) logs must be prepared through data cleaning and pre-processing before using APP-DGNN. This process consists of several vital steps to obtain well-formatted data, such as attribute selection, data cleaning, and data transformation. As illustrated in Fig. 1, the input for APP-DGNN is divided into two parts: interaction-based graphs and attribute-based graphs. The details on the generation of both types of graphs are presented in the following.

Data preparation for interaction-based graph generation. The interactions between learners and learning resources contain valuable information about academic performance. Candidate data is prepared to generate interaction-based graphs for improving performance predictions. The key to generating an interaction-based graph from raw log data is determining the types of nodes and edges. Since the target graph has a temporal property, online activities are chosen as the nodes \(V=\{v_1, v_2,\ldots , v_{N_v}\}\), where \(v_i\) denotes the ith type of learning activities. The edges \(E=\{e_1, e_2,\ldots \}\) represent possible interactions between these nodes. For example, nodes occurring within a specific time frame can be assumed to have edges between them. The notation ac(i, 1) represents the data required to generate an interaction-based graph. ac(i, 1) represents a data unit from the learning activity log sequence for learner \(l_i\). A sequence of learning activities \(Ac(l_i)\) for learner \(l_i\) can be formulated using Eq. (1).

$$\begin{aligned}{} & {} Ac(l_i) = \{ac(i, 1), ac(i, 2),\ldots , ac(i, N_{aci})\}, \end{aligned}$$
(1)
$$\begin{aligned}{} & {} Fg(L) = \{Ac(l_1),Ac(l_2),\ldots , Ac(l_M)\}, \end{aligned}$$
(2)

where \(L=\{l_1, l_2,\ldots , l_M\}\) denotes a set of learners with M learners, and Fg(L) denotes a set of activity log data for these learners. \(N_{aci}\) denotes the length of the activity log for learner \(l_i\). The following subsection details how these interaction activity logs are converted into interaction-based graphs.

Data preparation for attribute-based graph generation. An attribute-based graph is utilized to encode the information from the relationships between students and enhance the accuracy of academic performance predictions. Appropriate attributes are selected to model each student’s performance. These attributes, such as Gender, Region, Disability, and Highest_education, are obtained from learning management systems and denoted by at(i, 1). at(i, 1) can be a real-valued scalar or an integer obtained through one-hot or multi-hot encoding. A record \(At(l_i)\) for learner \(l_i\) can be expressed as

$$\begin{aligned}{} & {} At(l_i) = \{at(i, 1),\ldots , at(i, N_{ati})\}, \end{aligned}$$
(3)
$$\begin{aligned}{} & {} Fa(L) = \{At(l_1),At(l_2),\ldots , At(l_{M})\}, \end{aligned}$$
(4)

where \(N_{at}\) is the number of chosen attribute features for attributed-based graph generation. Note that Fa(L) is generated only from the training dataset, not all raw data from the LMS.

Interaction- and attribute-based graph generation

The subsection details how to use the data from Eq. (2) and Eq. (3) to construct interaction-based and attribute-based graphs for dual graph neural networks.

Interaction-based graph generation. In our setting, as mentioned in Eqs. (1) and (2), an interaction-based graph \({\mathcal {G}}\) is generated using a sequence of online learning activities. An interaction-based graph can be expressed as

$$\begin{aligned}{} & {} {\mathcal {G}}(T) = \{{\mathcal {V}}(T), {\mathcal {E}}(T)\}, \end{aligned}$$
(5)
$$\begin{aligned}{} & {} {\mathcal {V}}(T) = \{i: \exists v_i(t) \in \Omega , t\in T\}, \end{aligned}$$
(6)
$$\begin{aligned}{} & {} {\mathcal {E}}(T) = \{(i,j): \exists e_{i,j}(t)\in \Omega , t\in T\}, \end{aligned}$$
(7)
$$\begin{aligned}{} & {} \Omega = \{ x(t_1), x(t_2),\ldots \},~~ 0 \le t_1 \le t_2 \le \cdots , \end{aligned}$$
(8)

where \(x(t_i)\) denotes an interaction event in the sequence of online learning activities. \(v_i(t)\) denotes an online learning activity from a collection of candidate online learning activities V in time interval T. \(e_{i,j}(t)\) denotes an edge between node \(v_i\) and node \(v_j\), in our setting, if both \(v_i\) and \(v_j\) in \(\Omega \), \(e_{i,j}(t)\) equals to 1, 0 otherwise. \({\mathcal {N}}_i(T) = \{j: (i,j) \in \Omega (T)\}\) refers to the neighborhood of node \(v_i(t)\) in time interval T. Thus, a collection of interaction-based graphs is denoted by \(\{{\mathcal {G}}({\mathfrak {t}})\}_{{\mathfrak {t}}\in \{1, 2,\ldots , T\}}\). To be specific, the number of the node types in an interaction-based graph \({\mathcal {G}}({\mathfrak {t}})\) is determined by the types of online learning activities, i.e., \(N_v\). The features of node \(v_i\) are denoted as a tuple \( (v_{i,1},\ldots , v_{i,j},\ldots )\), where \(v_{i,j}\) denotes the jth feature of \(v_i\), e.g., the type of learning activity, or the total clicks, and so on.

Attribute-based graph generation. A k-NN algorithm is employed in generating an attribute-based graph, capturing relations amidst academic performance from varied perspectives. Before each graph convolution layer, a dilated k-NN algorithm [54] is applied, identifying k neighbors each student \(l_i\) within attribute feature space. Sampling k students uniformly occurs from top \(k \times d\) neighbors, where d represents the dilation rate. Consider \(At(l_i)\) as centroid node and \({\mathcal {N}}^{l}_{d}(l_i)\) as dilated neighbors of \(l_i\) at layer l. The sorted \(k \times d\) nearest neighbors are \(\{n_1, n_2, \dots , n_{k\times d} \}\). Then, the dilated neighbors of \(l_i\) are

$$\begin{aligned} {\mathcal {N}}^{l}_{d}(l_i) = \{n_1, n_{1+d}, n_{1+2d}, \dots , n_{1+(k-1)d}\}, \end{aligned}$$
(9)

where cosine similarity is utilized to measure the distance between vectors in the attribute space. A collection of dilated \(k\) \(NN\) graphs, denoted as \({\mathcal {G}}^l = \{ {\mathcal {G}}^l_i \in {\mathbb {R}}^{k\times D} |i=1, 2,\dots ,N\}\), emerges for layer l. Each layer’s k-NN graph exhibits characteristics of being dynamic and adaptive.

Interaction-based graph convolutional network module

As shown in Fig. 1, the dual graph networks consist of two main components: an interaction-based graph convolutional network (IGCN) and an attribute-based graph convolutional network (AGCN). The interaction-based graph convolutional network is detailed in the following subsection, while the attribute-based graph network is detailed in the subsequent subsection.

Fig. 2
figure 2

An illustrative example (seven nodes) of the interaction-based graph data being processed in the IGCN module

An interaction-based graph convolutional network is a neural network architecture that can capture the complex and dynamic interactions among students and learning activities in online learning environments. It can learn the latent features of students from their interaction graphs, which reflect their engagement, behavior, and performance in different periods. IGCN mainly consists of three parts (as shown in Fig. 2): graph convolutional operator (GCO), recurrent neural aggregator (RNA), and local attention representation, which are detailed as follows:

Graph convolutional operator. GCO is a function that performs convolution on the interaction graphs to learn the local features of each node and its neighboring nodes (related learning activities). GCO uses two kinds of filters: one for the self-connections (the diagonal entries of the adjacency matrix) and another for the other connections (the non-diagonal entries of the adjacency matrix). The self-connection filter captures the inherent attributes of each node, while the other-connection filter captures the relational attributes of each node and its adjacent nodes. GCO combines the results of both filters using a nonlinear activation function (ReLU) to obtain the hidden representation after the element-wise pooling of nodes (FFN-P). Therefore, GCO plays a crucial role in handling an interaction-based graph \({\mathcal {G}}_t\) from a specific period \(t\in T\). The processes of GCO are formulated as follows:

$$\begin{aligned} H_t^1 = \textit{ReLU} \left( \textit{FFN-P} \left( \hat{D}^{\frac{1}{2}}_t \hat{A} \hat{D}^{\frac{1}{2}}_t X^0_t \hat{W}_\theta + \hat{D}^{\frac{1}{2}}_t \tilde{A} \hat{D}^{\frac{1}{2}}_t X^0_t \tilde{W}_\theta \right) \right) ,\nonumber \\ \end{aligned}$$
(10)

where \(X^{0(i)}_t\) denotes the initial embedding of \(v_i(t)\), \(H^1_t\) denotes the learned hidden representation for the interaction-based graph \({\mathcal {G}}(t)\), \(\hat{A}_t = A_t + I_t\), \(\tilde{A}_t = A_t - I_t\) and \(A_t\) denotes the adjacent of \({\mathcal {G}}(t)\), \(I_t\) is a identity matrix with the shape as \(A_t\), \(\hat{D}_t\) is the normalization matrix of the degree matrix \(D_t\) of \({\mathcal {G}}(t)\), and \(\hat{W}_\theta \), \(\tilde{W}_\theta \) are shared learnable weights across different periods.

Recurrent neural aggregator. RNA is a function that combines the hidden representations of each node over various periods using a recurrent neural network (RNN). RNA employs a long short-term memory (LSTM) cell as the fundamental unit of the RNN, which is capable of maintaining the node features’ long-term dependencies and temporal dynamics. RNA also incorporates a cosine similarity-based attention mechanism to allocate different weights to the node features at different time steps, thereby capturing the significance and relevance of each period for the final prediction. Therefore, with the combined sequences \(\{H_t^1\}\) from GCO, a recurrent neural network is employed as an information aggregator across different periods. The process can be formulated as follows:

$$\begin{aligned} \begin{aligned} H_t^2&= LSTM(H^1_t, \alpha _t H^2_{t-1}, W_\theta ^h), \\ \alpha _t&= cos(H_t^1, H^1_{t-1}), \end{aligned} \end{aligned}$$
(11)

where LSTM denotes an LSTM cell, \(H_t^2\) denotes the hidden state at period t, \(\alpha _t\) is attention weight between \(H^1_t\) and \(H^1_{t-1}\), cos is cosine similarity function, and \(W_\theta ^h\) is the learnable weight matrix shared across different periods.

Local attention representation. After applying GCO and RNA to the interaction graphs, the IGCN procures a sequence of hidden representations for interaction graphs at varying time steps. To consolidate these features into a singular vector, IGCN employs an additional cosine similarity-based attention mechanism to calculate a weighted sum of the node features. This vector is subsequently transformed by a feed-forward neural network (FFN) to derive the local representation of each student’s academic performance in terms of interaction graphs. As previously stated, a portion of the final representation of student academic performances is locally generated from an interaction-based graph processing network. Hence, a straightforward attention-based method is utilized to amalgamate the features to obtain the local representation as follows:

$$\begin{aligned} {H}^3&= \Sigma _{t=1}^T \tilde{\alpha }_t H^2_t, \nonumber \\ \tilde{\alpha }_t&= cos(H^2_t, H^2_{t-1}),\nonumber \\ Z^L&= \textit{FFN}({H}^3), \end{aligned}$$
(12)

where \(\tilde{\alpha }_t\) is an attention weight between \(H^2_t\) and \(H^2_{t-1}\), and FFN is an MLP-based feature transformation function. With these operations, the local presentation of academic performance is obtained through interaction-based graphs.

Attribute-based graph convolutional network module

With the collection of attribute features \(\textit{Fa}(L)\), a dynamic graph neural network is applied to process these attribute-based graphs to get a global representation of academic performance. This module aims to capture the complex and dynamic relationships among students’ attributes, such as demographics, academic background, and learning behavior. To achieve this, relation graph convolution (RGC) and RGCN blocks (as shown in Fig. 3) are proposed, which are designed to handle heterogeneous and dynamic graphs with different types of edges and varying node degrees. Attribute-base graph convolutional network (AGCN) is detailed as follows:

Fig. 3
figure 3

An illustrative example (six nodes) of the attribute-based graph data processed in the AGCN module

Relation graph convolution (RGC): The purpose of using RGC is to leverage both the low- and high-frequency filters to capture the different levels of similarity among the students’ attributes. The low filter aims to aggregate the features of the students who are similar in most of their attributes, such as gender, age, major, etc. The high filter aims to aggregate the features of the students who are different in some of their attributes, such as learning styles, preferences, etc. A more comprehensive and diverse representation of the students’ attributes can be obtained by combining the low and high filters. In numerous frameworks for graph convolutional networks, a standard graph convolution that encompasses both the aggregation and update processes can be articulated as follows:

$$\begin{aligned} \begin{aligned} {x_i}^{l+1}&= \textit{UPD}({x_i}^l, AGG({x_i}^l, {\mathcal {N}}({x_i}^l); W_{upd}); W_{agg}), \end{aligned} \end{aligned}$$
(13)

where \(AGG(\cdot ;\theta _{agg})\) represents a node feature aggregation function with parameters \(W_{agg}\) and the function \(\textit{UPD}(\cdot ;W_{upd})\) signifies a method for updating parameters, specifically \(\theta _{upd}\). In layer l, \(x_i^l\) represents attribute features for student \(l_i\). Moving to layer \((l+1)\), attribute features for the same student are denoted as \(x_i^{l+1}\). \({\mathcal {N}}(x_i^l)\) signifies neighboring nodes in relation to \(x_i^l\) in layer l. Given the potential presence of redundant or irrelevant edges in the k-NN graph, this approach aids in extracting meaningful and positive correlations within the attribute-base graph. More specifically, aggregation operation (AGG) is defined for feature \(x_i\) in layer l in this manner

$$\begin{aligned} {x}^{l+1}_i= & {} \sum _{{x_j}^l\in {\mathcal {N}}({x_i}^l)} \alpha _{i,j}^L \left( \textit{FFN} ({x_j}^l+{x_i}^l) \right) \nonumber \\{} & {} + \alpha _{i,j}^H\left( \textit{FFN} ({x_j}^l-{x_i}^l) \right) , \end{aligned}$$
(14)

where \(\alpha _{i,j}^L\) and \(\alpha _{i,j}^H\) are learnable attention weights for the low and high filters, respectively. FNN represents a fully connected layer, succeeded by the sigmoid function. The update process is defined as: \({x_i}^{l+1} = UPD([{x_i}^l,AGG(\cdot ;W_{agg})];W_{upd})\), with square brackets [, ] representing a concatenation operation.

RGCN blocks: The purpose of using RGCN blocks is to adapt to the dynamic and heterogeneous nature of the attribute-based graphs. Since the students’ attributes may have complex relationships, which RGC cannot capture with a fixed topological structure, the graph structure and node features can be updated accordingly. Using RGCN blocks, the node features from different graphs are mapped onto a common feature space, and then, RGC is applied to update the node features based on the latest graph structure. This way, we can capture multiple variations of the students’ attributes and relationships. This block comprises L cascaded blocks sharing identical structures. Each block comprises two primary components: an RGC and an MLP. In the lth block, node features \(X^{l}\) from graph \({\mathcal {G}}^l\) are mapped onto a common feature space, enhancing node feature diversity. Subsequently, a k-NN graph is formed, and the RGC operator is applied to update graph features, succeeded by linear transformation operation. The process is expressed as

$$\begin{aligned} Z^{l} = {\mathcal {F}}_{\textrm{out}}^{l} (RGC({\mathcal {F}}_{\textrm{in}}^{l}(X^{l};W_{\textrm{in}}));W_{\textrm{out}}) + X^{l}, \end{aligned}$$
(15)

where \(Z^{l}\), \({\mathcal {F}}_{\textrm{in}}^{l}(\cdot ;W_{\textrm{in}})\) and \({\mathcal {F}}_{\textrm{out}}^{l}(\cdot ;W_{\textrm{out}})\) are fully connected layers. After applying the RGC operator, normalization and GELU activation functions are employed. The node feature transformation capacity is enhanced using a two-layer MLP, denoted as FFN, as follows:

$$\begin{aligned} X^{l+1} =\textit{FFN} (Z^{l};W_G) + Z^{l}, \end{aligned}$$
(16)

where the dim. of a hidden layer of \(\textit{FFN} (\cdot ;W_G)\) equals to 4 times the dimension of \(X^{l}\). The final representation from the RGCN block captures the global attribute-based graphs. \(Z^G_i\) is the node feature for student i, fed to the next final representation learning module.

Local-to-global representation learning and prediction

As shown in Fig. 1, the final representation of academic performance is generated from a local branch of an interaction-based module and a global branch of an attribute-based module. The local branch captures the temporal and sequential patterns of students’ online learning behaviors, such as video watching, quiz taking, and forum posting. It uses a recurrent neural network (RNN) to encode the sequential data and generate a local representation for each student, denoted as \(Z^L\). The global branch captures students’ static and demographic information, such as age, gender, education level, and major. It uses a relational graph convolutional network (RGCN) to encode the attribute data and generate a global representation for each student, denoted as \(Z^G_i\).

A simplified multi-head attention mechanism is applied to fuse these local–global features and obtain the academic performance representation, i.e., \(Z_i = \textit{Multi-Head}(Z^G_i, Z^L)\). The attention mechanism computes the similarity between each pair of students based on their global features and then weights their local features accordingly. This way, the final representation \(Z_i\) can capture the individual and the relational aspects of students’ academic performance. The process can be formulated as follows:

$$\begin{aligned} \begin{aligned}&Q_i = \textit{FFN}({{z}}^G_i), \\&\{K_{i'}\} = \textit{FFN}(\{{z}^L_{i'}\}),\\&{V}_{i'} = \frac{Q_i \tilde{\otimes }K_{i'}}{\sqrt{d_k}} \tilde{\otimes } {z}^L_{i'}, \\&Z_i = \textit{CPooling}(\{{V}_1, {V}_2,\ldots , {V}_{N_v}\}), \end{aligned} \end{aligned}$$
(17)

where CPooling denotes an element-wise pooling function; \(\tilde{\otimes }\) denotes an element-wise product.

With the final representation \(Z_i\), an MLP-based classifier is applied to \(Z_i\) to obtain the academic performance prediction of online candidate learner \(l_i\), i.e., \(y= MLP(Z_i)\), where y is the predicted result on a given representation \(Z_i\). Following the convention of classification tasks with neural networks, a cross-entropy loss is also employed to train the proposed model.

Complexity analysis: Let |E| be the number of edges, N be the number of nodes, \(d_i\) be the input feature size, and \(d_o\) be the output feature size. The complexity of one layer GCN [55] is \({\mathcal {O}}(|E|d_id_o)\). The complexity of one layer GAT is \({\mathcal {O}}(Nd_id_o + |E|d_o)\). The complexity of k-NN is \({\mathcal {O}}(N^2d_i)\). For IGCN in APP-DGNN, we have \(L_s\) layers of GCN, so the complexity is \({\mathcal {O}}(L_s(|E|d_id_o))\). For AGCN in APP-DGNN, there are L layers of RGCN, which consist of GCN and GAT modules. The complexity of the GCN module is \({\mathcal {O}}(L|E|d_id_o)\). The complexity of the GAT module is \({\mathcal {O}}(L(Nd_id_o + |E|d_o + d_md_o))\), where \(d_m\) is the number of neurons in the MLP layer for attention coefficients. The complexity of k-NN module is \({\mathcal {O}}(LN^2d_i)\). Thus, the total complexity of APP-DGNN can be \({\mathcal {O}}((L+L_s)|E|d_id_o + L(Nd_id_o + d_md_o) + LN^2d_i)\). In practice, L and \(L_s\) are usually small to avoid information redundancy, so the complexity of APP-DGNN is comparable to GCN and GAT with different depths, except for the k-NN cost.

Case study

Research questions

Demonstrating the superior capabilities of the proposed APP-DGNN model in forecasting academic performance forms the crux of this study. The public dataset Open University Learning Analytics dataset (OULA) [56], widely recognized in the field, serves as the basis for the case study.

  • Question One (Q1): Can our APP-DGNN outperform non-graph-based and graph-based models in predicting student final grades based on their online learning logs?

  • Question Two (Q2): Can our APP-DGNN accurately identify students at risk of failing with partial online learning logs?

  • Question Three (Q3): How do the different components of our APP-DGNN, such as IGCN, AGCN, and local-to-global representation learning module, influence its final prediction performance?

Dataset and baselines

Dataset: This is a dataset (OULAD) [56] that contains data from courses presented at the Open University (OU) in 2013 and 2014, including student demographics, performance, and interactions with the virtual learning environment (VLE). The following are some descriptions of the dataset.

  • Dataset source and size: The dataset was obtained from the Open University Learning Analytics Dataset, which contains data about 22 courses, 32,593 students, and their interactions with the Virtual Learning Environment (VLE) for seven selected courses.

  • Dataset features and labels: The dataset includes student demographics, course information, student assessment results, and student online activity records. The student grades were categorized into four groups: Pass, Distinction, Fail, and Withdrawal.

  • Dataset pre-processing and graph generation: The dataset was cleaned and preprocessed to remove missing values, outliers, and irrelevant features. The students’ online activity records were used to generate interaction-based graphs, where nodes represent students and edges represent the similarity of their online behavior. The student demographics and course information were used to generate attribute-based graphs, where nodes represent students and edges represent the similarity of their attributes.

In accordance with the research conducted by Li et al. [15], code-Module CCC from OULA was chosen for evaluation. Following pre-processing, academic data were gathered for 3983 students, encompassing basic details, online learning behavior data, and learning evaluation data. The student grades were distributed as follows: Distinction (498 students), Pass (1179 students), Fail (753 students), and Withdrawal (1553 students). Figure 4 presents the distribution of categories in the dataset. For the experimental study, the students were categorized into three groups: Pass (which comprised both Pass and Distinction), Withdrawal, and Fail. Various combinations of these groups were utilized for diverse tasks. Table 2 summarizes online learning activities that generate interaction-based graphs. Table 3 summarizes attributes used for generating attribute-based graphs.

Fig. 4
figure 4

Statistics of code-module CCC in OULA

Table 2 Online learning activities used to construct interaction-based graphs
Table 3 Attributes for constructing attribute-based graphs [15]

Baselines. The case study employs a variety of machine learning models as a standard of reference to evaluate the proposed APP-DGNN. These encompass three traditional approaches: Support Vector Machine (SVM), Linear Regression (LR), and Multiple Layer Perception Neural Networks (MLP) [57]. It also includes Recurrent Networks (GRU) [9], a Multi-Topology based Graph Neural Network (MTGNN) [15], and a modified Multi-View Graph Transformer from [58] (AP-GT). The reference models and the proposed APP-DGNN are built using PyTorch and Python.

Experimental settings

Training and testing setup. The dataset is allocated into two sections. The larger section, comprising 80% of the samples, is dedicated to training. The smaller section, containing the remaining 20%, is reserved for testing. The training section undergoes further division, where 90% of the samples are for training, and the remaining portion serves to assess potential models and determine optimal hyper-parameters and architectures. To achieve optimal performance, the window size hyper-parameter is tuned for sequential models like GRU and APP-DGNN. As detailed in sections “Interaction- and attribute- based graph generation” and “Interaction-based graph convolutional network module”, interaction-based graph construction involves feature selection, and Learning materials (denoted as id_site in the dataset) are chosen as nodes. Not all learning materials or activities are used in graph construction; those used are summarized in Table 2. Directed edges between nodes cannot be built due to the lack of fine-grained timestamps for each learning activity, so it is assumed that materials or nodes used within a day have non-directional edges between them. The raw features for a node are a tuple (site_id, sum_click, date).

Evaluation metrics: Predicting academic performance is framed as a binary classification task. The metrics employed for evaluating performance are

  • Classification Accuracy (ACC):

    $$\begin{aligned} ACC = \frac{\text {TP} + \text {TN}}{\text {TP} + \text {FP} + \text {FN} + \text {TN}}, \end{aligned}$$

    where TP, FP, FN, and TN denote the count of True-Positive, False-Positive, False-Negative, and True-Negative instances in the confusion matrix.

  • F1-score (F1):

    $$\begin{aligned} \text {F}1 = \frac{2 * \text {REL} * \text {PRE}}{\text {REL} + \text {PRE}}, \end{aligned}$$

    where F1 is the harmonic mean of REL (REL = TP / (TP + FN)) and PRE (Precision, defined as the proportion of true positives among predicted positives).

Results and discussion

Predicting academic performance with complete online learning logs

The first experiment aims to answer the research question: Can our APP-DGNN outperform non-graph-based and graph-based models in predicting student final grades based on their online learning logs? To investigate this, students’ complete learning logs from a semester are used to forecast their course performance (Pass/Withdrawn or Pass/Fail). For the binary classification problems of Pass/Fail and Pass/Withdrawn, only samples from the relevant classes were used. The experiment encompasses two distinct sub-tasks. The first task involves categorizing students into two groups: Pass or Fail, based on their risk of failing. The second task focuses on classifying students into Pass or Withdrawn, depending on their risk of discontinuing their studies.

Table 4 The experiment evaluates baseline models alongside APP-DGNN for the prediction of students at risk. An asterisk (*) signifies enhancements statistically significant over the superior baseline. These enhancements are confirmed by a two-sided t-test yielding a p value less than \(10^{-5}\)
Fig. 5
figure 5

Comparisons on different training sizes of the two sub-tasks

Table 4 presents the experimental results for the tasks. The best performance is denoted in bold font. Several observations can be made. Primarily, the proposed APP-DGNN excels in comparison to other models in both sub-tasks, attaining prediction performance of 83.96% in Pass/Fail task and 90.18% in Pass/Withdrawn task. Second, graph-based models consistently perform better than non-graph-based models in all metrics, demonstrating the effectiveness of using a graph-based model to predict academic performance. Third, among graph-based models, the proposed APP-DGNN with the interaction-based graph processing module achieves better prediction performance than models with static graph neural networks (e.g., MTGNN, AG-GT), suggesting that an interaction-based graph structure may better encode learning behavior data for academic performance prediction. The proposed APP-DGNN introduces a suitable graph structure with temporal properties to encode learning behavior data, capturing academic states in complex learning processes and improving predictive performance. Figure 5 shows the experimental results for different training sizes of the two sub-tasks. Compared to other evaluated models, our APP-DGNN maintains better performance with different sizes of training sets, demonstrating its ability to capture students’ academic states from learning behavior data.

Statistical analysis: Table 4 also shows that the enhancements achieved by APP-DGNN are statistically significant over the superior baseline, which is AP-GT, as confirmed by a two-sided t test yielding a p value less than \(10^{-5}\). Some observations regarding test scores, t-statistics, and p values can be drawn in the following.

(1) t statistics: The table also reports the t statistics for each comparison between APP-DGNN and AP-GT. The t statistics measure the difference between the means of the two models in terms of standard errors. The higher the value of the t statistics, the more likely the difference is not due to chance. The table shows that the t statistics are negative, indicating that APP-DGNN has a lower mean error than AP-GT. The t statistics are also larger in magnitude for the pass/withdrawn category than the pass/fail category, suggesting that APP-DGNN has a greater advantage over AP-GT in predicting students who withdraw from the course.

(2) p-value: The table also reports the p value for each comparison between APP-DGNN and AP-GT. The p value is the probability of obtaining a result equal to or more extreme than what was observed, assuming that the null hypothesis (i.e., no difference between the two models) is true. The lower the p value, the less likely the null hypothesis is true and the more confident it can be that the difference is real. The table shows that the p values are all very small, ranging from \(8.01\times 10^{-7}\) to \(2.57\times 10^{-7}\), which are much lower than the significance level of \(10^{-5}\). This means it can reject the null hypothesis and conclude that APP-DGNN is significantly better than AP-GT in predicting at-risk students.

Early prediction for at-risk students with partial online learning logs

The second experiment delves into a crucial research question: Can our APP-DGNN accurately identify students at risk of failing with partial online learning logs? APP-DGNN’s effectiveness and other baseline models were assessed by predicting academic performance during the initial weeks within a semester. Predicting academic performance early on is a vital feature within online learning management systems, as it facilitates the timely identification of students who might fail or withdraw. This paves the way for active intervention strategies or measures to assist these students in bolstering their skills and comprehension. We bifurcated this task into two parts: early prediction for students at risk of failing by classifying them as Pass or Fail, and early prediction for students at risk of withdrawal by classifying them as Pass or Withdrawn. Experimental settings were similar except for the duration (weeks 5, 10, 15, and 20) utilized for training and testing from learning logs.

Table 5 Comparisons between the baseline models and APP-DGNN in early predicting at-risk students. An asterisk (*) signifies enhancements statistically significant over the superior baseline. These enhancements are confirmed by a two-sided t test yielding a p value less than \(10^{-5}\)
Fig. 6
figure 6

Test accuracies of APP-DGNN, MTGNN, and GRU for early predicting at-risk students

Table 5 reports the comparison results between the baseline models and APP-DGNN in early predicting at-risk students regarding accuracy. It can be seen that APP-DGNN still outperforms the other baseline models in all learning periods, showing a more impressive performance compared to the results in Table 4. Also, compared to traditional machine learning algorithms like SVM and MLP, deep-learning-based methods show a better performance in the early prediction of at-risk students. Figure 6 depicts the test accuracy values of different learning periods of APP-DGNN, AP-GT, and GRU. It is evident from the increasing prediction accuracies over time for APP-DGNN, AP-GT, and GRU that more academic information becomes available as the course progresses to enhance prediction precision. Specifically, as depicted in Fig. 6a, by the course’s twentieth week, the proposed APP-DGNN achieves an accuracy 80.50% for predicting students at risk of failing. Moreover, it attains an accuracy 82.10% for predicting students at risk of dropping out, as shown in Fig. 6b. This illustrates the potential for early predicting students who might fail or withdraw. Furthermore, as Fig. 6 shows, the APP-DGNN achieves a better prediction performance over the other compared models in the earlier prediction, indicating the high effectiveness of the APP-DGNN model for early intervention. With these predictive results, proactive services can be developed to prevent students from dropping out and keep them continuing to learn.

Statistical analysis: Table 5 also reports the two-sided t test results. The table shows that APP-DGNN outperforms all the baseline models in both categories (Pass/Fail and Pass/Withdrawn) and across all periods (Week 5, Week 10, Week 15, and Week 20). The enhancements achieved by APP-DGNN are marked with an asterisk (*). They are statistically significant over the superior baseline, the model with the highest accuracy among the baselines for each category and period. Some observations regarding test scores, t-statistics, and p values can be drawn.

(1) t statistics: The t statistics in the table represent the degree of difference between the mean scores of APP-DGNN and the superior baseline model for each category and period. A higher absolute value of the t statistics indicates a more significant difference. For instance, in week 5, the t statistics for the Pass/Fail category is \(-\) 11.11, signifying a substantial difference in accuracy favoring APP-DGNN over the superior baseline model, MTGNN. Similarly, in week 20, the t statistics for the Pass/Withdrawn category is \(-\) 13.22, indicating a significant difference in accuracy favoring APP-DGNN over the superior baseline model, AP-GT. These results suggest that APP-DGNN consistently outperforms the superior baseline models, demonstrating its effectiveness.

(2) p value: The p value tests the hypothesis that there is no difference between the mean scores of APP-DGNN and the superior baseline model for each category and period. A smaller p value provides stronger evidence to reject this hypothesis and conclude that a significant difference exists. For example, in week 5, the p value for the Pass/Fail category is \(3.84\times {10^{-6}}\), which is significantly smaller than the threshold of \(10^{-5}\). This allows us to reject the null hypothesis and affirm that APP-DGNN significantly outperforms MTGNN. Similarly, in week 20, the p value for the Pass/Withdrawn category is \(1.02\times 10^{-6}\), also significantly smaller than the threshold, leading us to conclude that APP-DGNN significantly outperforms AP-GT. These p values collectively suggest that APP-DGNN significantly outperforms the superior baseline models across all categories and periods, underscoring its effectiveness in early predicting at-risk students.

Effectiveness of APP-DGNN

This subsection addresses the third research question: How do the different components of the proposed APP-DGNN, such as IGCN, AGCN, and local-to-global representation learning module, influence its final prediction performance? As the proposed APP-DGNN consists of several major components and hyper-parameters, we investigate their contribution to the performance of model predictions through an ablation study and analysis of parameter sensitivities.

Table 6 Effectiveness different components of APP-DGNN in terms of accuracy (%)

Effectiveness of components of APP-DGNN. Table 6 showcases the impact of various components of APP-DGNN on accuracy (%). For convenience, some notations are introduced to denote different ablation settings of APP-DGNN: SPP-IGCN denotes APP-DGNN without the interaction-based graph neural network module; SPP-AGCN denotes APP-DGNN without the attribute-based graph neural network module; APP-GRU denotes APP-DGNN with a GRU network [9] as the IGCN module, with the rest remaining the same; and SPP-L2G denotes APP-DGNN without local-to-global representation learning, simply concatenating two vectors instead. SPP-GCN denotes APP-DGNN with general graph convolutional operators in IGCN and AGCN instead of the proposed operators. Table 6 presents the effectiveness of different components of APP-DGNN in terms of accuracy, with the numbers in parentheses indicating deviations from the best prediction performance. Several observations can be made. First, all main components in APP-DGNN contribute to the prediction performance of the two sub-tasks: classification of Pass/Fail and Pass/Withdrawn. Second, APP-IGCN is the model without an IGCN module, which ignores interaction information between learning behavior data and can cause significant degradation in prediction performance. APP-IGCN shows the worst prediction performance for both sub-tasks, achieving 81.66% and 88.05% for Pass/Fail and Pass/Withdrawn, respectively. Third, both SPP-GCN and APP-DGNN have a temporal modeling process in their models, but APP-DGNN shows better prediction performance than SPP-GCN for both sub-tasks. The difference is that the proposed IGCN module in APP-DGNN uses a novel filtering information aggregation design, while SPP-GCN uses a conventional implementation. This suggests that the proposed design is better for capturing more academic information during students’ learning processes.

Table 7 Accuracy (%) of APP-DGNN with different settings of time units to generate interaction-based graphs

Influence on APP-DGNN due to window size: A parameter sensitivity analysis was carried out to scrutinize how window size impacts APP-DGNN’s performance. Interaction-based graph construction plays a pivotal role within APP-DGNN. Experimental outcomes for various window size configurations are displayed in Table 7. Prediction outcomes for both sub-tasks exhibit considerable sensitivity to these hyper-parameter configurations. APP-DGNN achieves the best performance at a window size of 6 days for both sub-tasks. The performance of APP-DGNN decreases as the window size increases beyond 6 days for both sub-tasks, suggesting that a large window size may not be a good choice for updating an interaction-based graph, as it may cause information loss due to poor graph construction. Additionally, a more fine-grained window size may not be a good choice, as it may require more computing resources and result in lower prediction accuracy.

Fig. 7
figure 7

Test accuracy of APP-DGNN and SPP-L2G with different settings of the number of RGCN blocks in IGCN

Number of RGCN blocks in APP-DGNN. We also investigated the influence of the number of RGCN blocks in the IGCN of the proposed APP-DGNN. Figure 7 visualizes the experimental results for APP-DGNN and SPP-L2G concerning different hyper-parameter settings for the number of RGCN blocks. From Fig. 7a, APP-DGNN achieves the best performance when the number is set to three, while SPP-L2G needs a larger number of RGCN blocks (four) to achieve the best performance. Figure 7b shows similar results, demonstrating that local-to-global representation learning can contribute to the performance of the proposed APP-DGNN.

Limitations: The APP-DGNN model also has some limitations that need to be addressed in future work. First, the model depends on the availability and quality of online learning logs, which may vary across different courses and platforms. Second, while the APP-DGNN model can handle dynamic networks, it may need to capture the complexity of student interactions and attributes fully. For instance, it may not consider individualized learning paths or their learning styles. Third, the model may not consider potential biases and ethical issues arising from using AI-based predictions to guide student learning and interventions. Finally, the model needs to be validated and generalized in other educational settings and domains and with other types of data and tasks.

Conclusions

This paper investigates the use of dynamic interaction patterns in online learning activities and hierarchical relations in student attribute feature spaces to improve the accuracy of model predictions. A novel academic performance prediction model called APP-DGNN, which employs dual graph neural networks for problem-solving, is proposed. Specifically, the proposed model uses an interaction-based graph neural network module to learn local academic performance representations from online interaction activities and an attribute-based graph neural network to learn global academic performance representations from attribute features of all students using dynamic graph convolution operations. The learned representations from local and global levels are combined in a local-to-global representation learning module to generate predicted academic performances. The empirical study aims to answer several questions, including whether the APP-DGNN can outperform non-graph-based and graph-based models in predicting academic performance and improving the early prediction of at-risk students. Experimental outcomes indicate a significant enhancement in performance by our APP-DGNN over the existing approaches. Ablation studies further validate the efficacy and superiority of techniques incorporated within APP-DGNN.

Future directions: Several areas could be focused on in future work, (i) applying APP-DGNN to datasets with temporal directed acyclic graphs, as the interaction-based graphs in APP-DGNN are non-directional and may not capture all potential information; incorporating the directionality and temporal order of the interactions into the graph structure and the convolution operator, and devising effective directed propagation and aggregation functions in the processing of directed graphs could be explored; (ii) conducting a comprehensive theoretical analysis of how and why the IGCN and AGCN modules contribute to the final prediction performance; investigating the mathematical properties and assumptions of the proposed graph convolution operators, and comparing them with other existing graph convolution methods in terms of complexity, expressiveness, and generalization could be done; and (iii) exploring whether the method can be applied to other analytical tasks in educational research, such as student dropout prediction, course recommendation, and learning behavior analysis, as well as investigating the use of a pretraining-fine-tuned schema in APP-DGNN; leveraging the large-scale and heterogeneous online learning data to pretrain the APP-DGNN model on a general task, and then fine-tune it on a specific task with a smaller or different dataset could be attempted.