Abstract
Predicting student performance is crucial for both preventing failure and enabling personalized teaching-and-learning strategies. The digitalization of educational institutions has led to the collection of extensive student learning data over the years. Current research primarily focuses on short-term data, e.g. a single year or semester. In contrast, long-term data has the potential to offer a deeper insight into student behavior, thereby increasing the accuracy of predictions. However, the direct application of long-term data in prediction models assumes consistent data distributions over time. In the real world, evolutions in course content and structure can lead to variations in feature spaces (heterogeneity) and distribution shifts across different academic years, compromising the effectiveness of prediction models. To address these challenges, we introduce the Learning Ability Self-Adaptive Algorithm (LASA), which can adapt to the evolving feature spaces and distributions encountered in long-term data. LASA comprises two primary components: Learning Ability Modeling (LAM) and Long-term Distribution Alignment (LTDA). LAM assumes that students’ responses to exercises are samples from distributions that are parameterized by their learning abilities. It then estimates these parameters from the heterogeneous student exercise response data, thereby creating a new homogeneous feature space to counteract the heterogeneity present in long-term data. Subsequently, LTDA employs multiple asymmetric transformations to align distributions of these new features across different years, thus mitigating the impact of distribution shifts on the model’s performance. With these steps, LASA can generate well-aligned features with meaningful semantics. Furthermore, we propose an interpretable prediction framework including three components, i.e. LASA, a base classifier for outcome predictions, and Shapley Additive Explanations (SHAP) for elucidating the impact of specific features on student performance. Our exploration of long-term student data covers an eight-year period (2016-2023) from a face-to-face course at Tsinghua University. Comprehensive experiments demonstrate that leveraging long-term data significantly enhances prediction accuracy compared to short-term data, with LASA achieving up to a 7.9% increase. Moreover, when employing long-term data, LASA outperforms state-of-the-art models, ProbSAP and SFERNN, by an average accuracy improvement of 6.8% and 6.4%, respectively. We also present interpretable insights for pedagogical interventions based on a quantitative analysis of feature impacts on student performance. To the best of our knowledge, this study is the first to investigate student performance prediction in long-term data scenarios, addressing a significant gap in the literature.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
Background
As institutions aim to improve education quality and reduce student failure and dropout rates, student performance prediction has become a key research topic in educational data mining (EDM). Large classroom environments, which are prevalent in higher education [1], often suffer from high failure and dropout rates. Given the large number of students in these settings, educators find it challenging to promptly identify individual learning behaviors and capabilities, which hinders the implementation of personalized educational strategies [2]. Thus, it is imperative to develop student performance prediction systems for campus students, allowing educators to implement data-driven educational interventions based on solid insights.
With the rapid development of digital campuses, substantial amounts of student academic performance data have been collected over the years, providing a foundation for data-driven academic prediction systems [3]. Existing studies typically utilize short-term data, often limited to 1 or 2 years or even shorter periods [4, 5]. Intuitively, leveraging long-term data should offer a more comprehensive view of student behavior, potentially improving the predictive accuracy of models. However, directly employing long-term data in prediction models implies the assumption that data distributions remain consistent across different periods [6]. This practice overlooks the dynamic nature of educational settings, where both teaching methodologies and curricular content may evolve. Furthermore, this evolution can lead to shifts in data distributions and heterogeneity in feature spaces across different academic years, ultimately compromising the effectiveness of prediction models.
Primary motivations
In this paper, we address a more realistic and challenging scenario: long-term student performance prediction. As courses evolve and historical data from multiple years become available, both teaching methodologies and curricular content undergo changes. This evolution results in long-term data characterized by distribution discrepancies and varying feature dimensionalities, posing significant challenges to the efficacy of predictive models. Figure 1 illustrates the differences in data distribution and feature space assumptions between previous studies and our current work. While distribution shifts and heterogeneous feature spaces in long-term data are ubiquitous in educational settings, they remain largely understudied.
Previous studies have predominantly utilized short-term datasets, assuming that these datasets possess consistent feature spaces and similar distributions. This approach commonly involves aggregating data across different years and randomly dividing it into training and test sets, potentially neglecting variations in distribution over time. In contrast, this paper emphasizes the analysis of long-term data. Due to factors such as curriculum changes, the feature space naturally evolves, leading to alterations in both the number and the meaning of features. These changes also induce shifts in the data distribution over time. In practical scenarios, when predicting student grades for a course, one can only leverage the historically labeled data from previous iterations of the course, since labels (student exam grades) for the target prediction set are not available until after final exams are completed. Hence, a simplistic mixture and splitting of data are impractical. To evaluate model performance more realistically, our methodology involves selecting a particular year’s unlabeled data as the prediction target and using all preceding available labeled data. This approach better reflects real-world applications and enhances practicality
Domain adaptation (DA) is a promising technique to address the challenge of distribution shifts in long-term data. It aims to bridge the distributional gap between the source and target domains by aligning their feature distributions [7]. However, many DA methods assume that single [8] or multiple sources [9, 10] domains share the same feature space as the target domain, differing only in distributions. This assumption hinders the applicability to long-term heterogeneous data, which often present multiple source domains with distinct feature spaces and distributions. Moreover, many DA methods rely on symmetric subspace mappings [11, 12] or deep learning structures [9], which may undermine the semantics of the feature space, resulting in predictions that lack interpretability. Considering the crucial role of interpretability in educational settings, where educators rely on understanding the rationales behind predictions to carry out informed interventions [13], it is imperative to develop a student performance prediction approach that not only accommodates the heterogeneity and distribution shifts in data but also ensures interpretable outcomes.
Innovation aspects
Our work includes three innovation aspects:
Research Question Innovations: We tackle the critical yet underexplored area of utilizing long-term data to predict student performance. Our work illuminates the unique challenges and opportunities presented by long-term educational data, such as feature heterogeneity and distribution shifts, which have not been adequately addressed in existing studies.
Methodological Innovations: To address these challenges, we introduce the Learning Ability Self-Adaptive Algorithm (LASA), designed to adapt the predictive model to the evolving feature spaces and distributions encountered in long-term data. LASA comprises two innovative components: Learning Ability Modeling (LAM) and Long-term Distribution Alignment (LTDA). LAM assumes that students’ responses to exercises are samples from distributions parameterized by their learning abilities. It then estimates these parameters from the heterogeneous student exercise response data, thereby creating a new homogeneous feature space to overcome the heterogeneity present in long-term data. Meanwhile, LTDA employs multiple asymmetric transformations to align feature distributions across different years, thus mitigating the impact of distribution shifts on the model’s performance. With these steps, LASA generates well-aligned features with meaningful semantics. Building on LASA, we propose an interpretable prediction framework that utilizes LASA to obtain semantically meaningful features, incorporates a base classifier for outcome predictions, and employs Shapley Additive Explanations (SHAP) to elucidate the impact of specific features on student performance.
Dataset Innovations: To empirically validate our approach and bridge the research gap in long-term student performance prediction, we introduce an 8-year dataset (2016-2023) from the face-to-face course, Principles of Electric Circuits, at Tsinghua University. This dataset, collected using RainClassroom, a leading educational tool in China [14], provides an opportunity to apply various student performance prediction methods in a real-world, long-term data scenario. The evolving features and data distributions exemplify the limitations of conventional prediction methods and underscore the necessity of LASA.
Contributions
The main contributions of this paper are as follows:
-
1.
For the first time, this study delves into the critical yet underexplored area of utilizing long-term data to predict student performance, highlighting the advantages and challenges of this scenario. We demonstrate that challenges such as feature heterogeneity and distribution shifts within long-term data can potentially impair classifier performance. Moreover, we show that overcoming these challenges and effectively leveraging long-term data can significantly enhance predictive accuracy, yielding more robust classification results compared to those based on short-term data.
-
2.
We propose a novel method, the Learning Ability Self-Adaptive Algorithm (LASA), specifically designed to harness the full potential of long-term student data. To the best of our knowledge, LASA is the first method aimed at tackling the issues of heterogeneity, distribution shifts, and obtaining interpretable features in the context of long-term student performance prediction.
-
3.
We introduce an interpretable prediction framework that includes three key components: first, it generates semantically meaningful features using the Learning Ability Self-Adaptive Algorithm (LASA); second, it employs a classifier to derive prediction outcomes based on these features; and third, it applies SHAP-based Model Interpretation to elucidate the influence of specific features on the prediction outcomes. This comprehensive framework provides educators with actionable insights, enabling targeted pedagogical interventions.
-
4.
We present the Long-term Student Performance Dataset, a pioneering dataset that encompasses an 8-year span (2016-2023) for long-term student performance prediction. This new dataset lays the foundation for groundbreaking insights into the challenges and benefits of using long-term educational data. The desensitization dataset and code are available at https://github.com/EDM314/LASA.
-
5.
Extensive experiments on our real-world datasets demonstrate that our LASA can effectively utilize long-term data to enhance predictive performance, surpassing state-of-the-art models, ProbSAP and SFERNN, by average accuracy improvements of 6.8% and 6.4%, respectively. Our analysis also provides interpretable insights into the factors influencing student success, paving the way for more effective pedagogical strategies.
Paper organization
The remainder of this paper is organized as follows: Sect.“Related work” provides a review of related work, positioning our research within the existing scholarly landscape. Section “The long-term student performance dataset” delineates the methodology underpinning the Learning Ability Self-Adaptive Algorithm (LASA), detailing its components and outlining the proposed framework for interpretable predictions. Section “Methodology” describes the long-term dataset, including the methodologies for data collection and analysis. Section “Experiments” presents the experimental results and discusses the implications of our findings. Section “Interpretability and case study” offers an interpretable analysis of the prediction outcomes, yielding insights for pedagogical interventions. Section “Discussion” discusses the main findings of this study, acknowledges its limitations, and suggests potential avenues for future research. Finally, Sect. “Conclusion” concludes the paper.
Related work
In this section, we review some previous works related to our study in terms of student performance prediction and domain adaptation.
Student performance prediction
Student performance prediction is a critical research topic in educational data mining (EDM). In recent years, numerous studies have been dedicated to leveraging data mining techniques to predict student academic performance. Li et al., based on 145 days of students’ daily living data, introduced a model that combines CNN with LSTM to predict students’ GPA [4]. Similarly, Riestra–Gonzalez et al. [3] attempted to predict students’ performance by analyzing log files from a learning management system (LMS) for an entire academic year. Additionally, CatBoost-SHAP [15] built student profiles using K-prototypes and employed CatBoost to detect at-risk students, utilizing two years of data, with one year serving as the training set and the other as the test set. While these studies have proposed various predictive models, most of them are based on short-term data, using data from only 1–2 years or even less. Unlike them, our study focuses on the benefits and challenges of utilizing long-term data, which is more aligned with practical applications.
A few studies also utilize data spanning several years. For example, Alcaraz et al. [6] collected data on the Power Electronic Systems course from 2010 to 2017. They incorporated a set of expert features and employed traditional machine learning algorithms to predict student grades. ProbSAP [16] explored using academic year GPA and prerequisite course scores to predict student grades for the Probability Statistics course, drawing upon data from 2015 to 2018. Lu et al. [17] predicted student grades for the final academic year using historical course data and LMS interaction data from 2014 to 2017. Although these studies leveraged long-term data, they often treated the data from multiple years as a single distribution, thus overlooking potential shifts in data distributions across years, which is impractical in real-world scenarios.
While some research has been directed towards managing distribution shifts in student performance prediction, these efforts primarily revolve around adapting models for different courses [18,19,20,21] or developing multi-course applications [22, 23]. These approaches, while aiming to address distribution variations, primarily concentrate on the course-level differences, often neglecting the intricate challenges posed by long-term educational data.
In summary, there has been a lack of research on long-term student performance prediction. We tackle the unique distributional challenges presented by historical data accumulated over extended periods, an unexplored area in EDM. This focus allows us to harness the full potential of long-standing data records, significantly enhancing the efficacy of our predictive model.
Domain adaptation algorithm
The DA technique is designed to address distribution differences between the source and target domains by aligning their respective feature distributions. In recent years, DA has been widely applied to various fields, including computer vision [12], natural language processing [24], and recommender systems [25]. This method shows promise for addressing distribution discrepancies in long-term data. Given that labels for future data are unavailable in real applications, our focus is on Unsupervised Domain Adaptation (UDA) techniques. According to the number and properties of source domains, UDA methods can be divided into Single-source Homogeneous UDA (SsHoUDA), Multi-source Homogeneous UDA (MsHoUDA), Single-source Heterogeneous UDA (SsHeUDA), and Multi-source Heterogeneous UDA (MsHeUDA).
SsHoUDA aims to transfer knowledge from a single source domain to a target domain, with both domains being homogeneous (having the same dimensionalities). Transfer Component Analysis (TCA) [26] is a widely recognized domain adaptation technique that seeks to identify a shared subspace between different domains, thereby reducing the Maximum Mean Discrepancy (MMD) [27]. Unlike TCA, Correlation Alignment (CORAL) [28] is proposed to align the second-order statistics between two domains. Moreover, Wang et al. introduced the Manifold Embedded Distribution Alignment (MEDA) [12], which integrates the Grassmann manifold and adjusts the weights between the marginal and conditional distributions for more accurate alignment. Although these methods have shown promising results in specific contexts, they are primarily designed for a single source and might neglect a significant portion of the information when dealing with long-term data.
MsHoUDA focuses on mitigating the distribution discrepancies across multiple source domains. As an extension of TCA, Multi-Domain TCA [29] was introduced to align multiple source domains with the target domain. Similarly, Peng et al. [9] presented M3SDA, a method that aligns the moments between multiple source domains and the target domain. TWMDA [10] further refines this alignment by accounting for sample weights. However, these techniques, primarily designed for homogeneous domains, might not yield optimal results when the feature spaces of the source and target domains diverge, a challenge frequently encountered with heterogeneous long-term data in educational settings.
SsHeUDA methods can bridge two heterogeneous cross-domain feature spaces and can be applied to domains with different dimensionalities. Yeh et al. [30] introduced Reduced Kernel Canonical Correlation Analysis (RKCCA), leveraging a derived correlation subspace to associate data across domains. However, this method depends on parallel instances, which are often lacking in many domains. Addressing this constraint, Liu et al. [31] proposed Shared Fuzzy Equivalence Relations (SFER). This facilitates knowledge transfer from a heterogeneous source domain to the target domain without requiring parallel instances. Despite its strengths, this strategy is still designed specifically for a single source domain.
MsHeUDA remains an inadequately explored challenge. Fuzzy Relation Neural Networks [32] appears as a potential solution, integrating multiple neural networks within the SFER framework to learn domain-invariant features across multiple heterogeneous domains. However, constrained by the SFER structure, this method exhibits a limited capability in capturing students’ behavioral patterns in the original response data. Furthermore, its intricate architecture lacks interpretability, rendering it unable to provide meaningful decision support. Our approach, LASA, can be considered a kind of MsHeUDA approach, which not only transfers knowledge across multiple heterogeneous domains but also retains interpretability.
Summary
In Table 1, we summarize the characteristics of recent methods, focusing on whether they address the challenges present in long-term data and their interpretability. Most student performance prediction methods are based on short-term data and do not thoroughly explore the role of long-term data. Some studies that utilize long-term data simply amalgamate all available data for model validation without considering the potential impacts of long-term data on future predictions. Additionally, most DA methods fall short in addressing the distribution shifts in multiple heterogeneous feature spaces, which are typical in long-term educational data. And they often come with complex structures and lack interpretability, thus failing to provide interpretable predictions, which are essential for educators to implement targeted teaching interventions. Our proposed LASA not only effectively resolves the challenges posed by heterogeneous feature spaces and distribution shifts but also can provide interpretable predictions.
The long-term student performance dataset
In this section, we present the Long-term Student Performance Dataset to bridge the existing gap in long-term student performance prediction research. This real-world dataset poses challenges such as heterogeneous feature spaces and distribution shifts, paving the way for the development of more practical and robust predictive models.
Data collection
We utilize Rain Classroom, one of the most popular teaching tools in China, to automatically collect in-class data. With Rain Classroom, upon completing the teaching of a knowledge point, the teacher can push corresponding exercises to students for data collection. In this way, data collection can be seamlessly integrated into the teaching process without disrupting the teaching rhythm or placing an extra burden on students and teachers. Meanwhile, students can receive and answer questions via their smartphones without the need for an extra device, and the exercise results are recorded into the database. The results also count towards the students’ regular course scores, which encourages them to take it seriously. A schematic representation of the data collection process is depicted in Fig. 2.
In each lesson, after teaching a particular topic, instructors can use the Rain Classroom platform to push relevant in-class exercises to students. Students can complete these exercises using their phones, eliminating the need for additional devices. Simultaneously, their responses are recorded, contributing to their in-class behavior data. After the course concludes, a final exam is administered, and students’ scores on this exam reflect their overall performance in the course
We collected data from the Principle of Electric Circuits course using Rain Classroom over a total of 8 years, from 2016 to 2023. Principle of Electric Circuits is a required second-year course for many majors at Tsinghua University and was also one of the first courses to use Rain Classroom for teaching purposes. The types of questions involved in our study are multiple-choice questions, with data samples and their features presented in Table 2. As a large-scale face-to-face course, using its in-class data to predict student performance is highly representative. In the following, we will provide a detailed analysis of the challenges, heterogeneous feature spaces and distribution shifts, encountered in the Long-term Student Performance Dataset.
Yearly statistics from 2016 to 2023 for the Principle of Electric Circuits course at Tsinghua University. The bar graphs represent the number of students, lessons, and exercises per year. The line graph showcases the average correct rate of student answers with standard deviation. Noticeable fluctuations in the data components reflect the dynamic evolution of the course content, student enrollment, and their performance
Data analysis
Figure 3 presents the basic statistics of our dataset. Notably, the dataset exhibits annual fluctuations in both the number of lessons and exercises from 2016 to 2023, reflecting the evolving nature of the course content. Such ongoing modifications hint at potential challenges associated with heterogeneous feature spaces that possess varying dimensions. Furthermore, the variations observed in student enrollments and their average accuracy rates underscore the existence of distribution shifts in the dataset.
To quantitatively analyze the dataset’s evolving feature-level characteristics year-over-year, we employ both single-feature and multi-feature evaluations. Using the students’ correct answer rate as a representative statistical feature, we calculate the Jensen-Shannon (JS) divergence [35] between different years (the smaller the value, the smaller the difference). As depicted in Fig. 4a, the distribution difference grows over time, with two noticeable similarity blocks: 2016–2020 and 2021–2023, confirming feature-level distribution shifts in long-term data. Moreover, in Fig. 4b, we employ the first principal component after Principal Component Analysis (PCA) for assessment. PCA is a statistical procedure that utilizes orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables, called principal components. In line with our previous findings, the closeness in color intensity around the diagonal indicates a higher similarity in distributions for consecutive years. The similarity blocks also exist, but they differ from the similarity block in Fig. 4a, which might be attributed to PCA altering the feature structures and distributions. Additionally, we observe a significant discrepancy in the distribution between 2021 and its adjacent years. These findings further underscore the distribution shifts in long-term data.
a Distribution discrepancy of a single statistical feature, where deeper shades of blue signify smaller distribution differences. Data from 2016–2020 exhibit similarities amongst each other, as does the data from 2021–2023, yet significant differences exist between these two clusters. b Distribution discrepancy of single PCA features, revealing similar patterns. c Joint distribution differences of multiple PCA features. Classification outcomes from SVM trained on data from different years confirm the presence of these distribution shifts
To provide a comprehensive understanding beyond single-feature analysis, we delve into distribution discrepancies across yearly dataset splits using a classifier-based multi-feature approach. Given the heterogeneous nature of our long-term data and varying feature dimensions, as depicted in Fig. 3, directly applying traditional machine learning classifiers proves challenging due to their expectation of consistent feature dimensions. Consequently, we utilize the first 30 principal components post-PCA as a consistent feature set and employ an SVM classifier [36]. The labeling process is elaborated in Sect. “Experimental settings”. In Fig. 4c, each heatmap cell indicates classification accuracy, with the model trained on data from the x-axis year and evaluated on the y-axis year. Diagonal cells represent average accuracies from ten-fold evaluations on identical-year data. Notably, training and testing on data from 2018 to 2020 yield commendable performances, aligning with the similarity patterns observed in Fig. 4(b). However, when predicting the years 2021 to 2023, we observe substantial variances in results based on different training years, further underscoring the presence of distribution shifts.
Methodology
In this section, we present the Learning Ability Self-Adaptive Algorithm (LASA) and the accompanying interpretable prediction framework for long-term student performance prediction. The main components of LASA and the architecture of the prediction framework are illustrated in Fig. 5. To aid in understanding the methodologies presented, we provide a table of notations used throughout this section. Table 3 summarizes the symbols and their descriptions for quick reference.
The interpretable prediction framework with LASA for long-term student performance prediction. The first component of LASA, Learning Ability Modeling, assumes that students’ responses to exercises are samples from distributions that are parameterized by their learning abilities. It then estimates these parameters from the heterogeneous student exercise response data, thereby creating a new homogeneous feature space to overcome the heterogeneity present in long-term data. Meanwhile, the second component of LASA, Long-term Distribution Alignment, employs multiple asymmetric transformations to align feature distributions across different years, thus mitigating the impact of distribution shifts on the model’s performance. With these steps, LASA generates well-aligned features with meaningful semantics. Then, a base classifier is used for outcome predictions and we employ Shapley Additive Explanations (SHAP) for elucidating the impact of specific features on student performance
Learning ability self-adaptive algorithm
To better utilize long-term data, we propose the Learning Ability Self-Adaptive Algorithm, which can adapt predictive models to the evolving features and distributions within long-term data. Figure 5 illustrates the two primary components of LASA: Learning Ability Modeling and Long-term Distribution Alignment. Each component is specifically designed to address the challenges of heterogeneity and distribution shifts in long-term datasets. Before detailing these components, we first define the challenges inherent in long-term datasets.
Challenges of long-term datasets
We consider the long-term data as consisting of \(T+1\) distinct timestamps, denoted as \(\mathcal {T} = \{1,..., T+1\}\). Each timestamp t corresponds to a specific data distribution \( p^{(t)}(\textbf{x}^{(t)}, \textbf{y}^{(t)}) \), where \( \textbf{x}^{(t)} \) denotes the input feature vector of dimension \( d^{(t)} \), and \( \textbf{y}^{(t)} \) represents its corresponding labels. It is important to note that for predictions at the \(T+1^{\text {th}}\) timestamp, we only have access to the features \(\textbf{x}^{(T+1)}\); the labels \(\textbf{y}^{(T+1)}\) are not available. Due to the dynamic nature of long-term educational data, both the feature dimension d and the distribution p may change over time, leading to heterogeneity and distribution shifts. Specifically, for two timestamps \(t_1\) and \(t_2\) (with \(t_1, t_2 \in \mathcal {T}\) and \(t_1 \ne t_2\)), we might encounter \(d^{(t_1)} \ne d^{(t_2)}\) and \(p^{(t_1)} \ne p^{(t_2)}\). The former poses challenges for many traditional machine learning algorithms that require consistent feature dimensions throughout training and prediction phases. The latter might impact the efficacy of prediction models that operate under the independent and identically distributed (IID) assumption.
Learning ability modeling
To address heterogeneity, namely to enable the predictive model to adapt to changes in the number of features within long-term data, we introduce the Learning Ability Modeling. We assume that students’ responses to exercises are samples from distributions parameterized by their learning abilities. Thus, even with a varying number of exercises, we can still obtain a fixed number of distribution parameters as new features, allowing data from multiple years to be used simultaneously in training prediction models. Notably, these distribution parameters are related to students’ learning abilities, offering semantic insights for interpreting the model’s predictions. The process for estimating these parameters is detailed as follows.
For a given timestamp with N students and D exercises, each having M potential response types, a student’s responses can be denoted as \( \textbf{x}_n = (x_{n,1}, \ldots , x_{n,D}) \in \mathbb {R}^{D \times M} \). Specifically, \( x_{n,i} \in \mathbb {R}^M \) is an M-dimensional one-hot vector indicating the \( n^{th} \) student’s answer to the \( i^{th} \) exercise, with each dimension corresponding to potential response types. The entire dataset is thus represented as \( X = \{\textbf{x}_n\}_{n=1}^N \in \mathbb {R}^{N \times D \times M} \).
Students, with their diverse learning abilities, are expected to show different response patterns to the same exercise. The distribution of \( x_{n,i} \) can be represented as \( x_{n,i} \sim p(x_{n,i}; \theta _{n,i}) \), where \(\theta _{n,i}\in \mathbb {R}^M\) denotes the student’s learning ability for exercise i. We further assume that all exercises can be grouped into K categories with no overlap among them. The categories of each exercise can be defined as \( Z = (z_1, \ldots , z_D) \), with \( z_i \) indicating the category of the exercise i. If exercises i and j are both in category k, it implies that the learning abilities corresponding to these exercises are considered equivalent, hence \(\theta _{n,i} = \theta _{n,j} = \theta _{n,k}\), where \(\theta _{n,k}\) denotes the learning ability parameters for category k. Consequently, the responses, \(x_{n,i}\) and \(x_{n,j}\), can be considered as samples drawn from the same distribution \(p(x_n; \theta _{n,k})\). Based on this observation, all exercise responses of one student originate from the K distributions \( \{p(x_n; \theta _{n,k})\}_{k=1}^K \). This approach transforms the original feature spaces of varying number of responses into a consistent set of K distributions, with parameters indicating students’ learning abilities. We define the learning ability embeddings of one student as \( \textbf{x}_{n,\theta } = (\theta _{n,1}, \ldots , \theta _{n,K}) \in \mathbb {R}^{K \cdot M} \), which means a new fixed-dimension and semantically meaningful feature space.
Specifically, since the student responses are discrete values, we can reasonably suggest that if \(z_i = k\), i.e., exercise i belongs to category k, then the response \(x_{n,i}\) follows a categorical distribution, articulated by
where \( \theta _{n,k} \) is an M-dimensional parameter vector. The element \( \theta _{n,k,m} \) stands for its \( m^{th} \) entry and meets the condition \(\sum _{m=1}^M \theta _{n,k,m} = 1 \). Assuming the prior distribution of exercise category \( p(z_i=k)=\pi _k \) and the condition \( \sum _{k=1}^K \pi _k=1 \) holds, the marginal distribution of the responses of exercise i is
where \(\textbf{f}_{i}=(x_{1,i},\ldots ,x_{N,i})\in \mathbb {R}^{N\times M}\) and \(\Theta =(\theta _{1,1},..., \theta _{n,k})\in \mathbb {R}^{N\times K \times M}\). The log-likelihood function of all responses then becomes
Considering \( \lambda _{n,k} \) and \( \mu \) as the Lagrange multipliers, our optimization objective is
Through optimization, we obtain
Using the Expectation–Maximization algorithm [37], we iteratively refine the initial parameters to optimize Eq. 3, and obtain the optimized parameters \(\theta _{n,k,m}\). Given features of different dimensions \( \{X^{(t)}\}_{t=1}^{T+1} \), we can derive homogeneous and semantically meaningful learning ability embeddings \( \{\Theta ^{(t)}\}_{t=1}^{T+1} \). This resolves the inconsistency in feature dimensions across timestamps. The pseudo code is presented in Phase 1 of Algorithm 1.
Long-term distribution alignment
To adapt the predictive model to distribution shifts over time, we introduce the Long-term Distribution Alignment. The efficacy of machine learning models can be significantly impacted by shifts in mean and variance. Hence, our first step is to align these statistical measures. For any element \( \theta _{n,k,m} \) in \( \Theta \), its transformation is given by
Through this procedure, each feature undergoes individual centering and scaling based on sample-driven statistics. We obtain the transformed sample \( \textbf{x}_{n,\bar{\theta }} = (\bar{\theta }_{n,1}, \ldots , \bar{\theta }_{n,K})\) and the transformed features \(\bar{\Theta }=(\bar{\theta }_{1,1},...,\bar{\theta }_{n,k})\in \mathbb {R}^{N\times K \times M}\).
To align distributions across two domains, a common method is moment matching[38]. Inspired by this, we minimize the difference between two timestamps by solving for the mappings \(\varvec{\psi }^{(t_1)}\) and \(\varvec{\psi }^{(t_2)}\), as illustrated below:
where \(\Vert \cdot \Vert _\textbf{F}^2\) represents the matrix Frobenius norm, and \(N^{(t)}\) denotes the sample size for the corresponding timestamp. The symbol \(u^{\otimes p}\in \mathbb {R}^{c^p}\) represents the p-level outer product of vector \(u\in \mathbb {R}^{c}\), and L is the dimension of the corresponding distribution. To achieve a more refined alignment compared to first-order moment matching, our approach aims to match the second-order moment, setting \(p=2\) in Eq. (7).
In contrast to the method presented in [39], which focuses on aligning the distribution between a single source and target domain, our study extends this concept to align the distributions across multiple timestamps within long-term data. Furthermore, the work of [9] suggests that the upper bound of the target error for a learned classifier is influenced by the pairwise moment divergence between the target domain and each source domain. Therefore, to harness the full potential of historical data and minimize target empirical errors, our objective is to minimize the distribution difference between T historical timestamps and the target timestamp \(T+1\). We modify Eq. (7) to align multiple timestamps, which can be expressed as
where \(\varvec{\psi }^{(t)}(\bar{\Theta }^{(t)})^\top \varvec{\psi }^{(t)}(\bar{\Theta }^{(t)})=\sum _{n=1}^{N^{(t)}}\varvec{\psi }^{(t)}(\textbf{x}_{n,\bar{\theta }}^{(t)})^{\otimes p}\), when \(p=2\).
Previous research typically utilized symmetric transformations to align within a subspace, which can result in the degradation of feature structures. Therefore, to preserve the semantic characteristics of features, we employ a series of asymmetric transformations. Specifically, we map the distribution of historical timestamps to the target timestamp for alignment. We assume that \(\varvec{\psi }^{(t)}:\bar{\Theta }^{(t)} \rightarrow \bar{\Theta }^{(t)}A\), where \(A^{(t)}\in \mathbb {R}^{L\times L}\) acts as a linear transformation. Here, \(A^{(T+1)}=\mathcal {I}\) functions as an identity mapping, which maps the target timestamp to itself, while \(\{A^{(t)}\}_{t=1}^T\) maps the respective timestamps to the target timestamp. Our objective can be redefined as
where \(C^{(t)}\) is the symmetric matrix defined as \(\frac{1}{N^{(t)}}{\bar{\Theta }^{(t)^\top }}\bar{\Theta }^{(t)}\). Recognizing the symmetry of \(C^{(t)}\), we apply singular value decomposition (SVD) to obtain \(C^{(t)}=U^{(t)}\Sigma ^{(t)}{U^{(t)^\top }}\). The most significant r singular values are denoted by \(\Sigma _{[1:r]}^{(t)}\). Their corresponding left singular vectors are represented as \(U_{[1:r]}^{(t)}\). Additionally, \({\Sigma ^{(t)+}}\) denotes the Moore-Penrose pseudoinverse of \(\Sigma ^{(t)}\). Consequently, the optimal transformation is given by
where \(r_t=\min (r_{C^{(t)}},r_{C^{(T+1)}})\), and \(r_{C^{(t)}}\), \(r_{C^{(T+1)}}\) denote the ranks of \(C^{(t)}\) and \(C^{(T+1)}\), respectively. Utilizing the optimal transformation \(\hat{A}^{(t)}\), we can obtain the well-aligned features \(\hat{\Theta }^{(t)} = \bar{\Theta }^{(t)}\hat{A}^{(t)}\), which can be used to train a classifier.
Therefore, our proposed LASA initially obtains a new homogeneous feature space, \(\Theta \), through LAM, addressing the issue of heterogeneity in long-term data. Subsequently, through LTDA, it aligns the distributions across different timestamps, tackling the problem of distribution shift. By integrating these two components, we can adaptively adjust features to accommodate the dynamic characteristics of long-term data. In terms of interpretability, the features derived from LAM are semantically meaningful representations in \(\Theta \), representing students’ learning abilities related to their responses to exercises. This provides a foundation for explaining students’ performances. To preserve the semantic structure within the features, we employ asymmetric transformations in LTDA. Since all timestamps are transformed to align with the target timestamp, these transformed features share the same semantic properties as those of the target timestamp. This ensures that the aligned features, \(\hat{\Theta }\), continue to reside within the same semantic space, thus maintaining their semantic integrity.
Base classifier
To adapt our prediction model to varying numbers of features and distribution shifts, we introduced LASA and obtained robust features \(\hat{\Theta }\), which can be utilized for training classifiers. In the framework proposed in this paper, various machine learning algorithms can be selected as classifiers. Given that SVM is commonly used as a base classifier in diverse Educational Data Mining (EDM) scenarios [40] and domain adaptation methods [41], we opt for SVM as our classifier in this study. Assuming the target of our prediction is the student performance at timestamp \(T+1\), we should use the data from the previous T timestamps as the training set, denoted as \(\{(\hat{\Theta }^{(t)}, Y^{(t)})\}_{t=1}^{T}\), with the test set being \((\hat{\Theta }^{(T+1)}, Y^{(T+1)})\). Here, Y represents the labels of the dataset. In real-world predictions, \(Y^{(T+1)}\) is inaccessible. The trained classifier is denoted as f. The specific principles and training process of SVM can be referred to in the literature [42], which are not elaborated here in detail.
SHAP-based model interpretation
After obtaining predictions from the trained classifier f for the test set, it is crucial to interpret these results to improve teaching methods and course content. Although traditional methods of assessing feature importance identify significant features, they fail to elucidate the specific impact of a feature on prediction outcomes [16]. The SHAP (Shapley Additive exPlanations) method [43] takes a cue from the Shapley value in cooperative game theory. This method creates an easy-to-understand model based on the Shapley value, allowing us to see both the positive and negative effects of each feature on individual predictions. By using SHAP, we can better understand how the learning ability parameters \(\hat{\Theta }\) contribute to our predictions
In our study, we assess the impact on predictions when excluding the \(i^{th}\) parameter, \(\textbf{f}_{\hat{\theta }_i}\), from our model. Given a set of parameters \(F_{\hat{\Theta }}\) and a sample feature vector \(x_{\hat{\Theta }}\), we evaluate the importance of \(\textbf{f}_{\hat{\theta }_i}\) by measuring its marginal contribution across all possible subsets of \(F_{\hat{\Theta }}\) that do not include \(\textbf{f}_{\hat{\theta }_i}\). This is achieved by calculating the difference in prediction accuracy with and without \(\textbf{f}_{\hat{\theta }_i}\) in each subset \(S \subseteq F_{\hat{\Theta }} {\setminus } \{\textbf{f}_{\hat{\theta }_i}\}\). The contribution of \(\textbf{f}_{\hat{\theta }_i}\), denoted as \(\phi _{i}\), is defined through the following equation, which represents the weighted average of these differences across all subsets S:
Here, \(f_S(\cdot )\) denotes the classifier’s output using only the feature subset S, and \(x_S\) represents the sample values associated with the parameter subset S.
Consequently, the classifier’s output for sample \(x_{\hat{\Theta }}\) can be expressed as
Here, g is the explanatory model, and \(x^{\prime }\in \{0,1\}^{K\cdot M}\) is the feature vector indicating the presence or absence of specific features. The size of our parameter matrix is given by \(K\cdot M\). The term \(\phi _i \in \mathcal {R}\) measures the contribution of the \(i^{th}\) feature, often called the Shapley values. Importantly, \(\phi _0\) represents the average prediction when no specific feature information is provided, roughly matching the mean of predictions across the training dataset. The Shapley values offer insights into how each feature influences the prediction, providing a foundation for educational improvements. Section “Interpretability and case study” will present detailed examples, along with relevant educational insights and suggestions.
Thus far, we have introduced the interpretable prediction framework for long-term student performance prediction. The hyperparameters of LASA are described in Table 4. The pseudo code of the framework is presented in Algorithm 1.
Experiments
In this section, we conduct comprehensive experiments to evaluate the proposed LASA method for predicting long-term student performance in real-world settings.
Comparison methods
To demonstrate the superiority of LASA, we compare it with the state-of-the-art student performance prediction models (non-transfer models) and novel DA methods across SsHoUDA, MsHoUDA, SsHeUDA, and MsHeUDA categories as shown in Table 5. ORACLE refers to the scenario where training and testing data are proportionally split from the same year, indicative of the classifier’s performance under IID conditions. Non-transfer models, which ignore distribution shifts in long-term data, provide a baseline for performance comparison. SsHoUDA models merge PCA-reduced data from all historical timestamps as a singular source domain. MsHoUDA models consider PCA-processed data from each timestamp as independent source domains, while SsHeUDA models average results from each past timestamp treated as a distinct source domain. Lastly, MsHeUDA models treat the raw data from each timestamp as separate source domains without PCA reduction.
Experimental settings
Data Preprocessing: In our long-term student performance dataset, the question type is multiple-choice. First, we compare each student’s selected options with the correct answers to categorize the responses into three types: correct, incorrect, and unanswered. To represent these outcomes, we employ one-hot encoding instead of numerical coding to avoid the bias introduced by numerical order. We then perform data cleaning by removing students who never participated in answering questions, as these students did not substantially contribute to our data collection, rendering the prediction of their performance meaningless. Additionally, we remove certain anomalous questions that all students failed to answer or answered incorrectly due to external factors. Such instances often involved technical issues during the class that prevented students from answering questions or errors in the answer key, resulting in correct student responses being marked as incorrect. For the task of predicting student performance categories, scores are typically divided based on certain thresholds. Following the recommendation by [46], students’ final exam scores can be categorized into three groups: good, pass, and at-risk, using the following formula:
Here, \(Thr_1\) and \(Thr_2\) are thresholds that segment student performances. In this study, we focus on identifying at-risk students and we set \(Thr_2\) to 0.3. The term \(\text {rank}(S)\) denotes the rank of a student’s score, where the highest score is ranked first. N represents the total number of students.
Training Process: For the segmentation of our long-term dataset, we first select a target year for prediction. The response data from this year serve as our test set, while all data from preceding years, including both responses and grade labels, are utilized as the training set. This ensures alignment with realistic scenarios where, during the prediction of student performance for the target year, only past features and labels, along with current target features, are accessible. Regarding the selection of LASA hyperparameters, the optimal number of question types, K, is identified through an analysis of the distance gap observed in the hierarchical dendrogram, as illustrated in Fig. 7(a). Consequently, K is set to 3, allowing us to use the outcomes of agglomerative hierarchical clustering to establish our initial parameters \(\Theta _{(0)}\) and \(\pi _{k,(0)}\). The maximum number of iterations, I, is set to 20 and convergence threshold, \(\epsilon \), is set to 1e-4. We choose the parameters of the SVM classifier by 5-fold cross-validation. For the comparison methods listed in Table 5, we report the optimal results achieved through parameter tuning within the recommended range specified in their original publications, facilitated by cross-validation. For methods requiring PCA, the dimensionality reduction is uniformly set to 30 dimensions. To validate the robustness of our findings, we replicate the experiments 20 times to obtain the average accuracy of all methods on the test set.
In this paper, we focus on predicting at-risk students. We use Accuracy as our evaluation metric to assess model performance, defined by
Here, \(f_{1\sim T}(\textbf{x})\) represents the predicted category label by the algorithm for instance \(\textbf{x}\), \(\textbf{y}\) is the ground truth label for the instance \(\textbf{x}\), and \(\mathcal {T}^{te}\) comprises all the instances in the testing split.
The experiments reported in this study were carried out on a high-performance workstation configured with an Intel Xeon W-3175X CPU, 192GB DDR4 RAM, and an NVIDIA RTX 3090 graphics card, running on a Windows 10 operating system. The software environment was based on Python 3.10, with machine learning models implemented using Scikit-learn 1.1.0 and deep learning models developed and trained via PyTorch 1.13.0.
Performance evaluation on long-term student performance dataset
We first evaluate LASA and the comparison methods on the Long-term Student Performance Dataset, focusing on predicting at-risk students from 2017 to 2023. For each year, starting with 2017, it serves as the target prediction year, with the preceding years used as training data. The results, presented in Table 6, highlight LASA’s consistent superiority in accuracy across most tasks. We also apply the Friedman test to establish a ranking of the different methods included in our experiments, as shown in the last column of Table 6. It is evident that LASA ranks first, with a significant margin compared to other methods. Subsequently, we conduct a post-hoc analysis using the Wilcoxon signed-rank test with Bonferroni correction, to adjust p-values for each pairwise comparison of methods, as detailed in Table 8. The outcomes of the post-hoc analysis conclusively demonstrate that our proposed LASA significantly outperforms all other methods in the study, except for PCA-ORACLE, with p-values consistently below 0.05. Although the p-value in the comparison with PCA-ORACLE is above 0.05, our method still ranks higher in terms of average performance and Friedman ranking.
LASA notably outperformed the idealized PCA-ORACLE model by 5.7% and surpassed other leading student performance prediction models by at least 6.8%. When compared to various DA methods, LASA showed an improvement of at least 6.1%. This enhanced performance is attributed to LASA’s unique ability to capture the nuances of student learning through capacity parameters and effectively leverage historical data for prediction.
Within the Non-transfer models category, both SVM-BSS and SVM-MS were less effective than PCA-ORACLE, highlighting the negative impact of distribution shifts in long-term data. Notably, SVM-MS, which aggregates data from multiple timestamps, does not address distributional differences, leading to potential negative transfer. State-of-the-art EDM methods such as CatBoost-SHAP, DNNMS, and ProbSAP, despite their good performance in short-term data, are less effective in long-term scenarios due to their assumptions of identical distributions, which do not align with the situation of long-term data.
UDA methods, designed to alleviate distribution shift issues, showed varying results. Under SsHoUDA, the MMD method exhibited the poorest performance, likely due to its limitation of using data from a single source domain. In contrast, MEDA’s approach of aligning both marginal and conditional distributions resulted in better outcomes. However, deep learning-based UDA methods such as DANN and D-CORAL underperformed, probably due to negative transfer from merging timestamp data without acknowledging intrinsic differences. MsHoUDA methods consider distributional discrepancies across multiple source domains but face challenges with heterogeneous feature spaces. TWMDA, aiming for finer granularity at the sample level in alignment, might have introduced noise, thereby compromising its performance. Although M3SDA considers aligning each pair of domains, it still relies on PCA to obtain features of the same dimensionality, which may amplify discrepancies between domains and thus result in poor performance. SsHeUDA methods generally showed limited success. These methods rely on a single source domain, and techniques like KCCA and GLG, respectively, require parallel instances and the same number of samples between domains, limiting the utilization of information and indicating their inability to learn effectively from raw data and align distributions for classification purposes.
In the MsHeUDA category, SFERNN represents a noteworthy attempt to address long-term heterogeneous educational data, utilizing data from multiple domains without prior PCA processing. However, its performance is capped by the SFER architecture and lacks interpretability. In contrast, LASA stands out by directly extracting semantically meaningful representations from raw features and aligning distributions effectively in long-term data. It not only achieved the highest accuracy but also outperformed SFERNN by 6.4%, highlighting its efficiency in handling heterogeneous source data while maintaining model interpretability. Overall, LASA’s exceptional performance, enhanced by robust data alignment and interpretability, sets a new benchmark in the field of long-term student performance prediction.
Additionally, we analyze the effectiveness of LASA’s components: LAM and LTDA, through ablation experiments in Sect. “Ablation study”. The results demonstrate that LAM can effectively extract the learning ability features of students, while LTDA can align feature distributions, further enhancing performance, thereby proving the efficacy of LASA.
Performance evaluation with varied timestamp utilization in long-term student data prediction
To further validate the effectiveness of our model in utilizing long-term student data, we evaluated LASA’s performance across varying numbers of timestamps. Specifically, when predicting student performance for a target timestamp given T available timestamps (all data prior to the target timestamp), LASA’s predictive capability is assessed using between 1 and T timestamps. Notably, for each t timestamps used, there are \(C_T^t\) potential scenarios. The average accuracy over these scenarios serves as our metric. We also evaluate the state-of-the-art student performance prediction method, ProbSAP, under the same conditions. The results are depicted in Fig. 6.
a Performance of ProbSAP with varying amounts of historical data. b Performance of our proposed LASA using different amounts of historical data. The x-axis represents the number of years of historical data used, and the y-axis shows the average accuracy obtained when using a specific amount of historical data. Different curves represent different target prediction years. The results indicate that using more historical data, i.e., long-term data, can increase accuracy. Furthermore, our LASA consistently outperforms ProbSAP, demonstrating its more effective utilization of long-term data
As depicted in Fig. 6(b), LASA substantially benefits from leveraging multiple timestamps in its predictions. There is a consistent improvement in accuracy as the number of timestamps increases, which underscores the importance of utilizing long-term data. For certain target timestamps, such as 2020 and 2022, using data from just a single timestamp yields commendable predictive performance. This could be attributed to the close distributions between these target timestamps and past timestamps, indicating minor distribution differences. Nonetheless, even under these conditions, we notice a slight performance gain as more timestamps are incorporated. Specifically, the performance for 2020 and 2022 improved by 2.4% and 3.9%, respectively, which is in line with our expectations. On the other hand, for scenarios like 2021, where a single timestamp does not produce satisfactory results, possibly due to significant distribution differences, the inclusion of additional timestamps boosts the predictive accuracy by up to 7.9%. This improvement could be attributed to LASA’s ability to discern nuanced patterns and internal variations across multiple timestamps with diverse distributions, which might not be apparent when relying solely on one timestamp.
a The hierarchical clustering dendrogram for exercise data reveals the most significant distance similarities at height 2, which leads us to divide the questions into three categories, thereby setting K to 3. b A comparison of average accuracy between LASA initialized randomly and LASA initialized via hierarchical clustering across different numbers of K shows that initialization with hierarchical clustering yields better performance. Additionally, both initialization methods outperform ProbSAP across a broad range of parameters, demonstrating the robustness of our model. Furthermore, the optimal K value determined from the dendrogram in a performance, validating the effectiveness of our parameter selection method
By comparing with Fig. 6a, we find that ProbSAP underperforms in 2021, characterized by considerable distribution differences, and its performance does not markedly improve even with the incremental addition of timestamps. This underperformance can be attributed to prediction models like ProbSAP that primarily rely on short-term data and IID assumptions; they may falter when confronted with long-term data exhibiting stark distribution differences. Furthermore, for predictions in 2023, ProbSAP’s performance deteriorates as more timestamps are employed, while LASA’s performance remains stable. This suggests that ProbSAP may experience negative transfer due to the introduction of data with considerable distribution differences, whereas LASA can effectively align long-term distributions, mitigating distribution disparities and preventing negative transfer. Moreover, even when LASA relies on data from just one timestamp, it outperforms ProbSAP, further emphasizing the effectiveness of our learning capability modeling. This is because LASA can derive more fundamental representations of students’ learning abilities from exercise data.
Sensitivity analysis
In this section, we evaluate the sensitivity of LASA to its parameters. As mentioned earlier, the value of K can be determined using a hierarchical clustering dendrogram, which results from clustering all the exercises. As shown in Fig. 7(a), we visually observe the most significant cluster similarities at height 2 and, therefore, plot a cut line at this point. All questions are consequently divided into three categories. Thus, we determine the optimal value for the number of question categories K to be 3. Additionally, we compare the performance for different values of K, as illustrated in Fig. 7b. We note that around the optimal K value identified by the dendrogram, LASA achieves satisfactory prediction accuracy, indicating the efficacy of our parameter selection method.
We also investigated different initialization methods for LASA’s initial iteration parameter values \(\pi _{k,(0)}^{(t)}\) and \(\Theta _{(0)}^{(t)}\), specifically using random initialization and hierarchical clustering to obtain initial parameters. We find that the initial parameters obtained from hierarchical clustering outperform those from random initialization, and both methods yield better performance across a wide range of parameters compared to the state-of-the-art model, ProbSAP. This observation suggests that our approach offers a flexible parameter selection range. However, we observed that as K approaches larger values, the model’s performance diminishes. This decline may be attributed to the fact that, as K increases, the learning ability parameters derived by LASA become less category-specific and more tailored to individual questions, especially when K equals the total number of questions. This shift gradually weakens the semantics, diminishing the model’s robustness and making it more susceptible to noise, leading to a decrease in performance.
Regarding the number of iterations for the algorithm, we discuss this in Sect. “Convergence of LASA”, where we note that convergence is generally achieved within 15 iterations. Consequently, we set the maximum number of iterations, I, to 20. Additionally, the sensitivity of classifier hyperparameters is addressed in Sect. “Hyperparameter optimization of the classifier”. The results indicate that using features extracted by the LASA algorithm, the SVM classifier performs consistently well across a wide range of hyperparameters.
Furthermore, our proposed LASA can be adapted to various classifiers. To explore its sensitivity across diverse base classifiers, we incorporated LASA with a range of machine learning classifiers. As depicted in Fig. 8, LASA consistently delivers noteworthy performance across most classifiers, highlighting its adaptability. The best predictive performance is observed when LASA is combined with classifiers such as SVM and LR. Although there is a slight decrease in performance when combined with classifiers like KNN, LASA still outperforms state-of-the-art models such as SFERNN and ProbSAP. These results demonstrate that LASA can extract robust features, making it relatively insensitive to the type of classifier used.
Comparative analysis of LASA’s predictive accuracy when integrated with various base classifiers has been conducted. We observed that, even when combined with different classifiers, the performance of LASA, including the least performing LASA-KNN variant, still surpasses that of other comparative methods, showcasing LASA’s robustness and effectiveness
Interpretability and case study
In student performance prediction models, interpretable predictions are essential for curriculum improvements and student interventions. Our proposed Learning Ability Modeling component can extract the student’s learning ability embeddings \( \Theta \) from raw exercise data, providing a semantically meaningful representation. We compute the average student learning ability embeddings for each year, as shown in Fig. 9. \(\text {H}\), \(\text {M}\), and \(\text {E}\) represent three distinct types of learning abilities, while \(\text {H}_{\{1,2,3\}}\), \(\text {M}_{\{1,2,3\}}\), and \(\text {E}_{\{1,2,3\}}\) correspond to specific learning abilities. Denote \(\{\cdot \}_1\) as \(\{\text {H}_1,\text {M}_1,\text {E}_1\}\), and similarly, \(\{\cdot \}_2\) and \(\{\cdot \}_3\) follow this notation. Based on the LAM process, we understand that \(\{\cdot \}_1\) learning ability parameter indicates the probability of not answering questions of the corresponding category, \(\{\cdot \}_2\) represents the likelihood of answering incorrectly, and \(\{\cdot \}_3\) signifies the probability of a correct response. We observe that there is a trend of \(\text {H}_2\), \(\text {M}_2\), \(\text {E}_2\) decreasing and \(\text {H}_3\), \(\text {M}_3\), \(\text {E}_3\) increasing. Thus, \(\text {H}\) pertains to a student’s ability to tackle challenging problems, exemplifying their capability to understand complex knowledge and its comprehensive application. Conversely, \(\text {E}\) encapsulates a student’s mastery over straightforward questions, indicating their grasp of basic concepts and attentiveness during lessons. \(\text {M}\), positioned between the two, denotes a student’s proficiency in addressing questions of moderate difficulty, emphasizing their skill in integrating basic concepts. It is noteworthy that there is a pronounced collinearity between \(\{\cdot \}_2\) and \(\{\cdot \}_3\), and their semantics overlap. To enhance interpretability, we have opted to omit the \(\{\cdot \}_2\) parameter in subsequent analyses.
This figure illustrates the average learning ability parameter values across different years, segmented into three categories representing distinct types of learning abilities: H, M, and E. Each category is further divided into three specific abilities: \(\{\cdot \}_1\) indicating the probability of not answering, \(\{\cdot \}_2\) for the likelihood of incorrect answers, and \(\{\cdot \}_3\) for the probability of correct answers. We observe that there is a trend of \(\text {H}_2\), \(\text {M}_2\), \(\text {E}_2\) decreasing and \(\text {H}_3\), \(\text {M}_3\), \(\text {E}_3\) increasing. Thus, \(\text {H}\) pertains to a student’s ability to tackle challenging problems, exemplifying their capability to understand complex knowledge and its comprehensive application. Conversely, \(\text {E}\) encapsulates a student’s mastery over straightforward questions, indicating their grasp of basic concepts and attentiveness during lessons. \(\text {M}\), positioned between the two, denotes a student’s proficiency in addressing questions of moderate difficulty, emphasizing their skill in integrating basic concepts
Upon obtaining interpretable and semantically meaningful features, we proceed to identify the core factors influencing student academic performance, following SHAP-based Model Interpretation in LASA. This analysis aids teachers in identifying areas where students require more attention and understanding the reasons behind poor grades, thereby enabling targeted guidance. Taking 2020 as an example, the feature importance ranking and feature contribution are illustrated in Fig. 10.
a Feature importance, representing the average absolute contribution of the features. b Contribution of features for each sample prediction, ranked by their importance. c Contribution of feature values in cases where the actual label is at-risk student and the predicted label matches. d Feature contributions in situation where the student is correctly predicted as not being at-risk. For c and d, the letters on the left represent the features and their corresponding values. Blue indicates that the feature contributes negatively to the prediction, while red indicates a positive contribution to the prediction. We have converted the SHAP values to probability magnitudes, where f(x) represents the final probability of being predicted as an at-risk student
Upon observing Fig. 10a, b, we found that for at-risk students, compared to learning ability \(\text {H}\), learning abilities \(\text {E}\) and \(\text {M}\) have a more pronounced influence on their prediction. Specifically, as the value of \(\text {E}_3\) decreases, there is a larger SHAP value, indicating that it pushes the model’s prediction towards the at-risk outcome. A similar trend is observed for \(\text {M}_3\), albeit its influence is less than that of \(\text {E}_3\). This implies that a student’s grasp of basic concepts and their ability to apply combinations of these concepts will largely influence whether they receive a low score at final test. In contrast, the capability of a student to understand and apply complex knowledge will not directly lead to a low score, as evidenced by the minor contribution of learning ability \(\text {H}\) to predicting if a student is at-risk.
Here, we quantitatively analyze the influence of these learning abilities on student performance through two cases. Figure 10c showcases the contribution of the three types of learning abilities to the prediction results for student who is actually at-risk and is also predicted as at-risk, using well-adapted learning ability parameters \(\hat{\Theta }\) obtained after LTDA. We notice that \(\text {E}_3\) for this student is \(-\)1.775, \(\text {M}_3\) is \(-\)1.358, and \(\text {H}_3\) is \(-\)1.88, indicating that student’s various learning abilities are below average and relatively low. However, we find that \(\text {H}_3\) only adds a 0.03 probability to the prediction of being at-risk, while \(\text {E}_3\) and \(\text {M}_3\) contribute probabilities of 0.42 and 0.17 respectively. This suggests that if students do not pay attention in class and thus have a weaker understanding of basic concepts, they are likely to receive lower scores. However, if their ability to tackle challenging problems is limited, it does not necessarily mean they will fail the course. Therefore, for at-risk students, teachers can provide materials to help them re-understand basic concepts and assign exercises that help in understanding and applying basic concepts, thereby enhancing their academic performance and preventing failures. In the second case, the analysis for student who is not at-risk and is also predicted as such by the model, as shown in Fig. 10d, indicates they have an above-average understanding and combinatory application ability of basic concepts. And the \(\text {H}_3\) value is 1.194, indicating they have a notably higher-than-average understanding and application ability for complex knowledge. Consequently, they performed well in the final exam, with no risk of failing.
In conclusion, for at-risk students, we recommend that teachers pay more attention to their understanding and application of basic concepts. Specifically, in class, after teaching each knowledge point, teachers can provide some reviews or simple exercises to enhance the student’s understanding. They can also increase attendance checks to prevent students from missing classes, leading to knowledge gaps. Additionally, providing some extracurricular materials or videos can help these students review or re-understand these basic concepts and their applications. At the same time, we suggest that these students focus on mastering the basics before diving into complex knowledge.
Discussion
This section discusses the principal findings of the study, as well as its limitations and directions for future research.
Main findings
Prior research has largely focused on short-term data, failing to fully leverage the student data accumulated over the years. Our study introduces the use of long-term data for predicting student performance. However, we have identified two main challenges in utilizing long-term data: heterogeneity, as seen in the yearly change in the number of features, and distribution shift, with yearly feature distribution alterations as displayed in Figs. 3 and 4. To overcome these challenges and fully harness the potential of long-term data, we proposed the LASA algorithm, which, when combined with classifiers and SHAP, forms an interpretable prediction framework.
Experimental comparisons indicate that our method significantly outperforms others, exceeding the state-of-the-art student performance prediction model ProbSAP and the domain adaptation model SFERNN by 6.8% and 6.6% in average accuracy, respectively. The Effectiveness Analysis reveals that even when employing methods designed for short-term data, the use of long-term data does indeed enhance predictive performance, as shown in Fig. 6. This confirms the significance of using long-term data and illustrates that our proposed LASA is more effective because it addresses heterogeneity and distribution shift, aspects not considered by other methods. Furthermore, the Sensitivity Analysis of LASA’s parameters suggests that our method for selecting hyperparameters yields robust results, and LASA itself is resilient to parameter variations.
Finally, an interpretive analysis of the predictive framework incorporating LASA reveals that LASA can extract meaningful features regarding students’ test-taking abilities and analyze their contributions to prediction outcomes, leading to actionable educational interventions. For example, students performing poorly often struggle with simpler questions, indicating a weaker foundation. Targeted remediation and reinforcement of fundamental knowledge are recommended for these students, whereas recommending integrated problem-solving exercises may not effectively improve their performance.
Limitations and future research
This paper introduces a predictive framework that combines the Learning Ability Self-Adaptive Algorithm (LASA) with a base classifier to harness long-term data for student performance predictions. As it is not an end-to-end method, the parameters for each module were optimized separately, potentially resulting in suboptimal outcomes. Heuristic and evolutionary search optimization intelligent algorithms present promising approaches for improving this process, with recent developments showing success in optimizing machine learning parameters [47, 48]. Furthermore, employing these techniques for feature selection could also serve as a potential method for addressing heterogeneity in long-term data. Future research will explore these techniques to further enhance the performance of our predictive framework.
Moreover, we note that the distributional differences between historical and recent data can vary significantly, impacting the model’s predictive accuracy. Large disparities may lead to a marked decrease in accuracy. Upcoming studies will seek to quantify these differences, assigning appropriate weights to long-term data to prioritize data with distributions that align more closely with the target year while diminishing the influence of data with greater discrepancies, thus improving the predictive performance and robustness of our model.
Our proposed LASA primarily caters to objective question types, such as multiple-choice questions, and may not perform well when the question type is subjective, such as in humanities courses, because it does not consider how to extract features from textual responses. This limitation restricts LASA’s assessment capabilities to complex reasoning and problem-solving. It does not effectively evaluate higher-order cognitive processes like critical, creative thinking, or decision-making, which are more manifest in subjective question types. However, with advancements in NLP technology and the widespread recognition of large language models (LLMs), we may consider employing LLMs to extract textual features in the future, enabling LASA’s application to a broader range of courses.
Additionally, although LASA has shown promising results in predicting student performance in circuit courses, its effectiveness in other subjects, such as calculus or linear algebra, remains untested due to the lack of long-term data in these areas. Moreover, teaching style differences among instructors may lead to data distribution changes, which could affect the model. To broadly apply our model to various teachers and courses, we plan to collect a wider range of data and develop more advanced methods to extract features and adapt to distribution changes in the future.
Conclusion
This study introduces the Learning Ability Self-Adaptive Algorithm (LASA), a novel approach designed to leverage long-term student data to predict student performance more accurately. By addressing the challenges of heterogeneity and distribution shifts in long-term educational data, LASA significantly improves predictive accuracy over existing state-of-the-art models. Our comprehensive experiments, spanning data from 2016 to 2023, underscore LASA’s ability to adaptively model student learning abilities and align distributions across academic years, thus overcoming the limitations posed by the dynamic nature of educational settings. The superiority of LASA was demonstrated through extensive comparative analysis, where it consistently outperformed other benchmark models by a notable margin in accuracy. These results underscore the importance of considering the evolving features and distributions of long-term data for student performance prediction. Furthermore, the integration of SHAP-based Model Interpretation into our framework provides valuable insights into the impact of various features on prediction outcomes. This not only enhances the interpretability of our model but also offers actionable guidance for educational interventions aimed at improving student learning outcomes. Despite the promising results, our study acknowledges limitations, such as the model’s focus on choice-based questions and its application to a single course taught by one instructor. Future work will explore extending LASA to other types of questions and courses, incorporating advanced NLP techniques for feature extraction from subjective assessments, and collecting more diverse long-term data across different instructors and disciplines. In conclusion, LASA represents a significant step forward in the domain of educational data mining, particularly in the use of long-term student data for performance prediction. Its development aligns with the growing need for adaptive, accurate, and interpretable models in the educational sector, paving the way for more personalized and effective learning experiences.
Data availability
The desensitization data is available at https://github.com/EDM314/LASA.
References
Smith DH IV, Hao Q, Dennen V, Tsikerdekis M, Barnes B, Martin L, Tresham N (2020) Towards understanding online question & answer interactions and their effects on student performance in large-scale stem classes. Int J Educ Technol Higher Educ 17(1):20. https://doi.org/10.1186/s41239-020-00200-7
Xu Z, Yuan H, Liu Q (2021) Student performance prediction based on blended learning. IEEE Trans Educ 64(1):66–73. https://doi.org/10.1109/TE.2020.3008751
Riestra-González M, Paule-Ruíz MdP, Ortin F (2021) Massive lms log data analysis for the early prediction of course-agnostic student performance. Comput Educ 163:104108. https://doi.org/10.1016/j.compedu.2020.104108
Li X, Zhang Y, Cheng H, Li M, Yin B (2022) Student achievement prediction using deep neural network from multi-source campus data. Complex Intell Syst. https://doi.org/10.1007/s40747-022-00731-8
Al-azazi FA, Ghurab M (2023) Ann-lstm: a deep learning model for early student performance prediction in mooc. Heliyon 9(4):15382. https://doi.org/10.1016/j.heliyon.2023.e15382
Alcaraz R, Martínez-Rodrigo A, Zangróniz R, Rieta JJ (2021) Early prediction of students at risk of failing a face-to-face course in power electronic systems. IEEE Trans Learn Technol 14(5):590–603
Yao S, Kang Q, Zhou M, Rawa MJ, Albeshri A (2022) Discriminative manifold distribution alignment for domain adaptation. IEEE Trans Syst Man Cybernet 53(2):1183–1197
Lu W, Chen Y, Wang J, Qin X (2021) Cross-domain activity recognition via substructural optimal transport. Neurocomputing 454:65–75. https://doi.org/10.1016/j.neucom.2021.04.124
Peng X, Bai Q, Xia X, Huang Z, Saenko K, Wang B (2019) Moment matching for multi-source domain adaptation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1406–1415
Liu Y-H, Ren C-X (2022) A two-way alignment approach for unsupervised multi-source domain adaptation. Pattern Recognit 124:108430. https://doi.org/10.1016/j.patcog.2021.108430
Gong B, Shi Y, Sha F, Grauman K (2012) Geodesic flow kernel for unsupervised domain adaptation. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp 2066–2073. https://doi.org/10.1109/CVPR.2012.6247911
Wang J, Feng W, Chen Y, Yu H, Huang M, Yu PS (2018) Visual domain adaptation with manifold embedded distribution alignment. In: Proceedings of the 26th ACM International Conference on Multimedia. MM ’18, pp. 402–410. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3240508.3240512
Borg E (2015) Classroom behaviour and academic achievement: How classroom behaviour categories relate to gender and academic performance. Br J Sociol Educ 36(8):1127–1148. https://doi.org/10.1080/01425692.2014.916601
Quadir B, Yang JC, Wang W (2022) Factors influencing the acquisition of english skills in an english learning environment using rain classroom. Interactive Learning Environments 0(0), 1–19. https://doi.org/10.1080/10494820.2022.2075015
Mingyu Z, Sutong W, Yanzhang W, Dujuan W (2022) An interpretable prediction method for university student academic crisis warning. Complex & Intell Syst 8(1):323–336. https://doi.org/10.1007/s40747-021-00383-0
Wang X, Zhao Y, Li C, Ren P (2023) Probsap: a comprehensive and high-performance system for student academic performance prediction. Pattern Recognit 137:109309 (10.1016/j.patcog.2023.109309)
Lu X, Zhu Y, Xu Y, Yu J (2021) Learning from multiple dynamic graphs of student and course interactions for student grade predictions. Neurocomputing 431:23–33. https://doi.org/10.1016/j.neucom.2020.12.023
Tsiakmaki M, Kostopoulos G, Kotsiantis S, Ragos O (2020) Transfer learning from deep neural networks for predicting student performance. Appl Sci 10(6):2145. https://doi.org/10.3390/app10062145
Xu B, Yan S, Li S, Du Y (2022) A federated transfer learning framework based on heterogeneous domain adaptation for students’ grades classification. Appl Sci 12(21):10711. https://doi.org/10.3390/app122110711
Kim B-H, Vizitei E, Ganapathi V (2018) Domain adaptation for real-time student performance prediction. arXiv preprint arXiv:1809.06686
Nguyen DH, Vo Thi NC, Nguyen HP (2016) Combining transfer learning and co-training for student classification in an academic credit system. In: 2016 IEEE RIVF International Conference on Computing & Communication Technologies, Research, Innovation, and Vision for the Future (RIVF), pp 55–60. https://doi.org/10.1109/RIVF.2016.7800269
Swamy V, Marras M, Käser T (2022) Meta transfer learning for early success prediction in moocs. In: Proceedings of the Ninth ACM Conference on Learning @ Scale. L@S ’22, pp. 121–132. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3491140.3528273
Yu C et al (2018) Spoc-mflp: a multi-feature learning prediction model for spoc students using machine learning. J Appl Sci Eng 21(2):279–290
Chu C, Wang R (2018) A survey of domain adaptation for neural machine translation. arXiv preprint arXiv:1806.00258
Yuan F, Yao L, Benatallah B (2019) Darec: Deep domain adaptation for cross-domain recommendation via transferring rating patterns. arXiv preprint arXiv:1905.10760
Pan SJ, Tsang IW, Kwok JT, Yang Q (2011) Domain adaptation via transfer component analysis. IEEE Trans Neural Netw 22(2):199–210. https://doi.org/10.1109/TNN.2010.2091281
Borgwardt KM, Gretton A, Rasch MJ, Kriegel H-P, Schölkopf B, Smola AJ (2006) Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics 22(14):49–57
Long M, Wang J, Ding G, Sun J, Yu PS (2013) Transfer feature learning with joint distribution adaptation. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2200–2207
Grubinger T, Birlutiu A, Schöner H, Natschläger T, Heskes T (2017) Multi-domain transfer component analysis for domain generalization. Neural Process Lett 46(3):845–855. https://doi.org/10.1007/s11063-017-9612-8
Yeh Y-R, Huang C-H, Wang Y-CF (2014) Heterogeneous domain adaptation and classification by exploiting the correlation subspace. IEEE Trans Image Process 23(5):2009–2018. https://doi.org/10.1109/TIP.2014.2310992
Liu F, Lu J, Zhang G (2018) Unsupervised heterogeneous domain adaptation via shared fuzzy equivalence relations. IEEE Trans Fuzzy Syst 26(6):3555–3568. https://doi.org/10.1109/TFUZZ.2018.2836364
Liu F, Zhang G, Lu J (2021) Multisource heterogeneous unsupervised domain adaptation via fuzzy relation neural networks. IEEE Trans Fuzzy Syst 29(11):3308–3322. https://doi.org/10.1109/TFUZZ.2020.3018191
Sun B, Saenko K (2016) Deep coral: Correlation alignment for deep domain adaptation. In: Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part III 14, pp. 443–450. Springer
Liu F, Zhang G, Lu J (2020) Heterogeneous domain adaptation: an unsupervised approach. IEEE Trans Neural Netw Learning Syst 31(12):5588–5602. https://doi.org/10.1109/TNNLS.2020.2973293
Fuglede B, Topsoe F (2004) Jensen-shannon divergence and hilbert space embedding. In: International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings., p. 31. IEEE
Noble WS (2006) What is a support vector machine? Nat Biotechnol 24(12):1565–1567
Bilmes JA et al (1998) A gentle tutorial of the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models. Int Comput Sci Inst 4(510):126
Chen C, Fu Z, Chen Z, Jin S, Cheng Z, Jin X, Hua X-s (2020) Homm: Higher-order moment matching for unsupervised domain adaptation. Proc AAAI Conf Artificial Intell 34(04):3422–3429. https://doi.org/10.1609/aaai.v34i04.5745
Sun B, Feng J, Saenko K (2016) Return of frustratingly easy domain adaptation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30
Shafiq DA, Marjani M, Habeeb RAA, Asirvatham D (2022) Student retention using educational data mining and predictive analytics: a systematic literature review. IEEE Access 10:72480–72503. https://doi.org/10.1109/ACCESS.2022.3188767
Kouw WM, Loog M (2021) A review of domain adaptation without target labels. IEEE Trans Pattern Anal Mach Intell 43(3):766–785. https://doi.org/10.1109/TPAMI.2019.2945942
Suthaharan S, Suthaharan S (2016) Support vector machine. Machine learning models and algorithms for big data classification: thinking with examples for effective learning, 207–235
Lundberg SM, Lee S-I (2017) A unified approach to interpreting model predictions. Advances in neural information processing systems 30
Ganin Y, Ustinova E, Ajakan H, Germain P, Larochelle H, Laviolette F, Marchand M, Lempitsky V (2016) Domain-adversarial training of neural networks. J Mach Learn Res 17(1):2096–2030
Cui X, Bollegala D (2020) Multi-Source Attention for Unsupervised Domain Adaptation. arXiv. https://doi.org/10.48550/arXiv.2004.06608
Xie S-T, He Z-B, Chen Q, Chen R-X, Kong Q-Z, Song C-Y (2021) Predicting learning behavior using log data in blended teaching. Sci Programm 2021:4327896. https://doi.org/10.1155/2021/4327896
Shahraki NS, Zahiri SH (2021) Drla: Dimensionality ranking in learning automata and its application on designing analog active filters. Knowl-Based Syst 219:106886
Shahraki NS, Zahiri S-H (2017) Inclined planes optimization algorithm in optimal architecture of mlp neural networks. In: 2017 3rd International Conference on Pattern Recognition and Image Analysis (IPRIA), pp. 189–194. IEEE
Acknowledgements
This work was supported by the National Natural Science Foundation of China under Grant 61877063.
Funding
This work was supported by the National Natural Science Foundation of China under Grant 61877063.
Author information
Authors and Affiliations
Contributions
YR was involved in the conception and design of the study, data preprocessing, analysis, and manuscript writing. XY contributed to the study’s design, data collection, supervised the research, and revised the manuscript critically for important intellectal content. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
All authors declare no Conflict of interest.
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Code availability
The code is available at https://github.com/EDM314/LASA.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: overview and preliminaries
A.1 Graphical abstract
The graphical abstract of our work is shown in Fig. 11.
1. Background: With the rapid development of digital campuses, large amounts of student academic performance data has been collected over the years, laying the groundwork for data-driven academic prediction systems. However, existing studies typically utilize short-term data, often limited to 1 or 2 years or even shorter periods. Intuitively, leveraging long-term data offers a more comprehensive understanding of student behavior, potentially enhancing the predictive accuracy of models. However, directly employing long-term data in prediction models implies the assumption that data distributions remain consistent across different periods. This assumption overlooks the dynamic nature of educational settings, where both teaching methodologies and curricular content undergo changes. The course evolution can lead to shifts in data distributions and heterogeneity in feature spaces across different academic years, ultimately compromising the effectiveness of prediction models. 2. Innovation: To address these challenges, we introduce the Learning Ability Self-Adaptive Algorithm (LASA), which is designed to adapt the predictive model to the evolving feature spaces and distributions encountered in long-term data. LASA comprises two innovative components: Learning Ability Modeling and Long-term Distribution Alignment. Learning Ability Modeling assumes that students’ responses to exercises are samples from distributions that are parameterized by their learning abilities. It then estimates these parameters from the heterogeneous student exercise response data, thereby creating a new homogeneous feature space to overcome the heterogeneity present in long-term data. Meanwhile, Long-term Distribution Alignment employs multiple asymmetric transformations to align feature distributions across different years, thus mitigating the impact of distribution shifts on the model’s performance. With these steps, LASA generates well-aligned features with meaningful semantics. Building on LASA, we propose an interpretable prediction framework that utilizes LASA for obtaining semantically meaningful features, incorporates a base classifier for outcome predictions, and employs Shapley Additive Explanations (SHAP) for elucidating the impact of specific features on student performance. 3. Key findings and results: Our exploration of long-term student data covers an eight-year period (2016-2023) from a face-to-face course at Tsinghua University. Comprehensive experiments demonstrate that leveraging long-term data significantly enhances prediction accuracy compared to short-term data, with LASA achieving up to a 7.9% increase. Moreover, when employing long-term data, LASA outperforms the state-of-the-art model ProbSAP, by an average accuracy improvement of 6.8%, respectively. We also present a quantitative analysis of feature impacts on student performance, offering interpretable insights for pedagogical interventions. To the best of our knowledge, this study is the first to investigate student performance prediction in long-term data scenarios
A.2 Abbreviations
The list of abbreviations is provided in Table 7.
Appendix B: Additional experimental details
B.1 Statistical analysis
We first apply the Friedman test to establish a ranking of the different methods included in our experiments. Subsequently, we conduct a post-hoc analysis using the Wilcoxon signed-rank test with Bonferroni correction, to adjust p-values for each pairwise comparison of methods, as shown in Table 8. The results of these statistical tests clearly show that our proposed Learning Ability Self-Adaptive Algorithm (LASA) significantly outperforms all other methods in the study, except for PCA-ORACLE, with p-values consistently below 0.05, indicating statistical significance. Although the p-value in the comparison with PCA-ORACLE is above 0.05, our method still ranks higher in terms of average performance and Friedman ranking.
The log-likelihood function values during the iterative process of LASA for each year from 2016 to 2023. Different graphs represent different years, with the horizontal axis showing the number of iterations and the vertical axis displaying the value of the log-likelihood function. It can be observed that the data for all years converge within 15 iterations
B.2 Convergence of LASA
The principle behind LASA’s Learning Ability Modeling involves iteratively updating the learned parameters to maximize the likelihood function. To demonstrate LASA’s convergence, we present the evolution of the likelihood values throughout the iterative process in Fig. 12. It is important to note that although significant differences exist in the likelihood values across various years, these discrepancies, stemming from annual variations in the number of samples and exercises, do not affect the algorithm’s performance or its convergence. Our focus is on the trend of the log-likelihood values within each year. Our findings indicate that for the years 2016 through 2023, LASA consistently achieves convergence rapidly, typically within 15 iterations. Therefore, we can set the maximum number of iterations I to 20. This demonstrates the algorithm’s swift convergence rate and robust convergence properties.
B.3 Learning Curve
Additionally, we have plotted the learning curve of LASA, as shown in Fig. 13. A cross-validation generator divides the entire dataset into training and test data across 10 iterations. Various subsets of the training set are utilized to train the classifier, and a score (accuracy) is computed for each subset size and the test set. The scores are then averaged over all 10 iterations for each subset size. We observe that with fewer data points, the model is prone to overfitting, indicated by a significantly higher training accuracy compared to testing accuracy. As the amount of data used increases, training accuracy decreases and stabilizes, while testing accuracy gradually improves, indicating a reduction in overfitting. This suggests that larger datasets enhance the model’s generalization ability, and our model still has the capacity to accommodate more data to improve test accuracy, as the continued gradual increase in test accuracy.
B.4 Hyperparameter optimization of the classifier
In this study, the proposed LASA framework is utilized in conjunction with the Support Vector Machine (SVM) classifier. Consequently, we conducted experiments to determine the best settings for the SVM’s hyperparameters. The results of the grid search and 5-fold cross-validation are presented in Fig. 14. The axes display the hyperparameter values, and each cell shows the corresponding cross-validation score. Brighter colors denote higher performance, while darker colors represent lower performance. We employed the radial basis function (rbf) kernel, with C representing the regularization parameter and gamma indicating the parameter of the rbf kernel. Our findings imply that the SVM classifier achieves satisfactory performance over a broad range of parameter values, highlighting one of SVM’s advantages of being relatively insensitive to parameter settings, thus eliminating the necessity for extensive optimization.
Relationship between SVM classification performance and hyperparameters. The x-axis and y-axis represent the values of the parameters, respectively, where each cell’s value indicates the cross-validation score. Brighter colors signify higher performance, while darker colors indicate lower performance
B.5 Exploring the impact of temporal data proximity on predictive model performance
In this section, we conduct supplementary experiments using our dataset, which spans from 2016 to 2023. These experiments are designed to understand how the proximity of training data to the target prediction year affects model performance, considering the evolution of course content and teaching methodologies over time.
We conducted an experiment focusing on predicting outcomes for the year 2023. The results are shown in Fig. 15. We used a sliding window method with window sizes of 3, 2, and 1 year, spanning from 2016 to 2022. This approach generates subsets of data reflecting varying degrees of closeness to the target year. For example, with a window size of 3, we start with data from 2020 to 2022. In this case, the proximity to the prediction target year is 1, meaning the closest year in the window to the target year, which corresponds to the point on the graph where the horizontal axis is 1. The windows shift forward by one year each time, with the final window containing data from 2016 to 2018, corresponding to the point on the graph where the horizontal axis is 5. We use the data from each window to predict student performance for 2023 using our LASA algorithm. To test the robustness of our results, we also conducted experiments with window sizes of 2 and 1 year(s), examining the effects of utilizing temporal data over different lengths. Each experiment was repeated 10 times to calculate the average and ensure the reliability of our findings.
Accuracy trends over different proximities to the prediction target year for various window sizes are presented. The red line indicates a window size of 3 years, the green line for 2 years, and the blue line for 1 year. Each point represents the average accuracy from 10 runs of the LASA algorithm, with shaded areas showing the variability across these runs. A noticeable decrease in accuracy for a window size of 1 at the horizontal axis 2 points to a significant change in data characteristics during that period
Our analysis uncovers an interesting pattern: when the window size is set to 1, accuracy tends to decline as the training data becomes more distant from the target prediction year. This observation aligns with the common understanding that more recent data might more accurately represent the current educational environment. However, a significant decrease in accuracy for the year 2021 (horizontal axis value of 2) suggests a considerable change in the data for that year, which then improves in 2022 (horizontal axis value of 1). This trend highlights that sudden changes can affect predictive performance. For larger window sizes, though some declines in performance at horizontal axis values of 1 and 2 are noted, the impact is much less severe than with a window size of 1. Additionally, we found that a window size of 3 generally outperforms the other two settings most of the time. These findings emphasize the value of using long-term data to cushion against abrupt shifts. Given the unpredictability of which years might experience significant changes, incorporating long-term data is essential for building reliable predictive models.
a Feature contributions in scenario where the student is not actually at-risk but is mistakenly predicted as one. b Contribution of feature values in instance where the student is genuinely at-risk but is not predicted as such. The letters on the left represent the features and their corresponding values. In the figure, blue indicates that the feature contributes negatively to the prediction, while red indicates a positive contribution to the prediction. The numerical values and the value of f(x) represent the probability
B.6 Complexity analysis
In this section, we address the computational complexity of our proposed Learning Ability Self-Adaptive Algorithm (LASA) and the overall predictive framework.
B.6.1 LASA complexity
Learning Ability Modeling: This involves matrix operations and optimization via the EM algorithm, with complexity primarily driven by the number of students (N), exercises (D), and response types (M). The overall computational load is \(O(I \cdot N \cdot D \cdot M)\), where I represents the number of iterations until convergence.
Long-term Distribution Alignment: This entails statistical normalization and distribution alignment across timestamps, with complexity affected by the number of categories (K), response types (M), and the process of singular value decomposition for asymmetric transformations. The computational demand is significant, especially for datasets with a large number of timestamps and features.
B.6.2 SVM Classifier complexity
Training an SVM classifier involves complexities ranging from \(O(N^2 \cdot K \cdot M)\) to \(O(N^3 \cdot K \cdot M)\), depending on the choice of kernel and optimization techniques used, where \(K \cdot M\) represents the dimensionality of the features obtained from Learning Ability Modeling. The complexity of making predictions with an SVM is \(O(SV \cdot K \cdot M)\), where SV denotes the number of support vectors and \(K \cdot M\) is the feature dimensionality.
B.6.3 SHAP-based model interpretation
Computing SHAP values for model interpretation involves a computational complexity that can reach up to \(O(2^{K \cdot M})\) for exact calculations, where \(K \cdot M\) represents the dimensionality of the features. Nevertheless, the adoption of efficient algorithms and approximation techniques can significantly lower this complexity, making it feasible for practical applications. For instance, specific models and implementation strategies can reduce the computational load to \(O((K \cdot M)^2)\), thereby enhancing the feasibility of applying SHAP for interpretative analysis in educational data mining contexts.
B.7 More interpretability case studies
Figure 16a presents an analysis for a student whose three types of learning abilities, \(\text {H}\), \(\text {M}\), and \(\text {E}\), are significantly below average, leading the model to predict them as at-risk. However, in reality, this student scored only 34 in the midterm but achieved 83 in the final exam, which is an above-average score. This improvement might be attributed to the typical 1-2 week review period between the end of the course and the final exam. Despite the student’s lack of focus during regular classes, their intensive study right before the exam helped them grasp the course content, resulting in satisfactory performance ultimately. This case highlights the importance of predicting student performance, as early warnings and guidance can prevent them from eventually failing. The analysis of a student who is genuinely at-risk but is not predicted as such is depicted in Fig. 16b. The learning ability values indicate that they have an above-average understanding and ability to apply basic concepts in combination. Consequently, the model classifies them as not being at-risk. However, in reality, they are at-risk and received a low score on the final exam. This situation reveals the limitations of our model. To identify students like these, who are not successfully classified as at-risk, we might need to consider additional factors, such as their response times or homework performance, to uncover more behavioral patterns and thus achieve accurate classification.
B.8 Ablation study
LASA mainly consists of the following components on top of the base classifier: Learning Ability Modeling and Long-term Distribution Alignment. It is worth noting that although SHAP is also a component, it does not affect the prediction results. To assess the efficacy of the proposed LASA, we first conduct ablation studies by removing each component of LASA separately to observe its impact on performance. Several variants of LASA are as follows: (a) LASA-\(\text {L}_1\): Without both the LAM and LTDA, and using the original features as a replacement for the learning ability parameter features. (b) LASA-\(\text {L}_2\): Without the LAM, and using the original features as a replacement for the learning ability parameter features. (c) LASA-\(\text {L}_3\): Without both the LAM and LTDA, but using PCA to generate features of the same dimensionality. (d) LASA-\(\text {L}_4\): Without the LAM, but using PCA to generate features of the same dimensionality. (e) LASA-\(\text {L}_5\): Without part of the LTDA, treating multiple timestamps as one dataset. (f) LASA-\(\text {L}_6\): Without the LTDA.
For each variant, and for every Target Timestamp, we perform 20 runs and take the mean of the results. The classification accuracy results of the variants and the complete LASA on different target timestamps are reported in Table 9.
We observe that LASA consistently outperforms its variants across most tasks, underscoring the significance of each component within LASA. By contrasting LASA-\(\text {L}_1\), which lacks the LTDA, with LASA-\(\text {L}_2\) which includes it, we notice an inferior performance when the alignment is present, indicating negative transfer. This may arise due to the high dimensionality and discreteness of the original data, making direct LTDA on raw data challenging. It suggests that a suitable distribution representation is essential for alignment, highlighting the importance of our modeling of student learning ability distribution. Comparing LASA-\(\text {L}_3\) and LASA-\(\text {L}_6\), both without the LTDA, the latter exhibits superior performance, surpassing by an average of 5.7%. This demonstrates that the capability parameter matrix derived from LAM provides a more intrinsic reflection of student performance than the feature space obtained via PCA. Moreover, with both LASA-\(\text {L}_4\) and LASA incorporating the LTDA, LASA boasts an average accuracy higher by 5.2%, further validating the efficacy of our proposed LAM. Furthermore, LASA-\(\text {L}_6\), which omits the LTDA, lags behind the complete LASA, especially in tasks from 2021 where LASA outperforms LASA-\(\text {L}_6\) by 11.8%. This suggests that the LTDA effectively narrows the distribution differences across multiple timestamps, further boosting model performance. Additionally, when comparing LASA-\(\text {L}_5\) and LASA, treating multiple timestamps as a singular distribution does not aptly address distribution differences since it overlooks the inherent distribution variations between timestamps. In summary, our proposed LAM effectively captures parameters reflecting students’ learning abilities, and the LTDA alleviates distribution disparities between timestamps. Their synergistic interplay further enhances model performance.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Ren, Y., Yu, X. Long-term student performance prediction using learning ability self-adaptive algorithm. Complex Intell. Syst. (2024). https://doi.org/10.1007/s40747-024-01476-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s40747-024-01476-2