Introduction

Background

As institutions aim to improve education quality and reduce student failure and dropout rates, student performance prediction has become a key research topic in educational data mining (EDM). Large classroom environments, which are prevalent in higher education [1], often suffer from high failure and dropout rates. Given the large number of students in these settings, educators find it challenging to promptly identify individual learning behaviors and capabilities, which hinders the implementation of personalized educational strategies [2]. Thus, it is imperative to develop student performance prediction systems for campus students, allowing educators to implement data-driven educational interventions based on solid insights.

With the rapid development of digital campuses, substantial amounts of student academic performance data have been collected over the years, providing a foundation for data-driven academic prediction systems [3]. Existing studies typically utilize short-term data, often limited to 1 or 2 years or even shorter periods [4, 5]. Intuitively, leveraging long-term data should offer a more comprehensive view of student behavior, potentially improving the predictive accuracy of models. However, directly employing long-term data in prediction models implies the assumption that data distributions remain consistent across different periods [6]. This practice overlooks the dynamic nature of educational settings, where both teaching methodologies and curricular content may evolve. Furthermore, this evolution can lead to shifts in data distributions and heterogeneity in feature spaces across different academic years, ultimately compromising the effectiveness of prediction models.

Primary motivations

In this paper, we address a more realistic and challenging scenario: long-term student performance prediction. As courses evolve and historical data from multiple years become available, both teaching methodologies and curricular content undergo changes. This evolution results in long-term data characterized by distribution discrepancies and varying feature dimensionalities, posing significant challenges to the efficacy of predictive models. Figure 1 illustrates the differences in data distribution and feature space assumptions between previous studies and our current work. While distribution shifts and heterogeneous feature spaces in long-term data are ubiquitous in educational settings, they remain largely understudied.

Fig. 1
figure 1

Previous studies have predominantly utilized short-term datasets, assuming that these datasets possess consistent feature spaces and similar distributions. This approach commonly involves aggregating data across different years and randomly dividing it into training and test sets, potentially neglecting variations in distribution over time. In contrast, this paper emphasizes the analysis of long-term data. Due to factors such as curriculum changes, the feature space naturally evolves, leading to alterations in both the number and the meaning of features. These changes also induce shifts in the data distribution over time. In practical scenarios, when predicting student grades for a course, one can only leverage the historically labeled data from previous iterations of the course, since labels (student exam grades) for the target prediction set are not available until after final exams are completed. Hence, a simplistic mixture and splitting of data are impractical. To evaluate model performance more realistically, our methodology involves selecting a particular year’s unlabeled data as the prediction target and using all preceding available labeled data. This approach better reflects real-world applications and enhances practicality

Domain adaptation (DA) is a promising technique to address the challenge of distribution shifts in long-term data. It aims to bridge the distributional gap between the source and target domains by aligning their feature distributions [7]. However, many DA methods assume that single [8] or multiple sources [9, 10] domains share the same feature space as the target domain, differing only in distributions. This assumption hinders the applicability to long-term heterogeneous data, which often present multiple source domains with distinct feature spaces and distributions. Moreover, many DA methods rely on symmetric subspace mappings [11, 12] or deep learning structures [9], which may undermine the semantics of the feature space, resulting in predictions that lack interpretability. Considering the crucial role of interpretability in educational settings, where educators rely on understanding the rationales behind predictions to carry out informed interventions [13], it is imperative to develop a student performance prediction approach that not only accommodates the heterogeneity and distribution shifts in data but also ensures interpretable outcomes.

Innovation aspects

Our work includes three innovation aspects:

Research Question Innovations: We tackle the critical yet underexplored area of utilizing long-term data to predict student performance. Our work illuminates the unique challenges and opportunities presented by long-term educational data, such as feature heterogeneity and distribution shifts, which have not been adequately addressed in existing studies.

Methodological Innovations: To address these challenges, we introduce the Learning Ability Self-Adaptive Algorithm (LASA), designed to adapt the predictive model to the evolving feature spaces and distributions encountered in long-term data. LASA comprises two innovative components: Learning Ability Modeling (LAM) and Long-term Distribution Alignment (LTDA). LAM assumes that students’ responses to exercises are samples from distributions parameterized by their learning abilities. It then estimates these parameters from the heterogeneous student exercise response data, thereby creating a new homogeneous feature space to overcome the heterogeneity present in long-term data. Meanwhile, LTDA employs multiple asymmetric transformations to align feature distributions across different years, thus mitigating the impact of distribution shifts on the model’s performance. With these steps, LASA generates well-aligned features with meaningful semantics. Building on LASA, we propose an interpretable prediction framework that utilizes LASA to obtain semantically meaningful features, incorporates a base classifier for outcome predictions, and employs Shapley Additive Explanations (SHAP) to elucidate the impact of specific features on student performance.

Dataset Innovations: To empirically validate our approach and bridge the research gap in long-term student performance prediction, we introduce an 8-year dataset (2016-2023) from the face-to-face course, Principles of Electric Circuits, at Tsinghua University. This dataset, collected using RainClassroom, a leading educational tool in China [14], provides an opportunity to apply various student performance prediction methods in a real-world, long-term data scenario. The evolving features and data distributions exemplify the limitations of conventional prediction methods and underscore the necessity of LASA.

Contributions

The main contributions of this paper are as follows:

  1. 1.

    For the first time, this study delves into the critical yet underexplored area of utilizing long-term data to predict student performance, highlighting the advantages and challenges of this scenario. We demonstrate that challenges such as feature heterogeneity and distribution shifts within long-term data can potentially impair classifier performance. Moreover, we show that overcoming these challenges and effectively leveraging long-term data can significantly enhance predictive accuracy, yielding more robust classification results compared to those based on short-term data.

  2. 2.

    We propose a novel method, the Learning Ability Self-Adaptive Algorithm (LASA), specifically designed to harness the full potential of long-term student data. To the best of our knowledge, LASA is the first method aimed at tackling the issues of heterogeneity, distribution shifts, and obtaining interpretable features in the context of long-term student performance prediction.

  3. 3.

    We introduce an interpretable prediction framework that includes three key components: first, it generates semantically meaningful features using the Learning Ability Self-Adaptive Algorithm (LASA); second, it employs a classifier to derive prediction outcomes based on these features; and third, it applies SHAP-based Model Interpretation to elucidate the influence of specific features on the prediction outcomes. This comprehensive framework provides educators with actionable insights, enabling targeted pedagogical interventions.

  4. 4.

    We present the Long-term Student Performance Dataset, a pioneering dataset that encompasses an 8-year span (2016-2023) for long-term student performance prediction. This new dataset lays the foundation for groundbreaking insights into the challenges and benefits of using long-term educational data. The desensitization dataset and code are available at https://github.com/EDM314/LASA.

  5. 5.

    Extensive experiments on our real-world datasets demonstrate that our LASA can effectively utilize long-term data to enhance predictive performance, surpassing state-of-the-art models, ProbSAP and SFERNN, by average accuracy improvements of 6.8% and 6.4%, respectively. Our analysis also provides interpretable insights into the factors influencing student success, paving the way for more effective pedagogical strategies.

Paper organization

The remainder of this paper is organized as follows: Sect.“Related work” provides a review of related work, positioning our research within the existing scholarly landscape. Section “The long-term student performance dataset” delineates the methodology underpinning the Learning Ability Self-Adaptive Algorithm (LASA), detailing its components and outlining the proposed framework for interpretable predictions. Section “Methodology” describes the long-term dataset, including the methodologies for data collection and analysis. Section “Experiments” presents the experimental results and discusses the implications of our findings. Section “Interpretability and case study” offers an interpretable analysis of the prediction outcomes, yielding insights for pedagogical interventions. Section “Discussion” discusses the main findings of this study, acknowledges its limitations, and suggests potential avenues for future research. Finally, Sect. “Conclusion” concludes the paper.

Related work

In this section, we review some previous works related to our study in terms of student performance prediction and domain adaptation.

Student performance prediction

Student performance prediction is a critical research topic in educational data mining (EDM). In recent years, numerous studies have been dedicated to leveraging data mining techniques to predict student academic performance. Li et al., based on 145 days of students’ daily living data, introduced a model that combines CNN with LSTM to predict students’ GPA [4]. Similarly, Riestra–Gonzalez et al. [3] attempted to predict students’ performance by analyzing log files from a learning management system (LMS) for an entire academic year. Additionally, CatBoost-SHAP [15] built student profiles using K-prototypes and employed CatBoost to detect at-risk students, utilizing two years of data, with one year serving as the training set and the other as the test set. While these studies have proposed various predictive models, most of them are based on short-term data, using data from only 1–2 years or even less. Unlike them, our study focuses on the benefits and challenges of utilizing long-term data, which is more aligned with practical applications.

A few studies also utilize data spanning several years. For example, Alcaraz et al. [6] collected data on the Power Electronic Systems course from 2010 to 2017. They incorporated a set of expert features and employed traditional machine learning algorithms to predict student grades. ProbSAP [16] explored using academic year GPA and prerequisite course scores to predict student grades for the Probability Statistics course, drawing upon data from 2015 to 2018. Lu et al. [17] predicted student grades for the final academic year using historical course data and LMS interaction data from 2014 to 2017. Although these studies leveraged long-term data, they often treated the data from multiple years as a single distribution, thus overlooking potential shifts in data distributions across years, which is impractical in real-world scenarios.

While some research has been directed towards managing distribution shifts in student performance prediction, these efforts primarily revolve around adapting models for different courses [18,19,20,21] or developing multi-course applications [22, 23]. These approaches, while aiming to address distribution variations, primarily concentrate on the course-level differences, often neglecting the intricate challenges posed by long-term educational data.

In summary, there has been a lack of research on long-term student performance prediction. We tackle the unique distributional challenges presented by historical data accumulated over extended periods, an unexplored area in EDM. This focus allows us to harness the full potential of long-standing data records, significantly enhancing the efficacy of our predictive model.

Domain adaptation algorithm

The DA technique is designed to address distribution differences between the source and target domains by aligning their respective feature distributions. In recent years, DA has been widely applied to various fields, including computer vision [12], natural language processing [24], and recommender systems [25]. This method shows promise for addressing distribution discrepancies in long-term data. Given that labels for future data are unavailable in real applications, our focus is on Unsupervised Domain Adaptation (UDA) techniques. According to the number and properties of source domains, UDA methods can be divided into Single-source Homogeneous UDA (SsHoUDA), Multi-source Homogeneous UDA (MsHoUDA), Single-source Heterogeneous UDA (SsHeUDA), and Multi-source Heterogeneous UDA (MsHeUDA).

SsHoUDA aims to transfer knowledge from a single source domain to a target domain, with both domains being homogeneous (having the same dimensionalities). Transfer Component Analysis (TCA) [26] is a widely recognized domain adaptation technique that seeks to identify a shared subspace between different domains, thereby reducing the Maximum Mean Discrepancy (MMD) [27]. Unlike TCA, Correlation Alignment (CORAL) [28] is proposed to align the second-order statistics between two domains. Moreover, Wang et al. introduced the Manifold Embedded Distribution Alignment (MEDA) [12], which integrates the Grassmann manifold and adjusts the weights between the marginal and conditional distributions for more accurate alignment. Although these methods have shown promising results in specific contexts, they are primarily designed for a single source and might neglect a significant portion of the information when dealing with long-term data.

MsHoUDA focuses on mitigating the distribution discrepancies across multiple source domains. As an extension of TCA, Multi-Domain TCA [29] was introduced to align multiple source domains with the target domain. Similarly, Peng et al. [9] presented M3SDA, a method that aligns the moments between multiple source domains and the target domain. TWMDA [10] further refines this alignment by accounting for sample weights. However, these techniques, primarily designed for homogeneous domains, might not yield optimal results when the feature spaces of the source and target domains diverge, a challenge frequently encountered with heterogeneous long-term data in educational settings.

SsHeUDA methods can bridge two heterogeneous cross-domain feature spaces and can be applied to domains with different dimensionalities. Yeh et al. [30] introduced Reduced Kernel Canonical Correlation Analysis (RKCCA), leveraging a derived correlation subspace to associate data across domains. However, this method depends on parallel instances, which are often lacking in many domains. Addressing this constraint, Liu et al. [31] proposed Shared Fuzzy Equivalence Relations (SFER). This facilitates knowledge transfer from a heterogeneous source domain to the target domain without requiring parallel instances. Despite its strengths, this strategy is still designed specifically for a single source domain.

MsHeUDA remains an inadequately explored challenge. Fuzzy Relation Neural Networks [32] appears as a potential solution, integrating multiple neural networks within the SFER framework to learn domain-invariant features across multiple heterogeneous domains. However, constrained by the SFER structure, this method exhibits a limited capability in capturing students’ behavioral patterns in the original response data. Furthermore, its intricate architecture lacks interpretability, rendering it unable to provide meaningful decision support. Our approach, LASA, can be considered a kind of MsHeUDA approach, which not only transfers knowledge across multiple heterogeneous domains but also retains interpretability.

Summary

In Table 1, we summarize the characteristics of recent methods, focusing on whether they address the challenges present in long-term data and their interpretability. Most student performance prediction methods are based on short-term data and do not thoroughly explore the role of long-term data. Some studies that utilize long-term data simply amalgamate all available data for model validation without considering the potential impacts of long-term data on future predictions. Additionally, most DA methods fall short in addressing the distribution shifts in multiple heterogeneous feature spaces, which are typical in long-term educational data. And they often come with complex structures and lack interpretability, thus failing to provide interpretable predictions, which are essential for educators to implement targeted teaching interventions. Our proposed LASA not only effectively resolves the challenges posed by heterogeneous feature spaces and distribution shifts but also can provide interpretable predictions.

Table 1 Summary of Related Works and Their Characteristics

The long-term student performance dataset

In this section, we present the Long-term Student Performance Dataset to bridge the existing gap in long-term student performance prediction research. This real-world dataset poses challenges such as heterogeneous feature spaces and distribution shifts, paving the way for the development of more practical and robust predictive models.

Data collection

We utilize Rain Classroom, one of the most popular teaching tools in China, to automatically collect in-class data. With Rain Classroom, upon completing the teaching of a knowledge point, the teacher can push corresponding exercises to students for data collection. In this way, data collection can be seamlessly integrated into the teaching process without disrupting the teaching rhythm or placing an extra burden on students and teachers. Meanwhile, students can receive and answer questions via their smartphones without the need for an extra device, and the exercise results are recorded into the database. The results also count towards the students’ regular course scores, which encourages them to take it seriously. A schematic representation of the data collection process is depicted in Fig. 2.

Fig. 2
figure 2

In each lesson, after teaching a particular topic, instructors can use the Rain Classroom platform to push relevant in-class exercises to students. Students can complete these exercises using their phones, eliminating the need for additional devices. Simultaneously, their responses are recorded, contributing to their in-class behavior data. After the course concludes, a final exam is administered, and students’ scores on this exam reflect their overall performance in the course

We collected data from the Principle of Electric Circuits course using Rain Classroom over a total of 8 years, from 2016 to 2023. Principle of Electric Circuits is a required second-year course for many majors at Tsinghua University and was also one of the first courses to use Rain Classroom for teaching purposes. The types of questions involved in our study are multiple-choice questions, with data samples and their features presented in Table 2. As a large-scale face-to-face course, using its in-class data to predict student performance is highly representative. In the following, we will provide a detailed analysis of the challenges, heterogeneous feature spaces and distribution shifts, encountered in the Long-term Student Performance Dataset.

Table 2 Example of raw data samples
Fig. 3
figure 3

Yearly statistics from 2016 to 2023 for the Principle of Electric Circuits course at Tsinghua University. The bar graphs represent the number of students, lessons, and exercises per year. The line graph showcases the average correct rate of student answers with standard deviation. Noticeable fluctuations in the data components reflect the dynamic evolution of the course content, student enrollment, and their performance

Data analysis

Figure 3 presents the basic statistics of our dataset. Notably, the dataset exhibits annual fluctuations in both the number of lessons and exercises from 2016 to 2023, reflecting the evolving nature of the course content. Such ongoing modifications hint at potential challenges associated with heterogeneous feature spaces that possess varying dimensions. Furthermore, the variations observed in student enrollments and their average accuracy rates underscore the existence of distribution shifts in the dataset.

To quantitatively analyze the dataset’s evolving feature-level characteristics year-over-year, we employ both single-feature and multi-feature evaluations. Using the students’ correct answer rate as a representative statistical feature, we calculate the Jensen-Shannon (JS) divergence [35] between different years (the smaller the value, the smaller the difference). As depicted in Fig. 4a, the distribution difference grows over time, with two noticeable similarity blocks: 2016–2020 and 2021–2023, confirming feature-level distribution shifts in long-term data. Moreover, in Fig. 4b, we employ the first principal component after Principal Component Analysis (PCA) for assessment. PCA is a statistical procedure that utilizes orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables, called principal components. In line with our previous findings, the closeness in color intensity around the diagonal indicates a higher similarity in distributions for consecutive years. The similarity blocks also exist, but they differ from the similarity block in Fig. 4a, which might be attributed to PCA altering the feature structures and distributions. Additionally, we observe a significant discrepancy in the distribution between 2021 and its adjacent years. These findings further underscore the distribution shifts in long-term data.

Fig. 4
figure 4

a Distribution discrepancy of a single statistical feature, where deeper shades of blue signify smaller distribution differences. Data from 2016–2020 exhibit similarities amongst each other, as does the data from 2021–2023, yet significant differences exist between these two clusters. b Distribution discrepancy of single PCA features, revealing similar patterns. c Joint distribution differences of multiple PCA features. Classification outcomes from SVM trained on data from different years confirm the presence of these distribution shifts

To provide a comprehensive understanding beyond single-feature analysis, we delve into distribution discrepancies across yearly dataset splits using a classifier-based multi-feature approach. Given the heterogeneous nature of our long-term data and varying feature dimensions, as depicted in Fig. 3, directly applying traditional machine learning classifiers proves challenging due to their expectation of consistent feature dimensions. Consequently, we utilize the first 30 principal components post-PCA as a consistent feature set and employ an SVM classifier [36]. The labeling process is elaborated in Sect. “Experimental settings”. In Fig. 4c, each heatmap cell indicates classification accuracy, with the model trained on data from the x-axis year and evaluated on the y-axis year. Diagonal cells represent average accuracies from ten-fold evaluations on identical-year data. Notably, training and testing on data from 2018 to 2020 yield commendable performances, aligning with the similarity patterns observed in Fig. 4(b). However, when predicting the years 2021 to 2023, we observe substantial variances in results based on different training years, further underscoring the presence of distribution shifts.

Methodology

In this section, we present the Learning Ability Self-Adaptive Algorithm (LASA) and the accompanying interpretable prediction framework for long-term student performance prediction. The main components of LASA and the architecture of the prediction framework are illustrated in Fig. 5. To aid in understanding the methodologies presented, we provide a table of notations used throughout this section. Table 3 summarizes the symbols and their descriptions for quick reference.

Table 3 Summary of Notations
Fig. 5
figure 5

The interpretable prediction framework with LASA for long-term student performance prediction. The first component of LASA, Learning Ability Modeling, assumes that students’ responses to exercises are samples from distributions that are parameterized by their learning abilities. It then estimates these parameters from the heterogeneous student exercise response data, thereby creating a new homogeneous feature space to overcome the heterogeneity present in long-term data. Meanwhile, the second component of LASA, Long-term Distribution Alignment, employs multiple asymmetric transformations to align feature distributions across different years, thus mitigating the impact of distribution shifts on the model’s performance. With these steps, LASA generates well-aligned features with meaningful semantics. Then, a base classifier is used for outcome predictions and we employ Shapley Additive Explanations (SHAP) for elucidating the impact of specific features on student performance

Learning ability self-adaptive algorithm

To better utilize long-term data, we propose the Learning Ability Self-Adaptive Algorithm, which can adapt predictive models to the evolving features and distributions within long-term data. Figure 5 illustrates the two primary components of LASA: Learning Ability Modeling and Long-term Distribution Alignment. Each component is specifically designed to address the challenges of heterogeneity and distribution shifts in long-term datasets. Before detailing these components, we first define the challenges inherent in long-term datasets.

Challenges of long-term datasets

We consider the long-term data as consisting of \(T+1\) distinct timestamps, denoted as \(\mathcal {T} = \{1,..., T+1\}\). Each timestamp t corresponds to a specific data distribution \( p^{(t)}(\textbf{x}^{(t)}, \textbf{y}^{(t)}) \), where \( \textbf{x}^{(t)} \) denotes the input feature vector of dimension \( d^{(t)} \), and \( \textbf{y}^{(t)} \) represents its corresponding labels. It is important to note that for predictions at the \(T+1^{\text {th}}\) timestamp, we only have access to the features \(\textbf{x}^{(T+1)}\); the labels \(\textbf{y}^{(T+1)}\) are not available. Due to the dynamic nature of long-term educational data, both the feature dimension d and the distribution p may change over time, leading to heterogeneity and distribution shifts. Specifically, for two timestamps \(t_1\) and \(t_2\) (with \(t_1, t_2 \in \mathcal {T}\) and \(t_1 \ne t_2\)), we might encounter \(d^{(t_1)} \ne d^{(t_2)}\) and \(p^{(t_1)} \ne p^{(t_2)}\). The former poses challenges for many traditional machine learning algorithms that require consistent feature dimensions throughout training and prediction phases. The latter might impact the efficacy of prediction models that operate under the independent and identically distributed (IID) assumption.

Learning ability modeling

To address heterogeneity, namely to enable the predictive model to adapt to changes in the number of features within long-term data, we introduce the Learning Ability Modeling. We assume that students’ responses to exercises are samples from distributions parameterized by their learning abilities. Thus, even with a varying number of exercises, we can still obtain a fixed number of distribution parameters as new features, allowing data from multiple years to be used simultaneously in training prediction models. Notably, these distribution parameters are related to students’ learning abilities, offering semantic insights for interpreting the model’s predictions. The process for estimating these parameters is detailed as follows.

For a given timestamp with N students and D exercises, each having M potential response types, a student’s responses can be denoted as \( \textbf{x}_n = (x_{n,1}, \ldots , x_{n,D}) \in \mathbb {R}^{D \times M} \). Specifically, \( x_{n,i} \in \mathbb {R}^M \) is an M-dimensional one-hot vector indicating the \( n^{th} \) student’s answer to the \( i^{th} \) exercise, with each dimension corresponding to potential response types. The entire dataset is thus represented as \( X = \{\textbf{x}_n\}_{n=1}^N \in \mathbb {R}^{N \times D \times M} \).

Students, with their diverse learning abilities, are expected to show different response patterns to the same exercise. The distribution of \( x_{n,i} \) can be represented as \( x_{n,i} \sim p(x_{n,i}; \theta _{n,i}) \), where \(\theta _{n,i}\in \mathbb {R}^M\) denotes the student’s learning ability for exercise i. We further assume that all exercises can be grouped into K categories with no overlap among them. The categories of each exercise can be defined as \( Z = (z_1, \ldots , z_D) \), with \( z_i \) indicating the category of the exercise i. If exercises i and j are both in category k, it implies that the learning abilities corresponding to these exercises are considered equivalent, hence \(\theta _{n,i} = \theta _{n,j} = \theta _{n,k}\), where \(\theta _{n,k}\) denotes the learning ability parameters for category k. Consequently, the responses, \(x_{n,i}\) and \(x_{n,j}\), can be considered as samples drawn from the same distribution \(p(x_n; \theta _{n,k})\). Based on this observation, all exercise responses of one student originate from the K distributions \( \{p(x_n; \theta _{n,k})\}_{k=1}^K \). This approach transforms the original feature spaces of varying number of responses into a consistent set of K distributions, with parameters indicating students’ learning abilities. We define the learning ability embeddings of one student as \( \textbf{x}_{n,\theta } = (\theta _{n,1}, \ldots , \theta _{n,K}) \in \mathbb {R}^{K \cdot M} \), which means a new fixed-dimension and semantically meaningful feature space.

Specifically, since the student responses are discrete values, we can reasonably suggest that if \(z_i = k\), i.e., exercise i belongs to category k, then the response \(x_{n,i}\) follows a categorical distribution, articulated by

$$\begin{aligned} p(x_{n,i}\mid z_i=k)=p(x_{n,i}; \theta _{n,k})=\prod _{m=1}^M \theta _{n,k,m}^{x_{n,i,m}}, \end{aligned}$$
(1)

where \( \theta _{n,k} \) is an M-dimensional parameter vector. The element \( \theta _{n,k,m} \) stands for its \( m^{th} \) entry and meets the condition \(\sum _{m=1}^M \theta _{n,k,m} = 1 \). Assuming the prior distribution of exercise category \( p(z_i=k)=\pi _k \) and the condition \( \sum _{k=1}^K \pi _k=1 \) holds, the marginal distribution of the responses of exercise i is

$$\begin{aligned} p(\textbf{f}_i;\Theta )&=\sum _{k=1}^K p(z_i=k)p(\textbf{f}_i\mid z_i=k) \nonumber \\&=\sum _{k=1}^K \pi _k\prod _{n=1}^{N}\prod _{m=1}^M \theta _{n,k,m}^{x_{n,i,m}}, \end{aligned}$$
(2)

where \(\textbf{f}_{i}=(x_{1,i},\ldots ,x_{N,i})\in \mathbb {R}^{N\times M}\) and \(\Theta =(\theta _{1,1},..., \theta _{n,k})\in \mathbb {R}^{N\times K \times M}\). The log-likelihood function of all responses then becomes

$$\begin{aligned} l(\Theta )=\sum _{i=1}^D \log \sum _{k=1}^K \pi _k\prod _{n=1}^{N}\prod _{m=1}^M \theta _{n,k,m}^{x_{n,i,m}}. \end{aligned}$$
(3)

Considering \( \lambda _{n,k} \) and \( \mu \) as the Lagrange multipliers, our optimization objective is

$$\begin{aligned} \max \mathcal {L} =&\sum _{i=1}^D \log \sum _{k=1}^K \pi _k\prod _{n=1}^{N}\prod _{m=1}^M \theta _{n,k,m}^{x_{n,i,m}} \nonumber \\&+ \sum _{n=1}^N\sum _{k=1}^K\lambda _{n,k}\left( 1-\sum _{m=1}^M \theta _{n,k,m}\right) +\mu \left( 1-\sum _{k=1}^K \pi _k\right) . \end{aligned}$$
(4)

Through optimization, we obtain

$$\begin{aligned}&\theta _{n,k,m}=\frac{\sum _{i=1}^D\gamma _{i,k} \cdot x_{n,i,m}}{\sum _{i=1}^D\gamma _{i,k}}, \nonumber \\&\gamma _{i,k}=\frac{\pi _k\prod _{n=1}^{N}\prod _{m=1}^M \theta _{n,k,m}^{x_{n,i,m}}}{\sum _{k=1}^K \pi _k\prod _{n=1}^{N}\prod _{m=1}^M \theta _{n,k,m}^{x_{n,i,m}}}, \\&\pi _k=\frac{\sum _{i=1}^D\gamma _{i,k}}{D} \nonumber . \end{aligned}$$
(5)

Using the Expectation–Maximization algorithm [37], we iteratively refine the initial parameters to optimize Eq. 3, and obtain the optimized parameters \(\theta _{n,k,m}\). Given features of different dimensions \( \{X^{(t)}\}_{t=1}^{T+1} \), we can derive homogeneous and semantically meaningful learning ability embeddings \( \{\Theta ^{(t)}\}_{t=1}^{T+1} \). This resolves the inconsistency in feature dimensions across timestamps. The pseudo code is presented in Phase 1 of Algorithm 1.

Long-term distribution alignment

To adapt the predictive model to distribution shifts over time, we introduce the Long-term Distribution Alignment. The efficacy of machine learning models can be significantly impacted by shifts in mean and variance. Hence, our first step is to align these statistical measures. For any element \( \theta _{n,k,m} \) in \( \Theta \), its transformation is given by

$$\begin{aligned}&\bar{\theta }_{n,k,m}=\frac{\theta _{n,k,m}-\mu _{k,m}}{\sigma _{k,m}}\nonumber , \\&\mu _{k,m} = \frac{1}{N}\sum _{n=1}^N\theta _{n,k,m},\\&\sigma _{k,m} = \sqrt{\frac{1}{N}\sum _{n=1}^N(\theta _{n,k,m}-\mu _{k,m})^2}\nonumber . \end{aligned}$$
(6)

Through this procedure, each feature undergoes individual centering and scaling based on sample-driven statistics. We obtain the transformed sample \( \textbf{x}_{n,\bar{\theta }} = (\bar{\theta }_{n,1}, \ldots , \bar{\theta }_{n,K})\) and the transformed features \(\bar{\Theta }=(\bar{\theta }_{1,1},...,\bar{\theta }_{n,k})\in \mathbb {R}^{N\times K \times M}\).

To align distributions across two domains, a common method is moment matching[38]. Inspired by this, we minimize the difference between two timestamps by solving for the mappings \(\varvec{\psi }^{(t_1)}\) and \(\varvec{\psi }^{(t_2)}\), as illustrated below:

$$\begin{aligned}&\min _{\varvec{\psi }^{(t_1)},\varvec{\psi }^{(t_2)}} \frac{1}{L^p}\Vert \frac{1}{N^{(t_1)}}\sum _{n=1}^{N^{(t_1)}}\varvec{\psi }^{(t_1)}(\textbf{x}_{n,\bar{\theta }}^{(t_1)})^{\otimes p}\nonumber \\&\qquad -\frac{1}{N^{(t_2)}}\sum _{n=1}^{N^{(t_2)}}\varvec{\psi }^{(t_2)}(\textbf{x}_{n,\bar{\theta }}^{(t_2)})^{\otimes p}\Vert _\textbf{F}^2,\forall t_1,t_2 \in \mathcal {T}, \end{aligned}$$
(7)

where \(\Vert \cdot \Vert _\textbf{F}^2\) represents the matrix Frobenius norm, and \(N^{(t)}\) denotes the sample size for the corresponding timestamp. The symbol \(u^{\otimes p}\in \mathbb {R}^{c^p}\) represents the p-level outer product of vector \(u\in \mathbb {R}^{c}\), and L is the dimension of the corresponding distribution. To achieve a more refined alignment compared to first-order moment matching, our approach aims to match the second-order moment, setting \(p=2\) in Eq. (7).

In contrast to the method presented in [39], which focuses on aligning the distribution between a single source and target domain, our study extends this concept to align the distributions across multiple timestamps within long-term data. Furthermore, the work of [9] suggests that the upper bound of the target error for a learned classifier is influenced by the pairwise moment divergence between the target domain and each source domain. Therefore, to harness the full potential of historical data and minimize target empirical errors, our objective is to minimize the distribution difference between T historical timestamps and the target timestamp \(T+1\). We modify Eq. (7) to align multiple timestamps, which can be expressed as

$$\begin{aligned}&\min _{\{\varvec{\psi }^{(t)}\}_{t=1}^{T+1}} \sum _{t=1}^T\Vert \frac{1}{N^{(t_1)}}\varvec{\psi }^{(t)}(\bar{\Theta }^{(t)})^\top \varvec{\psi }^{(t)}(\bar{\Theta }^{(t)})\nonumber \\&\quad -\frac{1}{N^{(T+1)}}\varvec{\psi }^{(T+1)}(\bar{\Theta }^{(T+1)})^\top \varvec{\psi }^{(T+1)}(\bar{\Theta }^{(T+1)})\Vert _\textbf{F}^2, \end{aligned}$$
(8)

where \(\varvec{\psi }^{(t)}(\bar{\Theta }^{(t)})^\top \varvec{\psi }^{(t)}(\bar{\Theta }^{(t)})=\sum _{n=1}^{N^{(t)}}\varvec{\psi }^{(t)}(\textbf{x}_{n,\bar{\theta }}^{(t)})^{\otimes p}\), when \(p=2\).

Previous research typically utilized symmetric transformations to align within a subspace, which can result in the degradation of feature structures. Therefore, to preserve the semantic characteristics of features, we employ a series of asymmetric transformations. Specifically, we map the distribution of historical timestamps to the target timestamp for alignment. We assume that \(\varvec{\psi }^{(t)}:\bar{\Theta }^{(t)} \rightarrow \bar{\Theta }^{(t)}A\), where \(A^{(t)}\in \mathbb {R}^{L\times L}\) acts as a linear transformation. Here, \(A^{(T+1)}=\mathcal {I}\) functions as an identity mapping, which maps the target timestamp to itself, while \(\{A^{(t)}\}_{t=1}^T\) maps the respective timestamps to the target timestamp. Our objective can be redefined as

$$\begin{aligned} \min _{\{A^{(t)}\}_{t=1}^{T}} \sum _{t=1}^T\Vert {A^{(t)}}^\top C^{(t)}A^{(t)}-{A^{(T+1)}}^\top C^{(T+1)}A^{(T+1)}\Vert _\textbf{F}^2, \end{aligned}$$
(9)

where \(C^{(t)}\) is the symmetric matrix defined as \(\frac{1}{N^{(t)}}{\bar{\Theta }^{(t)^\top }}\bar{\Theta }^{(t)}\). Recognizing the symmetry of \(C^{(t)}\), we apply singular value decomposition (SVD) to obtain \(C^{(t)}=U^{(t)}\Sigma ^{(t)}{U^{(t)^\top }}\). The most significant r singular values are denoted by \(\Sigma _{[1:r]}^{(t)}\). Their corresponding left singular vectors are represented as \(U_{[1:r]}^{(t)}\). Additionally, \({\Sigma ^{(t)+}}\) denotes the Moore-Penrose pseudoinverse of \(\Sigma ^{(t)}\). Consequently, the optimal transformation is given by

$$\begin{aligned} \hat{A}^{(t)}=(U^{(t)}{{\Sigma ^{(t)+}}}^{\frac{1}{2}}{U^{(t)^\top }})(U_{[1:r_t]}^{(T+1)}{\Sigma _{[1:r_t]}^{(T+1)}}^{\frac{1}{2}}{U_{[1:r_t]}^{(T+1)^\top }}), \nonumber \\ \end{aligned}$$
(10)

where \(r_t=\min (r_{C^{(t)}},r_{C^{(T+1)}})\), and \(r_{C^{(t)}}\), \(r_{C^{(T+1)}}\) denote the ranks of \(C^{(t)}\) and \(C^{(T+1)}\), respectively. Utilizing the optimal transformation \(\hat{A}^{(t)}\), we can obtain the well-aligned features \(\hat{\Theta }^{(t)} = \bar{\Theta }^{(t)}\hat{A}^{(t)}\), which can be used to train a classifier.

Therefore, our proposed LASA initially obtains a new homogeneous feature space, \(\Theta \), through LAM, addressing the issue of heterogeneity in long-term data. Subsequently, through LTDA, it aligns the distributions across different timestamps, tackling the problem of distribution shift. By integrating these two components, we can adaptively adjust features to accommodate the dynamic characteristics of long-term data. In terms of interpretability, the features derived from LAM are semantically meaningful representations in \(\Theta \), representing students’ learning abilities related to their responses to exercises. This provides a foundation for explaining students’ performances. To preserve the semantic structure within the features, we employ asymmetric transformations in LTDA. Since all timestamps are transformed to align with the target timestamp, these transformed features share the same semantic properties as those of the target timestamp. This ensures that the aligned features, \(\hat{\Theta }\), continue to reside within the same semantic space, thus maintaining their semantic integrity.

Base classifier

To adapt our prediction model to varying numbers of features and distribution shifts, we introduced LASA and obtained robust features \(\hat{\Theta }\), which can be utilized for training classifiers. In the framework proposed in this paper, various machine learning algorithms can be selected as classifiers. Given that SVM is commonly used as a base classifier in diverse Educational Data Mining (EDM) scenarios [40] and domain adaptation methods [41], we opt for SVM as our classifier in this study. Assuming the target of our prediction is the student performance at timestamp \(T+1\), we should use the data from the previous T timestamps as the training set, denoted as \(\{(\hat{\Theta }^{(t)}, Y^{(t)})\}_{t=1}^{T}\), with the test set being \((\hat{\Theta }^{(T+1)}, Y^{(T+1)})\). Here, Y represents the labels of the dataset. In real-world predictions, \(Y^{(T+1)}\) is inaccessible. The trained classifier is denoted as f. The specific principles and training process of SVM can be referred to in the literature [42], which are not elaborated here in detail.

SHAP-based model interpretation

After obtaining predictions from the trained classifier f for the test set, it is crucial to interpret these results to improve teaching methods and course content. Although traditional methods of assessing feature importance identify significant features, they fail to elucidate the specific impact of a feature on prediction outcomes [16]. The SHAP (Shapley Additive exPlanations) method [43] takes a cue from the Shapley value in cooperative game theory. This method creates an easy-to-understand model based on the Shapley value, allowing us to see both the positive and negative effects of each feature on individual predictions. By using SHAP, we can better understand how the learning ability parameters \(\hat{\Theta }\) contribute to our predictions

In our study, we assess the impact on predictions when excluding the \(i^{th}\) parameter, \(\textbf{f}_{\hat{\theta }_i}\), from our model. Given a set of parameters \(F_{\hat{\Theta }}\) and a sample feature vector \(x_{\hat{\Theta }}\), we evaluate the importance of \(\textbf{f}_{\hat{\theta }_i}\) by measuring its marginal contribution across all possible subsets of \(F_{\hat{\Theta }}\) that do not include \(\textbf{f}_{\hat{\theta }_i}\). This is achieved by calculating the difference in prediction accuracy with and without \(\textbf{f}_{\hat{\theta }_i}\) in each subset \(S \subseteq F_{\hat{\Theta }} {\setminus } \{\textbf{f}_{\hat{\theta }_i}\}\). The contribution of \(\textbf{f}_{\hat{\theta }_i}\), denoted as \(\phi _{i}\), is defined through the following equation, which represents the weighted average of these differences across all subsets S:

$$\begin{aligned}&\phi _i=\sum _{S\subseteq F_{\hat{\Theta }}\setminus \{\textbf{f}_{\hat{\theta }_i}\}}\frac{\mid S\mid !(\mid F_{\hat{\Theta }}\mid -\mid S\mid -1)!}{\mid F_{\hat{\Theta }}\mid !}\nonumber \\&\qquad [f_{S\cup \{\textbf{f}_{\hat{\theta }_i}\}}(x_{S\cup \{\textbf{f}_{\hat{\theta }_i}\}})-f_{S}(x_{S})]. \end{aligned}$$
(11)

Here, \(f_S(\cdot )\) denotes the classifier’s output using only the feature subset S, and \(x_S\) represents the sample values associated with the parameter subset S.

Consequently, the classifier’s output for sample \(x_{\hat{\Theta }}\) can be expressed as

$$\begin{aligned} f(x_{\hat{\Theta }})=g(x^{\prime })=\phi _0+\sum _{i=1}^{K\cdot M} \phi _i\cdot x_i'. \end{aligned}$$
(12)

Here, g is the explanatory model, and \(x^{\prime }\in \{0,1\}^{K\cdot M}\) is the feature vector indicating the presence or absence of specific features. The size of our parameter matrix is given by \(K\cdot M\). The term \(\phi _i \in \mathcal {R}\) measures the contribution of the \(i^{th}\) feature, often called the Shapley values. Importantly, \(\phi _0\) represents the average prediction when no specific feature information is provided, roughly matching the mean of predictions across the training dataset. The Shapley values offer insights into how each feature influences the prediction, providing a foundation for educational improvements. Section “Interpretability and case study” will present detailed examples, along with relevant educational insights and suggestions.

Table 4 Summary of Hyperparameters in the LASA

Thus far, we have introduced the interpretable prediction framework for long-term student performance prediction. The hyperparameters of LASA are described in Table 4. The pseudo code of the framework is presented in Algorithm 1.

Algorithm 1
figure a

Pseudo code of the interpretable predition framework with LASA for long-term student performance prediction

Experiments

In this section, we conduct comprehensive experiments to evaluate the proposed LASA method for predicting long-term student performance in real-world settings.

Comparison methods

To demonstrate the superiority of LASA, we compare it with the state-of-the-art student performance prediction models (non-transfer models) and novel DA methods across SsHoUDA, MsHoUDA, SsHeUDA, and MsHeUDA categories as shown in Table 5. ORACLE refers to the scenario where training and testing data are proportionally split from the same year, indicative of the classifier’s performance under IID conditions. Non-transfer models, which ignore distribution shifts in long-term data, provide a baseline for performance comparison. SsHoUDA models merge PCA-reduced data from all historical timestamps as a singular source domain. MsHoUDA models consider PCA-processed data from each timestamp as independent source domains, while SsHeUDA models average results from each past timestamp treated as a distinct source domain. Lastly, MsHeUDA models treat the raw data from each timestamp as separate source domains without PCA reduction.

Table 5 The comparison methods

Experimental settings

Data Preprocessing: In our long-term student performance dataset, the question type is multiple-choice. First, we compare each student’s selected options with the correct answers to categorize the responses into three types: correct, incorrect, and unanswered. To represent these outcomes, we employ one-hot encoding instead of numerical coding to avoid the bias introduced by numerical order. We then perform data cleaning by removing students who never participated in answering questions, as these students did not substantially contribute to our data collection, rendering the prediction of their performance meaningless. Additionally, we remove certain anomalous questions that all students failed to answer or answered incorrectly due to external factors. Such instances often involved technical issues during the class that prevented students from answering questions or errors in the answer key, resulting in correct student responses being marked as incorrect. For the task of predicting student performance categories, scores are typically divided based on certain thresholds. Following the recommendation by [46], students’ final exam scores can be categorized into three groups: good, pass, and at-risk, using the following formula:

$$\begin{aligned} \text {Label} = {\left\{ \begin{array}{ll} \text {good} &{} \text {if } \text {rank}(S) \ge Thr_1\times N \\ \text {pass} &{} \text {if } Thr_2\times N \le \text {rank}(S)< Thr_1\times N\\ \text {at-risk} &{} \text {if } \text {rank}(S) < Thr_2\times N \end{array}\right. }. \end{aligned}$$
(13)

Here, \(Thr_1\) and \(Thr_2\) are thresholds that segment student performances. In this study, we focus on identifying at-risk students and we set \(Thr_2\) to 0.3. The term \(\text {rank}(S)\) denotes the rank of a student’s score, where the highest score is ranked first. N represents the total number of students.

Training Process: For the segmentation of our long-term dataset, we first select a target year for prediction. The response data from this year serve as our test set, while all data from preceding years, including both responses and grade labels, are utilized as the training set. This ensures alignment with realistic scenarios where, during the prediction of student performance for the target year, only past features and labels, along with current target features, are accessible. Regarding the selection of LASA hyperparameters, the optimal number of question types, K, is identified through an analysis of the distance gap observed in the hierarchical dendrogram, as illustrated in Fig. 7(a). Consequently, K is set to 3, allowing us to use the outcomes of agglomerative hierarchical clustering to establish our initial parameters \(\Theta _{(0)}\) and \(\pi _{k,(0)}\). The maximum number of iterations, I, is set to 20 and convergence threshold, \(\epsilon \), is set to 1e-4. We choose the parameters of the SVM classifier by 5-fold cross-validation. For the comparison methods listed in Table 5, we report the optimal results achieved through parameter tuning within the recommended range specified in their original publications, facilitated by cross-validation. For methods requiring PCA, the dimensionality reduction is uniformly set to 30 dimensions. To validate the robustness of our findings, we replicate the experiments 20 times to obtain the average accuracy of all methods on the test set.

In this paper, we focus on predicting at-risk students. We use Accuracy as our evaluation metric to assess model performance, defined by

$$\begin{aligned} \text {Accuracy} = \frac{\mid \textbf{x} : \textbf{x} \in \mathcal {T}^{te} \wedge f_{1\sim T}(\textbf{x}) = \textbf{y}\mid }{\mid \textbf{x} : \textbf{x} \in \mathcal {T}^{te}\mid }. \end{aligned}$$
(14)

Here, \(f_{1\sim T}(\textbf{x})\) represents the predicted category label by the algorithm for instance \(\textbf{x}\), \(\textbf{y}\) is the ground truth label for the instance \(\textbf{x}\), and \(\mathcal {T}^{te}\) comprises all the instances in the testing split.

The experiments reported in this study were carried out on a high-performance workstation configured with an Intel Xeon W-3175X CPU, 192GB DDR4 RAM, and an NVIDIA RTX 3090 graphics card, running on a Windows 10 operating system. The software environment was based on Python 3.10, with machine learning models implemented using Scikit-learn 1.1.0 and deep learning models developed and trained via PyTorch 1.13.0.

Performance evaluation on long-term student performance dataset

Table 6 The at-risk student prediction accuracy of the comparison methods and our proposed LASA for the years 2017–2023

We first evaluate LASA and the comparison methods on the Long-term Student Performance Dataset, focusing on predicting at-risk students from 2017 to 2023. For each year, starting with 2017, it serves as the target prediction year, with the preceding years used as training data. The results, presented in Table 6, highlight LASA’s consistent superiority in accuracy across most tasks. We also apply the Friedman test to establish a ranking of the different methods included in our experiments, as shown in the last column of Table 6. It is evident that LASA ranks first, with a significant margin compared to other methods. Subsequently, we conduct a post-hoc analysis using the Wilcoxon signed-rank test with Bonferroni correction, to adjust p-values for each pairwise comparison of methods, as detailed in Table 8. The outcomes of the post-hoc analysis conclusively demonstrate that our proposed LASA significantly outperforms all other methods in the study, except for PCA-ORACLE, with p-values consistently below 0.05. Although the p-value in the comparison with PCA-ORACLE is above 0.05, our method still ranks higher in terms of average performance and Friedman ranking.

LASA notably outperformed the idealized PCA-ORACLE model by 5.7% and surpassed other leading student performance prediction models by at least 6.8%. When compared to various DA methods, LASA showed an improvement of at least 6.1%. This enhanced performance is attributed to LASA’s unique ability to capture the nuances of student learning through capacity parameters and effectively leverage historical data for prediction.

Within the Non-transfer models category, both SVM-BSS and SVM-MS were less effective than PCA-ORACLE, highlighting the negative impact of distribution shifts in long-term data. Notably, SVM-MS, which aggregates data from multiple timestamps, does not address distributional differences, leading to potential negative transfer. State-of-the-art EDM methods such as CatBoost-SHAP, DNNMS, and ProbSAP, despite their good performance in short-term data, are less effective in long-term scenarios due to their assumptions of identical distributions, which do not align with the situation of long-term data.

UDA methods, designed to alleviate distribution shift issues, showed varying results. Under SsHoUDA, the MMD method exhibited the poorest performance, likely due to its limitation of using data from a single source domain. In contrast, MEDA’s approach of aligning both marginal and conditional distributions resulted in better outcomes. However, deep learning-based UDA methods such as DANN and D-CORAL underperformed, probably due to negative transfer from merging timestamp data without acknowledging intrinsic differences. MsHoUDA methods consider distributional discrepancies across multiple source domains but face challenges with heterogeneous feature spaces. TWMDA, aiming for finer granularity at the sample level in alignment, might have introduced noise, thereby compromising its performance. Although M3SDA considers aligning each pair of domains, it still relies on PCA to obtain features of the same dimensionality, which may amplify discrepancies between domains and thus result in poor performance. SsHeUDA methods generally showed limited success. These methods rely on a single source domain, and techniques like KCCA and GLG, respectively, require parallel instances and the same number of samples between domains, limiting the utilization of information and indicating their inability to learn effectively from raw data and align distributions for classification purposes.

In the MsHeUDA category, SFERNN represents a noteworthy attempt to address long-term heterogeneous educational data, utilizing data from multiple domains without prior PCA processing. However, its performance is capped by the SFER architecture and lacks interpretability. In contrast, LASA stands out by directly extracting semantically meaningful representations from raw features and aligning distributions effectively in long-term data. It not only achieved the highest accuracy but also outperformed SFERNN by 6.4%, highlighting its efficiency in handling heterogeneous source data while maintaining model interpretability. Overall, LASA’s exceptional performance, enhanced by robust data alignment and interpretability, sets a new benchmark in the field of long-term student performance prediction.

Additionally, we analyze the effectiveness of LASA’s components: LAM and LTDA, through ablation experiments in Sect. “Ablation study”. The results demonstrate that LAM can effectively extract the learning ability features of students, while LTDA can align feature distributions, further enhancing performance, thereby proving the efficacy of LASA.

Performance evaluation with varied timestamp utilization in long-term student data prediction

To further validate the effectiveness of our model in utilizing long-term student data, we evaluated LASA’s performance across varying numbers of timestamps. Specifically, when predicting student performance for a target timestamp given T available timestamps (all data prior to the target timestamp), LASA’s predictive capability is assessed using between 1 and T timestamps. Notably, for each t timestamps used, there are \(C_T^t\) potential scenarios. The average accuracy over these scenarios serves as our metric. We also evaluate the state-of-the-art student performance prediction method, ProbSAP, under the same conditions. The results are depicted in Fig. 6.

Fig. 6
figure 6

a Performance of ProbSAP with varying amounts of historical data. b Performance of our proposed LASA using different amounts of historical data. The x-axis represents the number of years of historical data used, and the y-axis shows the average accuracy obtained when using a specific amount of historical data. Different curves represent different target prediction years. The results indicate that using more historical data, i.e., long-term data, can increase accuracy. Furthermore, our LASA consistently outperforms ProbSAP, demonstrating its more effective utilization of long-term data

As depicted in Fig. 6(b), LASA substantially benefits from leveraging multiple timestamps in its predictions. There is a consistent improvement in accuracy as the number of timestamps increases, which underscores the importance of utilizing long-term data. For certain target timestamps, such as 2020 and 2022, using data from just a single timestamp yields commendable predictive performance. This could be attributed to the close distributions between these target timestamps and past timestamps, indicating minor distribution differences. Nonetheless, even under these conditions, we notice a slight performance gain as more timestamps are incorporated. Specifically, the performance for 2020 and 2022 improved by 2.4% and 3.9%, respectively, which is in line with our expectations. On the other hand, for scenarios like 2021, where a single timestamp does not produce satisfactory results, possibly due to significant distribution differences, the inclusion of additional timestamps boosts the predictive accuracy by up to 7.9%. This improvement could be attributed to LASA’s ability to discern nuanced patterns and internal variations across multiple timestamps with diverse distributions, which might not be apparent when relying solely on one timestamp.

Fig. 7
figure 7

a The hierarchical clustering dendrogram for exercise data reveals the most significant distance similarities at height 2, which leads us to divide the questions into three categories, thereby setting K to 3. b A comparison of average accuracy between LASA initialized randomly and LASA initialized via hierarchical clustering across different numbers of K shows that initialization with hierarchical clustering yields better performance. Additionally, both initialization methods outperform ProbSAP across a broad range of parameters, demonstrating the robustness of our model. Furthermore, the optimal K value determined from the dendrogram in a performance, validating the effectiveness of our parameter selection method

By comparing with Fig. 6a, we find that ProbSAP underperforms in 2021, characterized by considerable distribution differences, and its performance does not markedly improve even with the incremental addition of timestamps. This underperformance can be attributed to prediction models like ProbSAP that primarily rely on short-term data and IID assumptions; they may falter when confronted with long-term data exhibiting stark distribution differences. Furthermore, for predictions in 2023, ProbSAP’s performance deteriorates as more timestamps are employed, while LASA’s performance remains stable. This suggests that ProbSAP may experience negative transfer due to the introduction of data with considerable distribution differences, whereas LASA can effectively align long-term distributions, mitigating distribution disparities and preventing negative transfer. Moreover, even when LASA relies on data from just one timestamp, it outperforms ProbSAP, further emphasizing the effectiveness of our learning capability modeling. This is because LASA can derive more fundamental representations of students’ learning abilities from exercise data.

Sensitivity analysis

In this section, we evaluate the sensitivity of LASA to its parameters. As mentioned earlier, the value of K can be determined using a hierarchical clustering dendrogram, which results from clustering all the exercises. As shown in Fig. 7(a), we visually observe the most significant cluster similarities at height 2 and, therefore, plot a cut line at this point. All questions are consequently divided into three categories. Thus, we determine the optimal value for the number of question categories K to be 3. Additionally, we compare the performance for different values of K, as illustrated in Fig. 7b. We note that around the optimal K value identified by the dendrogram, LASA achieves satisfactory prediction accuracy, indicating the efficacy of our parameter selection method.

We also investigated different initialization methods for LASA’s initial iteration parameter values \(\pi _{k,(0)}^{(t)}\) and \(\Theta _{(0)}^{(t)}\), specifically using random initialization and hierarchical clustering to obtain initial parameters. We find that the initial parameters obtained from hierarchical clustering outperform those from random initialization, and both methods yield better performance across a wide range of parameters compared to the state-of-the-art model, ProbSAP. This observation suggests that our approach offers a flexible parameter selection range. However, we observed that as K approaches larger values, the model’s performance diminishes. This decline may be attributed to the fact that, as K increases, the learning ability parameters derived by LASA become less category-specific and more tailored to individual questions, especially when K equals the total number of questions. This shift gradually weakens the semantics, diminishing the model’s robustness and making it more susceptible to noise, leading to a decrease in performance.

Regarding the number of iterations for the algorithm, we discuss this in Sect. “Convergence of LASA”, where we note that convergence is generally achieved within 15 iterations. Consequently, we set the maximum number of iterations, I, to 20. Additionally, the sensitivity of classifier hyperparameters is addressed in Sect. “Hyperparameter optimization of the classifier”. The results indicate that using features extracted by the LASA algorithm, the SVM classifier performs consistently well across a wide range of hyperparameters.

Furthermore, our proposed LASA can be adapted to various classifiers. To explore its sensitivity across diverse base classifiers, we incorporated LASA with a range of machine learning classifiers. As depicted in Fig. 8, LASA consistently delivers noteworthy performance across most classifiers, highlighting its adaptability. The best predictive performance is observed when LASA is combined with classifiers such as SVM and LR. Although there is a slight decrease in performance when combined with classifiers like KNN, LASA still outperforms state-of-the-art models such as SFERNN and ProbSAP. These results demonstrate that LASA can extract robust features, making it relatively insensitive to the type of classifier used.

Fig. 8
figure 8

Comparative analysis of LASA’s predictive accuracy when integrated with various base classifiers has been conducted. We observed that, even when combined with different classifiers, the performance of LASA, including the least performing LASA-KNN variant, still surpasses that of other comparative methods, showcasing LASA’s robustness and effectiveness

Interpretability and case study

In student performance prediction models, interpretable predictions are essential for curriculum improvements and student interventions. Our proposed Learning Ability Modeling component can extract the student’s learning ability embeddings \( \Theta \) from raw exercise data, providing a semantically meaningful representation. We compute the average student learning ability embeddings for each year, as shown in Fig. 9. \(\text {H}\), \(\text {M}\), and \(\text {E}\) represent three distinct types of learning abilities, while \(\text {H}_{\{1,2,3\}}\), \(\text {M}_{\{1,2,3\}}\), and \(\text {E}_{\{1,2,3\}}\) correspond to specific learning abilities. Denote \(\{\cdot \}_1\) as \(\{\text {H}_1,\text {M}_1,\text {E}_1\}\), and similarly, \(\{\cdot \}_2\) and \(\{\cdot \}_3\) follow this notation. Based on the LAM process, we understand that \(\{\cdot \}_1\) learning ability parameter indicates the probability of not answering questions of the corresponding category, \(\{\cdot \}_2\) represents the likelihood of answering incorrectly, and \(\{\cdot \}_3\) signifies the probability of a correct response. We observe that there is a trend of \(\text {H}_2\), \(\text {M}_2\), \(\text {E}_2\) decreasing and \(\text {H}_3\), \(\text {M}_3\), \(\text {E}_3\) increasing. Thus, \(\text {H}\) pertains to a student’s ability to tackle challenging problems, exemplifying their capability to understand complex knowledge and its comprehensive application. Conversely, \(\text {E}\) encapsulates a student’s mastery over straightforward questions, indicating their grasp of basic concepts and attentiveness during lessons. \(\text {M}\), positioned between the two, denotes a student’s proficiency in addressing questions of moderate difficulty, emphasizing their skill in integrating basic concepts. It is noteworthy that there is a pronounced collinearity between \(\{\cdot \}_2\) and \(\{\cdot \}_3\), and their semantics overlap. To enhance interpretability, we have opted to omit the \(\{\cdot \}_2\) parameter in subsequent analyses.

Fig. 9
figure 9

This figure illustrates the average learning ability parameter values across different years, segmented into three categories representing distinct types of learning abilities: H, M, and E. Each category is further divided into three specific abilities: \(\{\cdot \}_1\) indicating the probability of not answering, \(\{\cdot \}_2\) for the likelihood of incorrect answers, and \(\{\cdot \}_3\) for the probability of correct answers. We observe that there is a trend of \(\text {H}_2\), \(\text {M}_2\), \(\text {E}_2\) decreasing and \(\text {H}_3\), \(\text {M}_3\), \(\text {E}_3\) increasing. Thus, \(\text {H}\) pertains to a student’s ability to tackle challenging problems, exemplifying their capability to understand complex knowledge and its comprehensive application. Conversely, \(\text {E}\) encapsulates a student’s mastery over straightforward questions, indicating their grasp of basic concepts and attentiveness during lessons. \(\text {M}\), positioned between the two, denotes a student’s proficiency in addressing questions of moderate difficulty, emphasizing their skill in integrating basic concepts

Upon obtaining interpretable and semantically meaningful features, we proceed to identify the core factors influencing student academic performance, following SHAP-based Model Interpretation in LASA. This analysis aids teachers in identifying areas where students require more attention and understanding the reasons behind poor grades, thereby enabling targeted guidance. Taking 2020 as an example, the feature importance ranking and feature contribution are illustrated in Fig. 10.

Fig. 10
figure 10

a Feature importance, representing the average absolute contribution of the features. b Contribution of features for each sample prediction, ranked by their importance. c Contribution of feature values in cases where the actual label is at-risk student and the predicted label matches. d Feature contributions in situation where the student is correctly predicted as not being at-risk. For c and d, the letters on the left represent the features and their corresponding values. Blue indicates that the feature contributes negatively to the prediction, while red indicates a positive contribution to the prediction. We have converted the SHAP values to probability magnitudes, where f(x) represents the final probability of being predicted as an at-risk student

Upon observing Fig. 10a, b, we found that for at-risk students, compared to learning ability \(\text {H}\), learning abilities \(\text {E}\) and \(\text {M}\) have a more pronounced influence on their prediction. Specifically, as the value of \(\text {E}_3\) decreases, there is a larger SHAP value, indicating that it pushes the model’s prediction towards the at-risk outcome. A similar trend is observed for \(\text {M}_3\), albeit its influence is less than that of \(\text {E}_3\). This implies that a student’s grasp of basic concepts and their ability to apply combinations of these concepts will largely influence whether they receive a low score at final test. In contrast, the capability of a student to understand and apply complex knowledge will not directly lead to a low score, as evidenced by the minor contribution of learning ability \(\text {H}\) to predicting if a student is at-risk.

Here, we quantitatively analyze the influence of these learning abilities on student performance through two cases. Figure 10c showcases the contribution of the three types of learning abilities to the prediction results for student who is actually at-risk and is also predicted as at-risk, using well-adapted learning ability parameters \(\hat{\Theta }\) obtained after LTDA. We notice that \(\text {E}_3\) for this student is \(-\)1.775, \(\text {M}_3\) is \(-\)1.358, and \(\text {H}_3\) is \(-\)1.88, indicating that student’s various learning abilities are below average and relatively low. However, we find that \(\text {H}_3\) only adds a 0.03 probability to the prediction of being at-risk, while \(\text {E}_3\) and \(\text {M}_3\) contribute probabilities of 0.42 and 0.17 respectively. This suggests that if students do not pay attention in class and thus have a weaker understanding of basic concepts, they are likely to receive lower scores. However, if their ability to tackle challenging problems is limited, it does not necessarily mean they will fail the course. Therefore, for at-risk students, teachers can provide materials to help them re-understand basic concepts and assign exercises that help in understanding and applying basic concepts, thereby enhancing their academic performance and preventing failures. In the second case, the analysis for student who is not at-risk and is also predicted as such by the model, as shown in Fig. 10d, indicates they have an above-average understanding and combinatory application ability of basic concepts. And the \(\text {H}_3\) value is 1.194, indicating they have a notably higher-than-average understanding and application ability for complex knowledge. Consequently, they performed well in the final exam, with no risk of failing.

In conclusion, for at-risk students, we recommend that teachers pay more attention to their understanding and application of basic concepts. Specifically, in class, after teaching each knowledge point, teachers can provide some reviews or simple exercises to enhance the student’s understanding. They can also increase attendance checks to prevent students from missing classes, leading to knowledge gaps. Additionally, providing some extracurricular materials or videos can help these students review or re-understand these basic concepts and their applications. At the same time, we suggest that these students focus on mastering the basics before diving into complex knowledge.

Discussion

This section discusses the principal findings of the study, as well as its limitations and directions for future research.

Main findings

Prior research has largely focused on short-term data, failing to fully leverage the student data accumulated over the years. Our study introduces the use of long-term data for predicting student performance. However, we have identified two main challenges in utilizing long-term data: heterogeneity, as seen in the yearly change in the number of features, and distribution shift, with yearly feature distribution alterations as displayed in Figs. 3 and 4. To overcome these challenges and fully harness the potential of long-term data, we proposed the LASA algorithm, which, when combined with classifiers and SHAP, forms an interpretable prediction framework.

Experimental comparisons indicate that our method significantly outperforms others, exceeding the state-of-the-art student performance prediction model ProbSAP and the domain adaptation model SFERNN by 6.8% and 6.6% in average accuracy, respectively. The Effectiveness Analysis reveals that even when employing methods designed for short-term data, the use of long-term data does indeed enhance predictive performance, as shown in Fig. 6. This confirms the significance of using long-term data and illustrates that our proposed LASA is more effective because it addresses heterogeneity and distribution shift, aspects not considered by other methods. Furthermore, the Sensitivity Analysis of LASA’s parameters suggests that our method for selecting hyperparameters yields robust results, and LASA itself is resilient to parameter variations.

Finally, an interpretive analysis of the predictive framework incorporating LASA reveals that LASA can extract meaningful features regarding students’ test-taking abilities and analyze their contributions to prediction outcomes, leading to actionable educational interventions. For example, students performing poorly often struggle with simpler questions, indicating a weaker foundation. Targeted remediation and reinforcement of fundamental knowledge are recommended for these students, whereas recommending integrated problem-solving exercises may not effectively improve their performance.

Limitations and future research

This paper introduces a predictive framework that combines the Learning Ability Self-Adaptive Algorithm (LASA) with a base classifier to harness long-term data for student performance predictions. As it is not an end-to-end method, the parameters for each module were optimized separately, potentially resulting in suboptimal outcomes. Heuristic and evolutionary search optimization intelligent algorithms present promising approaches for improving this process, with recent developments showing success in optimizing machine learning parameters [47, 48]. Furthermore, employing these techniques for feature selection could also serve as a potential method for addressing heterogeneity in long-term data. Future research will explore these techniques to further enhance the performance of our predictive framework.

Moreover, we note that the distributional differences between historical and recent data can vary significantly, impacting the model’s predictive accuracy. Large disparities may lead to a marked decrease in accuracy. Upcoming studies will seek to quantify these differences, assigning appropriate weights to long-term data to prioritize data with distributions that align more closely with the target year while diminishing the influence of data with greater discrepancies, thus improving the predictive performance and robustness of our model.

Our proposed LASA primarily caters to objective question types, such as multiple-choice questions, and may not perform well when the question type is subjective, such as in humanities courses, because it does not consider how to extract features from textual responses. This limitation restricts LASA’s assessment capabilities to complex reasoning and problem-solving. It does not effectively evaluate higher-order cognitive processes like critical, creative thinking, or decision-making, which are more manifest in subjective question types. However, with advancements in NLP technology and the widespread recognition of large language models (LLMs), we may consider employing LLMs to extract textual features in the future, enabling LASA’s application to a broader range of courses.

Additionally, although LASA has shown promising results in predicting student performance in circuit courses, its effectiveness in other subjects, such as calculus or linear algebra, remains untested due to the lack of long-term data in these areas. Moreover, teaching style differences among instructors may lead to data distribution changes, which could affect the model. To broadly apply our model to various teachers and courses, we plan to collect a wider range of data and develop more advanced methods to extract features and adapt to distribution changes in the future.

Conclusion

This study introduces the Learning Ability Self-Adaptive Algorithm (LASA), a novel approach designed to leverage long-term student data to predict student performance more accurately. By addressing the challenges of heterogeneity and distribution shifts in long-term educational data, LASA significantly improves predictive accuracy over existing state-of-the-art models. Our comprehensive experiments, spanning data from 2016 to 2023, underscore LASA’s ability to adaptively model student learning abilities and align distributions across academic years, thus overcoming the limitations posed by the dynamic nature of educational settings. The superiority of LASA was demonstrated through extensive comparative analysis, where it consistently outperformed other benchmark models by a notable margin in accuracy. These results underscore the importance of considering the evolving features and distributions of long-term data for student performance prediction. Furthermore, the integration of SHAP-based Model Interpretation into our framework provides valuable insights into the impact of various features on prediction outcomes. This not only enhances the interpretability of our model but also offers actionable guidance for educational interventions aimed at improving student learning outcomes. Despite the promising results, our study acknowledges limitations, such as the model’s focus on choice-based questions and its application to a single course taught by one instructor. Future work will explore extending LASA to other types of questions and courses, incorporating advanced NLP techniques for feature extraction from subjective assessments, and collecting more diverse long-term data across different instructors and disciplines. In conclusion, LASA represents a significant step forward in the domain of educational data mining, particularly in the use of long-term student data for performance prediction. Its development aligns with the growing need for adaptive, accurate, and interpretable models in the educational sector, paving the way for more personalized and effective learning experiences.