1 Introduction

Academic performance (AP) is widely recognized as being crucial for lifelong success. Students who excel academically tend to enjoy better health, overall well-being, and higher salaries [62]. Moreover, AP is a proven critical indicator of university dropout rates [63, 79], thus posing a significant challenge for modern economies. The demands for individuals with advanced skills is escalating in order to satisfy labor market requirements and foster productive and equitable environments [88], and high dropout rates are a formidable barrier to these objectives. At a worldwide level, a mere 38% of students enrolled on Bachelor’s degree programs complete their studies within the designated time frame [62]. Although the situation in Spain is less dire, it is still troubling, with 27.4% of students at public universities discontinuing their studies in order to either change majors (12.5%) or exit the Spanish university system entirely (14.9%). These percentages rise to 37.1% in the fields of Engineering and Architecture, with 18.3% leaving the Spanish university system entirely) [11]. It is particularly noteworthy that the highest dropout rates occur in the first year of study. These statistics fully justify the scientific community’s considerable interest in two key areas: identifying the factors that can predict academic failure, and developing straightforward strategies based on these factors that will promote the development of flexible and interpretable models that are adaptable to various real-world educational environments [66]. These models should additionally enable the early identification of at-risk student groups so as to facilitate the timely execution of academic interventions [66].

With regard to the factors linked to academic performance, various authors argue that predicting academic dropout necessitates a blend of sociodemographic, economic and academic elements [4, 13, 53, 63, 79]. Achieving this requires cooperation between higher education instructors and institutional data system administrators, a process that can be quite burdensome for instructors.

Some researchers therefore advocate predicting academic performance by evaluating factors that are more readily accessible to academic instructors. The relevant literature emphasizes that cognitive factors such as cognitive abilities and self-regulated learning strategies are crucial [38, 51, 69, 81]. However, recent studies suggest that although cognitive factors account for approximately 25% of the variance in academic performance, the relationship between cognitive abilities and academic success is also shaped by various non-cognitive factors [69]. These include personality, emotional intelligence, motivation, academic engagement, self-efficacy, and grit [1, 3, 6, 28, 46, 58, 72, 92, 93]. A significant challenge regarding some methods in this category is their great dependence on subjective or reported measures [65], along with the use of extensive questionnaires [12, 23], which may hamper the reliability of the measurements. Moreover, certain proposed models incorporate a wide range of variables [74], which not only complicates the measurement process but also heightens the risk of overfitting, particularly when combined with machine learning techniques and relatively small sample sizes (around 100) [40].

With regard to facilitating streamlined processes for the early identification of student groups that are at risk, the use of machine learning (ML) techniques to tackle academic underperformance by considering both cognitive and non-cognitive factors has undergone an increase in popularity in recent years. This trend is attributed to the acknowledgment that the connection between these factors and academic performance (AP) is inherently non-linear [69]. The application of ML has consequently advanced educational research by providing considerable insights into these complex interactions. However, integrating ML models into the university setting can pose significant challenges for instructors (see e.g. [19]), who are required to navigate the application of relatively advanced ML techniques, often within the confines of restricted knowledge, time and resources.

The aim of this paper is twofold. Firstly, we introduce and showcase the potential effectiveness of a concise model that incorporates three key variables: previous academic performance, personality and academic engagement, in order to predict academic success. These variables cover both the cognitive and non-cognitive perspectives, which are, according to current literature, considered essential. The related eight measures (or nine, should the instructor aim to develop a model that is applicable to multiple subjects simultaneously) can be swiftly gathered (within 30 minutes) through the use of a concise questionnaire and a brief examination of the Learning Management System (LMS) log files related to the course. Secondly, we present a comprehensive ML process that will allow instructors to craft their personalized versions of the model, adapted to their unique educational contexts. The straightforwardness of the model, along with a detailed step-by-step guide to the ML process that includes explanatory techniques and cohort analysis, facilitates the detection of at-risk students and the derivation of insightful conclusions, even for instructors with minimal knowledge of ML and in scenarios involving small groups and limited resources. Furthermore, the clarity and interpretability of the resulting ML models equip educators with the means to thoroughly comprehend and trust the predictions generated.

In order to demonstrate the viability of our method, we have implemented it on two mathematical courses at a Spanish university. The findings reveal that, in this setting, our approach successfully identified at-risk groups in both subjects just one month into the semester. Equipped with this information, the educators involved in this case study now have a tool that enables them to implement targeted and early interventions so as to reduce the risk of academic failure on these courses.

Fig. 1
figure 1

Conceptual model: overview

The remainder of this paper is organized as follows: Section 2 introduces the conceptual model and outlines its three key constructs, while Section 3 provides a review of key empirical studies examining the effects of the model constructs on academic performance and the methodologies employed to evaluate these effects. Section 4 describes the ML process that instructors must follow so as to develop their own predictive models, and Section 5 describes the case study developed in order to illustrate the proposed approach. Section 6 provides an in-depth explanation of the implementation of the data collection phase of the ML process, while Section 7 illustrates the steps involved in identifying the optimal algorithm and ML model for any given context. Section 8 explains how to apply explainability techniques in order to improve the interpretability of the results. Section 9 explores the implications of our contributions and findings for the contemporary educational context, along with their limitations, and finally, Section 10 encapsulates the primary conclusions and proposes avenues for future research.

2 Conceptual framework

In order to empirically evaluate how differences in students’ cognitive and non-cognitive profiles can assist in predicting academic performance early in the semester, we adopted the Model of Human Performance (MHP) [16]. This model categorizes the various individual differences that could potentially impact on work productivity into two different dimensions. The first dimension, Capacity, encompasses variables including work experience, educational level and cognitive abilities, while the second dimension, Willingness, includes psychological and emotional traits such as motivation, academic engagement and personality. The model also incorporates a third dimension, Opportunity, which refers to the contextual factors surrounding the work, such as materials, tools and working conditions. The hypothesis driving this research is that the same three factors that, according to the MHP, impact on productivity in work settings are also bound to affect academic productivity. This hypothesis is partly supported by the presence of these same components, albeit structured and named differently, in other theories that are frequently used for the study of academic performance, such as the theory of Self-Regulated Learning (SRL) [64].

We have identified a minimum set of variables for each dimension that, based on existing literature, show promise as regards predicting academic performance at the university level [41, 69, 74] while simultaneously being easy for instructors at university institutions such as ours to collect. The resulting conceptual model is depicted in Fig. 1, in which the focus is on the relationship between individual differences - Capacity and Willingness - and Performance. The Capacity dimension is represented by a single component, prior academic performance (PAP), while the Willingness dimension includes two subcomponents: personality and academic engagement. Opportunity, which in the educational context refers to external factors that enable or prevent a student from achieving academic success effectively, is presumed to impact equally on all students within a specific educational setting. It has consequently not been integrated into the model as a dimension with which to explain individual differences in academic performance.

Each of the theoretical constructs involved in this conceptual model is explored in greater depth below.

2.1 Capacity

According to Reed and Jensen [73], cognitive abilities can be defined as “the skills and processes of the mind necessary to perform a task.” Cognitive skills assist in the process of information gathering, analysis, understanding, processing, and storing in memory for later use in any activity.

There are four contemporary models of intelligence that have received significant attention in the literature related to assessing cognitive abilities [68]: the theory of Multiple Intelligences (MI), the theory of the three strata or the Cattell-Horn-Carroll (CHC) theory, the Successful Intelligence Theory (SIT), and the Verbal-Perceptual Rotation Theory (VPRT). All of these theories begin with a general concept of intelligence, which is understood to be “the potential that facilitates adaptation, learning, planning, problem-solving, abstract reasoning, decision making, understanding of complex ideas, and the creativity of individuals, which has a biological substrate and is not exclusive to the human species” [39]. While all of these theories have associated measurement instruments, their comprehensiveness makes their application challenging in the university context.

Furthermore, the Spanish University entrance system is based on a composite score that reflects PAP. This score is a combination of the student’s high school grade point average (GPA) and the college admission test score. The PAP score serves as a gatekeeper for Spanish university admissions and is acknowledged to reflect the student’s cognitive abilities beyond that which is captured by an intelligence test, owing to the self-regulatory competencies that influence the high school GPA [69]. Moreover, previous studies have demonstrated that PAP is the strongest predictor of university GPA [69, 74]. In this study we, therefore, use PAP as a proxy for capacity (see Fig. 1).

2.2 Personality

The American Psychological Association (APA) defines personality as “individual differences in characteristic patterns of thinking, feeling, and behaving”. The research community widely accepts that personality traits are useful for predicting intention, achievement and behavior [9].

The two most representative personality theories are, according to personality psychologists [5, 35], the Big Five (BF) personality model [49] and Eysenck’s Hierarchical Three Factor model (PEN) [33]. Some studies comparing both have concluded that the BF model is more comprehensive and has better measurement reliability than the PEN model [7], thus making it the dominant model in personality research [26, 29, 83] and education [14, 20, 83].

The objective of the BF model is to classify all significant sources of individual personality differences, and it includes five factors: Neuroticism (N), Extraversion (E), Agreeableness (A), Consciousness (C), and Openness to experience (O). N reflects the individual’s tendency to experience distress and the cognitive and behavioral styles that result from this. High N scorers experience higher levels of nervous tension, guilt, depression, self-consciousness, frustration, and ineffective coping. E reflects individual differences in positive emotionality, and high E scorers are energetic, optimistic, cheerful, enthusiastic, talkative, dominant, warm, and sociable, while low E scorers are more reserved, quiet, shy, silent, withdrawn, and retiring. A reflects individual differences in friendliness, with high A scorers being modest, caring, nurturing, emotionally supportive, altruistic, and trusting, while low A scorers tend to be more self-centered, spiteful, hostile, jealous, and indifferent to others. C reflects individual differences in the will to achieve and impulsiveness, with high C scorers being neat, diligent, thorough, achievement-oriented, well-organized, and governed by conscience, while low C scorers are not. Finally, O refers to individual differences in intellect traits: high O scorers have aesthetic sensitivity, imagination, curiosity, intellectual interest, a need for variety, unconventional values, perceptiveness, and openness to values, sensations, feelings, and fantasies. However, individuals that attain low scores for the O factor tend to favor conservative values, repress anxiety, and judge in conventional terms. These five components are illustrated in Fig. 1.

2.3 Academic engagement

Academic engagement is characterized by a favorable, persistent, and comprehensive attitude toward work [78]. Research has consistently shown a positive correlation between academic engagement and academic achievement [69, 77].

With regard to assessing academic engagement, the increasing use of information and computing technologies (ICT) in education, and particularly the integration of Learning Management Systems (LMS), provides instructors with a rich ecosystem of data that can provide a clear picture of the quantity and quality of interaction both inside and outside the classroom, along with a sense of the students’ commitment to the subject. System usage has, in particular, been shown to be significantly related to academic performance, explaining around 20% of the variance in students’ final grades [36].

Of the myriad of LMSs available on the market, the Spanish university system employs Moodle, an open-source e-learning environment that allows instructors to gather a range of metrics with which to assess how students approach the course in blended settings. Some examples of these metrics include total time online, total sessions (accesses to the platform), average inter-session interval, the proportion of time spent on learning resources (relative to total time online), the proportion of learning resources accessed, the proportion of activities completed, number of interventions in fora, and number of messages sent to the facilitator [30].

Although any of these measures (or a combination of them) could have been used to provide a picture of student engagement, given our objective of producing a parsimonious model, we opted for a single variable with which to reflect LMS use. The choice of this variable was guided by the following criteria: (a) The variable had to be easy to collect (standardized by subject) in order to preserve the simplicity of the data gathering process; (b) the variable needed to be sufficiently broad to be applicable to various subjects and instructors, and (c) the variable had to represent platform usage, irrespective of the specific purpose. For example, some courses may employ active learning methods such as flipped classrooms, necessitating significant access to resources at home without requiring extended interaction with the platform (in terms of time). Conversely, courses that demand student engagement in forums, wikis, or quizzes through the LMS can result in greater platform usage both within and outside the classroom. These two scenarios could yield very different time metrics, although not necessarily divergent counts of platform accesses since they are normalized by subject. We consider this to be an indicator of academic engagement that is less dependent on specific teaching strategies than are other variables, which makes it more adaptable to changes in the academic environment. Moreover, the dynamic nature of academic engagement suggests the suitability of measuring this number of accesses at different time points. Given the 4-month format of Spanish university courses, we have chosen four moments: T1 (week 4), T2 (week 8), T3 (week 12), and T4 (week 15). All of this is reflected in Fig. 1.

3 Related work

3.1 Key factors influencing academic performance: insights from meta-analyses

As mentioned previously, given the acknowledged significance of AP for lifelong success and socio-economic growth, extensive research has been conducted into the variables that may affect it, including several meta-analyses. This section reviews frequently cited studies that examine the effects of the three factors incorporated into the conceptual model illustrated in Fig. 1 (PAP, personality, and academic engagement) on AP.

In terms of capability, a meta-analysis by [74] identified a moderate correlation between high school GPA and American College Testing (ACT). Likewise, the findings of [69] indicated that PAP serves as the most significant predictor of university GPA, a conclusion further supported by other studies [27, 44, 87].

With respect to personality, a meta-analysis by [70] on the Five Factor model of personality and academic performance highlighted conscientiousness (one of the Big Five traits) as a key predictor of tertiary GPA, even after taking high school GPA into consideration. This conclusion is corroborated by [74]. Vedel [86] conducted another meta-analysis, reinforcing the view that conscientiousness is the strongest personality predictor of GPA, with a weighted summary effect of .26. The meta-analysis carried out by [69] similarly recognizes conscientiousness, academic self-efficacy and non-cognitive SRL strategies as powerful predictors of variance in AP.

A more recent meta-analysis was conducted by [60] in order to examine the combined impact of cognitive abilities and personality. This analyzed 267 independent samples (N = 413,074) from 228 unique studies. The results indicated that the combined effect of cognitive ability and personality traits can account for 27.8% of the variance in academic performance. Cognitive ability was found to be the most significant predictor, with a relative importance of 64%, while conscientiousness had a relative importance of 28%.

Finally, with regard to academic engagement, data from Learning Management Systems (LMS) have been shown to effectively predict student academic performance with high accuracy. Specifically, metrics such as the frequency and timing of Moodle interactions, along with the types of activities students participate in on Moodle, can be extracted from Moodle logs. These metrics have been identified as being strong predictors of academic performance, accounting for approximately 20% of the variance in student’s final grades [2, 32, 36, 54, 71].

It is noteworthy that the three factors included in our conceptual framework (capacity, personality, and academic engagement) appear to maintain their predictive power over academic outcomes across years [42, 69], and also seem to be valid for different cultures [75].

Fig. 2
figure 2

Proposed ML process: an overview

3.2 Academic performance analysis techniques: from traditional statistics to machine learning

With regard to the techniques used to analyze the impact of these factors on academic performance, traditional statistical methods such as correlation analyses and simple/multiple linear regression models have been the norm for many years [36, 79]. However, the emergence of LMSs and their associated data, along with the development of advanced ML techniques, have driven the research community to broaden their understanding of the learning environment by exploring and leveraging a wide range of educational data through the use of these innovative techniques. In particular, recent studies [2, 79] have established the critical role played by ML methods in identifying students at risk and predicting dropout rates. Another recent meta-analysis [34] reports that a significant majority of studies employing ML techniques (approximately 90%) utilize supervised learning (with classification accounting for 78% and regression accounting for 12%), while a smaller fraction (10%) applies unsupervised learning. Within the realm of supervised learning, the meta-analysis identifies the most frequently used ML algorithms for these approaches, including Random Forest, Decision Tree, Support Vector Machine, Naive Bayes, Linear Regression, Logistic Regression, Artificial Neural Network (ANN), K-Nearest Neighbor, Gradient Boosted Trees, and eXtreme Gradient Boosting. Another meta-analysis by [10], which encompasses 260 studies focused on predicting student performance within the scope of Educational Data Mining (EDM), states that the most popular data-mining algorithms are ANN and Random Forest. Nevertheless, it is crucial to acknowledge that there is no single “best” algorithm with which to predict academic performance, as different ML algorithms excel in various knowledge domains [79]. Furthermore, the effectiveness of these algorithms may vary according to the application of feature selection and whether the evaluation metrics include only precision or both precision and Kappa scores [32].

4 The machine learning process

A step-by-step illustration of our proposed ML process is provided in Fig. 2. This figure shows the three main phases of the approach: data collection, ML model generation and selection, and post-hoc explainability. By following this process, instructors are able to generate their own ML models, tailored to the specific characteristics of their educational contexts, in order to predict academic performance, identify risk groups and implement early interventions effectively.

The instructor starts by collecting data in the three key areas identified by the conceptual model (see Fig. 1). The set of variables (predictors) that must be considered for any instructor who intends to apply our approach for the training of four candidate models (Time 1, Time 2, Time 3 and Time 4, see Fig. 2) includes:

  • The five personality variables (N, E, A, C, and O). In order to measure the personality constructs, the Spanish version of the Big Five Inventory (BFI-44), which has been repeatedly validated by the research community [52], was selected. This questionnaire includes 44 items, divided as follows: N (8 items), E (8 items), A (9 items), C (9 items), and O (10 items).

  • PAP: A measure of previous academic performance. A score ranging from 0 to 14, which is obtained by combining the student’s high school grade point average (GPA), the mandatory college admission test score (CATS_M), and the optional college admission test score (CATS_O). The calculation of this score is as follows: a maximum of 10 points is achievable by computing the result of \(0.6*GPA + 0.4*CATS\_M\). The additional 4 points (required in order to attain a total of 14) are obtained from the optional section of the college admission test score.

  • Four pairs of LMS-A variables, corresponding to four points in time (T1, T2, T3, and T4) and a distinction between accesses inside the classroom (IA), and outside the classroom (OA). For example, LMS-OA-T3 would correspond to the number of accesses to the Moodle platform made from outside the classroom by each student at the end of week 12.

  • Subject code. Moreover, if the instructors’ intention is for the model to predict academic performance across various related subjects, they should include the subject code as an additional variable so as to account for differences in difficulty in different subjects. This inclusion is crucial, because not all subjects, even those within the same domain, share the same level of difficulty, and this variance can significantly influence students’ performance scores.

Four different ML models are then trained, each corresponding to a different point in time (Time 1 through Time 4).

At each point in time, Explainable Artificial Intelligence (XAI) techniques are applied to the corresponding model in order to understand the features that influence the predictions at that point in time. This step helps demystify the decisions made by the ML model and identify which factors are most important as regards predicting academic risk in the instructor’s context. The instructor can subsequently perform cohort analysis in order to detect patterns or common characteristics among groups of students at risk. This can involve clustering students on the basis of the predictions made by the model and analyzing the characteristics of these clusters. Finally, using the insights gained from the ML model and the XAI and cohort analyses as a basis, the instructor can devise and implement intervention strategies that are applicable at that point in time with the objective of mitigating the identified risks. The objective of these interventions is to support at-risk students and improve their chances of academic success.

At the end of the course, the instructor will have four potential models (T1, T2, T3, T4) with which to elucidate academic performance. It is then time for the instructor to choose the model that best aligns with the unique characteristics of their educational context. This choice is informed by both the predictive accuracy of the model and the timing of the application, thus favoring the earlier detection of at-risk students whenever feasible. After making this choice, in subsequent years, the instructors will need only to deploy the chosen model at the predetermined time to obtain a prediction of academic success that is customized to the distinctive nature of their particular course.

5 Design of case study

In order to demonstrate the feasibility of the approach proposed in this paper, we have applied it to our educational context. This case study can be classified as a cross-sectional study within the observational research category. Observational studies are a type of empirical research in which the independent variables cannot be manipulated (as occurs in experiments or quasi-experiments). They are instead observed, and the researcher draws conclusions based on those observations [15]. When compared to other types of empirical studies, cross-sectional studies have one main disadvantage: they allow neither the establishment of cause-effect relationships nor alternative explanations for the results obtained. However, despite this limitation, they are still valuable in education as they can help confirm assumptions and inform educational actions [21].

5.1 Objectives and definition of context

The main objective of this case study was to illustrate the viability of the approach proposed in this paper, and to generate an ML model and an associated tool that would be capable of predicting, with sufficient precision, students’ academic performance in the context of first and second-year university students enrolled on two mathematical subjects at a Spanish university.

5.2 Research questions

We additionally sought to address the following research questions based on the case study:

  • RQ1: What is the performance improvement ratio of the optimal model, generated using our methodology, when compared to the baseline for student academic performance (AP) in our case study?

  • RQ2: At what point during the semester can, according to our case study findings, at-risk student groups be accurately identified in order to prevent academic failures?

  • RQ3: How do individual features correlate with AP in our case study, and is this relationship consistent across the four proposed models?

5.3 Variables and measurement instruments

The objective of our case study was to discover a model that would be capable of elucidating academic performance across two related courses. We included both the eight mandatory variables (N, E, A, C, O, PAP, LMS-IA and LMS-OA) and subject code as input variables.

The outcome variable (academic performance), meanwhile, refers to the predicted final score obtained by the students on the course.

5.4 Context of case study

The objective of our case study was to collect data from 322 first- and second-year students enrolled on the Computer Engineering program on Albacete Campus of the University of Castilla-La Mancha in Spain.Footnote 1

Table 1 Participants in the study

As discussed in Section 1, this university period is particularly important because the transition from high school to university can be challenging for students in terms of developing independent learning skills, self-assessment abilities and meeting expectations regarding academic performance [89, 96]. It is consequently essential to identify students who are at risk of failure during this period, as the majority of dropouts occur at this time [50, 63, 79].

The selection of courses for the study was determined as follows:

  • Calculus and Numerical Methods (CNM): first semester of the first year of the degree program. This included 167 students (17 female), with a median age of 18 years and fewer than 16% on their second enrollment.

  • Logic: first semester of the second year of the degree program. This course had 155 students (20 female), with a median age of 19 years and fewer than 8% on their second enrollment

This selection was made because these subjects are within the same domain (mathematics) and, along with programming courses, tend to imply the greatest challenges for students on the Computer Engineering degree program, where instructors typically observe higher rates of academic failure.

Table 2 Final grade of the participants in CNM and Logic subjects

6 Phase 1: data collection

In this phase, the instructor needs to gather the measures associated with the Personality, PAP and Academic Engagement variables (see Fig. 2).

In our case study, this gathering process started during the first week of the first semester of the course (week 1). During this week, the students completed the BFI-44 questionnaire (Spanish version), together with a question in which they stated their PAP score. Table 1 shows the number of students expected (total enrolled) and the number of students that participated in the study by filling in the BFI-44 questionnaire (total number of participants). We also show the same numbers disaggregated by gender. The students were not warned in advance that a questionnaire was going to be administered, signifying that we can safely assume that the lack of participation in the study was not related to the study itself.

The Cronbach’s Alpha values calculated for the five scales on the Big Five (BF) questionnaire are the following: Extraversion: 0.807 (8 items); Agreeableness: 0.702 (9 items); Conscientiousness: 0.783 (9 items); Neuroticism: 0.829 (8 items); and Openness: 0.753 (10 items). The fact that all values fall between 0.7 and 0.9 indicates good internal consistency for the scales, without the items being redundant [82].

Data regarding the number of accesses to the LMS was then gathered by one of the instructors at four points in time during the semester: after one month (week 4), after two months (week 8), after three months (week 12) and just before the exam took place (week 15).

The students on both courses had to perform a set of evaluative activities that would enable them to obtain 45% of their grade. The remaining 55% was calculated on the basis of several partial exams. In the event of the students failing one or more partial exams, they had the opportunity to retake the course by participating in a final exam during the official examination period established by the university, before the beginning of the next semester. Both the activities and the exams (partial and final) were graded by the instructor responsible for the course on a scale of 0 to 10. In order to pass a course, a weighted final mark of 5 out of 10 is required, with at least a 4 out of 10 in the examination part.

6.1 Descriptive statistics

These data can be analyzed using various open-source software tools and libraries.Footnote 2

Descriptive statistics of the final grades (the output measure), disaggregated by subject, can be seen in Table 2. Figure 3 comprises the combined violin and box plot graph that represents the distribution of these same grades disaggregated by subject. This graph demonstrates that both subjects have a wide distribution of grades, implying significant variability in student performance. In CNM, the mean is below the median, indicating a distribution that is slightly skewed towards lower grades. In Logic, the mean is very close to the median, suggesting a more symmetrical distribution of grades. CNM also has a longer lower tail, which implies that there are more students with lower grades. In summary, this distribution seems to suggest that the subject of Logic is easier for students, whose academic performance is generally better than in CNM.

Fig. 3
figure 3

Combined violin and box plot showing the distribution of final grades in CNM and Logic subjects

Figures 4, 5, and 6 show the combined violin and box plot graph, illustrating the distribution of the input data required by our approach.

Fig. 4
figure 4

Combined violin and box plot depicting the distribution of the PAP variable

The shape of the violins and the position of the mean in Fig. 4 (PAP variable), suggest that there is no significant difference between the distribution of the entrance grades for CNM and Logic, although CNM has a slightly higher median. This indicates that the mean entrance grades for these two subjects are similar.

Fig. 5
figure 5

Combined violin and box plot depicting the distribution of the BF personality variables: Agreeableness (A), Consciousness (C), Extraversion (E), Openness to Experience (O), and Neuroticism (N)

Fig. 6
figure 6

Combined violin and box plot depicting the distribution of the academic engagement variables (in-class and out-of-class) at different time points, using robust normalization

Fig. 7
figure 7

Diagram illustrating the ML-based process utilized in order to extract pertinent information

Figure 5 (Personality) additionally suggests that there are no marked differences between the personality profiles of the CNM and Logic students with respect to the five traits evaluated. The medians and means are quite similar for each trait, which implies that the groups of students in these two subjects are alike in terms of personality.

Finally, Fig. 6 (Academic Engagement) shows the distribution of the variables measuring (a) number of accesses to the Moodle platform during in-person class hours (LMS-IA) measured at four points in time (T1 to T4) and (b) number of accesses outside these hours (LMS-OA), also recorded at the same four points in time (T1 to T4). A robust normalization (\(R_n\)) was applied in order to mitigate the impact of outliers in the distribution of academic engagement variables (Fig. 6) (1). In this formula, the functions \(q_1\), \(q_2\), and \(q_3\) correspond to the first, second and third quartiles, respectively, according to the subject under consideration, s.

$$\begin{aligned} R_n(y,s) = \frac{ y_i - q_2(y,s)}{q_3(y,s)-q_1(y,s)} \end{aligned}$$
(1)

This normalization makes the measures comparable across different time horizons and subjects. As expected, there is a greater dispersion for the LMS-OA variables, especially in the Logic subject.

7 Phase 2: generation and selection of ML model

After collecting the necessary data, the instructors are now ready to employ ML techniques so as to identify the optimal algorithm and ML model tailored to their context of usage (see Fig. 2).

7.1 Selection of ML algorithm

Figure 7 provides a more detailed examination of the activities undertaken in this phase.

7.1.1 Selection of an initial set of machine learning algorithms

The initial suite of ML algorithms incorporated into our methodology, which were obtained from the relevant literature [48], encompasses the array of most widely-used supervised algorithms highlighted in Section 3, along with several new algorithms with which to represent the principal machine learning styles in a comprehensive manner. These include:

  • Baseline: This algorithm computes the average outcome without considering predictors (independent variables).

  • Linear regression: This is a common type of regression used for value prediction, which assumes independence between variables and applies a coefficient for each input variable plus a single overall fit value. In this study, we tested different variants by employing coefficient adjustment methods. These variants are:

    • Simple Linear Regression with least squares adjustment [90]

    • LASSO [84] with l1 regularisation

    • Ridge [45] with l2 regularisation

    • Bayesian Ridge [85] with l2 regularisation and noise adjustment

    • Automatic Relevance Determination (ARD) [59] with Gaussian precision and noise adjustment using the evidence maximization technique.

  • Decision tree [18]: This algorithm predicts the value of a sample in a hierarchical manner on the basis of straightforward learning rules. The tree is constructed using the training samples, and only one feature is taken into account for each rule.

  • Random Forest [17]: In order to create a more stable and robust model than a single decision tree, this algorithm builds several decision trees. All individual forecasts are considered in order to calculate the final prediction.

  • AdaBoost (Adaptive Boosting) [37]: This algorithm constructs several linear regression models during the first training phase. The algorithm predicts the value of a new sample by taking into account all the predictions of the regressors weighted with weights learned during the training phase.

  • Tree boosting: XGBboost (eXtreme Gradient Boosting) [22] and CatBoost (Category Boosting) [31] are based on the use of multiple decision trees, and boosting is employed in the training phase. They also use derivable cost functions and gradient descent for weight adjustment, similar to that which occurs with neural networks, but each applies different strategies.

  • Support vector machine [24]: This algorithm consists of two phases. In the first phase, the original data space is translated into a usually higher dimensional space. In the second phase, a hyperplane with which to separate the samples in the new space is found.

  • Neural Network (Multilayer Perceptron) [43]: All the layers in this basic neural network architecture are fully connected.

  • Nearest Neighbors [25]: This algorithm uses a similarity function, commonly the Euclidean distance, to find the k (parameter) closest known samples in the training set. Its prediction is computed on the basis of these closest samples. In our study, we set the values of k to 1, 3, 5, 7, and 9.

For all algorithms, we propose to adhere to the default configurations for all model parameters, with the exception of the case of the neighborhood-based algorithm. With this algorithm, experimenting with varying numbers of neighbors resulted in improved performance, especially when dealing with noisy data. Table 3 enumerates the algorithms, the corresponding library functions, and their respective parameters in order to facilitate the reproducibility of our experiments,

Table 3 Algorithms, functions and parameters used in the experiments
Fig. 8
figure 8

Average RMSE of the 10-fold cross-validation results for PAP, the five personality predictors (personality), and the eight LMS-A variables (academic engagement) using various machine learning algorithms. Lower error values indicate a better performance. The numbers at the end of the bars indicate the RMSE error, and the percentage of relative improvement from the baseline is shown in parentheses as error(algorithm)/error(baseline). The baseline results are highlighted in blue

7.1.2 Selection of the best model using the 10-fold cross-validation technique and pairwise comparisons

The second activity consists of evaluating the set of ML algorithms selected in order to identify that which is optimal, under the assumption that different contexts may yield different optimal algorithms.

We propose to identify the best algorithm by using the k-fold cross-validation technique, a method commonly employed in this field. This technique involves dividing the original sample set into several partitions of a similar size (typically 10) and iterating through them, with one partition serving as the test set and the remaining partitions used as the training set. This generates multiple results (equal to the number of partitions) for each algorithm, which can then be averaged in order to yield a final result or be tested against hypotheses so as to compare algorithms.

The evaluation metric that we propose for the ML models is the root mean square error (RMSE) [67]. This metric, which is defined in (2), utilizes vectors y and \(\hat{y}\) of size n, where y represents the true values and \(\hat{y}\) represents the predicted values. The RMSE is commonly utilized for regression problems, as it (a) utilizes the square of the difference between the true and predicted values (residuals) to further penalize differences, and (b) maintains the same units of measurement by applying the square root to the average of these residuals.

$$\begin{aligned} RMSE(y,\hat{y}) = \sqrt{ \frac{1}{n} \sum (y_i - \hat{y}_i)^2} \end{aligned}$$
(2)

To return to our case study, Fig. 8 presents the average RMSE for the outcomes of the 10-fold cross-validation applied separately to the defined predictors: personality (encompassing the five personality traits), academic engagement (represented by the eight LMS-A variables) and capacity (PAP). The findings illustrate that linear approximation algorithms generally yield a superior performance in our context, likely attributable to the linear characteristics of the variables under consideration. Of these, the ARD algorithm stands out as one of the top performers. In this figure, the most favorable RMSE values were observed for PAP (1.54), Personality (1.75), and LMS-A (1.76), with gain-to-baseline ratios of 0.794, 0.903 and 0.906, respectively (the lower the better).

The instructor must now explore the average RMSE of the 10-fold cross-validation outcomes for all the predictors together at each point in time so as to ensure that their combination improves the performance of the model. We propose that this validation method be uniformly applied across all machine learning algorithms utilizing the same sample partitions, thus facilitating the subsequent pairwise comparisons.

In our case study, this exploration is presented in Fig. 9, which combines PAP, personality, and a dynamic evaluation of academic engagement characteristics throughout a four-month period. Once again, the best algorithms in our case study tend to involve linear approximations.

Fig. 9
figure 9

Average RMSE of the 10-fold cross-validation outcomes for the personality and academic engagement predictors at different time intervals (spaced one month apart) using several machine learning algorithms together. Lower error values indicate a better performance. The numbers at the end of the bars indicate the RMSE error, and the percentage of relative improvement from the baseline is shown in parentheses as error(algorithm)/error(baseline). The baseline results are highlighted in blue

Fig. 10
figure 10

Pairwise comparison of significance between the RMSE values obtained from the 10-fold cross-validation after applying the Wilcoxon signed-rank test. Green bullets indicate that the row algorithm is significantly better than the column algorithm

The third step when selecting the optimal model involves conducting a pairwise comparison in order to assess the significance of the RMSE generated by the 10-fold cross-validation of the algorithms under consideration for prediction.

Figure 10 illustrates this pairwise comparison within the context of our case study. The pairwise comparisons were carried out using the Wilcoxon signed-rank test [91], adopting the standard 95% confidence level [80] with Holm’s adjustment [47], a non-parametric statistical test that does not assume a normal distribution of the means. The findings indicate that linear models significantly outperform the others (as denoted by the green bullets).

The instructor can now employ all of these results to select the best model. In our case study, ARD has been selected as the best algorithm because, although the differences between all the linear models are not significant, ARD achieved the best average results.

Furthermore, note that the average RMSE values for ARD at the different time intervals are similar: 1.43, 1.42, 1.43 and 1.42 for T1, T2, T3 and T4, respectively. This means that, in the context of our case study, it is safe to make predictions regarding academic performance at T1, because they are very similar to those that would be made at the end of the academic year, T4.

It is also worth noting that the gain over the baseline using this model and a combination of features is around 0.73, which is better than the 0.79 obtained using the PAP alone.

The results achieved thus far can be employed as the basis on which to address RQ1 by noting that the optimal model generated using our methodology yields a gain-to-baseline ratio of 0.73, indicating an approximate increase of 27% in precision when compared to the baseline. Moreover, the applicability of this model as early as four weeks into the semester provides a response to RQ2.

However, RQ3 can be addressed only by carrying out a further analysis.

8 Phase 3: Post-hoc explainability of the ML model selected

Having selected the best model, it is now time for the instructor to apply ML post-hoc explainability techniques [8] (see the last activity in Fig. 7). These techniques can uniformly explain the results, regardless of the particular algorithm, the linear/nonlinear relationships between variables and the learning styles. Their application assists as regards interpreting the results produced by the model and enhances the instructor’s confidence in its predictions.

Two main approaches can be employed in order to achieve a certain level of interpretability and determine the importance of a variable input in the model. The first approach involves a statistical technique based on performing permutations on the input variables [17], establishing a relationship between these modifications and the predictions, and thus estimating the importance of the variable.

The second approach involves building an additional linear model on top of the original model. This approach is led by the Shapley values [76] and is based on game theory. In essence, the Shapley values quantify the average incremental contribution of a variable when combined with other variables. The Shapley approximation is particularly useful in those scenarios in which each variable contributes unequally to the final outcome (target variable), thus ensuring local accuracy, omission handling and consistency. It is for these reasons that the Shapley approach has been used in this study.

One limitation of the Shapley approach is that its values are general and its calculation applies to the total values studied. However, recent advances [56, 57] make it possible to extend its application to different contexts, such as groups of samples or individual samples. The SHAP (SHapley Additive exPlanations) [55] tool makes it possible to leverage these advancements.

A scheme of the integration of the SHAP post-hoc explainability process with the ML training process that we advocate in our approach is presented in Fig. 11. This shows how an explainer model is built (using a linear approach) on top of the trained ML model in order to provide an explanation for the prediction.

Fig. 11
figure 11

General scheme of the post-hoc explainability process that is performed on top of the trained ML model

The outcomes of applying this SHAP explainability process to the ARD model chosen for our case study are shown below, and we also show how the results have allowed us to address RQ3.

8.1 RQ3: consistency of features correlation with AP

For the sake of comparability, Fig. 12 shows the average impact of the chosen set of input variables on AP at T1 and T4.

Fig. 12
figure 12

Average impact of input factors on AP. The graph at the top represents T1 (week 4), while the graph at the bottom represents T4 (end of the course)

Our results indicate that, in our case study, the same variables are relevant as regards predicting AP at both points in time (PAP, Subject code, Conscientiousness, Openness to experience, Agreeableness, and LMS outclass counts). The fact that the same predictors operate at T1 and T4 suggests the consistency of the model over time.

In order to further explore the individual relationships between relevant features and AP, Fig. 13 presents a grid plot depicting the shape of the relationship for each combination. The figure highlights that PAP is the most relevant variable with a positive relationship on both courses. The Openness to Experience personality variable also has a negative relationship on both courses, meaning that higher Openness to Experience correlates with lower final grades. Moreover, the student’s access to the LMS outside class time (LMS-OA-T1 and LMS-OA-T4) has a positive relationship, with slight differences between CNM and Logic. Agreeableness also has a slight influence, with a negative relationship. Finally, the course itself also has an impact on the grades, with a tendency toward better grades in Logic, which is a second-year subject, than in CNM (a first-year subject), indicating a greater difficulty of the CNM subject.

Fig. 13
figure 13

Individual impact of the most relevant variables on AP. The sub-figures are ordered from top to bottom on the basis of their average impact value indicated in brackets. The horizontal axis of each sub-figure shows the density of samples in each corresponding area

8.2 Cohort analysis

The analyses conducted thus far have resulted in a customized model with which to predict a student’s final grade at a specific point in time (T1 to T4). In order to enable instructors to attain a greater understanding of the factors that most influence performance in their analytical context, our proposal includes a cohort analysis as the final step. This analysis allows the division of the students for whom the ML model has been customized into n performance groups (n cohorts) on the basis of a set of specific rules, conditions or criteria that define each group.

Our proposal suggests that this group formation should be automated in a manner similar to the way in which decision trees are created (decision tree method). In this automatic cohort formation, the most relevant variables for the samples analyzed are evaluated on the basis of the target variable, and a split is applied to one of them in each step so as to form two partitions. The use of the SHAP tool and the application of the Gini metric (see (3)) to the samples of the partitions being created are proposed for this purpose.

$$\begin{aligned} Gini(y) = 1 - \sum _{i=0}^{n} P(y_i)^2 \end{aligned}$$
(3)

where y is a vector with n elements and \(P(y_i)\) represents the probability of the element \(y_i\) being misclassified.

The process starts with the total set of samples and divides it into subsets until the maximum number of sets is attained.

The main advantage of this additional step is that the algorithm automatically identifies the optimal cohorts on the basis of logical criteria, and helps uncover rules that might not be obvious a priori.

Fig. 14
figure 14

Cohort rules at T1, represented as a decision tree. Asterisks (**) indicate significant differences (at 95% confidence level) between cohorts (1), (2), (3), (4) and (5) based on the output variable

Returning to our case study, using the ML model at T1 as a basis, we decided to create five cohorts, which, in our educational context, were considered sufficient to differentiate between various profiles while still ensuring that the subsets contained an adequate number of samples.

Table 4 shows the rules automatically generated, defining the five specified cohorts for our case study. Similarly, Fig. 14 illustrates these rules as a binary tree. In both representations, there are some descriptors of the groups formed, such as their size, median of the final grades, and the significance of differences between them determined by applying the U Mann-Whitney U test [61]. In the first group, individuals with a PAP score exceeding 10.85 achieve the highest final grades, averaging 8.6. In the case of those who do not meet this criterion, their subject performance is determined by the number of accesses they make outside class time during period T1. If the number of accesses is above 104, they are assigned to groups 2 (average 7.2) or 3 (average 6.0) on the basis of their PAP score (greater or lower than 7.4), whereas if it is below 104, they are assigned to groups 4 (average 5.8) or 5 (average 3.7) on the basis of whether they have a PAP score that is greater or lower than 6.58.

Using the cohorts generated as a starting point, it is possible to identify at-risk groups by selecting those cohorts whose mean predictor value falls below a certain threshold. In our case, and based on the criteria outlined in Table 4 and in Fig. 14, cohorts 4 (PAP \(<=\) 10.85) & (LMS-OA-T1 \(<=\) 104) & (PAP > 6.58) and 5 (’LMS-OA-T1’ \(<=\) 104) & (PAP \(<=\) 6.58) indicate student groups of a potentially low academic performance, with the latter group being at extreme risk, with a median AP score of 3.7 over 10.

Table 4 Number of students and median value of the output predictor using cohort rules at T1

8.3 Prediction and analysis tool

Finally, in order to facilitate both individual-level academic performance prediction with explainability and group-level estimation and analysis in the context of our case study, we have created a simple yet functional and effective tool. We have utilized Python Jupyter notebook technology with the Google Colab online service, which was chosen for its free access and the requirement of only a web browser for use. It is available at this link. This tool is already being used by the instructors of the subjects included in our case of study, and could also serve as an example and a basis for other developments, as its source code is accessible.

9 Discussion

This paper makes two main contributions. Firstly, it proposes a conceptual model that is based on the MHP and includes three factors: personality, academic engagement and capacity. This model is sufficiently simple to allow instructors to collect data quickly and without having to resort to academic authorities, thus increasing its practical utility.

Secondly, it proposes an ML methodology with which to generate tailored models for the early prediction of academic outcomes and the detection of at-risk groups in single or related academic subjects. The process involves three phases: (a) a data gathering phase, (b) a model generation and selection phase and (c) a post-hoc explainability phase. The use of XAI techniques increases the transparency and interpretability of the model selected, thus enabling educators to better understand and trust in their predictions, which can inform the proposal of more effective intervention plans. Moreover, the inclusion of cohort analysis techniques allows group-level estimation and analysis.

In order to illustrate the feasibility of the approach, the paper shows the application of the proposed methodology to a case study in which an ML model was selected and trained with data concerning students enrolled in two different mathematics-related subjects taught in the first and second year of a Computer Engineering degree program. While recent meta-analyses have reported a prediction capacity of up to 25% for cognitive ability [69], the proxy of academic capacity (PAP) of our model has a slightly lower gain-to-baseline (20.6%), using just one easily-collected measure. This gain-to-baseline increases to 26.4% when all the variables are considered together. The context of the case study (mathematics and first and second year subjects) maximizes the practical impact of the model obtained, as they are the years and subjects in which the highest academic failure rates occur in the Spanish system. In this case study, significant insights were obtained from relatively small sample sizes, demonstrating a uniform pattern across two different subjects, despite variations in teaching methods and instructors, within the same academic discipline.

Our case study reveals that, while non-cognitive variables in related literature can explain up to an additional 25% of the variability in performance, this percentage greatly varies depending on the particular context. In our case study, the non-cognitive components included (academic engagement and personality) explain only an additional 6%, although when analyzed separately, they explain 9.4% and 9.7%, respectively (see Fig. 8). One possible explanation for this is that the variables of willingness, and particularly personality, which is measured in a manner similar to that shown in related literature, may not have a significant predictive capability in mathematics subjects when compared to other fields. Whatever the reason, this fact underscores the need for instructors to develop customized models that are finely tuned to the unique characteristics of their educational environment. The three personality variables that are most strongly associated with academic achievement in our case study are consciousness, openness to experience, and agreeableness. These variables have also been identified as being the most relevant in related meta-analyses [69, 74, 86]. The positive impact of consciousness is expected and easily justifiable, as thoroughness and diligence are widely recognized as having a positive impact on academic outcomes. However, in contrast to other studies, our findings indicate that openness to experience has an inverse relationship with success in our context. Students with greater openness to experience perform worse, as shown in Fig. 13. This negative impact of openness suggests that students with higher levels of imagination, aesthetic sensitivity and need for variety may not perform as well in mathematical subjects, despite the fact that these traits are considered positive in other areas and at lower levels of education [70]. Our case study therefore supports the approach of analyzing subjects belonging to different areas of knowledge separately. Finally, agreeableness is the third most important influential personality variable in our context, which is consistent with existing meta-analyses [86].

Our cohort analysis shows that academic engagement is the most important willingness-related factor as regards classifying students at risk in the context of our case study. Despite its dynamic nature [2], our case study demonstrates that this variable can be useful in certain contexts when measured as soon as even one month into the semester (T1). This can be justified on the grounds that procrastination is a major impediment to academic success at university [74], and students engaging early in the class dynamics consequently have greater possibilities of success. Also related to this analysis, it is worth noting that the decision tree derived during the cohort analysis relies solely on the PAP and the LMS-OA-T1 variables to determine the groups, indicating that academic engagement and capacity overshadow the impact of personality in the context of this study, which simplifies the identification of students at risk.

Finally, and similar to that which occurs in other contexts [71], the case study shows that combining the capacity and willingness variables increases the global classification accuracy of our model when compared to relying exclusively on one type of variable. Our gain-to-baseline result (26%) is comparable to the gains shown in a recent meta-analysis that included studies that predicted academic performance on the basis of a mixture of cognitive and non-cognitive measures [60]. Moreover, our results highlight the superior predictive power of PAP, with a relative importance of 50.24% (slightly lower than the relative importance of 64% reported for a similar measure in [60]), while personality has a relative importance of 21.24% (also slightly lower than the 28% reported in [60]). These differences can be attributed to the inclusion of academic engagement, which was not considered in the meta-analysis carried out by [60].

9.1 Main limitations of the approach

The case study presented in this paper highlights that the proposed approach has the potential to generate customized ML models with which to predict academic performance. However, it is vital to acknowledge its main limitations. Firstly, the creation of an optimal model necessitates a full academic term for data gathering and analysis in order to identify the most suitable model for the specific context. This model can then assist in identifying at-risk students in future terms, provided that the context of the subject remains relatively unchanged. Secondly, instructors need a fundamental understanding of ML in order to apply this process effectively. Lastly, the dependence on LMS data as a surrogate for academic engagement could restrict its use for those instructors who do not employ these platforms.

9.2 Practical implications

The proposed methodology, which involves integrating a conceptual framework with a Machine Learning process so as to generate personalized models with which to predict academic performance in higher education, provides significant practical benefits for instructors:

  • Early Identification of Needs: It helps identify students who may require additional support, thus enabling early interventions.

  • Learning Personalization: It allows instructors to tailor educational strategies to the individual needs and characteristics of students, thus improving learning effectiveness.

  • Improvement of Pedagogical Strategies: It provides insights into how different aspects of personality and academic engagement affect performance, thus guiding instructors as regards optimizing their teaching methods.

  • Encouragement of Academic Engagement: By better understanding the influence of engagement on academic performance, instructors can design activities that increase student participation and motivation.

These implications highlight the potential of the proposal to transform higher education, emphasizing personalization and the continuous improvement of the teaching-learning process.

10 Conclusions and future work

In this paper, we have presented a method that includes both a conceptual model and a ML process with which to predict academic performance. We believe that the parsimony of the conceptual model and the clear step-by-step description of the process significantly increase the likelihood of our proposal being adopted in university settings.

Our approach has the objective of assisting instructors in the detection of at-risk groups and implement targeted interventions with which to prevent academic failure as soon as possible in the semester, ultimately leading to improved academic outcomes. The use of XAI techniques and cohort analysis allows a more comprehensive understanding of the factors that influence performance in each particular context, leading to more effective risk mitigation strategies.

The application of the approach to a case study involving two first and second-year mathematical subjects demonstrates that the accuracy of the resulting model is consistent with the findings reported in literature. The model and the associated Colab notebook are currently aiding instructors of these subjects to devise targeted strategies aimed at lowering the rates of academic failure, thereby enhancing educational outcomes.

As part of our future work, we plan to apply this method in various new contexts so as to assess the ability of the approach to generate similarly valuable models for other courses and universities. Moreover, we are exploring the incorporation of other easily collectible variables in order to enrich the willingness dimension of the conceptual model, such as the admission option (the order in which students chose their current degree program) [79] and the grades they aspire to achieve [74] so as to determine whether these additions enhance the predictive accuracy of the models. Finally, we aim to investigate how our approach compares (in terms of both effectiveness and complexity) with other new approaches, such as those that use expanded user and item features by employing the latent factor space of auxiliary domains. These approaches enrich the feature set with underlying patterns in user-item interactions that may not be directly observable in the current context [94, 95]. One potential advantage of these methods is that, once these latent vectors have been obtained for the students, the feature set of any target domain in which these students participate can be expanded by incorporating these vectors without the need for any further data collection or analysis. This expanded feature set provides a richer representation of the students, thus enabling more nuanced predictions about student performance or preferences.