Improved online educational technology and the increasing availability of student data recorded by Learning Management Systems (LMSs) allow for novel methods and opportunities to capture new student learning patterns and behaviours. Learning analytics (LA) allows us to measure, collect, analyse, and report such online learning behaviour of students timelier, more detailed, and to a certain extent, more objectively than previous traditional, "offline" methods which relied on retrospective self-reports (Beer et al., 2010; Long & Siemens, 2011). Self-regulated learning (SRL) has received special attention in the field due to its high importance in higher education for effective online learning (Khalil et al., 2022). However, students are generally not very good at self-regulating their own learning (Viberg et al., 2020).

In order to better understand how online students self-regulate online, many measurements have been created. Panadero et al. (2016) argue that the LA field has seen three waves of SRL measurement approaches. During the first wave, most methods were "retrospective", as they required participants to describe SRL-related activities that occurred in the past. Surveys, questionnaires, and interviews fall into this category. These methods mainly view SRL through an aptitude/trait perspective, which assumes SRL is constant across learning activities. The counterargument against such an approach is that SRL is a complex, adapting set of processes and conditions that cannot easily be captured by one set of questions or attributed to any unidimensional scale (Winne & Hadwin, 1998). As a reaction, a second wave of SRL measurements was based entirely on viewing SRL through a state/process lens. Think-aloud protocols, asking students to verbalise their thoughts while learning (Greene & Azevedo, 2009), and online traces, capturing students’ online actions while learning (Panadero et al., 2016), both fall into this category. Studies using such measurements allowed researchers to get a more in-depth look at how students move between various SRL phases, how they set and update goals, how and when students use different learning tactics, and when they change them, etc. Within this wave, we can further distinguish between "obtrusive" measurements, such as the think-aloud protocols, and "non-obtrusive" measurements, such as online trace data. While both types of data collection have certain advantages (e.g., trace data collection is more efficient, while self-reporting data can be valid if sufficiently tailored to the study context), trace data can be understood as real-time behaviour taking place when learning, while self-reported measurements can be more inaccurate; students are not always correct when reporting their strategies (Panadero et al., 2016).

The first two waves have a wide range of fully developed measurements in use today (Panadero et al., 2016). Still, to the best of our knowledge, there are no measurements of different phases of online SRL that do not rely on self-reported methods and that have been validated and tested on multiple courses. This exploratory paper aims to build a quantitative SRL scale from online indicators validated across courses. Such a scale would be relevant as online indicators can be unobtrusively collected from many students across entire courses without the time/costs involved in self-reported measurements and without the standard survey measures. However, relying on student clicks to represent certain behaviours, such as SRL, leads to validity and reliability issues. Indeed, validity is a growing concern in the LA field (e.g., Milligan, 2018), which we consider in this paper. A fast and reliable method of measuring online SRL, as proposed, could help better and faster determine students’ SRL skills, determining which students need help in their learning.

We continue by presenting a brief literature review of LMS indicators used in the LA field, mainly those used in research in higher education. Some of these indicators will later be used when forming the scale, as described in the methodology. Next, we define the SRL conception that we use and present the SRL model that constitutes our theoretical framework and aim of the scale development, namely the COPES model (Winne & Hadwin, 1998).

1 Background

1.1 Literature review

In recent years, there has been a significant amount of research in the field of learning analytics (LA) on self-regulated learning (SRL). Fincham et al. (2019) analysed patterns in students' study sessions to identify study tactics, which were then used to predict academic performance and the impact of personalised feedback interventions. To create the study sessions, the researchers used indicators based on five features of the learning management system (LMS): formative assessments, summative assessments, video content, reading content, and metacognitive materials (such as engagement with the dashboard and syllabus). Jovanović et al. (2017) identified student learning patterns as they prepared for face-to-face meetings and linked these patterns to final course grades. They used similar indicators but also included access to the schedule and learning objectives page in the metacognitive category. Jovanović et al. (2019) employed the same LMS features but examined generic and context-specific indicators of regularity of engagement and how they differed in predicting course grades. They used indicators such as entropy of weekly use of various learning resources, entropy of weekly assessment outcomes, median absolute deviations, frequency of pattern changes, and number of weeks of high engagement with a particular learning resource. Uzir et al. (2020a, b) identified learning tactics that were then clustered into learning strategies used to predict course grades. They used four indicators related to time management based on important course milestones, such as deadlines for course topic preparation. You (2016) studied online indicators and mid-term grades to predict final course achievement. Like Uzir et al. (2020a, b), they developed indicators such as late submission, which represented students' failure to submit on time and the first click on a learning material. They concluded that indicators representing SRL are better predictors of academic performance than simple frequency indicators representing online engagement, providing further support for the construction of an SRL online scale.

Other insightful research for selecting clickstream indicators for scale construction is found in predictive learning analytics. To predict students’ academic performance, Li and Tsai (2017) used the online learning materials that students access to identify which one they spend the most time on, what patterns occur, and which patterns relate to higher grades and motivation. They used the total time and the average time spent by each student on lecture slides, video lectures, shared assignments, and posted messages. Park et al. (2016) identified patterns of online activities in blended courses. While they do use the basic features seen before (announcements, resources, discussion forums, etc.), they add access to wikis or explicit downloads of said resources (different from accessing or clicking them). Zacharis (2015) investigated online activities that show the highest correlations with course grades and form the best predictive model of student success. Of 29 indicators representing more than ten different LMS features, 14 were significant, with four explaining more than 50% of the explained variance of academic performance: reading and posting messages, content creation contribution, quiz efforts, and the number of files viewed. Conijn et al. (2017) investigated the portability of models predicting student performance across 17 courses. They used 23 indicators across various types, such as total number (e.g., clicks, posts, quizzes started), time (e.g., the largest period of inactivity, time until first click), average (e.g., of clicks or time), and irregularity (e.g., standard deviation of study time or study interval). Predictive modelling showed low overlap among courses regarding the indicators that best predicted final exam grades. Thus, the portability of the prediction models across courses was low.

A final paper we want to highlight is by Quick et al. (2020), which applied factor analysis to various online indicators, the resulting factors being then correlated with established SRL survey scales. While this method does describe a sort of convergent validity test, the paper does not have clear SRL phases identified. Also, it does not address the issue of portability across courses. In general, earlier research suffers from several limitations, including the lack of a clear theoretical underpinning, no clear convergent validity check, no clear distinction between the SRL phases, and no consideration of portability. Our research tackles these issues by placing theory and portability at the forefront.

All the presented studies address student engagement to a certain degree. Engagement can be understood as a meta-concept, integrating constructs such as motivation and self-regulation (Beer et al., 2010; Saqr & López-Pernas, 2021). Online engagement, specifically, can manifest through various uses of LMS features, such as online resources and course participation, but also more complex ones, such as patterns and regularity of usage (Azcona et al., 2019; Hutain & Michinov, 2022). As found in You (2016), using measures of identifiable learning behaviour (e.g., SRL) is better than simple frequency indicators when it comes to learning achievement. This paper hopes to use various types of specific cognitive engagement indicators to identify the online self-regulation behaviour of students. We will do this using a supported SRL framework designed specifically for LMS usage. Since online engagement is a substantial concept, we plan on using a more detailed model of SRL to help us better capture, select, and differentiate the traces of SRL behaviour of students from other forms of engagement, such as motivation. Such indicators will be used to form an online SRL scale that is valid and portable across courses.

To build a scale, we must first decide on the selection of online indicators. Our review allowed us to explore the field and see what is possible before creating our own indicators and scales. A common limitation of earlier research is that the selection of Learning Analytics features lacks a clear theoretical rationale. The next section will present the theoretical model that guides the feature selection of our scale.

1.2 The COPES model

Our selection of appropriate clickstream indicators for scaling is guided by educational theory. Self-regulated learning is an umbrella term used to describe the cognitive, motivational, and emotional aspects of learning (Panadero, 2017). Of all the models in the literature, (Winne, 2014; Winne & Hadwin, 1998) COPES model focuses the most on the cognitive and metacognitive aspects of SRL, which are essential in online learning. This has led researchers to use it extensively in computer-supported learning environments, such as LMSs. Since we aim to build a scale specifically for LMSs based on the fine-grained online behaviour of students, the COPES model will also constitute our theoretical base. The COPES model has four stages (or phases) that the students run through recursively when studying: (1) task definition, (2) goal setting and planning (from now on goal-setting), (3) enactment of tactics/strategies (from now on enactment), and (4) adaptation.

Within these four phases, five components run recursively (Conditions, Operations, Products, Evaluations, and Standards), the acronyms making the name of the model COPES. Operations function within Conditions to generate Products. These are then Evaluated relative to Standards which influence future cycles; for details, see Winne and Hadwin (1998). In this paper, we are interested in the products that result at the end of each phase. During task definition, phase 1, the product is the student generating a perception about the task at hand, together with the constraints and resources. During goals and plans, phase 2, the product is the generation of the goals and plans for approaching the study task. During study tactics, phase 3, the plans are enacted. During adaptation, phase 4, the student makes changes to the previous three phases for future study sessions. An important aspect of the model is that the phases are weakly sequenced, with students moving freely between them, and recursively such that any product can influence the shape of the next phase (Winne, 2010; Winne & Hadwin, 1998).

1.3 The aim of the study

Our objective is to build a multidimensional scale for the COPES model that is portable. Thus, the straightforward transfer of the scale between courses while retaining its reliability is one of the main aspects we are interested in and tackle in our research. This so-called “portability” allows for the indicators to become an actual “scale”, as other researchers can use it to measure a construct, i.e., online SRL. Still, portability is one of the biggest problems in the LA field (Conijn et al., 2017). Educators and researchers would find such a data-only online SRL scale easy to form, use, and understand. The method we use aims to tackle the portability issue by using the scale on both similar and dissimilar courses; this comparison is based on the LMS features used during the course.

Finally, we want our scale to measure what it is intended to, a characteristic called construct validity (Messick, 1987). The general lack of measurement validity is considered one of the biggest issues that should be tackled within the LA field in the near future (Gasevic et al., 2022). Thus, we will correlate the resulting online SRL scales with the final student grades and a different SRL survey scale. We expect our scale to correlate with student grades significantly. However, we consider the relation with the survey scale more exploratory since the two measures represent two different aspects of SRL behaviour (Siadaty et al., 2016). Considering these points, our research question is:

  • RQ: How can we create a multidimensional scale measuring online SRL, based on the COPES model, using only trace data that is reliable and valid? To what extent is it portable across courses?

In order to answer the research question, we continue with the methodology used to form the scale. The results section presents the dimensions and correlations between each other and student grades. Finally, we discuss the findings and propose future studies and current limitations.

2 Method

2.1 Research design

This is an exploratory study since we investigated a new approach to identifying the SRL online phases of the COPES model. It is also primarily a quantitative study since we formed numeric indicators representing the students’ online SRL behaviour and used reliability and correlational analyses to check for convergent validity and possible relations with survey data and academic performance. We opted for this design as we observed a scarcity of studies that effectively bridge online self-regulated learning behaviours with well-defined SRL processes or phases, commencing with the establishment of a robust theoretical framework. Moreover, few LA studies that look at identifying online SRL also triangulate with other data types. While such studies have increased in number in recent years (e.g., Quick et al. (2020) for self-reported data, Cloude et al. (2022) for eye-tracking and physiological responses), they are still in their infancy and to our knowledge, no study has attempted at triangulating the resulting online COPES phases with self-reported data of the students.

2.2 Courses and participants

LMS clickstream data of students was collected through Canvas from four courses at a Dutch university of technology. We chose the courses based on the LMS features they employed. Three courses used the same LMS features; the last had a somewhat different structure. Thus, courses A, B, and S all used discussion forums (where students could ask and answer each other various questions), announcement boards (primarily used by the educators to update students), schedules, rubrics, study guides, quizzes, lecture slides, gradebooks, and mandatory assignments throughout the courses; for these courses, the number of assignments was three, five, and four, respectively. Course D used similar features; however, it did not use discussion forums, it did not have a specific page just for the schedule, it did record its lectures, and it did not have quizzes; the course had five assignments.

Our final sample for the trace data was N = 757 students with 1 890 625 total clicks across 74,440 learning sessions; on average, students had 98,3 sessions with 25,4 clicks per each. Table 1 shows how these were distributed across the four courses. As we can see, Course B had the highest overall online activity. This makes sense as this course was combined with a mandatory professional skills track, common at the studied university. At the same time, Course D had the lowest activity, which was also expected as this is a dissimilar course with fewer LMS features for students to access.

Table 1 Information for each of the courses

To check the convergent validity and correlate the online scale with a self-reported scale and grades, we sent out surveys to students from all courses; these surveys also contained informed consent forms necessary to use their grades and the self-reported scale in our study. In total, n = 173 (22,9%) of the students responded, ranging from 19 (15,6% response rate) from Course S to 65 (35,5% response rate) from Course D. The average age was M = 22,2 (SD = 1,3) with 88 (51%) identifying as female, 84 (48,5%) as male, and 1 (0,5%) as other. The average score for the self-reported SRL scale was M = 4,27 (SD = 0,66, 1–7 Likert scale). For the final grades, the average was M = 6,84 (SD = 0,95), with 10 (5,8%) failing the course; the threshold was 5,5.

We started the scale construction by using two similar courses; the third similar course and the fourth dissimilar course were used to check the scales' portability. The four courses were bachelor-level, mandatory, and blended. In the university’s curriculum, a quartile (Q) comprises ten weeks, with a final exam in the ninth or tenth week. The ethical review board and the Learning Analytics committee of the university approved the study.

2.3 Data

2.3.1 Clickstream data pre-processing

All pre-processing and data analysis were done using R (R Core Team, 2022); all the scripts and a short explanation for their usage can be found here: https://zenodo.org/badge/latestdoi/631179167. The data can be made available on reasonable request. A learning session was defined as a series of activities within 30 min of each other. While there are no clear rules on the time used for determining sessions, previous studies have also used this range (e.g., del Valle & Duffy, 2009; Uzir et al., 2020a). The data underwent two cleaning procedures: one at the participant and one at the variable level. At the participant level, we removed all students with fewer than three sessions to filter students with a very low online presence. The initial and final samples of the students per course can be found in Table 1 above.

At the variable level, we first removed all variables with no clicks (e.g., course D did not use discussion forums. Thus, we removed all variables related to discussion forums). Then, we eliminated variables with very low variance using the nearZeroVar function from the caret package (Kuhn, 2008). Three variables were eliminated for all four courses: the number of clicks on the discussion forum before the start of the course, the number of clicks on the assignment page per session, and the number of clicks on the submission comments. Lastly, we z-score standardised all indicators course-wise to safeguard their generalizability.

2.3.2 Indicator formation

To form the scales, we identified indicators and associated them with the relevant phases of the COPES model. Since there are no guiding frameworks, we devised a four-step process. First, we took the definitions of each of the four phases of the COPES model and wrote an adapted interpretation specifically for our LMS. The definitions were taken from the Winne and Hadwin (1998) chapter that introduced the model. These interpretations served as a transition step between the definitions and the actual resources of the LMS that we targeted when forming the indicators. For example, task definition is defined as the perception of the features of the task students must carry out (Winne & Hadwin, 1998, p. 285). Adapting this definition to our context, we can think of the student trying to better understand the course content but also the assignments themselves. Thus, the indicators for this phase relate to any course resources that the students can access before the official start of the course (a student will access the course at an earlier stage if they plan on better understanding the task/course) or resources that help with studying immediately after they are published (discussion forums, announcements). Information regarding the other phases can be found in Table 2. At this stage, we also decided which time range would be associated with which of the four phases. In short, it does not make sense for all indicators to be calculated using all the clicks available (e.g., task definition represented by clicks taking place after the course exam or adaptation with clicks before the start of the course). The two dates that defined the three time periods (before, during, and after) were the official start of the course (represented by the first day of the academic quartile) and the last day of the course (represented by the last day of the academic quartile).

Table 2 Definition, interpretation, LMS resources targeted in the formation of the indicators, and data time periods used for the four COPES phases

Second, we connected the indicators to the phases using the indicators from the literature review and the COPES-adapted definitions. As a rule, indicators were allocated to only one phase. Many indicators found in the literature could not be replicated for various reasons, such as unavailable or incompatible data of different LMSs. Of the 107 indicators we formed, 67 were taken from the literature, and 40 were newly created.

Third, based on the list of indicators, we started to create variables. While some indicators were simple frequency indicators (e.g., total number of clicks on announcements), others were aggregates (e.g., average of clicks on announcements per session or standard deviation of the clicks on announcements per session), time-based (e.g., total time spent on the gradebook page), or even time-based aggregates (e.g., average time spent per session on the gradebook page). A complete list of the indicators can be found in Appendix Table 6. As the fourth and final step, principal component and reliability analyses were conducted, as described in the next section.

2.4 Criteria and analysis

The construction of the scales was carried out per phase. For each phase, a one-dimensional scale was created. The principal component analysis (PCA) was used to determine the number of underlying dimensions and check the explained variance; ideally, there should be one dimension per phase with a high explained variance. Cronbach's alpha was used for evaluating internal consistency. We used both since it is possible to have high internal consistency with the items representing more than one dimension. Both Course A and Course B were used in the construction process, with one indicator being removed at a time if this led to higher internal consistency. A PCA was conducted after each removal to assess the necessary assumptions and explained variance. The scale was considered final when no further indicator removal was recommended, as it would either significantly decrease the alpha or explained variance for one or both courses. While heterogeneity is an inevitable issue when working with LA data from multiple courses or cohorts (Saqr et al., 2022), we designed this method to account for this and help portability.

Once a scale was obtained based on Courses A and B, the same indicators were used to test whether these constitute a reliable scale in Courses S and D, too. Again, we started with the reliability analyses, followed by a PCA. As mentioned, while Course S was almost identical in structure, Course D was quite different. However, this was done on purpose to verify the scales' generalizability. In the end, we were interested in the stability of the scales' values, both alpha's and explained variance, when transferring them from Courses A and B to Courses S and D. We decided that indicators of different types that represent the same aspect of behaviour based on the same LMS resource should not be included in one dimension in the final version of the scale. For example, indicators representing the total number of clicks on articles and the average number of clicks on articles per session will not fit both within the same scale as they both represent frequency and will be highly correlated.

We checked the necessary assumptions for the PCA analysis using Bartlett's test of sphericity and the Kaiser-Meyer-Olkin (KMO) test of sampling adequacy; we also checked the explained variance of the resulting scale. A KMO above 0.5 was considered acceptable (Kaiser & Rice, 1974), while Bartlett's test had to be significant. We did not have a strict alpha value (α) due to our study's exploratory and novel nature and method. While an α = 0.6 is a common threshold in survey design, Taber (2018) argues that this is problematic due to the arbitrary nature of the value while also arguing that even lower alphas can be useful. We went with a minimum Cronbach's alpha of α = 0.5 as acceptable; Nunnally (1978) and Taber (2018) suggest this as a good start for new scales. Finally, we calculated the correlations between the first dimension PCA scores and the scale scores; we wanted high correlations as this would suggest we would not need to weigh future scale scores when calculating them. We used the alpha function from the psych package (Revelle, 2022) for the scale formation and alpha calculation, the built-in princomp function for the PCA analysis, the bart_spher function from the REdaS package for Bartlett's test of sphericity (Maier, 2015), and the KMO function of the psych package for the Kaiser–Meyer–Olkin test of sampling adequacy (Revelle, 2022).

3 Results

3.1 Final indicators

First, we report the final indicators that form the scales (Table 3) for each one of the phases of the COPES model. We follow up by discussing the indicators that form each scale and the online learning behaviour they represent.

Table 3 COPES phases with the identified indicators, average and range of reliability cronbach alpha values

The task definition scale mainly comprises LMS features that the students could use to gather information: announcements, discussion forums, and assignments, which align with our interpretation and the model’s definition. The final scale has seven items (five for Course D as it does not use discussion forums, so the number of clicks and its standard deviation items are not included). Besides the three mentioned LMS features, time spent on the course page before the official start also proved relevant to include, as a student could access the course material in advance to get a better understanding of the task. The final scale has both the total number of hours and the standard deviation of the hours between sessions, which shows that the consistency of the students’ activity before the course’s official start plays a role in SRL.

Goal-setting is defined by six indicators, all focusing on the schedule, rubric, and study guide. Course A has only four indicators due to the lack of “schedule” as an individual feature usage. Thus, both the average and standard deviation indicators are missing. In line with our interpretation and the COPES definition, the students use all the features to assess the course’s agenda and workload, subsequently setting up various study goals. While the scale has six different indicators, they only represent two different types: averages and standard deviations between sessions.

Enactment is one of the most studied phases in literature. For this, we devised simple tactics, such as the percentage of downloaded materials, and focused on the actual material that the student targeted. Where available, the video lectures were the main source of learning for the exam, constituting an indicator on its own. This was followed by the accompanying lecture slides, which were also relevant to the exams and formed their own category. To make the scale generalisable, we decided to categorise all the other materials into “Other Mandatory Materials” (besides the mentioned video lectures and slides) and “Optional Materials”. It can be safely assumed that most courses have materials that can be categorised into these categories. We ended up with seven indicators, of which six were relevant for Course D due to its lack of video lecture recording; however, the course provided the concomitant lecture slides. Besides clicking on the described learning materials, which constitute four of the items, students also have the option to download them. This is a problem for LA in general; once downloaded onto their personal learning space, further tracking of students' behaviour becomes impossible. Thus, we think knowing how many materials students download as part of their learning tactics is essential. Since it is the most researched, Phase 3 had the most created and discarded indicators (Appendix Table 6). Many of said indicators revolved around entropy (e.g., the entropy of course access at the daily level, weekly level, or session duration). Taking inspiration from the literature review, we also formed many new indicators (e.g., the average time between accessing the same video lecture, the average time between accessing different video lectures, etc.). Finally, we created indicators inspired by LA literature revolving around the timeliness of studying, such as those based on critical points of the course (e.g., the difference in number of sessions between the beginning and mid (or end) of the course).

The adaptation scale contains five items also for the dissimilar Course D; it is the most varied regarding the nature of the indicators. Indicators such as the total number of clicks on the course page after the official end align with the definition and our interpretation. Adaptation is the last phase, and it represents the students’ reflective behaviour after the exam has taken place and the students may return to the course to identify possible mistakes and adapt for future courses/exams. Similarly, indicators regarding the gradebook and already uploaded submissions reflect the students’ interest in checking their grades or learning products that cannot be changed or deleted, thus leading to possible long-term changes in their SRL processes. Looking at the indicators that were not selected, we can see that most were various types of already selected indicators (gradebooks, interaction with assignments and submissions after deadlines), but also some periods between important time points, such as the time difference between assignment deadline and submission. We also included the clicks on submissions or the assignment page after their deadlines, often used for students when they want to edit an already submitted assignment, which shows reflection, and time spent on quizzes, often used by courses to help students reflect on their knowledge.

Frequency-based indicators dominate the resulting scales, as they are part of three out of four phases (besides the goal-setting phase). This shows that while aggregates have potential and should be used, simple clicks are still a solid form of online behaviour representation that should not be overlooked. The second most used type is the standard deviation indicators for task definition and goal-setting, which show regularity of learning behaviour; the other types are used once: time-based, averages, and percentages.

3.2 Final scales: reliability and portability across courses

The internal consistency and PCA results for all four courses can be seen in Table 4 below; italics represent what we considered problematic results. These are the statistical results of the scales with items (= Canvas indicators), as described in Table 3. The scales were built on Course A and B simultaneously. Thus, we expect to have the best results, mainly high alpha and explained variance, in these two courses. Also, we want to see those values kept stable when transferred to Courses S and D. The first column represents the KMO sampling adequacy, the number of components indicated by the PCA test, and the variance explained by the first component. The second column shows Cronbach’s alpha value and Pearson’s correlation coefficient between the PCA first dimension score and the calculated scale score.

Table 4 Reliability measurements of the scales across different courses

Overall, the four scales retain similar values when transferred to both the similar and the dissimilar course, with a somewhat lower value for the dissimilar course, as expected. The goal-setting, enactment, and adaptation scales show high internal consistency values (with average alphas of α =  ~ 0.72, α =  ~ 0.83, and α =  ~ 0.74, respectively). In contrast, the task definition shows a more modest average of α =  ~ 0.62, which is still above our criterion of α = 0.50 and even the generally accepted criterion of α = 0.60. The correlations among the first dimension PCA scores and the scale scores show very high values (r > 0.95), all but for one case, that being the goal-setting of Course D. This is a strong argument for using the average scale scores in future application of the scales. We conclude that we have constructed four reliable scales that maintain their reliability across courses.

Next, we consider the reliability analysis per scale. The enactment scale shows the best results of the four scales. It demonstrates a high stability of the selected items (Canvas indicators) and a high stability for the reliability and the amount of explained variance of the indicators. A remarkable finding is that the scale can be transferred to a dissimilar course without losing quality or needing adjustments. Moreover, all PCA analyses suggest one dimension, which is excellent.

The scale measuring adaptation also shows strong results. Again, all courses have very high scale stability, both for the similar and the dissimilar course. The only limitation is that the PCAs suggest two dimensions for two courses. For both courses, the time spent on quizzes and the clicks after the course indicators form one of the dimensions, with the other three indicators forming the second. We argue that the same indicators forming the same dimensions for those courses represent sufficient stability to keep the scales the same. These could indeed represent different dimensions of students' adaption during SRL, such as reflecting and acting on the reflection, respectively.

Moving on to task definition, we can see that scale reliability stays constant when transferred to the similar course but shows small problems when moved to the dissimilar one; this course has two items less because of its specific course design. Still, the alpha is higher than the minimum α = 0.50. However, the KMO value of 0.49 value drops below the KMO = 0.50 criterion; while this may seem small, the criterion was already the minimum threshold. Finally, we can see that the PCA analysis suggests two dimensions, with three for Course B. The explained variances vary between 33 and 37%, the lowest of the phases. One explanation could be that, due to the high number of indicators, our scale measuring task definition reflects multiple dimensions, like adaptation. While the high number of items would keep the alpha high, it would also decrease the explained variance of the first dimension.

The scale measuring goal-setting has high explained variances and alphas. However, it encounters a different problem: the reversal of two items. Since course D does not use the schedule feature, the whole scale relies on items based on the study guide and the rubric features. However, both items built around the rubric (average clicks per session and standard deviation between sessions) negatively correlate with the other scale items, i.e., the two study guide items. A possible explanation for this reversal may be that some courses offer the study guide as a compendium of various information that a student should receive when starting the course. These include the assignment rubrics and the course schedules. Course D offered both and removed the schedule feature from the Canvas LMS but did not remove the rubric feature. This led to students who clicked on the study guide for the rubric clicking less often on the rubric feature, creating a negative correlation between the two. The PCA shows the number of dimensions to closely follow the three LMS features (rubric, study guide, and the schedule), with two for Course D as the schedule feature is absent.

3.3 Validity and correlation with course grades

To examine the validity of the newly constructed scales, we correlated them with each other. We also correlated the scales with a previously validated survey scale, the self-regulated learning (SRL) scale from the Motivated Strategies for Learning Questionnaire (MSLQ) manual (Pintrich et al., 1991), which is widely used in the field to measure the overall construct of SRL (e.g., “When reading for this course, I come up with questions to help me focus my reading.”). While the MSLQ scale is an overall measure of SRL, our scales are designed to measure specific dimensions of online SRL within the learning management system (LMS) at a more fine-grained level. As mentioned in the literature review, self-reported data measures inherently different aspects of SRL (static vs. dynamic views); thus, we did not have any hypotheses regarding the correlations between the survey measurement and the formed scales. Additionally, we examined the extent to which the four scales correlated with the students' course grades, which is important, for instance, for identifying students who may need additional educational support.

Table 5 shows that the correlations between three SRL dimensions are positive and significant and have effect sizes between ‘medium’ and ‘large’ (medium = 0.30, large = 0.50; Cohen, 1992). These correlations among task definition, enactment, and adaptation support our claim that it is possible to use the clickstream indicators to unobtrusively measure SRL in a more fine-grained way than it has been done earlier. Only the scores indicating the fourth dimension (goal-setting) do not correlate with the other dimensions in an expected pattern. These correlations are either small or almost zero. The correlation between goal-setting and enactment is small and, to our surprise, negative. When removing the problematic course D and re-running the analysis, we find a non-significant value (r = -0.05, n = 529, p = 0.501). A possible explanation is that students who access too many goal-setting features, such as the schedule and rubric, do it out of confusion or distress. This behaviour can also reflect in the negative correlation with the final grades. Overall, we believe these correlations support one SRL scale with multiple correlated dimensions, each with high reliability that is shown to hold over various courses. Table 5 also shows that none of the four scales significantly correlates with the self-reported SRL survey scale. The result is non-significant when forming one average of the four phases and correlating it with the survey SRL scale (r = 0.02, n = 175, p = 0.769). As we mentioned earlier, these two types of measurements -our formed scales on the one hand and the survey scale on the other hand- refer to different concepts of SRL, which may explain this lack of significant associations.

Table 5 Correlations between each COPES scale, final course grade, and the SRL MSLQ survey scale over all 4 courses

We also assessed the scales’ correlations with final course grades. Enactment shows a significant, medium to large positive correlation with the final course grade (r = 0.38, n = 173, p < 0.001). This is in line with our expectations, as in this phase of the COPES model, clear, observable learning behaviour should appear that is likely to produce learning benefits. For example, during this phase, students use all their previously acquired information and interact with the learning material, such as articles and video lectures. The scores measuring students' degree of adaptation correlate negatively with grades. This makes sense as adaptation of learning behaviour is more likely to occur after negative feedback, which is more likely to be provided after a weaker (interim or final) test result. As demonstrated in Table 3, the adaptation indicators also refer to activities after the last course day (beyond the final exam). The lack of a correlation between task definition and grades also makes sense. These indicators refer to activities that took place before or shortly after the start of the course (see Table 3). As proposed in the COPES model, these activities lead to students generating a perception about the task at hand (Winne & Hadwin, 1998), which is, as we argue, likely to influence the grades indirectly via the other phases. The significant negative correlation between goal-setting and grades is unexpected, as we would expect that goal-setting influences grades indirectly via enactment.

The success of our approach is demonstrated when we consider the correlation between the MSLQ survey scale and the final course grade, which is much lower: r = 0.07 (n = 171, p = 0.389). As shown in Table 5, the correlation between enactment and grades is substantially higher (p < 0.001). Furthermore, a linear regression of final course grades on the four dimensions of SRL yields a model with a multiple r = 0.51, explaining 25.5% of the variance in the grades (adjusted R2 = 0.251, F(4, 717) = 61.74, p < 0.001). These findings demonstrate that the scale scores are useful for predicting student grades, much more so than the MSLQ survey scale.

4 Discussion

Since its conceptualisation in the 80 s, SRL has seen many methods being created for its measurement, many intrusive (Panadero et al., 2016). We demonstrated how to use Learning Analytics to measure students’ online SRL in a valuable way for future educational applications. Using an exploratory approach guided by theoretical arguments, we managed to unobtrusively and reliably construct four scales measuring four dimensions of self-regulated learning. Two of these scales, measuring the dimensions of enactment and adaptation, were portable across a similar and a dissimilar blended course. The third scale, measuring to what extent the student engages in the task definition, was portable to a similar blended course but diminished in reliability somewhat when transferred to a dissimilar blended course. The fourth scale, measuring the extent of goal-setting activities, was only portable to a similar blended course. Using this fourth scale in a more dissimilar blended course may nevertheless be possible if the researcher or educator has knowledge about the course design that may impact the reversal of items. To our knowledge, both the feasibility of these unobtrusive measurements and their decent portability outcomes represent a novel finding in the literature.

Our study further proves the importance of learning context when studying SRL. Instructional conditions are an important part of the context (Gašević et al., 2016) and relate to any outside factors that can guide SRL (Winne & Hadwin, 1998). Such conditions are important to consider when trying to measure and understand SRL. For example, in our study, the LMS used included features that allowed students to fulfil many of the conditions, such as social context (discussion forum), grading policies (rubric), temporal (schedule), and learning resources (lectures, slides, mandatory, and optional material). While modern LMSs include more and more such features that can help students, this comes with a major disadvantage as models that consider them become more complex. We argue that creating an online SRL scale should always consider such features as they are important for fulfilling the instructional conditions and help create a more complete picture of students’ learning. Our results show that creating a portable scale across courses is possible if the conditions are considered and included.

Except for the goal-setting phase, which had a negative correlation with enactment and a non-significant relation with task definition, all the other phases correlated positively and significantly with each other. This supports that such phases can be considered dimensions of one overall construct, online SRL in LMSs. This also explains the non-significant correlations with the survey scale, which measures one broad SRL scale initially designed for offline environments. A case in point is the lower correlation of the most validated and used survey SRL scale in the literature, the MSLQ SRL scale, with final course grades than our enactment phase (r = 0.07 vs. r = 0.38, respectively). This suggests that a purely online indicator-based SRL scale can and should also be used, as it measures a different kind of SRL, but one related to actual accomplishment in the course.

4.1 Limitations and future directions

One of the main limitations of our methodology is the subjectivity of interpreting the definition of the COPES phases. The association between the indicators and the four phases of the COPES model is crucial for the validity of the results but depends to some extent on our interpretation. Since there are no previously formulated survey scales to run a convergence analysis, it isn't easy to be certain that our indicators represent those phases. One of our main recommendations for future studies is to focus on validating such online-based scales. As more studies adopt this method and create more online scales, convergent validity should become possible and easier to test.

To get a better idea about the generalizability of the scale, further research could try to add more and more different courses to the scale construction process. Also, we suggest adding dissimilar courses from the beginning, thus ensuring that the indicators that are most vital for each phase are present during the selection.

We focused on basic LMS features (such as discussion forums, announcement boards, etc.) to enable other researchers to create and use our scales easily. While our basic indicators should make such scales easy to transfer, this also has caveats. Self-regulated learning entails complex forms of behaviour involving many processes, as the COPES model shows. Adaptation is a prime example as the phase itself can refer to the other phases and can stretch over one learning task, but also timewise over days, weeks, and months. Our aim of including only basic indicators can help and encourage scale construction but is a necessary simplification compared to such processes that are difficult to capture using our criteria. While our scale performed well statistically, more complex analyses, such as pattern analysis, may better represent intricate parts of SRL phases and should also be the focus of future research.

We described how we created the scales for the actual products of the four phases of the COPES model. However, the products result at the end of each cycle within each phase, leaving out many elements of the model. Everything occurs within the user and involves hidden mechanisms that are not entirely known. One could argue that we should have targeted the operations of the COPES model, namely SMART: searching, monitoring, assembling, rehearsing, and translating. In this case, we would have looked for different indicators for each of the SMART operations within each of the four phases, as they differ (searching and monitoring for task definition differs from enactment). We agree that this would be an alternative, perhaps even more accurate representation of SRL, but it also creates additional difficulties in finding indicators, as they are even more particular and contextual. In this sense, the proposed scales are necessary but just an initial step to understanding and properly measuring online SRL. In the same vein, more work needs to be done on task definition and goal-setting. While we acknowledged from the beginning that these would be the most problematic due to their “offline” aspect, we believe more indicators could be found.

5 Conclusion

Our theory-guided but exploratory approach led to the construction of scales that correlate substantially with students’ course grades. This constitutes an essential improvement for identifying students needing additional educational support when compared to the most often used survey scale, the MSLQ-SRL, without burdening students with surveys. When used independently, our scales' predictions will likely not be powerful enough for fine-grained discrimination between different students, and some scale weaknesses will have to be remedied in future revisions. Moreover, universities may want to combine predictive Learning Analytics models with other administrative data with predictive power to identify students' needs that are useful in educational practice. Accordingly, our results should not be regarded as ultimate findings but rather as insights leading to the first tools that demonstrate the scientific relevance of Learning Analytics for the unobtrusive measurement of otherwise cumbersome survey measurement and its general educational potential.