Analyzing Learner and Instructor Interactions within Learning Management Systems: Approaches and Examples

  • Mimi ReckerEmail author
  • Ji Eun Lee
Living reference work entry


Institutions of higher education are increasingly turning to learning management systems (LMSs) to help support their instructional missions. An LMS provides features to support various teaching and learning activities and can be used in online, face-to-face, and blended courses. An LMS also automatically records all instructor and student usage activities and stores them in vast repositories collected as a part of natural instructional activity. The increasing availability of these datasets, coupled with emerging “big data” and educational data mining techniques, offers unparalleled opportunities for research on understanding learning and teaching “at scale.” This chapter reviews approaches and processes for analyzing both instructor and student usage data collected by LMSs. The review is organized according to a standard three-phase data mining methodology called knowledge discovery from database (KDD). It also illustrates the KDD process by presenting examples from research by the authors that analyzed instructor and student usage data collected by a widely used LMS, called Canvas. The chapter concludes with recommendations for future research.


Educational data mining Learning management systems Knowledge discovery from database Learning Teaching 


Around the world, institutions of higher education are increasingly turning to learning management systems (LMSs) to support their instructional mission (Hawkins & Rudy, 2007; Smith, Lange, & Huston, 2012). In the United States, for example, 99 % of responding institutions recently reported the use of an LMS (Dahlstrom, Brooks, & Bichsel, 2014). At the same time, enrollments in massive online open courses (MOOCs), also supported by an LMS, have soared. Popular LMS examples include Blackboard, Moodle, Edmodo, and Canvas (Browne, Jenkins, & Walker, 2006; Hawkins & Rudy, 2007).

An LMS contains several features to support teaching and learning. Dawson (2008) divided LMS features into four instructional categories based on what they support: (1) administration, (2) assessment, (3) content, and (4) engagement. For example, an LMS contains administration features , such as a course calendar and announcement tools. It provides access to content in the form of course modules or links to online content. It supports assessment through tools that administer quizzes and assignments. Finally, an LMS supports tools for participant engagement, such as discussion and collaboration tools. Together, these features can support many different kinds of courses, in many different disciplines and with different pedagogical approaches. An LMS can also support courses offered in a variety of modalities such as online, face to face, or hybrid.

Popular LMSs are also typically engineered to capture and store the large datasets of user actions as they interact with the system. This means that the clickstreams of every student and instructor can be stored for further analysis. For a typical online class of approximately 25 students, over the course of one semester, this can amount to over 500 LMS actions recorded by the system. And unlike the data in many traditional educational datasets, these data are tall (involving many students), wide (with many observations per student), fine (very frequently collected), long (in duration), and deep (theory relevant) (Koedinger, D’Mello, McLaughlin, Pardos, & Rosé, 2015).

LMSs differ in what data they store; some store total frequency counts of feature use for every user, while others store daily updates of feature use by each user. LMSs also differ in how they store this information. Some store data in flat log files, while others use a relational database, allowing queries to the dataset using an application programmer interface (API).

These vast datasets thus contain a treasure trove of information regarding how different learners access course content and how different instructors choose to design courses to support student learning. Coupled with computational and statistical methods emerging from the field of “big data,” researchers are making great strides in analyzing these datasets to reveal new patterns of learning and teaching at scale (Krumm, Waddington, Teasley, & Lonn, 2014; Siemens & Long, 2011). This new field, the application of data mining methods to education datasets, is called “educational data mining ,” or EDM (Baker & Siemens, 2014; Siemens & Long, 2011). Moreover, results from EDM can be used to develop tools and interventions to support learners. This field, typically called “learning analytics ,” is also making advances in supporting learning and teaching in innovative ways (Bienkowski, Feng, & Means, 2012; Siemens, 2013; Siemens & Baker, 2012). Though differences exist between these two nascent fields, they both focus on applying innovative computational, statistical, and “big data” mining methods to large sets of learning data in order to help better understand the variety of student learning and instructor teaching patterns, thereby improving the overall educational experience (Siemens & Baker, 2012).

The purpose of this chapter is to review the application of EDM to LMS datasets. This chapter first reviews studies that use EDM for prediction purposes (e.g., learning outcomes) and then for clustering purposes (e.g., clustering similar students). The chapter then applies the three-phase knowledge discovery from database (KDD) framework from the data mining research literature (Han, Kamber, & Pei, 2011; Witten & Frank, 2005) to structure the description of key issues when applying EDM techniques to LMS datasets. Using examples from research, the chapter describes (1) the kinds of LMS data that are collected, (2) the kinds of modeling and analysis approaches used, and (3) approaches for interpreting and applying the results. The chapter concludes with recommendations for researchers in this field.

Applications of EDM to LMS Data

To conduct this review, the research literature was reviewed for studies that extracted and analyzed data from learning management systems used in higher education settings. To conduct this search, Google Scholar was searched using keywords such as “learning management system,” “course management system,” “educational data mining,” and “learning analytics.” The resulting studies were then categorized using Baker and Yacef’s (2009) taxonomy of data mining methods: prediction, clustering, relationship mining, distillation of data for human judgment, and discovery with models. Because of their frequency of use, studies using prediction, clustering, and distillation of data for human judgment were reviewed in more detail.

Since the purpose of this chapter is to review approaches and processes for analyzing LMS data, a particular focus was on methodology. For example, when reviewing prediction studies, the kinds of data mining methods (e.g., multiple regression, logistic regression) and the predictor and outcome variables used in each study were the focus.

The Purpose of EDM and EDM Methods

The purpose of data mining is to extract useful information from “big data” sets in order to improve decision making (Bienkowski et al. 2012). In field of business, data mining has been increasingly used to discover patterns in data and then predict future trends from the extracted patterns (Romero & Ventura, 2007).

Similarly, the purpose of educational data mining (EDM) is to discover useful information related to learning from large datasets collected from online learning environments (e.g., LMS). The goal of such work is to use some information to improve learning, instruction, and the design of the online learning systems themselves (Bienkowski et al. 2012; Klosgen & Zytkow, 2002; Romero & Ventura, 2007). In particular, by using EDM, researchers can extract valuable patterns, predict student success or retention, create a model for student learning, and evaluate e-learning processes and systems. In this way, EDM provides useful information not only to students but also to faculty and administrators.

Several methods are used in EDM research. Romero and Ventura (2007) categorized EDM methods into four categories, while Baker and Yacef (2009) classified them into five categories (see Table 1). As shown in Table 1, some of the categories overlap, such as clustering and relationship mining, but there are some differences as well. Romero and Ventura’s taxonomy is more focused on data mining techniques, since it is based on traditional web data mining research, such as clustering, association rule mining, and text mining. However, Baker and Yacef’s categorization of EDM methods is based more on the researcher’s purpose for conducting the analysis. It includes “distillation of data for human judgment” and “discovery with models,” which are not classical data mining methods. This chapter follows Baker and Yacef’s taxonomy to conduct the review on EDM studies.
Table 1

Categories of EDM methods

Romero and Ventura (2007)

Baker and Yacef (2009)

Statistics and visualization

Web mining

 Clustering, classification, and outlier detection

 Association rule mining and sequential pattern mining

 Text mining



Relationship mining

Distillation of data for human judgment

Discovery with models

Of course, EDM can be applied to data collected from any kind of interactive learning environment. This chapter focuses on EDM studies that analyze LMS data collected in higher education settings.

EDM Research Applications to LMS

Romero and Ventura (2007) reviewed 60 studies that used EDM and found that 43 % of them used relationship mining methods, followed by prediction (28 %), and clustering (15 %). In EDM research using LMS data, the most widely used techniques appear to be prediction, clustering, and distillation for human judgment (in particular, visual data analytics). In what follows, Baker and Yacef’s taxonomy (2009) is used to review EDM studies; note that some of the studies used more than one EDM method.


The review found that the most widely used technique in EDM research using LMS data is prediction. Prediction refers to developing a model to infer what kinds of student behaviors will predict success or failing on some outcome measures (Bienkowski et al. 2012). In prediction studies, statistical methods such as multiple regression and logistic regression are frequently used for developing predictive models. Existing studies focused on prediction can be categorized by the characteristics of the independent (predictors) variables.

First, the system-recorded data directly extracted from the LMS (called LMS tracking variables ) can be used as predictors, such as total time online, number of assignments completed, number of messages posted, and so forth. Macfadyen and Dawson (2010) analyzed LMS usage of 118 students in an online undergraduate biology class. In this study, they excluded variables that were related to assignment scores because these variables contributed to a substantial portion of students’ final grade. After the assignment score variable was excluded, three tracking variables (total number of discussion messages posted, total number of mail messages sent, total number of assignments completed) were found to be statistically significant predictors of student final grade, explaining 33 % of the total variance in student achievement. Similarly, Thakur, Olama, McNair, Sukumar, and Studham (2014) used logistic regression and neural network models to find which tracking variables significantly predicted student performance. Perhaps unsurprisingly, results showed that the number of assignments completed was the strongest predictor of success or failure, followed by the number of quizzes completed, and the number of posts submitted in discussions.

Second, new variables can be generated by modifying (combining, transforming, or manipulating) standard LMS tracking variables. For example, Jo, Kim, and Yoon (2015) used LMS log data (total login time, login frequency, and regularity of login intervals) as proxy variables to measure adult learners’ time management strategies. The regularity of the login interval variable was calculated by using the standard deviation of the login interval. The result from multiple regression analysis revealed that regularity of learning interval was the only significant predictor of learning outcome, while total login time and login frequency were not significant predictors. Similarly, Yu and Jo (2014) used the regularity of learning interval as one of the independent variables in their analysis, and results showed that it significantly predicted student final grade.

As another example, Abdous, He, and Yen (2012) investigated the relationship between the kinds of questions students asked in the course forum and their final grade. In this study, cluster analysis was used to classify the themes in students’ questions, extracted from the LMS database. The results showed that there were four major themes in students’ questions: class check-in, deadline/schedule, evaluation/technical, and learning/comprehension. The themes found in students’ online questions were found to be a significant predictor of final grade; in particular, students whose questions were related to learning/comprehension had a higher final grade.

Finally, LMS tracking variables can be combined with other data sources, such as self-reported data and other system-generated data. For example, Tempelaar, Rienties, and Giesbers (2015) used various sources such as learning dispositions (self-reported) and e-tutorial data (formative assessments), as well as LMS tracking data as predictors for academic performance. The results showed that e-tutorial data were the best predictor of academic performance, while most LMS tracking data did not significantly predict student final grade. The one exception was the variable that captured the number of downloads of old exams.

The results of these studies are summarized in Table 2.
Table 2

Methods used and significant predictors of student final grade for studies reviewed



Significant predictors of student performance (final grade)

Nonsignificant variables

Macfadyen and Dawson (2010)

Correlation analysis

Multiple regression analysis

Logistic regression

No. of discussion messages posted

No. of mail messages sent

No. of assessments completed

Total time online

Thakur et al. (2014)

Logistic regression

Neural network model


No. of quizzes taken

No. of posts in discussions


Jo et al. (2015)

Correlation analysis

Multiple regression analysis

Regularity of login interval

Total login time

Login frequency

Yu and Jo (2014)

Multiple regression analysis

Total study time in LMS

Interaction with peers

Regularity of learning interval

No. of downloads

Total login frequency

Interaction with instructor

Abdous et al. (2012)

Ordinal logistic regression

Online question theme (questions concerned learning/comprehension)

No. of student questions

No. of chat messages

Login frequency

Tempelaar et al. (2015)

Hierarchical linear regression

No. of downloads of old exams for practice purposes

Basic LMS data were not substantial predictors of learning

To summarize the results of these prediction studies in terms of methodology, most of the reviewed studies used conventional statistical methods to fit a model. These included multiple regression and logistic regression. Logistic regression was used particularly frequently because of the ordinal nature of the outcome variable, final grades (typically letter grades). Also, alternative methods were used for non-normally distributed data. For example, Thakur et al. (2014) found that the distribution of student final grades in many courses is not a normal distribution, thus precluding the use of parametric statistical modeling. For this reason, they used the neural network model, which performed better when modeling non-normally distributed data.

In terms of the kinds of predictor variables, the review revealed inconsistent results. For example, the number of mail messages sent was a significant predictor in Macfadyen and Dawson’s study (2010); however, the number of chat messages sent was not a significant predictor in Abdous et al.’s study (2012). Moreover, total study time was a significant predictor in Yu and Jo’s study (2014), but not in others (Jo et al. 2015; Macfadyen & Dawson, 2010). These contradictory results might be due to differences in independent variables used, the study context (course subject, different LMSs), or the kinds of students enrolled in the courses.

Finally, LMS tracking variables themselves might not provide enough information to adequately model student learning outcomes. In some of the studies reviewed, student LMS usage data explained approximately 30–35 % of the total variance in student performance (Jo et al. 2015; Macfadyen & Dawson, 2010). In addition, one of the studies reviewed found that basic LMS usage data were not significant predictors of learning at all (Tempelaar et al. 2015). Thus, in order to create a better predictive model of student learning, data triangulation with other sorts of data is recommended (e.g., Xu & Recker, 2012).


Clustering is a data mining technique used to group a full dataset into a smaller subset of similar objects (called clusters, or subsets) (Romero, Ventura, & García, 2008). For instance, clustering can be used to group students based on their learning difficulties or interaction patterns (Bienkowski et al. 2012). In e-learning research, a variety of objects can be clustered, including students, courses, and content.

Students are the most common object of clustering in studies using LMS usage data. For example, Romero et al. (2008) applied a clustering algorithm to group 75 students based on their LMS usage activities using the Moodle LMS. They used LMS log data such as the number of assignments and quiz and discussion participation to classify students. When using the K-means clustering algorithm, they found three clusters: very active students (n = 29), active students (n = 22), and non-active students (n = 24). They concluded that this information can be helpful when grouping students for collaborative activities.

Similarly, Lust, Elen, and Clarebout (2013) grouped students based on their self-reported self-regulation strategies and use of tools within an LMS. They also used the K-means clustering algorithm to classify students and found four different clusters (profiles) that reflected strategy and tool use. In two clusters (self-regulated and deeply oriented students, disorganized students), the students’ tool-use pattern was associated with their strategy use, whereas in the other two clusters (undefined students, inconsistent students), patterns in tool use were not associated with any kind of strategy use.

In another example, Yildiz, Bal, and Gulsecen (2015) also used students’ LMS usage data (frequency of use, quiz score, midterm exam score, etc.) to cluster students in groups. In this study, they compared three different clustering methods to form three fuzzy-based models and used these to find the best approach for estimating student outcomes. They evaluated results from the three clustering algorithms in terms of their accuracy ratios and found that fuzzy c-means had the best result in terms of predicting student academic performance.

A less common clustering approach is to use courses as the object of clustering. For instance, Valsamidis, Kontogiannis, Kazanidis, Theodosiou, and Karakos (2012) used a clustering algorithm to categorize courses based on a quantitative metric (LMS usage rates). They applied the K-means algorithm to classify 39 courses and found two clusters: nine courses with high activity and 30 courses with low activity. In this study, they also proposed a new metric for measuring the quality of a course by using LMS log data. For each course, they computed its “enrichment” (a measure of how many unique pages were viewed by the students) and its “interest” (a measure of how many unique pages were viewed per session). Then, they measured the quality of each course by computing the average for the enrichment and interest values. They investigated the relationship between the clustering results (the quantitative index) and the quality index by using cluster visualization and found that the quantitative index was associated with the quality index: the high-activity courses had higher- and medium-quality index scores, while the low-activity courses had low-quality scores.

Finally, the object of clustering can be course content. As described above, Abdous et al. (2012) used NVIVO to manually code for themes in questions students posted in the course forum. They then conducted a hierarchical cluster analysis to categorize themes in the students’ questions. Four emerging clusters were identified: class check-in, deadline/schedule, evaluation/technical, and learning/comprehension.

These clustering studies are summarized in Table 3.
Table 3

Summary of studies using various clustering methods


Clustering algorithm

Clustering objects


Romero et al. (2008)


Student LMS usage (# of assignments, quiz and forum participation, etc.)

Three clusters (very active students, active students, inactive students)

Lust et al. (2013)


Student (LMS) tool usage

Student learning strategies (self-reported data)

Four clusters (disorganized students, self-regulated and deeply oriented students, undefined students, inconsistent students)

Yildiz et al. (2015)


Fuzzy c-means

Subtractive clustering

Student LMS log data (frequency, quiz, midterm exam, etc.)

Eight clusters (K-means), 9 clusters (Fuzzy c-means), 11 clusters (Subtractive clustering)

Fuzzy c-means had the best result

Valsamidis et al. (2012)


Markov clustering (MCL)

Courses (course activity)

Student LMS usage patterns

Two clusters for course clustering (nine with high activity, 30 with low activity)

Twenty-seven clusters for student clustering

Abdous et al. (2012)

Hierarchical clustering

Content (student questions)

Four clusters (check-in, deadline/schedule, evaluation/technical, learning/comprehension)

This review found that the K-means algorithm is the most commonly used method in clustering studies of LMS usage data. Romero et al. (2008) noted that K-means is one of the simplest objective function-based algorithms and also one of the most popular methods used in data mining work. In terms of the clustering object, student LMS usage data were widely used. Future work should also consider clustering instructor usage data, as well as content clustering, in order to derive more useful information about learning and teaching. Finally, cluster interpretation is an important and complex final step. Simple clustering results of students or courses do not reveal much on their own, and researchers need to carefully interpret these in order to derive implications about the student learning process.

Distillation of Data for Human Judgment

When distilling data for human judgment, various techniques are used to represent data in ways that enable humans to quickly and easily understand its features (Bienkowski et al. 2012). This section reviews data visualization, a popular and effective technique.

In e-learning settings, data visualization can be used at either a micro- or macro level. At a microlevel, visualizations can depict analytical results as graphs, scatter plots, heatmaps (described above), etc., to aid in interpretation. For example, Thakur et al. (2014) investigated the stability of student grades in math courses over the course of a semester by examining LMS usage data. In order to detect stability, they created heatmaps in which each block represented the relative grade of students (plotting each student on the x-axis and modules [each week] on the y-axis). They found that the relative grades for freshmen tended to fluctuate during the semester, whereas the relative grades for senior-level students were more constant.

In another example, Valsamidis et al. (2012) used a visual display to better represent results from a clustering study. In this study, they used a Markov clustering (MCL) algorithm, a combination of cluster analysis and graphical representation. With MCL, the relationships between students were visualized in a 3-D graph; each node represented a student and vectors represented the relationships between students. In this way, the visual display grouped students with similar characteristics, as well as students who were isolated. This visualization helped in interpreting results from the cluster analysis.

At a macro level, dashboards (e.g., student monitoring and tracking systems) can be developed to provide various types of interactive and real-time displays. These dashboards can be embedded in the LMS to provide more up-to-date information to teachers, advisors, administrators, and students (Verbert, Duval, Klerkx, Govaerts, & Santos, 2013). An example of a dashboard application is Course Signals, developed at Purdue University (Arnold & Pistilli, 2012). Using LMS usage data, Course Signals represents student performance as a traffic signal (red for poor and green for good) to instructors and students. Results from introducing this tool suggest a positive impact on student grades and retention.

In a review article, Verbert et al. (2013) analyzed 15 dashboard applications, including Course Signals. Among these, seven applications were targeted at instructors, four were targeted at students, and the remaining four were for both instructors and students. They also reviewed the usefulness of the dashboard applications and found several positive outcomes, including impact on student grades and retention (Arnold & Pistilli, 2012), improvement in self-assessment (Kerly, Ellis, & Bull, 2007), and satisfaction with the course (Kobsa, Dimitrova, & Boyle, 2005).

However, in a critique, Gašević, Dawson, and Siemens (2015) expressed concerns that the design of many of these applications did not incorporate sound instructional design principles, especially regarding student feedback. They noted that in the Course Signals study (Tanes, Arnold, King, & Remnet, 2011), feedback provided by instructors was rarely instructional. Instead, feedback was typically more summative, rather than formative, and thus less useful for students.

In summary, data visualization, especially dashboard applications, can help students, instructors, and researchers better understand student learning patterns and trajectories while also potentially detecting at-risk students. To maximize their effect, their design should be grounded in sound and proven instructional design and learning theory.

The KDD Process

Knowledge discovery from database (KDD) refers to a framework for discovering knowledge in large data collections (Valsamidis et al. 2012). The classical KDD process consists of three phases: preprocessing, data mining, and postprocessing (Romero & Ventura, 2007; Romero et al. 2008). In contrast, Valsamidis et al. (2012) divided the KDD process into five phases: data preprocessing, data transformation, data mining, data visualization, and data interpretation. However, the classical three-phase approach is followed in this article.

Data preprocessing refers to transforming raw data into an appropriate shape for applying a data mining algorithm (Romero & Ventura, 2007). It encompasses data cleaning (removing unnecessary items such as missing values and outliers), user identification, session identification, data transformation, and enrichment (calculating new attributes from existing data) (Romero et al. 2008; Valsamidis et al. 2012).

Romero et al. (2008) noted that preprocessing LMS data requires less data cleaning and preprocessing work compared to that required in other large datasets, since users and sessions are typically already identified with unique IDs in most LMS datasets. They also stressed that other tasks, such as selecting data (choosing courses that researchers are interested in) and creating a summarization table (a table of required information that is aligned to research objectives), are important steps when preprocessing LMS data.

The second phase of KDD is data mining, which encompasses the core modeling work of the whole KDD process. As noted, five categories of technical methods are widely used in EDM: prediction, clustering, relationship mining, distillation for human judgment, and discovery with models (Baker & Yacef, 2009; Bienkowski et al. 2012).

The third and final phase of KDD is data postprocessing; it encompasses data visualization and data interpretation. Data visualization overlaps with distillation for human judgment and therefore is sometimes included in the data mining phase (Romero & Ventura, 2007; Romero et al. 2008).

Data interpretation is a critical step in the KDD process. Because EDM is a process that uses KDD and not a final goal, it is important to consider how EDM contributes to better understanding of student learning and teaching . For example, Gašević et al. (2015) pointed out that very few EDM studies have contributed to the development of learning theory or teaching practice, even though EDM research has received a great deal of attention. They stressed that EDM or learning analytics should be about learning and thus should have a substantial impact on research and the practice of learning and teaching.

The next section briefly illustrates the KDD process by presenting examples from research by the authors that analyzed instructor and student usage data collected by a widely used LMS, called Canvas. The data come from over 33,000 courses taught over 3 years at a midsized public university in the western United States. More details are available elsewhere (Lee et al. in press).

Data Preprocessing

Data Cleaning

The Canvas system, like other LMSs such as Moodle, logs usage data in a relational database. Therefore, MySQL, which is one of the most popular open-source databases, was used to support the data preprocessing. The Canvas log data consist of 13 tables and 78 columns; the important ones are summarized in Table 4. The Canvas data contain unique database-generated identifiers not only for courses but also for instructors and students. In this way, all user information is anonymized.
Table 4

Some important Canvas data tables and columns

Table name

Column name




Unique database-generated identifier for the account


Full name of the account



Unique database-generated identifier for the course


Subject abbreviation for the course (e.g., ENGL, STAT)


Long name for the course


Start date for the course

course_end_date, etc.

End date for the course



Unique database-generated identifier for the instructor



Unique database-generated identifier for the user


The user’s enrollment type for a particular course (teacher, TA, or student)


Category of the content item (e.g., quiz, discussion)


Total number of times the user viewed this content item


Total number of times the user participated with this content item

In order to transform the raw data into an appropriate shape for data mining, data was first examined in the “times_viewed” and “times_participated” columns, important data for understanding user interactions. Many missing values (nulls) were found in the dataset. Thus, the process ensured that nulls were accurately presented by differentiating between meaningful nulls (when the activity was not possible due to course design) and accurate nulls (when the feature was present but not used).

Then, courses were select for further analysis. Data were examined from the most recent semester, spring 2014, which included data from 2,461 courses. Courses with no meaningful data, such as missing or low-usage data, final grades, and course identifiers, were eliminated. Courses with fewer than five students and courses with fewer than ten instructor/content or student/content interactions were also eliminated. Figure 1 summarizes the data cleaning process. Ultimately, a total of 1,870 courses were included for the analysis.
Fig. 1

The data cleaning process

Creating a Summarization Table

Data cleaning was conducted through MySQL scripting. The database included several irrelevant columns for data analysis (e.g., course_start_date, course_end_date), and the data format was also not compatible with analytical tools such as SPSS, R, and Tableau. For this reason, the relevant columns in the full dataset were extracted and exported in CSV format. Then, a summarization table (matrix) was created, which included only relevant variables for data mining by displaying each user in rows and variables (the number of views and participations for each kind of content; see Table 5) in columns.
Table 5

Features logged by Canvas shown in four major categories






No. of visits to announcement page (navigation page for all announcements)


No. of visits to roster page (navigation page for people enrolled in the course)


No. of times enrollment viewed (information for a specific person on the roster)


No. of times calendar viewed



No. of times assignment viewed (viewing instructions or reviewing instructions after submission)


No. of times assignment participated (submission or resubmission)


No. of times quiz viewed (viewing instructions or viewing previous attempts)


No. of times quiz participated (submission of quizzes)


No. of visits to grade page (a student’s grade page for a course)



No. of visits to file page (course navigation page for all files)


No. of times attachment viewed (downloading or previewing files)


No. of visits to syllabus page



No. of visits to topic page (course navigation page for all discussion topics)


No. of times discussion viewed


No. of times discussion participated (making comments or reply is counted as participation)


No. of times wiki viewed (viewing or reloading edits)


No. of times wiki edited and saved


No. of entering into collaboration


No. of visits to conference page (navigation page for all web conferences)

Data transformation, such as converting raw frequencies into proportions, was also performed, Z-scores, and so forth. However, different transformation strategies were used depending on the purpose of the analysis and the data mining method. These transformation processes are described in the next section.

Applying Data Mining to Canvas Data

The second phase of KDD is data mining, and the third phase is postprocessing, which includes data interpretation and visualization. In this section, the data mining and interpretation processes are described together.

Data mining methods employed include prediction by using multinomial logistic regression, clustering by using the expectation maximization (EM) algorithm and hierarchical cluster analysis (HCA), and distillation for human judgment by using heatmaps. A more extensive presentation of results is available elsewhere (Authors, 2015).


The first investigation focused on which LMS variable predicted student academic performance. Similar to other prediction studies, statistical methods were used for prediction. Several statistical methods were considered, such as multiple linear regression, hierarchical linear modeling (HLM), and ordinal logistic regression. However, the data violated several assumptions for conducting these analyses, such as normality of residuals, independence of observations, and so forth. In addition, students’ final grades are ordinal because they are composed of letter grades. For this reason, multinomial logistic regression was selected to predict the probability of students’ membership in a given final grade category, based on their use of LMS features. Thus, the distribution and nature of the model variable had a strong influence on the modeling approach that was selected.

Before conducting the multinomial logistic regression, the data were transformed into suitable shape for the analysis. In terms of the predictors, raw frequencies (number of views, number of participations) were transformed into a proportion of total possible activity in order to control for courses with different levels of activity. The proportion was calculated by dividing the number of student views and participations by the total number of content items posted by the instructors. Some variables (the use of calendar, conference, collaboration) were eliminated because fewer than 10 % of the courses used these features, yielding too many nulls. In terms of dependent variables, the final grades were grouped into four bands – highest (A), high (A−, B+, and B), low (B−, C+, C, C−, D+, and D), and lowest (F) – in order to simplify interpretation. After data transformation, two multinomial logistic regressions were conducted on final grade for both face-to-face and online courses.

In brief, the results showed that “assignment participation” was the strongest predictor of final grade in face-to-face courses, whereas “quiz participation” was the strongest predictor in online courses. Thus, perhaps not surprisingly, engaging in assignments and tests positively influenced the final grade.


In order to categorize the courses into groups that exhibited similar usage patterns, cluster analysis was applied. In this analysis, the clustering object was the course, and both instructor and student usage data were used as variables.

For the cluster analysis, the data and variables were first selected. Undergraduate and face-to-face courses (N = 1,040) were chosen from the full spring 2014 semester dataset to secure a large enough sample size for the cluster analysis. Then, Pearson correlation analysis was conducted to eliminate irrelevant (or overlapping) variables, removing features with a value over 0.7. The final set contained 7 instructor and 18 student variables.

Then, the expectation maximization (EM) clustering algorithm was applied (Ferguson et al. 2006), which determined the optimal number of clusters to be 3.

Figure 2 shows the distribution of some of the important features used in each of the three clusters (4 out of a total of 7 instructor variables and 4 out of 18 student variables). In this figure, the red line indicates the median values in each cluster, and a greater dispersion of the blue color indicates greater use of that feature.
Fig. 2

Distribution of use for four-instructor and four-student features within the three clusters identified by the EM clustering algorithm at the macro level

By examining the median values in a cluster, it is apparent that instructors in clusters B and C were more active than instructors in cluster A in terms of posting assignments, quizzes, discussion topics, and wiki pages. In terms of student activities within each cluster, students in cluster B were the most active users in terms of assignment, quiz, discussion, and wiki participation.

When examining the average and median values of student final grades in each cluster, to examine how instructor and student activities were associated with student final grades, students in cluster B achieved the highest average and median grades, while students in cluster C outperformed students in cluster A. These results suggest ways in which instructor and student LMS usage is associated with student final grade.

Distillation of Data for Human Judgment

Distillation of data for human judgment is an approach for depicting data to enable humans to quickly identify various characteristics of the data. In order to investigate student patterns of activity at the microlevel and their relationship with final grade in depth, clustergrams were built (Bowers, 2010). Clustergrams combine hierarchical cluster analysis (HCA) with a heatmap. The heatmap represents each participant’s row of data across each of the columns of variables as a color block, ranging from colder blue for values −3 SD below the mean to a hotter red for values +3 SD above the mean, with zero values in white. As such, the heatmap, as a form of visual analytics, enables the human eye to examine the different intensities in patterns across the entire dataset quickly and easily.

In this analysis, two courses offered by the same instructor in both face-to-face (N = 33) and online (N = 36) formats were selected. As these courses were taught by the same instructor and had similar enrollments, it became easier to compare different course modalities. Following the recommendations of the HCA literature, in order to standardize variance, all student data were transformed to z-scores (Bowers, 2010; Romesburg, 1984). Student final grades were coded into five grade categories (A, A−, B, C/D, F) to reduce the complexity of interpretation. The student final grades were not included in the HCA calculations but were presented as the final column in the heatmaps to help visualize how usage patterns relate to a student’s final grade.

HCA was applied to cluster both rows (students) and columns (Canvas features). In this analysis, the Euclidean distance measure was used, which is the most commonly used type when analyzing ratio or interval-scale data. For the clustering algorithm, average linkage was chosen, which defines the distance between two clusters based on “the average distance between all pairs of the two clusters’ members” (Mooi & Sarstedt, 2011, p. 250).

Figure 3 presents the clustergrams for the face-to-face course (left) and the online course (right). In clustergrams, the rows represent data for each student, and the HCA reordered students in terms of the similarity of their LMS usage patterns. The columns represent the Canvas features, and the HCA clustered LMS features in terms of their similarity.
Fig. 3

Clustergrams of the face-to-face course (left) and the online course (right)

When examining the clustergram of the face-to-face course, in terms of students (rows), student clusters with “hotter” colors (higher LMS usage) tended to receive higher final grades, while student clusters with “colder” colors (lower LMS usage) tended to have lower final grades. The online course showed patterns similar to those in the face-to-face course in that students’ LMS usage aligned with their final grades.

For a closer interpretation of the clustergrams, the rows (students) were divided into three overall clusters through visual inspection and compared final grades in each cluster for both courses. In the face-to-face course, the mean value for final grades in cluster 1 (M = 3.21, SD = 0.75), with hotter colors, was higher than that for cluster 2 (M = 2.58, SD = 1.24), with colder colors. Similarly, in the online course, the mean value for final grades in cluster 2 (M = 3.31, SD = 0.67) was higher than that for cluster 3 (M = 1.93, SD = 1.63), and that difference was significant (U = 52.00, p < 0.05). Thus, student clusters within both courses appear to be related to their final grades, something that has been noted in the past HCA heatmap literature (Bowers, 2010). In this way, the clustergram provides a rich contextual portrait of individual students’ interaction patterns, and how these patterns relate to those of other students and learning outcome, rather than simply focusing on group averages.

Several differences were also found between the two course modalities. First, although the same instructor designed both courses, different LMS features were used. The quiz feature was used only in the face-to-face course, while announcements, syllabus, and discussion tool features were used only in the online course. Second, differences in the relationship between student final grade and LMS usage were found. In the face-to-face course, assessment features, such as “assignment_p” and “quiz_v” showed color patterns similar to those for the final grade. In the online course, “wiki_v” and “grades_v” had color patterns similar to those for the final grade.

Conclusions and Recommendations for Future Research

This chapter used the KDD framework to describe approaches and processes for analyzing both instructor and student usage data collected by learning management systems. As described above, KDD consists of three phases: (1) data preprocessing, (2) data mining and modeling, and (3) model evaluation and interpretation (Cooley, Mobasher, & Srivastava, 1997; Han & Kamber, 2006; Romero & Ventura, 2007; Witten & Frank, 2005). Within each phase, prior work and approaches were reviewed, discussing opportunities and challenges.

As other researchers have noted (e.g., Baker & Siemens, 2014), the data preprocessing phase, particularly data cleaning, is often the most difficult and time consuming. Moreover, this phase often needs to be revisited as data assumptions and research questions change. In addition, an LMS is often designed to capture data that are easily stored, not data that are of most interest to educational researchers. Thus, engaging with LMS engineers early to better understand data availability and formats is recommended.

Finally, having clear research goals and questions is paramount. And, while these can of course be revisited as is natural in the course of research, simply assuming that a bottom-up or an atheoretical data mining approach will reveal interesting and groundbreaking results is naïve (Norvig, n.d). While increased computational power enables computers to iterate quickly through many candidate models, it is paramount that researchers consider how planned research will address important and meaningful educational questions.

In terms of theory, this chapter contributes toward advancing the field of educational data mining as it applies to understanding and modeling the increasing and voluminous amount of LMS usage data collected in all sectors of education. In terms of application, this chapter describes several existing approaches for distilling results from EDM studies to support and enhance decision making. For example, EDM results can help inform real-time feedback to learners and instructors within visualizations, called dashboards. These dashboards can signal if a particular learner is on a positive or negative trajectory or at risk of failing. Instructors can use this kind of information to provide extra feedback or help. Similarly, learners can use this information to change their learning strategies. EDM results have also been used to inform administrative decision making. For example, the combination of EDM with demographic information available in university databases can support inferences about background characteristics of students (e.g., age, enrolment status, gender) that, in combination with course usage patterns, predict success or failure in particular courses or courses of study. A final important application area of EDM is in the area of course design. EDM can help inform the iterative improvement of course quality, as results help identify more and less successful course design features.

From an ethics standpoint, researchers must also address thorny issues around data privacy. Under most conditions of use, data from human subjects must be collected with informed consent and must be anonymized for scientific or public use. However, LMS data collected from learner and instructor activities are often collected without explicit user consent. Frequently, because of the rapid pace of technological developments, LMS developers have not designed and implemented transparent data standards, policy, and tools to ensure data privacy. Additionally, different stakeholders may have different needs for data about a student, course, or set of courses. For example, it may be reasonable for a student to have access to all of his/her data in a confidential way. Instructors, similarly, may plausibly want full access to data on their current students, but only summary access to their current students’ past performance. Finally, administrators and researchers will want only certain kinds of summary and anonymized access. As such, safeguards and data privacy standards must ensure that data are protected from unauthorized access and tampering.

In 2014, the US Department of Education released guidelines for educational institutions in order to keep parents and students informed about what student data is collected and how it is being used. New federal and state legislation is also being proposed to ensure that student data handled by companies are protected and shared only under stringent conditions. This includes prohibiting educational services from selling data they have collected from students, from using the information to deliver ads to students, and from collating student profiles from data for noneducational purposes. These changes demonstrate the growing public concern about the voluminous amount of digital student data that is collected and analyzed in a sometimes obscure fashion.


  1. Abdous, M., He, W., & Yen, C. J. (2012). Using data mining for predicting relationships between online question theme and final grade. Journal of Educational Technology and Society, 15(3), 77–88.Google Scholar
  2. Arnold, K. E., & Pistilli, M. D. (2012). Course signals at Purdue: Using learning analytics to increase student success. In Proceedings of the 2nd International Conference on Learning Analytics and Knowledge (pp. 267–270). Vancouver, BC: ACM.CrossRefGoogle Scholar
  3. Baker, R., & Siemens, G. (2014). Educational data mining and learning analytics. In R. K. Sawyer (Ed.), The Cambridge handbook of the learning sciences (pp. 253–274). New York, NY: Cambridge University Press.CrossRefGoogle Scholar
  4. Baker, R. S., & Yacef, K. (2009). The state of educational data mining in 2009: A review and future visions. Journal of Educational Data Mining, 1(1), 3–17.Google Scholar
  5. Bienkowski, M., Feng, M., & Means, B. (2012). Enhancing teaching and learning through educational data mining and learning analytics: An issue brief. Washington, DC: US Department of Education.Google Scholar
  6. Bowers, A. J. (2010). Analyzing the longitudinal K-12 grading histories of entire cohorts of students: Grades, data driven decision making, dropping out and hierarchical cluster analysis. Practical Assessment, Research and Evaluation, 15(7), 1–18.Google Scholar
  7. Browne, T., Jenkins, M., & Walker, R. (2006). A longitudinal perspective regarding the use of VLEs by higher education institutions in the United Kingdom. Interactive Learning Environments, 14(2), 177–192.CrossRefGoogle Scholar
  8. Cooley, R., Mobasher, B., & Srivastava, J. (1997, November). Web mining: Information and pattern discovery on the World Wide Web. In Tools with artificial intelligence, 1997. Proceedings, Ninth IEEE international conference on (pp. 558–567). IEEE.Google Scholar
  9. Cooley, R., Mobasher, B., & Srivastava, J. (1999). Data preparation for mining World Wide Web browsing patterns. Knowledge and Information Systems, 1(1), 5–32.CrossRefGoogle Scholar
  10. Dahlstrom, E., Brooks, D. C., & Bichsel, J. (2014, September). The current ecosystem of learning management systems in higher education: Student, faculty, and IT perspectives. Research report. Louisville, CO: ECAR.Google Scholar
  11. Dawson, S. (2008). A study of the relationship between student social networks and sense of community. Journal of Educational Technology and Society, 11(3), 224–238.Google Scholar
  12. Ferguson, K., Arroyo, I., Mahadevan, S., Woolf, B., & Barto, A. (2006). Improving intelligent tutoring systems: Using expectation maximization to learn student skill levels. In M. Ikeda, K. D. Ashley, & T.W. Chan (Eds.), Lecture Notes in Computer Science: Vol. 4053. Intelligent Tutoring Systems (pp. 453–462). Berlin, Germany: Springer.Google Scholar
  13. Gašević, D., Dawson, S., & Siemens, G. (2015). Let’s not forget: Learning analytics are about learning. TechTrends, 59(1), 64–71.CrossRefGoogle Scholar
  14. Han, J., Kamber, M., & Pei, J. (2011). Data mining: Concepts and techniques (3rd ed.). Waltham, MA: Morgan Kaufmann.Google Scholar
  15. Hawkins, B. L., & Rudy, J. A. (2007). EDUCAUSE Core data service: Fiscal year 2006 summary report. Boulder, CO: EDUCAUSE. Retrieved from Scholar
  16. Jo, I. H., Kim, D., & Yoon, M. (2015). Constructing proxy variables to measure adult learners’ time management strategies in LMS. Educational Technology and Society, 18(3), 214–225.Google Scholar
  17. Kerly, A., Ellis, R., & Bull, S. (2007). CALMsystem: A conversational agent for learner modelling. In R. Ellis, T. Allen, & M. Petridis (Eds.), Applications and innovations in intelligent systems XV: Proceedings of AI-2007, 27th SGAI international conference on innovative techniques and applications of artificial intelligence (pp. 89–102). Berlin, Germany: Springer.Google Scholar
  18. Klosgen, W., & Zytkow, J. (2002). Handbook of data mining and knowledge discovery. Oxford, UK: Oxford University Press.Google Scholar
  19. Kobsa, E., Dimitrova, V., & Boyle, R. (2005). Using student and group models to support teachers in web-based distance education. In Proceedings of the 10th international conference on user modeling (pp. 124–133). Berlin, Germany: Springer.Google Scholar
  20. Koedinger, K. R., D’Mello, S., McLaughlin, E. A., Pardos, Z. A., & Rosé, C. P. (2015). Data mining and education. Wiley Interdisciplinary Reviews: Cognitive Science.doi:10.1002/wcs.1350Google Scholar
  21. Krumm, A. E., Waddington, R. J., Teasley, S. D., & Lonn, S. (2014). A learning management system-based early warning system for academic advising in undergraduate engineering. In Learning analytics (pp. 103–119). New York, NY: Springer.Google Scholar
  22. Lee, J. E., Recker, M, Choi, H., Hong, W. J., Kim, N. J., Lee, K., Lefler, M., Louviere, J., & Walker, A. (in press). Applying data mining methods to understand user interactions within learning management systems: Approaches and lessons learned. Journal of Educational Technology Development and Exchange.Google Scholar
  23. Lust, G., Elen, J., & Clarebout, G. (2013). Regulation of tool-use within a blended course: Student differences and performance effects. Computers & Education, 60(1), 385–395.CrossRefGoogle Scholar
  24. Macfadyen, L. P., & Dawson, S. (2010). Mining LMS data to develop an “early warning system” for educators: A proof of concept. Computers & Education, 54(2), 588–599.CrossRefGoogle Scholar
  25. Mooi, E., & Sarstedt, M. (2011). A concise guide to market research: The process, data, and methods using IBM SPSS statistics. Berlin/Heidelberg, Germany: Springer.CrossRefGoogle Scholar
  26. Norvig, P. (n.d.). All we want are the facts, ma’am. Retrieved from
  27. Romero, C., & Ventura, S. (2007). Educational data mining: A survey from 1995 to 2005. Expert Systems with Applications, 33(1), 135–146.CrossRefGoogle Scholar
  28. Romero, C., Ventura, S., & García, E. (2008). Data mining in course management systems: Moodle case study and tutorial. Computers & Education, 51(1), 368–384.CrossRefGoogle Scholar
  29. Romesburg, H. C. (1984). Cluster analysis for researchers. Belmont, CA: Lifetime Learning Publications.Google Scholar
  30. Siemens, G. (2013). Learning analytics: The emergence of a discipline. American Behavioral Scientist, 57(10), 1380–1400.CrossRefGoogle Scholar
  31. Siemens, G., & Baker, R. S. J. D. (2012). Learning analytics and educational data mining: Towards communication and collaboration. In Proceedings of the 2nd International Conference on Learning Analytics and Knowledge (pp. 252–254). Vancouver, BC: ACM.CrossRefGoogle Scholar
  32. Siemens, G., & Long, P. (2011). Penetrating the fog: Analytics in learning and education. Educause Review, 46(5), 30–32.Google Scholar
  33. Smith, V. C., Lange, A., & Huston, D. R. (2012). Predictive modeling to forecast student outcomes and drive effective interventions in online community college courses. Journal of Asynchronous Learning Networks, 16(3), 51–61.Google Scholar
  34. Tanes, Z., Arnold, K. E., King, A. S., & Remnet, M. A. (2011). Using signals for appropriate feedback: Perceptions and practices. Computers & Education, 57(4), 2414–2422.CrossRefGoogle Scholar
  35. Tempelaar, D. T., Rienties, B., & Giesbers, B. (2015). In search for the most informative data for feedback generation: Learning analytics in a data-rich context. Computers in Human Behavior, 47, 157–167.CrossRefGoogle Scholar
  36. Thakur, G. S., Olama, M. M., McNair, A. W., Sukumar, S. R., & Studham, S. (2014, January). Towards adaptive educational assessments: Predicting student performance using temporal stability and data analytics in learning management systems. In Proceedings 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New York, NY: ACM.Google Scholar
  37. Valsamidis, S., Kontogiannis, S., Kazanidis, I., Theodosiou, T., & Karakos, A. (2012). A clustering methodology of web log data for learning management systems. Journal of Educational Technology and Society, 15(2), 154–167.Google Scholar
  38. Verbert, K., Duval, E., Klerkx, J., Govaerts, S., & Santos, J. L. (2013). Learning analytics dashboard applications. American Behavioral Scientist, 57(10), 1500–1509.CrossRefGoogle Scholar
  39. Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques (2nd ed.). San Francisco, CA: Morgan Kaufmann.Google Scholar
  40. Xu, B., & Recker, M. (2012). Teaching analytics: A clustering and triangulation study of digital library user data. Educational Technology and Society Journal, 15(3), 103–115.Google Scholar
  41. Yildiz, O., Bal, A., & Gulsecen, S. (2015). Statistical and clustering based rules extraction approaches for fuzzy model to estimate academic performance in distance education. Eurasia Journal of Mathematics, Science and Technology Education, 11(2), 391–404.Google Scholar
  42. Yu, T., & Jo, I.-H. (2014). Educational technology approach toward learning analytics: Relationship between student online behavior and learning performance in higher education. In Proceedings of the 4th International Conference on Learning Analytics and Knowledge (pp. 269–270). Indianapolis, IN: ACM.CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Department of Instructional Technology and Learning Sciences, Emma Eccles Jones College of Education and Human ServicesUtah State UniversityLoganUSA
  2. 2.Department of Instructional Technology and Learning SciencesUtah State UniversityLoganUSA

Personalised recommendations