Keywords

1 Introduction

Modeling a longitudinal process brings a lot of variability over time. The modeling procedure becomes even harder when we use multivariate continuous variables to model a single construct. For example, in education research we might model students’ online behavioral engagement through their number of clicks, time spent online, and frequency of interactions [1]. Most human behavioral constructs are an amalgam of interrelated features with complex fluctuations over time. Modeling such processes requires a method that takes into account the multidimensional nature of the examined construct as well as the temporal evolution. Nevertheless, despite the rich boundless information in the quantitative data, discrete patterns can be captured, modeled, and traced using appropriate methods. Such discrete patterns represent an archetype or a “state” that is typical of behavior or function [2]. For instance, a combination of frequent consumption of online course resources, long online time, interaction with colleagues and intense interactions in cognitively challenging collaborative tasks can be coined as an “engaged state” [3] The same student that shows an “engaged state” at one point, can transition to having few interactions and time spent online at the next time point, i.e., a “disengaged state”.

Capturing a multidimensional construct into qualitative discrete states has several advantages. First, it avoids information overload where the information at hand is overwhelmingly hard to process accurately because of the multiplicity and lack of clarity of how to interpret small changes and variations. Second, it allows an easy way of communicating the information; it is understandable that communicating a state such as “engaged” is easier than telling the values of several activity variables. Third, it is easier to trace or track. As we are interested in significant shifts overtime, fine-grained changes in the activities are less meaningful. We are rather interested in significant shifts between behavioral states, e.g., from engaged to disengaged. Besides, such a shift is also actionable. As [4] puts it, “reliability can sometimes be improved by tuning grain size of data so it is neither too coarse, masking variance within bins, nor too fine-grained, inviting distinctions that cannot be made reliably.” More importantly, capturing the states is more tethered to reality of human nature and function. In fact, many psychological, physiological or disease constructs have been described as states with defining criteria, e.g., motivation, depression or migraine.

Existing methods for longitudinal analysis are often limited to the study of a single variable’s evolution over time [5]. Some examples of such methods are longitudinal k-means [6], group-based trajectory modeling (GBTM) [7], or growth models [8]. However, when multivariate data is the target of analysis, these methods cannot be used. Multivariate methods are usually limited to one step or another of the analysis e.g., clustering of multivariate data into categorical variables (e.g., states), or chart the succession of categories into sequences. The method presented in this chapter provides an ensemble of methods and tools to effectively model, visualize and statistically analyse the longitudinal evolution of multivariate data. As such, modeling the temporal evolution of latent states, as we propose in this chapter, may not be entirely new and has been performed—at least in part—using several models, algorithms and software platforms [9,10,11]. For instance, the package lmfa can capture latent states in multivariate data, model their trajectories as well as transition probabilities [12]. Nevertheless, most of the existing methods are concerned with modeling disease states, time-to event (survival models), time-to-failure models [11] or lack a sequential analysis.

The VaSSTra method described in this chapter allows to summarize multiple variables into states that can be analyzed using sequence analysis across time. Then, using life-event methods, distinct trajectories of sequences that undergo a similar evolution can be analyzed in detail. VaSSTra consists of three steps: (1) capturing the states or patterns (from variables); (2) modeling the temporal process (from states); and (3) capturing the patterns of longitudinal development (similar sequences are grouped in trajectories). As such, the method described in this chapter is a combination of several methods. First, a person-centered method (latent class or latent profile analysis) is used to capture the unobserved “states” within the data. The states are then used to construct a “sequence of states”, where a sequence represents a person’s ordered states for each time point. The construction of “sequence of states” unlocks the full potential of sequence analysis visually and mathematically. Later, the longitudinal modeling of sequences is performed using a clustering method to capture the possible trajectories of progression of states. Thus, the name of the method is “from variables to states”, “from states to sequences” and “from sequences to trajectories” VaSSTra [5].

Throughout the chapter, we discuss how to derive states from different variables related to students, how to construct sequences from students’ longitudinal progression of states, and how to identify and study distinct trajectories of sequences that undergo a similar evolution. We also cover some advanced properties of sequences that can help us analyze and compare trajectories. In the next section, we explain the VaSSTra method in detail. Next, we review the existing literature that has used the method. After that, we present a step-by-step tutorial on how to implement the method using a dataset of students’ engagement indicators across a whole program.

2 VaSSTra: From Variables to States, from States to Sequences, from Sequences to Trajectories

In Chap. 10, we went through the basics of sequence analysis in learning analytics [13]. Specifically, we learned how to construct a sequence from a series of ordered student activities in a learning session, which is a very common technique in learning analytics (e.g., [14]). In the sequences we studied, each time point represents a single instantaneous event or action by the students. In this advanced chapter, we take a different approach, where sequences are not built from a series of events but rather from states. Such states represent a certain construct (or cluster of variables) related to students (e.g., engagement, motivation) during a certain period (e.g., a week, a semester, a course). The said states are derived from a series of data variables related to the construct under study over the stipulated time period. Analyzing the sequence of such states over several sequential periods allows us to summarize large amounts of longitudinal information and to study complex phenomena across longer timespans [5]. This approach is known as the VaSSTra method. VaSSTra utilizes a combination of person-based methods (to capture the latent states) along with life events methods to model the longitudinal process. In doing so, VaSSTra effectively leverages the benefits of both families of methods in mapping the patterns of longitudinal temporal dynamics. The method has three main steps that can be summarized as (1) identifying latent States from Variables, (2) modeling states as Sequences, and (3) identifying Trajectories within sequences. The three steps are depicted in Fig. 1 and described in detail below:

Fig. 1
A stepwise flow diagram for the V a S S T r a method runs as follows. Step 1, from variables to states. Step 2, from states to sequences. Step 3, from sequences to trajectories.

Summary of the three steps of the VaSSTra method

  • Step 1. From variables to states: In the first step of the analysis, we identify the “states” within the data using a method that can capture latent or unobserved patterns from multidimensional data (variables). The said states represent a behavioral pattern, function or a construct that can be inferred from the data. For instance, engagement is a multidimensional construct and is usually captured through several indicators. e.g., students’ frequency and time spent online, course activities, cognitive activities and social interactions. Using an appropriate method, such as person-based clustering in our case, we can derive students’ engagement states for a given activity or course. For instance, the method would classify students who invest significant time, effort and mental work are “engaged.” Similarly, students who are investing low effort and time in studying would be classified as “disengaged.” Such powerful summarization would allow us to use the discretized states in further steps. An important aspect of such states is that they are calculated for a specific timespan. Therefore, in our example we could infer students’ engagement states per activity, per week, per lesson, per course, etc. Sometimes, such time divisions are by design (e.g., lessons or courses), but in other occasions researchers have to establish a time scheme according to the data and research questions (e.g., weeks or days). Computing states for multiple time periods is a necessary step to create time-ordered state sequences and prepare the data for sequence analysis.

  • Step 2. From states to sequences: Once we have a state for each student at each time point, we can construct an ordered sequence of such states for each student. For example, if we used the scenario mentioned above about measuring engagement states, a sequence of a single student’s engagement states across a six-lesson course would be like the one below. When we convert the ordered states to sequences, we unlock the potential of sequence analysis and life events methods. We are able to plot the distribution of states at each time point, study the individual pathways, the entropy, the mean time spent at each state, etc. We can also estimate how frequently students switch states, and what is the likelihood they finish their sequence in a “desirable” state (i.e., “engaged”).

A text reads, disengaged average engaged engaged engaged average. All the words in the text are separated using a hyphen.
  • Step 3. From sequences to trajectories: Our last step is to identify similar trajectories—sequences of states with a similar temporal evolution—using temporal clustering methods (e.g., hidden Markov models or hierarchical clustering). Covariates (i.e., variables that could explain cluster membership) can be added at this stage to help identify why a trajectory has evolved in a certain way. Moreover, sequence analysis can be used to study the different trajectories, and not only the complete cohort. We can compare trajectories according to their sequence properties, or to other variables (e.g., performance).

3 Review of the Literature

The VaSSTra method has been used to study different constructs related to students’ learning (Table 1), such as engagement [3, 15, 16], roles in computer-supported collaborative learning (CSCL) [17], and learning strategies [18, 19]. Several algorithms have been operationalized to identify latent states from students’ online data. Some examples are: Agglomerative Hierarchical Clustering (AHC) [19], Latent Class Analysis (LCA) [3, 15, 16], Latent Profile Analysis (LPA) [17], and mixture hidden Markov models (MHMM) [18].

Table 1 Previous literature in which VaSSTra has been used

Moreover, sequences of states have mostly been used to represent each course in a program [3, 15,16,17,18], but also smaller time spans, such as each learning session in a single course [18, 19]. Different algorithms have also been used to cluster sequences of states into trajectories including HMM [3, 19], mixture hidden Markov models (MHMM) [15, 17], and AHC [16, 18]. Moreover, besides the basic aspects of sequence analysis discussed in the previous chapter, previous work have explored advanced features of sequence analysis such as survival analysis [3, 16], entropy [3, 17, 18], sequence implication [3, 18], transitions [3, 17, 18], covariates [17], discriminating subsequences [16] or integrative capacity [17]. Other studies have made used of multi-channel sequence analysis [15, 19], which is covered in Chap. 13 [20].

4 VassTra with R

In this section we provide a step-by-step tutorial on how to implement VaSSTra with R. To illustrate the method, we will conduct a case study in which we examine students’ engagement throughout all the courses of the first two years of their university studies, using variables derived from their usage of the learning management system.

4.1 The Packages

In order to conduct our analysis we will need several packages besides the basic rio (for reading and saving data in different extensions), tidyverse (for data wrangling), cluster (for clustering features), and ggplot2 (for plotting). Below is a brief summary of the rest of the packages needed:

  • BBmisc: A package with miscellaneous helper functions [21]. We will use its normalize function to normalize our data across courses to remove the differences in students’ engagement that are due to different course implementations (e.g., larger number of learning resources).

  • tidyLPA: A package for conducting Latent Profile Analysis (LPA) with R [22]. We will use it to cluster students’ variables into distinct clusters or states.

  • TraMineR: As we have seen in Chap. 10 about sequence analysis [13], this package helps us construct, analyze and visualize sequences from time-ordered states or events [23].

  • seqhandbook: This package complements TraMineR by providing extra analyses and visualizations [24].

  • Gmisc: A package with miscellaneous functions for descriptive statistics and plots [25]. We will use it to plot transitions between states.

  • WeightedCluster: A package for clustering sequences using hierarchical cluster analysis [26]. We will use it to cluster sequences into similar trajectories.

The code below imports all the packages that we need. You might need to install them beforehand using the install.packages command:

An R code imports libraries for data manipulation, visualization, clustering, and sequence analysis.

4.2 The Dataset

For our analysis, we will use a dataset of students’ engagement indicators throughout eight courses, corresponding to the first two years of a blended higher education program. The dataset is described in detail in the data chapter. The indicators or variables are calculated from students’ log data in the learning management system, and include the frequency (i.e., number of times) with which certain actions have been performed (e.g., view the course lectures, read forum posts), the time spent in the learning management system, the number of sessions, the number of active days, and the regularity (i.e., consistency and investment in learning). These variables represent students’ behavioral engagement indicators. The variables are described in detail in Chapter 2 about the datasets of the book [27] and Chapter 7 about predictive learning analytics [28] and in previous works [3]. Below we use rio’s import function to read the data.

An R code imports a C S V file from a given U R L and stores it in the longitudinal data variable using the import function.

# A tibble: 1,136 x 15    UserID   CourseID Sequence Freq_Course_View Freq_Forum_Consume    <chr>    <chr>       <int>            <int>              <int>  1 D2C5F64E C6107FC4        1              150                251  2 D2C5F64E 4C3F37F0        2               98                 84  3 D2C5F64E E54A52A3        3              254                354  4 D2C5F64E AB7EC624        4              332                825  5 D2C5F64E B0E95213        5              386                960  6 D2C5F64E 3DE2A32B        6              261               1026  7 D2C5F64E D7DF3685        7              250                652  8 D2C5F64E ECD1AFC8        8              287                697  9 D7E3D0DC B51E1259        1              186                597 10 D7E3D0DC C473D477        2              241                580 # i 1,126 more rows # i 10 more variables: Freq_Forum_Contribute <int>, Freq_Lecture_View <int>, #   Regularity_Course_View <dbl>, Regularity_Lecture_View <dbl>, #   Regularity_Forum_Consume <dbl>, Regularity_Forum_Contribute <dbl>, #   Session_Count <int>, Total_Duration <int>, Active_Days <int>, #   Final_Grade <dbl>

4.3 From Variables to States

The first step in our analysis is to detect latent states from the multiple engagement-related variables in our dataset (e.g., frequency of course views, frequency of forum posts, etc.). For this purpose, we will use LPA, a person-based clustering method, to identify each student’s engagement state at each course. We first need to standardize the variables, to account for the possible differences in course implementations (e.g., each course has a different number of learning materials and slightly different duration). This way, the mean value of each indicator will be always 0 regardless of the course. Any value above the mean will be always positive, and any value below will be negative. As such the engagement is measures on the same scale. To standardize the data we first group it by CourseID using tidyverse’s group_by and then we apply the normalize function from the BBmisc package to all the columns that contain the engagement indicators using mutate_at and specifying the range of columns. If we inspect the data now, we will see that all variables are centered around 0.

An R code groups the longitudinal data by the course I D, standardizes columns from freq underscore course underscore view to active underscore days, and stores the result in d f.

# A tibble: 1,136 x 15    UserID   CourseID Sequence Freq_Course_View Freq_Forum_Consume    <chr>    <chr>       <int>            <dbl>              <dbl>  1 D2C5F64E C6107FC4        1           -1.13              -1.87  2 D2C5F64E 4C3F37F0        2           -2.03              -2.69  3 D2C5F64E E54A52A3        3            0.519             -1.24  4 D2C5F64E AB7EC624        4            1.88               1.33  5 D2C5F64E B0E95213        5            2.78               2.11  6 D2C5F64E 3DE2A32B        6            0.603              2.46  7 D2C5F64E D7DF3685        7            0.434              0.357  8 D2C5F64E ECD1AFC8        8            0.707              0.707  9 D7E3D0DC B51E1259        1            0.939              1.22 10 D7E3D0DC C473D477        2            1.55               1.39 # i 1,126 more rows # i 10 more variables: Freq_Forum_Contribute <dbl>, Freq_Lecture_View <dbl>, #   Regularity_Course_View <dbl>, Regularity_Lecture_View <dbl>, #   Regularity_Forum_Consume <dbl>, Regularity_Forum_Contribute <dbl>, #   Session_Count <dbl>, Total_Duration <dbl>, Active_Days <dbl>, #   Final_Grade <dbl>

Now, we need to subset our dataset and choose only the variables that we need for clustering. That is, we exclude the metadata about the user and course, and keep only the variables that we believe are relevant to represent the engagement construct.

An R code selects specific columns titled freq course view, freq forum consume, freq forum contribute, freq lecture view, regularity course view, session count, total duration, and active days from the d f dataset. It stores them in the subset named to cluster.

Before we go on, we must choose a seed so we can obtain the same results every time we run the clustering algorithm. We can now finally use tidyLPA to cluster our data. We try from 1 to 10 clusters and different models that enforce different constraints on the data. For example, model 3 takes equal variances and equal covariances, whereas model 6 takes varying variances and varying covariances. You can find out more about this in the tidyLPA documentation [22]. Be aware that running this step may take a while. For more details about LPA, consult the model-based clustering chapter.

An R code sets a seed for reproducibility, ungroups the to cluster subset, performs single imputation, and estimates latent profiles with 1 to 10 clusters using models 1, 2, 3, and 6, storing the result in M c lustt.

Once all the possible cluster models and numbers have been calculated, we can calculate several statistics that will help us choose which is the right model and number of clusters for our data. For this purpose, we use the compare_solutions function from tidyLPA and we use the results of calling this function to plot the BIC and the entropy of each model for the range of cluster numbers that we have tried (1–10) (Fig. 2). In the model-based clustering chapter you can find out more details about how to choose the best cluster solution.

Fig. 2
Two multi-line graphs of entropy and B I C versus classes. For entropy, models 1 and 2 follow falling trends, 3 fluctuates, and 6 follows a rising trend. For B I C, models 1 and 2 follow concave-up decreasing trends, 3 is linear, and 6 follows a rising trend.

Choosing the number of clusters by plotting statistics

An R code calculates A I C, B I C, and entropy statistics for the M c lustt model and stores the results in the cluster statistics subset. It then creates a plot of entropy versus classes and a plot of B I C versus classes for different models, using lines and points, with a minimal theme and the legend positioned at the bottom.

Although, based on the BIC values, Model 6 with 3 classes would be the best fit, the entropy for this model is quite low. Instead, Models 1 and 2 have a higher overall entropy and quite a large fall in BIC when increasing from 2 to 3 classes. Taken together, we choose Model 1 with 3 classes which shows better separation of clusters (high entropy) and a large drop in BIC value (elbow). We add the cluster assignment back to the data so we can compare the different variables between clusters and use the cluster assignment for the next steps. Now, for each student’s course enrollment, we have assigned a state (i.e., cluster) that represents the student’s engagement during that particular course.

A one-line code reads, d f $ state left angular bracket hyphen M c lust $ model underscore 1 underscore class underscore 3 $ model $ classification.

We can plot the mean variable values for each of the three clusters (Fig. 3) to understand what each of them represents:

Fig. 3
A grouped column chart of mean value versus state plots the following highest and lowest values for each state. Lowest for 1 (session count, negative 1.5). Lowest for 2 (freq course view, negative 0.2). Highest for 3 (session count, 1.2) and lowest (freq forum contribute, 0.6).

Mean value of each variable for each cluster

An R code reshapes the d f into a longer format, converts state into a factor, filters columns matching those in the cluster subset, removes underscores from column names, calculates means by group, stores results in long mean, and creates a grouped bar plot of mean values by state and variable, with a specified color palette and minimal theme.

We clearly see that the first cluster represents students with low mean levels of all engagement indicators; the second cluster represents students with average values, and the third cluster with high values. We can convert the State column of our dataset to a factor to give the clusters an appropriate descriptive label:

An R code defines engagement levels as disengaged, average, and active, then converts the state variable in d f into a factor with specified levels and corresponding labels, stored in d f underscore named.

4.4 From States to Sequences

In the previous step, we turned a large amount of variables representing student engagement in a given course into a single state: Disengaged, Average, or Active. Each student has eight engagement states: one per each course (time point in our data) in the first two years of the program. Since the dataset includes the order of each course for each student, we can construct a sequence of the engagement states throughout all courses for each student. To do that we first need to transform our data into a wide format, in which each row represents a single student, and each column represents the student’s engagement state at a given course:

An R code sorts d f named by user I D and sequence then reshapes it into a wider format using the user I D as the identifier column and sequence values as new columns, with corresponding state values. The result is stored in clus seq d f.

Now we can use TraMineR to construct the sequence and assign colors to represent each of the engagement states:

An R code defines a color palette for the sequence plot, then creates a sequence object clus seq from clus seq d f data, specifying sequences from columns 2 to 9, with engagement levels as the alphabet and the defined color palette.

 [>] 3 distinct states appear in the data:

     1 = Active

     2 = Average

     3 = Disengaged

 [>] state coding:

       [alphabet]  [label]    [long label]

     1  Active      Active     Active

     2  Average     Average    Average

     3  Disengaged  Disengaged Disengaged

 [>] 142 sequences in the data set

 [>] min/max sequence length: 8/8

We can use the sequence distribution plot from TraMineR to visualize the distribution of each state at each time point (Fig. 4). We see that the distribution of states is almost constant throughout the eight courses. The ‘Average’ state takes the largest share, followed by the ‘Engaged’ state, and the ‘Disengaged’ state is consistently the least common. For more hints on how to interpret the sequence distribution plot, refer to Chapter 10 [13].

Fig. 4
An area chart of real frequency versus data plots linear trends with minor fluctuations across the chart for disengaged, followed by average and active. The plotted area is the widest for average.

Sequence distribution plot of the course states

An R code creates a sequence plot seq d plot from the sequence object clus seq. It hides borders, enables layout adjustment, includes a legend, arranges the legend in three columns, and sets the legend proportions to 20% of the plot size.

We can also visualize each of the individual students’ sequences of engagement states using a sequence index plot (Fig. 5). In this type of visualization, each horizontal bar represents a single student, and each of the eight colored blocks along the bar represents the students’ engagement states. We can order the students’ sequences according to their similarity for a better understanding. To do this, we calculate the substitution cost matrix (seqsubm) and the distance between the sequences according to this cost (seqdist). Then we use an Agglomerative Nesting Hierarchical Clustering algorithm (agnes) to group sequences together according to their similarity (see Chapter 10 [13]). We may now use the seq_heatmap function of seqhandbook to plot the sequences ordered by their similarity. From this plot, we already sense the existence of students that are mostly active, students that are mostly disengaged, and students that are in-between, i.e., mostly average.

Fig. 5
A heatmap displays horizontal lines in various colors representing different states across a sequence index. Colors indicate varying values for each state. The x-axis numbers represent time categories. The dendrogram on the left indicates a hierarchical clustering of states based on sequence distance.

Sequence index plot of the course states ordered by sequence distance

An R code computes a substitution cost matrix with a constant method and then derives a dissimilarity matrix using the L C S distance, accounting for missing values. After hierarchical clustering with Ward's method on the dissimilarity matrix, it produces a sequence heatmap of the clus seq data, annotated with the clustering from cluster ward 2.

4.5 From Sequences to Trajectories

In the previous step we constructed a sequence of each student’s engagement states throughout eight courses. When plotting these sequences, we observed that there might be distinct trajectories of students that undergo a similar evolution of engagement. In this last step, we use hierarchical clustering to cluster the sequences of engagement states into distinct trajectories of similar engagement patterns. To perform hierarchical clustering we first need to calculate the distance between all the sequences. For more details on the clustering technique, please refer to Chapter 8 [29]. As we have seen in the Sequence Analysis chapter, there are several algorithms to calculate the distance. We choose LCS (Longest Common Subsequence), implemented in the TraMineR package which calculates the distance based on the longest common subsequence.

A one-line code reads, dissim L C S left angular bracket hyphen seq dist left parenthesis clus underscore seq, method = double quotation starts L C S double quotation ends right parenthesis.

 [>] 142 sequences with 3 distinct states

 [>] creating a 'sm' with a substitution cost of 2

 [>] creating 3x3 substitution-cost matrix using 2 as  constant value

 [>] 103 distinct  sequences

 [>] min/max sequence lengths: 8/8

 [>] computing distances using the LCS metric

 [>] elapsed time: 0.012 secs

Now we can perform the hierarchical clustering. For this purpose, we use the hclust function of the stats package

A one-line code reads, clustered left angular bracket hyphen h c lust left parenthesis as dot dist left dissim L C S right parenthesis, method = double quotation starts ward dot D 2 double quotation ends right parenthesis.

We create partitions for clusters ranging from 2 to 10 and we plot the cluster statistics to be able to select the most suitable cluster number (Fig. 6).

Fig. 6
A multi-line graph of indicators versus N clusters plots a falling trend for H C and rising trends for C R H s q, H G S D, R 2 s q, and R 2, that extend as linear trends and overlap with the linear trends for the remaining indicators.

Cluster statistics for the hierarchical clustering

A code converts clustering results into a clustered range object using the L C S dissimilarity, aiming for 10 clusters. It then plots the clustered range, displaying statistics for all clusters, normalized using z-scores with median normalization, and lines with a width of 2.

There seems to be a maximum for most statistics at three clusters, so we save the cluster assignment for three clusters in a variable named grouping.

A one-line code reads, grouping left angular bracket hyphen clustered underscore range $ clustering $ cluster 3.

Now we can use the variable grouping to plot the sequences for each trajectory using the sequence index plot (Fig. 7):

Fig. 7
Three sequence index plots for trajectories. Trajectory 1 for active spans 8 points with values between 13.7 and 25.1. Trajectory 2 for average also has 8 points with values from 7.4 to 23.2. Trajectory 3 for disengaged extends from 6 to 32.6. Active, average, and disengaged correspond to various performance levels.

Sequence index plot of the course states per trajectory

A one-line code reads, seq I plot left parenthesis clus underscore seq, group = grouping, sort v = double quotation starts from dot start double quotation ends right parenthesis.

In Fig. 7, we see that the first trajectory corresponds to mostly average students, the second one to mostly active students, and the last one to mostly disengaged students. We can rename the clusters accordingly.

A code assigns trajectory names like mostly average, mostly active, and mostly disengaged based on the grouping variable.

We can plot the sequence distribution plot to see the overall distribution of the sequences for each trajectory (Fig. 8).

Fig. 8
Three stacked histograms of real frequency versus data. The real frequency plot for n = 41 has mostly active values, n = 69 has mostly average values, and n = 32 has mostly disengaged values.

Sequence distribution plot of the course states per trajectory

A one-line code reads, seq d plot left parenthesis clus underscore seq, group = trajectories right parenthesis.

4.6 Studying Trajectories

There are many aspects that we can study about our trajectories. For example, we can use the mean time plot to compare the time spent in each engagement state for each trajectory. This plot summarizes the time distribution plot across all time points (Fig. 9). As expected, we see that the mostly active students spend most of their time in an ‘Active’ state, the mostly average students in an Average state, and the mostly disengaged students in a ‘Disengaged’ state, although they spend quite some time in an ‘Average’ state as well.

Fig. 9
Three column charts of mean time versus performance levels. The chart with mean time n = 41 has mostly active values, n = 69 has mostly average values, and n = 32 has mostly disengaged values.

Mean time plot of the course states per trajectory

A one-line code reads, seq m t plot left parenthesis clus underscore seq, group = trajectories right parenthesis.

Another very useful plot is the sequence frequency plot (Fig. 10), that shows the most common sequences in each trajectory and the percentage of all the sequences that they represent. We see that, for each trajectory, the sequence that has all equal engagement states is the most common. The mostly active has, of course, sequences dominated by ‘Active’ states, with sparse average states, and one ‘Disengaged’ state. The mostly disengaged shows a similar pattern dominated by disengaged states with some average and one active state. The mostly average, although it is dominated by ‘Average’ states, shows diversity of shifts to active or disengaged.

Fig. 10
Three index maps for cumulative frequency versus data. The map with cumulative frequency n = 41 has mostly active values, n = 69 has mostly average values, and n = 32 has mostly disengaged values.

The 10 most frequent sequences in each trajectory

A one-line code reads, seq f plot left parenthesis clus underscore seq, group = trajectories right parenthesis.

To measure the stability of engagement states for each trajectory at each time point, we can use the between-study entropy. Entropy is lowest when all students have the same engagement state at the same time point and highest when the heterogeneity is maximum. We can see that the “Mostly active” and “Mostly disengaged” trajectories have a slightly lower entropy compared to the “Mostly average” one, which is a sign that the students in this trajectory are the least stable (Fig. 11).

Fig. 11
Three line graphs of entropy versus data plot fluctuating trends. The graph for entropy n = 41 is titled mostly active, n = 69 is titled mostly average, and n = 32 is titled mostly disengaged. Mostly disengaged has the lowest fluctuations.

Transversal entropy plot of each trajectory

A one-line code reads, seq H t plot left parenthesis clus underscore seq, group = trajectories right parenthesis.

Another interesting aspect to look into is the difference in the most common subsequences among trajectories (Fig. 12). We first search the most frequent subsequences overall and then compare them among the three trajectories. Interestingly enough, the most frequent subsequence is remaining ‘Active’, and remaining ‘Disengaged’ is number five. Remaining average is not among the top 10 most common subsequences, but rather the subsequences containing the ‘Average’ state always include transitions to other states.

Fig. 12
Three column charts of data versus color by sign and significance of person’s residual. The plot titled mostly active has higher values for positive 0.01 active and average, mostly average has higher values for neutral, and mostly disengaged has higher values for positive 0.01 average and disengaged.

Most discriminating subsequences per trajectory

A code creates an event sequence object from clus underscore seq, then extracts frequent subsequences with a minimum support of 0.05 and a maximum length of 2. Using trajectory groupings, it conducts a group comparison analysis using the chi-square method.

There are other sequence properties that we may need compare among the trajectories. The function seqindic calculates sequence properties for each individual sequence (Table 2). Some of these indices need additional information about the sequences, namely, which are the positive and the negative states. In our case, we might consider the Active state to be positive and the Disengaged state to be negative. Below we discuss some of the most relevant measures:

  • Trans: Number of transitions. It represents the number of times there has been a change of state. If a sequence maintains the same state throughout its whole length, the value of Trans would be zero; if there were two shifts of state, the value would be 2.

  • Entr: Longitudinal entropy or within-student entropy is a measure of the diversity of the sequence states. In contrast with the transversal or between-students entropy that we saw earlier (Fig. 11), which is calculated per time point, longitudinal entropy is calculated per sequence (i.e., which represents a student’s engagement through the courses in a program in this case). Longitudinal entropy is calculated using Shannon’s entropy formula. Sequences that remain in the same state most of the time have a low entropy whereas sequences that shift states continuously with great diversity have a high entropy.

  • Cplx: The complexity index is a composite measure of a sequence’s complexity based on the number of transitions and the longitudinal entropy. It measures the variety of states within a sequence, as well as the frequency and regularity of transitions between them. In other words, a sequence with a high complexity index is characterized by many different and unpredictable states or events, and frequent transitions between them.

  • Prec: Precarity is a measure of the (negative) stability or predictability of a sequence. It measures the proportion of time that a sequence spends in negative or precarious states, as well as the frequency and duration of transitions between positive and negative states. A sequence with a high precarity index is characterized by a high proportion of time spent in negative or precarious states, and frequent transitions between positive and negative states.

  • Volat: Objective volatility represents the average between the proportion of states visited and the proportion of transitions (state changes). It is measure of the variability of the states and transitions in a sequence. A sequence with high volatility would be characterized by frequent and abrupt changes in the states or events, while a sequence with low volatility would have more stable and predictable patterns.

  • Integr: Integrative capacity (potential) is the ability to reach a positive state and then stay in a positive state. Sequences with a high integrative capacity not only include positive states but also manage to stay in such positive states.

Table 2 Sequence indicators
A code calculates sequence-based indices including transition rate, entropy, complexity, precision, volatility, and integration. It focuses on states where active is present and compares precision with disengaged states.

We can compare the distribution of these indices between the different trajectories to study their different properties (Fig. 13). Below is an example for precarity and integrative capacity. We clearly see how the Mostly disengaged trajectory has the highest value of precarity, whereas the Mostly active students have the highest integrative capacity. Beyond a mere visual representation, we could also conduct statistical tests to compare whether these properties differ significantly from each other among trajectories.

Fig. 13
Two box plots of precarity and integrative capacity versus trajectories. The mean values are estimated as follows. Precarity (mostly active, 0.00), (mostly average, 0.24), and (mostly disengaged, 0.76). Integrative capacity (mostly active, 0.85), (mostly average, 0.00), and (mostly disengaged, 0.00).

Comparison of sequence indicators between trajectories

A code adds trajectory information to the indices data frame. It creates boxplots for precision and integration grouped by trajectory, with manually set fill colors. Both plots have no legend.

As we have mentioned, an important aspect of the study of students’ longitudinal evolution is looking at the transitions between states. We can calculate the transitions using seqtrate from TraMineR and plot them using transitionPlot from Gmisc.

A one-line code reads, transition underscore matrix = seq trate left parenthesis clus underscore seq, count = T right parenthesis.

From Table 3 and Fig. 14 we can see how most transitions are between one state and the same (no change). The most unstable state is ‘Average’ with frequent transitions both to ‘Active’ and ‘Disengaged’. Both ‘Active’ and ‘Disengaged’ had occasional transitions to ‘Average’ but rarely from one another.

Fig. 14
A flow diagram for the transition between states. Active points to active and also to average. Average points to average and also active and disengaged. Disengaged points to disengaged and also to average.

Transition plot between states

Table 3 Transition rate between states
A code generates a transition plot using the provided transition matrix. It colors the starting boxes according to engagement levels, sets text color, adjusts text size, and labels boxes with engagement levels in reverse order.

5 Discussion

In this chapter, the VaSSTra method is presented as a person-centered approach for the longitudinal analysis of complex behavioral constructs over time. In the step-by-step tutorial, we have analyzed students’ engagement states throughout all the courses in the first two years of program. First, we have clustered all the indicators of student engagement into three engagement states using model-based clustering: active, average and disengaged. This step allowed us to summarize eight continuous numerical variables representing students’ online engagement indicators of each course into a single categorical variable (state). Then, we constructed a sequence of engagement states for each student, allowing us to map the temporal evolution of engagement and make use of sequence analysis methods to visualize and investigate such evolution. Lastly, we have clustered students’ sequences of engagement states into three different trajectories: a mostly active trajectory which is dominated by engaged students who are stable throughout time, a mostly average trajectory with averagely engaged students who often transition to engaged or disengaged states, and a mostly disengaged trajectory with inactive students that fail to catch up and remain disengaged most of the program. As such, VaSSTra offers several advantages over the existing longitudinal methods for clustering (such as growth models or longitudinal k-means) which are limited to a single continuous variable [30,31,32], instead of taking advantage of multiple variables in the data. Through the summarizing power of visualization, VaSSTra is able to represent complex behaviors captured through several variables using a limited number of states. Moreover, through sequence analysis, we can study how the sequences of such states evolve over time and differ from one another, and whether there are distinct trajectories of evolution.

Several literature reviews of longitudinal studies (e.g., [33]) have highlighted the shortcomings of existing research for using variable-centered methods or ignoring the heterogeneity of students’ behavior. Ignoring the longitudinal heterogeneity means mixing trends of different types, e.g., an increasing trend in a subgroup and a decreasing trend in another subgroup exist. Another limitation of the existing longitudinal clustering methods is that cluster membership can not vary with time so one student is assigned to a single longitudinal cluster, which makes it challenging to study variability and transitions.

As we have seen in the literature review section, the VaSSTra method can be adapted to various scenarios beyond engagement, such as collaborative roles, attitudes, achievement, or self-efficacy, and can be used with different time points such as tasks, days, weeks, or school years. The reader should refer to Chapter 8 [29] and Chapter 9 [34] about clustering to learn other clustering techniques that may be more appropriate for transforming different types of variables into states (that is, conducting the first step of VaSSTra). Moreover, in Chapter 10 [13], the basics of sequence analysis are described, including how to cluster sequences into trajectories using different distance measures that might be more appropriate in different situations. The next chapter (Chapter 12) [35] presents Markovian modeling, which constitutes another way of clustering sequences into trajectories according to their state transitions. Lastly, Chapter 13 [20] presents multi-channel sequence analysis, which could be used to extend VaSSTra to study several parallel sequences (of several constructs) at the same time.