Keywords

1 Introduction

In the previous two chapters, we have learned about sequence analysis [1, 2] and its relevance to educational research. This chapter presents a closely-related method: Markovian models. Specifically, we focus on a particular type of Markovian model, where the data are assumed to be categorical and observed at discrete time intervals, as per the previous chapters about sequence analysis, although in general Markovian models are not restricted to categorical data. One of the main differences between sequence analysis and Markovian modelling is that the former relies on deterministic data mining, whereas the latter uses probabilistic models [3]. Moreover, sequence analysis takes a more holistic approach by analysing sequences as a whole, whereas Markovian modelling focuses on the transitions between states, their probability, and the reasons (covariates) which explain why these transitions happen.

We provide an introduction and practical guide to the topic of Markovian models for the analysis of sequence data. While we try to avoid advanced mathematical notations, to allow the reader to continue to other, more advanced sources when necessary, we do introduce the basic mathematical concepts of Markovian models. When doing so, we use the same notation as in the R package ‘seqHMM’ [4], which we also use in the examples. In particular, we illustrate first-order Markov models, hidden Markov models, mixture Markov models, and mixture hidden Markov models with applications to synthetic data on students’ collaboration roles throughout a complete study program.

The chapter proceeds to describe the theoretical underpinnings on each method in turn, then showcases each method with code, before presenting some conclusions and further readings. In addition to the aforementioned applications to collaboration roles and achievement sequences, we also provide a demonstration of the utility of Markovian models in another context, namely process mining. In the process mining application, we leverage Markov models and mixture Markov models to explore learning management system logs. Finally, we conclude with a brief discussion of Markovian models in general and provide some recommendations for further reading of advanced topics in this area as a whole.

2 Methodological Background

2.1 Markov Model

The simple first-order Markov chain or Markov model (MM) can be used to model transitions between successive states. In the first-order MM, given the current observation, the next observation in the sequence is independent of the past—this is called the Markov property (the order of MM determines on how many previous observations the next observation depends on). For example, when predicting a student’s school success in the fourth year under a first-order model, we only need to consider their success in the third year, while their success in the first and second year give no additional information for the prediction (see Fig. 1 for an illustration). As such, the model is said to be memoryless.

Fig. 1
A flow diagram has four nodes, Y 1 to Y 4, connected in series via arrows.

Illustration of the Markov Model. The nodes \(Y:1\) to \(Y:4\) refer to states at time points 1 to 4. The arrows indicate dependencies between states

As an example, consider the data described in Table 1 which includes four sequences of length ten. The alphabet—that is, the list of all possible states appearing in the data—consists of two types of observed state; low achievement success (L) and high achievement success (H). Here, the individuals are assumed to be independent from one another:

Table 1 Four example sequences of school achievement with individuals A–D across the rows and years 1–10 across the columns

Say t describes the position in the sequence, or in this example, the year (in other words, here t runs from 1 to 10). If we assume that the probability of observing L or H at any given point t depends on the previous observation only, we can estimate the transition probabilities \(a:{LL}\) (from state L to state L), \(a:{LH}\) (L to H), \(a:{HL}\) (H to L), and \(a:{HH}\) (H to H) by calculating the number of observed transitions from each state to all states and scaling these with the total number of transitions from that state. Mathematically, we can write the transition probability \(a:{rs}\) from state r to state s as

$$\displaystyle \begin{aligned} a_{rs}=P(z_t = s\,|\,z_{t-1} = r), \ s,r \in \{L,H\}, \end{aligned}$$

which simply states that the observed state \(z:t\) in year t being L or H depends on which of the two states were observed in the previous year \(t-1\). For example, to compute \(a_{LH}=P(z_t=H\,|\,z_{t-1}=L)\), the probability of transitioning from the origin state L to the destination state H, we divide the twelve observed transitions to state H from state L by 20, which is the total number of transitions from L to any state.

The basic MM assumes that the transition probabilities remain constant in time (this property is called time-homogeneity). This means, for example, that the probabilities of transitioning from the low-achievement state to the high-achievement state is the same in the ninth year as it was in the second year. We can collect the transition probabilities in a transition matrix (which we call A) which shows all of the possible transition probabilities between each pair of origin and destination states, as illustrated in Table 2. For example, when a student has low achievement in year t, they have a 40% probability to have low achievement in year \(t+1\) and a higher 60% probability to transition to high achievement instead, regardless of the year t. Notice that the probabilities in each row must add up to 1 (or 100%).

Table 2 Transition matrix showing the probabilities of transitioning from one state to another (low or high achievement). The rows and columns describe the origin state and the destination state, respectively

Lastly, we need to define probabilities for the starting states of the sequences, i.e., the initial probabilities

$$\displaystyle \begin{aligned} \pi_s=P(z_1 = s), \ s \in \{L,H\}.\end{aligned}$$

In the example, half of the students have low achievement and the other half have high achievement in the first year, so \(\pi _L=\pi _H = 0.5\).

This basic MM is very simple and is often not realistic in the context of educational sciences. We can, however, extend the basic MM in several ways.

First of all, we can include covariates to explain the transition and/or initial probabilities. For example, if we think that transitioning from low to high achievement becomes more challenging as the students get older we may add time as an explanatory variable to the model, allowing the probability of transitioning from low to high achievement to decrease in time. We could also increase the order of the Markov chain, accounting for longer histories. This may be more realistic, but at the same time increasing the order makes the model considerably more complex, the more so the longer the history considered.

Secondly, one of the most useful extensions is the inclusion of hidden (or latent) states that cannot be observed directly but can be estimated from the sequence of observed states. An MM with time-constant hidden states is typically called the mixture Markov model (MMM). It can be used to find latent subpopulations, or in other words, to cluster sequence data. A model with time-varying hidden states is called the hidden Markov model (HMM), which allows the individuals to transition between the hidden states. Allowing for both time-constant and time-varying hidden states leads to a mixture hidden Markov model (MHMM). Unless otherwise specified, from now on when talking about hidden states we refer always to time-varying hidden states, while time-constant hidden states are referred to as clusters.

2.2 Mixture Markov Model

Consider a common case in sequence analysis where individual sequences are assumed to be clustered into subpopulations such as those with typically high and low achievement. In the introductory sequence analysis chapter, the clustering of sequences was performed based on a matrix of pairwise dissimilarities between sequences. Alternatively, we can use the MMM to group the sequences based on their initial and transition probabilities, for example, into those who tend to stay in and transition to high achievement states and those that tend to stay in and transition to low achievement states, as illustrated in Table 3.

Table 3 Two transition matrices showing the probabilities of transitioning from one state of achievement to another in two clusters of Low achievement and High achievement. The rows and columns describe the origin state and the destination state, respectively

In MMMs, we have a separate transition matrix \(A^k\) for each cluster k (for \(k=1,\ldots ,K\) clusters/subpopulations), and the initial state distribution defines the probabilities to start (and stay) in the hidden states corresponding to a particular cluster. This probabilistic clustering provides group membership probabilities for each sequence; these define how likely it is that each individual is a member of each cluster. We can easily add (time-constant) covariates to the model to explain the probabilities of belonging to each cluster. By incorporating covariates in this way we could, for example, find that being in a high-achievement cluster is predicted by gender or family background. However, we note that this is distinct from the aforementioned potential inclusion of covariates to explain the transition and/or initial probabilities.

An advantage of this kind of probabilistic modelling approach is that we can use traditional model selection methods such as likelihood-based information criteria or cross-validation for choosing the best model. For example, if the number of subpopulations is not known in advance—as is typically the case—we can compare models with different clustering solutions (e.g., those obtained with different numbers of clusters, different subsets of covariates, or different sets of initial probabilities, for example) and choose the best-fitting model with, for example, the Bayesian information criterion (BIC) [5].

2.3 Hidden Markov Model

The HMM can be useful in a number of cases when the state of interest cannot be directly measured or when there is measurement error in the observations. In HMMs, the Markov chain operates at the level of hidden states, which subsequently generate or emit observed states with different probabilities. For example, think about a progression of a student’s ability as a hidden state and school success as the observed state. We cannot measure true ability directly, but we can estimate the student’s progress by their test scores that are emissions of their ability. There is, however, some uncertainty in how well the test scores represent students’ true ability. For example, observing low test scores at some point in time does not necessarily mean the student has low ability; they might have scored lower than expected in the test due to other reasons such as being sick at that particular time. Such uncertainty can be reflected in the emission probabilities; for example, in the high-ability state students get high test scores eight times out of ten and low test scores with a 20% probability, while in the low-ability state the students get low test scores nine times out of ten and high test scores with a 10% probability. These probabilities are collected in an emission matrix as illustrated in Table 4.

Table 4 Emission matrix showing the probabilities of each hidden state (low or high ability) emitting each observed state (low or high test scores)

Again, the full HMM is defined by a set of parameters: the initial state probabilities \(\pi _s\), the hidden state transition probabilities \(a:{rs}\), and the emission probabilities of observed states \(b:s(m)\). What is different to the MM is that in the HMM, the initial state probabilities \(\pi _s\) define the probabilities of starting from each hidden state. Similarly, the transition probabilities \(a:{rs}\) define the probabilities of transitioning from one hidden state to another hidden state. The emission probabilities \(b:s(m)\) (collected in an emission matrix B) define the probability of observing a particular state m (e.g., low or high test scores) given the current hidden state s (e.g., low or high ability).

When being in a certain hidden state, observed states occur randomly, following the emission probabilities. Mathematically speaking, instead of assuming the Markov property directly on our observations, we assume that the observations are conditionally independent given the underlying hidden state. We can visualise the HMM as a directed acyclic graph (DAG) illustrated in Fig. 2. Here Z are the unobserved states (such as ability) which affect the distribution of the observed states Y  (test scores). At each time point t, the state \(z:t\) can obtain one of S possible values (there are two hidden states in the example of low and high ability, so \(S=2\)), which in turn defines how \(Y:t\) is distributed.

Fig. 2
A flow diagram has four nodes, Z 1 to Z 4, connected in series for the hidden states and four nodes Y 1 to Y 4 linked to the corresponding hidden states for the observed states.

Illustration of the HMM. The nodes \(Z:1\) to \(Z:4\) refer to hidden states at time points 1 to 4, while the nodes \(Y:1\) to \(Y:4\) refer to observed states. The arrows indicate dependencies between hidden and/or observed states

2.4 Mixture Hidden Markov Models

Combining the ideas of both time-constant clusters and time-varying hidden states leads to the concept of mixture hidden Markov model (MHMM). Here we assume that the population of interest consists of a finite number of subpopulations, each with their own HMM with varying transition and emission probabilities. For example, we could expect to find underlying groups which behave differently when estimating the progression of ability through the sequence of test scores, such as those that consistently stay on a low-ability or high-ability track (stayers) and those that move between low and high ability (movers). In this case, we need two transition matrices: the stayers’ transition matrix allows for no transitions while the movers’ transition matrix allows for transitioning between low and high ability, as illustrated in Table 5.

Table 5 Two transition matrices showing the probabilities of transitioning from one state of ability to another in two clusters, the Stayers and the Movers. The rows and columns describe the origin state and the destination state, respectively

Similarly, we need two emission matrices that describe how the observed states are related to hidden states, as illustrated in Table 6. In this example, there is a closer match between low/high ability and low/high test scores in the Stayers cluster in comparison to the Movers cluster.

Table 6 Two emission matrices showing the probabilities of each hidden state (low or high ability) emitting each observed state (low or high test scores)

Mathematically, when estimating a MHMM we first fix the number of clusters K, and create a joint HMM consisting of K submodels (HMMs). The number of hidden states does not have to be fixed but can vary by submodel, so that the HMMs have more hidden states for some clusters and fewer for others (in our example, because the transition matrix of the Stayers cluster is diagonal, we could also split the cluster into two single state clusters, one corresponding to low and another to high ability). This can increase the burden of model selection, so often a common number of hidden states is assumed for each cluster for simplicity. In any case, the initial state probabilities of this joint model define how sequences are assigned to different clusters. We estimate this joint model using the whole data and calculate cluster membership probabilities for each individual. The idea of using mixtures of HMMs has appeared in literature under various names with slight variations, e.g., [6], [7], and [4]. Notably, MHMMs inherit from MMMs the ability to incorporate covariates to predict cluster memberships.

2.5 Multi-Channel Sequences

There are two options to analyse multi-channel (or multi-domain or multi-dimensional) sequence data with Markovian models. The first option is to combine observed states in different channels into one set of single-channel sequences with an expanded alphabet. This option is simple, and works for MMs, HMMs, MMMs, and MHMMs, but can easily lead to complex models as the number of states and channels increases considerably. The second option, which can only be used when working with HMMs and MHMMs, is to treat the observed states in each channel independently given the current hidden state. This can be easily performed by defining multiple emission probability matrices, one for each channel. The assumption of conditional independence simplifies the model, but is sometimes unrealistic, in which case it is better to resort to the first option and convert the data into single-channel sequences. Both options are discussed further in Chapter 13 [8], a dedicated chapter on multi-channel sequences, where applications of distance-based and Markovian clustering approaches are presented. In this chapter, we henceforth focus on single-channel sequences.

2.6 Estimating Model Parameters

The model parameters, i.e. the elements of the initial probability vectors \(\pi \), transition probability matrices A, and emission probability matrices B, can be estimated from data using various methods. Typical choices are the Baum-Welch algorithm (an instance of the expectation-maximisation, i.e., the EM algorithm) and direct (numerical) maximum likelihood estimation. It is possible to restrict models, for example, by setting some parameters to fixed values (typically zeros), for example, to make certain starting states, transitions, or emissions impossible.

After the parameter estimation, in addition to studying the estimated model parameters upon convergence, we can, for example, compute cluster-membership probabilities for each individual and find the most probable paths of hidden state sequences using the Viterbi algorithm [9]. These can be further analysed and visualised for interpretation.

3 Review of the Literature

Markovian methods have been used across several domains in education and have gained renewed interest with the surge in learning analytics and educational data mining. Furthermore, the introduction of specialised R packages (e.g., seqHMM [10]) and software applications (e.g., Mplus [11, 12]) have made it easier to implement Markov models. One of the most common applications of Markovian methods is the clustering of sequence data [13,14,15]. Markov models offer a credible alternative to existing distance-based methods (e.g. optimal matching) and can be used with different sequence types (e.g. multi-channel sequences). Furthermore, Markovian methods offer some advantages in clustering sequential data such as the inclusion of covariates that can explain why a sequence emerged (e.g., [16]). More importantly, Markovian models are relatively scalable and can be used to cluster large sequences [17]. As Saqr et al. [17] noted, large sequences are hard to cluster using standard methods such as hierarchical clustering, which is memory inefficient, and hard to parallelise or scale [18, 19]. Furthermore, distance-based clustering methods are limited by the theoretical maximum dimension of a matrix in R which is 2,147,483,647 (i.e., a maximum of 46,430 sequences). In such a case, Markovian methods may be the solution.

Examples of Markovian methods in clustering sequences are plentiful. For example, HMMs have been used to cluster students’ sequences of learning management system (LMS) trace data to detect their patterns of activities or what the authors referred to as learning tactics and strategies [15]. Another close example was that of López-Pernas and Saqr [20], who used HMMs to cluster multi-channel data of students’ learning strategies of two different tools (an LMS and an automated assessment tool). Other examples include using HMM in clustering sequences of students’ engagement states [21], sequences of students’ collaborative roles [16], or sequences of self-regulation [13, 14].

Markovian methods are also popular in studying transitions and have therefore been used across several applications and with different types of data. One of the most common usages is what is known as stochastic processes mining which typically uses first-order Markov models to map students’ transitions between learning activities. For example, Matcha et al. [22] used first-order Markov models to study students’ processes of transitions between different learning tactics. Other uses include studying the transitions between tactics of academic writing [23], between self-regulated learning events [24], or within collaborative learning settings [25]. Yet, most of such work has been performed by the pMiner R package [26], which was recently removed from The Comprehensive R Archive Network (CRAN) due to slow updates and incompatibility with existing guidelines. This chapter offers a modern alternative that uses modern and flexible methods for fitting, plotting, and clustering stochastic process mining models as well as the possibility to add covariates to understand “why” different transitions pattern emerged.

Indeed, transition analysis in general has been a popular usage for Markovian models and has been used across several studies. For instance, for the analysis of temporal patterns of students’ activities in online learning (e.g., [27]), or transitions between latent states [28], or transitions between assignment submission patterns [29].

4 Examples

As a first step, we will import all the packages required for our analyses. We have used most of them throughout the book. Below is a brief summary:

  • qgraph: A package for visualising networks, which can be used to plot transition probabilities [30]. This is used only for the process mining application in Sect. 4.3.

  • rio: A package for reading and saving data files with different extensions [31].

  • seqHMM: A package designed for fitting hidden (latent) Markov models and mixture hidden Markov models for social sequence data and other categorical time series [4, 32].

  • tidyverse: A package that encompasses several basic packages for data manipulation and wrangling [33].

  • TraMineR: As seen in the introductory sequence analysis chapter, this package helps us construct, analyze, and visualise sequences from time-ordered states or events [34].

A 5-line code has the inputs to import the following packages from the library. They are q graph, Rio, s e q H M M, tidyverse, and Tra Mine R.

Henceforth, we divide our examples into two parts: the first largely focuses on traditional uses of the seqHMM package to fit Markovian models of varying complexity to sequence data; the latter presents a demonstration of Markovian models from the perspective of process mining. We outline the steps involved in using seqHMM in general in Sect. 4.1, demonstrate the application of MMs, HMMs, MMMs, and MHMMs in Sect. 4.2, and explore process mining using Markovian models in Sect. 4.3, leveraging much of the steps and code from the previous two sections. We note that different datasets are used in Sects. 4.2 and 4.3; we begin by importing the data required for Sect. 4.2 and defer the importing of the data used in the process mining application to the later section.

With this in mind, we start by using the ìmport() function from the rio package to import our sequence data. Based on the description of the MHMM in [35], we used the seqHMM package to simulate a synthetic dataset (simulated_data) consisting of students’ collaboration roles (obtained from [36]) on different courses across a whole program. As the original data, the simulation was based on the two-channel model (collaboration and achievement), but we only use the collaboration sequences in the following examples, and leave the multi-channel sequence analysis to Chapter 13 [8]. While not available in the original study, we also simulated students’ high school grade point average (GPA, for simplicity categorised into three levels) for each student, which will be used to predict cluster memberships. Using this data, we show how the seqHMM package can be used to analyse such sequences. We start with the simple MM, and then transition to HMMs and their mixtures. To be able to use the seqHMM functions, we need to convert the imported data to a sequence using the function seqdef() from the TraMineR package (see Chapter 10 [1] for more information about creating stslist objects). We subsequently assign a colour palette to each state in the alphabet for later visualisations using the function cpal(). Finally, we can also extract the covariate information separately (cov_data).

A code has a U R L address, an import function for simulated data, and inputs for variables, alphabets, and names.

 [>] 3 distinct states appear in the data:

     1 = Isolate

     2 = Leader

     3 = Mediator

 [>] state coding:

       [alphabet]  [label]  [long label]

     1  Isolate     Isolate  Isolate

     2  Mediator    Mediator Mediator

     3  Leader      Leader   Leader

 [>] 200 sequences in the data set

 [>] min/max sequence length: 20/20

A 4-line code has functions like c pal, c o v data, select, mutate, and factor. The levels of G P A are low, middle, and high.

4.1 Steps of Estimation

We will first briefly introduce the steps of the analysis with the seqHMM package and then show examples of estimating MMs, HMMs, MMMs, and MHMMs.

4.1.1 Defining the Model Structure

First, we need to create the model object which defines the structure of the model. This can be done by using one of the model building functions of seqHMM. The build functions include build_mm() for constructing the simple MM, build_hmm() for the HMM, build_mmm() for the MMM, and build_mhmm() for the MHMM. The user needs to give the build function the sequence data and the number of hidden states and/or clusters (when relevant). The user can also set restrictions on the models, for example, to forbid some transitions by setting the corresponding transition probabilities to zero. To facilitate the estimation of the parameters of more complex models, the user may also set informative starting values for model parameters.

4.1.2 Estimating the Model Parameters

After defining the model structure, model parameters need to be estimated. The fit_model() function estimates model parameters using maximum likelihood estimation. The function has several arguments for configuring the estimation algorithms. For simple models the default arguments tend to work well enough, but for more complex models the user should adjust the algorithms. This is because the more parameters the algorithm needs to estimate, the higher the risk of not finding the model with the optimal parameter values (the ones which maximises the likelihood).

In order to reduce the risk of being trapped in a local optimum of the likelihood surface (instead of a global optimum), we advise to estimate the model numerous times using different starting values for the parameters. The seqHMM package strives to automate this. One option is to run the EM algorithm multiple times with random starting values for any or all of initial, transition, and emission probabilities. These are specified in the control_em argument. Although not done by default, this method seems to perform very well as the EM algorithm is relatively fast. Another option is to use a global direct numerical estimation method such as the multilevel single-linkage method. See [4] for more detailed information on model estimation.

4.1.3 Examining the Results

The output of the fit_model() contains the estimated model (stored in fit_hmm$model) as well as information about the estimation of the model, such as the log-likelihood of the final model (fit_hmm$logLik). The print() method provides information about the estimated model in a written format, while the plot() method visualises the model parameters as a graph. For HMMs and MHMMs, we can calculate the most probable sequence of hidden states for each individual with the hidden_paths() function. Sequences of observed and hidden state sequences can be plotted with the ssplot() function for MMs and HMMs and with the mssplot() function for the MMMs and the MHMMs. For MMMs and MHMMs, the summary() method automatically computes some features of the models such as standard errors for covariates and prior and posterior cluster membership probabilities for the subjects.

4.2 Markov Models

We now follow the steps outlined above for each model in turn, starting from the most basic Markov model, proceeding through a hidden Markov model and a mixture Markov model, and finally concluding with a mixture hidden Markov model.

4.2.1 Markov Model

We focus on the sequences of collaboration roles, collected in the roles_seq object. The build_mm() function only takes one argument, observations, which should be an stslist object created with the seqdef() function from the TraMineR package as mentioned before. We can build a MM as follows:

Markov underscore model left angle bracket hyphen build underscore m m left parenthesis roles underscore s e q right parenthesis.

For the MM, the build_mm() function estimates the initial probabilities and the transition matrix. Note that the build_mm() function is the only build function that automatically estimates the parameters of the model. This is possible because for the MM the estimation is a simple calculation while for the other types of models the estimation process is more complex. The user can access the estimated parameters by calling markov_model$initial_probs and markov_model$transition_probs or view them by using the print method of the model:

Print left parenthesis Markov underscore model right parenthesis.

Initial probabilities :  Isolate Mediator   Leader    0.375    0.355    0.270 Transition probabilities :           to from       Isolate Mediator Leader   Isolate   0.4231    0.478 0.0987   Mediator  0.1900    0.563 0.2467   Leader    0.0469    0.428 0.5252

We can see that the initial state probabilities are relatively uniform, with a slightly lower probability for starting in the Leader state. In terms of the transition probabilities, the most distinct feature is that that it is rare to transition directly from the Leader state to Isolate and vice versa (estimated probabilities are about 5% and 10%, respectively). It is also more common to drop from Leader to Mediator (43%) than to increase collaboration from Mediator to Leader (25%). Similarly, the probability of moving from Mediator to Isolate is only 19%, but there is a 48% chance of transitioning from Isolate to Mediator.

We can also draw a graph of the estimated model using the plot() method which by default shows the states as pie graphs (for the MM, the pie graphs only consist of one state), transition probabilities as directed arrows, and initial probabilities below each state (see Fig. 3).

Fig. 3
Three connected pie charts titled 0.38, 0.36, and 0.27 have the larger segments for the isolate, mediator, and leader, respectively.

Estimated Markov model as a pie charts with the transition probabilities shown as labelled edges

A 3-line code with the function plot has inputs for legend dot prop, n col dot legend, edge dot label dot color, and vertex dot label dot color.

4.2.2 Hidden Markov Models

The structure of an HMM is set with the build_hmm() function. In contrast to build_mm(), other build_*() functions such as build_hmm() do not directly estimate the model parameters. For build_hmm(), in addition to observations (an stslist), we need to provide the n_states argument which tells the model how many hidden states to construct. Using again the collaboration roles sequences, if we want to estimate an HMM with two hidden states, we can write:

Set dot seed left parenthesis 1 right parenthesis. Hidden underscore Markov underscore model left angle bracket hyphen build underscore h m m left parenthesis observations = roles underscore s e q, n underscore states = 2 right parenthesis.

The set.seed() call ensures that we will always end up with the same exact initial model with hidden states in the same exact order even though we use random values for the initial parameters of the model (which is practical for reproducibility). We are now ready to estimate the model with the fit_model() function. The HMM we want to estimate is simple, so we rely on the default values and again use the print() method to provide information about the estimated model:

A 2-line code reads as follows. Fit underscore h m m left angle bracket hyphen fit underscore model left parenthesis hidden underscore Markov underscore model right parenthesis. Fit underscore h m m dollar model.

Initial probabilities : State 1 State 2   0.657   0.343 Transition probabilities :          to from      State 1 State 2   State 1  0.9089  0.0911   State 2  0.0391  0.9609 Emission probabilities :            symbol_names state_names Isolate Mediator Leader     State 1  0.4418    0.525 0.0336     State 2  0.0242    0.478 0.4980

The estimated initial state probabilities show that it is more probable to start from hidden state 1 than from hidden state 2 (66% vs. 34%). The high transition probabilities on the diagonal of the transition matrix indicate that the students typically tend to stay in the hidden state they currently are in. Transition probabilities between the hidden states are relatively low and also asymmetric: it is more likely that students move from state 1 to state 2 than from state 2 to state 1. Looking at the emission matrices, we see that the role of the students in state 2 is mostly Leader or Mediator (emission probabilities are 50% and 48%). On the other hand, state 1 captures more of those occasions where students are isolated or exhibit at most a moderate level of participation (mediators). We can also visualise this with the plot() method of seqHMM (see Fig. 4):

Fig. 4
Two connected pie charts titled 0.66 and 0.34 have the larger segments for the mediator and leader, respectively.

HMM with two hidden states (pie charts), with transitions between hidden states shown as labelled edges

A 4-line code with the function plot has inputs for legend dot prop, n col dot legend, edge dot label dot color, and vertex dot label dot color.

The plot values mainly shows the same information. By default, to simplify the graph, the plotting method combines all states with less than 5% emission probabilities into one category. This threshold can be changed with the combine.slices argument (setting combine.slices = 0 plots all states).

For simple models, using n_states is sufficient. It automatically draws random starting values that are then used for the estimation of model parameters. However, as parameter estimation of HMMs and mixture models can be sensitive to starting values of parameters, it may be beneficial to provide starting values manually using the initial_probs, transition_probs, and emission_probs arguments. This is also necessary in case we want to define structural zeros for some of these components, e.g., if we want to restrict the initial probabilities so that each sequence starts from the same hidden state, or if we want to set the lower diagonal part of the transition matrix to zero, which means that the model does not allow transitioning back to previous states (this is called a left-to-right model) [4]. It is also possible to mix random and user-defined starting values by using simulate_*() functions (e.g. simulate_transition_probs()) for some of the model components and user-defined values for others.

In the following example we demonstrate estimating a three-state HMM with user-defined starting values for the initial state probabilities and the transition matrix but simulate starting values for the emission matrices. For simulating starting values with simulate_emission_probs(), we need to define the number of hidden states, and the number of observed symbols, i.e., the length of the alphabet of the sequences.

A code has the inputs for the following commands, set seed for randomization, initial state probability vector, must sum to one, a 3 cross 3 transition matrix each row should sum to one, simulate emission probabilities, and build the H M M.

Our initial probabilities suggest that it is slightly more likely to start from the second hidden state than the first and the third. Furthermore, the starting values for the transition matrices suggest that staying in hidden states 1 and 3 is more likely than staying in hidden state 2. All non-zero probabilities are, however, mere suggestions and will be estimated with the fit_model() function. We now estimate this model 50 times with the EM algorithm using randomised starting values:

A 4-line code with a set dot seed function has inputs for control underscore e m, restart, and times.

We can get the information on the EM estimation as follows:

Fit underscore h m m underscore 2 dollars e m underscore results.

$logLik [1] -3546.155 $iterations [1] 488 $change [1] 9.947132e-11 $best_opt_restart  [1] -3546.155 -3546.155 -3546.155 -3546.155 -3546.155 -3546.155 -3546.155  [8] -3546.155 -3546.155 -3546.155 -3546.155 -3546.155 -3546.155 -3546.155 [15] -3546.155 -3546.155 -3546.155 -3546.155 -3546.155 -3546.155 -3546.155 [22] -3546.155 -3546.155 -3546.155 -3546.155

The loglik element gives the log-likelihood of the final model. This value has no meaning on its own, but it can be used to compare HMMs with the same data and model structure (e.g., when estimating the same model from different starting values). The iterations and change arguments give information on the last EM estimation round: how many iterations were used until the (local) optimum was found and what was the change in the log-likelihood at the final step.

The most interesting element is the last one: best_opt_restart shows the likelihood for 25 (by default) of the best estimation rounds. We advise to always check these to make sure that the best model was found several times from different starting values: this way we can be fairly certain that we have found the actual maximum likelihood estimates of the model parameters (global optimum). In this case all of the 25 log-likelihood values are identical, meaning that it is likely that we have found the best possible model among all HMMs with three hidden states.

A 5-line code with the function plot has inputs for legend dot prop, n col dot legend, edge dot label dot color, vertex dot label dot color, combine dot slices, and trim.

Interpreting the results in Fig. 5 we see that the first hidden state represents about equal amounts of isolate and mediator roles, the second hidden state represents mainly Leaders and some Mediator roles, and the third hidden state represents mainly Mediator roles and partly Leader roles. Interestingly, none of the students start as Mediator/Leader, while of the other two the Isolate/Mediator state is more typical (two thirds). There are no transitions from the first to the second state nor vice versa, and transition probabilities to the third state are considerably higher than away from it. In other words, it seems that the model has two different origin states and one destination state.

Fig. 5
Three connected pie charts titled 0.69, 0.31, and 0 have the larger segments for isolate, leader, and mediator, respectively.

HMM with three hidden states (pie charts), with transitions between hidden states shown as labelled edges

We can visualise the observed and/or hidden state sequences with the ssplot() function. The ssplot() function can take an stslist object or a model object of class mm or hmm (see Fig. 6). Here we want to plot full sequence index plots (type = ”I”) of both observed and hidden states (plots = ”both”) and sort the sequences using multidimensional scaling of hidden states (sortv = ”mds.hidden”). See the seqHMM manual and visualisation vignette for more information on the different plotting options.

Fig. 6
Two sequence index plots titled n = 200 plot observation and hidden states versus data. The first graph plots data for isolate, mediator, and leader, and the second graph plots data for states 1, 2, and 3.

Observed and hidden state sequences from the HMM with three hidden states

A code has the inputs for the following commands, plot sequence index plot, full sequence, plot observed and hidden state sequence, sort sequence by the scores of multidimensional scaling, and x-axis tick labels.

By looking at the sequences, we can see that even though none of the students start in hidden state 3, the majority of them transition there. In the end, most students end up alternating between mediating and leadership roles.

Is the three-state model better than the two-state model? As already mentioned, we can use model selection criteria to test that. To make sure that the three-state model is the best, we also estimate a HMM with four hidden states and then use the Bayesian information criterion for comparing between the three models. Because the four-state model is more complex, we increase the number of re-estimation rounds for the EM algorithm to 100.

A code has inputs and functions for the following commands, set seed for randomization and build and estimate a H M M with four states.

 [1] -3534.304 -3534.304 -3534.304 -3534.304 -3534.304 -3534.304 -3534.304  [8] -3534.304 -3534.304 -3534.304 -3534.304 -3534.304 -3534.304 -3534.305 [15] -3534.305 -3534.306 -3534.308 -3534.310 -3534.332 -3534.335 -3534.335 [22] -3534.335 -3534.336 -3534.337 -3534.337

The best model was found only 13 times out of 101 estimation rounds from randomised starting values. A cautious researcher might be wise to opt for a higher number of estimation rounds for increased certainty, but here we will proceed to calculating the BIC values.

A single-line code reads as follows. B I C left parenthesis fit underscore h m m dollar model right parenthesis.

[1] 7430.028

A single-line code reads as follows. B I C left parenthesis fit underscore h m m underscore 2 dollar model right parenthesis.

[1] 7208.427

A single-line code reads as follows. B I C left parenthesis fit underscore h m m underscore 3 dollar model right parenthesis.

[1] 7259.37

Generally speaking, the lower the BIC, the better the model. We can see that the three-state model (fit_hmm_2) has the lowest BIC value, so three clusters is the best choice (at least among HMMs with 2–4 hidden states).

4.2.3 Mixture Markov Models

The MMM can be defined with the build_mmm() function. Similarly to HMMs, we need to either give the number of clusters with the n_clusters argument, which generates random starting values for the parameter estimates, or give starting values manually as initial_probs and transition_probs. Here we use random starting values:

A 4-line code has the following commands, set seed for randomization and define model structure left parenthesis 3 clusters right parenthesis.

Again, the model is estimated with the fit_model() function:

Fit underscore m m m left angle bracket hyphen fit underscore model left parenthesis m m m right parenthesis.

The results for each cluster can be plotted one at a time (interactively, the default), or in one joint figure. Here we opt for the latter (see Fig. 7). At the same time we also illustrate some other plotting options:

Fig. 7
Three M M M models for clusters 1, 2, and 3 have interconnected nodes for mediator, isolate, and leader for each cluster.

MMM with three clusters

A code has the inputs and functions for some following commands, plot all clusters at the same time, set the number of rows and columns for cluster plots, choose another layout for the vertices, limit pie graph from vertices, and set state colors.

The following code plots the sequence distribution plot of each cluster (Fig. 8). In Cluster 1, we see low probabilities to downward mobility and high probabilities for upward mobility, so this cluster describes leadership trajectories. In Cluster 2, we can see that the thickest arrows lead to mediator and isolates roles, so this cluster describes trajectories with less central roles in collaboration. In Cluster 3, we see the highest transition probabilities for entering the mediator role but also some transitions from mediator to leader, so this cluster describes trajectories with more moderate levels of participation in comparison to cluster 1. This behavior is easier to see when visualising the sequences in their most probable clusters. The plot is interactive, so we need to hit ‘Enter’ on the console to generate each plot. Alternatively, we can specify which cluster we want to plot using the which.plots argument.

Fig. 8
Three stacked column charts plot proportion versus data for isolate, mediator, and leader. The charts are titled as follows, cluster 1, n = 48, cluster 2, n = 87, and cluster 3, n = 65. The leader has higher values in A, while the mediator has higher values in B and C.

State distribution plots by most probable clusters estimated with the mixture Markov model. (a) Cluster 1. (b) Cluster 2. (c) Cluster 3

A code with the c l 1 function has the inputs for the following commands, plot the Y-axis, legend position, legend column, and label for the y-axis.

We can add covariates to the model to explain cluster membership probabilities. For this, we need to provide a data frame (argument data) and the corresponding formula (argument formula). In the example data we use the data frame called cov_data that we created at the beginning of the tutorial with columns ID and GPA, where the order of the ID variable matches to that of the sequence data roles_seq (note that the ID variable is not used in the model building, so the user needs to make sure that both matrices are sorted by ID). We can now use the information about students’ GPA level as a predictor of the cluster memberships.

Numerical estimation of complex models from random starting values may lead to convergence issues and other problems in the estimation (you may, for example, get warnings about the EM algorithm failing). To avoid such issues, giving informative starting values is often helpful. This model is more complex than the model without covariates and estimation from random starting values leads to convergence issues (not shown here). To facilitate model estimation, we use the results from the previous MMM as informative starting values. Here we also remove the common intercept by adding 0 to the formula, which simplifies the interpretation of the covariate effects later (instead of comparing to a reference category, we get separate coefficients for each of the three GPA categories).

A code has the inputs and functions for the following commands, starting values for initial probabilities, starting values for transition probabilities, data frame for covariates, and formula for covariates.

Again, the model is estimated with the fit_model() function. Here we use the EM algorithm with 50 restarts from random starting values:

A code has the inputs and functions for the following commands, E M with randomized restarts and store loglik values from all 50 + 1 estimated rounds.

Warning in fit_model(mmm_2, control_em = list(restart = list (times = 50, : EM algorithm failed: Estimation of gamma  coefficients  failed due to singular Hessian.

The model was estimated 50 + 1 times (first from the starting values we provided and then from 50 randomised values). We get one warning about the EM algorithm failing. However, 50 estimation rounds were successful. We can check that the best model was found several times from different starting values (37 times, to be precise):

A single-line code reads as follows. Fit underscore m m m underscore 2 dollar e m underscore results dollar best underscore o p t underscore restart.

 [1] -3614.627 -3614.627 -3614.627 -3614.627 -3614.627 -3614.627 -3614.627  [8] -3614.627 -3614.627 -3614.627 -3614.627 -3614.627 -3614.627 -3614.627 [15] -3614.627 -3614.627 -3614.627 -3614.627 -3614.627 -3614.627 -3614.627 [22] -3614.627 -3614.627 -3614.627 -3614.627 -3614.627 -3614.627 -3614.627 [29] -3614.627 -3614.627 -3614.627 -3614.627 -3614.627 -3614.627 -3614.627 [36] -3614.627 -3614.627 -3619.695 -3624.547 -3624.547 -3624.547 -3624.547 [43] -3624.547 -3624.547 -3624.547 -3624.547 -3624.547 -3624.547 -3631.328 [50] -3637.344      -Inf

We can now be fairly certain that the optimal model has been found, and can proceed to interpreting the results. The clusters are very similar to what we found before. We can give the clusters more informative labels and then show state distribution plots in each cluster in Fig. 9:

Fig. 9
3 stacked column charts titled a, b, and c plot observations versus numbers and are labeled mainly leader, n = 49, isolate or mediator, n = 87, and mediator or leader, n = 64. The leader has higher values in A, while the mediator has higher values in B and C.

State distribution plots by most probable clusters estimated with the mixture Markov model with covariates. (a) Mainly leader. (b) Isolate/mediator. (c) Mediator/leader

A programming code with 2 lines. The first line assigns cluster names to a model using an assignment operator, and the second line plots the model with a legend at the bottom.

The model summary shows information about parameter estimates of covariates and prior and posterior cluster membership probabilities (these refer to cluster membership probabilities before or after conditioning on the observed sequences, respectively):

A programming code assigns the model's summary to a variable and displays its contents.

Covariate effects : Mainly leader is the reference. Isolate/mediator :            Estimate  Std. error GPALow       1.9221       0.478 GPAMiddle    0.3901       0.314 GPAHigh     -0.0451       0.277 Mediator/leader :            Estimate  Std. error GPALow        1.670       0.487 GPAMiddle     0.411       0.312 GPAHigh      -0.667       0.332 Log-likelihood: -3614.627   BIC: 7461.487 Means of prior cluster probabilities :    Mainly leader Isolate/mediator  Mediator/leader            0.244            0.425            0.331 Most probable clusters :             Mainly leader  Isolate/mediator  Mediator/leader count                  49                87               64 proportion          0.245             0.435             0.32 Classification table : Mean cluster probabilities (in columns) by the most probable cluster (rows)                  Mainly leader Isolate/mediator Mediator/leader Mainly leader          0.91758          0.00136          0.0811 Isolate/mediator       0.00081          0.89841          0.1008 Mediator/leader        0.05902          0.10676          0.8342

We will first interpret the information on prior and posterior cluster membership probabilities and then proceed to interpreting covariate effects. Firstly, the means of prior cluster probabilities give information on how likely each cluster is in the whole population of students (33% in Mediator, 24% in Leader, and 43% in Isolate). Secondly, Most probable clusters shows group sizes and proportions if each student would be classified into the cluster for which they have the highest cluster membership probability.

Thirdly, the Classification table shows mean cluster probabilities (in columns) by the most probable cluster (in rows). We can see that the clusters are fairly crisp (the certainty of cluster memberships are fairly high) because the membership probabilities are large in the diagonal of the table. The uncertainty of the classification is the highest for the Mediator/leader cluster (among those that had the highest membership probability in that cluster, average cluster memberships were 83% for the Mediator/leader cluster, 6% for the Mainly leader cluster, and 11% for the Isolate/mediator cluster) and the highest in the Mainly leader cluster (92% for the Mainly leader cluster, 8% for the Mediator/leader cluster, and 0.1% for the Isolate/mediator cluster).

The part titled Covariate effects shows the parameter estimates for the covariates. Interpretation of the values is similar to that of multinomial logistic regression, meaning that we can interpret the direction and uncertainty of the effect—relative to the reference cluster Mainly leader—but we cannot directly interpret the magnitude of the effects (the magnitudes are on log-odds scale). We can see that individuals with low GPA more often end up in the Isolate/mediator cluster and the Mediator/leader cluster in comparison to the Mainly leader cluster (i.e., the standard errors are small in comparison to the parameter estimates), while individuals with high GPA levels end up in the Mediator/leader cluster less often but are not more or less likely to end up in the Isolate/mediator cluster. For categorical covariates such as our GPA variable, we can also easily compute the prior cluster membership probabilities from the estimates with the following call:

A single-line code reads, e x p left parenthesis fit m m m 2 dollar model dollar coefficient right parenthesis over row sums left parenthesis e x p left parenthesis fit m m m 2 dollar model dollar coefficient double right parentheses.

          Mainly leader Isolate/mediator Mediator/leader GPALow       0.07605453        0.5198587       0.4040868 GPAMiddle    0.25090105        0.3705958       0.3785031 GPAHigh      0.40497185        0.3870997       0.2079285

The matrix shows the levels of the covariates in the rows and the clusters in the columns. Among the high-GPA students, 40% (0.40497) are classified as Mainly leaders, 39% as Isolate/mediators, and 21% as Mediator/leaders. Among middle-GPA students classification is relatively uniform (25% as Mainly leaders, 37% as Isolate/mediators and 38 Mediator/leaders) whereas most of the low-GPA students are classified as Isolate/mediators or Mediator/leaders (52% and 40%, respectively).

The summary object also calculates prior and posterior cluster memberships for each student. We omit them here, for brevity, but demonstrate that they can be obtained as follows:

A programming code assigns prior and posterior cluster probabilities from a summary object to the variables.

4.2.4 Mixture Hidden Markov Models

Finally, we will proceed to the most complex of the models, the MHMM.

For defining a MHMM, we use the build_mhmm() function. Again, we can use the argument n_states which is now a vector showing the number of hidden states in each cluster (the length of the vector defines the number of clusters). We will begin by estimating a MHMM with three clusters, each with two hidden states:

A programming code sets a seed for random number generation, builds a hidden Markov model with specific parameters, and fits the model using data related to G P A.

Error in fit_model(mhmm): EM algorithm failed: Estimation of gamma coefficients failed due to singular Hessian.

In this case, we get an error message about the EM algorithm failing. This means that the algorithm was not able to find parameter estimates from the random starting values the build_mhmm() function generated and we need to adjust our code.

Starting values for the parameters of the MHMM can be given with the arguments initial_probs, transition_probs, and emission_probs. For the MHMM, these are lists of vectors and matrices, one for each cluster. We use the same number of hidden states (two) for each cluster. We define the initial values for the transition and emission probabilities as well as regression coefficients ourselves. We also restrict the initial state probabilities so that in each cluster every student is forced to start from the same (first) hidden state.

A programming code includes comments and commands for setting initial probabilities, defining transition probabilities, simulating emission probabilities, and building a model structure.

Now that we have built the MHMM, we can estimate its parameters:

A programming code sets a random number generator seed, suppresses warnings during execution, fits a model with a function, and assigns the result to a variable. Control parameters for the expectation-maximization algorithm are defined in an argument.

We can now check how many times the log-likelihood values occurred in the 101 estimations:

A single-line code reads, table left parenthesis round left parenthesis fit m h m m 2 dollar e m results dollar best o p t restart comma 2 double right parentheses.

    -Inf -3672.25 -3595.82 -3588.58 -3584.14 -3526.42 -3525.06 -3519.53       56        2        1        3        1        4        1        2  -3519.5 -3519.24       15       16

The best model was found 16 times out of 101 times, although the second best model has a log-likelihood of \(-\)3519.5 which is almost indistinguishable from the optimal model (\(-\)3519.24)

We will start to interpret the model by looking at the sequence plots in each cluster (see Fig. 10). The function call is interactive. As before, if you only want to plot one cluster you can use the which.plots argument:

Fig. 10
2 rows of 3 sequence index distribution plots plot observation versus clusters and hidden states versus hidden states, respectively. The sequence distribution plots are labeled cluster 1, n = 64, cluster 2, n = 38, and cluster 3, n = 98.

MHMM estimated sequence distribution plot with hidden states. (a) Cluster 1. (b) Cluster 2. (c) Cluster 3

A programming code uses a function to generate a plot, type, and sort the parameters, along with legend and legend properties.

We can also visualise the model parameters in each cluster (see Fig. 11):

Fig. 11
3 pairs of pie charts titled clusters 1, 2, and 3 have a legend at the bottom. The pie charts on the left and right are labeled 1 and 0, respectively. Each cluster has a link labeled as follows, Cluster 1, 0.013, and 0.097, Cluster 2, 0.14, and Cluster 3, 0.083, and 0.13.

Transitions between states for each trajectory

A programming code defines the different parameters needed to plot the size of the vertex, label color, and edge color of a model.

Based on the two plots, we can determine that Cluster 1 describes students who start as leaders but then transition to alternating between mediator and leader. Cluster 2 describes students who start by alternating between isolate and mediator roles and then mainly transition to alternating between mediator and leader roles. Cluster 3 describes students who start as alternating between isolate and mediator roles, after which they transition between isolate/mediator and mediator/leader.

A programming code uses an assignment operator to assign the cluster names downward transition, upward transition, and alternating to a model.

With summary(fit_mhmm_2$model) we get the parameter estimates and standard errors for the covariates and information about clustering:

A single-line code reads, summary left parenthesis fit m h m m 2 dollar model right parenthesis.

Covariate effects : Downward transition is the reference. Upward transition :            Estimate  Std. error GPALow       -0.455       0.464 GPAMiddle     0.440       0.310 GPAHigh      -2.743       0.727 Alternating :            Estimate  Std. error GPALow       1.3560       0.324 GPAMiddle    0.3461       0.316 GPAHigh      0.0468       0.250 Log-likelihood: -3519.243   BIC: 7237.543 Means of prior cluster probabilities : Downward transition   Upward transition         Alternating               0.302               0.181               0.517 Most probable clusters :             Downward transition  Upward transition  Alternating count                        61                 30          109 proportion                0.305               0.15        0.545 Classification table : Mean cluster probabilities (in columns) by the most probable cluster (rows)                     Downward transition Upward transition Alternating Downward transition             0.95727            0.0267      0.0161 Upward transition               0.03007            0.8037      0.1662 Alternating                     0.00975            0.0962      0.8940

We can see, that the prior probabilities of belonging to each cluster are very different: half of the students can be described as alternating, while of the rest, a downward transition is more typical (31%). Based on the classification table, the Downward transition cluster is rather crisp, while the other two are partly overlapping (see the MMM example for more information on interpreting the classification table).

The Covariate effects tables show that, in comparison to Alternating cluster, students with low GPA are less likely to end up in the Upward or Downward transition clusters and students with high GPA are less likely to end up in Upward transition cluster. Again, we can calculate the probabilities of belonging to each cluster by GPA levels:

A single-line code reads, e x p left parenthesis fit m h m m 2 dollar model dollar coefficients right parenthesis over row sums left parenthesis e x p left parenthesis fit m h m m 2 dollar model dollar coefficients double right parentheses.

          Downward transition Upward transition Alternating GPALow              0.1813217        0.11502283   0.7036555 GPAMiddle           0.2521406        0.39144399   0.3564154 GPAHigh             0.4734128        0.03048189   0.4961054

The table shows that students with low GPA typically belong to the Alternating cluster (70% probability) while students with high GPA mainly end up in the Downward transition cluster (47%) or the Alternating cluster (50%). Most students with middle GPA end up in the Upward transition cluster (39%), but the probabilities are almost as high for the Alternating cluster (36%) and also fairly high for the Downward transition cluster (25%).

In light of this, it is worth noting that the covariates do not merely explain the uncovered clusters; as part of the model, they drive the formation of the clusters. In other words, an otherwise identical model without the dependence on the GPA covariate may uncover different groupings with different probabilities.

If we are not sure how many clusters or hidden states we expect, or if we wish to investigate different combinations of covariates, we can estimate several models and compare the results with information criteria or cross-validation. Estimating a large number of complex models is, however, very time-consuming. Using prior information for restricting the pool of potential models is useful, and sequence analysis can also be used as a helpful first step [10, 37].

4.3 Stochastic Process Mining with Markovian Models

Process mining is a relatively recent method for the analysis of event-log data (time-stamped logs) which aims to understand the flow and dynamics of the process under study. In education, process mining has been used extensively to analyse learners’ online logs collected from Learning Management Systems (LMS), to understand how they utilize learning resources and transitions between learning activities to mention a few [38, 39]. In this book, we have devoted a full chapter for process mining where we explained how process mining can be performed in R [40]. Yet, in this chapter we will present a novel method that we propose to perform stochastic process mining using MMs. While process mining can be performed using different software, techniques and algorithms, MMs offer a powerful framework for process mining with several advantages over the commonly used methods. First, it is more theoretically aligned with the idea of a transition from an action to an action and that actions are temporally dependent on each other. Second, MMs allow for data to be clustered into similar transition patterns, a possibility not offered by other process mining methods (see the process mining chapter of this book [40]). Third, contrary to other process mining methods, MMs do not require researchers to arbitrarily exclude—or trim—a large part of the data to “simplify” the model. For instance, most of the process mining analyses require an arbitrary cutoff to trim the data so that the process model is readable. This trimming significantly affects the resulting model and makes it hard to replicate. Most importantly, MMs have several fit statistics that we can use to compare and judge the model fit as we have seen before.

Several R packages can perform stochastic process mining; in this tutorial we will rely on the same package we discussed earlier and combine it with a powerful visualisation that allows us to effectively visualise complex processes. In the next example, we will analyse data extracted from the learning management system logs and offer a detailed guide to process mining. We will also use MMMs to cluster the data into latent patterns of transitions. Given that the traditional plotting function in seqHMM works well with a relatively short alphabet, we will use a new R package called qgraph for plotting. The package qgraph offers powerful visualisations which makes plotting easier, and more interpretable especially for larger models. Furthermore, qgraph allows researchers to use a fixed layout for all the plotted networks so the nodes can be compared to each other more easily.

Let us now go through the analysis. The next chunk of code imports the prepared sequence data from the sequence analysis chapter. The data belong to a learning analytics course and the events are coded trace logs of students’ actions such as Course view, Instructions, Practicals, Social, etc. Then, we build a sequence object using the function seqdef() from TraMineR.

A programming code imports data from a specified U R L, defines sequence data, and specifies variables from columns 7 to 54.

Before proceeding further, it is advisable to visualise the sequences. Figure 12 shows the sequence index plot, sorted according to the first states. The data are much larger than the collaboration roles and achievement sequences analysed previously; there are 9478 observations with an alphabet of 12 states. Unlike in the previous example, the sequence lengths vary considerably. Due to this, shorter sequences contain missing values to fill the empty cells in the data frame. However, there are no internal gaps. When creating the sequence object with the seqdef function, TraMineR allows for distinguishing between real missing values (NA, where the true state is unknown) and technical missing values (void) used to pad the sequences to equal lengths. The seqHMM package is able to account for both types of missing values and treats them slightly differently, for example when calculating the most probable paths of hidden states.

Fig. 12
A sequence index plot of sequence versus time plots some values as follows, group work (41, 7653), assignment (1, 780), practicals (26, 7660), feedback (7, 5590), and course view (1, 3547). Values are estimated.

Sequence index plot for the learning management system logs

A programming code to create a line plot of sequence data over time. The parameters include the type of plot, number of columns, sorting type, proportion of the legend, size of the legend items, border, and labels for the x and y axes.

A simple transition analysis can be performed by estimating and plotting the transition probabilities. This can be performed using the TraMineR package. Yet, this simple approach has drawbacks and it is advisable to estimate the MM and use their full power. The next code estimates the transition probabilities of the full dataset using the function seqtrate() from TraMineR package (shown in Table 7).

Table 7 Transition probabilities
A single-line code assigns the result of a function to a variable with an argument.

As we mentioned earlier, we will use a novel plotting technique that is more suitable for large process models. Below, we plot the transition probabilities with the qgraph() function from the qgraph package (Fig. 13). We use some arguments to improve the process model visualisation. First, we use the argument cut = 0.15 to show the edges with probabilities below 0.15 in lower thickness and colour intensity. This cut makes the graph easier to read and less crowded, and gives emphasis to the edges which matter. The argument minimum = 0.05 hides small edges below the probability threshold of 0.05. We use edge.labels = TRUE to show the transition probabilities as edge labels. The argument color gets the colour palette from the sequence with the function cpal() and the argument curveAll = TRUE ensures the graph shows curved edges. The ‘‘colorblind’’ theme makes sure that the colours can be seen by everyone regardless of colour vision abilities. Lastly, the mar

Fig. 13
A process map of the course view contains the following, theory, L A types, general, applications, instructions, ethics, feedback, group work, social, practicals, and assignment.

Process map for the overall process

within the figure area.

A programming code generates a transition plot by creating a variable that stores labels extracted from a data source. It assigns labels, sets edge properties, and applies a color theme.

The seqtrate() function only computes the transition probabilities but does not compute the initial probabilities. While it is not difficult to calculate the proportions of starting in each state, we can also estimate a simple Markov model which does the same with a short command. We do so using the build_mm() function as per Sect. 4.2, recalling that the build_mm() function is distinct from build_hmm(), build_mmm(), and build_mhmm() in that it is the only build function that automatically estimates the parameters of the model.

The plotting now includes an extra option called pie = overallmodel$initi al_probs which tells qgraph to use the initial probabilities from the fitted MM as the sizes of the pie charts in the borders of the nodes in Fig. 14. For instance, the pie around Course view is around half of the circle corresponding to 0.48 initial probability to start from Course view. Please also note that the graph is otherwise equal to the one generated via seqtrate() apart from these initial probabilities.

Fig. 14
A process map of the course view with pie chart borders contains the following, theory, L A types, general, applications, instructions, ethics, feedback, group work, social, practicals, and assignment.

Process map for the overall process with initial probabilities

A programming code for creating and plotting models, and visualizing model transitions based on several parameters. The model also includes a pie chart representing initial probabilities.

Having plotted the transitions of the full dataset, we can now look for transition patterns, that is typical transition patterns (i.e., clusters) that are repeated within the data. The procedure is the same as before. In the next example, we use the function build_mmm() to build the model with four clusters as a demonstration. Ideally, researchers need to estimate several models and choose the best model based on model selection criteria (such as BIC) values as well as interpretability.

The steps involved in fitting the model are as before; we make use of the function fit_model() to estimate the model. The results of the running the code will be an MM for each cluster (with distinct initial and transition probabilities). Given the number of sequences in the dataset, their length, and the number of states, the computational burden is larger than for previous applications in this chapter. For illustrative purposes, instead of repeated EM runs with random starting values, we use single EM run followed by global optimisation, using the argument global_step = TRUE. One benefit of this global (and local) step in fit_model over the EM algorithm is the flexibility to define a maximum runtime (in seconds) for the optimization process (argument maxtime in control_global). This can be valuable for larger problems with predefined runtime (e.g., in a shared computer cluster). Note, however, that relying on the runtime can lead to non-reproducible results even with fixed seed if the optimisation terminates due to the time limit. Finally, we run additional local optimisation step using the results of the global optimisation, for more accurate results. The last argument threads = 16 instructs to use parallel computing to enable faster fitting (please, customise according to the number of cores in your computer). As for the starting values, we use the transition probabilities computed from the full data for all clusters, and random values for the initial probabilities.

While in theory many of the global optimisation algorithms should eventually find the global optimum, in practice there are no guarantees that it is found in limited time. Thus, as earlier, in practice it is advisable to try different global/local optimisation algorithms and/or EM algorithm with different initial values to make it more likely that the global optimum is found (see [4] for further discussion).

A programming code that creates a seed for number generation, simulates transition probabilities, calculates initial probabilities, and constructs a Hidden Markov Model with specified probabilities, which is then fitted to data.

[1] "NLOPT_SUCCESS: Generic success return value."

[1] -114491.2

Before plotting the clusters, let us do some cleanups. First, we get the transition probabilities of each cluster and assign them to a variable. In that way, they are easier to manipulate and work with. In the same way, we can extract the initial probabilities for each cluster.

A programming code that is used to extract the transition and initial probabilities of each cluster from a model.

Plotting the process maps can be performed in the same way we did before. However, if we need to compare clusters, it is best if we use a unified layout. An average layout can be computed with the function averageLayout() which takes the transition probabilities of the four clusters as input and creates—as the name implies—an averaged layout. Another option is to use the same layout of the overallplot in the previous example. This can be obtained from the plot object overallplot$layout. This can be helpful if you would like to plot the four plots corresponding to each cluster with the same layout as the overall plot (see Fig. 15).

Fig. 15
4 process maps are titled diverse, assignment-oriented, practical-oriented, and group-centered.

Process maps for each cluster

A programming code creates various types of graphs with different layouts and titles. The graphs are titled diverse, assignment-oriented, practical-oriented, and group-centered.

Oftentimes, the researcher is interested in comparing two pre-defined fixed groups, e.g., high achievers and low achievers, rather than between the computed clusters. In the next example we will compare high to low achievers based on their achievement levels. First, we have to create a separate sequence object for each group. We do this by filtering but you can do it in other ways. For instance, you can create two sequences from scratch for each group. The next step is to build the MMs separately for each group.

A programming code filters a dataset based on a value and builds models for the high and low sequences. The data is filtered based on whether it is less than equal to or greater than 2.

Before plotting the groups, let us do some cleaning, like we did before. First, we get the transition and initial probabilities of each group. We also compute an average layout. Please note that you can use the layout from the previous examples if you are comparing the models against each other and you need a unified framework. The plotting is the same as before (see Fig. 16).

Fig. 16
2 process maps with pie chart borders are titled high achievers and low achievers.

Process maps for high achievers and low achievers using average layout

A programming code for creating graphs for high and low achievers based on transition and initial probabilities. The plots are customized with various parameters, including color, layout, pie, theme, and title.

We can also plot the difference plot (see Fig. 17); that is, what the low achievers do less than high achievers. In this case, red edges are negative (transitions they do less) and blue edges are positive (transitions that they do more than high achievers). As you can see, the differences are not that huge. In fact, much of the literature comparing high and low achievers uses higher thresholds e.g., top 25% to bottom 25% or even top 10% to bottom 10%.

Fig. 17
A process map on the left is labeled ethics, applications, assignments, course view, group work, and theory. The process map is surrounded by nodes labeled practicals, general, L A types, feedback, institutions, and social.

Difference between process maps of high achievers and low achievers using average layout

A programming code for creating a differential plot using a function that calculates differences, sets parameters, and assigns colors.

5 Conclusions and Further Readings

Markovian models provide a flexible model-based approach for analysing complex sequence data. MMs and HMMs have proven useful in many application areas such as biology and speech recognition, and can be a valuable tools in analysing data in educational settings as well. Their mixture variants allow for the representation of complex systems by combining multiple MMs or HMMs, each capturing different aspects of the underlying processes, allowing probabilistic clustering, information compression (e.g. visualisation of multicategory data from multiple domains), and detection of latent features of sequence data (e.g., extraction of different learning strategies). The ability to incorporate covariates in the case of MMMs and MHMMs makes those models even more powerful, and generally MMs and MMMs represent useful tools in the field of process mining also.

The seqHMM package used in the examples supports time-constant covariates for predicting cluster memberships for each individual. In theory, covariates could be used to define transition or emission probabilities as well, leading to subject-specific and possibly time-varying transition and emission probabilities (in the case of time-varying covariates). However, at the time of writing this chapter, these are not supported in seqHMM (this may change in the future). In R, there are at least two other, potentially useful packages: for MMs, the dynamite [41] package supports covariates on the transition probabilities with potentially time-varying effects, whereas LMest [42] supports MMs and HMMs with covariates, and restricted variants of the MHMM where only the initial and transition matrices vary between clusters. Going beyond the R software, some commercial software also offers tools to analyse Markovian models, including latentGold [43] and Mplus [11].

The conditional independence assumption of observations given the hidden states in HMMs can sometimes be unrealistic. In these settings, the so-called double chain MMs can be used [44]. There the current observation is allowed to depend on both the current hidden state and the previous observation. Some restricted variants of such models are implemented in the march package in R [45]. Finally, variable-order MMs extend basic MMs by allowing the order of the MM to vary in time. A TraMineR-compatible implementation of variable-order models can be found in the PST package [46].

We encourage readers to read more about how to interpret the results in the original study where the data for this chapter was drawn from [36]. We also encourage readers to learn more about Markovian models in the context of multi-channel sequence analysis in Chapter 13 [8].