Combining Sequence Analysis and Hidden Markov Models in the Analysis of Complex Life Sequence Data

Life course data often consists of multiple parallel sequences, one for each life domain of interest. Multichannel sequence analysis has been used for computing pairwise dissimilarities and finding clusters in this type of multichannel (or multidimensional) sequence data. Describing and visualizing such data is, however, often challenging. We propose an approach for compressing, interpreting, and visualizing the information within multichannel sequences by finding (1) groups of similar trajectories and (2) similar phases within trajectories belonging to the same group. For these tasks we combine multichannel sequence analysis and hidden Markov modelling. We illustrate this approach with an empirical application to life course data but the proposed approach can be useful in various longitudinal problems.


Introduction
Longitudinal data often consists of multiple parallel sequences that ought to be analyzed jointly.For example, life course data may contain sequences of employment, family formation, and residence.Such data is often referred to as multichannel or multidimensional sequence data.A multichannel approach often gives a simpler representation of the data as opposed to combining states across life domains (the extended alphabet approach); the latter approach rapidly grows the state space as the number of channels and/or states grows.If some data is only partially observed, the multichannel approach also allows for handling data as it is instead of having to make difficult decisions on how to combine observed and unobserved states (Helske and Helske 2018).
Joint analysis of complex multidimensional data poses several challenges.Multichannel sequence analysis (Gauthier et al. 2010) has been the standard tool for the analysis of multichannel sequence data (for empirical applications see, e.g., Eerola and Helske 2016;Müller et al. 2012;Spallek et al. 2014).This approach is simple and fast in computing dissimilarities between sequences, and cluster analysis is often used for grouping similar sequences.Describing and visualizing results is, however, often challenging.
We propose an approach for compressing the information within multichannel sequences and for facilitating the interpretation of such data by finding (1) groups of similar trajectories and (2) similar phases within trajectories belonging to the same group.For the first task we use the standard multichannel sequence analysis approach and for the second task we propose using hidden Markov models (HMMs).With the help of HMMs the data can then be illustrated with a graph showing typical phases within trajectories and the transitions between them and/or shown as simplified (single-channel) trajectories consisting of these typical phases.We illustrate this approach with an empirical application to complex longitudinal life course data but such an approach, and HMMs in general, are useful in various longitudinal problems across disciplines.
Hidden Markov models have been widely used in economics, bioinformatics, and engineering (see, e.g., MacDonald and Zucchini 1997;Durbin et al. 1998;Rabiner 1989), often to study single long sequences such as time series.In social sciences, such models are commonly referred to as latent Markov (chain) models (Wiggins 1955(Wiggins , 1973;;Van de Pol and De Leeuw 1986); typically they have been used for analysing panel data with a few measurement points.In the social science framework, Vermunt et al. (1999) extended the HMM to include individual covariates and Bartolucci et al. (2007) further developed it for multichannel observations.See also Taushanov and Berchtold (2018) in this bundle.
To the best of our knowledge, few papers apply HMMs to multichannel social sequence data and they all consider binary observations.Bartolucci et al. (2007) studied criminal trajectories of 11,400 offenders, applying HMMs to ten-channel data with six time points.Ip et al. (2015) analysed and classified 18-item profiles of food security among 248 Latino farm worker households in the USA for eight time points.Rijmen et al. (2008) studied 12 parallel trajectories of emotions at 63 time points among 32 anorectic patients.Our analysis extends this framework into multichannel data with much longer and multinomial sequences.
The rest of the paper is structured as follows.We start by giving an introduction to HMMs (we assume that the reader is familiar with sequence analysis and refer to the introduction chapter in this book for the less experienced).We then proceed to framing our goals in the context of complex life course data.We continue by describing the data and the empirical analysis and show the results.We conclude with discussing the usefulness of the method, the challenges it poses, and mention some future directions.

Hidden Markov Model
In hidden Markov models, observations are related to a hidden process following a Markov chain.Hidden states can only be detected through the observed sequence(s), as they generate or "emit" observations on varying probabilities.
Let us assume we have multichannel sequence data with N individuals, T timepoints, and C channels and a hidden Markov model with S hidden states.Now z i = (z i1 , z i2 , . . ., z iT ) represents the hidden state sequence for individual i = 1, . . ., N from time 1 to time t and y itc denotes the observation of individual i at time t = 1, . . ., T in channel c = 1, . . ., C.
Figure 1 illustrates the structure of an HMM for two-channel data.The first order Markov assumption states that the probability of transitioning to the hidden state at time t only depends on the hidden state at the previous time point t −1.Here we also assume the same latent structure applies to all channels, i.e., hidden state z it emits observed states y itc in all channels c and observations y it1 , . . ., y itC are assumed conditionally independent given the hidden state z it .
The following probabilities characterize a discrete first-order hidden Markov model for multichannel data: • Initial probability vector π = {π s } of length S, where π s is the probability of starting from the hidden state s: π s = P (z i1 = s); s ∈ {1, . . ., S}.
• Transition probability matrix A = {a sr } of size S ×S, where a sr is the probability of moving from the hidden state s at time t − 1 to the hidden state r at time t: a sr = P (z it = r|z i(t−1) = s); s, r ∈ {1, . . ., S}.Typically, the maximum likelihood estimates of these probabilities are calculated with the Baum-Welch algorithm, i.e., the expectation-maximization (EM) algorithm for HMMs (Baum and Petrie 1966;Rabiner 1989).The most probable path of hidden states for each subject given their observations and the model can be computed using the Viterbi algorithm (Viterbi 1967;Rabiner 1989).Missing observations are handled straightforwardly.When observation y itc is missing, it does not contribute to the estimation of model parameters nor hidden states.See Helske and Helske (2018) for a more extensive presentation on HMMs for multichannel data.

Combining Sequence Analysis and Hidden Markov Models for Complex Life Sequences
For analysing complex life sequence data, we aim to compress the information into two types of components: 1. groups with similar life course patterns and 2. typical life stages within each group.
The first component corresponds to finding clusters or latent classes of individuals who have experienced similar life events in similar order and timing.The other, time-varying components should correspond to life stages during which individuals are more likely to have similar experiences, e.g., observed states within the sequences.These life stages could be either stable episodes between two transitions (e.g., employed and married without children) or characterized by transitions in some of the life domains (e.g., moving between unemployment and short-term jobs).Individuals may, and typically do, go through several different life stages during their life course.SA followed by cluster analysis is a typical strategy for grouping life trajectories.Hidden Markov models, in turn, may be used for finding time-varying latent structures and transitions between them.At first, we use multichannel SA to compute pairwise dissimilarities and then group individuals into clusters.Separate HMMs are then fitted for each cluster.The number and nature of the hidden states are determined independently for each group.
We estimate left-to-right HMMs where transitions to previous hidden states are not allowed.We had several reasons to do this.First, left-to-right models are simpler to estimate since some of the parameters are restricted to zero.Second, due to the nature of the life trajectories, also the observed states tend to show a left-to-right behaviour and many of the HMMs would end up being estimated close to left-toright models anyway.Third, we find that left-to-right models are often easier to interpret in the context of life course: individuals go through different life stages but even if they return to have a similar life stage compared to a previous one -say re-marriage after a divorce -this second life stage comes with a different history compared to the first time.

Data
We illustrate the analysis of complex life sequence data using a subsample of the German National Educational Panel Survey (NEPS) (Blossfeld et al. 2011).We restricted the analysis to the life courses of an age cohort born in 1955-1959.Only individuals who were born in Germany or moved there before the age of 14 are included.
The data consisted of monthly life statuses of 1731 individuals in three life domains (labour market participation, partnerships, and parenthood) from age 15 to age 50.For each individual, there were three parallel sequences of length 434, which made altogether 2,253,762 data points (of which 2,232,730 were observed and 21,032 were missing).Using the monthly time scale also allowed for the detection of smaller fluctuations in life courses, e.g.recurrent transitions between short-term unemployment and employment.

Sequences
The sequences in three life domains were constructed as follows: Labour market participation with 4 states: • No children • Has (had) children (biological, adopted, or foster children) The coding for parenthood was very simple.A practical reason was that this record was available for most individuals, whereas more detailed information was often missing.On the other hand, we can argue that specifically the experience of becoming a parent is relevant as one step in the developmental process into adulthood.
For the latter two life domains, the status of each month was usually determined from the latest event.An exception was made for the rare partnerships that lasted for less than a month; there separation was coded from the following month onward.In a case of multiple records per month in the career domain, the final status was given according to assumed importance: school and vocational training came before employment, which in turn dominated over vocational preparation, unemployment, and other non-employment statuses.
Altogether 306 individuals (17.7%) had some missing information in one or two life domains.Thus, at each time point we had at least some information from each individual.

Analysis
We have little prior knowledge on the structure of the model; hence, how many clusters to choose and how many hidden states to include in each cluster?Since the complexity of these types of life course trajectories varies a lot (e.g., some individuals have no family-related transitions while others have many), we expected the groups to have varying numbers of hidden states.

Sequence Analysis and Clustering
We started by applying multichannel sequence analysis and computed the dissimilarities between the sequences.These were then used in cluster analysis.
The dissimilarities between sequences were determined according to the generalized Hamming distance with user-defined substitution costs (see Table 1).We set the highest cost to be the same in all life domains to give them equal weight.We gave no cost for substituting missing states since we wanted to determine dissimilarity based on the observed trajectories.Regarding the costs within different life domains, our choices were mainly based on how far the states are regarded on the pathway to adulthood and, in terms of labour market participation, also on how close the other states can be regarded to employment which is often the favourable state.The metric compares observed states time point by time point and gives a cost for mismatches.It generally works well in a multichannel problem where timing is important (Studer and Ritschard 2016) and resulted in meaningful clusters with high goodness-of-fit.
We used Ward's clustering method for the Hamming dissimilarities and chose six clustering solutions with 7-12 clusters for further examination.The choice was based on goodness-of-fit statistics, the dendrogram, and the interpretability of the clusters.Ward's method was chosen because it typically produces usable and relatively even-sized clusters compared to most of the other clustering methods (Aassve et al. 2007;Helske et al. 2015).Also, the method is hierarchical (agglomerative), so when two smaller clusters are merged, all other clusters remain the same.This means that among the 7 + 8 + 9 + 10 + 11 + 12 = 57 clusters in the six sets of clustering results, only 7 + 2 + 2 + 2 + 2 + 2 = 17 were unique, resulting in significant decrease in the number of models to be estimated compared to nonhierarchical clustering.

Hidden Markov Models for Clusters
At the next step, we estimated five HMMs with 4-8 or 5-9 hidden states separately for each of the 17 unique clusters-fewer hidden states for clusters with simpler observed trajectories, more for the more complex ones.Since the goal was to find general life stages between adolescence and middle age in a given group, having too few or too many hidden states was not plausible nor interpretable.When increasing the number of hidden states, at some point they lost their distinctive nature (consecutive states had very similar emission probabilities) and/or they were rarely "visited" in the most probable paths of hidden states.
A well-known problem with the HMM estimation is that most of the optimization methods are sensitive to initial estimates of the parameters.In order to reduce the risk of being trapped in a poor local optimum, we estimated the models numerous times with random starting values.We continued re-estimation until we had found the same optimum for at least 100 times (which turned out to be much more than necessary).
For each cluster, we compared the HMMs with a different number of hidden states to find the best model.Bayesian information criterion (BIC) and other information criteria are common choices for comparison of HMMs with different numbers of hidden states.Another common option for model selection is crossvalidation.
We chose to use BIC as it generally selects parsimonious models.Unfortunately, here BIC kept suggesting models with more and more states.We did, however, use BIC as one source of information for choosing the number of hidden states by looking for turning points in BIC after which additional hidden states offered little improvement.In addition to BIC, the choice of the number of hidden states was based on the interpretability of the model and the prevalence of the hidden states in the individual trajectories.

Software
Analyses were conducted with the R software (R Core Team 2015) by using the packages TraMineR (Gabadinho et al. 2011) for sequence analysis, cluster (Maechler et al. 2015) for cluster analysis, and seqHMM (Helske and Helske 2018) for hidden Markov modelling.For the estimation of HMMs we used the automatic re-estimation routine for the EM algorithm provided in the model estimation function.

Results
The number of hidden states per cluster varied between six and eight.The model of eight clusters in the smallest BIC (even the highest likelihood) and was chosen as the best model.We present a few different ways to describe the results: a table showing the most typical transitions in each cluster, a figure illustrating the structure of the HMMs, and a figure of the most probable hidden states, i.e., the trajectories of general life stages for each individual.
Table 2 describes each cluster in terms of some important transitions and states: typical labour market participation (showing the timing of completing education and the type of employment after that), partnership histories (age at first partnership, the type and number of partnerships), and parenthood (the timing of the first child).It also shows the number of individuals in each cluster and the proportion of the sample, as well as the hidden states described with the most important transitions.
Figure 2 illustrates the HMM structure for each of the eight clusters.It shows the HMMs as directed graphs where the pies represent hidden states and the slices show the emission probabilities of observed states within each hidden state (to draw attention to the most prevalent observations, we only show probabilities that are greater than 0.05).The arrows indicate transition probabilities between the hidden states-the thicker the arrow, the higher the probability.
Figure 3 illustrates the most probable hidden state paths.We have assigned similar colours to similar hidden states across clusters.
As an example of how to interpret these figures, let us look at the smallest of the clusters titled Single parents (cluster H).All individuals start from the first hidden  Hidden states are described by the most probable observations

Fig. 3
Most probable hidden state paths between ages 15 and 50 for individuals in eight clusters.Hidden states are described with the most probable observed states showing labour market participation (studying/employed/unemployed/out of the labour market), partnership statuses (single/cohabiting/married/divorced/separated; also partner = cohabiting or married, no partner = divorced/separated from marriage or after cohabitation), and parenthood status (if has had children In general, the clusters were well separated from each other by the timing and occurrence of labour market participation and family states.The two largest clusters composing of half of the respondents were characterized by (mostly) short education and family.The biggest difference was in the timing of partnership and parenthood transitions which occurred either earlier in life (cluster A) or later (cluster B).The third largest cluster (cluster C) mostly consisted of individuals, more often men, who had long education and later family transitions.Another cluster with early family transitions (cluster D) consisted of mostly women and was characterized with a long career break for mostly taking care of children.
Two clusters were characterized by no or very late parenthood.They differed in the timing of the partnerships; the larger cluster (cluster E) had earlier first partnerships while in the smaller cluster (cluster F) partnerships were delayed or omitted altogether.The two smallest clusters consisted of parents living divorced or separated (cluster H) or single parents (cluster G).

Discussion
When analysing complex sequence data with multiple channels, describing and visualizing the data can be a challenge.By combining sequence analysis and hidden Markov models the information in data can be compressed into hidden states (life stages) and clusters (general patterns in life courses).Hidden states were able to capture general life stages that included not only rather stable episodes such as being employed and married with children (e.g., State 7 for cluster A) but also life stages characterized by change, e.g., moving between unemployment and short-term employment (State 3 for cluster F).
We presented two different ways of HMM-based visualizations that give complementary information but could also be shown alone-it is up to the researcher to decide which one is more informative in each case.The HMM graphs show the structure of the hidden states and the transitions between them; also parameter estimates could be easily included in the graph.The most probable paths of hidden states show individual-level information on the approximate prevalence and timing of different life stages.
Despite its usefulness as a data reduction technique, this approach comes with some challenges.A major one is the estimation of several HMMs when the number of hidden states and clusters is unknown.For these challenges, we used a few approaches.In terms of the number of clusters, we used a hierarchical clustering method which reduced the number of models to be estimated compared to nonhierarchical clustering.We then estimated a single model numerous times with randomized starting values to find the one with the highest likelihood, using parallel computation for improved efficiency.
Another issue is that we take the SA clusters as fixed.In reality, there is, of course, a lot of uncertainty which we do not take into account.Also, we do not discuss other trajectory grouping techniques besides SA.To our knowledge, there are not many methods suitable for multichannel sequence data; we experimented with latent class analysis (Collins and Wugalter 1992) which did not lead to satisfactory results.On the other hand, regarding the parameter uncertainty, in theory it is possible to compute asymptotic standard errors from the Hessian matrix obtained from the numerical optimization algorithms, but in practice the underlying asymptions are typically not met (Zucchini and MacDonald 2009).
The mixture hidden Markov model (MHMM) offers a solution to the problem of uncertainty of clustering.In the MHMM, instead of fixing individuals to the clusters defined during the SA step, we could use all data to estimate a mixture of HMMs where each individual belongs to each cluster with some probability (preferably with a large probability for one cluster and a small probability to all others).In a complex setting, SA can be used to determine the range of potential clustering structures.It can also be of aid when setting initial values for the estimation process, which is often essential when using very large models.
Although in theory the MHMM approach allows even more flexibility to the modelling and potential for more interesting ways of inference, there are some practical computational problems in the MHMM methodology.The parameter estimation of HMMs is often very sensitive to initial values, and the computational costs increase rapidly when the number of hidden states grows.These problems are even more prominent in complex MHMMs, especially when the structure of the model (in terms of the number of hidden states and/or clusters) is not known.For this study, we were not able to find stable solutions for MHMMs despite large computational resources available-the multichannel structure, long sequences, and the relatively large number of individuals in our data was too challenging a combination for parameter estimation.Nevertheless, the MHMM can be useful in other settings.It has been successfully used for simpler problems, e.g., for accounting for measurement error and unobserved heterogeneity.
An extention not covered in this paper is the inclusion of external covariates.Personal characteristics and other relevant factors, time-constant as well as timevarying, could be used to predict transition probabilities between life stages.In MHMMs, time-constant covariates may also be used to predict cluster memberships.See, e.g., Vermunt et al. (2008) for a general presentation of such models.
We are currently studying algorithmic variations which can reduce the computational complexity of the MHMM estimation.Further research is also needed regarding model selection and the goodness-of-fit of left-to-right HMMs and MHMMs.Further theoretical and empirical studies are needed for detecting the reasons for the failure of BIC and for discovering selection criteria that are better suited for finding parsimonious HMMs.
Another topic for future research is the potential of hidden Markov models and Markovian models in general as mechanisms of generating social sequence data.
The aim of our study was to describe complex life sequence data and for that goal, the SA-HMM approach gave satisfactory results in a reasonable time.We were able to find meaningful and well-separating clusters and to visualize their complex life course information by using HMM graphs and the most probable paths of life stages for each individual.

Fig. 2
Fig. 2 HMM graphs for the eight clusters A-H.State abbreviations show labour market/partnership/parenthood statuses: ST = studying, EM = employed, OU = Out of the labour market, UN = unemployed; S = single, C = cohabiting, M = married, D = divorced/separated; NC = no children, CH = has child(ren).Hidden states are described by the most probable observations The hidden state at time t is illustrated with z it inside a circle and the observed state at time t in channel c with y itc inside a rectangle.Arrows indicate dependencies between states

Table 1
Substitution costs for Hamming distances in three life domains: labour market participation, partnerships, and parenthood

Table 2
Description of clusters by typical timing of the completion of education, type of employment, the timing, number, and types of typical partnerships, and the timing of parenthood.Hidden states are described with changes in the most probable observations (ordered by prevalence) (out = out of the labour market (not studying), div.= divorced or separated).For all clusters, the first two hidden states are omitted as they are approximately the same: hidden state 1 is studies, single, no children and hidden state 2 is employed, single, no children ).Multiple relevant observed states within a life domain are ordered by emission probabilities.See Fig.2for visualizations of the hidden states in each cluster state (State 1, indicated with light blue in the hidden state paths), a life stage where they are childless singles and mostly studying.For almost all, the next transition is to State 2 (dark blue), moving to employment.A few make a straight transition to State 3 (light pink), a life stage of becoming parents and being out of the workforce.State 4 (darker purple) describes a life stage during which individuals are singles, have children, and are employed.This is the most prevalent life stage for the members of this cluster and many stay there until the end of the follow-up.A few move out of employment, mostly to unemployment (State 5, light purple).During the last life stage, experienced by almost half of the members, individuals form their first partnerships (State 6, yellow).