1 Introduction

Sequence analysis (SA; see Liao et al., 2022; Piccarreta & Studer, 2019; Raab & Struffolino, 2022 for an introduction) is a collection of tools used in social science to simultaneously describe the timing, sequencing, and quantum of processes that are represented by sequences of events (states) experienced over a specific period and observed at regular intervals. Such sequences are regarded as meaningful units of analysis and studied in their entirety. Because longitudinal data have recently become widely available, SA has been extensively used in demography and life-course research and beyond: Its versatility in analysing the evolution of processes over time has been proven by applications in other fields (social policy analysis, democratisation research, electoral participation studies, historical sociology, developmental psychology, mobility and time use research, and study of cultural processes).

In many applications, sequences are clustered to reveal relevant types, reflecting the different structural features of the temporal evolution of a given process. An important issue concerns the assessment of within-cluster homogeneity, the relevance of the identified types, and the detection of sequences with peculiar or uncommon features. In this Research Note, we contribute to the literature by proposing criteria that can be used to inspect clusters internal composition and to identify deviant cases in clusters and therefore in data. Further, we introduce tools to visualise and suitably describe the features of deviant sequences, thus enhancing the description of the data structure.

Clusters typically include some sequences that deviate—sometimes substantially—from the type that their cluster supposedly represents. We distinguish between two types of deviant cases. Structurally peculiar sequences differ from their cluster’s core but are nonetheless similar to each other; they are hence one of the relatively rare types of sequence that could be isolated in dedicated clusters by increasing the number of clusters (even if this might be accomplished only by very detailed and fragmented partitions). By contrast, outlying sequences have very peculiar features and are therefore highly different from all the others in their cluster.

Identifying the different deviant trajectories and their characteristics is relevant for several reasons. First, it offers a better description of the types—meaning that deviating sequences cannot influence substantive interpretations of the typical patterns—and it informs a more conscious and cautious interpretation of the substantive results. Second, it permits a detailed analysis of within-cluster heterogeneity: If clusters include many structurally peculiar sequences, this might indicate that the chosen partition failed to identify all the substantively/theoretically relevant types in the data. Further, deviant sequences might be interesting in their own right from a substantive and/or theoretical viewpoint. Indeed, clustering algorithms typically—and with good reason—prioritise the identification of the most typical patterns, and marginal clusters or outliers can be difficult to identify.

Efforts to detect deviant sequences may be particularly beneficial when clusters enter subsequent regression analyses as dependent or independent variables. This is only suitable if clusters are homogeneous, so that sequences clustered together share common features and are well represented by the same type. Moreover, deviant sequences might also differ more from others in their clusters with respect to covariate levels when a strong association exists between sequences and covariates: this could influence and bias the regression’s results.

To illustrate our procedure and demonstrate its added value, we analyse Italian women’s employment trajectories and their link with the respondents’ maternal participation in the labour market across geographical areas.

2 Identifying Deviant Sequences

In the following, we introduce criteria to identify and analyse deviant sequences.Footnote 1 These differ from others in their cluster with respect to the most relevant characteristics of the process signified by the cluster itself or, in other words, by the type that summarises the features of sequences allocated to the cluster.

2.1 Identifying Cluster-Dissimilar Sequences

A first intuitive approach to detecting deviant sequences relates to the extent of their dissimilarity from their cluster. For the generic \(i\)th sequence, this can be evaluated based on its average dissimilarity from all the others in its cluster, \(a_i\), or on its dissimilarity, \(m_i ,\) from its cluster’s representative sequence. Here, we look at the medoid, that is, the case with the minimal average distance from all the others. Thus, for a sequence \(s_i\) in a cluster \(C_g\) including \(n_g\) cases summarised by the medoid \(\tilde{s}_g\) we consider:

$$\begin{aligned} & a_i = \sum \limits_{j \in C_g ,j \ne i} d(s_i ,s_j ){/}(n_g - 1) \\ & m_i = d(s_i ,\tilde{s}_g ), \\ \end{aligned}$$

where \(d(s_i ,s_j )\) indicates the (properly measured) dissimilarity between two sequences. To evaluate individual deviations in relative terms, we compare them with the means calculated on all the cases, \(\overline{a} = \sum_i {(a_i {/}n)}\) and \(\overline{m} = \sum_i {(m_i {/}n)}\). As for all the criteria proposed to flag extremes, deviations could be compared to a threshold, which may coincide with the percentile of the deviations (chosen based on the “expected” proportion of extreme cases). Since thresholds will (or should) also depend on the amount of heterogeneity in data, we adopt an exploratory approach and suggest analysing which sequences are identified as dissimilar from their cluster as the threshold varies.

2.2 Identification of Noisy Sequences

Another approach to detecting deviant sequences is based on the criterion used in the DBSCAN clustering algorithm (Ester et al., 1996) to classify cases into core, border and noise. Specifically, core cases are those in ‘dense’ regions that have a number of neighbours (cases located within a pre-defined distance from them) that exceed a specific threshold. Border cases are non-core cases located in the neighbourhood of a core case. Isolated cases that are neither core nor border cases are defined as noise. These definitions require a specification of the radius of the neighbourhood and the minimum number of cases it should include. We adopt a data-driven approach, and consider the dissimilarity of each sequence from its 5th nearest neighbour (NN) in its cluster, \(d_i^{{\text{NN}}5}\), and a high percentile—e.g., the 95th, \(\overline{d}_{0.95}^{{\text{NN}}5}\)—of the distribution of the \(d_i^{{\text{NN}}5}\)’s in the entire sample. Therefore, 95% of the sequences in the sample have at least 5 NN (in their cluster) within a distance lower than or equal to \(\overline{d}_{0.95}^{{\text{NN}}5}\). We define core sequences as those having at least 5 sequences in their cluster within a distance \(\overline{d}_{0.95}^{{\text{NN}}5}\), and border and noise sequences accordingly. Researchers need to choose the number of NN and the percentile of the distances to use as a threshold. Nonetheless, because case classification and neighbourhood diameter depend on the behaviour of all the cases—due to the reliance on percentiles—the chosen number of NN has a limited impact on the identification of noise cases.

2.3 Visualising Deviant Sequences

Sequences and clusters are usually explored using the sequence index plot (Scherer, 2001), where each sequence is represented by a set of horizontally stacked bars coloured based on the state visited at each time point. To enhance visualisation, individual sequences are arranged on the vertical axis based on their similarity (see Piccarreta & Lior, 2010). This plot might fail to accurately describe the patterns in large sets of data, because sequences are over-plotted to make them fit into the graphic window: visualising and drawing substantive conclusions on clusters’ features can therefore be difficult. Alternative plots introduced in the literature (Fasang & Liao, 2014; Gabadinho et al., 2011; Müller et al., 2008; Piccarreta, 2012) limit over-plotting by only displaying sequences representative of their clusters, thus excluding the deviant sequences we are mostly interested in from the plot.

To explore the features of both regular and deviant clusters’ sequences, we propose to associate each cluster with a set of flagged index plots, one dedicated to regular cases and the other/s dedicated to sequences deviating according to one (or more) criterion or according to the severity of the deviation. This enables a close inspection of irregular sequences.

3 Illustrative Application

The increase in women’s labour market participation from the 1950/1960s is one of the most crucial changes in the post-war Italian labour market, although female participation remains among the lowest in the EU-27 (around 50% in 2019; Eurostat, 2020), with large disparities across the country. We focus on the employment trajectories of Italian women who transitioned from school to work between late 1970s and early 1990s. Different cohorts were differently exposed to labour market deregulation, but the ‘access’ to a stable employment trajectory remained associated with individual characteristics and parental background (Raitano & Vona, 2018; Struffolino & Raitano, 2020). Besides the structural constraints discouraging women—and particularly mothers—from participating in the labour market (e.g., lack of accessible childcare, residual parental leave for fathers, persistent wage gaps), maternal employment has proved to be a driver of women’s employment and intergenerational mobility, over and above parental education. Mothers’ employment can influence their daughters’ behaviour via social learning—by providing actual behavioural examples (Bandura, 1977), and by changing gender attitudes (Moen et al., 1997). Empirical evidence on single countries (e.g., Di Pietro & Urwin, 2003 for Italy) and cross-country analyses (e.g., McGinn et al., 2019) has shown that adult daughters of employed mothers are more likely to be employed. These studies focus on point-in-time responses, for example, the respondent’s employment status at age 35. Here, we instead consider a long time span over individuals’ life courses to identify specific forms of inequality in labour market trajectories.

3.1 Data and Methods

We used data from the ‘Multi-purpose Survey on Households: Families and Social Subjects’ carried out in 2009 by the Italian National Statistical Office (ISTAT) and focused on 4323 women born between 1959 and 1974 whose monthly work activities were tracked from age 16 to 35. We distinguish between the following states: education (Edu),Footnote 2 joblessness (Jless, including both unemployment and inactivity), full-time (FT), and part-time (PT) work, further broken down according to the type of contract: permanent (Perm), temporary (Temp), self-employment (Self), or dependent self-employment (DSelf).Footnote 3

First, we applied cluster analysis to identify distinctive employment trajectories. We assessed dissimilarities between sequences using OMA (Abbott, 1990; Abbott & Forrest, 1986) with substitution costs between two states being inversely proportional to the transition frequencies from one state to the other, and insertion-deletion costs equal to 1. We extracted clusters using partitioning around medoids (PAM, Kaufman & Rousseeuw, 2005) and the Ward’s agglomerative algorithms and compared the quality of partitions with a varying number of clusters using a battery of criteria (point biserial correlation, Hubert’s gamma, Hubert’s C coefficient, average silhouette width, pseudo R2 and pseudo F statistic; Hennig & Liao, 2010; Studer, 2013). Most of the criteria (Fig. 3, Appendix) pointed to 4 clusters and indicated that the PAM algorithm performed better.

We identified deviant cases based on three criteria: (a) average dissimilarity from other cases in their cluster more than 1.5 times the general average; (b) dissimilarity from own cluster’s medoid more than 1.5 times the general average; and (c) identification as noisy based on the \(\overline{d}_{0.95}^{{\text{NN}}5}\)-criterion.

We analysed cluster composition by distinguishing between regular and deviant sequences. We then used multinomial regression models to relate trajectories—and specifically cluster membership—to the interaction between maternal employment status (employed or unemployed/inactive) when the respondent was 15 years old and the geographical macro-area of residence (North-East, North-West, Centre, South, or Islands) at the time of interview, controlling for birth cohort (1959–1964, 1965–1969, or 1970–1974), highest parental education level (no education, primary education, at least lower secondary education), and whether at least one parent had tertiary education (see Table 1 in the Appendix for the covariate distribution in the sample). Specifically, we analysed whether and to what extent including or excluding the identified deviant sequences affected the regression’s results.Footnote 4

3.2 Results

Figure 1 displays the flagged index plots for the four-cluster partition. For each cluster (column), the top plot reports cases flagged as regular by all the three criteria (possibly overlapping) used to identify deviant cases. Note that the clusters’ regular sequences exhibit very distinguishable features, with long-term permanence in FT-Perm (C1), PT-Perm (C2), FT-Self (C3), and Jless (C4). People enter these dominant states after a variable period of education: half of the cases in C4 left school before 18. In contrast, C3 includes a high proportion of women who prolonged education, possibly experiencing relatively long joblessness before becoming self-employed.

Fig. 1
figure 1

Source: Multi-purpose Survey on Households: Families and Social Subjects 2009. (Color figure online)

Flagged index plots for the 4-clusters partition extracted using PAM clustering algorithm. For each cluster (column), the top plot reports cases flagged as regular by all the criteria used to identify deviant cases. For each cluster, we report subplots displaying sequences flagged as deviant, being dissimilar from their cluster [according to their average dissimilarity from their cluster—(a)—or to their dissimilarity from the cluster’s medoid—(b)], or being identified as noisy—(c). In each plot, cases are arranged on the vertical axis based on the Traveling salesperson problem solver (TSP, Gutin & Punnen, 2007) seriation algorithm (Hahsler & Hornik, 2011; Hahsler et al., 2008), starting from a randomly chosen case finds the shortest path (in terms of dissimilarities) to connect all the cases.

Besides the plots displaying regular sequences, for each cluster we report subplots displaying sequences flagged as deviant, being dissimilar from their cluster (according to their average dissimilarity from their cluster or to their dissimilarity from the cluster’s medoid), or being identified as noisy. The sets of cases dissimilar from their cluster (Fig. 1, rows a and b) mostly include structurally peculiar sequences dominated by employment arrangements less diffused among Italian women in the relevant age range, in particular FT and PT combined with Temp and DSelf. Despite their common features, such sequences are scattered among the four clusters, mainly based on the length of education.

The sets of sequences flagged as noisy (Fig. 1, row c) include sequences sharing similar—even if atypical—patterns that are dominated by extremely rare states (particularly, PT-DSelf) as well as turbulent sequences with very uncommon characteristics (e.g., high numbers of transitions or rare transitions). These cases do not share common or relevant (in terms of frequency) patterns and can therefore be regarded as outliers with irregular or unstructured deviations from the cluster.

While some deviation from the regular core sequences is expected, because several factors affect individuals’ behaviours (e.g., preferences, resources, constraints), not all the degrees of structural deviations are acceptable. Thus, the allocation to clusters of sequences with structural differences from the core ones should be justified/explained based on some substantive considerations. Indeed, this might affect the substantive interpretation of the typology deduced from the clusters. Regarding the outliers, researchers should consider if they can be regarded as random deviations from the cluster types or rather as sequences that do not belong to clusters.

Identifying peculiar patterns enables researchers to describe the data in more detail; moreover, deviant sequences might be interesting in their own right from a substantive and/or theoretical point of view. However, it is not necessarily easy or possible to identify such sequences by increasing the number of clusters, since clustering algorithms typically—and reasonably—prioritise the identification of the most typical patterns.

Indeed, we also explored higher-order partitions and in particular (based on the criteria used to assess clusters’ quality) the PAM nine-clusters solution (Fig. 4, Appendix). While identifying two clusters, including the structurally peculiar sequences dominated by FT-Temp and by prolonged education, respectively, this solution offers a more accurate description of the sequences dominated by the most frequent states, namely FT-Perm and JLess; the less frequent deviant sequences are not detected and isolated. Interestingly, we applied the PAM algorithm to group the sequences flagged as deviant in the four-cluster solution (Fig. 1, rows a, b, and c), and extracted 11 clusters (Fig. 6 and Table 2, Appendix). Some of these ‘deviant’ clusters remained highly heterogeneous but we detected some small clusters of trajectories dominated by non-standard employment, which could not be identified even by increasing the number of clusters (for example, 15 as displayed in Fig. 5, Appendix) extracted from the entire set of cases. Thus, in addition to being able to explore and address within-cluster heterogeneity, our suggested procedure also enabled an efficient analysis and characterisation of both regular types and rare or less frequent types or sub-types.

As mentioned above, the identification of deviant sequences can be particularly important when clusters are related to covariates via multinomial logistic regression. Indeed, if an actual relation exists between covariates and clusters, structurally deviant sequences might be characterised by structural differences in their covariates’ values, and this might affect the significance and/or the magnitude of regression coefficients. For example, if deviant sequences that possess the opposite covariate characteristics to those of the core sequences are included in a cluster, this can lower the significance of coefficients; by contrast, deviant sequences with extreme covariate values can magnify the relevance of a covariate that is not related to given clusters.

Drawing on our data, Fig. 2 reports the coefficients of the models estimated based on the entire sample and on the sub-samples obtained by filtering out deviant sequences identified using different criteria. We focused on the coefficients for maternal employment (M.empl). When we estimated the model using the entire sample (in black), having had a mother who worked when the respondent was 15 increased the probability of being in cluster C1:FT-Perm and in cluster C3:FT-Self rather than in cluster C4:Jless. Excluding the different types of deviant sequences (in red, blue, and green) did not change these conclusions in terms of significance, although the magnitude of the coefficients was sensitive to the exclusion of cluster-dissimilar sequences. Note that, for the interaction between maternal employment and geographical macro-area, both the magnitude and the significance of coefficients were sensitive to the exclusion of the deviant sequences. Specifically, compared to women in the Centre (reference category), daughters of employed mothers in the North-West (NW:M.Empl) and in the South (S:M.Empl) were only less likely to be in C3:FT-Self versus C4:Jless when structurally peculiar sequences were excluded from the sample (confidence intervals in red and in green in Fig. 2). On the contrary, for the daughters of employed mothers, for those in the North-West (NW:M.Empl) compared to those in the Centre, the difference in the probability of being in C1:FT-Perm versus C4:Jless was no longer significant when structurally peculiar sequences were excluded from the sample. In both cases, we see changes in the magnitude of the coefficients: in the first case from − 0.5969 to − 1.0436 (NW:M.Empl) and from − 0.5219 to − 1.4220 (S:M.Empl), in the second case from − 0.4247 to − 0.2842 (tables with the regression coefficients available from the authors). These results are particularly interesting given that removing the deviant sequences could lower the coefficients’ significance due to the reduced sample size; thus, in this case, deviant sequences obscured the direction and size of the effects. For researchers seeking to gather some insights about the mechanism underlying these findings, a first step would involve contrasting the baseline characteristics of the women who had regular versus (different types of) deviant sequences. Figure 7 in the Appendix reports the covariate distributions for the detailed set of types and sub-types identified using our procedures.

Fig. 2
figure 2

Beta-coefficients of a multinomial logistic regression model for cluster membership (reference category: C4:JLess) and 95% confidence intervals. Regression is first applied to the entire sample (black) and then to the sub-samples obtained by filtering out different types of deviant sequences: (red) sequences dissimilar from their cluster—i.e. sequences with an average dissimilarity from the others (in their cluster) or from their cluster’s medoid higher than 1.5 times the corresponding general average (blue) noisy sequences identified based on \(\overline{d}_{0.95}^{{\text{NN}}5}\) (green) both sequences dissimilar from their cluster and noisy sequence. Source: Multi-purpose Survey on Households: Families and Social Subjects 2009. Legend for covariates: Int. = intercept; macro-area of residence at the age of 15: North-East (R.NE), North-West (R.NW), Centre (reference), South (R.S), and Islands (R.I); birth cohort: 1959–1964 (reference), 1965–1969 (C.65–69), 1970–1975 (C.70–75); working status of the mother when the respondent was 15: employed (M.Emp), not employed (reference); highest parental education level: no education (reference), primary education (P.Prim), at least lower secondary education (P.LowSec+); at least one parent with tertiary education (dummy variable, P.Tert). (Color figure online)

4 Conclusions

SA is increasingly employed in different fields to describe longitudinal processes represented as sequences of categorical states. Many applications aim at simplifying the complexity of large sets of individual sequences and use cluster analysis to identify types representing different empirical realisations of the studied temporal process. Evaluating within-cluster homogeneity is crucial in SA to accurately describe the types. To address this issue, we introduced criteria to identify deviant sequences whose characteristics differ from the type process summarised by their cluster. We also introduce graphic tools to describe and contrast the features of regular and deviant cases, thus enhancing the description of the data structure. Besides allowing a more detailed analysis of clusters’ internal composition, our proposals enable an efficient identification and qualification of peculiar sequences that might be interesting in their own right from a substantive and theoretical point of view. As shown by our illustrative application, this cannot necessarily be accomplished by increasing the number of clusters, because clustering algorithms typically prioritise the identification of the most common patterns in data.

We elaborate on the possible role of deviant sequences in the (very typical) case when the relation between sequences and covariates is analysed using cluster membership as a response variable in a regression framework. In such cases, sequences in the same cluster are assumed to be fully consistent with the cluster type. Nonetheless, cases with deviant sequences might also differ from the other sequences in their clusters with respect to the covariate levels, especially when a strong association exists between sequences and covariates. This could influence and bias regression results. The role of deviant cases in determining the significance and magnitude of regression coefficients is demonstrated by our motivating application, focused on the link between maternal employment and daughters’ employment trajectories of Italian women across geographical areas. Our results suggest that maternal employment has an effect over and above parental education on the probability of daughters’ permanent employment over the life course. However, the significance of the interaction between maternal employment and geographical macro-area—which reflects different opportunities in the local labour market for female employment in general and for mother’s employment in particular—is sensitive to the exclusion of the deviant sequences from the clusters—and especially of structurally peculiar sequences. Thus, such employment trajectories are different from their cluster’s core, both in their structural characteristics (e.g., the states occurring over time) and the baseline characteristics of the individuals experiencing them. To conclude, researchers should carefully consider within-cluster homogeneity, the characteristics of regular and deviant sequences, and the possible role of the latter in subsequent analytical steps.