Previous research
There has long been a general interest in matching survey results with administrative data. These efforts have focused on increasing the accuracy of interviewee characteristics such as educational attainment (e.g. Adriaans et al. 2020) and income (e.g. Kreiner et al. 2015; Valet et al. 2019), counteracting the recall bias of isolated life-course events such as retirement (Korbmacher 2014), or decreasing the measurement error associated with entire dimensions of the life course such as union histories (Kreyenfeld and Bastin 2016) and employment biographies (e.g. Huber and Schmucker 2009; Kreuter et al. 2010; Wahrendorf et al. 2019).
For instance, Huber and Schmucker (2009) matched a year’s worth of survey and administrative employment data on a monthly basis in order to compare employment biographies and found high levels of agreement between data originating from the two sources: data from only 5 percent of respondents featured deviations. The authors conducted a multivariate analysis, which demonstrated that a higher number of state transitions, for instance between employment and unemployment, increases the probability of a mismatch between the two types of sources. Age, education and income also have a significant effect. The results show that the probability of mismatches decreases until the age of 45 and increases afterwards. Moreover, having no educational degree or belonging to the middle-income category were linked to a smaller probability of diverging sequences. Other variables like sex, nationality and further training had no significant effect. However, it is notable that the regarded biographic episode of one year was rather short and not far in the past at the time of the retrospective survey.
Wahrendorf et al. (2019) examined the differences between self-reported employment histories from the Heinz Nixdorf Recall Study (executed in three German Cities in the Ruhr area) and administrative data from the German Institute for Employment Research; they note that the size and composition of the two data sources are different. They state high and constant levels of agreement over time with an average of only 4 years of diverging information per person for an observation period of 36 years, which corresponds to a median level of agreement of 89 percent. They find that self-reported employment sequences are less complex than those originating from administrative data, which confirms that respondents tend to oversimplify their biographies during surveys (Manzoni et al. 2010). Using sequence analysis, Wahrendorf et al. (2019) find larger differences for women and people working in the tertiary sector. These effects vanish in multivariate analyses that take into account transitions and the years spent in part-time work and non-employment. This leads the authors to conclude that women, who often work in the tertiary sector, have more complex sequences with frequent status changes and more part-time work or non-employment (Widmer and Ritschard 2009) and therefore display greater disparities in the different types of data sources. However, the classification into only three employment states is rather rough; for instance, Wahrendorf et al. (2019) combine unemployment and childcare into a single category of non-employment, which prevents a detailed look at gender-based differences.
Sources of mismatch: misrepresentation and non-response
Why might survey and administrative data on the life history of the same person yield different information? Possible sources of mismatch are reporting errors in survey or administrative data, as well as diverging measurement concepts and scopes in the two types of data. For instance, survey respondents may provide inaccurate information or no information at all. Likewise, recall bias is a central source of inaccuracy in retrospective life-course interviews: respondents may not recall sections of their life history or may misremember the nature of life-course episodes, their time frame or their temporal order (e.g. Korbmacher 2014; Schröder 2011; Wagner and Philip 2019). Alternatively, they may withhold or misrepresent sensitive aspects of their past such as (un-)employment spells, earnings or partnership history due to social desirability bias (Krumpal 2013; Valet et al. 2019). The survey design, interview mode and interviewer characteristics may exacerbate or mitigate the likelihood and severity of these biases (Kreuter et al., 2008; Kühne 2018; West and Blom 2017).
Administrative data are generally considered to be more robust to misrepresentations and non-response than surveys (Kreuter et al. 2010) since they are gathered within the framework of administrative processes using a systematic approach. However, inaccuracies may occur through measurement error, as well as missing documentation and diverging measurement constructs associated with the scope and purpose of the data (Groen 2012). Depending on the type of administrative data and how it is collected and processed, the nature or exact timing of an event may be inaccurately reported, for instance in employers’ social security notifications, although this is unlikely (Abowd and Stinson 2011; Kreuter et al. 2010). The convergence between survey and administrative data is contingent on the (dis-)similarity and compatibility of measurement concepts in both sources, such as how they define employment states. Further, legislative and institutional changes may produce mismatches through the (ex post) redefinition of measurement concepts in administrative sources (Mika 2009). Finally, depending on their scope and purpose, administrative data may lack coverage of certain individual characteristics or parts of the survey population, thus effectively producing non-responses for these items (Korbmacher and Czaplicki 2013; Sakshaug et al. 2017).
Dissimilarity measures in sequence analysis
Our analysis depicts respondents’ life trajectories as sequences. A sequence is defined as an ordered list of items (Brzinsky-Fay et al. 2006)—in this case units of time (e.g. 1 year)—categorized as different states (e.g. employment, education, etc.). Several consecutive items that share the same state are called an episode or spell. Sequence analysis uses distance measures to quantify the level of inconsistency between two sequences. Calculating these distance measures is often seen as a first step. The resulting distance matrix, which contains the distances between all given sequences, is then used to cluster the sequences, for example (Brzinsky-Fay and Kohler 2010). In this analysis we only calculate distances for each person between the two sequences (originating from the survey and administrative data) because we are interested in the level of inconsistency between the two sources—not between the respondents. In a later step we use this measure of dissimilarity per person to investigate the level of dissimilarity between different groups, and to examine the determinants influencing the degree of divergence between the two sources. We employ three different distance measures for the pairwise comparison of individual sequences from administrative and survey data.
First, we calculate the Hamming distance, which depicts the difference between two sequences by simply counting the number of non-matching items (Hamming 1950; Studer and Ritschard 2016). The resulting distance value denotes the number of years in which self-reported states differ from administrative states. This measure is very sensitive to the correct timing—i.e. the exact year in which a state appears. In our case of retrospective interviewing, it can be difficult for respondents to remember the correct timing of an event, so a measure that is more forgiving in this regard may be appropriate.
The second distance measure we use is OM, which describes the difference in the two sequences as the minimal cost of transforming one into another by allowing and specifying the costs of insertion (inserting an element into a specific position) and deletion (deleting an element from a certain position) which are subsumed under the term indel, as well as substitution (changing one element into another) operations (Abbott and Forrest 1986; Abbott and Hrycak 1990; Studer and Ritschard 2016). We use the standard costs of 2 for each substitution and 1 for each indel operation. Thus, the OM distance allowing for alignment operations takes into account time shifts in the sequences and is less sensitive to correct timing than the Hamming measure.
Although OM is the most widely used distance measure for sequence analysis in the social sciences (Halpin 2010) and has previously been used to examine similar research questions (Huber and Schmucker 2009; Wahrendorf et al. 2019), it has been criticized for being sociologically meaningless (Wu et al. 2000). Halpin (2010) notes that OM was designed to assess discrete-time sequences, yet life trajectory data is continuous in time. He argues that trajectories should be considered sequences of spells. Moreover, previous research has shown very high correlations between Hamming and OM distances for life-course data (Halpin 2010; Wahrendorf et al. 2019). A range of refined measures has been proposed to address some of the critiques of OM (e.g. Elzinga 2003; Hollister 2009; Halpin 2010; Lesnard 2010; Elzinga and Wang 2012).
Third, we use one such refined, context sensitive measure, OMspell, which calculates OM for several consecutive years in the same state, and is therefore sensitive to the duration spent in distinct successive states (Studer and Ritschard 2016). It introduces a correction factor function for the spell length with a weighting factor, the so-called expansion cost. The indel and substitution costs introduced for OM are extended as follows (Studer and Ritschard 2016):
$$ c_{I}^{s} \left( {a_{t} } \right) = c_{I} \left( a \right) + \delta \quad \left( {t - 1} \right) $$
$$ \gamma ^{s} \left( {a_{{t_{1} }} ,b_{{t_{2} }} } \right) = \left\{ {\delta \left| \begin{gathered} t_{1} - \left. {t_{2} } \right|\quad \quad \quad \quad \quad \quad \quad \quad if{\mkern 1mu} a = b, \hfill \\ \gamma \left( {a,b} \right) + \delta \left( {t_{1} + t_{2} - 2} \right)\quad \quad otherwise \hfill \\ \end{gathered} \right.} \right. $$
In this formula, \(c_{I}^{s}\) denotes the spell indel costs and \(\gamma^{s}\) the spell substitution cost of a given spell \(a_{t}\), which depicts state a during t years. \(\delta\) denotes the expansion cost. Past studies have successfully used this measure for sequence analysis (e.g. Lee et al. 2017; Squires et al. 2017). As before, we use the standard cost of 2 for substitution \(\gamma \left( {a,b} \right)\), 1 for indel operations \(c_{I}\) and 0.5 for expansion cost \(\delta\).
We compare different measures of dissimilarity to assess their ability to capture and emphasize different dimensions of distinction between the two sequences and to investigate which aspects are meaningful in the context of our research question. As stated above, the naive Hamming distance measure is very sensitive to the point in time when the state appears. For example, a respondent may report a state a year earlier than the administrative data due to recall errors. He or she may report the sequence A–B–C, while the administrative data states C–A–B. This sequence has the same Hamming distance value as, for instance, if the respondent had reported a third (totally different) state for that year, such as D–D–D. The information he or she provided in the first case may be inaccurate concerning the exact timing, but from a social science perspective it is more similar to the administrative sequence than the second case (Halpin 2010). OM, however, allows for slight time shifts due to the alignment through indel operations, which result in lower costs and therefore have a smaller distance value.
However, OM does not take the context into account. For instance, it makes no distinction between reporting a different item that is part of a two-year episode and misreporting an item that is part of a ten-year spell, which might be less consequential from a sociological point of view (Halpin 2010). OMspell also includes the length of the spells and thus the structure of the sequences. It prefers expanding or compressing existing spells before inserting new spells for alignment. By calculating OM between sequences of spells and therefore taking episodes/spells (consecutive years spent in the same state) as the unit of analysis, it accounts for the continuous character of life trajectory data. Figure 1 depicts two example sequences from the employed data, with their respective Hamming, OM and OMspell distances.
In sequence 1 at the beginning of the observation period, after a short period of employment, some years spent in school or training (ST) can be identified, followed by a very long employment (EM) spell for the rest of the period. The sequences from both sources look similar but are not identical. The differences are found in the first six years. The sequence from the administrative data for the first 6 years is: EM-ST-EM-ST-ST-ST, while the sequence from survey data is: EM-EM-ST-ST-ST-EM. The remaining 24 years are coded EM in both sources. In total, three years at age 22, 23 and 26 differ. The Hamming distance therefore takes a value of 3, normalized to 0.1. OM-based distance, which considers that these three divergences can be resolved by one insertion and one deletion operation, takes a value of 2, normalized to 0.023, which is a much smaller distance than that calculated using the Hamming method. Hamming can take a maximum value of 30, in case that every single year is different, while OM and OMspell have a (theoretical) maximum of 60. For OMspell, spells as a whole count as units for indel and substitutions operations, and substitutions of the same state can be expanded to different lengths, i.e. to include several elements. Therefore, based on OMspell, two compression operations with a cost of 0.5 and two insertion operations with a cost of 1 each are needed to convert the survey into the admin sequence. This results in an OMspell value of 3, normalized to 0.05, which is slightly higher than OM but still considerably lower than the Hamming measurement.
Sequence 2 contains a total of 9 years’ difference between the administrative and survey data, resulting in a Hamming distance of 8, normalized to 2.67. In the first half of the observation period, there is a shift in the transition from childcare (CH) to employment (EM), which in the survey data occurs two years earlier. The remaining six differences are found at the end of the period. While the administrative data show several changes between Sick or Disabled (SD), return to Employment and finally to Retirement (RT) status, the survey data show a one-time change to SD. Using the OM approach, the survey data sequence can be transformed into the admin sequence by inserting two CH items and one SD item, and deleting two SD and one EM. These six indel operations have a cost of 6. Furthermore, three SD years have to be substituted with RT years, which costs 3*2. This results in a total OM distance of 12, normalized to 0.2. Calculating the OMspell cost consists of substituting the nine-year CH spell in the survey data with an 11-year episode of the same state. Since this is only an expansion of an existing spell, the cost results from the expansion factor 0.5 times the amount of the difference in spell length, i.e., a cost of 1. A similar calculation applies to the following compressions of the EM spell from 15 to 12 years and the SD spell from six years to one year. Subsequently, a two-year EM spell, a one-year SD spell and a three-year RT spell have to be inserted. The costs are calculated by the insertion cost (1) plus the weighting factor (0.5) times the spell length (3) minus 1, represented as 1 + 0.5*(3–1) for the three-year RT episode. This results in an OMspell distance of 9.5, normalized to 0.158. In this case OMspell is smaller than OM because it favors the expansion or compression of existing spells over the insertion of new spells. OM scores the same regardless of whether the inserted CH items show the same state as previous or following spells. OMspell rewards the fact that the considered sequences have the same spell structure in the first two-thirds—CH, EM, SD.