The continuous sample of working lives: improving its representativeness

Pérez-Salamero González, Juan Manuel; Regúlez-Castillo, Marta; Vidal-Meliá, Carlos

doi:10.1007/s13209-017-0154-0

The continuous sample of working lives: improving its representativeness

Original Article
Open access
Published: 03 March 2017

Volume 8, pages 43–95, (2017)
Cite this article

Download PDF

You have full access to this open access article

SERIEs Aims and scope Submit manuscript

The continuous sample of working lives: improving its representativeness

Download PDF

3613 Accesses
6 Citations
Explore all metrics

Abstract

This paper studies the representativeness of the Continuous Sample of Working Lives (CSWL), a set of anonymized microdata containing information on individuals from Spanish Social Security records. We examine several CSWL waves (2005–2013) and show that it is not representative for the population with a pension income. We then develop a methodology to draw a large dataset from the CSWL that is much more representative of the retired population in terms of pension type, gender and age. This procedure also makes it possible for users to choose between goodness of fit and subsample size. In order to illustrate the practical significance of our methodology, the paper also contains an application in which we generate a large subsample distribution from the 2010 CSWL. The results are striking: with a very small reduction in the size of the original CSWL, we significantly reduce errors in estimating pension expenditure for 2010, with a p value greater or equal to 0.999.

How to Measure Retirement Age? A Comparison of Survey and Register Data

Article Open access 24 October 2019

CenSoc: Public Linked Administrative Mortality Records for Individual-level Research

Article Open access 15 November 2023

A general framework for analysing the mortality experience of a large portfolio of lives: with an application to the UK universities superannuation scheme

Article Open access 29 April 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Selecting a representative sample from the population is a very important factor in quantitative research given that results obtained from a wrongly selected sample which is not properly in tune with the object of study cannot be generalized to the whole population. Moreover, smaller than appropiate sample may not have enough diversity to enable significant differences or associations potentially present in the target population to be identified. It is fundamental to make the right choice and decide on the right way to deal with the dataset, always making sure that it is the best one for the purposes of the research. Hence, it is important to check that the sample selected is representative of the target population in terms of demographic characteristics as well as any others that can affect the results of the study to be conducted.^{Footnote 1}

The Continuous Sample of Working Lives (CSWL) is a random sample (RS) of around 1.2 million people, i.e. 4% of the reference population. It contains administrative data on working lives, which provide the basis for this sample taken from Spanish Social Security records, and comprises anonymized microdata with detailed information on individuals. Izquierdo et al. (2009) point out that this database provides a unique dataset with very rich information about labour market histories and personal characteristics, such as nationality, date and country of birth (or province if Spanish), gender and place of residence when the individual first entered the Social Security system, along with additional information about the composition of the household and labour market variables.

The first wave covers people who had an economic relationship with the Social Security system in 2004, and provides the entire working history (employment, unemployment and retirement) of the sample population. The sample is updated every year using information from the variables selected from the Social Security system dating back to when computerized records began, and from other administrative data sources which record additional information on individuals. Apart from the details given by the institutions responsible for generating the CSWL,^{Footnote 2} the data available to researchers date from 2004 to 2015.^{Footnote 3}

The sample reference population^{Footnote 4} is defined as individuals who have had some connection (through contributions, pensions or unemployment benefits) to the Social Security system at any time during the year of reference. This population of reference makes it possible to select people who on a particular date in that year had no relationship with Social Security, but who did at some time before or after that date in the same year. This means that the CSWL will contain details about a person who may have had various relationships with Social Security (unemployed, unaffiliated, contributor, pensioner), unlike other datasets that only show a single relationship because they refer to the situation on one specific date in the year.

Individuals are selected from the reference population each year according to whether their identification codes contain certain randomly-generated figures in the correct positions. Each year individuals who figured in the previous version of the CSWL and continue to have a relationship with Social Security remain in the selection, while new individuals are incorporated if they meet the requirement of having identification codes containing the randomly-generated figures described above. The detailed information available on the individuals selected includes work trajectories (from 1967), contribution bases (from 1980) and/or pensions (from 1996), as long as it is contained in the Social Security administrative records. Those individuals who for any reason have no connection to Social Security in a particular year will not figure in the CSWL.

The CSWL is a dataset that has been used in a considerable number of studies, especially on labour economics (Treviño et al. 2008; Benavides et al. 2010; Vall Castello 2012; Bonhomme and Hospido 2013, 2016; Solé et al. 2013; Agliari et al. 2014; Arranz and García-Serrano 2014; Barra et al. 2014 and Nagore García and van Soest 2016) and the Spanish public pension system (Antón Pérez et al. 2007; Argimón et al. 2007; Boado-Penas et al. 2008; Moral Arce et al. 2008; Vidal Meliá et al. 2009; Cairó Blanco 2010; Peinado Martínez 2011; Devesa et al. 2012; Domínguez Fabián et al. 2012; Meneu Gaya and Encinas Goenechea 2012; Vicente Merino et al. 2012; Conde Ruiz and González 2013, and Vegas Sánchez et al. 2013). Other studies give detailed descriptions of its characteristics, advantages and limitations.^{Footnote 5}

Other countries such as the USA and Germany have similar databases but they use stratification. For example, the US Social Security Administration (SSA) has been compiling a Continuous Work History Sample (CWHS) since the late 1930s’, which is a one-percent stratified cluster probability sample of all possible Social Security numbers. The population, or sampling frame, from which the CWHS cases are selected consists of the 1 billion possible nine-digit Social Security numbers (SSNs). These digits represent the geographical area of each number allocated, a group for the date of issue and a random serial number [Smith (1989)]. The numbers are thus stratified geographically by place of application for SSN and chronologically by the process of allocation of numbers within each stratum. The information source is the so-called MEF (Master Earnings File—Olsen and Hudson 2009), which is used to determine pensions for retirement, permanent disability and widowhood in the US public Social Security system (OASDI).

Similarly in Germany (Himmelreicher and Stegmann 2008), there is the so-called sample of insured persons and their insurance accounts (Versicherungskontenstichprobe, VSKT), which provides longitudinal data that have a high potential for analyses of employment biographies and pension claims in old age. These data are process-produced, contain very large samples and allow for differentiated analyses of a variety of social groups. The VSKT was initially sampled in 1983 as a stratified^{Footnote 6} random sample with disproportional selection probabilities and since then has continued as a panel containing monthly information on the individuals included in the sample. It represents 1% of the contributing population.

One important branch of research in pension systems is the problem of global ageing and the sustainability of public pension systems, referred to hereafter as PPS. As mentioned above, the CSWL is one of the datasets considered for the purposes of this research in Spain. In order to analyze Spanish PPS, it is necessary to have information on relevant variables such as age and gender for each year of reference as considered in previous studies. These last two factors are essential for correctly estimating life expectancy, so any study that make estimates of future benefits should select a sample that is representative of the population in terms of age and gender as well as type of pension.

With all this in mind, the first objective of the paper is to analyse whether all the information given by the CSWL on the benefit recipients makes up the best replica of the study population that researchers can have. To solve this question of how representative the CSWL is of the population of pensioners, it is important to carry out a statistical analysis to determine whether there are significant differences between the CSWL sample distribution and the population distribution. Even though the researcher does not have the entire population, the distributions of the population of pensioners organized by age, gender and type of pension on the last day of each year are available. Therefore it is possible to carry out a test in order to check whether the CSWL has the same distribution as the population of pensioners. However, given that the CSWL is not a stratified sample, it is advisable to check whether it covers the correct proportion of the population in each stratum to be considered in the study of the pensioner population categorized by age, gender and type of pension.

In this paper we conduct such an analysis for the CSWL waves for 2005–2013 and confirm that there is a lack of representativeness in most years. Given these results and the fact that it has long been known that a stratified random sample (SR) enables a more efficient selection to be made when one of the variables in the study of interest presents great variability, as is the case of the ages of pensioners in the different types of pension, the first idea that comes to mind is why not use a random sample with stratification to obtain a dataset that better represents the population of pensioners?

The answer is clear: researchers do not have all the data on the population, so they cannot extract such a sample. They would have to use a stratified random sample (SR) contained in the CSWL, but with a considerably smaller size, which would entail a loss of richness in the information on pensioners’ working lives. Hence it is important and advisable to develop other procedures in order to obtain larger subsamples with less reduction in the total number of pensioners than in the SR subsample contained in the CSWL, and at the same time to make the original CSWL more representative in terms of pension benefits.

Hence the second objective of the paper is to provide researchers with a novel methodology for the design of a dataset on pensioners by making the CSWL more representative of the population in terms of type of pension, gender, and age, and by trying to miss as few pension records as possible so as not to overlook diversity in working lives.

Subsample selection is done by finding a feasible solution to a nonlinear optimization problem (NLP)^{Footnote 7} using mixed integer nonlinear programming with just one real non-negative decision variable, the constant of proportionality, q, in a stratified sample design with proportional allocation. Maximizing q is equivalent to maximizing the size of the subsample and is subject to constraints implied by the fact that the number of pensioners to be included in each cohort of the subsample has to be a natural number (non-negative integer) and that the subsample obtained must be included in the population as well as in the CSWL. The methodology applied uses a goodness of fit test—Pearson’s chi-square test—in order to make the subsample selected more representative of the population, providing p values close to 1. This methodology enables us to obtain quite a large dataset included in the CSWL which is much more representative of the pensioner population in terms of type of pension, gender, and age, as would be the case with an SR. In addition, this procedure enables users to choose between goodness of fit quality and subsample size.

Finally, in order to illustrate the gains obtained with the selection of the subsample, the methodology is applied to the CSWL for 2010,^{Footnote 8} to gauge the improvement in the estimate of total pension expenditure in different cohorts taking into account age, gender and type of pension, even though the main objective is not this but to obtain the subsample itself for use in any subsequent analysis of the Spanish public pension system. Given that the lack of representativeness of the CSWL has also been found in other years, our findings suggest that the same procedure might relevantly be applied to select subsamples in the other waves of the CSWL.

The structure of the rest of the paper is as follows: Sect. 2 analyses the representativeness of the CSWL for the years from 2005 to 2013 with respect to pension benefits. Section 3 sets out the distribution by type of pension, gender, and age of a hypothetical CSWL using SR sampling and the distribution of a subsample obtained by SR sampling using the original CSWL. In this section we show the importance of stratification as a sort of backtesting. We check whether stratification matters by looking at the total expenditure estimated using a stratified random sample. Section 4 details the criteria used for subsample selection and the results obtained. The paper ends with conclusions, pointers for future research and two appendices: the first shows all the tables and graphs with the estimates of the total expenditure deviations for 2010, and the second (online) extends the analysis of the goodness of fit of the CSWL to the population (INSS) for the whole period 2005–2013 and summarizes the problem statement whose solution provides the distribution of the number of pensioners in large subsamples which also represent the population better than the CSWL itself.

2 Analysis of the goodness of fit of the CSWL to the population (2005–2013)

In this section we analyse how well the CSWL pension data distribution fits the population distribution of pensioners at December $31{\mathrm{st}}$ for the years 2005 to 2013^{Footnote 9} by age, gender, and type of pension. The data are available in the statistical reports of the National Social Security Institute (INSS), though it is important to stress that the population from which the sample is drawn comprises all those individuals who have been registered or have received some kind of contributory pension from the Social Security system at any time during the year of reference, regardless of how long they were in that situation. It does not therefore coincide with the figure for the population at December $31{\mathrm{st}}$ each year or indeed with the population of pensioners, but is larger. However, in our study there is a process of post-stratification of the CSWL with all pensions registered (current) at December $31{\mathrm{st}}$ being grouped by cohorts for age, gender and type of pension as of that date. We do not add all the pensions that were registered at some time during the year but only those which were recorded as currently registered on December $31{\mathrm{st}}$, just as the INSS statistical report for each year does. The composition of the pensioners population is obtained from INSS (2006–2007), INSS (2008–11) and INSS (2012–2014). Pensions deregistered during the year due to the death of the recipient or because the recipient ceased to meet the requirements for receiving the pension are not considered on December $31{\mathrm{st}}$. The statistics on the total number of pensioners in the INSS statistical report consider only those pensions recorded as currently registered on December $31{\mathrm{st}}$, so a comparison between the distributions of pensioners (by type of pension, gender and age) in the CSWL as of December $31{\mathrm{st}}$ and the INSS makes sense, because the moment in time considered is the same.

In short, the aim of this analysis is to determine whether there are any statistically significant differences in the weights of the cohorts between the sample and the population using a goodness of fit test.

To conduct the test it is necessary to conduct a post-stratification of the CSWL once the sample records with no information on gender or date of birth have been deleted. The main theoretical reason why the population of pensioners is stratified by type of pension, gender and age is that there is a different life expectancy for each pensioner depending on the type of benefit received (retirement, permanent disability, widow(er)’s, orphan’s and family responsibilities),^{Footnote 10} whether the pensioner is a man or a woman and whether he/she has a given age are taken into account. So in order to make accurate forecasts when analysing the sustainability of the Spanish public pension system, in which life expectancy plays a crucial role, those differences have to be taken into account. Therefore it is very important to have a sample of individuals that adequately represents the population of pensioners, taking into account these variables.

Another practical reason is that the information available to us about the population of pensioners is the distribution of this population organized by age, gender and type of pension. Moreover, the age cohorts considered in our analysis are also given in the format in which the information is disclosed by the INSS.

Once the data on pensions from the CSWL has been post-stratified at December $31{\mathrm{st}}$ by type of pension, gender, and age cohorts, we perform a preliminary analysis comparing the distribution obtained from the CSWL with that of the population for 2005–2013. We use the equivalent table from the statistical report of the Spanish National Social Security Institute (INSS), once those population records that provide no information on gender or date of birth have been deleted. We thus compare the number of pensions in each cohort of the sample with the same cohort of the population, to check for differences with respect to the 4% of the population that the sample should in theory represent.

The ratio between the number of pensioners by cohort in the CSWL and the respective cohort in the population is calculated. This is a first approach to detect differences that, given the framework, may be considered significant in practice (Wang 1993). Table 1 shows the percentage of the population represented for each cohort, pointing out those cohorts underrepresented with percentages of less than 3% and those overrepresented with percentages greater than 5% for 2010. Similar tables for other years are provided in the Online Appendix. Mere observation of the tables with these calculations suffices to detect the presence of the same kind of problem of overrepresentation and underrepresentation for certain cohorts in almost all the waves analysed.

Table 1 Percentages of pensions in the CSWL out of the total INSS population by age, 2010.

The continuous sample of working lives: improving its representativeness

Abstract

Similar content being viewed by others

How to Measure Retirement Age? A Comparison of Survey and Register Data

CenSoc: Public Linked Administrative Mortality Records for Individual-level Research

A general framework for analysing the mortality experience of a large portfolio of lives: with an application to the UK universities superannuation scheme

1 Introduction

2 Analysis of the goodness of fit of the CSWL to the population (2005–2013)

3 Checking whether stratification matters

4 Selection of a large subsample distribution from the CSWL: results

5 Summary, conclusions and future research

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material 1 (docx 184 KB)

Appendix 1. Estimate of total expenditure on pensions from expenditure at 31st December INSS and CSWL 2010

Appendix 1. Estimate of total expenditure on pensions from expenditure at 31st December INSS and CSWL 2010

Rights and permissions

About this article

Cite this article

Share this article

Keywords

JEL Classification

Search

Navigation