1 Introduction

Selecting a representative sample from the population is a very important factor in quantitative research given that results obtained from a wrongly selected sample which is not properly in tune with the object of study cannot be generalized to the whole population. Moreover, smaller than appropiate sample may not have enough diversity to enable significant differences or associations potentially present in the target population to be identified. It is fundamental to make the right choice and decide on the right way to deal with the dataset, always making sure that it is the best one for the purposes of the research. Hence, it is important to check that the sample selected is representative of the target population in terms of demographic characteristics as well as any others that can affect the results of the study to be conducted.Footnote 1

The Continuous Sample of Working Lives (CSWL) is a random sample (RS) of around 1.2 million people, i.e. 4% of the reference population. It contains administrative data on working lives, which provide the basis for this sample taken from Spanish Social Security records, and comprises anonymized microdata with detailed information on individuals. Izquierdo et al. (2009) point out that this database provides a unique dataset with very rich information about labour market histories and personal characteristics, such as nationality, date and country of birth (or province if Spanish), gender and place of residence when the individual first entered the Social Security system, along with additional information about the composition of the household and labour market variables.

The first wave covers people who had an economic relationship with the Social Security system in 2004, and provides the entire working history (employment, unemployment and retirement) of the sample population. The sample is updated every year using information from the variables selected from the Social Security system dating back to when computerized records began, and from other administrative data sources which record additional information on individuals. Apart from the details given by the institutions responsible for generating the CSWL,Footnote 2 the data available to researchers date from 2004 to 2015.Footnote 3

The sample reference populationFootnote 4 is defined as individuals who have had some connection (through contributions, pensions or unemployment benefits) to the Social Security system at any time during the year of reference. This population of reference makes it possible to select people who on a particular date in that year had no relationship with Social Security, but who did at some time before or after that date in the same year. This means that the CSWL will contain details about a person who may have had various relationships with Social Security (unemployed, unaffiliated, contributor, pensioner), unlike other datasets that only show a single relationship because they refer to the situation on one specific date in the year.

Individuals are selected from the reference population each year according to whether their identification codes contain certain randomly-generated figures in the correct positions. Each year individuals who figured in the previous version of the CSWL and continue to have a relationship with Social Security remain in the selection, while new individuals are incorporated if they meet the requirement of having identification codes containing the randomly-generated figures described above. The detailed information available on the individuals selected includes work trajectories (from 1967), contribution bases (from 1980) and/or pensions (from 1996), as long as it is contained in the Social Security administrative records. Those individuals who for any reason have no connection to Social Security in a particular year will not figure in the CSWL.

The CSWL is a dataset that has been used in a considerable number of studies, especially on labour economics (Treviño et al. 2008; Benavides et al. 2010; Vall Castello 2012; Bonhomme and Hospido 2013, 2016; Solé et al. 2013; Agliari et al. 2014; Arranz and García-Serrano 2014; Barra et al. 2014 and Nagore García and van Soest 2016) and the Spanish public pension system (Antón Pérez et al. 2007; Argimón et al. 2007; Boado-Penas et al. 2008; Moral Arce et al. 2008; Vidal Meliá et al. 2009; Cairó Blanco 2010; Peinado Martínez 2011; Devesa et al. 2012; Domínguez Fabián et al. 2012; Meneu Gaya and Encinas Goenechea 2012; Vicente Merino et al. 2012; Conde Ruiz and González 2013, and Vegas Sánchez et al. 2013). Other studies give detailed descriptions of its characteristics, advantages and limitations.Footnote 5

Other countries such as the USA and Germany have similar databases but they use stratification. For example, the US Social Security Administration (SSA) has been compiling a Continuous Work History Sample (CWHS) since the late 1930s’, which is a one-percent stratified cluster probability sample of all possible Social Security numbers. The population, or sampling frame, from which the CWHS cases are selected consists of the 1 billion possible nine-digit Social Security numbers (SSNs). These digits represent the geographical area of each number allocated, a group for the date of issue and a random serial number [Smith (1989)]. The numbers are thus stratified geographically by place of application for SSN and chronologically by the process of allocation of numbers within each stratum. The information source is the so-called MEF (Master Earnings File—Olsen and Hudson 2009), which is used to determine pensions for retirement, permanent disability and widowhood in the US public Social Security system (OASDI).

Similarly in Germany (Himmelreicher and Stegmann 2008), there is the so-called sample of insured persons and their insurance accounts (Versicherungskontenstichprobe, VSKT), which provides longitudinal data that have a high potential for analyses of employment biographies and pension claims in old age. These data are process-produced, contain very large samples and allow for differentiated analyses of a variety of social groups. The VSKT was initially sampled in 1983 as a stratifiedFootnote 6 random sample with disproportional selection probabilities and since then has continued as a panel containing monthly information on the individuals included in the sample. It represents 1% of the contributing population.

One important branch of research in pension systems is the problem of global ageing and the sustainability of public pension systems, referred to hereafter as PPS. As mentioned above, the CSWL is one of the datasets considered for the purposes of this research in Spain. In order to analyze Spanish PPS, it is necessary to have information on relevant variables such as age and gender for each year of reference as considered in previous studies. These last two factors are essential for correctly estimating life expectancy, so any study that make estimates of future benefits should select a sample that is representative of the population in terms of age and gender as well as type of pension.

With all this in mind, the first objective of the paper is to analyse whether all the information given by the CSWL on the benefit recipients makes up the best replica of the study population that researchers can have. To solve this question of how representative the CSWL is of the population of pensioners, it is important to carry out a statistical analysis to determine whether there are significant differences between the CSWL sample distribution and the population distribution. Even though the researcher does not have the entire population, the distributions of the population of pensioners organized by age, gender and type of pension on the last day of each year are available. Therefore it is possible to carry out a test in order to check whether the CSWL has the same distribution as the population of pensioners. However, given that the CSWL is not a stratified sample, it is advisable to check whether it covers the correct proportion of the population in each stratum to be considered in the study of the pensioner population categorized by age, gender and type of pension.

In this paper we conduct such an analysis for the CSWL waves for 2005–2013 and confirm that there is a lack of representativeness in most years. Given these results and the fact that it has long been known that a stratified random sample (SR) enables a more efficient selection to be made when one of the variables in the study of interest presents great variability, as is the case of the ages of pensioners in the different types of pension, the first idea that comes to mind is why not use a random sample with stratification to obtain a dataset that better represents the population of pensioners?

The answer is clear: researchers do not have all the data on the population, so they cannot extract such a sample. They would have to use a stratified random sample (SR) contained in the CSWL, but with a considerably smaller size, which would entail a loss of richness in the information on pensioners’ working lives. Hence it is important and advisable to develop other procedures in order to obtain larger subsamples with less reduction in the total number of pensioners than in the SR subsample contained in the CSWL, and at the same time to make the original CSWL more representative in terms of pension benefits.

Hence the second objective of the paper is to provide researchers with a novel methodology for the design of a dataset on pensioners by making the CSWL more representative of the population in terms of type of pension, gender, and age, and by trying to miss as few pension records as possible so as not to overlook diversity in working lives.

Subsample selection is done by finding a feasible solution to a nonlinear optimization problem (NLP)Footnote 7 using mixed integer nonlinear programming with just one real non-negative decision variable, the constant of proportionality, q, in a stratified sample design with proportional allocation. Maximizing q is equivalent to maximizing the size of the subsample and is subject to constraints implied by the fact that the number of pensioners to be included in each cohort of the subsample has to be a natural number (non-negative integer) and that the subsample obtained must be included in the population as well as in the CSWL. The methodology applied uses a goodness of fit test—Pearson’s chi-square test—in order to make the subsample selected more representative of the population, providing p values close to 1. This methodology enables us to obtain quite a large dataset included in the CSWL which is much more representative of the pensioner population in terms of type of pension, gender, and age, as would be the case with an SR. In addition, this procedure enables users to choose between goodness of fit quality and subsample size.

Finally, in order to illustrate the gains obtained with the selection of the subsample, the methodology is applied to the CSWL for 2010,Footnote 8 to gauge the improvement in the estimate of total pension expenditure in different cohorts taking into account age, gender and type of pension, even though the main objective is not this but to obtain the subsample itself for use in any subsequent analysis of the Spanish public pension system. Given that the lack of representativeness of the CSWL has also been found in other years, our findings suggest that the same procedure might relevantly be applied to select subsamples in the other waves of the CSWL.

The structure of the rest of the paper is as follows: Sect. 2 analyses the representativeness of the CSWL for the years from 2005 to 2013 with respect to pension benefits. Section 3 sets out the distribution by type of pension, gender, and age of a hypothetical CSWL using SR sampling and the distribution of a subsample obtained by SR sampling using the original CSWL. In this section we show the importance of stratification as a sort of backtesting. We check whether stratification matters by looking at the total expenditure estimated using a stratified random sample. Section 4 details the criteria used for subsample selection and the results obtained. The paper ends with conclusions, pointers for future research and two appendices: the first shows all the tables and graphs with the estimates of the total expenditure deviations for 2010, and the second (online) extends the analysis of the goodness of fit of the CSWL to the population (INSS) for the whole period 2005–2013 and summarizes the problem statement whose solution provides the distribution of the number of pensioners in large subsamples which also represent the population better than the CSWL itself.

2 Analysis of the goodness of fit of the CSWL to the population (2005–2013)

In this section we analyse how well the CSWL pension data distribution fits the population distribution of pensioners at December \(31{\mathrm{st}}\) for the years 2005 to 2013Footnote 9 by age, gender, and type of pension. The data are available in the statistical reports of the National Social Security Institute (INSS), though it is important to stress that the population from which the sample is drawn comprises all those individuals who have been registered or have received some kind of contributory pension from the Social Security system at any time during the year of reference, regardless of how long they were in that situation. It does not therefore coincide with the figure for the population at December \(31{\mathrm{st}}\) each year or indeed with the population of pensioners, but is larger. However, in our study there is a process of post-stratification of the CSWL with all pensions registered (current) at December \(31{\mathrm{st}}\) being grouped by cohorts for age, gender and type of pension as of that date. We do not add all the pensions that were registered at some time during the year but only those which were recorded as currently registered on December \(31{\mathrm{st}}\), just as the INSS statistical report for each year does. The composition of the pensioners population is obtained from INSS (2006–2007), INSS (2008–11) and INSS (2012–2014). Pensions deregistered during the year due to the death of the recipient or because the recipient ceased to meet the requirements for receiving the pension are not considered on December \(31{\mathrm{st}}\). The statistics on the total number of pensioners in the INSS statistical report consider only those pensions recorded as currently registered on December \(31{\mathrm{st}}\), so a comparison between the distributions of pensioners (by type of pension, gender and age) in the CSWL as of December \(31{\mathrm{st}}\) and the INSS makes sense, because the moment in time considered is the same.

In short, the aim of this analysis is to determine whether there are any statistically significant differences in the weights of the cohorts between the sample and the population using a goodness of fit test.

To conduct the test it is necessary to conduct a post-stratification of the CSWL once the sample records with no information on gender or date of birth have been deleted. The main theoretical reason why the population of pensioners is stratified by type of pension, gender and age is that there is a different life expectancy for each pensioner depending on the type of benefit received (retirement, permanent disability, widow(er)’s, orphan’s and family responsibilities),Footnote 10 whether the pensioner is a man or a woman and whether he/she has a given age are taken into account. So in order to make accurate forecasts when analysing the sustainability of the Spanish public pension system, in which life expectancy plays a crucial role, those differences have to be taken into account. Therefore it is very important to have a sample of individuals that adequately represents the population of pensioners, taking into account these variables.

Another practical reason is that the information available to us about the population of pensioners is the distribution of this population organized by age, gender and type of pension. Moreover, the age cohorts considered in our analysis are also given in the format in which the information is disclosed by the INSS.

Once the data on pensions from the CSWL has been post-stratified at December \(31{\mathrm{st}}\) by type of pension, gender, and age cohorts, we perform a preliminary analysis comparing the distribution obtained from the CSWL with that of the population for 2005–2013. We use the equivalent table from the statistical report of the Spanish National Social Security Institute (INSS), once those population records that provide no information on gender or date of birth have been deleted. We thus compare the number of pensions in each cohort of the sample with the same cohort of the population, to check for differences with respect to the 4% of the population that the sample should in theory represent.

The ratio between the number of pensioners by cohort in the CSWL and the respective cohort in the population is calculated. This is a first approach to detect differences that, given the framework, may be considered significant in practice (Wang 1993). Table 1 shows the percentage of the population represented for each cohort, pointing out those cohorts underrepresented with percentages of less than 3% and those overrepresented with percentages greater than 5% for 2010. Similar tables for other years are provided in the Online Appendix. Mere observation of the tables with these calculations suffices to detect the presence of the same kind of problem of overrepresentation and underrepresentation for certain cohorts in almost all the waves analysed.

Table 1 Percentages of pensions in the CSWL out of the total INSS population by age, 2010.

Analysing the results for the said ratios, it can be concluded that there are cases in all the years where the figure exceeds 5% or fails to reach 3% of the population, i.e. where it deviates from the percentage of the population (4%) represented by the CSWL, with some cohorts being considerably overrepresented in relative terms. In all the years considered the CSWL also contains age cohorts for some types of pension that present outliers, while in the population those cohorts have no pensions.

The main mismatches are found in permanent disability and widow(er)’s pensions, and to a lesser extent retirement pensions, in the case of men. The mismatches are greatest in 2005, 2006, 2008, 2009, 2010 and 2011, and less significant in 2007, 2012 and 2013, given that the number of cohorts more than one fourth away from 4% is smaller and where such differences do exist they are smaller. Hence for the years where the differences are greater, the statistical test is expected to provide results that support the existence of statistically significant differences not due exclusively to sample size.

Pearson’s chi-squared test \((\chi ^{2})\) is considered as a test of goodness of fit to check whether the sample follows the same distribution as the population of pensioners as of \(31{\mathrm{st}}\) December. Goodness of fit tests usually have a given hypothesis as to the theoretical distribution for the population, which they test using the data observed in the sample. In the case of the CSWL the distribution of the population of pensioners by age, gender, and type of pension is known from the statistical report of the INSS (INSS, 2006–2014).

We use Pearson’s chi-square test, given by:

$$\begin{aligned} {\chi }^{2}=\sum _{i=1}^k \frac{\left( {O_i -E_i } \right) ^{2}}{E_i} \end{aligned}$$
(1)

in which \(\hbox {O}_{\mathrm{i}}\) represents the values observed in the CSWL and \(\hbox {E}_{\mathrm{i}}\) the expected or theoretical values, i.e. those obtained from the distribution of the population of pensioners in each of the years considered, 2005–2013, as given by the INSS.

For this test the expected frequency for each cohort needs to be calculated by gender and type of pension. That is, for a given gender and type of pension we calculate the relative frequency as the ratio of the number of pensions in each age cohort to the total pensions for the same gender and type of pension in the population.

To calculate the expected or theoretical values we use the following definition of the relative distribution of expected frequency for population \(\hat{f}\):

$$\begin{aligned} \hat{f}_{i,\,j,\,k} =\frac{N_{i,\,j,\,k}}{\sum _{i=1}^{18} N_{i,j,\,k} }=\frac{N_{i,\,j,\,k} }{N_{j,k}} \end{aligned}$$
(2)

Using this relative frequency, the expected values of the subsample or the absolute expected frequency for \(\hat{e}\) in the subsample are:

$$\begin{aligned} E_i =\hat{e}_{i,\,j,\,k} =\hat{f}_{i,\,j,\,k} \cdot n_{j,k} \end{aligned}$$
(3)

where i is the index for the 18 cohorts into which the variable “age” has been divided; j is the index corresponding to “gender” (male, female); k is the index for the 5 types of pension (permanent disability, retirement, widow(er)’s, orphan’s and family responsibilities); \(N_{j,k} \): is the number of pension benefits in the population per type of pension k and gender j; and \(n_{j,k} \): is the number of pension benefits in the CSWL for pension type k and gender j.

For large samples, as is the case with the CSWL, which covers more than 300,000 pensioners in every available year, it is very unlikely that the sample will be a perfect fit to the population, so the test statistic will show a rejection of the hypothesis that the sample and the population have the same distribution and conclude that the differences between the two distributions are statistically significant. Those differences could be magnified because of the large size of the sample. To overcome this possible error in the interpretation of the test results (pointed out by Berkson 1938; Wang 1993 and Lin et al. 2013 among others) it is important to ensure that the differences found are not due to the large size of the sample, so there is evidence not only of statistical differences but also of practical ones.

According to Wilkinson (1999), statistical significance refers to whether the effect observed is larger than would be expected by chance, i.e. can the null hypothesis that there is no effect be rejected? This is what is typically addressed by p values. Practical significance is about whether we should care, i.e. whether the effect is useful in an applied context. Two groups will almost never be exactly the same if thousands or millions of people are tested. That does not mean that every difference is of interest. This is usually associated with effect size measures [e.g. Cohen’s d, (Cohen 1988)].

In statistics, an effect size is a measure of the strength of the relationship between two variables in a statistical population, or a sample-based estimate of that quantity. An effect size calculated from data is a descriptive statistic that expresses the estimated magnitude of a relationship without making any statement about whether the apparent relationship in the data reflects a true relationship in the population. Thus effect sizes complement inferential statistics such as p values. Sample-based effect sizes are distinguished from test statistics used in hypothesis testing in that they estimate the strength of an apparent relationship rather than assigning a significance level reflecting whether the relationship could be due to chance. The effect size does not determine the significance level or vice-versa. Given a sufficiently large sample size, a statistical comparison will always show a significant difference unless the population effect size is exactly zero.

The best measure of association for the chi-square test is \({\upvarphi }\) (or Cramér’s \(\upvarphi \)) and Cramér’s V. \({\upvarphi }\) is related to the point-biserial correlation coefficient and Cohen’s d and estimates the extent of the relationship between two variables (2\(\,\times \,\)2). Cramér’s V may be used with variables that have more than two levels. \({\upvarphi }\) can be computed by finding the square root of the chi-square statistic divided by the sample size.

$$\begin{aligned} {\upvarphi }=\sqrt{\frac{{\chi }^{2}}{N}} \end{aligned}$$
(4)

Similarly, Cramér’s V is computed by taking the square root of the chi-square statistic divided by the sample size and the length of the minimum dimension.

$$\begin{aligned} \hbox {V}=\sqrt{\frac{{\chi }^{2}}{N\cdot \left( {k-1} \right) }} \end{aligned}$$
(5)

with k = the number of regrouped cohorts.

Once the hypothesis has been tested in the case of the fit of the distribution by age for each gender and type of pension, to determine whether the differences found are statistically significant, the size of the effect is estimated using Cramér’s V, the results of which enable the preliminary analysis of practical significance to be completed.

Depending on the values of a contingency coefficient such as Cramér’s V it is possible to determine the size of the effect (Cohen 1988), and it can be used to help decide whether or not those differences which are statistically significant are due to the large size of the sample.

Table 2 Results chi-squared test for 2010.

The results of the test for per type of pension and gender for 2010 are summarized (Table 2) whereas (the results) for the whole period can be found in the Online Appendix (Tables 1.9–1.18). It can be observed that they are almost equivalent in most years, which means that the hypothesis that the CSWL has the same distribution as the population in most pension benefits (permanent disability, retirement and widow(er)s) is rejected. However, in pension benefits for retirement the size of the effect is negligible, so the differences detected can be attributed to the large size of the sample. The causes of these differences can be attributed to the sample design (simple random sampling), to administrative errors and to a reclassification of pensioners older than 65 with permanent disability benefits, who are considered as disabled in the CSWL but as retirement pension beneficiaries in the official population statistics.

In the versions of the CSWL for 2007, 2012, and 2013, the differences found in the distribution of pensioners receiving permanent disability benefits are due to the large size of the sample, given that the size of the effect is negligible. This does not happen in the case of pension benefits for widow(er)s. It has also been detected in those years that the code assigned to most pensioners over 65 on permanent disability benefits is changed to retirement benefits, when in the previous years they continued to be classed as receiving permanent disability benefits. This explains the better fit in 2007, 2012 and 2013. It is worth noting in particular what happens after 2007: the same coding errors appear in the pensions for permanent disability for pensioners older than 65.

Hence it is concluded that the fit is not good for some types of pension benefit given that some cohorts are over or underrepresented in the CSWL with respect to the actual population of pensioners, and contain a number of pensions clearly higher or lower than the figure expected depending on the proportion of the reference population represented by the CSWL (4%). The results seem to suggest that for 2005, 2006, 2008, 2009, 2010 and 2011 the CSWL does not fit the distribution of the population well in terms of type of pension, gender, and age for two types of pension benefit: permanent disability and widowhood. The mismatch is greater in the former, and the poor fit is not attributable solely to the large sample size. In other years the null hypothesis that the CSWL has the same distribution as the population cannot be rejected given the size of the sample, but there is room for improvement in the cases where the null hypothesis cannot be rejected, as will be shown in the next section.

These results must be taken into account when making forecasts on the sustainability of the Spanish public pension system using the CSWL, given that we find that for most years and some types of pension benefit it does not correctly represent the distribution by age of the contributory pensions in the system whose sustainability is to be analysed. It is concluded that the significance of the differences detected goes beyond a single year given that they are found for a considerable number of waves of the CSWL.

Wang (1993) advocates asking whether these differences are important or significant in practice too. The answer to this question differs from case to case and according to the experience of each research team, as there is no statistic for measuring significance in practice. Hence some researchers may consider an error of 1% to be important while for others it is negligible, depending on the context and goal of each study.

To estimate the significance in practice of the differences in the distribution of the number of pensions by cohorts, the total annual expenditure on pensions by cohorts could be estimated using the CSWL and compared with the estimate obtained using the population, which is known. In addition, if the estimate provided by a sample obtained using SR can be compared, it would reveal how much margin for improvement there is in these estimates and hence the significance of the differences found. We seek to answer these questions below. In particular, if one of the objectives is to use the data from the CSWL to make forecasts about the sustainability of the Spanish public pension system, it is advisable to have more representative subsamples based on using an SR, the characteristics of which are described in the following section.

3 Checking whether stratification matters

When deciding what sampling design is most appropriate for studying pension benefits and pension expenditure, it is important to know whether it is relevant to divide the population into levels and groups. To show the real importance of stratification for the case of the CSWL, in this section a sort of backtesting is carried out. This is a process widely used in finance, demographics and insurance among other fields. In our case we want to test our methodology on prior time periods. Instead of applying the methodology for the time period forward, in which case its effectiveness could take years to check, our procedure is applied to relevant past data in order to gauge its usefulness.

We have the information on the distribution of pensioners organized by age, gender and type of pension and we know the number of pensions and the mean pension expenditure in each group, so in this section we check whether stratification matters by looking at the total expenditure estimated using a stratified random sample. Given that we do not have the entire population, we obtain the distribution of pensions of a hypothetical stratified random sample extracted from the population and estimate the total annual pension expenditure by cohort at \(31{\mathrm{st}}\) December 2010. This is then compared with the CSWL to check for any improvement in the estimate and forecasts for pension expenditure for 2010 using the hypothetical sample obtained from the population by stratified random sampling (SR) with respect to the CSWL. An improvement can be expected because we find that for most years and some types of pension benefit the CSWL does not correctly represent the distribution by age of the contributory pensions in the system.

It is clear that stratification is indeed relevant because, for example, the age of pensioners is a variable that has an important influence when it comes to forecasting expenditure on pensions. As stated above, these variables are also important in analysing the sustainability of the public pension system, where pensioners’ life expectancy plays a very important role.

One of the main goals of stratification is to give a better cross-section of the population so as to increase relative precision. There are several reasons to use stratified random sampling. Stratification ensures adequate representation of various groups of the population which may be of interest or importance. The use of stratification increases the accuracy with which a characteristic of a population can be estimated. It is achieved by dividing a heterogeneous population into sub-populations, each of which is homogeneous within itself. When there are extreme values in the population, stratification is more powerful because individual strata will be more homogeneous and separate estimates obtained from them can be combined into a precise estimate for the whole population by taking a relatively smaller sample (Singh and Chaudhary 1986).

Once the strata or groups to be considered are established, the next step is to decide the sample size in each stratum, i.e. the method of allocation. In this study we have opted for proportional allocation. This allocation was originally proposed by Bowley (1926), and it is very common in practice because of its simplicity. When the only information available is \({ N}_{i}\), the total number of units in the i-th stratum, as is the case here because the number of pensioners in the population in each age cohort by gender and type of pension is known, a given sample of size n is allocated in the different strata in proportion to their sizes, i.e. in the i-th stratum:

$$\begin{aligned} n_i /n=N_i /N\rightarrow n_i =n\,\left( N_i /N \right) =n \,\left( {w_{i}} \right) =N_i \left( {n/N} \right) \end{aligned}$$
(6)

This means that the sampling fraction is the same in all strata and coincides with the overall sampling fraction, whose value is the constant of proportionality, \(q=n/N\).

The technique of post-stratification consists of classifying the population and the selected sample into a given number of strata or groups after selection of the sample as a simple random sample. This is done here by taking the CSWL (simple random sample) and grouping the data by type of pension, gender and age. The problem of post-stratification was discussed originally by Hansen et al. (1953), and the technique is normally applied when the value of the variable used to build the strata is not known but might be known after the simple random sample has been obtained. This is not the case with the CSWL, given that the variables used to build the strata or cohorts (type of pension, gender and age) are known. Following Tryfos (1996), considering the selection of a simple random sample of individuals, the post stratified estimator for the population mean of the variable of interest Y once the sampled individuals have been classified into the M groups is:

$$\begin{aligned} \bar{Y}_{{ps}} = w_{{1}} \,\bar{Y}_{1} + w_{2} \bar{Y}_{2} + \cdots + w_{M} \,\,\bar{Y}_{M} \end{aligned}$$
(7)

where \(w_{i} =N_i /N\), and \(\bar{Y}_{i}\) is the mean of the variable Y of the sampled individuals in group i. It is calculated exactly like the stratified estimator \(\bar{Y}_{S}\) but is based on the results of a simple -not a stratified- random sample. The weights \(w_{i}\) are assumed to be known. When the size of the simple random sample, n, is large, the proportion of sampled elements that fall into a given group can be expected to be approximately equal to the proportion of elements of that group or stratum in the population, that is, \(n_i /n\approx N_i /N\). So when n is large, the post-stratified estimator based on a simple random sample can be expected to behave like a stratified estimator based on a proportional stratified sample.

These considerations on the data from the population are important in studying the average age and average pension of the population relative to the same data in the age cohorts or strata of the CSWL and for the subsequent post-stratification. However, taking into account that the average means and the average pensions of all the population are known for each type of pension, gender and age cohort, the interest is focused on whether \(n_i /n\approx N_i /N\), even though the CSWL size may be considered large. The first section of the paper has questioned whether the proportions in the strata or cohorts of the CSWL are similar to the corresponding ones in the population. Table 3 presents the distribution by type of pension, gender and age that a sample should have using proportional stratified sampling if it is to be representative of the population of pensioners with a constant of proportional allocation of, approximately, \(q=3.992\%\). This constant of proportional allocation is the result of dividing the number of pension benefits at December \(31{\mathrm{st}}\) contained in the CSWL, \(n^{CSWL}=349,169\) (source: Authors’ Own Calculations based on the CSWL for 2010), by the total number of pensions in the population taken from INSS (2011), \(N^{INSS}=8,747,470\). The results in Table 3 total 3 pensions more than in the CSWL because of rounding.

To correctly estimate the dimension of the problems found in our analysis it is worth highlighting the differences in the relative importance of each type of pension: retirement, permanent disability and death. For example, in 2010 (see Table 3) retirement pensions account for 59.48% of the total number of beneficiaries. Gender really matters: for men the figure is 78.43% whereas for women it is only 41.50%. Widow(er)s account for 26.31% of the total number of pensions, but for women the figure is 47.72%, which is even higher than for retirement pensions. In the case of men the weight of this contingency is much lower at just 3.73% of the total number of beneficiaries.

Next we estimate total annual pension expenditure by cohort at \(31{\mathrm{st}}\) December 2010 for the population and for the CSWL (Table 4). To isolate the effect of the differences in the distribution of average pension amounts between the CSWL and the actual population (INSS), we estimate total expenditure for each cohort by taking the average pension published in the 2010 INSS statistical report. To obtain pension expenditure for the population (INSS), we multiply the number of pensions by the average pension for each cohort and by the coefficient (between 12 and 14) for adjusting the total amount of pensions at December \(31{\mathrm{st}}\) to recognized expenditure in financial year 2010 on each type of pension, as shown in the fourth column (No. INSS adjusted payments) in Table 4.

To obtain figures for the seventh column in the table [CSWL expenditure (year)], for each cohort we multiply the number of pensions by the average pension, by the raising factor (the ratio of population to sample size, 1 / q) and by the same coefficient that adjusts the monthly pension to the pension recognized as expenditure in 2010. For the case of the SR we proceed in the same way as with the CSWL.

Table 3 Distribution of pensions at 31/12/2010 in a hypothetical sample extracted from the population using stratified random sampling.
Table 4 Estimate of total expenditure on pensions from expenditure at 31st December.

In Appendix 1, Tables 12, 13, 14, 15, 16, 17, 18, 19, 20 and 21 show the expenditure calculated per cohort using the CSWL and the SR hypothetical sample for each type of pension together with Figs. 1, 2, 3 based on these tables. With them we seek to show the distortion in the estimation of pension expenditure for each type of contingency, gender and age cohorts caused by using RS instead of SR. Graphically, the improvement in the estimate of total pension expenditure for all types of pensions and for both men and women can be seen, as expected. The fit is worse in the case of SR, where the size of the sample is small due to the loss of elements in the cohorts, so this originates greater differences in the forecast of total expenditure for those cohorts.

Table 5 shows the improvement indicators in the estimate of total pension expenditure using the hypothetical SR sample obtained from the population of pensioners. It shows large reductions in the various measures of error in the estimate of total pension expenditure using the CSWL compared to the population of pensioners.

Table 5 Improvement indicators in the estimate of total pension expenditure in % in the hypothetical SR sample with respect to the CSWL.

The indicators used to measure the improvement are defined as follows:

R_SRMQE: reduction in the square root of the mean quadratic error.

R_SDAV: reduction in the sum of the differences in absolute value of estimated expenditure by cohort.

MDC: maximum difference in the estimate of expenditure by cohort as a percentage of total expenditure by gender and type of pension for the case of the SR sample.

CSWL_MDC: maximum difference in the estimate of expenditure by cohort as a percentage of total expenditure by gender and type of pension for the case of the original CSWL.

CE \(\le \) 0.01%: percentage of cohorts with an error in the estimate of expenditure of less than 0.01% of total expenditure, in absolute value, for the case of the SR sample.

CSWL_CE \(\le \) 0.01%: percentage of cohorts with an error in the estimate of expenditure of less than 0.01% of total expenditure, in absolute value, for the case of the original CSWL.

The SR sample obtained seems to be the best at representing the population of pensioners by age, gender and type of pension, with the larger one being better. The aim of this research is to extract a large subsample obtained from the CSWL, with a view to improving its representativeness, and to compare the reductions in the various measures of error of the estimate of total pension expenditure with those resulting from the hypothetical SR sample obtained from the population (which researchers cannot access).

However, the hypothetical SR sample seems not to be contained in the CSWL. For a research team to obtain a sample of data on pensions such as the one that would result from the hypothetical subsample using SR obtained from the population, but using the CSWL instead, the constant of proportionality q would have to be reduced.

Table 6 Distribution of pensions at 31-12-2010 in a subsample selected from the CSWL using SR sampling.

We have developed a procedure in Excel VBA (Visual Basic for Applications) to reduce the constant of proportionality \(q=n/N\), using stratified sampling with proportional allocation to obtain the largest subsample contained in the CSWL that could be extracted from the population if available. The result is a sample size of 25% of the original CSWL sample, up to 87,472 pension benefits representing 0.99999% of the population of pensioners. This is a considerable loss in pension data on working lives (previous contributions as well as benefits). Table 6 shows the distribution of this subsample. In addition, this reduction gives a slightly worse fit than the hypothetical sample obtained from the population using SR sampling, as can be seen in Table 7.

Table 7 Improvement indicators in the estimate of total pension expenditure in % of the SR subsample (SRSS) with respect to the CSWL.

In short, backtesting is carried out in this section to show the real importance of stratification for the case of the CSWL. We show, looking at the estimate of total expenditure, that with an SR sampling a better fit can be obtained by age cohorts, gender and type of pension. In order to be able to obtain an SR sample from the CSWL available and not from the population, the subsample reduces the size to 25% of the original. In the following section we apply a procedure to select large SR subsamples, thus improving the fit of the CSWL to the population of pensioners, bearing in mind that this improvement will not be as great as that obtained using the hypothetical but unfeasible SR sample from the population, but it may at least be as good as the improvement offered by the SR subsample of 87,972 pensions extracted from the CSWL using stratification. The methodology developed has the property of allowing the user to choose the relationship between the desired goodness of fit to the population and the size of the subsample.

4 Selection of a large subsample distribution from the CSWL: results

In this section we explain the criteria for the design of a large subsample to be selected using proportional allocation stratification from the 2010 CSWL data to improve its representativeness with respect to the number of pension benefits in the population. We explain the procedure developed for this and the distribution of pensions in a subsample selected from the 2010 CSWL obtained with this procedure that will improve the fit with a high p value without missing as many registers as with the SR contained in the CSWL. Besides that, in order to illustrate the practical significance of our findings we show the results obtained in its application to estimate total expenditure by age, gender and type of pension and compare it with those that use the SR hypothetical sample and the subsample with respect to the CSWL.

The aim is to find a subsample for the more general case of fit to the distribution of the CSWL to the pensioner population by age, gender and type of pension.

The selection criteria for finding a subsample with the necessary characteristics are the following:

  1. (a)

    It must be more representative of the population under study. The procedure should therefore include a goodness of fit test on the distribution of the number of pensioners by age, gender and type of pension that takes into account the associated p values.

  2. (b)

    The total number of pensioners needs to be relatively high so as to be bigger than the number that would result from a stratified sample from the CSWL, approximately 1% of the population of pensioners. Hence the requirement is to maximize subsample size, with 1% of the pensioner population being the lower limit.

  3. (c)

    The subsample obtained must be included in the population as well as in the CSWL. These two requirements might seem obvious, but constraints have to be introduced to avoid the outliers found in the CSWL but not in the population, i.e. those cohorts for which the number of pensions is greater than zero in the CSWL but zero in the population of pensioners. It is also important to have a number of pensioners in each cohort of the subsample that is lower than or equal to that of the corresponding cohort of the population.

The details of the method used to find the subsample design are explained in the Online Appendix. This is a nonlinear programming problem with just one real non-negative decision variable: the constant of proportionality, \(q=n/N\). When the only information is \(\hbox {N}_{\mathrm{i}}\), i.e. the number of elements in each stratum, the allocation or assignment to each stratum in the sample is proportional to the size of the stratum, i.e. \(n_i =\left( {\frac{n}{N}} \right) N_i =q\,N_i\).

Maximizing q is equivalent to maximizing the size of the subsample. Without taking into account the integer constraints we have:

$$\begin{aligned} \mathop {\hbox {Max}}\limits _{q} \left\{ q \right\} \rightarrow \max _q \left\{ {q\cdot N^{{INSS}}} \right\} \rightarrow \max _q \left\{ {n^{SUB}\left( q \right) } \right\} \end{aligned}$$
(8)

If the integer constraints with respect to the number of pensions in any cohort are taken into account, what we are really considering when we maximize q is the maximization of \(\hat{q}\), the adjusted constant of proportionality, so this is what is finally considered:

$$\begin{aligned}&\mathop {\hbox {Max}}_q \left\{ q \right\} \leftrightarrow \max _q \left\{ {\hat{q}} \right\} \nonumber \\&\quad \hbox {with }\hat{q}=\frac{\sum \nolimits _{k=1}^5 \sum \nolimits _{j=1}^2 \sum \nolimits _{i=1}^{18} n_{i,j,k}^{SUB} \left( q \right) }{\sum \nolimits _{k=1}^5 \sum \nolimits _{j=1}^2 \sum \nolimits _{i=1}^{18} N_{i,j,k}^{{INSS}} \left( q \right) } \nonumber \\&\qquad \quad \quad \quad =\frac{\sum \nolimits _{k=1}^5 \sum \nolimits _{j=1}^2 \sum \nolimits _{i=1}^{18} Trunc\left[ {q\cdot N_{i,j,k}^{INSS} \left( q \right) } \right] }{\sum \nolimits _{k=1}^5 \sum \nolimits _{j=1}^2 \sum \nolimits _{i=1}^{18} N_{i,j,k}^{INSS} \left( q \right) } \end{aligned}$$
(9)
Table 8 Summary of results.

In order to make the selected subsample more representative of the population, a goodness of fit test has to be considered. The procedure takes into account that the value of the test for the subsample to be selected is such that it does not reject the null hypothesis (the subsample has the same distribution as the pensioner population at \(31{\mathrm{st}}\) December 2010), by contrast with the alternative hypothesis (the subsample does not have the same distribution as the pensioner population at \(31{\mathrm{st}}\) December 2010). The goodness of fit test used is Pearson’s chi-squared test as explained in Sect. 2 of the paper.

Table 9 Distribution of pensions at 31-12-2010 in a subsample selected from the CSWL with a p value of \(\ge 0.999\)

Table 8 shows the results of applying the above procedure to the CSWL. A large subsample design is generated, improving the fit to the distribution of the population of pensioners, with high p values of the test. There are feasible solutions with sizes ranging from 93.074% of the CSWL, associated with a minimum p value of 0.8, to 70.87% of the CSWL with a p value of 0.9999.

It has been proved that the subsample obtained is included in the population and in the CSWL. Therefore, given the values of the goodness of fit test and the fact that the subsample is well over 1% larger than the subsample obtained using stratified sampling, it is sure to meet the objective of finding bigger subsamples that are more representative than the CSWL.

As an example, Table 9 shows the theoretical distribution of the subsample that provides a goodness of fit test with a p value of \(\ge 0.999\).

Tables 22, 23, 24, 25 and 26 in Appendix 1 show the differences in the estimated total expenditure on pensions between the population and a large size subsample obtained for a p value greater or equal to 0.999 for each type of pension, gender, and age cohort. A major reduction in errors can be seen in the estimation of pension expenditure with a very small reduction in the size of the original CSWL.

Table 10 Comparison of results of the goodness of fit test: p value.
Table 11 Improvement indicators in the estimate of total pension expenditure in % with respect to the CSWL: subsample (SS) with p value \(\ge \) 0.999, SR subsample (SRSS) and SR sample (SRS).

Table 10 shows the p values for the 10 cases considered (5 types of pension by 2 genders) for the goodness of fit test for pensions in the subsample obtained using this procedure (SS) compared to the pensions in the CSWL, along with the subsample design selected from the CSWL using stratification (SR subsample, Table 6) and the hypothetical sample extracted from the population using stratified random sampling (SR sample, Table 3). Obviously the values in the subsample obtained by the procedure designed are lower than those obtained with the SR sample and subsample, but the differences with the latter are almost non-existent. It is therefore possible to find subsamples contained in the CSWL and in the population that have a better fit and many more observations than would be provided by a stratified random sample taken from the CSWL. Overall, the distribution of total pensions is adjusted to the population using Pearson’s goodness of fit test using a p value of 1, as in the SR sample and subsample. More important in the improvement in forecasts for pension expenditure are the differences found by type of pension and gender.

Table 11 shows the improvement indicators in the estimate of total pension expenditure using the subsample obtained with a p value greater than 0.999 as well as those that use the SR hypothetical sample and subsample with respect to the CSWL.

It can be seen that the subsample (SS) obtained by the procedure greatly reduces the errors in the estimation of total expenditure in each cohort of the CSWL considered and that in many cases it improves on the reductions given by the SR subsample (SRSS) contained in the CSWL, which was 1% of the population. Obviously it does not reach the results obtained with the best possible sample of similar size to the CSWL, which would be obtained using an SR sampling from the population.

The estimation errors in pension expenditure by contingency, gender and age cohort of the SR subsample with a size of 87,472 benefits, and those of the subsample obtained with a p value greater than 0.999, which numbers 302,907 benefits, are much smaller than those obtained with the original CSWL, but somewhat lower than those from the sample obtained by the procedure proposed. However, this last sample is 3.462 times bigger, which is a great advantage for forecasting given the diversity in working lives for any subsequent analysis of the sustainability of public pension systems.

5 Summary, conclusions and future research

The CSWL is a set of anonymized microdata taken from Spanish Social Security records. The availability of this data has marked a turning point for studies on the Spanish public pension system since it has given researchers access to very valuable information about individuals that enables them to examine in depth numerous aspects of the pension system that had previously been ignored. However, the CSWL is obtained using a simple random sample (RS), so its fit to the population in terms of age, gender and type of pension is worse than would be obtained using stratified random sampling (SR) with proportional allocation. In this paper we examine how representative the CSWL data is of pension benefits. The results show that it does not fit the distribution of the population of pensioners by age and gender in two types of benefit: permanent disability and widower’s benefits for the case of men; there is also a mismatch to a lesser extent regarding retirement pensions. These mismatches are bigger for 2005, 2006, 2008, 2009, 2010 and 2011, and smaller in 2007, 2012 and 2013 due to the correction of a codifing error in permanent disability benefits for persons over 65. It is hard to understand why the error reappears after the correction in 2007.

We check the effects of this poor fit of the CSWL to the pensioner population in terms of pension benefits by estimating the annual total pension expenditure by cohorts obtained from the CSWL and the figure that would have resulted from a hypothetical SR. Given that researchers cannot access full data on the population, they are unable to obtain an SR sample and must resort to an SR subsample smaller than the CSWL and contained within it. The problem is that the reduction in size is considerable, so richness in pension types is lost.

For the reasons indicated, it can be said that it is advisable to use other procedures to select large size subsamples, so the representativeness of the original CSWL is improved with less reduction in the number of pensions. With the methodology developed, large subsamples can be obtained that pass the \({\chi }^{2}\) test giving p values near 1, so it can be concluded not only that it is feasible to find a large dataset selected from the CSWL that better represents the population of pensioners but also that it is possible to draw up a procedure for extracting subsamples from the CSWL with a good fit to the population of interest by type of pension, gender and age.

Apart from the problems described, the CSWL is a powerful dataset whose size enables representative large subsamples to be extracted which emulate the whole population of pensioners. This is taken into account in formulating and solving optimization problems to obtain feasible solutions (contained in the population and in the CSWL) that more or less meet the goodness of fit requirements in terms of a p value chosen by the researcher, with sizes ranging from 70 to 93% of the original.

According to our findings, it can be said that studies conducted using subsamples of data on pension benefits from the CSWL obtained using SR sampling with proportional allocation taking into account the known distribution of the population should not give rise to doubt as to how well the CSWL represents the population. However, those which use data on pension benefits selected from the CSWL without checking that they are representative may produce conclusions based on age cohorts which are overrepresented or underrepresented in the subsamples with respect to their real weight in the population.

Last but not least, we would like to emphasize that this research meets the conditions for reproducibility. Boyland (2016) establishes the guidelines for a paper to be recognized as “reproducible”: the most important are that the data used in the analysis should be accessible to other researchers and that the algorithms or methods of analysis should be specified in the manuscript in sufficient detail to allow the results to be reproduced. This is our case: any research team can replicate the procedure for selecting more representative large subsamples from the CSWL.

A further line of research directly related to the results of this paper would be to consider another variable in the design and selection procedure for large size subsamples plus the number of pensions, such as the amount of the pensions selected. The objective would be to find a dataset that fits the population as well as possible, taking into account not only the number of pensions but also the average pension amount per cohort of the population, the data for which is also known.