1 Introduction

Health care policymakers have long been concerned with health care financing arrangements (e.g., per capita premiums, taxes, contributions from social security) and the effect of these arrangements on the receiver of health care. In the light of rising health care expenditures, the effects of financing arrangements on income distribution have recently attracted attention from policymakers. Because various financial arrangements have different implications for an individual’s balance between payments made to and health care received from health care insurance, analysis of redistributive effects is worthwhile.

The early literature on redistribution in health care has mainly focused on individual financing arrangements and whether their redistributive effect is progressive or regressive with respect to income (e.g., Doorslaer et al. 1999). A limitation of this approach is how to aggregate the redistributive effects of separate financial arrangements to obtain the overall effect. Simply aggregating separate effects is not sensible because the financial arrangements are interdependent. Another weakness is the reliance on mainly aggregate-level data, which makes an examination of redistribution effects for subpopulations impossible.

To overcome these limitations, researchers have adopted a microsimulation approach; e.g., Grabka (2004) and Drabinski (2004). Microsimulation models are useful in redistribution analysis because they enable the simulation of policy effects on a sample of economic agents (e.g., individual or households) at the individual level. The overall analysis then comprises an evaluation of the consequences induced by a policy or a policy reform on indicators of the activity or welfare for each individual agent in the underlying microdata (Bourguignon and Spadaro 2006; Spielauer 2011).Footnote 1 For a recent survey on the application of microsimulation models in health care research, we refer to Schofield et al. (2018).

Microsimulation studies are typically based on survey samples (e.g., households) on top of which the simulation runs. We are interested to study sample averages of the redistribution effects by breakdown variables (gender, age, income group, etc.). Thus, a qualitatively good baseline survey (i.e., absence of outliers and measurement errors) is indispensable to obtain reliable simulations because outliers and other data imperfections tend to bias the estimates.

In the early days of microsimulation, researchers have often been satisfied if their simulation model runs and approximately tracks observed data. Data quality and sound statistical inference have received microsimulation modelers’ attention—at least— since the paper of Klevmarken (2002). Much of the research in this area has been devoted to alignment (also known as calibration or benchmarking) methods that attempt to align estimated characteristics (e.g., mean or total) with known population values; see Creedy and Tuckwell (2004) and references therein. These alignment methods do not explicitly address survey errors such as systematic under-representation of particular groups of agents or individuals; instead, they correct discrepancies between the baseline survey and known population values by reweighting the survey data. In general, the methods proved successful in improving simulation accuracy for a wide range of applications.

When alignment cannot be achieved with standard methods, Myck and Najsztub (2015) show that calibrating or reweighting the data sequentially over several stages may be beneficial. The authors prove the effectiveness of sequential calibration for a household survey that suffers from under-representation of high-income groups. In our application, we encounter the same problem: High-income households are under-represented in the baseline survey, compared with data from tax registers. Although calibration corrects the under-representation problem, it cannot achieve alignment of estimated average income in the right tail of the distribution with known population values without distorting the empirical distribution. Thus, there is a tradeoff: Either average income is aligned but the empirical distribution is severely distorted or vice versa. The problem is rooted in the inability of the (standard) calibration method to cope with skewed heavy-tailed distributions.

The first purpose of this paper is the introduction of a parametric Pareto model to describe the right tail of the income distribution. With the help of the tail model, we adjust the sample distribution such that average income in the top income bracket is aligned with known values from tax data. Our key contribution is a new method based on order statistics from the Pareto model; this contribution is an extension of our earlier model (Schoch et al. 2013; Müller and Schoch 2014a).

The second goal of the paper also refers to the treatment of skewed heavy-tailed, outlier-prone distributions. However, in this case, the baseline survey has fortunately been enriched with individual data on health care costs through record linkage. Thus, modeling cost data is unnecessary because the true cost data are available. Unfortunately, the heavy-tailed population distribution in conjunction with the baseline survey’s small sampling fraction make standard estimation procedures very unreliable. We address this problem and propose robust estimating and alignment methods to cope with skewed heavy-tailed distributions. Although the combination of survey data with other sources through record linkage has been investigated, for example, Lohr and Raghunathan (2017) and Thompson (2019), the topic of this study has not been addressed.

To facilitate the methodological discussion, we apply the methods and techniques to our microsimulation model on compulsory health care insurance in Switzerland.Footnote 2

The remainder of the paper is organized as follows. In Sect. 2, we provide background information on compulsory health care insurance in Switzerland. In Sect. 3, we explain the microsimulation model. In Sect. 4, we discuss how the Pareto tail model for income can correct the under-representation of low- and high-income groups and adjust for nonresponse bias. In Sect. 5, we study the problem of outliers that result from record linkage of heavy-tailed population distributions to the baseline survey. Finally, in Sect. 6, we conclude by discussing the major findings.

In Appendix A, we describe the microsimulation model briefly; Appendix B provides an introduction to the Deville–Särndal calibration method.

2 Institutional setting of compulsory health insurance

Basic compulsory health care insurance (CHI) in Switzerland is a package of insurance benefits that must be offered by any insurance provider to any person without selection.Footnote 3 In particular, all insurance contracts that qualify for CHI must not be subject to health assessments or similar gatekeepers to inhibit enrolment in an insurance plan. Any type of price discrimination or positive risk selection with respect to an individual’s age, gender, or health condition is prohibited.Footnote 4 CHI is compulsory for all permanent residents in Switzerland. Hence, each individual is obliged to purchase a CHI contract from one of the 56 insurance providers who qualified for CHI in 2016 (BAG 2018). Family members are insured individually. CHI is not sponsored by employers. Individuals are free to choose and change their insurer and/ or insurance contract once per year, but they must sign on with an insurer operating in their canton.Footnote 5 As a result, the provision of health care is heavily decentralized, and cantons exercise great control over health care (Crivelli et al. 2006).Footnote 6

Table 1 Franchise, premium rebates, and maximum deductible for adults (in CHF; 2016)

The benefits of CHI are identical for all insured persons throughout the country in the event of illness, accident, and maternity. Although the benefits are identical for all insured persons, CHI offers a set of insurance plans—among which individuals are free to choose—with different financing. The set of plans consist of a heavily regulated basic insurance (franchise ordinaire, CHF 300 deductible) and five special insurance plans that rebate the premium in exchange for greater financial liability (higher degree of cost sharing through higher deductibles when individuals first incur costs; Table 1) or for accepting a limited choice of providers (managed-care arrangements).

CHI premiums are unrelated to earnings but are raised as per capita premiums. To mitigate the regressive effect of the premiums, eligible low-income individuals are entitled to premium reductions or subsidies (individuelle Prämienverbilligung). The subsidies are co-financed by the cantons and the federal government but eligibility criteria, subsidy amount, and payout procedures differ by canton.

2.1 Redistributive effects in the system of compulsory health care

Compulsory health care is financed through mixed sources. From the perspective of individuals seeking health care, insurance providers are the main provider of reimbursement for basic health care expenditures. Reimbursements cover a portion of health care costs, and the insured pays the remainder of the incurred cost through cost sharing (the amount depends on the insurance policy) and out-of-pocket payments (OOP).

From a citizen’s point of view, individuals contribute to the total health care expenditures in two ways: As health care insurance holders, they finance the system through premium payments (and cost sharing); as taxpayers, they establish the financial basis for health care providers in the cantons (e.g., hospitals) and social insurances (old-age and invalidity, means-tested supplementary benefits, and premium reductions). Fig. 1 shows the major financial flows in the system.

Fig. 1
figure 1

Major financial flows in compulsory health care (source: Schoch et al. 2013)

Contributions to and financial aids from the system differ greatly in order of magnitude (see balance sheet representation of CHI in Table 2). More importantly, the various financial sources have different implications for a household’s balance between payments made to and financial support received from the system and hence for redistributive effects. Based on theoretical reasoning, we know that the (flat-rate) premium payments exercise a fairly strong regressive effect (i.e., the financial burden decreases in relative terms with growing income). The regressive effect is somewhat mitigated for low-income adults by premium reductions but tends to affect middle-class families. Taxes, by contrast, exert a progressive effect with respect to income: Households in high-income brackets contribute a disproportionally large share to total health care expenditures. Although theoretical reasoning may provide crude insights into the redistributive effect of a single financial element (e.g., taxes), it cannot demonstrate how the different financing elements interact and what redistributive net effect results. Notably, an analysis of aggregate data also cannot accomplish this. The availability of micro-level household (and personal) data is indispensable for studying redistributive effects in detail.

Table 2 Balance sheet representation of compulsory health insurance (in CHF; 2016)

3 Microsimulation model

A household- or person-level dataset that contains data on all relevant financial elements of the CHI system (i.e., taxes, insurance premiums, etc.) is not available. Therefore, we must use simulation-based approaches or data combination techniques (e.g., record linkage) to study redistributive effects.

Our baseline survey dataset is the 2016 edition of the European Statistics on Income and Living conditions (SILC; Swiss Federal Statistical Office), which refers to the permanent resident population living in private households.Footnote 7 SILC is designed as a household survey and provides a rich set of sociodemographic and income-related variables.Footnote 8 The sampling design of SILC 2016 is a stratified random sample with proportional allocation; stratification is along the seven major regions (BFS 2016). SILC 2016 has a sample size of 17 880 individuals who live in 7761 households. In relative terms, the sample covers a sampling fraction of approximately 0.2% of the Swiss resident population. As a consequence of proportional sample allocation, the realized sample sizes for small cantons are small (e.g., 20 households in Uri and in 14 Appenzell Innerrhoden). Because of the very small sample sizes, canton-specific investigations require the application of small-area estimation methods. We do not address canton-level estimation; see Schoch et al. (2013) for further details.

3.1 A static microsimulation approach

In this paper, we focus on a static microsimulation model. However, the survey-related methodological issues we address concern dynamic simulation models to the same extent. The main purpose of static analysis is to simulate the distributional incidence of current policies and the impact on individuals and households of policy changes. Static models have no temporal dimension; instead, they focus on distributions and outcomes for a particular point in time (in our case, the year 2016). Moreover—and in contrast to dynamical models—individual and household characteristics and behaviors are considered exogenous in static microsimulation (Li et al. 2014).Footnote 9

From a methodological perspective, the following two techniques are available to enrich the baseline survey data with supplementary individual- or household-level data:

(i):

microsimulation,

(ii):

record linkage (at the level of individuals).

When studying only the distributional incidence of current policies, technique (ii) is preferred because it augments the baseline data with observed data. However, linking data from auxiliary sources to survey data presents methodological difficulties (see Sect. 5). Moreover, in the vast majority of incidence analyses, record linkage is technically infeasible or prohibited by data-protection laws or both. In these cases, or when we want to investigate policy changes or counterfactual policy scenarios, microsimulation is the only feasible technique.

Regarding CHI simulation, Fig. 2 shows all variables that must be included into the SILC baseline survey for distributional analysis. In our earlier incidence analysis (Schoch et al. 2013), all listed variables were simulated for each individual or household in the sample (see Appendix A for a model overview). In the current model, the insurance-related variables (e.g., premium; see variables left of arrow “A” in Fig. 2) could be taken from a recently established register on compulsory health care, maintained by the Swiss Federal Office of Public Health.Footnote 10 The remaining variables (see arrow “B”) are subject to microsimulation at the level of individuals or households.

Fig. 2
figure 2

Simulation or data combination strategies to enrich the SILC 2016 survey with financial data related to compulsory health care insurance

3.2 Finite population inference

Inference in microsimulation models is in principle no different from (ordinary) inferential statistics, but inference aspects have often been neglected. Researchers have often been satisfied if their simulation model runs and approximately tracks observed data (Klevmarken 2002). The insufficient attention to statistical inference is undesirable and unjustified because standard software allows for the routine computation of sampling variances (Figari et al. 2014).

To address statistical inference in microsimulation models, we first note that design-based inferenceFootnote 11 is the relevant mode of inference for survey-based microsimulation because the model builds on top of baseline survey data. Second, we follow Klevmarken (2002) to distinguish between two modes of simulation:

(i):

simulation based on a set of deterministic rules,

(ii):

model-based simulation (stochastic).

Simulation based on set a of deterministic rules is nonstochastic by design; here, stochastic refers to the notion of a super-population model, in the sense of Godambe and Thompson (1986). That is, we assume—in principle—that we can perform simulation by applying deterministic rules to observed variables. For illustration purposes, we consider the following example: Given pre-tax income and relevant socioeconomic variables, tax payments can be computed for each individual in the sample by using a set of rules that describe the taxation regime. From the perspective of sampling theory, the simulated tax payments are regarded as constants. The only stochastic element is induced by the sampling design, which is not affected by simulation. Consequently, we can estimate the total of a simulated variable by the Horvitz–Thompson estimator.Footnote 12 Statistical inference then refers to the sampling distribution of the estimated total under the sampling design in use. This approach to inference is certainly useful when the rules underlying simulation are assumed to be deterministic or at least predominantly deterministic.

Inference for model-based simulation is far more intricate because the (super population) model induces an additional stochastic element to the stimulation. This additional randomness accounts for uncertainty that is integral to the statistical model. When the parameters that characterize the simulation model can be estimated from a different dataset, we may attempt to incorporate model uncertainty from the estimation exercise into our simulations.Footnote 13

Regarding our CHI microsimulation model, the two most important variables for CHI financing volume—and subject to microsimulation—are taxes (24.4% share of total finances) and premium reductions (5.0% share). These variables are of a predominantly deterministic nature in the aforementioned sense. Hence, standard sampling inference applies. Regarding means-tested benefits, our earlier model (Schoch et al. 2013) included means-tested benefits as a separate financing instrument. In the current model, only premium reductions financed through means-tested supplementary benefits are considered.Footnote 14 Their share is 5.3% of total financial flows in CHI. More importantly, the simulated values are of predominantly deterministic nature.

The last financial element subject to simulation is OOP, which cannot be deduced from a set of deterministic rules. Instead, OOP depend on individual behavior, perception of health risks (e.g., self-assessed health condition, prevalence of chronic conditions), household composition, endowment of resources, and limitations because of financial constraints. Therefore, stochastic models or heuristicsFootnote 15 must be applied for simulation purposes, which implies that inferential statistics cannot relate only to randomization inference. However, because the contribution of OOP to the system is virtually negligible (share of 0.5%), we neglect the model-based contribution to statistical uncertainty. This approach incurs some error; however, the amount of uncertainty not properly accounted for is negligible.

3.3 Unbiasedness of estimates from the baseline survey

So far we have implicitly assumed that the baseline survey dataset provides unbiased (or nearly so) estimates of population characteristics. Under a broader perspective, we define the total survey error as the difference between the population characteristic and the sample-based estimate of that characteristic. The total survey error is a measure of quality and can be further subdivided into sampling error and nonsampling error. The sampling error is under the control of the survey statistician. Nonsampling errors are virtually unpredictable and difficult to control. They refer to the entire survey process and comprise the following types or errors: specification errors, measurement errors, sampling frame errors, nonresponse errors, and processing errors; see e.g., Biemer and Lyberg (2003, Chap. 2).

Next, we assume that the specification, measurement, sampling frame, and processing errors are negligible. Therefore, the nonresponse error becomes the focus. We do not claim that all error components other than nonresponse errors are absent; we only point out that nonresponse dominates total survey error. Regarding our baseline survey, SILC 2016, we can provide verified reasons that substantiate the negligibility assumption.Footnote 16

In the presence of nonresponse, survey estimates tend to be biased. As a direct consequence, all simulation models built on top of the baseline survey data are—as a rule—at risk of generating simulated values whose estimated population characteristics are also biased; cf. Myck and Najsztub (2015).Footnote 17 How can we tell that the baseline survey is at risk of producing biased results? Although we cannot answer this question for the entire dataset, we can study individual variables. Of particular importance are variables that serve as inputs for the simulation, for instance, household income. For each variable, we can check whether certain estimated characteristics (e.g., mean or total) are aligned or benchmarked with their known population values. When the characteristics are not properly aligned, we may calibrate the sampling weights such that alignment with the population is achieved using the calibration method of Deville and Särndal (1992); see Klevmarken (2002) or Creedy and Tuckwell (2004) for a discussion of the method in microsimulation.Footnote 18

Two further points are notable. First, statistical inference for a variable of interest is considerably more difficult if that variable has been directly subject to calibration. The original Deville–Särndal method only covers the case, where calibration is conducted with respect to auxiliary variables but not the variable of interest. Second, calibration or reweighting cannot always completely remove the bias, as we explore in Sect. 4.

4 Correcting for nonresponse bias in the baseline survey

In survey research, we distinguish between unit and item nonresponse. Unit nonresponse refers to households (or individuals) who do not participate in the survey because of explicit refusal or unavailability. Item nonresponse occurs when some of the sampled households who agreed to participate in the survey refuse to answer specific questions (see e.g., Groves and Couper 1998, Chap. 1).

When considering income-related nonresponse, strong empirical evidence has been presented that item nonresponse is more accentuated for households in the tails of the income distribution (Biewen 2001). Frick and Grabka (2005) draw the same conclusion for the German Socio-Economic Panel. They demonstrate that households’ propensity to not answering income-related questions is nearly twice as high in the top income decile compared with a median-income household. Consequently, differential or selective item nonresponse can lead to biased estimates.

Although survey teams undertake great efforts to avoid or correct for unit nonresponse, it is practically unavoidable. More importantly, when survey compliance is correlated with the variables of interest, there are serious concerns about biases in survey-based inference for these variables, as demonstrated in a theoretical model by Korinek et al. (2006). The authors also provide substantial empirical evidence that unit nonresponse is indeed income dependent. That is, Korinek et al. (2006) find a significantly negative income effect on survey compliance: survey response probability decreases with increasing income. Thus, sample estimates of income characteristics tend to be heavily downward biased. Consequently, we must reckon with biased simulation results because income and other variables possibly affected by nonresponse enter microsimulation as model input.

4.1 Empirical evidence of under-representation in the tails

Unlike surveys, tax registers are not limited by under- or over-representation. Thus, tax register data are a trusted benchmark against which we can compare proportions, means, or totals estimated from surveys, to detect potential nonresponse bias and other survey-related errors.

Fig. 3
figure 3

Share of households by income brackets (in CHF 1000) for tax data and the SILC 2016 survey (source: SILC and ESTV (2017), Normal- und Sonderfälle)

When comparing estimated shares of households from the 2016 SILC survey against aggregated tax data (ESTV 2017) by income brackets, we find that the estimated shares of households in both tails of the income distribution are noticeably under-represented.Footnote 19 Fig. 3 illustrates this finding for the case of married couples; similar patterns of under-representation are found for taxable entities other than married couples (not shown). Also in Fig. 3, the estimated shares of the households in the lower half of the income distribution tend to be slightly over-represented in SILC. The under-representation of the top income bracket is not only a problem observed in the Swiss SILC data, but also in other countries; see e.g. Törmälehto (2017) who presents empirical evidence of under-representation in EU-SILC 2012 for all European and associated countries. Though, the degree of under-representation varies considerably between countries.

4.1.1 Alignment by calibration and reweighting

We attempt to calibrate the weights of the SILC sample data such that the frequency distribution of households by income brackets is aligned with the income distribution resulting from the tax register. We easily achieve this objective when the household shares by income bracket are considered calibration targets (among other totals and proportions); see Schoch et al. (2013) for more details. However, this approach makes the sampling weights dependent on the income variable, which implies that statistical inference becomes technically much more challenging (unless we neglect the dependency introduced through calibration). Myck and Najsztub (2015) propose a different but closely related approach: Calibrate the weights over several stages on variables from administrative records to correct the under-representation of high-income groups. In our application, the indirect calibration method of Myck and Najsztub (2015) was inferior.Footnote 20

Although calibration aligns the estimated household shares by income brackets with known values, we continue to observe an anomaly in the top income bracket: Estimated average income in the top income bracket is only CHF 315 096 (after calibration) and therefore too small by approximately 25% compared with the value reported in the tax register data (i.e., CHF 394 370; ESTV 2017). Consequently, this underestimate of average income implies—based on progressive taxation—downward biased results for the simulation of taxes (and other simulated variables). What can we do to rectify the anomaly? Does it help to run another round of calibration, but with average income as the calibration target?

The calibration method is not appropriate to overcome the underlying problem, which manifested itself because of an underestimate of average income. The underlying problem is that too few high- and ultrahigh-income households were included in the sample for mainly two reasons: (i) these households are rare, and the SILC sampling design did not oversample this special group; and (ii) survey compliance decreases with increasing income (see the aforementioned discussion). These findings are substantiated when we compare the estimated frequencies of ultra-high-income households in SILC with the results in Foellmi and Martínez (2017). Thus, we should—loosely speaking—add some high- and ultra-high-income households to the sample to correct for the deficiency. Calibration and similar reweighting techniques do not succeed because they only modify the “importance” of existing households in the sample. Even worse, these methods can distort the observed income distribution in their attempt to align the estimated sample mean in the top income bracket with the known population value.

Indeed, our numerical analysis (Schoch et al. 2013) shows that calibration tends to increase the weights of observations with high incomes in the top income bracket. Although weight adjustment ensures that the sample average in the top income fulfills the benchmark, it overemphasizes high-income households whose income is still small compared with the households that should have been in the sample in greater number. Since our primary interest is not average income but the entire income distribution (for simulation purposes), any distortion of the distribution is problematic; hence, reweighting methods are not a viable option.

4.2 Pareto tail modeling

The complete tax-data distribution of income is unavailable to us. Therefore, we cannot use it to adjust the SILC income distribution in the right tail. In the absence of empirical data, we thus assume that the right tail of the income distribution can be described by a parametric Pareto distribution. With the help of the tail model, we adjust the sample distribution such that average income in the top income bracket (i.e., above CHF 200 000) is aligned with the known value from the tax data.

Pareto tail models have been a productive assumption in many applications, for example, Dell et al. (2007) show that a Pareto tail model describes top incomes in Switzerland well; see also Foellmi and Martínez (2017). The assumption has also been beneficial in robust statistics; see e.g., Cowell and Victoria-Feser (2007) and Alfons et al. (2013).

To fix notation, we let the income of household \(i\) be represented by random variable \(X_{i}\) (\(i=1,\ldots,n\)), which is defined on the positive real line. Let \(\{X_{i},i\geq 0\}\) denote a sequence of independent and identically distributed random variables with cumulative distribution function \(F\). Many of the empirically studied parametric income distributions (e.g., Singh–Maddala, Dagum and Generalized Beta) have heavy tails. In particular, their tail decay as a power law \(F\left(X\geq x\right)\sim L(x){\cdot x}^{-(\theta+1)}\) as \(x\rightarrow\infty\), where \(\theta> 0\) is a parameter and \(L(y)\) denotes a regularly varying function (Kleiber and Kotz 2003, Chap. 3.3). The tail behavior of such income distributions can be described by a Pareto distribution

$$F_{\theta}(x)=1-\left(\frac{x}{x_{0}}\right)^{-\theta}\qquad(x\geq x_{0}),$$
(1)

where \(x_{0}> 0\) is a threshold and \(\theta> 0\) is the shape parameter of the Pareto distribution. The corresponding density function is given by \(f_{\theta}\left(x\right)=\theta x_{0}^{\theta}/x^{\theta+1}\) (for \(x> x_{0}\)) and is shown in Fig. 4 for some values of the parameter \(\theta\) (the threshold \(x_{0}\) is kept fixed at \(x_{0}=1\) for the sake of comparison). We observe that smaller values of \(\theta\) decrease the density at \(x_{0}\) and simultaneously imply a heavier tail.

Fig. 4
figure 4

Pareto density function as a function of variable x for three values of the shape parameter \(\theta\) (the threshold \(x_{0}\) is kept fixed at \(x_{0}=1\))

4.2.1 Parametrization of the Pareto tail model

To use the Pareto tail model, we must determine or estimate the model parameters from tax data. The threshold \(x_{0}\) is fixed at CHF 200 000 because this marks the beginning of the top income bracket. To determine the shape parameter \(\theta\), we use average income as published by the federal tax authority; see ESTV (2017). Next, we relate the empirical average to the expected value of a Pareto random variable \(X\) with law \(X\sim F_{\theta}(x)\). Under this law, the expected value conditional on \(\theta\) is (Kleiber and Kotz 2003, p. 71)

$$\mathbb{E}_{\theta}(X)=\frac{\theta x_{0}}{\theta-1}\qquad(\text{for}\;\theta> 1).$$
(2)

Putting the empirical average in place of the expected value and substituting CHF 200 000 for \(x_{0}\), we can solve Equation (2) for \(\theta\). Furthermore, since average income in the top income bracket (from tax register data) is known for each canton, we compute canton-specific parameter estimates (the threshold \(x_{0}\) is the same for all cantons; Table 3). The estimated shape parameters show great variation among the cantons: from 1.42 (canton SZ) to 2.72 (canton JU). For these cantons, the 99% income quantile under the Pareto tail assumption is, respectively, CHF 5.12 million (SZ) and CHF 1.09 million (JU).

Table 3 Estimated shape parameters \(\theta\) by canton

4.2.2 Incorporating the Pareto tail assumption into the simulation model

Because the Pareto model is only used for tail modeling, all incomes below the threshold of CHF 200 000 are not affected, and estimation for the lower part of the distribution refers to the empirical distribution. Regarding the right tail of the distribution, three approaches are worth considering:

(i):

imputation of randomly drawn observations from the Pareto model;

(ii):

semi-parametric estimation; and

(iii):

imputation of expected order statistics from the Pareto model.

In approach i), we replace the observed incomes above the threshold \(x_{0}\) with randomly drawn values from the Pareto tail model \(F_{\theta}\) in (1). This approach has been used by, for example, Alfons et al. (2013), in robust statistics; see also Törmälehto (2017) for an application to EU-SILC. Usually, the empirical mean of the imputed observations is not perfectly aligned with the expected value. However, alignment can be achieved by scaling the values slightly. In our earlier model, we used this method with canton-specific parameter values; see Schoch et al. (2013). The major advantage of method (i) is that it generates a corrected income variable that can then be used in the simulation as if it were the original variable. A disadvantage is that the households are assigned randomly drawn income values that may not be related to their originally observed income. Thus, a relatively poor household can be turned into a high-income household (and vice versa). This is normally not an issue, unless the simulated results are to be studied for fine-grained subpopulations.

The second approach is a semi-parametric estimating method and is inspired by Cowell and Victoria-Feser (2007). This approach directly sets in at the stage of estimation and—so to speak—skips the imputation stage. Denote by \(F(x)\) the entire income distribution, which is defined as a mixture distribution,

$$F(x)=\begin{cases}F_{n}(x)&\text{if}\;x<x_{0},\\ F_{n}(x_{0})+\big\{1-F_{n}(x_{0})\big\}\cdot F_{\theta}(x)&\text{if}\;x\geq x_{0},\end{cases}$$
(3)

with the empirical distribution function \(F_{n}(x)=\sum_{i\in s}w_{i}\mathbb{1}\{x_{i}\leq x\}/\sum_{i\in s}w_{i}\), where summation is over all elements in the sample \(s\), \(w_{i}\) is the sampling weight, and \(\mathbb{1}\{\cdot\}\) denotes the indicator function. Any characteristic of interest (e.g., arithmetic mean) that can be expressed as a statistical functional \(T:G\rightarrow\mathbb{R}_{+}\) of a distribution function \(G\) can be computed at the distribution defined in (3). For instance, the (weighted) sample mean—computed at an arbitrary distribution \(G\)—can be expressed as a statistical functional \(T(G)=\int x\mathrm{d}G(x)\) where integration is over the positive real line. When \(T\) is computed at \(F\) defined in (3), we obtain

$$T(F)=\frac{1}{\sum_{i\in s}w_{i}}\bigg(\sum_{i\in s}w_{i}x_{i}\mathbb{1}\{x_{i}\leq x_{0}\}+\frac{\theta x_{0}}{\theta-1}\sum_{i\in s}w_{i}\mathbb{1}\{x_{i}> x_{0}\}\bigg),$$
(4)

which highlights that \(T(F)\) is a weighted average of the empirical mean for incomes below threshold \(x_{0}\) and the expected value in the right tail under the Pareto model. We observe that this method does not explicitly replace or impute incomes in the baseline dataset.Footnote 21

Fig. 5
figure 5

Percentage change between observed income and imputed income (against household income quantiles; application of method iii)

The third method is new, according to our review of the literature. For ease of discussion, we neglect the canton-specific tail models and work with a nation-wide model only. Let \(X_{1:n},\ldots,X_{n:n}\) denote the \(n\) order statistics (i.e., observations sorted in ascending order) of the observed income variable in the right tail (i.e., for x > x0). Under the Pareto model in (1), the expected value of the \(k\)-th order statistic is (David and Nagaraja 2003)

$$\mathbb{E}_{\theta}(X_{k:n})=\frac{x_{0}n!}{(n-k)!}\cdot\frac{\Gamma(n-k+1-\theta^{-1})}{\Gamma(n+1-\theta^{-1})}=:\mu_{k:n}\qquad(\text{for}\;1\leq k\leq n),$$
(5)

where \(\Gamma\) denotes the Gamma function.Footnote 22 For the imputation approach, we replace all empirical income order statistics \(X_{1:n},\ldots,X_{n:n}\) in the baseline data by the expected values \(\mu_{1:n},\ldots,\mu_{n:n}\) (under the Pareto tail model). This method has several advantages over the other two approaches. First, the arithmetic mean of the imputed \(\mu_{i:n}\)’s (\(i=1,\ldots,n\)) is equal to the (overall) expected value under the Pareto model defined in (1), that is, the mean of the imputed observations is automatically aligned with the benchmark from tax data.Footnote 23 Second, the imputation strategy preserves the households’ income ranks. A relatively poor household is not turned into a high-income household (and vice versa). Lastly, the changes in income generated by imputation are small; to observe this, we computed the percentage change in income between the empirical and the imputed value for all 271 households in the top income bracket (Fig. 5). Observe that the changes are displayed by income quantiles. For incomes below the third quartile, the changes are less than 21.5 percentage points. The largest change in income for an individual household is an increase of 239.4%, which reflects the fact that households with especially high incomes were under-represented in the original data.

4.3 Empirical illustration

As an illustration of the methods, we simulated average tax payments in favor of the system of CHI. Tax payments include federal, cantonal, and municipality taxes and are simulated from pre-tax income (and other variables). In Fig. 6, we show average tax payments in favor of CHI for households in different income brackets, once with and once without Pareto tail correction. Tax payments in the top income bracket are substantially underestimated when the correction is not considered. The correction method used in the display of Fig. 6 is based on method iii), that is, imputation by expected order statistics from the Pareto model. However, the two other methods yield similar results (not shown) because the display uses a rather coarse income bracketing. When we specify smaller income brackets (e.g., in 0.5% steps), the imputations of method (i) show a nonsmooth behavior in the right tail, which is undesirable.

Fig. 6
figure 6

Average tax payments in favor of the health care system by 11 types of households from different income brackets; tax payments are computed with and without Pareto correction

5 Register data and record linkage (with heavily skewed data)

Individual health care cost data are typically not available from household surveys because interviewees do not know the amount of costs they incurred in a calendar year. This was the case in our earlier simulation model (Schoch et al. 2013); therefore, we simulated individual health care costs.Footnote 24 The major difficulty in this modeling exercise was the replication of the outlier-prone, heavily right-skewed and zero-inflated distribution of the cost data. Zero-inflation occurs because the majority of individuals did not use any health care-related services; hence, no costs were incurred. By contrast, medical treatment for a few people incurred tremendous costs (outliers).

As we pointed out in Sect. 3, insurance-related data (i.e., premium, franchise, and health care costs) are now available from a register on compulsory health care. Moreover, the Swiss Federal Statistical Office linked the register data to the 2016 SILC survey through record linkage. Thus, we can avoid modeling the cost data because the true cost data are available. It cannot get any better than this, right?

Unfortunately, linking register data to an existing survey dataset is insufficient to guarantee good results. The nature of the baseline survey is not affected by record linkage, that is, the baseline survey still covers only a small, randomly selected part of the underlying population (\(\approx\) 0.2% sampling fraction). A sampling fraction of 0.2% implies—under the simplifying assumption of simple random sampling without replacement—that on average each person receives a sampling weight of approximately 476.Footnote 25 Thus, each sampled person is said to represent approximately 476 individuals in the population.

Moreover, we may think of linking health care costs to the survey as if we had directly sampled from the very skewed cost population distribution. As a result, we obtain a sample that shows high sampling variability. Even more problematic is the analysis of such data for breakdowns or domains of interest (e.g., breakdown by gender or age group) because outlying values tend to be more influential in smaller samples. For instance, if a person in a subpopulation has incurred a huge amount of health care costs (e.g., several hundred thousands of CHF), that individual’s value represents (under our simplified calculations) the values of 476 individuals in the population and therefore exerts a tremendous influence on the subpopulation’s distribution of health care costs. The compound effect of an outlying observation and a large weight can completely ruin an estimate. Thus, we clearly cannot let such extreme data or outliers be untreated.

Fig. 7
figure 7

Slicing the baseline survey data: The top plane or slice shows the relation between the variables (health care) cost and age (group). The population means (and totals) of health care costs are known for the marginal distribution by age group but not for all other breakdown variables (like income, etc.)

For the time being, an outlier (or extreme value) shall mean an atypical and/ or influential observation in the sample (a more formal outlier definition will be given later). Also, for ease of discussion, we consider the specific situation shown in Fig. 7. The figure shows a schematic representation of the (health care) cost data in the baseline survey, cut into slices by breakdown variables (age, income, etc.). The top plane shows the slice \(\texttt{cost}\sim\texttt{age}\) (group). This slice is special for two reasons. First, variable age (group) is used in the microsimulation as a breakdown variable to study the redistributive effects by age. Second, the population means of health care costs by age groups are known (from administrative CHI data). Hence, no estimation is required for the analysis of health care cost by age group (unless we are interested in a characteristic other than the arithmetic mean), yet this setting enables us to adjust the cost data (or, equivalently, the sampling weights) such that the weighted sample means (by age group) are aligned with the known population values. The adjusted data then allow us to obtain (presumably) more accurate estimates of average health care costs for other breakdown variables, whose population cost averages are not known (e.g., income, see Fig. 7), compared to not having adjusted the data in the first place (i.e., not having utilized the auxiliary information of the top slice). Such methods refer to the calibration principle of Deville and Särndal (1992, p. 376) that “weights that perform well for the auxiliary variables also should perform well for the study variable” (under the assumption that the study variable is correlated with the auxiliary variables).

It is crucial that the alignment methods applied at the top slice (to stick with our visual metaphor in Fig. 7) work properly, for otherwise alignment issues transmit to other slices causing distorted estimates there. Thus, we must avoid that an alignment problem in one place turns into an estimation problem at another place. For that matter it is of crucial importance how an alignment method achieves its goal in the presence of outliers. The naive alignment approach which scales the cost data (or weights) by \(\bar{y}/\hat{\bar{y}}\), where \(\bar{y}\) and \(\hat{\bar{y}}\) denote, respectively, the population mean and the weighted sample mean of health care cost (for some age group), ensures that alignment is achieved for this breakdown variable. However, outliers in the data may exercise a huge impact on \(\hat{\bar{y}}\) and thus on the scaling factor. To see this, consider the age group of 41–45 years old women. For this group, the population mean of cost is \(\bar{y}=3102\) CHF. The sample mean amounts to \(\hat{\bar{y}}=5421\) CHF, which is an overestimate by approximately 75% because of (most notably) one heavy outlier; see Fig. 7. The resulting scaling factor is approx. 0.57, which implies that all observations—even the “good” ones—(or their sampling weights) are heavily shrunken. Such heavy shrinkage may cause disastrous underestimation for other breakdown variables. This problematic behavior is not limited to the naive method. Even more sophisticated alignment methods are not immune to this. In fact, any alignment method which is based on non-robustly estimated characteristics will be influenced by outliers. This is also the case for the (traditional) Deville–Särndal calibration method (Duchesne 1999).

Before we address alignment and estimation methods that can cope with outliers, it is helpful to formalize our definition of outliers.

5.1 Representative and nonrepresentative outliers

Compared with “classical” statistics, outliers are a different concept in design-based survey sampling. In the sampling context, outliers are extreme values selected from the population under study that deviate from the bulk of data. Following Chambers (1986), distinguishing representative from nonrepresentative outliers is helpful. Representative outliers are extreme but correct values and are thought to represent other population units similar in value. A nonrepresentative outlier is an atypical or extreme observation whose value is either deemed erroneous or unique in the sense that there is no other unit like it.Footnote 26

Furthermore, and in contrast to classical statistics, we also must consider the sampling weights because design-based estimators are functions of the weights and the observed values. Depending on the type of estimator, observations not considered outliers (e.g., situated in the bulk of the data) can still heavily influence the estimate because of their large sampling weight. We call such observations influential values (Lee 1995). Conversely, outliers well separated from the majority of observations are not necessarily influential when they have small weights. The problem worsens if large values have large sampling weights.

Outliers and influential values are typically dealt with in two separate steps: detection followed by treatment. Another option is the application of robust estimation techniques (e.g., \(M\)-estimators; see below), which combine the steps of detection and treatment. All techniques aim to avoid untreated outliers and influential values because these can heavily compromise the variance–bias profile—or equivalently the mean square error (MSE)—of the estimator of interest. So, leaving erroneous outliers untreated implies biased estimates and inflated variance of the estimator. In case of representative and nonrepresentative outliers, the situation is more complicated because the outliers’ influence on the MSE depends on the sample size (Hulliger 1995; Lee 1995). If the sampling fraction is large or the sample size is large, the problem is less troublesome. When the sample size is small, however, and whence—as a rule—the variance is the dominant factor in the MSE, small biases introduced through robustification (e.g., reducing values or shrinking weights) can be worthwhile if the variance can be significantly reduced. Thus, for small samples, there is a tradeoff between variance and bias. However, in some cases, the introduced bias can be substantial and may render a robust procedure grossly inefficient. This phenomenon occurs all the more as the sample size increases because the variance decreases, but the bias typically does not. As a result, the bias tends to dominate the MSE for large samples.Footnote 27

5.2 Robust estimation and alignment methods

In contrast to our discussion on Pareto tail modeling for income, we have no comparable parametric model for health care cost because empirical and theoretical evidence on the distributional shape of health care costs is scarce (compared with the well-studied Pareto assumption in income research). Thus, we adopt robust non-parametric methods.

In what follows, \(y_{i}\) (\(i\in s\)) denotes the variable of interest (health care cost). The goal is to obtain \(y\)-totals (or means) as weighted linear statistics of the sample data, which are

(i):

outlier resistant (robust),

(ii):

and (if applicable) aligned with known population totals (or means).

We deliberately speak of weighted linear statistics, not estimators in order to cover estimation and alignment methods. That is, when a statistic is used to estimate an unknown population parameter or characteristic, it is called an estimator. Unlike estimation methods, the population parameter or characteristic of interest is a known quantity in the application of alignment methods. Therefore, the device to achieve alignment is not called an estimator. We use the term aligned value to denote the sample-based weighted linear statistic (e.g., weighted mean), which is based on the modified observations or weights to achieve alignment. Clearly, the aligned value is equal to the known population quantity (if alignment was successful).

For estimation methods, we demand only that requirement (i) is met (i.e., outlier robustness, see enumeration above), whereas for alignment methods, both requirements (i) and (ii) must be fulfilled. By way of illustration, consider the visual metaphor of the data slices in Fig. 7. Since the population means are known for the top slice (i.e., \(\texttt{cost}\sim\texttt{age}\)), estimation is pointless and we focus only on outlier resistant alignment. For all other slices, the goal is robust estimation of the \(y\)-total or -mean (taking the modified sampling weights or observations into account that have been obtained at the top slice).

To fix notation, let \(w_{i}\) denote the sampling weight (\(i\in s\)). We denote by \(w_{i}^{*}\) outlier resistant weights (that are possibly adjusted to meet alignment goals), and which are defined as \(w_{i}^{*}=u_{i}w_{i}\), where the \(u_{i}\)’s are factors to downweight outliers and achieve alignment. We will discuss the choice of the \(u_{i}\)’s later. By the identity

$$\sum_{i\in s}w_{i}^{*}y_{i}=\sum_{i\in s}w_{i}y_{i}^{*}$$
(6)

we see that the estimated \(y\)-total can equivalently be represented with the help of modified observations \(y_{i}^{*}=y_{i}u_{i}\).Footnote 28 Also, we may regard the \(y_{i}^{*}\)’s as imputed values which are free from outliers and ensure (together with the \(w_{i}\)’s) that the alignment goals are achieved (granted that alignment goals were imposed). More importantly, we have the freedom to work, in the later course of the simulation, with the tuples \((w_{i}^{*},y_{i})\), \((w_{i},y_{i}^{*})\), or directly with the \(u_{i}\)’s.

Next, we address three methods to compute the \(u_{i}\)’s (and thus the \(y_{i}^{*}\)’s or \(w_{i}^{*}\)’s).

5.2.1 Robust estimation

For the further course of discussion, it is helpful to focus on robust \(M\)-estimators in the context of finite population estimation (although these estimators do not seek alignment with known population values). We restrict attention to the robust Horvitz–Thompson (HT) estimator of Hulliger (1995) because it is outlier resistant and it can be written as a weighted linear estimator.

Let \(\psi\) denote the Huber \(\psi\)-function defined as \(\psi(x,k)=\min\{{k,\max}{(-k,x)}\}\) for \(x\in\mathbb{R}\), where \(k> 0\) is a robustness tuning constant; we let \(\widehat{\sigma}\) be a preliminary robust estimate of scale, for example, the interquartile range of the cost data \(y_{i}\). The robust estimator of the weighted mean is the solution \(\widehat{\mu}_{k}\) of the estimating equation (Hulliger 1995)

$$\sum_{i\in s}w_{i}\psi\left(\frac{y_{i}-\mu}{\widehat{\sigma}},k\right)=0.$$
(7)

The tuning constant \(k\) determines the amount of robustness we want to achieve.Footnote 29 Estimator \(\widehat{\mu}_{k}\) can be expressed as a weighted estimator,

$$\widehat{\mu}_{k}=\frac{\sum_{i\in s}w_{i}u_{i}y_{i}}{\sum_{i\in s}w_{i}u_{i}}\qquad\text{where}\quad u_{i}=\frac{\psi(e_{i},k)}{e_{i}}\qquad\text{with}\quad e_{i}=\frac{y_{i}-\widehat{\mu}_{k}}{\widehat{\sigma}},$$
(8)

and can thus be brought into the form of (6). The \(u_{i}\) take values in the interval \([0,1]\).

5.2.2 Robust adaptive \(M\)-estimator with an alignment penalty

In this paragraph, it is assumed that the population \(y\)-mean, \(\bar{y}=\sum_{i\in U}y_{i}/N\), is a known quantity (\(U\) denotes the set of population indices). Observe that the estimator \(\widehat{\mu}_{k}\), which is defined as the solution to the estimating equation in (7), does not impose alignment goals. As a result, \(\widehat{\mu}_{k}\) may differ considerably from \(\bar{y}\). In order to incorporate the auxiliary information that \(\bar{y}\) is known, we propose to compute an adaptive \(M\)-estimator that minimizes an approximate estimate of the mean squared error of \(\widehat{\mu}_{k}\),

$$\widehat{\mathrm{MSE}}(\widehat{\mu}_{k})=\widehat{\mathrm{var}}(\widehat{\mu}_{k})+\big(\widehat{\mu}_{k}-\bar{y}\big)^{2},$$
(9)

where \(\widehat{\mathrm{var}}\) denotes the estimated variance. Observe that the squared bias term on the r.h.s. of (9) is evaluated with respect to the known population mean \(\bar{y}\). The squared bias works like an alignment penalty that penalizes estimates that deviate too much from \(\bar{y}\). Formally, we seek the \(M\)-estimator which minimizes (9) on the set of tuning constants \(\{k:k\in\mathbb{R}_{+}\}\). The optimal estimator is \(\widehat{\mu}_{k_{\text{opt}}}\), where

$$k_{\text{opt}}=\underset{k\in\mathbb{R}_{+}}{\mathrm{arg\;min}}\;\;\widehat{\mathrm{MSE}}(\widehat{\mu}_{k}).$$
(10)

The proposed estimator is inspired by the minimum estimated risk estimator in Hulliger (1995); our method differs from Hulliger’s insofar that he defines the squared bias as \((\widehat{\mu}_{k}-\widehat{\mu})^{2}\), where \(\widehat{\mu}\) is the weighted sample mean. For ease of reference, we call the estimator \(\widehat{\mu}_{k_{opt}}\) with \(k_{opt}\) defined in (10) the minimum risk \(M\)-estimator (MRM). Although the MRM estimator is not explicitly aligned or benchmarked with \(\bar{y}\), it often coincides with \(\bar{y}\) (or is at least close to the benchmark); see empirical illustration, below. Furthermore, deviations of \(\widehat{\mu}_{k_{opt}}\) from \(\bar{y}\) are unproblematic (or even intended) provided that the MSE of \(\widehat{\mu}_{k_{opt}}\) is considerably smaller than the MSE of any competing estimator. That is, we deliberately relax the alignment requirement slightly whilst the gains in MSE outweigh the incurred bias.

In the presence of outliers and influential values, the MRM estimator tends to be superior in terms of MSE compared with competing methods (see below). However, it can be heavily biased when the population mean \(\bar{y}\) is much larger than \(\widehat{\mu}_{k_{opt}}\); whence, the squared bias dominates the MSE and the MRM estimator does not achieve any gains in MSE over the weighted sample mean (yet, the MRM estimator is never inferior to the weighted sample mean).

5.2.3 Robust self-calibration

In this paragraph, we introduce a robust calibration method that explicitly ensures alignment (under the assumption that the (sub-) population quantities \(\bar{y}\) and \(N\) are known).Footnote 30 To this end, we follow Duchesne (1999), who proposed a robustification of the (traditional) calibration method of Deville and Särndal (1992). In practice, the traditional calibration (see Appendix B) is used to re-weight a vector of auxiliary variables, say, \(\boldsymbol{x}_{i}\in\mathbb{R}^{p}\) (\(i\in s\))—not the variable of interest, \(y_{i}\)—such that the sample \(x\)-totals are aligned with their population values. Our approach, however, seeks calibration or alignment directly for the study variable \(y_{i}\). Therefore, we call the method (robust) self-calibration.

We follow Duchesne (1999) and fix a set of tuples of constants \(\{(q_{i},r_{i}):i\in s\}\). The choice of the constants will be discussed later.Footnote 31 Next, we define—still following Duchesne (1999)—a set of weights \(\{v_{i}:i\in s\}\) and consider minimizing the distance function \(\sum_{i\in s}(v_{i}-r_{i})^{2}/q_{i}\) subject to alignment or calibration constraints (s.t.c.). This choice of distance function is problematic because the resulting weights \(v_{i}\) can be negative. In order to restrict the calibrated weights \(v_{i}\) to the interval \([L,U]\), where \(L\) and \(U\) are pre-determined boundaries (\(0\leq L<U<\infty\)), we consider instead the following minimization problem

$$\min\frac{1}{2}\sum_{i\in s}h(v_{i},q_{i},r_{i})\qquad\text{s.t.c.}\qquad\left[\begin{matrix}\sum_{i\in s}v_{i}\\ \sum_{i\in s}v_{i}y_{i}\end{matrix}\right]=\left[\begin{matrix}N\\ \sum_{i\in U}y_{i}\end{matrix}\right],$$
(11)

where minimization is with respect to the \(v_{i}\)’s, and

$$h(v_{i},q_{i},r_{i})=\begin{cases}\displaystyle{\frac{(v_{i}-r_{i})^{2}}{q_{i}}}&\text{if}\;v_{i}\in[L,U],\\ \infty&\text{otherwise}.\end{cases}$$
(12)

The distance function in (12) is due to Duchesne (1999), and it is a slight modification of Case 7 in Deville and Särndal (1992). We impose two calibration constraints; see r.h.s. of (11). Observe that our second constraint is specified with respect to the study variable, \(y_{i}\), not an auxiliary variable (this marks the major difference to the proposal of Duchesne 1999). Together, the two constraints ensure that the Hajek estimator of the \(y\)-mean, \(\sum_{i\in s}v_{i}y_{i}/\sum_{i\in s}v_{i}\), is aligned with the population \(y\)-mean.Footnote 32

The choice of the constants \((q_{i},r_{i})\) is of great importance in order to achieve robustness. We take \((q_{i},r_{i})=(u_{i}w_{i},u_{i}w_{i})\) for all \(i\in s\), where \(u_{i}=\psi(e_{i},k_{opt})/e_{i}\) with \(e_{i}=(y_{i}-\widehat{\mu}_{k_{opt}})/\widehat{\sigma}\) and \(\widehat{\mu}_{k_{opt}}\) is the \(M\)-estimator with \(k_{opt}\) defined in (10). Observe that this choice implies that \(q_{i}=r_{i}\) (\(i \in s\)), which is sensible and easy to compute but may not be the best specification possible. That is to say, it can sometimes be advantageous to take the constants to be \((w_{i}u_{i},w_{i}u_{i}^{\prime})\), where \(u_{i}^{\prime}=\psi(e_{i},k^{\prime})/e_{i}\) with \(k^{\prime}\) other than \(k_{opt}\). However, this approach poses the difficulty of choosing the tuning constant \(k^{\prime}\). We stick with the choice \(q_{i}=r_{i}\) because of its simplicity, and then we solve (11) to get the calibrated weights \(v_{i}\) (\(i\in s\)).

In the later course of the simulation, we are free to work with the tuples \((v_{i},y_{i})\) or \((w_{i},y_{i}^{*})\), where \(y_{i}^{*}=y_{i}u_{i}^{*}\) with \(u_{i}^{*}=v_{i}/w_{i}\), or we may store the \(u_{i}^{*}\)’s in the baseline survey for future usage (\(i\in s\)).

Fig. 8
figure 8

Estimates (or aligned values) of average health care cost for women (by age group) for several estimation and alignment methods; the known population means are shown as a thick grey line; source: BAG (2017)

5.3 Empirical illustration

We study the empirical performance of the three methods for estimation and alignment of health care cost by age group (cf. top slice in Fig. 7). Clearly, estimation is actually not needed because the population means (by age group) are known quantities. Therefore, we are mainly concerned whether the methods achieve alignment. Fig. 8 shows the aligned values or estimates of average cost by age group for several methods. The known population means are shown as a thick grey line. From the visual display, we observe that the weighted sample mean overestimates the population mean for the age group of 41–45 years old women by approx. 75% (i.e., CHF 5421 vs. 3102) because of a few outliers. A similar behavior—albeit less pronounced—is apparent for the age groups 26–30, 46–50 and 51–55 years. The estimates of the minimum risk \(M\)-estimator (MRM) are robust against outliers and influential values, and the estimates coincide with (or are at least close to) the population means in the age groups below 54 years. For the age groups above 54 years, however, the MRM estimator underestimates the population means quite noticeably. The reason for this behavior lies in the nature of the method. As an \(M\)-estimator, the method works by downweighting outlying observations; yet, for the age groups above 54 years, the method should actually react by up-weighting (which it is incapable of doing by design).Footnote 33 The method robust self-calibration produces values which are perfectly aligned with the known population means (as expected). If alignment is the only method selection criterion, we prefer robust self-calibration over the other methods.

Table 4 Estimates (by age group) of the relative mean square error (relMSE) for the methods robust self-calibration (self-cal), minimum risk \(M\)-estimator (MRM), and the weighted sample mean (avg); relMSE is computed with respect to method self-cal; see text for further explanations

For a comprehensive assessment of the estimation/ alignment methods, we shall also study the methods’ MSE. To fix notation, let \(\widehat{\mu}\) denote a generic estimator or alignment method. We estimate the MSE of \(\widehat{\mu}\) by \(\widehat{\mathrm{var}}(\widehat{\mu})+(\widehat{\mu}-\bar{y})^{2}\). For the weighted sample mean (Hajek estimator) and the MRM method, we use standard (approximate) variance calculation procedures to compute \(\widehat{\mathrm{var}}(\widehat{\mu})\); see e.g. Särndal et al. (1992, 182). Since robust self-calibration is not an estimating method, the aligned means have zero variance.Footnote 34 However, we shall nevertheless compute an approximate variance estimate for the robust self-calibration method. The variance estimator mimics the variance of the Hajek estimator, though it neglects the fact that the calibrated sampling weights are dependent on the \(y_{i}\)’s. As a result, the approximate variance estimator tends to underestimate the true variance.Footnote 35

Table 4 shows the relative MSE (relMSE) for the methods in Fig. 8. The relMSE is the ratio of an estimator’s MSE to the MSE of robust self-calibration. Values smaller (larger) than 1.0 indicate that the method under study is more (less) efficient than robust self-calibration. First, we note that the weighted sample mean (avg, Hajek estimator) is extremely inefficient compared with all other methods. Second, when \(avg<\bar{y}\) (see last column of Table 4), the MRM estimator is as inefficient as method avg. In these cases, the estimate \(\widehat{\mu}_{k_{opt}}\) is equal to avg because the penalty term (squared bias in the MSE, see Eq. 9) dominates the MSE and pulls the estimate onto avg. The MRM estimator can—in principle—escape from this trap if it would downweight small observations more than large outliers. However, it is not capable of doing so in our application. By contrast, MRM is more efficient than the robust self-calibration method in all cases where \(avg> \bar{y}\). For some age groups, the gains in efficiency over self-cal are considerable (partly because we have tuned self-cal rather conservatively)Footnote 36. Third, although method self-cal does not achieve the most efficient estimate/aligned value for one particular age group, it clearly shows the best ensemble efficiency (i.e., mean or total efficiency over all age groups). That is, self-cal achieves a fairly good compromise. Moreover, and when alignment is of key importance to the microsimulation modeler, self-cal is the preferred method because it ensures alignment at reasonable efficiency. For simulations with small sample sizes (not the case in our application), efficiency consideration become more important than perfect alignment; hence, MRM is a good choice.

Table 5 Estimated relative variance (relVAR) of the weighted sample mean (Hajek estimator) of health care costs by household type, where the weights are taken from robust self-calibration (self-cal), minimum risk \(M\)-estimator (MRM), or just the sampling weights (method avg); relVAR is computed with respect to method self-cal

Next, we address the robust estimation problem when the population means are unknown (see slices other than the top slice in Fig. 7). This time we consider estimation of average health care cost by household type. Clearly, we cannot use alignment methods. In principle, we could estimate average cost by a robust estimator of the Hajek mean for each category of variable household type. There is nothing wrong with this approach, except that it does not incorporate the auxiliary information from the alignment exercise at the top slice (to stick with the visual metaphor). In other words, this approach does not utilize the calibration principle of Deville and Särndal (1992, p. 376) that “weights that perform well for the auxiliary variables also should perform well for the study variable”. Thus, we estimate the average costs by household type with the Hajek type estimator \(\sum_{i\in s}w_{i}u_{i}y_{i}/\sum_{i\in s}w_{i}u_{i}\), where the \(u_{i}\)’s depend on the method under consideration. For method avg, we have \(u_{i}\equiv 1\); for MRM and self-cal, we take the \(u_{i}\)’s that have been generated in the previous alignment exercise. Now, we cannot examine how close an estimate is to the population value since the latter is unknown. Therefore, we focus our discussion on the efficiency of the methods, measured by the variance of the estimators. We computed the relative variances (relVAR) with respect to method self-cal. Thus, values smaller (larger) than 1.0 indicate superior (inferior) efficiency compared with method self-cal; see Table 5.

The extreme outlier that we have already encountered in the alignment exercise (cost \(\sim\) age group; see also Fig. 7) shows up in the household type “families with two or more children”, and it inflates the estimated variance for the weighted sample mean (avg). The two other methods are robust against the outlier(s). Also, see from Table 5 that the MRM estimator has a smaller variance than method self-cal in households with children (and vice versa). This pattern is caused by the alignment methods and then “imported” to the current situation. That is, the MRM estimator was superior (with a few exceptions) in terms of efficiency for age groups below 54 years (see Table 4). This effect then carries over to the current estimation problem because individuals in households with children (i.e., parents) range typically in age brackets below 54 years; as a result relVAR is lower. Since self-cal and MRM are so close in terms of relative variance, it is hard to prefer one method over the other. However, if take up the discussion of the previous paragraph, we may favor method self-cal if we value alignment (at the top slice) more than efficiency (and vice versa). Notably, in very small samples, efficiency considerations become more important and thus MRM is preferred over method self-cal.

6 Conclusion

The credibility of microsimulation modeling with the research community and policymakers depends on the availability of high-quality baseline surveys and the application of sound statistical methods. In this paper, we addressed two potential quality issues that both relate to skewed heavy-tailed distributions.

First, we reviewed how the presence of unit nonresponse can lead to biased simulation and estimates. In our application, we found that the top income bracket (and to a lesser extent also households in the lowest income bracket) are significantly under-represented in the baseline survey, compared with tax register data. Notably, we discovered that too few high- and ultra-high-income households were included in the sample because—as the literature shows—survey compliance decreases with increasing income. Other survey-related errors may have contributed to the under-representation of the top income bracket. Altogether, the estimate of average income underestimates the known population average. Based on progressive taxation, underestimation of the average implies downward-biased results for the simulation of taxes (and possibly other simulated variables). Although the Deville–Särndal calibration eliminated under-representation of the top income group, it could not achieve alignment of estimated average income in the right tail of the distribution with known population values without distorting the empirical distribution. The problem is rooted in the inability of the calibration method to cope with skewed heavy-tailed distributions. To overcome the problem, we introduced a parametric Pareto model to describe the right tail of the income distribution. With the help of the tail model, we adjusted the sample income distribution in the tail such that average income in the top income bracket was aligned with known values. Henceforth, income data from the adjusted sample is more representative for the population distribution in terms of the first moment and with respect to tail probabilities. Our method of imputing expected order statistics from the Pareto distribution in place of the empirical order statistics has two major advantages over random imputation: the ranks of the observed household incomes are preserved, and the differences between observed and imputed values are small (except for the highest order statistics).

Under-representation of the top income bracket is a common issue of household surveys and is not limited to the Swiss SILC survey. This claim is substantiated by, for instance, the analysis of Törmälehto (2017) who presents empirical evidence of under-representation for 31 countries in the 2012 EU-SILC data, and the theoretical arguments in Korinek et al. (2006). Since sample surveys in general have difficulties in capturing top incomes, our method can be a useful tool for microsimulation modelers working with survey income data.

Fig. 9
figure 9

Redistributive effects: Balance between payments made to and benefits received from CHI by household income bracket (see text for further explanations).

The second contribution of the paper also refers to the treatment of skewed heavy-tailed distributions. Here, we are concerned with variables from an outlier-prone, skewed population distribution that have been added to the baseline survey by record linkage. In our empirical application, individual health care costs from register data have been linked to the baseline survey. Because the baseline survey is a random sample with a small sampling fraction, the sampling weights (i.e., the inverse of the sample inclusion probabilities) are relatively large. An outlying observation in the cost data together with a large sampling weight can thus heavily influence or even ruin a sample estimate of the mean, total, or any similar characteristic. In contrast to our discussion on Pareto tail modeling for income, we have no comparable parametric model for health care cost; therefore, we adopt robust non-parametric methods.

In terms of methods, we distinguish between estimation and alignment methods for health care costs (by breakdown variables like age, income or household type). Alignment methods seek modifications of the data or the sampling weights such that the sample characteristics (e.g., mean or total) are aligned with known population values; hence, no estimation is required (unless we are interested in characteristics other than the ones that were benchmarked). However, the population characteristics of health care costs are only known for some breakdown variables. In our application, health care costs are known by age group, but not for other breakdown variables like \(\texttt{cost}\sim\texttt{household type}\). Therefore, we cannot impose alignment goals for average health care costs by household type. However, and by referring to the calibration principle of Deville and Särndal (1992, p. 376), that “weights that perform well for the auxiliary variables also should perform well for the study variable”, we seek alignment for \(\texttt{cost}\sim\texttt{age group}\) and then use the modified observations (or weights) for the analysis of \(\texttt{cost}\sim\texttt{household type}\).

Alignment and estimation methods are required to be outlier resistant. When non-robust alignment methods are applied to achieve alignment for one breakdown variable (e.g., \(\texttt{cost}\sim\texttt{age group}\)), the cost data or weights are at risk of being distorted in the presence of outliers, which in turn may cause biased estimates for other breakdown variables (e.g., \(\texttt{cost}\sim\texttt{household type}\)). Thus, we must avoid that an alignment problem in one place turns into an estimation problem at another place. Any method that is based on non-robustly estimated sample-based characteristics (namely, the naive alignment method and the Deville–Särndal calibration method) is not protected against the presence of outliers. Therefore, we have proposed two alignment methods which are outlier resistant: robust self-calibration (self-cal) and the minimum risk \(M\)-estimator of the mean (MRM). The latter method is inspired by Hulliger (1995).

Our empirical analysis shows that the method self-cal achieves alignment with known population characteristics for reasonable levels of efficiency (mean square error, MSE) in the presence of outliers. In contrast, the weighted sample average is heavily influenced by outliers and is very inefficient. The MRM estimator does not impose explicit alignment goals and still produces estimates that are very close to the known population values with one exception: the MRM estimate is not even close to the benchmark when the sample mean is considerably smaller than the known population mean (formally, \(\widehat{\bar{y}}<\bar{y}\)). Apart from this case, the MRM is superior in terms of MSE. That being said, we prefer method self-cal over MRM when the sample size is relatively large for the following reasons: Self-cal achieves the alignment goals and its ensemble efficiency (i.e., total or mean efficiency over all categories of a breakdown variable, e.g., household type) is superior; in other words, self-cal achieves a good efficiency compromise. If, however, the sample size is small, efficiency considerations become more important. Hence, we favor the MRM estimator when \(\widehat{\bar{y}}> \bar{y}\) because it exhibits gains in MSE over self-cal, and we suggest self-cal for the cases where \(\widehat{\bar{y}}<\bar{y}\). Our methods are universally applicable to outlier-prone and skewed data when achieving alignment goals is demanded.

To illustrate the impact of the discussed methods, we study the redistributive effects in CHI by household income. Fig. 9 shows a comparison of average payments made to the system (taxes, premium, OOP) versus average financial aids (e.g. premium reductions) and average health care benefits received from CHI by income bracket. Payments and benefits are equivalized by the EUROSTAT equivalization scaleFootnote 37 for reasons of comparison. Moreover, Fig. 9 does not contain confidence intervals for ease of simplicity. We observe from the display that households above the 40%-50% income bracket are net payers (see balance/ saldo). It is also noteworthy that households in the top income bracket make (mainly through taxes) a major financial contribution to the system. If the Pareto tail adjustment for the income distribution is omitted, we would observe significantly underestimated tax payments in the top income bracket. Fig. 9 shows other interesting facts—to be discussed elsewhere. We refer the reader to Schoch et al. (2013), where we study other breakdowns (e.g., gender, household composition) and more sophisticated measures of the redistribution effects (e.g., Gini coefficient).