Introduction

Measures of inequality are widely used to study income, welfare, and poverty issues. They can also be helpful to analyze the efficiency of a tax policy or to measure the level of social stratification and polarization. They are most frequently applied to dynamic comparisons (comparing inequality across time). The Gini concentration coefficient based on the Lorenz curve is the most widely used measure of income inequality. The Zenga point concentration measure, based on the Zenga curve, has recently received some attention in the literature.

The true values of income inequality coefficients are usually unknown and they can only be estimated on the basis of sample data coming from household budget surveys. Estimators of concentration coefficients are usually nonlinear, thus their standard errors cannot be obtained easily. The methods of variance estimation that can solve this problem include: various replication techniques, Taylor expansion, and parametric procedures based on income distribution models.

The main objective of the paper is to use survey data to analyze income inequality in Poland by socio-economic groups by means of selected concentration measures and their decomposition. This approach can further be used to assess relative economic affluence of one subpopulation with respect to another and to estimate stratification indices. To complete the analysis, some variance estimation techniques that can be used to estimate the standard errors of Gini and Zenga inequality measures should also be presented and applied.

Estimation of Income Concentration Measures

Out of the many income concentration measures, the Gini index is the most popular, mainly due to its good statistical properties and straightforward economic interpretation. The Gini index of inequality can be defined as double the area between the Lorenz curve and the line of equal shares. The Lorenz curve is expressed in Fig. 1. The line at 45° represents perfect equality of incomes and the area between this line and the Lorenz curve is called concentration area. The Gini index can be expressed as follows:

Fig. 1
figure 1

Lorenz concentration curve

$$ G = 2\int\limits_0^1 {\left( {p - L(p)} \right)dp} $$
(1)

where:

L(p):

the Lorenz function

p = F(y):

cumulative distribution function of income.

The Gini index can also be interpreted as an average gain to be expected if an economic unit had a choice between its own income and that of another economic unit selected at random, relative to the mean income. One can estimate the value of the Gini index from the survey data using the following formula:

$$ \widehat{G} = \frac{{2\sum\limits_{{i = 1}}^n {\left( {{w_i}{y_{{(i)}}}\sum\limits_{{j = 1}}^i {{w_j}} } \right) - \sum\limits_{{i = 1}}^n {{w_i}{y_{{(i)}}}} } }}{{\left( {\sum\limits_{{i = 1}}^n {{w_i}} } \right)\sum\limits_{{i = 1}}^n {{w_i}{y_{{(i)}}}} }} - 1 $$
(2)

where:

y (i) :

household incomes in a non-decsending order,

w i :

survey weight for i-th economic units, and

\( \sum\limits_{{j = 1}}^i {{w_j}} \) :

rank of j-th economic unit in n-element sample.

The total Gini ratio calculated for a population of size n divided into k subpopulations can be decomposed as follows (Dagum 1997):

$$ \matrix{ {G = {G_w} + {G_b} + {G_t}} \\ {{G_w} = \sum\limits_{{j = 1}}^k {{G_j}{p_j}{s_j}} } \\ {{G_b} = \sum\limits_{{j = 2}}^k {\sum\limits_{{h = 1}}^{{j - 1}} {{G_{{jh}}}\left( {{p_j}{s_h} + {p_h}{s_j}} \right)} } {D_{{jh}}}} \\ {{G_t} = \sum\limits_{{j = 2}}^k {\sum\limits_{{h = 1}}^{{j - 1}} {{G_{{jh}}}\left( {{p_j}{s_h} + {p_h}{s_j}} \right)} } \left( {1 - {D_{{jh}}}} \right)} \\ }<!end array> $$
(3)

The component G w is the contribution of within-groups’ inequality to the Gini index and G b is the contribution of net between-groups’ inequality, while G t denotes the contribution of populations overlapping, also called transvariation. The terms p j and s j denote the population and income shares of the j-th subpopulation, respectively. The term D jh , called either economic distance ratio or REA, plays a crucial role in the decomposition (3), and can be regarded as the measure of relative economic affluence of the j-th subpopulation with respect to the h-th subpopulation:

Another interesting measure of income inequality based on a concentration curve was proposed by Zenga (1990). It is called point concentration measure, being sensitive to changes of inequality in each part (point) of a population (Kleiber and Kotz 2003 ). The Zenga synthetic inequality index Z can be expressed as the area below the Zenga curve Z p, which is based on the relation between income and population quantiles (see: Fig. 2):

Fig. 2
figure 2

Zenga concentration curve

$$ Z = \int\limits_0^1 {{Z_p}dp} $$
(4)

The area below the Zp curve representing the concentration area is equal to 1 in the case of perfect concentration, and takes value 0 when all incomes are equal. The Zenga curve does not represent the forced behavior, as does the Lorenz curve, so it can take various shapes depending on the underlying income distribution model.

The commonly used nonparametric estimator of the Zenga index (4) was introduced by Aly and Hervas (1999) and can be expressed by the following equation:

$$ \widehat{Z} = 1 - \frac{1}{{n\bar{y}}}\left[ {{y_{{1:n}}} + \sum\limits_{{j = 1}}^{{n - 1}} {{y_{{\left\langle {\frac{{\sum\nolimits_{{i = 1}}^j {{y_{{i:n}}}} }}{{\bar{y}}}} \right\rangle :n}}}} } \right] $$
(5)

where:

y i:n :

i-th order statistics in n-element sample based on weighted data, and

\( \bar{y} \) :

sample arithmetic mean.

The estimator (5) has been proven to be consistent and asymptotically normally distributed.

Estimation of Standard Errors

The precision of estimation seems very important from the point of view of statistical inference that should be the part of each statistical analysis based on random samples. The reliable values of estimation errors are necessary to conduct statistical inference methods, in particular to verify statistical hypothesis and construct confidence intervals. The following remarks can be helpful to realize the importance of the problem to be discussed in this subsection:

  • The precision of an estimator T n is usually discussed in terms of its sampling variance D 2(T n ) or its standard error being simply the square root of the variance.

  • In many cases, the exact value of sampling variance is unknown, because it depends on unknown population quantities

  • After survey data have been obtained, however, an estimate of the variance \( {\widehat{D}^2}(\widehat{\theta }) \) can be calculated.

  • For most income concentration measures (Gini and Zenga indices among them), explicit variance estimators are theoretically complicated—it is hard to derive general mathematical formulas for nonlinear statistics, especially when the sampling design is complex.

Thus, many approximate techniques for variance estimation can be used to obtain standard errors of income inequality measures, including: (Wolter 2003)

  • Taylor linearization technique,

  • Random groups method,

  • Balanced Half Samples (BHS), also called Balanced Repeated Replication (BRR),

  • Jacknife technique,

  • Bootstraping,

  • Parametric approach based on maximum likelihood theory,

  • Generalized Variance Function (GVF)- first applied in Current Polpulation Survey CPS in 1947.

In the context of inequality measures Taylor linearization, the jackknife, and the bootstrap are the methods of variance estimation most often applied (see: Verma and Betti 2005; Davidson 2009; Kordos and Zięba 2010).

The Taylor linearization technique approximates the nonlinear estimator T n by a pseudoestimator g(Y) which is a linear function of sample observations. It is based on the first-order Taylor expansion around a parameter θ and neglecting the remainder term:

$$ g(Y) \approx g(\theta ) + \sum\limits_{{i = 1}}^k {g_i^{\prime }(\theta )\,} ({Y_i} - {\theta_i}) $$
(6)

We often use the variance of the linearized statistic g(Y) as the approximation of the true estimator variance:

$$ {{D}^{2}}\left[ {g(Y)} \right] \approx \sum\limits_{{i = 1}}^{k} {g_{i}^{\prime }(\theta )} V({{Y}_{i}}) + \sum\limits_{{i > j}} {g_{i}^{\prime }(\theta )} g_{j}^{\prime }(\theta )Cov({{Y}_{i}},{{Y}_{j}}) $$
(7)

where:

g′(θ):

first derivative of a function g(θ)

V(Y i ):

variance of a random variable Y i

cov(Y i , Y j ):

covariance between variables Y i and Y j .

The jackknife technique was originally developed by Quenouille to reduce the bias of an estimator in a finite-population context. The jackknife method starts with partitioning the original sample into L dependent groups of equal size. Next, for each group, the estimator T l (called pseudovalue) is calculated based on the data that remain after omitting the l-th group:

$$ {T_l} = L{T_n} - (L - 1){T_{{(l)}}} $$
(8)

The jacknife variance estimator is defined as:

$$ \widehat{D}_J^2({T_n}) = \frac{1}{{L(L - 1)}}\sum\limits_{{l = 1}}^L {{{({T_l} - {T_Q})}^2}} $$
(9)

where:

T (l) :

the value of T based only on the data that remain after omitting the l-th group,

T (Q) :

jacknife estimator of \( \theta \) defined as the simple arithmetic mean of pseudovalues, and

L :

number of jacknife samples.

The bootstrap method, similar to the jackknife method, was introduced outside the field of survey sampling as a means of obtaining approximate variance estimates and confidence intervals. After drawing a series of N independent resamples (called bootstrap samples) by a design identical to the one by which the sample was drawn from the population, we calculate estimators \( T_k^{ * } \), k = 1…N. The bootstrap variance estimator is defined as:

$$ \widehat{D}_B^2({T_n}) = \frac{1}{{N - 1}}\sum\limits_{{k = 1}}^N {(T_k^{ * } - {{\overline T }^{ * }}} {)^2}. $$
(10)

Provided that the probability distribution of a variable of interest Y can be approximated by a theoretical distribution model, the method of variance estimation based on maximum likelihood theory can also be used. Let us assume that:

  • an inequality measure of interest can be expressed as a function g(θ ) of the parameters θ of an income distribution model given by a density function f(y,θ),

  • the density function is well fitted to data, and

  • the ML (maximum likelihood) estimates T n of the parameters θ can be obtained.

According to the classical estimation theory, the ML estimators are asymptotically unbiased and normally distributed with variances given by the Cramer-Rao bound. The variance of the ML estimator of an inequality measure g(θ) takes the form:

$$ {{\text{D}}^{{2}}}{\text{[g(}}\widehat{\theta })] = {\left[ {\frac{{\partial g(\theta )}}{{\partial \theta }}} \right]^T}{{\bf I}}_{\theta }^{{ - 1}}\left[ {\frac{{\partial g(\theta )}}{{\partial \theta }}} \right] $$
(11)

where: I θ denotes the Fisher information matrix.

Application

The results of the calculations were obtained on the basis of the data coming from the Polish Household Budget Survey (HBS) for the years 2006 and 2008. In 2006 the randomly selected sample covered 37,508 households, i.e., approximately 0.3 % of the total number of households, while in 2006 the total sample size was 37,584. The samples were selected by two-stage stratified sampling with unequal inclusion probabilities for primary sampling units. In order to maintain the relation between the structure of the surveyed population and the socio-demographic structure of the total population, data obtained from the HBS were weighted with the structure of households by number of persons and class of locality coming from the Population and Housing Census 2002. The basic analysis presented in the paper was conducted after dividing the overall sample by socio-economic group, constructed according to the exclusive or main source of maintenance.

First, according to the formulas (2), (3), and (5), the estimates of Gini and Zenga inequality measures were calculated and the Gini index was decomposed into between and within-groups inequality. Then, the estimates of their standard errors were obtained using two variance estimation methods: bootstraping and parametric approach. The estimation of Gini and Zenga coefficients for the entire population was also carried out. As a theoretical distribution model, the Dagum type-I function was used (see: Dagum 1977).

Table 1 shows the estimates of Gini and Zenga coefficients together with their standard errors calculated by means of the bootstrap method. The number of bootstrap replicates was N = 5000. It can be easily noticed that the values of Zenga indices for socio-economic groups in Poland vary from 0.25 to 0.49, while the Gini coefficients take values from 0.29 to 0.43. Thus the Zenga coefficient seems to be more sensitive to differences between family incomes that the Gini one. The standard errors are significantly higher for the Zenga coefficient, being usually 3-6 % of the estimated values. The relative dispersion of the Gini index is usually 1–5 %. Additionally, Figs. 3 and 4 show that despite relatively small number of repetitions, the distributions of both inequality statistics can be approximated by the normal density curves.

Table 1 Estimated values of Gini and Zenga inequality measures by socio-economic group and boostrap estimates of their standard errors (first row-2006, second row- 2008)
Fig. 3
figure 3

Bootstrap distribution of Gini index estimator (N = 5000)

Fig. 4
figure 4

Bootstrap distribution of Zenga index estimator (N = 5000)

Tables 2 and 3 contain the results concerning Gini index decomposition by socio-economic groups. In 2008, the intragroup inequality (that is, the within-group component G w ) accounted for 32 % of the overall inequality in Poland. The within-group component reflects the inner polarization of all the groups: what causes remarkable differentials in average income between managers and blue-collar workers within the group of employees, between entrepreneurs and the others within the group of self-employed, or between retirees and pensioners within the fourth group. Table 2 can also be helpful to answer the question to what extent particular groups contribute to the overall inequality. Because of very small income and population shares, the income disparities among the self-employed weigh only 0.6 % on the total inequality, while the contribution of farmers is even smaller being 0.5 %. The group with the highest share (24 %) in the overall Gini index is the group of employees.

Table 2 Income inequality decomposition by subpopulations in 2008 (socio-economic groups)
Table 3 Average family income and economic distance ratios for socio-economic groups in Poland in 2006

The net between-groups component G b contributes 43 % of the total Gini coefficient. The highest value of economic distance ratio was observed between non-earned sources and self-employed (D = 0,88)—the economic situation of self- employed is 88 % better than the non-earned sources (see: Table 3). The transvariation component G t describing the overlapping of the subpopulations accounts for the remaining 24 % of the total income inequality in Poland.

In Table 4, the results of the estimation obtained using the parametric approach are presented. The theoretical income distribution model was Burr type-III function, also called the Dagum distribution. The model parameters λ, β, δ were estimated using the maximum likelihood method, demanding the solution of a nonlinear system of equations. The maximum likelihood estimates of Gini and Zenga coefficient and ML estimates of their corresponding standard errors follow the similar regularities as the ones observed for the bootstrap approach presented in Table 1.

Table 4 Parametric estimates of the Gini and Zenga inequality measures and their standard errors based on the Dagum model parameters

Concluding Remarks

The paper considered the problem of efficient estimation of inequality indices on the basis of random samples, including the measurement of inequality within and between subpopulations. Reliable estimates of inequality indices are usually available only on the national level, whereas in this paper, the detailed results for socio-economic groups were presented. They can be helpful to identify the sources of income inequality and poverty in Poland.

The results of the calculations presented in the paper reveal that the level of income inequality in Poland is high, as compared with many other European countries, especially for some socio-economic groups. The main component of income inequality in Poland, when measured by the Gini index, is economic disparity between socio-economic groups. The high value of the overlapping component suggests that the socio-economic groups are not separated perfectly, so they cannot be regarded as strata.

In general, the inequality estimation was more efficient when the Gini index was applied, which resulted in fewer errors of estimates. On the other hand, the synthetic Zenga measure seemed more sensitive to slight changes of income inequality within the groups of households. Thus, it is clear that both inequality coefficients, accompanied by the measures of their precision, can be regarded as useful tools in income distribution analysis.