First a brief review of the Dagum model for fitting the economic data is provided. Then, the inequality curves and indices and computational details are presented. Finally the notion of relative distribution is defined, providing a powerful tool to perform the comparison among genders.
Density Estimation
The Dagum distribution was chosen to model the income data. The choice of the underlying distribution was based on the well-known economic foundations and a large body of diversified empirical pieces of evidence supporting the aforementioned model as an excellent candidate to fit observed income distributions (Kleiber and Kotz 2003). Recall that F belongs to the Dagum family if its density function is given by
$$f(x;a,b,p)=\frac{ap x^{ap-1}}{b^{ap}\left[1+\left( \frac{x}{b}\right)^{a}\right]^{p+1}},\quad x>0,$$
for some a, b, p > 0. Notice that the first moment of a Dagum distribution is finite if and only if a > 1. Therefore, the inequality measures considered in this paper (and introduced in the following subsection) are defined under this condition. The parameters a and p are shape parameters, while b is a scale parameter. This model allows for various degrees of positive skewness and leptokurtosis. Moreover, it has built-in flexibility to be unimodal, to approximate income distributions, or to be zeromodal to describe wealth distributions. The shape parameters are related to inequality, Lorenz and first-stochastic dominance. For example, let F1 = f(a1, b1, p1) and F2 = f(a2, b2, p2) represent two Dagum distributions. Then the necessary and sufficient conditions for Lorenz dominance (i.e. non-intersecting Lorenz curves) are a1p1 ≤ a2p2 and a1 ≤ a2. For more details on this distribution, in the framework of economic size distributions, see Kleiber and Kotz (2003, chap. 6.3) and references therein. Finally, the cumulative distribution function (CDF) for the Dagum is given by
$$F(x;a,b,p)=\left[1+\left( \frac{x}{b}\right)^{-p}\right]^{-a},\quad x>0.$$
Given an independent and identically distributed sample x1, x2, ... , xn, the likelihood equations for the Dagum family are given by
$$ \left\{\begin{aligned} \frac{n}{a}+p\sum\limits_{i=1}^{n}\ln\left( \frac{x_{i}}{b}\right)-(p+1)\sum\limits_{i=1}^{n}\frac{\ln\left( \frac{x_{i}}{b}\right)}{1+\left( \frac{b}{x_{i}}\right)^{a}}&=0\\ np-(p+1)\sum\limits_{i=1}^{n}\frac{1}{1+\left( \frac{b}{x_{i}}\right)^{a}}&=0\\ \frac{n}{p}+a\sum\limits_{i=1}^{n}\ln\left( \frac{x_{i}}{b}\right)-\sum\limits_{i=1}^{n}\ln\left[1+\left( \frac{x_{i}}{b}\right)^{a}\right]&=0. \end{aligned}\right. $$
(1)
However, no explicit solution to this system is known (e.g., Kleiber 2008). An effective package for Dagum estimation is available on the CRAN R package repository (CRAN 2018). It solved numerically the maximum likelihood (ML) optimization with very good model fit.
Inequality Curves and Indices
Let Z ≥ 0 be a random variable representing gross or net incomes as well as taxes. Let F(Z) be its cumulative distribution function (CDF), μ = E(Z) the mean value, and F− 1(p) = inf{z : F(Z) ≥ p} its quantile function for 0 < p < 1.
The Lorenz curve {(p, LF(p))|p ∈ [0,1]}, plots the cumulative share of Z, say LF(p),
$$ L_{F}(p)=\frac{{{\int}_{0}^{p}}F^{-1}(s)ds}{{{\int}_{0}^{1}}F^{-1}(s)ds}=\frac{1}{\mu}{{\int}_{0}^{p}}F^{-1}(s)ds $$
(2)
versus the cumulative share of the population, p. In the ideal case of perfect equality (that is, a society in which all people have the same income), the share of incomes equals the share of the population, so that LF(p) = p, for all 0 < p < 1. In this case, the Lorenz curve is the diagonal line from (0, 0) to (1, 1). On the other hand, the lower the share of income LF(p) held by the share of income earners p, the higher the inequality. In the ideal case of perfect inequality (that is, a society in which all people but one have an income of nil), the share of incomes equals zero for 0 ≤ p < 1, so that LF(p) = 0, and LF(1) = 1 only for p = 1. In this setting, it is very natural to express the degree of inequality through the deviation of the actual Lorenz curve from the diagonal line. The Gini index (Gini 1914) is twice the area between the equality line and the Lorenz curve:
$$ G_{F}=2{{\int}_{0}^{1}}(p-L_{F}(p))dp. $$
(3)
The Gini index can be rewritten as
$$ G_{F}= 2{{\int}_{0}^{1}}\frac{\mu-\overset{-}{\mu}(p)}{\mu} p dp, \quad \text{where} \quad \overset{-}{\mu}(p)=\frac{1}{p}{{\int}_{0}^{p}}F^{-1}(s)ds. $$
(4)
Three issues arise when using the Gini index. First, the weight, p, gives the lowest importance to the more critical comparisons. Second, the weight, p, gives more emphasis to the less informative comparisons. Third, the considered groups, which μ and \(\overset {-}{\mu }(p)\) refer to, are overlapped. These considerations gave rise to many attempts for modifying the Gini index (see Greselin (2014) for a review).
Observing the noticeable increase in disparities between less fortunate and more fortunate individuals, Zenga (2007) introduced a new inequality curve, ZF(p), obtained by contrasting the average income of the poorer p% bottom earners, that is \(\overset {-}{\mu }(p)\), with the amount that is held, on average, by the richest top earners, i.e. the remaining (1 − p)% of the population, that is
$$ \overset{+}{\mu}(p)=\frac{1}{1-p}{{\int}_{p}^{1}}F^{-1}(s)ds. $$
(5)
Therefore, Zenga defined the inequality curve {(p, ZF(p))|p ∈ [0,1]}, where
$$ Z_{F}(p)=\frac{\overset{+}{\mu}(p)-\overset{-}{\mu}(p)}{\overset{+}{\mu}(p)} \quad \text{for} \quad 0<p<1. $$
(6)
The methodology proposed by Zenga keeps in mind that the notions of poor and rich are relative to each other and summarizes, in a single measure, the amount of inequality in the population. A measure of economic inequality can be defined by calculating the area beneath the Zenga’s curve:
$$ Z_{F}={{\int}_{0}^{1}}Z_{F}(p) dp. $$
It is worth recalling that the Zenga index follows the axiomatic approach. It obeys the Pigou-Dalton transfer principle (Dalton 1920; Pigou 1920), it is scale invariant, and it has the desired properties of anonymity and decomposability.
The next aim is to compare the two indices:
$$ G_{F}=2{{\int}_{0}^{1}}(p-L_{F}(p))dp= 2{{\int}_{0}^{1}}\frac{\mu-\overset{-}{\mu}(p)}{\mu} p dp, $$
and
$$ Z_{F}={{\int}_{0}^{1}}Z_{F}(p)dp= {{\int}_{0}^{1}}\frac{\overset{+}{\mu}(p)-\overset{-}{\mu}(p)}{\overset{+}{\mu}(p)} dp. $$
Both indices are relative measures of inequality, but the latter, compared to the former, has the following interesting features. First, the Zenga index gives the same weight to each comparison along the entire distribution. Second, the considered groups, which \(\overset {-}{\mu }(p)\) and \(\overset {+}{\mu }(p)\) refer to in the definition of ZF, are exhaustive and disjoint. Finally, the curve ZF(p) has neither forced values at the endpoints nor is constrained to being non-decreasing and concave on the interval [0, 1], as is the case for the Lorenz curve LF(p). Turning back to the expressions of the inequality measures in the Dagum model, the Lorenz curve and the Gini index have the following form (Dagum 1977):
$$ L(t;a,b,p)=B\left( t^{1/p};p+\frac{1}{a};1-\frac{1}{a}\right),\quad 0<t<1 $$
(7)
and
$$ G(p,a,b)=\frac{\varGamma(p)\varGamma(2p+1/a)}{\varGamma(2p)\varGamma(p+1/a)}-1. $$
(8)
In Equation (7), B(t;a;b) indicates the beta CDF, while Γ(x) indicates the Gamma function in Equation (8). The substitution of the Lorenz curve in the formula of Zenga’s index yields
$$ Z(p,a,b)={{\int}_{0}^{1}}\frac{t-B\left( t^{1/p};p+\frac{1}{a};1-\frac{1}{a}\right)}{t\left[1-B\left( t^{1/p};p+\frac{1}{a};1-\frac{1}{a}\right)\right]}dt. $$
(9)
As noted at the beginning of this section, the Lorenz curve and the two inequality measures are defined if and only if a > 1.
Finally, bearing in mind that the aim is to employ such measures on survey data, the estimators for the aforementioned curves and indices are needed. From the random sample (X1,…, Xn), the empirical Lorenz curve can be rephrased as: {(i/n, Ln(i/n)}:
$$ L_{n}\left( \frac{i}{n}\right)=\frac{{\int}_{0}^{i/n}F_{n}^{-1}(s)ds}{{{\int}_{0}^{1}}F_{n}^{-1}(s)ds}=\frac {{\sum}_{j=1}^{i} X_{(j)}} {{\sum}_{j=1}^{n} X_{(j)}}, $$
(10)
where \(\bar {X}\) denotes the sample mean and X(j) denotes the j-th order statistics. Ln(p) is the share of the total amount of income owned by the least fortunate p × 100% of the sample. The Gini index, evaluated from the sample, is
$$ G_{n}=\frac{2}{n-1}\displaystyle{\sum\limits_{i=1}^{n-1} \left\{ \frac{i}{n} - \frac {{\sum}_{j=1}^{i} X_{(j)}} {{\sum}_{j=1}^{n} X_{(j)}} \right\}}. $$
(11)
The empirical Zenga measure (Greselin and Pasquazzi 2009) can be obtained by replacing the population, CDF F, by its empirical counterpart \(\widehat {F}_{n}\):
$$ {Z}_{n}=\frac{1}{n-1} \sum\limits_{i=1}^{n-1} \frac{\displaystyle {\frac{1}{n-i}\sum\limits_{j=i+1}^{n} X_{(j)}}-\frac{1}{i}\sum\limits_{j=1}^{i}X_{(j)} }{ \displaystyle \frac{1}{n-i}\sum\limits_{j=i+1}^{n} X_{(j)}} . $$
(12)
Relative Distributions
Let Y0 be a random variable representing a measurement for a population (e.g., income, wealth, hourly wages etc). In the following, the population that generated Y0 is denoted as the reference population (in the present case, the male income distribution can be set as the reference). Denote the CDF of Y0 by F0(y). Suppose also to observe another measurement of Y from a different population. The population that generated Y is denoted as the comparison population. It is assumed that Y has CDF F(y). Typically, Y is the measurement for a separate group (in the present case, the female income distribution), but in other cases, it can also be the same group at a later period.
The objective is to study the differences between the comparison distribution and the reference distribution. Unless explicitly mentioned, both F and F0 are assumed to be absolutely continuous with common support. The relative distribution of Y to Y0 is defined (Handcock and Morris 2006) as the distribution of the random variable:
It is worth remarking that R is obtained from Y by transforming the latter by the CDF for Y0, namely F0. This has also been called the grade transformation (Ćwik and Mielniczuk 1989). If the two distributions, Y and Y0, are identical, then the random variable, R, is uniformly distributed on the unit interval. The deviation from the uniform pattern can be interpreted as the gap between the distributions. While this transformation is not widely used or understood in the social sciences, it is a very useful one because R measures the relative rank of Y compared to Y0.
As a random variable, R has both a CDF and a probability density function. In particular, R has CDF G, such that \(G(p)=F(F_{0}^{-1}(p))\) for 0 ≤ p ≤ 1. The relative CDF, G(p), can be interpreted as the proportion of the comparison group whose attribute lies below the pth quantile of the reference group. Note that even though the relative CDF is explicitly scaled in terms of quantiles, the implicit unit of comparison is the value of the attribute on the original measurement scale, with \(y_{p} = F^{-1}_{0} (G(p))\) representing the cut point.