On the Negative Bias of the Gini Coefficient due to Grouping

The Gini coefficient is a measure of statistical dispersion that is commonly used as a measure of inequality of income, wealth or opportunity. Empirical research has shown that the coefficient may have a nonnegligible downward bias when data are grouped. It is unknown under which grouping conditions the downward bias occurs. In this note it is shown that the Gini coefficient strictly decreases if the data are partitioned into equal sized groups.


Introduction
The Gini coefficient (Gini, 1912) is a measure of statistical dispersion that is commonly used in various scientific disciplines, including economics, sociology, health science and engineering. It is commonly used to quantify inequality of wealth, income and opportunity, and inequality in education between countries (Sen, 1977). The coefficient can be defined in various different ways (Jasso, 1979;Yitzhaki, 1998;Ceriani and Verme, 2012). Here, we define the coefficient as the relative mean difference of the values of a frequency distribution (Damgaard and Weiner, 2000). The Gini coefficient of the real numbers x 1 , x 2 , . . . , x n is (1) Formula (1) is equal to the mean of the difference between every possible pair of values, divided by the mean value size. The Gini coefficient measures the dispersion among the values of the frequency distribution. If all values are positive the coefficient produces values between 0 and 1. The coefficient has value 0 if all the values are equal. In this case there is perfect equality. Values near 1 express high inequality among the values. Furthermore, the coefficient is invariant under multiplication of a positive constant. Moreover, different frequency distributions may have the same value of the Gini coefficient.
In many applications the Gini coefficient is estimated from grouped data with 5 to 30 categories instead of the microdata (Gastwirth, 1972;Abounoori and McCloughan, 2003). For example, income or tax statistics are often grouped for confidentiality reasons (Van Ourti and Clarke, 2011). Empirical research has shown that the Gini coefficient may have a nonnegligible downward bias when data are grouped (Lerman and Yitzhaki, 1989;Davies and Shorrocks, 1989;Kwok, 2010). Vice versa, Kwok (2010) noted that the Gini coefficient increases if a combined household is split into several smaller households or people living alone. Thus, the Gini coefficient may produce different results for income when the units of analysis are individuals instead of households (Deininger and Squire, 1996). Therefore, in interpreting the Gini coefficient the demographic structure of a country or region should be taken into account.
The Gini coefficient may decrease if the data are grouped, but it may also increase. For example, for x = {1, 1, 2, 2} we have G = .167. Indeed, adding the first and second value of x we obtain x = {2, 2, 2} and G = 0. However, if we add the second and third value of x we obtain x = {1, 3, 2} and G = .222, whereas if we add the third and fourth value of x we get x = {1, 1, 4} and G = .333. The latter two cases show that the Ginivalue may also increase if the data are grouped. Specific grouping conditions under which the downward bias of the Gini coefficient occurs have not been formulated. New insights into the properties of the coefficient with respect to data grouping are therefore welcomed.
In this research note, it is proved that the Gini coefficient strictly decreases if the values of a frequency distribution are partitioned into equal sized groups and the combined values are analyzed. An immediate consequence is that, vice versa, the Gini coefficient increases if the units are split into equally sized parts. A theorem that formalizes these statements is presented in the next section. An example and discussion are presented in the last section.

A Theorem
In this section, we show (Theorem 1) that the Gini coefficient strictly decreases if the data are partitioned into equal sized groups.
The Gini coefficient of the sums is Repeated application of the triangle inequality to the sum In the absolute difference |x ik − x i |, the value x ik of group k is compared to the value x i from group . The value x ik can also be compared to one of the d − 1 other values x j in group . Thus, we have d variants of inequality (5) such that in each variant a value of group k is compared to a different value in group . If we sum these d variants of (5) we obtain Summing the right-hand side of (6) over all combinations of k and with k = , we obtain the identity since s k − s = 0 if k = . However, summing the left-hand side of (6) over all combinations of k and with k = yields The triple summation in (8) is only equal to zero in the rare case that, in each group, all values are equal to one another. If we exclude this very particular case, equality (8) Combining the left-hand side of (9) with the identity and the right-hand side of (9) with identity (7), it follows that, summing (6) over all combinations of k and , yields Dividing both sides of (11) by 2n n i=1 x i , and using the identities n = dm and n i=1 x i = m k=1 s k on the right-hand side of the result, we obtain the strict inequality G > G s , which completes the proof. Table 1 presents the income in Australian dollars from 2013 of twentyfour individuals. The numbers were made freely available by the Australian Government (http://data.gov.au). The particular numbers in Table 1 are the twenty-four top numbers on income of the 2012-13 Individual sample file. Six people did not have an income in 2013. The maximum income is 192669, the average income is 45066, and the total income for the twentyfour individuals is 1081585. For Table 1 we have G = .540.  Table 2 presents the Gini-values that are obtained by grouping the data in Table 1. The first line corresponds to the case of no grouping (or twenty-four groups). The second line with twelve groups corresponds to the case in which the first two incomes are grouped (67848 + 50335), the second two values are grouped (14537 + 37495), and so on. The third line with eight groups corresponds to the case where the first three values are grouped (67848 + 50335 + 14537), and so on. The bottom line with two groups corresponds to the case where the first and last twelve incomes are grouped. Table 2 shows that, for the data in Table 1, the Gini coefficient tends to decrease when the number of groups becomes smaller. Theorem 1 applies to two partitions that are nested. For two numbers from the first column of Table 2, if the bottom number is a divisor of the top number, then the two partitions are nested, and the partition associated with the top number is finer than the partition associated with the bottom number. Consider, for example, the sequence starting with 24 to 12 to 6 to 3 groups. In each step the values are partitioned into groups of equal size. Furthermore, the associated G-values strictly decrease (from .540 to .303 to .226 to .193). Another illustration of Theorem 1 is the sequence from 24 to 8 to 4 to 2 groups. Again, the associated G-values strictly decrease (from .540 to .253 to .161 to .138).

An Example and Discussion
Theorem 1 is also applicable to income and wealth distribution tables (see, e.g. Kerbo, 2000). These tables typically summarize the income and wealth frequency distributions using a number of quantiles. Quantiles divide a frequency distribution into equal groups, each containing the same fraction of the total population. If two sets of quantiles are nested, Theorem 1 tells us that the set with higher granularity (higher number of quantiles) will have a higher Gini coefficient. For example, five 20% quantiles (low granularity) will yield a lower Gini coefficient than twenty 5% quantiles taken from the same distribution. Hence, it is important in applications that the Gini-value is reported together with the proportions of the quantiles used for measurement.
Finally, a limitation of Theorem 1 is that the units of analysis must be partitioned into equal sized groups. As demonstrated in the introduction, if the units are partitioned into groups of different sizes, it depends on the data at hand whether the Gini coefficient increases or decreases. On the other hand, the theorem puts no restrictions on which values are combined. Furthermore, some of the values are allowed to be negative or zero.
Open Acce s This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/ licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. s