“Why Call It Equality?” Revisited: An Extended Critique of the EIGE Gender Equality Index

In this paper, we review the methodology of one of the most comprehensive indices of gender equality, the Gender Equality Index by the European Institute for Gender Equality (EIGE). Building on Permanyer’s (J Eur Soc Policy, 25(4):414–430, 2015) critical analysis, we offer an extended critique of the EIGE’s current methodology, focusing on four interrelated issues: (a) the lack of transparency around the methodological decisions and the concomitant implicit theorising, (b) the continuing over-contribution of the ‘correcting coefficient’ to the index such that it predominantly captures achievement levels rather than gender gaps, (c) problems with the verification process and use of Principal Component Analysis, (d) issues arising from the aggregation and weighting of index components. Our analysis shows that in addition to the use of the correcting coefficient, other methodological choices (such as the use of ratios and geometric means) result in an unjustified penalisation of lower-GDP countries, reinforcing biased assumptions about gender equality progress in more affluent countries vis-á-vis lower-GDP countries in the sample. We call for greater transparency around theory, method and the relationship between the two while also proposing methodological improvements. These changes would bring the EIGE index closer to fulfilling its undoubted potential to provide a nuanced understanding of gender equality levels in the European Union and effectively inform policy development toward social change.


Introduction
The European Institute for Gender Equality's (EIGE) index is the eminent measurement of gender equality levels in Europe. In 2017, the index was revised in response to a critique made by Permanyer (2015) that the EIGE index primarily measured 'achievement' rather than 'equality' due to the application of a 'correcting coefficient'. The dilemma of measuring 'pure' gender gaps versus including the measure of overall achievement levels (i.e. women and men's combined values) is a critical point of contention in the gender equality index scholarship. We seek to contribute to this literature as well as the broader methodological debate around the measurement of gender equality using the EIGE index as a case study. The aim of this paper is twofold: (i) to evaluate whether the methodological changes the EIGE applied in response to Permanyer's critique have addressed the problem and (ii) to examine the methodology of the index more broadly to assess whether it is currently fit for its designated purpose.
The paper is divided as follows. We first introduce the historical context in which the EIGE index emerged before reviewing its methodology. Following this, we present Permanyer's critique and the modifications undertaken by EIGE in response. In the final section, we offer an extended analysis of the current methodology, focusing on four interrelated issues: (a) the lack of transparency about the methodological decisions and the concomitant implicit theorising, (b) the continuing over-contribution of the correction coefficient to the index such that it still predominantly captures achievement levels rather than gender gaps, (c) problems with the verification process and the use of Principal Component Analysis, (d) issues arising from the way the aggregation and weighting of components of the index is conducted. We show that in addition to the impact of the correcting coefficient, other methodological choices (such as the use of ratios and geometric means) also result in penalisation of lower-GDP countries. We call for greater transparency around theory, method and the relationship between the two while also proposing methodological improvements. These changes would bring the EIGE index closer to fulfilling its undoubted potential to provide a nuanced understanding of gender equality levels in the European Union (EU) and inform policy development.

Background: Emergence of Gender Equality Indices
Over the past 40 years, promoting gender equality has become a central aim of international political agendas. Contemporaneously, quantitative indicators have gained importance in national and international governance across the policy spectrum. Notably, the 1995 Beijing Platform For Action, developed in the context of the United Nation's Fourth World Conference on Women, identified twelve priority areas in which progress in terms of gender equality should be made. To track global developments in these areas over time, data collection efforts have increased leading to the generation of gender-differentiated indicators.
Since 1995, we have witnessed a proliferation of international gender equality indices. This development has occurred in four waves (see Schmid (2021) for a detailed discussion), starting with the introduction of the Gender-related Development Index (GDI) and the Gender Empowerment Measure (GEM) by the United Nations Development Programme (UNDP) in 1995. Two limitations of these pioneering indices were quickly identified. Firstly, as Hirway and Mahadevia (1996) and Walby (2005) pointed out, the GDI and GEM's focus was limited to education, income and health, disregarding other important dimensions of gender equality, such as employment, unpaid work and violence. Secondly, it was noted that both indices "do not measure gender equality as such" (Dijkstra, 2002, p. 203) but conflate relative levels (i.e. between women and men) with absolute levels (i.e. women and men combined) of achievement in a country, leading to an unjust penalisation of poorer countries (see Bardhan and Klasen (1999), Dijkstra (2002), Permanyer (2008), Bericat (2012)). It was argued that the GDI and GEM thus measure gendersensitive development, yet Schüler (2006) shows that they were frequently misinterpreted as direct measures of gender inequality.
The second wave saw the emergence of gender equality indices such as those developed by White (1997), Forsythe et al. (2000), Dijkstra and Hanmer (2000), Permanyer (2008) and Klasen and Schüler (2011). These focused on measuring gender equality independently from absolute achievement levels. However, due to a lack of data availability, the same three dimensions as the GDI continued to be used. The third wave aimed to expand their substantive scope to offer a more comprehensive picture of gender inequality from a global perspective, for instance, the index by Dijkstra (2002), Social Watch (2005), Hausmann et al. (2006), Dijkstra and Hanmer (2000) and Dilli et al. (2019). Finally, the fourth wave has shifted focus away from global indices to offer a more detailed picture of regionspecific gender equality contexts.
As part of this fourth wave, Plantenga et al. (2003) were commissioned to conduct a feasibility study for an index of gender equality levels across the EU by the the European Commission (EC). This resulted in the development of the European Union Gender Equality Index (EUGEI) which sought to support the development of policy recommendations and monitor their effectiveness over time (Plantenga et al., 2009). Drawing on Fraser (1997), Plantenga et al. (2003) assert that gender equality necessitates the transformation of both men's and women's lives; involving a redistribution of paid and unpaid time, economic resources and political power. To capture this notion, the EUGEI structures gender equality into five dimensions: Paid Work, Money, Decision-making Power, Knowledge 1 and Time (Plantenga et al., 2003(Plantenga et al., , 2009. Following the feasibility study, the EC instructed the newly established European Institute for Gender Equality (EIGE) to construct a gender equality index for the EU as part of the Strategy for Equality between Women andMen (2010-2015) (European Commission, 2006, p. 12). In 2013, the EIGE published the first version of the index. The structure of the index mirrored the EUGEI whilst adding a sixth domain on Health. The core index uses 31 indicators measuring value-bound gender gaps (0 for absolute inequality; 1 for absolute equality) modified by the overall level of achievement, as described in more detail below.
The establishment of a gender equality index for the European community is a significant step in the EU's commitment to gender equality. It is one of the most comprehensive indices of gender equality to date, drawing on a broad range of gender-differentiated indicators. The EIGE must be lauded for having navigated and integrated the viewpoints of multiple actors from across the EU (including the EIGE's Working Group on the Gender Equality Index, Management Board and Experts' Forum) as well as those set by the EC in the index development. As noted by Humbert and Hubert (2021), the index also masters a considerable technical challenge by maximising the use of available data sources, 1 3 combining data from the EU statistical offices and other EU agencies. As such, it is amongst the most detailed gender equality indices to date and pre-eminent in the European context.

Overview of the Methodology of the EIGE Index
Along with the introduction of the index itself in 2013, EIGE documented the initial methodology (De Bonfils et al., 2013) that was updated in 2017 (Barbieri et al., 2017). This section outlines the methodology based on both these reports and then presents Permanyer's critique of the EIGE's 'correcting coefficient'.

Indicator Selection and Processing
As stated in the 2017 methodological report, the choice of indicators is driven by theoretical considerations; firstly, that they capture relevant dimensions of gender equality relating to the concept of equal sharing of assets and resources and secondly, that they represent individual-level outcome indicators. Further, indicators were selected on the basis that they are disaggregated by sex and meet the following criteria (Barbieri et al., 2017, p. 9): • Harmonised at EU level and thereby comparable between Member States; • Accessible, updated on a regular basis, punctual and comparable over time; • Accurate, measuring in a reliable way the phenomenon it intends to measure, and sensitive to change; • Comprehensive and easily interpretable, intuitive and sufficiently simple to be unambiguously interpreted in practice; • Comprehensive and easy to interpret; • With less than 10 percent missing data points.
Multivariate analyses were run to compare the structure of the data with the conceptual framework and to inform decision-making about which indicators made the final list. First, correlation analysis using Pearson's correlation coefficient was conducted followed by Principal Component Analyses (PCA) to identify factors and components (Barbieri et al., 2017, pp. 49-51). This is reported to confirm the six domains of the index, with 14 subdomains and 31 indicators. Although, "[t]wo sub-domains were split in the statistical structure. Within the domain of work, the sub-domain of segregation was merged with quality of work. In the domain of Health, the sub-domain of behaviour was split in two in the measurement framework, which was finally solved by aggregating them with a half weight each in a single sub-domain." (Barbieri et al., 2017, p. 18) 2 To account for differences between the Member States in population size and structure, the indicators are transformed into ratios by dividing the indicator of interest by the 2 In two cases, the indicators are processed so that all indicators are conceptually consistent (i.e. a higher value always indicates more equality). This is the case for the indicator 'population at risk of poverty', where 20 percent of people at risk of poverty is equivalent to 80 percent not at risk of poverty. In the case of the indicator 'income quintile share' the inverse (1/value) is calculated. The S80/S20 income quintile share compares the 20 percent of the population with the highest income with the 20 percent of the population with the lowest, while its inverse, S20/S80, keeps comparing the same percentages, meaning the higher the share, the greater the equality. relevant reference population (p. 10). Where data were missing, different types of imputation were used (Barbieri et al., 2017, p. 11): • Variable with only 1 year available for all the countries or it is available for only limited number of years. This information is used for subsequent years so that it avoids showing changes in time.
[...] • Variable with a missing value for a certain country in a certain year, but with data available for other years. Missing data is imputed with data from the closest year. • Variables with missing EU-28 average. The average (non-weighted) of 28 values is imputed. • Variable with a missing value for a certain country in all years. The missing value is imputed using the expectation-maximisation (EM) algorithm available in SPSS.

Calculation of Final Metric: Combining the Gender Gaps and Correcting Coefficient
To calculate the index, the below final metric was defined, consisting of two components: the gender gap metric Υ (X it ) and the correcting coefficient (X it ) . 3 where: Barbieri et al., (2017, p. 12) note that "the calculation is carried out for the variable X for the i-th country in the period t in order to obtain the percentage that women (X w it ) represents over the average of the two values of women and men (X a it ) ". The non-weighted average between men and women is used instead of the (weighted) total value of the indicator to avoid extreme values that lie outside of 0 and1 (Barbieri et al., 2017, p. 12).
Further, the index allows values that are to the advantage or disadvantage of women to avoid instances where women's adverse outcome in one indicator is compensated by men's adverse outcome by another indicator. For this reason, the absolute values of the gaps are used. The gender gaps are reversed for interpretability by subtracting the gender gaps Υ (X it ) from 1 so that the value of 1 represents complete equality while 0 represents complete inequality (Barbieri et al., 2017, p. 13).
After calculating the gender gaps, a correcting coefficient (X it ) is applied to some indicators 4 to account for country differences in overall (i.e. population level) achievement levels across each indicator. The EIGE decided to combine the two in order to ensure that "a good score is the reflection of both low gender gaps and high levels of achievement." (Barbieri et al., 2017, p. 13) They argue that the conceptual importance of this is demonstrated in the example of the 2008 financial crisis, as a result of which gender gaps narrowed due to a sharper decrease in men's employment rates while living and working conditions deteriorated overall (De Bonfils et al., 2013, p. 9). Combining the two is argued to account for socio-economic changes in the gender equality analyses. Humbert and Hubert (2021) note that capturing the levels of achievement in conjunction with gender gaps follows the EU's gender mainstreaming principle, making the index particularly valuable for EU policy use. The combination of the gender gap and correcting coefficient component is to ensure that "Member States with similar gender gaps are treated differently if their levels of achievement differ. The higher the level of achievement, the lower the correcting of the gender gap." (Barbieri et al., 2017, p. 8) In the 2013 and 2015 versions of the index, the correcting coefficient was calculated using the quotient of the distance between the indicator value for each country and a benchmark reflecting the highest value achieved by any country. As discussed in more detail below, Permanyer (2015) demonstrated that the correcting coefficient had a stronger effect on the country scores than the actual gender gap. So, overall achievement levels of a country weighed more heavily than gender differences which in turn undeservedly penalised low-GDP countries. As a result, Barbieri et al. (2017) modified the correcting coefficient formula "to increase the contribution of the gender gaps" (p. 13).
The modified correcting coefficient (X it ) is calculated by dividing the total value of a country (X T it ) by a fixed benchmark, obtained by selecting the highest country score on the particular indicator achieved between 2005, 2010, 2012 and 2015 (see below formula). Thus, rather than taking a new benchmark each year, the modified coefficient now holds the benchmark constant over time. The ratio is then squared (Barbieri et al., 2017, p. 13) though this choice is not justified nor are the theoretical implications of this choice discussed.

Aggregation and Weighting
To "reduce subjectivity" and make it easier to decide between the possible aggregation and weighting methods, the first edition (2013) of the index adopted a multi-modelling principle: a set of potential indices were computed (a total of 3636 indices) in order to select the most robust one (Barbieri et al., 2017, p. 14). Since the methodology was changed for the 2017 version of the index, the robustness analysis was carried out again. The 2017 report states that the robustness analysis confirmed the use of equal weights in the aggregation of the indicators at sub-domain level, and the sub-domain scores at the domain level.
To aggregate the indicators, the arithmetic mean is used to obtain the 14 sub-domains since it "allows full compensability, and thus has the potential to offset a poor performance in some variables by a sufficiently large advantage in other variables" (De Bonfils et al., 2013, p. 49). The geometric mean "which decrease[s] this potential compensatory effect" is then used to aggregate to the dimension level and to produce overall country scores (2013, p. 49). The EIGE states that this combination follows "a gradually compensatory aggregation method [...] mean[ing] that the compensation allowed is higher within the aggregation at indicator level, where the arithmetic average is always considered. However, it becomes gradually less compensatory within sub-domain and domain level, where only geometric or harmonic means are allowed." De Bonfils et al. (2013, p. 49). Why this is a preferable approach over the alternatives is not further elaborated. An Analytical Hierarchy Process (AHP), carried out once in 2013, was used to produce the weights for the top-level aggregation. The AHP is based on ordinal pair-wise comparison of domains: "experts were first asked to make pair-wise comparisons of domain, and secondly, to assign a strength of preference to the selected domain on the scale from 1 (equal importance of domains) to 9 (the most important domain). The relative weights assigned to each domain by experts are as follows (Barbieri et al., 2017, p. 16

Summary of Permanyer's Critique
The EIGE's combination of gender gap with a correcting coefficient in the final metric is intended to result in high country scores representing both low gender gaps and high achievement levels. However, Permanyer (2015) showed that the final country scores on the 2013 index are largely driven by the correcting coefficient: across all countries and indicators, the percent contribution of the equality component (i.e. gender gaps) fell below 50 percent in 72.2 percent of cases. Further, high heterogeneity across indicators and countries was found: in one instance, the equality component contributes 80 percent to the indicator scores, while falling as low as 7 percent in another instance. On average across all countries and indicators, the equality component contributes a mere 31.6 percent to the index scores (Permanyer, 2015, p. 421).
In light of these findings, Permanyer (2015, p. 427) noted that: "the GEI values are basically determined by differences between countries in average achievement levels of women and men rather than by gender differences within them, a result that is somewhat disturbing for an index of gender equality." Further, he notes that countries with lower GDP levels are penalised as a consequence given the association with lower achievement levels, yet these might not necessarily be related to gender discriminatory practices or norms. With a nod to the GDI, Permanyer suggests either renaming the index 'Gender-related Achievement Index' or dropping the correcting coefficient altogether to measure gender equality directly.
In response, the EIGE kept the correcting coefficient but, as mentioned above, introduced modifications in 2017 to harmonise the contributions of the two components across indicators and countries. To re-estimate the contributions of each component, they used Permanyer's procedure. The 2017 methodological report does not discuss the different alternatives that were considered and tested for modifying the correcting coefficient. What they do document is that as a result of the modification, "the mean contribution of the gender equality component varies across the countries slightly, but remains between 49.2% and 70.5% in 2015. Between two and four countries have a contribution slightly below 50%, depending on the year." (Barbieri et al., 2017, p. 37) Since these modifications, no further changes to the methodology have been reported.

Critique of the Current EIGE Methodology
In this section, we offer an analysis of the current EIGE methodology. This includes revisiting the Permanyer critique discussed above but also raises a wider set of concerns about the index and its methodological construction. Specifically, we cover four issues: 1. The lack of transparency about methodological decision-making and concomitant implicit theorising 2. The contribution of the correcting coefficient to the index 3. The verification process and the use of PCA 4. Aggregation and weighting issues We will consider each of these in turn.

Lack of Transparency and Implicit Theories
One of our central concerns with the EIGE methodology is that the available reports lack transparency about both methodological choices and the resulting implications for the underlying theory. It is important to note at this point that we do not take issue with the EIGE index' conceptual framework per se. Our primary issue lies with the EIGE's methodological choices that impact upon the theory in a way that requires conceptual justification. Thus, the methodology of the index construction produces a strong theoretical position and yet that position is neither made explicit nor justified.
To examine the EIGE's underlying reasoning that seemingly guide the methodological choices, we develop three presuppositions intended to make these explicit. The central tenet of the index can be understood as something akin to presupposition 1.
Presupposition 1 To understand variation in the difference between women's and men's outcomes, one cannot simply measure the differences in those outcomes and compare them, one must correct those differences for overall levels of achievement. This is a strong theoretical position that requires additional elaboration for coherence and to explain how it does not represent double-think. First, the presupposition embedded in the notion of 'correction' is that there is something that needs correcting, i.e. something that is-in some sense-wrong. However, the methodological reports shed little light on why gender gaps are insufficient. The closest that it comes to a justification is that the purpose of the correcting coefficient is to compare the performance of each country with the best performer in the EU-28. In a particular indicator, the further the score of a country diverges from the level of the best performer, the more the score will be adjusted (Barbieri et al., 2017, p. 13). Yet, if one wants to compare gaps, then one would simply examine the size of these gaps. We therefore conclude that, when referring to comparing to the best performer, they are not talking about comparing the size of gaps. Crucially, the justification does not explain what is wrong that needs 'correcting'.
Even if we accept the presupposition that the gender gap data do need correcting, a further question arises: would the right thing to do to make that correction be to manipulate the gender gap numbers by combining them with levels of achievement? It is hard to imagine any analogous circumstances where combining an absolute measure of something with a difference would be a good mechanism for correcting errors in the differential. What makes this solution even more untenable is where the same underlying data are used to calculate the absolute levels as is used to calculate the differences, which is the case here.
Finally, we question the EIGE's decision to correct the differential by multiplying the two numbers together. The impact of such a data transformation on the pattern of errors is empirically dependent on the relationship between the errors and the absolute values, yet we have been given no model or theory to explain EIGE's expectations about that relationship.
On one level, the above points are simply semantic. Had the EIGE taken Permanyer's advice and changed the name of the index (and also stopped using the term 'correcting'), then these arguments would become a matter of historical curiosity. But since EIGE chose not to do that, we can only assume that EIGE must hold to presupposition 1.
Even if the combination of absolute and relative measures weres a plausible approach, we are still faced with an additional factor muddying the waters: the correcting coefficient is selectively applied to the index. Of the 31 indicators, it is only applied to 21 of them, a fact that is easily lost in the interpretation of the scores. In ten cases then, the EIGE deems the attainment of the maximum achievement level undesirable. For example, it is not applied to two indicators in the domain of Time measuring frequency of involvement in domestic work and care for dependents, since: "what matters are gender inequalities rather than the level of involvement in these activities, which may depend on other factors, such as fertility rates or cultural traditions of eating out rather than cooking at home. In this case, it is difficult to argue that 100% of women or men should spend time in caring and/or house-work activities (following the principle that the higher the value, the better is the situation)" (Barbieri et al., 2017, p. 14).
Yet, this assertion might also be made about the 21 corrected indicators. For example, the EIGE corrects the indicator on full-time equivalent employment rates, which by implication means that achieving the maximum level is desirable. At the same time, however, the indicator measuring worker's engagement in voluntary work is also corrected. Achieving the maximum level in both of these indicators is not only improbable, but it is also not clear why this is necessary for achieving gender equality.
It has been argued (for example, Humbert and Hubert 2021) that the combination of absolute and relative levels is central for the index to facilitate gender mainstreaming: pushing for higher achievement levels while ensuring that gender equality lies at the core of policy strategies. However, we argue that the current construction of the index in fact undermines gender mainstreaming as the inclusion of an unrelated aspect marginalises gender in a measure explicitly constructed to capture gender inequality. This is especially the case given that the achievement component still predominantly drives the scores of the EIGE index, as we show below. Further, it seems to us that ensuring gender mainstreaming would mainly be relevant if the index were primarily aimed at measuring achievement levels. In that case, the index should indeed be sensitive to gender differences, but be labelled 'Gender-related Achievement Index', as previously proposed by Permanyer (2015).

Presupposition 2
To measure a gap between women and men, use the ratio rather than the difference.
A second issue is that the EIGE uses ratios rather than differences to calculate gender gaps (see presupposition 2). These "gaps" are calculated by dividing the average value of women by the unweighted average of women and men. Ratios have variable sensitivity to differences between the values of women and men depending on overall levels, and it actually produces particularly poor ratios when overall levels are low, even if the gender difference is small. For example, holding the gender difference constant at 4: 10/14 = 0.714; 30/34 = 0.88; 60/64 =0.94; 80/84 = 0.95. The key point here is that the overall achievement level is already taken into account in how EIGE calculate the gaps even before they apply the correcting coefficient.
A third element of implicit theorising is the notion of compensability. When selecting the method for domain to indicator aggregation, they evaluate the options on the basis of how much a weakness is compensated for by a strength elsewhere. The underlying assumption here is that full compensability may not be desirable. This is a theoretical position and not simply a statistical one as they seem to imply in the document. The underlying theoretical position is represented by presupposition 3.

Presupposition 3
When aggregating indices of gender inequality we need to ensure that strength in one domain does not fully compensate for weakness in another domain and that therefore the weakest index is the main driver for the overall index score.
This, as we shall see later when we discuss weighting in more detail, has some unintended consequences, but to illustrate the effect in Table 1, country A would be said to be doing better than country B if one uses the geometric mean to generate the index.
Arguments can be made for and against non-compensability but, in essence, the lowest value is being weighted higher. Therefore, we would argue that equal weighting and full compensability should be the default and that variance from full compensability should be justified theoretically on the basis of what the index is trying to achieve. The EIGE methodological report does not do this.
The final issue to cover here is the treatment of missingness. The 2017 methodological report states: "In order to work with a complete dataset, the imputation of some missing values needed to be carried out. An imputation is a mathematical procedure which allows the estimation of a data point when it is not available." (Barbieri et al., 2017, p. 11). They then describe four techniques, three of which are not mathematical but simply deterministic data edits. The fourth, expectation maximisation (EM), is a statistical technique. This would be a semantic quibble if the document was clear about the what had actually been tested, yet regrettably this information is not provided. Particularly, we are not informed about the level of missingness per indicator, country or indeed overall. The only information on this is that the report states that indicators were only considered if they had no more than 10 percent of missing data points. 5 We are not told which indicators were excluded because of missingness. Further, we are given no details of the procedure for the EM algorithm that was employed where countries had missing values for a given indicator for all years. The EM algorithm has parameters that need to be set which will impact on the outcome. It is likely the SPSS defaults have been used but this is neither explained nor justified. There is also a critical choice of which indicators are included in the modelling: were all variables used for each variable imputed; and were all years used? Finally, an assumption of EM is that the values are missing completely at random (MCAR). Little's test for MCAR is actually provided by SPSS as part of the standard EM output, but it is not reported. Given that the algorithm has been employed precisely for those data where an indicator was missing for all years for a given country an assumption of MCAR seems implausible. Barbieri et al. (2017) do carry out a robustness analysis. In principle, this is a source of strength. However, we again find a lack of transparency in outlining the methodological steps. The most detailed description can be found in the 2013 report (De Bonfils et al., 2013). Here they state: "The robustness analysis involved combining all possible sources of variations ([100] simulations of imputed data, all weights and aggregation alternatives). Altogether, this resulted in the computation of 3636 sets of scores, which corresponds to the overall index distribution of all possible scenarios generated." However, there is no apparent combination of scenarios that gives rise 3636 sets of scores; not least because 100 is not a factor of 3636. Thus, what they have done here is only partially reported and therefore not reproducible.

The Contribution of the Correcting Coefficient
For this analysis, we replicated the EIGE indices from 2017 (the first index after the methodological modifications) and 2020 (the most recent at time of submission) using the data available on the EIGE website and following the methodology described in the report summarised above. 6 To approximate the percent contributions of the equality (i.e. gender gap) and achievement (i.e. correcting coefficient) component, we follow Permanyer's (2015) five step procedure. The contribution of the achievement component ( ) to the final metric ( Γ ) was approximated using C = 100 * ln( )∕ln(Γ∕100) ; the contribution of the equality component (e) to the overall scores was estimated using C e = 100 * ln(e)∕ln(Γ∕100) ) and rescaled to total 100 when combined (see the Excel files Electronic Supplementary Material (ESM) 1 and ESM 2 for the detailed calculation).
We find that the scores of the index in both 2017 (using 2015 data) and 2020 (using 2018 data) are still predominantly driven by differences in achievement levels rather than gender gaps. Figures 1 and 2  indicator for Sweden for both years) and as low as 0 percent (e.g. in the case of Voluntary ork indicator in the Netherlands for both years).
These findings call into question the modifications the EIGE made to the correcting coefficient formula in 2017, as Permanyer's (2015) critique has not been addressed. The EIGE's claim that the modifications have led to a narrowing of the mean contribution range of the equality component to 49.2 percent and 70.5 percent 7 for each country in the EIGE 2017 index (Barbieri et al., 2017, p. 37) is only true if the ten indicators to which no correcting is applied are included. This seems misleading however; the equality contribution for these ten indicators is structurally 100 percent, skewing the average contributions. When excluded, the contribution of the equality component to the country scores drops to a range of 25.1 and 56.5 percent (mean 34.4) for 2017 and 26.2 and 54.0 percent (mean 36.0) in 2020. Barbieri et al. (2017) use Principal Components Analysis (PCA) to verify the index's latent structure. Although they are not alone in doing this, PCA is not a statistical method and is not intended for model verification; it is an exploratory technique that provides a mathematical description of the data by reducing the number of dimensions (see for example Suhr (2006)). To test a theory, Confirmatory Factor Analysis (CFA) should normally be used. CFA answers the question: "is the proposed model a fit to the data?" -precisely the question that the EIGE wishes to address. 8 Putting aside the issue of using PCA at all, we also note errors in implementation. A detailed analysis of this and an exploration of alternative approaches can be found in the supporting document ESM 6. The practical conclusion is that orthodox choices about the number of factors and type of rotation lead to different conclusions about the structure of domains and the overall index. We also find that the PCA yields different results with EIGE's 2018 data than with the 2010 data. This itself is only questionable if one is using PCA for verification, but since this is its use by EIGE, the fact that the theoretical structure proposed does not hold up over time is problematic. Papadimitriou and Del Sorbo (2020) find in their 2020 audit of the EIGE's index that its structure is coherent based on correlation and PCA analyses. Whilst we concur that their results do indicate some conceptual coherence, the issue is that, as with Barbieri et al. (2017) themselves, they are using exploratory tools to carry out an evaluation. We tested the 2018 EIGE data 'domain to index' structure using CFA and found that model fit was not good: RMSEA=0.241, CFI=0.867, although SRMR=0.80 is acceptable (see supporting document ESM 5 for more detail). For the 2010 data, the situation is even more complicated. In this case, a single factor model fits the data well but with two domains (Work and Knowledge) loading with the opposite sign to the others, as shown in Table 2. From a theoretical perspective this is clearly an issue.

The Verification Process and the Use of PCA
Our own exploratory analysis using PCA (see supporting document ESM 6) indicates that the extraction of two factors produces a more comfortable fit to the data over time (2010, 2015 and 2018 data). In summary, extraction of a single factor is, at best, only possible if we allow counter-theoretical factor loadings and variation of loadings between data years (i.e. no temporal equivalence).

Weighting and Aggregation
The manner in which weighting and aggregation methods are used is another cause for concern. Barbieri et al. (2017) indicate that expert weighting was tested and then used for aggregating the domain level to the overall index. In principle, this is a reasonable approach, but beyond the statement that the Analytic Hierarchy Process was used, it is not at all clear what the process was. Specifically: 1. How many experts were consulted? 2. Which countries were represented in the expert panel? 3. What was the variability in the expert judgments? Only the means of the eventual weights are documented in the report. 4. The report indicated that only 60% of the expert weights were retained; how did this reflect on representativeness of the eventual sample?
The report provides us with the means of the eventual weights. These are reported to fifteen decimal places. Although not wrong, it suggests a level of precision at odds with the subjective nature of the process reported. Taking the expert weighting at face value, a more critical issue is the interaction between the stated weights and the use of geometric mean for aggregation. Variation in the index distributions within and between the domains means that the effective weights employed are very different to the reported expert weights, as shown in Table 3.
To calculate the effective weights we used the following procedure (see ESM 1 and ESM 2 for more detail): 1. Calculate the overall index score per country removing each of the domains in turn. 2. From step 1, calculate the leverage that each domain has over the score of each country (i.e. the absolute difference between the overall score with and without that domain). 3. Calculate the arithmetic means of the leverage values derived in step 2. 4. Divide each of the domain means derived in step 3 by the sum of those means to arrive at the effective weight.
In general, the choice of geometric mean to aggregate on the sub-domain and domain level has theoretical implications that requires justification. Applying the geometric mean does not just 'reduce compensability', but affects different values in disparate ways, exemplified in Tables 4 and 5. In Table 4, we see two (hypothetical) countries and two domains. It seems counter-intuitive that country A is deemed to be doing worse than country B overall. This distortion is even more marked if you imagine a transition to Table 5. Country B makes a significant (20 point) improvement in Domain 2 and country A makes a more modest (10 point) in the same indicator. Yet, the overall score for country A increases a lot more than that of country B, so much so that country A now overtakes country B. A smaller net improvement has resulted in a bigger improvement in the overall score. This is, of course, simply a property of the geometric mean. Variation in the numbers being averaged leads to lower means. Theoretically, that variability is likely to be related to achievement levels and therefore the use of geometric mean is likely to further amplify the effect embodied in the correcting coefficient.

An Alternative Index
To evaluate the impact of the use of correcting coefficient and geometric weighting on the index, we constructed an alternative index using the data from the 2020 index. This index was constructed with the same domain and sub-domain structure as EIGE, but without the correcting coefficient and with the use arithmetic averaging at all levels. We retained the expert weights used by the EIGE. For simplicity's sake, we call this index the Alternative EIGE (A-EIGE). The exploration of the difference between this alternative index and the EIGE can be found in supporting document ESM 3. The headline result is that the EIGE's use of geometric mean and correction coefficient creates a penalty 9 for countries with lower GDP per capita relative to the A-EIGE. This difference also has a geopolitical aspect with Southern and Eastern countries being penalised relative to Western and Northern ones.

Discussion
In this paper we have reviewed the methodology of one of the key indices of gender equality, the Gender Equality Index developed by the European Institute for Gender Equality (EIGE). We first outlined the methodology of the index and the critique voiced by Permanyer (2015)-that the scores are predominantly driven by differences in achievement levels rather than gender gaps-prompting the EIGE to modify their methodology. After this, we offered an extended critique of the EIGE's current methodology focused on four interrelated issues: (a) the lack of transparency about the methodological decisions and the concomitant implicit theorising, (b) the continuing over-contribution of the correcting coefficient to the index, (c) problems with the verification process they have employed, (d) issues arising from the aggregation and weighting of the components of the index. The first two issues highlight inconsistencies between the methodological choices and underlying theory. While we do not take issue with the explicit conceptual framework per se, we argue that many of the theoretical implications of the methodological choices are not explicit and underdeveloped in EIGE's reporting. Every index does to some extent embed a political and theoretical model that will drive the selection of indicators and methodological choices. Especially when dealing with an index that measures a complex phenomenon such as gender equality, the underlying theory of gender equality should be evident and consistent with the chosen methodology. Unfortunately, we believe this not to be the case for the EIGE index in its current form.
One element of our critique has focused on the limitations of continuing to include absolute achievement levels alongside gender gaps. Yet, we acknowledge that critiques can also be raised regarding an exclusive focus on gaps, as it might suggest that the index should be constructed around a gender equality model built on 'sameness' where men are taken as the standard to which women should be elevated. In the 'sameness vs difference' debate, it was pointed out that this androcentric approach will fall short in increasing gender equality as patriarchal structures remain unchallenged (for a discussion see Fraser (1997) and Squires (2000)). For instance, as Plantenga and Remery (2013, p. 36) put it: "It is by no means obvious that women's position is strengthened by having to work as many hours in paid employment as men." However, the inclusion of absolute achievement levels does not resolve this tension. Since absolute achievement levels reflect the population weighted average of women's and men's values on a given indicator, countries with higher overall achievement levels will often have higher achievement levels for men. This is especially true for indicators related to employment and income. In these cases, the male standard is still taken as the reference point to some extent when achievement levels are taken into consideration. Moreover, allowing the maximum to be defined by the highest performing country effectively means that the cultural and political model of this high performing country is the favoured gender equality model. Not only is this arbitrary, it is also a culturally biased gender equality model that penalises countries with lower GDP levels. It would make more sense if the benchmarks chosen in the index would correspond to specified policy aims of the EU (for example, setting the benchmark for the indicator on employment rates to 70% to correspond to the Lisbon Strategy). If the point of the index is to encourage gender-sensitive development across the EU, we understand that rewarding countries with small gender differences but lower levels of 'development' (i.e. achievement levels) runs counter to the socio-political agenda of the EU. However, there seems to be a tension between the stated aims of the index and the reality of its political use.
Moreover, the current approach might undermine the index effectively informing policy development toward greater gender equality. Since its introduction, Sweden (followed by Denmark) has consistently scored highest on the EIGE index (Barbieri et al., 2020), findings which support the longstanding idealisation of and orientation toward the Swedish model. However, since the 2008 financial crisis, the Swedish welfare state has undergone significant processes of retrenchment and privatisation, argued to undermine its commitment to gender and social equality (Daly, 2020). In addition, when we consider violence against women (currently not measured in the core EIGE index), the potential existence of a 'Nordic Paradox' (i.e. where the prevalence is higher in supposedly more egalitarian Nordic countries) would further caution against their idealisation. 10 Still, Sweden currently turns out to be the favoured gender equality model in the EIGE's index. This highlights the importance of combining qualitative and longitudinal analyses with quantitative indicators for governance, especially when used to benchmark or rank countries (Verloo and Vleuten, 2009;Minto et al., 2020;Razavi, 2019). More clarity on the theoretical and methodological foundations of the index can also help reduce the risk of its findings being misinterpreted or 'co-opted' to accommodate an economic and employment growth rationale, that might in fact undermine gender (and social) equality.
In general, the choice of methods has an impact on the meaning of the index created. The lack of transparency about the methods chosen and the incoherence of the methods with the stated aims of the index is something that needs to be addressed. We note that many of the methodological choices (the use of ratios, geometric means and the correcting coefficient) penalise the lower-GDP countries in the sample and therefore tend to reinforce biased assumptions about gender equality progress in the more affluent countries. This is likely not the intention, but it is the effect. Greater clarity about theory, method and the relationship between the two are therefore needed.
We recognise the great potential of the EIGE index in that it combines a broad range of relevant gender equality indicators for the European Union and, as such, forms one of the most comprehensive gender equality indices to date. As the literature shows (see for example Schmid (2021), Dijkstra (2002), Permanyer (2010), Bettio et al. (2013)), the limitations identified in this paper are not specific to the EIGE index and speak of the complexity in developing gender equality indices. By taking the EIGE index as a case study, we aim to contribute to the debate around the conceptual and methodological difficulties in measuring gender inequality using compound indicators. Further, we hope to point to ways in which the EIGE index can be improved. We believe it to be critical that the EIGE maximises its potential for understanding gendered outcomes across the EU, given the importance of measuring and monitoring gender equality levels for accountability and governance.
We believe the EIGE index could be substantially improved if it disentangles the two components-the gender gaps and achievement levels-to create separate indices. We do not deny that achievement levels are socially and politically relevant just as gender gaps are, but the current conflation of the two different measures represent questionable methodological practice. Instead, separating out the two components would allow the index to provide a more nuanced understanding of gender equality levels across the European Union, which can then more effectively inform policy development. The Regional Gender Equality Monitor (Norlén et al., 2019) from the Joint Research Centre of the European Commission's science and knowledge service could offer inspiration here.
Beyond this, it is difficult to be prescriptive about other methodological choices (e.g. the aggregation method) as they all interact and until the purpose of the index is clear choice of method will remain arbitrary. We encourage the EIGE to strive for that clarity in order for this eminent index to fulfil its undoubted potential.