Gender wage difference estimation at quantile levels using sample survey data

Anastasiade-Guinand, Mihaela-Cătălina; Matei, Alina; Tillé, Yves

doi:10.1007/s11749-023-00885-8

Gender wage difference estimation at quantile levels using sample survey data

Original Paper
Open access
Published: 19 September 2023

Volume 32, pages 1392–1433, (2023)
Cite this article

Download PDF

You have full access to this open access article

TEST Aims and scope Submit manuscript

Gender wage difference estimation at quantile levels using sample survey data

Download PDF

Mihaela-Cătălina Anastasiade-Guinand¹,
Alina Matei ORCID: orcid.org/0000-0002-2630-5633² &
Yves Tillé²

668 Accesses
Explore all metrics

Abstract

This paper is motivated by the growing interest in estimating gender wage differences in official statistics. The wage of an employee is hypothetically a reflection of her or his characteristics, such as education level or work experience. It is possible that men and women with the same characteristics earn different wages. Our goal is to estimate the differences between wages at different quantiles, using sample survey data within a superpopulation framework. To do this, we use a parametric approach based on conditional distributions of the wages in function of some auxiliary information, as well as a counterfactual distribution. We show in our simulation studies that the use of auxiliary information well correlated with the wages reduces the variance of the counterfactual quantile estimates compared to those of the competitors. Since, in general, wage distributions are heavy-tailed, the interest is to model wages by using heavy-tailed distributions like the GB2 distribution. We illustrate the approach using this distribution and the wages for men and women using simulated and real data from the Swiss Federal Statistical Office.

The gender pay gap in the USA: a matching study

Article Open access 05 September 2019

Gender wage gaps in Ghana: a comparison across different selection models

Article 28 June 2023

On the Sensitivity of Wage Gap Decompositions

Article 07 May 2020

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

This paper is motivated by the growing interest in estimating the wage differences between men and women in official statistics. Applications in official statistics deal with random samples drawn from finite populations. Estimation is usually made in what one calls the design-based framework, where only the samples are random, while the variables collected from them are fixed. Thus, inference in finite populations may be very different from the one usually used in the classical statistics. In order to accommodate some theory from econometrics, we consider what one calls a superpopulation framework, assuming that our finite population is a random sample drawn from an infinite population. Next, the finite population is divided in two groups or subpopulations: men and women.

It is possible that men and women with the same characteristics earn different wages. The wages of men and women are modeled separately using a parametric model. Conditional on some characteristics, we assume that the conditional wage distribution of each individual into a group follows a given theoretical distribution with unknown parameters. Our goal is to capture the shape of the wage distributions and to go beyond the mean differences provided by the regression approach of Blinder (1973) and Oaxaca (1973), which is widely used by the world’s national statistical offices. We do this by determining the estimator of the differences between the gender wages at different quantiles. Following, for instance, Melly (2006) and Chernozhukov et al. (2013), we extend to quantiles the classical decomposition method of Blinder (1973) and Oaxaca (1973) for the mean, using the concept of counterfactual distribution. (For an overview, see Fortin et al. 2011.) The counterfactual distribution is estimated by putting together the parameters of one group and the characteristics of the latter group. This is done in order to estimate what the former group would earn, if they had the characteristics of the other group. We follow this guideline to estimate the wages of women as if they had the same characteristics of men. This leads to the estimation of differences between the gender wages conditionally on fixed covariates at different quantiles. We use a conditional distribution approach similar to the one used by Biewen and Jenkins (2005). First, we estimate the parameters of the distribution of each individual given their characteristics. Next, the marginal wage distribution is fitted based on the individual wage distributions. The parametric approach used in this paper has already been suggested in several papers in the decomposition literature, as, for instance, Biewen and Jenkins (2005), Van Kerm (2013) and Van Kerm et al. (2016).

The novelty of this paper is twofold. First, we use the parametric approach from a survey sampling perspective, underling the specific framework used in the design-based inference. While the main goal is to model wages by using heavy-tailed distributions and survey weights, we also use the parametric approach with a quite different interest. This is specific to survey sampling and was not previously investigated: This approach uses auxiliary information; if this is well correlated with the wages, the variance of the quantile counterfactual estimates may be reduced compared to those of some competitors. We use two parametric methods to estimate quantiles, by assuming a given theoretical distribution of conditional wages of men and women given their characteristics, respectively. While the first parametric method used is similar to one used by Biewen and Jenkins (2005), the second one is new and is introduced in this paper. The second method has the advantage of allowing an easy construction of confidence intervals of a quantile. Secondly, we want to illustrate the quantile decomposition of wages using data from official statistics and heavy-tailed distributions other than the log-normal distribution usually used in this domain (see, for instance, Leythienne and Ronkowski 2018).

Motivated by a flexible way to model income and wage distributions, we fit in our examples a generalized beta distribution of the second kind (hereafter, GB2) distribution to conditional wages. Following the work of Thurow (1970), who considered that “the beta distribution seems the most flexible” distribution to capture income changes, McDonald (1984) introduced the GB2 distribution to model income distributions. McDonald (1984), Bandourian et al. (2002) and McDonald and Ransom (2008) showed that the GB2 distribution provides a good fit for income. The GB2 distribution can be used to fit either positively or negatively skewed distributions and is a generalization of several distributions, such as the log-normal, the exponential or the Fisk distributions (Kleiber and Kotz 2003; McDonald 1984; McDonald and Xu 1995; McDonald and Butler 1990). This distribution is already well covered in the literature (see, for instance, Kleiber and Kotz 2003; Graf et al. 2011). We illustrate the two parametric methods using the GB2 distribution, with parameters estimated through maximum pseudo-likelihood, when survey weights and characteristics are associated with sampled employees, by expressing the scale parameter of a GB2 distribution as a function of their characteristics. We show in “Appendix” how to estimate the standard errors of the estimated parameters in a GB2 regression model, using a sandwich estimator and a parametric bootstrap approach.

This paper is structured as follows: In Sect. 2, we present the general setup and recall the classical decomposition method of Blinder (1973) and Oaxaca (1973), making the bridge with the estimation in the context of survey data. In Sect. 3, we discuss the concept of counterfactual wage distribution and show a decomposition method at quantiles’ level. The two parametric methods to estimate quantiles are presented in Sects. 4 and 5. The two methods are applied for the gender wage distributions and the counterfactual wage distribution, respectively. The Monte Carlo simulation results given in Sect. 6.2 show the methodological interest to use auxiliary information in the quantile counterfactual estimation. We illustrate the decomposition method at quantiles’ level in Sect. 6.3, by assuming that the conditional wage distribution for women and men follow, respectively, a GB2 distribution. The data used were obtained from the Swiss Federal Statistical Office and are issued from the Swiss survey on the structure of earnings in 2012. We draw our conclusions in Sect. 7.

2 Setup

Consider a finite population of employees with the labels $U=\{1, 2, \ldots , N\}$. From this population, we randomly select a sample S of size n, without replacement. The sample is selected through a sampling design $p(s) = \Pr (S=s), \forall s \subseteq U$. It is assumed that the sampling design is noninformative. To each unit $k\in S$, a survey weight $w_k$ is associated. These weights can be equal to the inverse of the inclusion probabilities or can be more complicated weights, like calibration weights. The set U is divided in two subsets with labels corresponding to men and women, denoted by $U_M$ and $U_F$, respectively, such that $U_M \cup U_F = U$ and $U_M \cap U_F = \emptyset $. Similarly, the sample S is divided into two random subsamples of men and women, denoted by $S_M = S \cap U_M$ and $S_F = S \cap U_F$, respectively. We denote these subsamples as $S_g\subseteq U_g, g \in \{M,F\}$, with $n_M$ and $n_F$ being the number of employees in the subsamples, respectively, such that $n_M+n_F =n$.

We work in a superpopulation framework and assume that the finite population is a random sample drawn from an infinite population. Let Y be the variable wage. First, we consider that Y is a random variable generated by a distribution model $\xi $ in the infinite population. Next, the finite population $\{ Y_1, Y_2,\ldots , Y_N\}$ is randomly generated from the model $\xi $, where $Y_k$ is the variable wage associated with each $k\in U$. We assume that the estimation process refers to the infinite population parameter, and is executed in the design-based approach, considering, however, that $Y_k$ associated with unit $k\in U$ is random (see Särndal et al. 1992, p. 516, Case 4).

We also assume that a linear regression model that relates the logarithm of Y to some covariates $X_1, X_2, \ldots , X_c$ holds. The covariates are the same in each $U_g, g \in \{M,F\}$, but for coherence with the subsets’ notation we denote by $X_{1, g}, X_{2, g}, \dots , X_{c, g}$ the covariates in group $g \in \{M,F\}$. For each unit $k\in U_g$, $g \in \{M,F\}$, the wage is denoted by $Y_{k, g}$ and the c covariates are stored in the vector

$$\begin{aligned} \textbf{X}_{k, g} = (1, X_{1k, g}, X_{2k, g}, \dots ,X_{ck, g} )^\top . \end{aligned}$$

(1)

One realization of $\textbf{X}_{k, g}$ is denoted by $\textbf{x}_{k, g} = (1, x_{1k, g}, x_{2k, g}, \dots ,x_{ck, g} )^{\top }$. The last c elements of the vector $\textbf{x}_{k, g}$ represent realizations of variables $X_{1, g}, X_{2, g}, \dots , X_{c, g}$, respectively, $g \in \{M,F\}$. In what follows, we also denote by $y_k$ a realization of $Y_k, k\in U$ and use $\textbf{X}_g=(X_{1, g}, X_{2, g}, \ldots , X_{c, g})$, $g \in \{M,F\}$ ,with $\textbf{x}_g$ one realization of $\textbf{X}_g$.

2.1 The Blinder–Oaxaca-type decomposition method

We use what is called in econometrics a decomposition method. The general idea of decomposition methods is to divide the difference between wages of men and women in two elements: the first one is the part due to the difference in characteristics between them, and thus, it can be explained, while the second one is not. Starting with Blinder (1973) and Oaxaca (1973), many decomposition methods have been proposed, not only to decompose wage means, but also wage densities; for an overview, see Fortin et al. (2011).

Assume that the superpopulation is divided in two subsuperpopulations from where the subsets $U_g, g=\{M,F\}$ are drawn, respectively. In each subsuperpopulation ${\text {SUP}}_g$, a linear relationship is suitable between the characteristics that are available and the logarithm of the wage. A linear regression model is fitted separately in each subsuperpopulation ${\text {SUP}}_g$ with $g\in \{M,F\}$

$$\begin{aligned} \textrm{log}(Y_{k, g})=\textbf{X}_{k, g}^\top \varvec{\beta }_g+\varepsilon _{k, g}, k\in {\text {SUP}}_g, \end{aligned}$$

(2)

where $\varepsilon _{k, g}\sim N(0, \sigma ^2_g)$ are independent and identically distributed (iid), $\varvec{\beta }_g$ represents the vector of regression coefficients and $\sigma ^2_g$ is the variance of $\textrm{log}(Y_{k, g})\mid \textbf{X}_{k, g}$ in ${\text {SUP}}_g$. The regression coefficients ${\varvec{\beta }}_g$ are called the group wage structure or the returns on characteristics, and they represent the contribution of each characteristic to the logarithm of the wage.

By using Model (2), one obtains the conditional expectation $E({\widetilde{Y}}_g \mid \textbf{X}_g=\textbf{x}_g)=\textbf{x}_g^\top \varvec{\beta }_g$ and the unconditional expectation

$$\begin{aligned} E({\widetilde{Y}}_g)=E\left( E\big ({\widetilde{Y}}_g \mid \textbf{X}_g\big )\right) =E(\textbf{X}_{g})\varvec{\beta }_g+E(\varepsilon _{g})=E(\textbf{X}_{g})\varvec{\beta }_g, \end{aligned}$$

where ${\widetilde{Y}}_g$ represents $\textrm{log}(Y_g)$, $Y_g$ is the random variable wage in group g, $\varepsilon _{g}$ is the random variable error term in the same group, and $\textbf{X}_g$ and $\varepsilon _{g}$ are independent.

The difference between the conditional expectations of the logarithm of wages of two groups (it is a Blinder–Oaxaca-type decomposition) can be written as

$$\begin{aligned} \begin{aligned} \Delta&=E({\widetilde{Y}}_M)-E({\widetilde{Y}}_F)\\&=E\left( E\big ({\widetilde{Y}}_M \mid \textbf{X}_M\big )\right) -E\left( E\big ({\widetilde{Y}}_F \mid \textbf{X}_F\big )\right) \\&=\left( E\left( \textbf{X}_{M})-E(\textbf{X}_{F}\right) \right) {\varvec{\beta }}_F +E(\textbf{X}_{M})\left( {\varvec{\beta }}_M-{\varvec{\beta }}_F\right) . \end{aligned} \end{aligned}$$

(3)

The term $E(\textbf{X}_{M}){\varvec{\beta }}_F$ that appears in Expression (3) is called the women’s counterfactual mean of the logarithm of wage. We interpret it as the mean of the logarithm of wage of women if they had the same average characteristics as men and if their return on characteristics remained unchanged. This counterfactual exercise is also found in Fortin et al. (2011). Women’s counterfactual distribution of logarithm of wage is obtained by using the characteristics of men ($\textbf{X}_M$) and the wage structure of women ($\varvec{\beta }_F$).

The difference between the average of the logarithm of wages of the groups in Expression (3) contains two elements: an explained part, also called the composition effect ($E(\textbf{X}_{M})-E(\textbf{X}_{F})){\varvec{\beta }}_F$, and an unexplained part, or the structure effect $E(\textbf{X}_{M})({\varvec{\beta }}_M-{\varvec{\beta }}_F)$. The former encompasses differences in characteristics between the two groups. The latter is the difference in the returns on characteristics between the two groups, the part that is not attributable to objective factors (Oaxaca 1973; Blinder 1973). This is sometimes called “discrimination”; however, the term is no unanimously accepted. Popli (2013) comments that “the unexplained wage gap, which is often termed discrimination, includes the effect of labor market discrimination, unobservable variables and omitted variables.” If ${\varvec{\beta }}_M={\varvec{\beta }}_F$, this term is 0.

At $U_g$ level, $E(\textbf{X}_{g})$ is reduced to a finite mean ${\overline{\textbf{X}}}_{g}=\sum _{k\in U_g} \textbf{X}_{k, g} /N_g$, and the regression coefficients are given by

$$\begin{aligned} \varvec{\beta }_g = \left( \sum _{k \in U_g} \textbf{X}_{k, g} \textbf{X}_{k, g}^\top \right) ^{-1} \sum _{k \in U_g} \textbf{X}_{k, g} {\widetilde{Y}}_{k, g}, \quad g\in \{M,F\}, \end{aligned}$$

(4)

where ${\widetilde{Y}}_{k, g}=\textrm{log}(Y_{k, g}), k\in U_g$. The vector $\varvec{\beta }_g$ can be consistently estimated from the subsamples $S_g$ by

$$\begin{aligned} {\widehat{\varvec{\beta }}}_g = \left( \sum _{k \in S_g} w_k\textbf{x}_{k, g} \textbf{x}_{k, g}^\top \right) ^{-1} \sum _{k \in S_g} w_k\textbf{x}_{k, g} {\widetilde{y}}_{k, g}, \, g\in \{M,F\}, \end{aligned}$$

(5)

where ${\widetilde{y}}_{k, g}$ is the realization of ${\widetilde{Y}}_{k, g}, k\in S_g$.

The difference $\Delta $ can be estimated at the sample level by

$$\begin{aligned} {\widehat{\Delta }} = \big (\widehat{{\overline{\textbf{X}}}}_M -\widehat{{\overline{\textbf{X}}}}_F\big )^{\top }{\widehat{\varvec{\beta }}}_F+ \widehat{{\overline{\textbf{X}}}}_M^{\top }\big ({\widehat{\varvec{\beta }}}_M-{\widehat{\varvec{\beta }}}_F\big ), \end{aligned}$$

(6)

where $\widehat{{\overline{\textbf{X}}}}_g=\sum _{k\in S_g} w_k\textbf{x}_{k, g}/\sum _{k\in S_g} w_k$ represents the estimator of ${\overline{\textbf{X}}}_{g}$.

Estimating ${\widehat{\varvec{\beta }}}_M-{\widehat{\varvec{\beta }}}_F$ allows us to estimate the unexplained part at the mean level, using a log model of the wages. We are interested to estimate it at the quantiles’ level, using a more general framework that extends the log model.

3 Quantiles’ decomposition

On the superpopulation level, let $F^{(Y_F\mid \textbf{X}_F)}(.)$ and $F^{(Y_M\mid \textbf{X}_M)}(.)$ be the cumulative distribution functions (CDFs) of the conditional wage distributions of women and men, with respect to the characteristics $\textbf{X}_F$ and $\textbf{X}_M$, respectively. We also denote by $F^{\textbf{X}_F}(.)$ and $F^{\textbf{X}_M}(.)$ the CDFs of distributions corresponding to $\textbf{X}_F$ and $\textbf{X}_M$, respectively.

Recall that a counterfactual distribution is an artificial distribution, defined “as the result of either a change in the distribution of a set of covariates X that determine the outcome variable of interest Y, or as a change in the relationship of the covariates with the outcome, i.e., a change in the conditional distribution of Y given X” (Chernozhukov et al. 2013). We construct a counterfactual wage distribution as the distribution resulting from the change in the distribution of covariates. We build the counterfactual wage distribution of women using the characteristics of men. It can be interpreted as the wage distribution of women if they had the characteristics of men. This is done in order to compare the observed and the counterfactual wage distributions to measure the effects of the change on quantiles’ levels.

Let $F^{C}(.)$ be the CDF of the counterfactual distribution of women. Following Chernozhukov et al. (2013), the CDF in the point $y\in {\mathcal {Y}}_F$, where ${\mathcal {Y}}_F$ is women’s wage support is defined as

$$\begin{aligned} F^{C}(y) = \int _{{\mathcal {X}}_M} F^{(Y_F \mid \textbf{X}_F)} (y \mid \textbf{x}) dF^{\textbf{X}_M}(\textbf{x}), \end{aligned}$$

(7)

where ${\mathcal {X}}_M$ is the support of $\textbf{X}_M$. The counterfactual wage distribution is well defined if the support of $\textbf{X}_F$ (${\mathcal {X}}_F$) includes the support of $\textbf{X}_M$: ${\mathcal {X}}_M\subseteq {\mathcal {X}}_F$.

The counterfactual wage is a potential wage of a woman if she matches the characteristics of a man. Expression (7) assumes that to each woman one can match the characteristics of a man. Under the assumption that ${{\mathcal {X}}}_M={{\mathcal {X}}}_F$, DiNardo et al. (1996) re-expressed the counterfactual distribution given in Expression (7) as

$$\begin{aligned} F^{{C}}(y) = \int _{{{\mathcal {X}}}_F} F^{(Y_F\mid \textbf{X}_F)}(y\mid \textbf{x}) \psi (\textbf{x}) {\textrm{d}} F^{\textbf{X}_F}(\textbf{x}), \end{aligned}$$

(8)

where $\psi (\textbf{x})={\textrm{d}} F^{\textbf{X}_M} (\textbf{x})/{\textrm{d}} F^{\textbf{X}_F} (\textbf{x})$. DiNardo et al. (1996) rewrite the $\psi (.)$ factor as

$$\begin{aligned} \psi (\textbf{x}_k)=\psi _k=\frac{P(G_k=1\mid \textbf{x}_k)/P(G_k=1)}{P(G_k=0\mid \textbf{x}_k)/P(G_k=0)}, \end{aligned}$$

(9)

where $G_k = 1$ if individual k is a man and $G_k = 0$ otherwise and $\textbf{x}_k$ is the vector of observed characteristics for individual k. The parameter $\psi (\textbf{x}_k)$ can be estimated by using a probit or a logistic regression model (DiNardo et al. 1996) or by calibration (Anastasiade and Tillé 2017); for the calibration method in survey sampling, see Deville and Särndal (1992). The difference between the two methods is discussed by Anastasiade and Tillé (2017).

The classical decomposition on the mean level (Blinder 1973; Oaxaca 1973) is re-expressed at quantile level (Melly 2006; Chernozhukov et al. 2013) as

$$\begin{aligned} \Delta _{(\alpha )}=Q_{(\alpha )}^{M}-Q_{(\alpha )}^{F}, \end{aligned}$$

with $\alpha \in (0,1)$, where ${Q}_{(\alpha )}^{M}$ and $Q_{(\alpha )}^{F}$ represent the quantile of order $\alpha $ of the men and women wage distribution, respectively.

The change at quantile level is rewritten as

$$\begin{aligned} \Delta _{(\alpha )}=Q_{(\alpha )}^{M}-Q_{(\alpha )}^{F} =\left( Q_{(\alpha )}^{M}-Q_{(\alpha )}^{C}\right) +\left( Q_{(\alpha )}^{C}-Q_{(\alpha )}^{F}\right) , \end{aligned}$$

where $Q_{(\alpha )}^{C}$ represents the quantile of order $\alpha $ of the counterfactual distribution. The difference $Q_{(\alpha )}^{M}-Q_{(\alpha )}^{C}$ is interpreted here as the unexplained part at the $\alpha $ quantile level. Estimation of $\Delta _{(\alpha )}$ results in a quantile estimation problem. To estimate the quantiles ${Q}_{(\alpha )}^{M}$ and ${Q}_{(\alpha )}^{F}$, we apply the methods shown in Sect. 4, while methods to estimate ${Q}_{(\alpha )}^{C}$ are given in Sect. 5.

4 Quantile estimation in finite populations

For simplicity of notation, the index g is suppressed in this section.

Let Y be a random variable defined over a superpopulation. In the classical statistical framework, we assume a joint distribution of $(Y, \textbf{X})$ and denote by $F^Y(.)$ and $F^\textbf{X}(.)$ the marginal CDF of Y and $\textbf{X}$, respectively. We also assume that $Y \mid \textbf{X}= \textbf{x}\sim D(h(\textbf{x}^{\top } \varvec{\beta }), \varvec{\delta })$, where $D(h(\textbf{x}^{\top } \varvec{\beta }), \varvec{\delta })$ is a distribution with parameters $h(\textbf{x}^{\top } \varvec{\beta })$ and $\varvec{\delta }$. Note that h is a continuous function, and the first parameter is expressed using some characteristics $\textbf{x}$ and some other parameters $\varvec{\beta }$. The marginal distribution of Y has the CDF

$$\begin{aligned} F^{Y}(y)= \int F_{D(h(\textbf{x}^{\top } \varvec{\beta }), \varvec{\delta })} (y\mid \textbf{x}) {\textrm{d}}F{^\textbf{X}}(\textbf{x}), \end{aligned}$$

where $F_{D(h(\textbf{x}^{\top } \varvec{\beta }), \varvec{\delta })}(.\mid \textbf{x})$ is the CDF of the distribution $D(h(\textbf{x}^{\top } \varvec{\beta }), \varvec{\delta })$.

Given $\textbf{X}=\textbf{x}$, at the U level, the parameters $h(\textbf{x}^{\top } \varvec{\beta }), \varvec{\delta }$ are replaced by $h(\textbf{x}^{\top }\varvec{\beta }_N), \varvec{\delta }_N$, respectively, where $h(\textbf{x}^{\top }\varvec{\beta }_N), \varvec{\delta }_N$ are parameters computed on U. For instance, if a model similar to the one provided by Expression (2) holds, D is the log-normal distribution, h(.) is the exponential function, $\varvec{\beta }_N$ is similarly defined as in Expression (4), and $\varvec{\delta }_N$ is $\sigma _N^2$, the error term.

Conditional on U and given $\textbf{X}=\textbf{x}$, the CDF of the distribution of Y is expressed at the U level using the following mixture distribution

$$\begin{aligned} F^{Y}_N(y)=\frac{1}{N}\sum _{k\in U} F_{D(h(\textbf{x}_k^{\top } \varvec{\beta }_N), \varvec{\delta }_N)}(y \mid \textbf{x}_k ), {\text { for any }} y\in {\mathcal {R}}. \end{aligned}$$

(10)

Note that here we assume that $F^{Y}_N(y)$ is a model CDF. This is in contrast to the approach given in the topic-related literature (Särndal et al. 1992, p.197), where the estimand is the finite population empirical distribution function $F^Y_{\textrm{emp}}(y)$ given by

$$\begin{aligned} F_{\textrm{emp}}^Y(y)=\frac{\sum _{k \in U} I(y_k\le y)}{N}, \end{aligned}$$

(11)

where I(.) is the indicator function, with $I(y_k\le y)=1$ if $y_k\le y$, 0 otherwise. The CDF estimation in finite populations usually concerns the estimation of $F^Y_{\textrm{emp}}(.)$ and not of $F^Y_N(.)$. In order to compare the usual approach used in survey sampling and the parametric approach, the interest is here to estimate $F^Y(.)$, because both $F^Y_{\textrm{emp}}(.)$ and $F^Y_N(.)$ are estimators of $F^Y(.)$.

At the sample level, $F_{\textrm{emp}}^Y(y)$ is estimated by

$$\begin{aligned} {\widehat{F}}_{\textrm{emp}}^Y(y)=\frac{\sum _{k \in S} w_k I(y_k\le y)}{\sum _{k \in S} w_k}, \end{aligned}$$

while the quantile ${Q}_{(\alpha )}$ of the distribution of Y is estimated by

$$\begin{aligned} {\widehat{Q}}_{(\alpha ), \textrm{emp}} = \left[ {\widehat{F}}^Y_{\textrm{emp}}(\alpha )\right] ^{-1}, \end{aligned}$$

(12)

where $\left[ {\widehat{F}}^Y_{\textrm{emp}}(.)\right] ^{-1}$ denotes the inverse of ${\widehat{F}}^Y_{\textrm{emp}}(.)$.

For $F_N^Y(.)$, the parameters $\gamma _{k, N}=h(\textbf{x}_{k}^{\top } \varvec{\beta }_N)$ and $\varvec{\delta }_N$ in Expression (10) are estimated, respectively, by ${\widehat{\gamma }}_{k, N}=h(\textbf{x}_{k}^{\top } {\widehat{\varvec{\beta }}}_N)$ and ${\widehat{\varvec{\delta }}}_N$, where both estimators are computed on the sample, using a weighted approach with weights $w_k$. The quantile estimation is done using two methods. The first method (denoted as “Method 1”) is based on the estimator of $F^{Y}_N(y)$ given in Expression (13), and it is similar to the one used by Biewen and Jenkins (2005). As an alternative to Method 1, we propose in this paper a second one (denoted as “Method 2”) which is a simulation method.

1.
Method 1 $F^{Y}_N(y)$ is estimated on a sample S using a Hájek-type estimator as
$$\begin{aligned} {\widehat{F}}^{Y}_N(y)=\sum _{k\in S} w_k F_{D({\widehat{\gamma }}_{k,N}, {\widehat{\varvec{\delta }}}_N)}(y\mid \textbf{x}_k)/\sum _{k\in S}{w_k}. \end{aligned}$$
(13)
Next, the quantile ${Q}_{(\alpha )}$ of the distribution of Y is estimated by
$$\begin{aligned} {\widehat{Q}}_{(\alpha )}= \left[ {\widehat{F}}^Y_N(\alpha )\right] ^{-1}. \end{aligned}$$
In many cases, $\left[ {\widehat{F}}^Y_N(.)\right] ^{-1}$ is computed using a numerical method.
2.
Method 2 If the inverse function of ${\widehat{F}}^{Y}_N(\alpha )$ cannot be computed (e.g., lack of monotony of ${\widehat{F}}^{Y}_N$) or the numerical method is slow, we introduce and use the following Monte Carlo method based on parametric bootstrap:
1. (a)
  Generate a large number m of n independent draws from the distribution $D(h(\textbf{x}_{k}^{\top } {\widehat{\varvec{\beta }}}_N), {\widehat{\varvec{\delta }}}_N)$, $k\in S$, respectively. A matrix M of dimension $m\times n$ of such draws is obtained. Each element $(i, k), i=1, \dots , m, k=1, \dots n$ in M is the realization $y_{i, k}$ of a random variable $Y_{i, k}$ with $Y_{i, k}\sim D(h(\textbf{x}_{k}^{\top } {\widehat{\varvec{\beta }}_{N}}), {\widehat{\varvec{\delta }}_{N}});$ given $\textbf{x}$ all the random variables $Y_{i, k}$ are independent.
2. (b)
  Associate with each element $(i, k), i=1, \dots , m, k=1, \dots , n$ of M the weight $w_k, k \in S$ and compute the empirical weighted quantile of order $\alpha \in [0, 1]$
  $$\begin{aligned} {\widehat{Q}}^{(i)}_{\alpha , \textrm{emp}}=\left[ {\widehat{F}}^{Y}_{\textrm{emp}, i} (\alpha )\right] ^{-1}, \end{aligned}$$
  where ${\widehat{F}}^{Y}_{\textrm{emp}, i} (y)=\sum _{k=1}^{n} w_k I(y_{i, k}\le y)/\sum _{k=1}^{n} w_k$.
3. (c)
  For each $\alpha \in [0, 1]$, compute the mean of ${\widehat{Q}}^{(i)}_{\alpha , emp} (Y), i=1, \dots , m;$ this mean represents an estimator of the quantile of order $\alpha $ of the distribution with the CDF given in Expression (10).

Method 2 allows an easy construction of an approximate $100\times (1-\gamma )\%$ confidence interval $(\gamma \in (0,1))$ for a quantile using the method of percentile bootstrap confidence intervals. Conditional to the estimated parameters, each column of the previous matrix provides a set of independent estimates ${\widehat{Q}}^{(i)}_\alpha (Y)$ of a given quantile $\alpha $. Next, the empirical quantiles of order $100\times (\gamma /2)\%$ and $100\times (1-\gamma /2)\%$ are computed. They form the lower and upper bounds of an approximate $100\times (1-\gamma )\%$ confidence interval of a quantile of order $\alpha $. Monte Carlo simulation results (not shown here) indicate coverage rates close to $95\%$ for this method ($\gamma =0.05$).

Remark 1

Method 2 can be improved if the CDF is estimated using all $m\times n$ simulated outcomes as follows

$$\begin{aligned} \sum _{i=1}^m\left( \sum _{k=1}^n w_kI(y_{i,k}\le y)/\sum _{k=1}^n w_k\right) /m. \end{aligned}$$

This CDF estimator can be inverted to obtain the quantile estimation at level $\alpha $. Step (c) in Method 2 is no longer necessary. The same remark applies to the algorithm given in Sect. 5. Thus, one can improve the quantile estimator precision by using $m\times n$ outcomes instead of m. However, the computation of an approximate $100\times (1-\gamma )\%$ confidence interval of a quantile is no longer possible. We use in our results the first version of Method 2.

5 Quantile estimation of the counterfactual distribution

We are interested to estimate the quantiles of the counterfactual distribution. This is necessary for a comparison between them and the estimated quantiles of the unconditional distribution of wage of women and those of the men, respectively, as underlined in Sect. 3.

The empirical counterfactual CDF defined at the $U_F$ level can be written as

$$\begin{aligned} F^{C}_{\textrm{emp}}(y)=\frac{\sum _{k \in U_F} \psi _k I(y_k\le y)}{\sum _{k\in U_F} \psi _k}. \end{aligned}$$

(14)

The weighted version of DiNardo et al. (1996) and Anastasiade and Tillé (2017) methods uses the estimated empirical counterfactual CDF defined by

$$\begin{aligned} {\widehat{F}}^{C}_{\textrm{emp}}(y)=\frac{\sum _{k \in S_F} {\widehat{\psi }}_k w_k I(y_k\le y)}{\sum _{k\in S_F} {\widehat{\psi }}_k w_k}, \end{aligned}$$

where ${\widehat{\psi }}_k$ is an estimator of $\psi _k$ given in Expression (9). Next, both methods estimate the $\alpha $-quantile $Q_{(\alpha )}^{C}$ of the counterfactual distribution using

$$\begin{aligned} {\widehat{Q}}_{(\alpha ), \textrm{emp}}^{C} =\left[ {\widehat{F}}^{C}_{\textrm{emp}} (\alpha )\right] ^{-1}. \end{aligned}$$

(15)

An opposite approach is to use the following model counterfactual CDF at the $U_F$ level

$$\begin{aligned} F^C_N(y)=\frac{1}{N_C} \sum _{k \in U_F} \psi _k F_{D(h({\textbf{x}}_{k, F}^{\top } \varvec{\beta }_F), \varvec{\delta }_F)}(y \mid \textbf{x}_{k, F}), \end{aligned}$$

(16)

where $N_C=\sum _{k\in U_F} \psi _k$ and $\varvec{\beta }_F$, $\varvec{\delta }_F$ are parameters of the distribution D(.) defined on $U_F$. We estimate $F^C_N(y)$ by

$$\begin{aligned} {\widehat{F}}^{C}_N(y)=\frac{\sum _{k \in S_F} {\widehat{\psi }}_k w_kF_{D(h({\textbf{x}}_{k, F}^{\top } {\widehat{\varvec{\beta }}}_F), {\widehat{\varvec{\delta }}}_F)}(y \mid \textbf{x}_{k, F})}{\sum _{k \in S_F} {\widehat{\psi }}_k{w}_k}, \end{aligned}$$

where ${\widehat{\varvec{\delta }}}_F$ and ${\widehat{\varvec{\beta }}}_F$ are computed on $S_F$ and ${\widehat{\psi }}_k$ is computed, for instance, with the help of the calibration approach of Anastasiade and Tillé (2017), using the raking method; this represents a nonparametric estimation of ${\psi }_k$ in contrast to the method of DiNardo et al. (1996) which uses a logistic regression model. Next, the estimator of ${Q}_{(\alpha )}^{C}$ is given by

$$\begin{aligned} {\widehat{Q}}_{(\alpha )}^{C}=\left[ {\widehat{F}}^{C}_N(\alpha )\right] ^{-1}. \end{aligned}$$

(17)

If the inverse function of ${\widehat{F}}^C_N(.)$ cannot be computed or its numerical approximation is slow, the following Monte Carlo method based on parametric bootstrap similar to the one given in Sect. 4 is used:

1.
Generate a large number m of $n_F$ independent draws from the distribution $D(h(\textbf{x}_{k, F}^{\top } {\widehat{\varvec{\beta }}}_F), {\widehat{\varvec{\delta }}}_F)$, $k\in S_F$, respectively. A matrix of dimension $m\times n_F$ of such draws is obtained. Each element $(i, k), i=1, \dots , m, k=1, \dots n_F$ in this matrix is the realization $y_{i, k}$ of a random variable $Y_{i, k}$ with $Y_{i, k}\sim D(h(\textbf{x}_{k, F}^{\top } {\widehat{\varvec{\beta }}}_F), {\widehat{\varvec{\delta }}}_F)$; given $\textbf{x}_F$ all the random variables $Y_{i, k}$ are independent.
2.
Associate with each element $(i, k), i=1,\dots , m, k=1, \dots , n_F$ the weight ${\widehat{\psi }}_k w_k, k \in S_F$ and compute the empirical weighted quantile of order $\alpha \in [0, 1]$ of the counterfactual wage distribution by
$$\begin{aligned} {\widehat{Q}}^{(i), C}_{(\alpha )}=\left[ {\widehat{F}}_{\textrm{emp}, i}^C(\alpha )\right] ^{-1}, \end{aligned}$$
where ${\widehat{F}}_{\textrm{emp}, i}^C(y)=\sum _{k=1}^{n_F} {\widehat{\psi }}_k w_k I(y_{i, k}\le y)/\sum _{k=1}^{n_F} {\widehat{\psi }}_kw_k$.
3.
For each $\alpha \in [0, 1]$, compute the mean of the ${\widehat{Q}}^{(i), C}_{(\alpha )}, i=1, \dots , m;$ this mean represents an estimate of the quantile of order $\alpha $ of the counterfactual wage distribution.

Remark 2

1.
The method to compute ${\widehat{Q}}^{(i), C}_{(\alpha )}$ uses random weights ${\widehat{\psi }}_k w_k, k \in S_F$. Its computation is reliable because ${\widehat{\psi }}_k w_k, k \in S_F$ are fixed in each run of the algorithm.
2.
The reweighting factor $\psi _k$ does not allow the computation of the wage variable corresponding to the counterfactual distribution given in Expression (7) (the variable $\psi _k Y_k^F$ has a different CDF), but only the estimation of some of its parameters.
3.
As for gender wage quantile estimation, the use of auxiliary information $\textbf{x}_{k, F}$ in estimating $F^{C}_N(y)$ is expected to reduce the variance of the estimator given in Expression (17), compared to that of the estimator given in Expression (15). In Sect. 6.2, we show some Monte Carlo results that sustain the variance reduction of the two parametric methods compared to the other two competitors.

6 Application using the GB2 distribution

6.1 The GB2 regression model

We illustrate the two parametric methods to estimate the structure and composition effects at quantiles’ level using the GB2 distribution. The GB2 distribution is characterized by four parameters, namely a, b, p and q. McDonald and Xu (1995) and Kleiber and Kotz (2003) use the following probability density function of a ${\text {GB}}2(a, b, p, q)$ distribution

$$\begin{aligned} f(y; a,b,p,q) = \dfrac{\mid a\mid y^{ap-1}}{b^{ap} \textrm{B}(p,q)\left[ 1+\left( \dfrac{y}{b}\right) ^a\right] ^{p+q}}, y>0, \end{aligned}$$

(18)

where $\textrm{B}(p, q)$ represents the function ${\text {Beta}}(p, q)$ with arguments p and q. Using the notation of Graf et al. (2011) and Graf and Nedyalkova (2015), Equation (18) is rewritten as

$$\begin{aligned} f(y; a,b,p,q) = \dfrac{a\left( \dfrac{y}{b}\right) ^{ap-1}}{b\textrm{B}(p,q)\left[ 1+\left( \dfrac{y}{b}\right) ^a\right] ^{p+q}}, y>0. \end{aligned}$$

(19)

The parameters a, p and q are shape parameters and b is the scale parameter (Kleiber and Kotz 2003). All of them are strictly positive. The peak of the distribution is controlled by a, while the other two shape parameters control for the left and the right tail, respectively. The GB2 distribution can be positively or negatively skewed, depending upon the values of p and q.

We borrow from McDonald and Butler (1990) the idea of changing the scale parameter b, by expressing it as a function of the observed characteristics of the employees. The framework can also be expressed as a regression model

$$\begin{aligned} \textrm{log}(Y_k) = \textbf{X}_k^{\top } \varvec{\beta }+ \textrm{log}(\varepsilon _k), \end{aligned}$$

(20)

where $\varepsilon _k \sim {\text {GB}}2(a,1,p,q)$. As $\varepsilon _k \sim {\text {GB}}2(a,1,p,q)$, we have that $Y_k\mid \textbf{X}_k=\textbf{x}_k \sim {\text {GB}}2(a, \textrm{exp}(\textbf{x}_k^{\top }\varvec{\beta }), p, q)$ (see McDonald and Butler 1990). Since $\varepsilon _k$ follows a GB2 distribution, we refer to the model in Eq. (20) as a GB2 regression model.

In each group $g\in \{M, F\}$, we assume that the conditional wage of $k\in U_g$, $Y_k\mid \textbf{X}_{k, g}=\textbf{x}_k \sim {\text {GB}}2(a_g, \textrm{exp}(\textbf{x}_k^{\top }\varvec{\beta }_g), p_g, q_g)$. Thus, for each $k\in U_g$, Expression (19) becomes

(21)

We use the maximum pseudo-likelihood method to fit GB2 regression models using survey weights. The sandwich estimator (Huber 1967; Freedman 2006; Graf et al. 2011) or parametric bootstrap can be used to estimate the standard errors of the estimated parameters. We describe in “Appendix” the entire approach.

Biewen and Jenkins (2005) suggested to express all the four parameters of the GB2 distribution as a function of the observed characteristics. However, they note that “there would be too many parameters to be estimated, and variance calculations for the statistics of interest are rather complicated. Also we found that estimation often led to numerical problems.” We also note that if the survey weights are skewed, the estimation of the parameters is numerically complicated. In order to avoid all these problems, we express in our examples only the scale parameter as a function of the observed characteristics.

6.2 Monte Carlo studies

Monte Carlo simulation was used to show the performances of the two parametric methods when the quantiles of the counterfactual wage distribution are estimated. Three settings have been employed as follows:

Setting 1, where we generate a conditional wage distribution for women, $Y_{k, F} = \textrm{exp}[1.10+X_{k, F}+\varepsilon _{k, F}]$, where $\varepsilon _{k, F} \sim N(0, 1)$ are iid, $X_{k, F} \sim N(5, 1)$, $k=1, \dots N_F$, with $N_F=50{,}000$. The covariate for the men is $X_{k, M} \sim N(4, 1)$, iid, $k=1, \dots N_M$, with $N_M=N_F$. The correlation between $\textrm{log}(Y_F)$ and $X_F$ is about 0.70.
Setting 2, where we generate a conditional wage distribution for women, $Y_{k, F} = \textrm{exp}[1.44+0.15X_{k, F}+\textrm{log}(\varepsilon _{k, F})]$, where $\varepsilon _{k, F} \sim {\text {GB}}2(8,1,0.50,0.90)$ are iid, $X_{k, F} \sim {\text {Gamma}}(9, 2)$, $k=1, \dots N_F$, with $N_F=50{,}000$. The covariate for the men is $X_{k, M} \sim {\text {Gamma}}(10, 2)$, iid, $k=1, \dots N_M$, with $N_M=N_F$. The correlation between $\textrm{log}(Y_F)$ and $X_F$ is about 0.60.
Setting 3 is similar to Setting 2, using with the same $N_F, X_{k, F}, X_{k, M}$ and $\varepsilon _{k, F}$, but $Y_{k, F} = \textrm{exp}[1.44+0.07X_{k, F}+\textrm{log}(\varepsilon _{k, F})], k=1, \dots N_F$. The correlation between $\textrm{log}(Y_F)$ and $X_F$ is about 0.30.

At the superpopulation level, the counterfactual distribution uses the factor ${\psi }(x)={\textrm{d}}F^{X_M}(x)/{\textrm{d}}F^{X_F}(x)$. For Setting 1, ${F}^{(Y_F\mid X_F)}(y\mid x)$ is the CDF of the log-normal distribution with parameters $\varvec{\mu }=\textbf{x}_F ^{\top }\varvec{\beta }_1$ and $\sigma ^2=1$, where $\varvec{\beta }_1=(1.10, 1)'$. For Setting 2, ${F}^{(Y_F\mid X_F)}(y\mid x)$ is the CDF of the distribution ${\text {GB}}2(8, \textrm{exp}[\textbf{x}_F ^{\top }\varvec{\beta }_2], 0.50, 0.90)$, with $\varvec{\beta }_2=(1.44, 0.15)'$; for setting 3, we have a similar situation. For all settings, the quantile $Q_{(\alpha )}^C$ is computed using the inverse of $F^C(\alpha )$ given in Expression (8). $F^C(.)$ is used, because two different CDF are employed at the finite population level given, respectively, by Expressions (14) and (16). $F^C(\alpha )$ is computed using Monte Carlo integration, with 10,000,000 runs; its inverse at the point $\alpha $ is computed using a numerical method.

We use r runs and draw in each one a random sample of women and men, respectively. In Setting 1, the number of runs equals 10,000, and in Settings 2 and 3, due to the time-consuming process of fitting a GB2 distribution, we use only 1000 runs. In Setting 1, we select samples of women and men, respectively, by simple random sampling without replacement, with sample sizes $n_F=n_M=1000$. In Settings 2 and 3, we employ systematic sampling with unequal probabilities for both samples with $n_F=n_M=10{,}000$, where the inclusion probabilities are proportional to $x_F$ and $x_M$, respectively; these two settings are close to the framework used by the application given in Sect. 6.3.

In each run of the Monte Carlo simulation, we computed the quantiles of order 1%, 5%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95% and 99%, respectively, of the counterfactual wage distribution using the estimators given by Methods 1 and 2, the method of Anastasiade and Tillé (2017) (with raking calibration; hereafter, the calibration method) and the weighted version of the method of DiNardo et al. (1996) (hereafter, weighted DFL).

For each generic estimator ${\widehat{Q}}_{(\alpha )}^C$ of $Q_{(\alpha )}^C$, the following Monte Carlo measures were used:

the Monte Carlo relative bias (in percentages)
$$\begin{aligned} {\text {RB}}_{{\textrm{MC}}}({\widehat{Q}}_{(\alpha )}^C)=100\times \left( E_{\textrm{MC}}({\widehat{Q}}_{(\alpha )}^C)-Q_{(\alpha )}^C\right) /Q_{(\alpha )}^C, \end{aligned}$$
where $E_{\textrm{MC}}({\widehat{Q}}_{(\alpha )})=\sum _{i=1}^r {\widehat{Q}}_{i, (\alpha )}^C/r$, and ${\widehat{Q}}_{i, (\alpha )}^C$ is the quantile estimator of $Q_{(\alpha )}^C$ computed in the ith run;
the Monte Carlo variance
$$\begin{aligned} {\text {Var}}_{\textrm{MC}}\big ({\widehat{Q}}_{(\alpha )}^C\big )=\frac{1}{r-1} \sum _{i=1}^r \left[ {\widehat{Q}}_{i, (\alpha )}^C-E_{\textrm{MC}}\big ({\widehat{Q}}_{(\alpha )}^C\big )\right] ^2; \end{aligned}$$
the Monte Carlo root mean square error (RMSE)
$$\begin{aligned} {\text {RMSE}}_{\textrm{MC}}\big ({\widehat{Q}}_{(\alpha )}^C\big )=\left[ {\text {Var}}_{\textrm{MC}}\big ({\widehat{Q}}_{(\alpha )}^C\big ) +\left( B_{\textrm{MC}}\big ({\widehat{Q}}_{(\alpha )}^C\big )\right) ^2\right] ^{1/2}, \end{aligned}$$
where $B_{\textrm{MC}}({\widehat{Q}}_{(\alpha )}^C)=E_{\textrm{MC}}({\widehat{Q}}_\alpha ^C)-Q_{(\alpha )}^C$,
the Monte Carlo coefficient of variation (in percentages)
$$\begin{aligned} {\text {CV}}_{\textrm{MC}}\big ({\widehat{Q}}_{(\alpha )}^C\big )=100\times \left( {\text {Var}}_{\textrm{MC}} \big ({\widehat{Q}}_{(\alpha )}^C\big )\right) ^{1/2}/E_{\textrm{MC}}\big ({\widehat{Q}}_{(\alpha )}^C\big ). \end{aligned}$$

For the estimators corresponding to Methods 1 and 2, we estimated the parameters of the women’s wage distribution at each run using the corresponding weights of women selected in the women’s sample, as well as the estimated factor $\psi _k$ given by the method of Anastasiade and Tillé (2017) with raking calibration. The latter was also used to compute in each run the calibration estimator for each quantile of the counterfactual distribution. Similarly, the factor $\psi _k$ for the weighted DFL method was estimated in each run. We used a weighted logistic regression to compute $P(G_k=1 \mid x_k)$ and $P(G_k=0 \mid x_k)$, while $P(G_k=1)$ and $P(G_k=0)$ were estimated by weighted means $\sum _{k\in S_g} w_k/\sum _{k\in S} w_k, g\in \{M, F\};$ see Expression (9). All the results were computed in R Core Team (2022). The weighted empirical quantiles were computed using the function wtd.quantile from the R package Hmisc (Harrell 2022), while the inverse of a CDF at the point $\alpha $ was computed using the R base function uniroot. We used 1000 bootstrap runs in Method 2.

All the used estimators are biased with respect to the sampling design. The values of the Monte Carlo relative bias in percentages are shown in Tables 1, 5 and 9 for the three settings, while the values of the Monte Carlo variance are given in Tables 2, 6 and 10, respectively; the Monte Carlo root mean square errors are reported in Tables 3, 7 and 11, respectively. The Monte Carlo coefficients of variation are given in Tables 4, 8 and 12, respectively. We note that Method 1 and Method 2 provide very close values of the Monte Carlo measures in all three settings.

The estimator of Anastasiade and Tillé (2017) using calibration is used in Figs. 1, 2 and 3 as a benchmark in order to visualize the behavior of the other estimators at different quantiles. Since Method 1 and Method 2 provide almost identical results, only the results of Method 1 are shown in Figs. 1, 2 and 3. Figure 1 shows the ratio between the Monte Carlo bias $B_{\textrm{MC}}$ obtained by using Method 1, the weighted DFL and that of the calibration method for each of the quantiles of order 1%, 5%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95% and 99%. Figure 2 provides the ratio between the Monte Carlo variance of Method 1, the weighted DFL and that of the calibration method for each quantile. Similarly, Fig. 3 shows the ratio of the Monte Carlo RMSEs.

In Setting 1, the two parametric methods show a smaller value of the Monte Carlo relative bias than the weighted DFL and the calibration estimator at each quantile. The two methods also provide a substantial reduction of the Monte Carlo variance at each quantile, and a good behavior with respect to the RMSE over the calibration and weighted DFL estimators (see Fig. 1). The estimators obtained by the two parametric methods also display a smaller coefficient of variation than the reweighting estimators at all quantiles as shown in Table 4.

For Setting 2, the Monte Carlo expectation of the estimated parameters of the GB2 distribution are for a, $\beta _0$, $\beta _1$, p and q, respectively, 8.02, 1.44, 0.15, 0.50 and 0.91, showing that we provide approximately unbiased estimates under the sampling design, for large sample sizes. In Setting 2, the two parametric methods result in estimators that have a lower Monte Carlo variance than the calibration and the weighted DFL estimators almost at all quantiles (see Fig. 2). Like in Setting 1, the value of the Monte Carlo coefficient of variation of the estimators obtained using the two parametric methods are smaller than of those using the last two methods (see Table 8). The parametric methods sometimes show a larger bias and relative mean square error than the calibration estimator, but provide a reduction of the Monte Carlo variance at each quantile (except for the quantile of order 20%; see also Fig. 2). Compared to Setting 1, note that the correlation between $\textrm{log}(Y_F)$ and $X_F$ is less important (0.60 compared to 0.70).

Setting 3 shows a smaller correlation between $\textrm{log}(Y_F)$ and $X_F$ (about 0.30) compared to Setting 2; this is similar to the correlation between the logarithm of women wage and age in the application given in Sect. 6.3. This correlation reduction is visible in the behavior of the Monte Carlo variance and relative bias of the two parametric methods. Thus, the parametric methods still provide a reduction of the Monte Carlo variance for most of quantiles (except for the quantiles of order 30%, 80% and 95%; see also Fig. 3) compared to the two competitors. The value of the Monte Carlo relative bias of the two parametric methods is more important than in Setting 2 for the quantiles of order 20% and 70%. Despite the lower correlation between $\textrm{log}(Y_F)$ and $X_F$, the shapes of the Monte Carlo RMSE of the two parametric methods are similar to the ones provided by Setting 2; see the last plot in Figs. 1 and 2, respectively. The Monte Carlo coefficients of variation of Method 1 and Method 2 also show reduced values compared to the other two methods (see Table 12).

Table 1 Setting 1: Monte Carlo relative bias (in %) of the four estimators of the counterfactual wage quantiles

Gender wage difference estimation at quantile levels using sample survey data

Abstract

Similar content being viewed by others

The gender pay gap in the USA: a matching study

Gender wage gaps in Ghana: a comparison across different selection models

On the Sensitivity of Wage Gap Decompositions

1 Introduction

2 Setup

2.1 The Blinder–Oaxaca-type decomposition method

3 Quantiles’ decomposition

4 Quantile estimation in finite populations

Remark 1

5 Quantile estimation of the counterfactual distribution

Remark 2

6 Application using the GB2 distribution

6.1 The GB2 regression model

6.2 Monte Carlo studies

6.3 Application to real data

7 Discussion and conclusions

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

1.1 Estimation of the parameters in GB2 regression with survey weights

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation