1 Introduction

This paper is motivated by the growing interest in estimating the wage differences between men and women in official statistics. Applications in official statistics deal with random samples drawn from finite populations. Estimation is usually made in what one calls the design-based framework, where only the samples are random, while the variables collected from them are fixed. Thus, inference in finite populations may be very different from the one usually used in the classical statistics. In order to accommodate some theory from econometrics, we consider what one calls a superpopulation framework, assuming that our finite population is a random sample drawn from an infinite population. Next, the finite population is divided in two groups or subpopulations: men and women.

It is possible that men and women with the same characteristics earn different wages. The wages of men and women are modeled separately using a parametric model. Conditional on some characteristics, we assume that the conditional wage distribution of each individual into a group follows a given theoretical distribution with unknown parameters. Our goal is to capture the shape of the wage distributions and to go beyond the mean differences provided by the regression approach of Blinder (1973) and Oaxaca (1973), which is widely used by the world’s national statistical offices. We do this by determining the estimator of the differences between the gender wages at different quantiles. Following, for instance, Melly (2006) and Chernozhukov et al. (2013), we extend to quantiles the classical decomposition method of Blinder (1973) and Oaxaca (1973) for the mean, using the concept of counterfactual distribution. (For an overview, see Fortin et al. 2011.) The counterfactual distribution is estimated by putting together the parameters of one group and the characteristics of the latter group. This is done in order to estimate what the former group would earn, if they had the characteristics of the other group. We follow this guideline to estimate the wages of women as if they had the same characteristics of men. This leads to the estimation of differences between the gender wages conditionally on fixed covariates at different quantiles. We use a conditional distribution approach similar to the one used by Biewen and Jenkins (2005). First, we estimate the parameters of the distribution of each individual given their characteristics. Next, the marginal wage distribution is fitted based on the individual wage distributions. The parametric approach used in this paper has already been suggested in several papers in the decomposition literature, as, for instance, Biewen and Jenkins (2005), Van Kerm (2013) and Van Kerm et al. (2016).

The novelty of this paper is twofold. First, we use the parametric approach from a survey sampling perspective, underling the specific framework used in the design-based inference. While the main goal is to model wages by using heavy-tailed distributions and survey weights, we also use the parametric approach with a quite different interest. This is specific to survey sampling and was not previously investigated: This approach uses auxiliary information; if this is well correlated with the wages, the variance of the quantile counterfactual estimates may be reduced compared to those of some competitors. We use two parametric methods to estimate quantiles, by assuming a given theoretical distribution of conditional wages of men and women given their characteristics, respectively. While the first parametric method used is similar to one used by Biewen and Jenkins (2005), the second one is new and is introduced in this paper. The second method has the advantage of allowing an easy construction of confidence intervals of a quantile. Secondly, we want to illustrate the quantile decomposition of wages using data from official statistics and heavy-tailed distributions other than the log-normal distribution usually used in this domain (see, for instance, Leythienne and Ronkowski 2018).

Motivated by a flexible way to model income and wage distributions, we fit in our examples a generalized beta distribution of the second kind (hereafter, GB2) distribution to conditional wages. Following the work of Thurow (1970), who considered that “the beta distribution seems the most flexible” distribution to capture income changes, McDonald (1984) introduced the GB2 distribution to model income distributions. McDonald (1984), Bandourian et al. (2002) and McDonald and Ransom (2008) showed that the GB2 distribution provides a good fit for income. The GB2 distribution can be used to fit either positively or negatively skewed distributions and is a generalization of several distributions, such as the log-normal, the exponential or the Fisk distributions (Kleiber and Kotz 2003; McDonald 1984; McDonald and Xu 1995; McDonald and Butler 1990). This distribution is already well covered in the literature (see, for instance, Kleiber and Kotz 2003; Graf et al. 2011). We illustrate the two parametric methods using the GB2 distribution, with parameters estimated through maximum pseudo-likelihood, when survey weights and characteristics are associated with sampled employees, by expressing the scale parameter of a GB2 distribution as a function of their characteristics. We show in “Appendix” how to estimate the standard errors of the estimated parameters in a GB2 regression model, using a sandwich estimator and a parametric bootstrap approach.

This paper is structured as follows: In Sect. 2, we present the general setup and recall the classical decomposition method of Blinder (1973) and Oaxaca (1973), making the bridge with the estimation in the context of survey data. In Sect. 3, we discuss the concept of counterfactual wage distribution and show a decomposition method at quantiles’ level. The two parametric methods to estimate quantiles are presented in Sects. 4 and 5. The two methods are applied for the gender wage distributions and the counterfactual wage distribution, respectively. The Monte Carlo simulation results given in Sect. 6.2 show the methodological interest to use auxiliary information in the quantile counterfactual estimation. We illustrate the decomposition method at quantiles’ level in Sect. 6.3, by assuming that the conditional wage distribution for women and men follow, respectively, a GB2 distribution. The data used were obtained from the Swiss Federal Statistical Office and are issued from the Swiss survey on the structure of earnings in 2012. We draw our conclusions in Sect. 7.

2 Setup

Consider a finite population of employees with the labels \(U=\{1, 2, \ldots , N\}\). From this population, we randomly select a sample S of size n, without replacement. The sample is selected through a sampling design \(p(s) = \Pr (S=s), \forall s \subseteq U\). It is assumed that the sampling design is noninformative. To each unit \(k\in S\), a survey weight \(w_k\) is associated. These weights can be equal to the inverse of the inclusion probabilities or can be more complicated weights, like calibration weights. The set U is divided in two subsets with labels corresponding to men and women, denoted by \(U_M\) and \(U_F\), respectively, such that \(U_M \cup U_F = U\) and \(U_M \cap U_F = \emptyset \). Similarly, the sample S is divided into two random subsamples of men and women, denoted by \(S_M = S \cap U_M\) and \(S_F = S \cap U_F\), respectively. We denote these subsamples as \(S_g\subseteq U_g, g \in \{M,F\}\), with \(n_M\) and \(n_F\) being the number of employees in the subsamples, respectively, such that \(n_M+n_F =n\).

We work in a superpopulation framework and assume that the finite population is a random sample drawn from an infinite population. Let Y be the variable wage. First, we consider that Y is a random variable generated by a distribution model \(\xi \) in the infinite population. Next, the finite population \(\{ Y_1, Y_2,\ldots , Y_N\}\) is randomly generated from the model \(\xi \), where \(Y_k\) is the variable wage associated with each \(k\in U\). We assume that the estimation process refers to the infinite population parameter, and is executed in the design-based approach, considering, however, that \(Y_k\) associated with unit \(k\in U\) is random (see Särndal et al. 1992, p. 516, Case 4).

We also assume that a linear regression model that relates the logarithm of Y to some covariates \(X_1, X_2, \ldots , X_c\) holds. The covariates are the same in each \(U_g, g \in \{M,F\}\), but for coherence with the subsets’ notation we denote by \(X_{1, g}, X_{2, g}, \dots , X_{c, g}\) the covariates in group \(g \in \{M,F\}\). For each unit \(k\in U_g\), \(g \in \{M,F\}\), the wage is denoted by \(Y_{k, g}\) and the c covariates are stored in the vector

$$\begin{aligned} \textbf{X}_{k, g} = (1, X_{1k, g}, X_{2k, g}, \dots ,X_{ck, g} )^\top . \end{aligned}$$
(1)

One realization of \(\textbf{X}_{k, g}\) is denoted by \(\textbf{x}_{k, g} = (1, x_{1k, g}, x_{2k, g}, \dots ,x_{ck, g} )^{\top }\). The last c elements of the vector \(\textbf{x}_{k, g}\) represent realizations of variables \(X_{1, g}, X_{2, g}, \dots , X_{c, g}\), respectively, \(g \in \{M,F\}\). In what follows, we also denote by \(y_k\) a realization of \(Y_k, k\in U\) and use \(\textbf{X}_g=(X_{1, g}, X_{2, g}, \ldots , X_{c, g})\), \(g \in \{M,F\}\) ,with \(\textbf{x}_g\) one realization of \(\textbf{X}_g\).

2.1 The Blinder–Oaxaca-type decomposition method

We use what is called in econometrics a decomposition method. The general idea of decomposition methods is to divide the difference between wages of men and women in two elements: the first one is the part due to the difference in characteristics between them, and thus, it can be explained, while the second one is not. Starting with Blinder (1973) and Oaxaca (1973), many decomposition methods have been proposed, not only to decompose wage means, but also wage densities; for an overview, see Fortin et al. (2011).

Assume that the superpopulation is divided in two subsuperpopulations from where the subsets \(U_g, g=\{M,F\}\) are drawn, respectively. In each subsuperpopulation \({\text {SUP}}_g\), a linear relationship is suitable between the characteristics that are available and the logarithm of the wage. A linear regression model is fitted separately in each subsuperpopulation \({\text {SUP}}_g\) with \(g\in \{M,F\}\)

$$\begin{aligned} \textrm{log}(Y_{k, g})=\textbf{X}_{k, g}^\top \varvec{\beta }_g+\varepsilon _{k, g}, k\in {\text {SUP}}_g, \end{aligned}$$
(2)

where \(\varepsilon _{k, g}\sim N(0, \sigma ^2_g)\) are independent and identically distributed (iid), \(\varvec{\beta }_g\) represents the vector of regression coefficients and \(\sigma ^2_g\) is the variance of \(\textrm{log}(Y_{k, g})\mid \textbf{X}_{k, g}\) in \({\text {SUP}}_g\). The regression coefficients \({\varvec{\beta }}_g\) are called the group wage structure or the returns on characteristics, and they represent the contribution of each characteristic to the logarithm of the wage.

By using Model (2), one obtains the conditional expectation \(E({\widetilde{Y}}_g \mid \textbf{X}_g=\textbf{x}_g)=\textbf{x}_g^\top \varvec{\beta }_g\) and the unconditional expectation

$$\begin{aligned} E({\widetilde{Y}}_g)=E\left( E\big ({\widetilde{Y}}_g \mid \textbf{X}_g\big )\right) =E(\textbf{X}_{g})\varvec{\beta }_g+E(\varepsilon _{g})=E(\textbf{X}_{g})\varvec{\beta }_g, \end{aligned}$$

where \({\widetilde{Y}}_g\) represents \(\textrm{log}(Y_g)\), \(Y_g\) is the random variable wage in group g, \(\varepsilon _{g}\) is the random variable error term in the same group, and \(\textbf{X}_g\) and \(\varepsilon _{g}\) are independent.

The difference between the conditional expectations of the logarithm of wages of two groups (it is a Blinder–Oaxaca-type decomposition) can be written as

$$\begin{aligned} \begin{aligned} \Delta&=E({\widetilde{Y}}_M)-E({\widetilde{Y}}_F)\\&=E\left( E\big ({\widetilde{Y}}_M \mid \textbf{X}_M\big )\right) -E\left( E\big ({\widetilde{Y}}_F \mid \textbf{X}_F\big )\right) \\&=\left( E\left( \textbf{X}_{M})-E(\textbf{X}_{F}\right) \right) {\varvec{\beta }}_F +E(\textbf{X}_{M})\left( {\varvec{\beta }}_M-{\varvec{\beta }}_F\right) . \end{aligned} \end{aligned}$$
(3)

The term \(E(\textbf{X}_{M}){\varvec{\beta }}_F\) that appears in Expression (3) is called the women’s counterfactual mean of the logarithm of wage. We interpret it as the mean of the logarithm of wage of women if they had the same average characteristics as men and if their return on characteristics remained unchanged. This counterfactual exercise is also found in Fortin et al. (2011). Women’s counterfactual distribution of logarithm of wage is obtained by using the characteristics of men (\(\textbf{X}_M\)) and the wage structure of women (\(\varvec{\beta }_F\)).

The difference between the average of the logarithm of wages of the groups in Expression (3) contains two elements: an explained part, also called the composition effect (\(E(\textbf{X}_{M})-E(\textbf{X}_{F})){\varvec{\beta }}_F\), and an unexplained part, or the structure effect \(E(\textbf{X}_{M})({\varvec{\beta }}_M-{\varvec{\beta }}_F)\). The former encompasses differences in characteristics between the two groups. The latter is the difference in the returns on characteristics between the two groups, the part that is not attributable to objective factors (Oaxaca 1973; Blinder 1973). This is sometimes called “discrimination”; however, the term is no unanimously accepted. Popli (2013) comments that “the unexplained wage gap, which is often termed discrimination, includes the effect of labor market discrimination, unobservable variables and omitted variables.” If \({\varvec{\beta }}_M={\varvec{\beta }}_F\), this term is 0.

At \(U_g\) level, \(E(\textbf{X}_{g})\) is reduced to a finite mean \({\overline{\textbf{X}}}_{g}=\sum _{k\in U_g} \textbf{X}_{k, g} /N_g\), and the regression coefficients are given by

$$\begin{aligned} \varvec{\beta }_g = \left( \sum _{k \in U_g} \textbf{X}_{k, g} \textbf{X}_{k, g}^\top \right) ^{-1} \sum _{k \in U_g} \textbf{X}_{k, g} {\widetilde{Y}}_{k, g}, \quad g\in \{M,F\}, \end{aligned}$$
(4)

where \({\widetilde{Y}}_{k, g}=\textrm{log}(Y_{k, g}), k\in U_g\). The vector \(\varvec{\beta }_g\) can be consistently estimated from the subsamples \(S_g\) by

$$\begin{aligned} {\widehat{\varvec{\beta }}}_g = \left( \sum _{k \in S_g} w_k\textbf{x}_{k, g} \textbf{x}_{k, g}^\top \right) ^{-1} \sum _{k \in S_g} w_k\textbf{x}_{k, g} {\widetilde{y}}_{k, g}, \, g\in \{M,F\}, \end{aligned}$$
(5)

where \({\widetilde{y}}_{k, g}\) is the realization of \({\widetilde{Y}}_{k, g}, k\in S_g\).

The difference \(\Delta \) can be estimated at the sample level by

$$\begin{aligned} {\widehat{\Delta }} = \big (\widehat{{\overline{\textbf{X}}}}_M -\widehat{{\overline{\textbf{X}}}}_F\big )^{\top }{\widehat{\varvec{\beta }}}_F+ \widehat{{\overline{\textbf{X}}}}_M^{\top }\big ({\widehat{\varvec{\beta }}}_M-{\widehat{\varvec{\beta }}}_F\big ), \end{aligned}$$
(6)

where \(\widehat{{\overline{\textbf{X}}}}_g=\sum _{k\in S_g} w_k\textbf{x}_{k, g}/\sum _{k\in S_g} w_k\) represents the estimator of \({\overline{\textbf{X}}}_{g}\).

Estimating \({\widehat{\varvec{\beta }}}_M-{\widehat{\varvec{\beta }}}_F\) allows us to estimate the unexplained part at the mean level, using a log model of the wages. We are interested to estimate it at the quantiles’ level, using a more general framework that extends the log model.

3 Quantiles’ decomposition

On the superpopulation level, let \(F^{(Y_F\mid \textbf{X}_F)}(.)\) and \(F^{(Y_M\mid \textbf{X}_M)}(.)\) be the cumulative distribution functions (CDFs) of the conditional wage distributions of women and men, with respect to the characteristics \(\textbf{X}_F\) and \(\textbf{X}_M\), respectively. We also denote by \(F^{\textbf{X}_F}(.)\) and \(F^{\textbf{X}_M}(.)\) the CDFs of distributions corresponding to \(\textbf{X}_F\) and \(\textbf{X}_M\), respectively.

Recall that a counterfactual distribution is an artificial distribution, defined “as the result of either a change in the distribution of a set of covariates X that determine the outcome variable of interest Y, or as a change in the relationship of the covariates with the outcome, i.e., a change in the conditional distribution of Y given X” (Chernozhukov et al. 2013). We construct a counterfactual wage distribution as the distribution resulting from the change in the distribution of covariates. We build the counterfactual wage distribution of women using the characteristics of men. It can be interpreted as the wage distribution of women if they had the characteristics of men. This is done in order to compare the observed and the counterfactual wage distributions to measure the effects of the change on quantiles’ levels.

Let \(F^{C}(.)\) be the CDF of the counterfactual distribution of women. Following Chernozhukov et al. (2013), the CDF in the point \(y\in {\mathcal {Y}}_F\), where \({\mathcal {Y}}_F\) is women’s wage support is defined as

$$\begin{aligned} F^{C}(y) = \int _{{\mathcal {X}}_M} F^{(Y_F \mid \textbf{X}_F)} (y \mid \textbf{x}) dF^{\textbf{X}_M}(\textbf{x}), \end{aligned}$$
(7)

where \({\mathcal {X}}_M\) is the support of \(\textbf{X}_M\). The counterfactual wage distribution is well defined if the support of \(\textbf{X}_F\) (\({\mathcal {X}}_F\)) includes the support of \(\textbf{X}_M\): \({\mathcal {X}}_M\subseteq {\mathcal {X}}_F\).

The counterfactual wage is a potential wage of a woman if she matches the characteristics of a man. Expression (7) assumes that to each woman one can match the characteristics of a man. Under the assumption that \({{\mathcal {X}}}_M={{\mathcal {X}}}_F\), DiNardo et al. (1996) re-expressed the counterfactual distribution given in Expression (7) as

$$\begin{aligned} F^{{C}}(y) = \int _{{{\mathcal {X}}}_F} F^{(Y_F\mid \textbf{X}_F)}(y\mid \textbf{x}) \psi (\textbf{x}) {\textrm{d}} F^{\textbf{X}_F}(\textbf{x}), \end{aligned}$$
(8)

where \(\psi (\textbf{x})={\textrm{d}} F^{\textbf{X}_M} (\textbf{x})/{\textrm{d}} F^{\textbf{X}_F} (\textbf{x})\). DiNardo et al. (1996) rewrite the \(\psi (.)\) factor as

$$\begin{aligned} \psi (\textbf{x}_k)=\psi _k=\frac{P(G_k=1\mid \textbf{x}_k)/P(G_k=1)}{P(G_k=0\mid \textbf{x}_k)/P(G_k=0)}, \end{aligned}$$
(9)

where \(G_k = 1\) if individual k is a man and \(G_k = 0\) otherwise and \(\textbf{x}_k\) is the vector of observed characteristics for individual k. The parameter \(\psi (\textbf{x}_k)\) can be estimated by using a probit or a logistic regression model (DiNardo et al. 1996) or by calibration (Anastasiade and Tillé 2017); for the calibration method in survey sampling, see Deville and Särndal (1992). The difference between the two methods is discussed by Anastasiade and Tillé (2017).

The classical decomposition on the mean level (Blinder 1973; Oaxaca 1973) is re-expressed at quantile level (Melly 2006; Chernozhukov et al. 2013) as

$$\begin{aligned} \Delta _{(\alpha )}=Q_{(\alpha )}^{M}-Q_{(\alpha )}^{F}, \end{aligned}$$

with \(\alpha \in (0,1)\), where \({Q}_{(\alpha )}^{M}\) and \(Q_{(\alpha )}^{F}\) represent the quantile of order \(\alpha \) of the men and women wage distribution, respectively.

The change at quantile level is rewritten as

$$\begin{aligned} \Delta _{(\alpha )}=Q_{(\alpha )}^{M}-Q_{(\alpha )}^{F} =\left( Q_{(\alpha )}^{M}-Q_{(\alpha )}^{C}\right) +\left( Q_{(\alpha )}^{C}-Q_{(\alpha )}^{F}\right) , \end{aligned}$$

where \(Q_{(\alpha )}^{C}\) represents the quantile of order \(\alpha \) of the counterfactual distribution. The difference \(Q_{(\alpha )}^{M}-Q_{(\alpha )}^{C}\) is interpreted here as the unexplained part at the \(\alpha \) quantile level. Estimation of \(\Delta _{(\alpha )}\) results in a quantile estimation problem. To estimate the quantiles \({Q}_{(\alpha )}^{M}\) and \({Q}_{(\alpha )}^{F}\), we apply the methods shown in Sect. 4, while methods to estimate \({Q}_{(\alpha )}^{C}\) are given in Sect. 5.

4 Quantile estimation in finite populations

For simplicity of notation, the index g is suppressed in this section.

Let Y be a random variable defined over a superpopulation. In the classical statistical framework, we assume a joint distribution of \((Y, \textbf{X})\) and denote by \(F^Y(.)\) and \(F^\textbf{X}(.)\) the marginal CDF of Y and \(\textbf{X}\), respectively. We also assume that \(Y \mid \textbf{X}= \textbf{x}\sim D(h(\textbf{x}^{\top } \varvec{\beta }), \varvec{\delta })\), where \(D(h(\textbf{x}^{\top } \varvec{\beta }), \varvec{\delta })\) is a distribution with parameters \(h(\textbf{x}^{\top } \varvec{\beta })\) and \(\varvec{\delta }\). Note that h is a continuous function, and the first parameter is expressed using some characteristics \(\textbf{x}\) and some other parameters \(\varvec{\beta }\). The marginal distribution of Y has the CDF

$$\begin{aligned} F^{Y}(y)= \int F_{D(h(\textbf{x}^{\top } \varvec{\beta }), \varvec{\delta })} (y\mid \textbf{x}) {\textrm{d}}F{^\textbf{X}}(\textbf{x}), \end{aligned}$$

where \(F_{D(h(\textbf{x}^{\top } \varvec{\beta }), \varvec{\delta })}(.\mid \textbf{x})\) is the CDF of the distribution \(D(h(\textbf{x}^{\top } \varvec{\beta }), \varvec{\delta })\).

Given \(\textbf{X}=\textbf{x}\), at the U level, the parameters \(h(\textbf{x}^{\top } \varvec{\beta }), \varvec{\delta }\) are replaced by \(h(\textbf{x}^{\top }\varvec{\beta }_N), \varvec{\delta }_N\), respectively, where \(h(\textbf{x}^{\top }\varvec{\beta }_N), \varvec{\delta }_N\) are parameters computed on U. For instance, if a model similar to the one provided by Expression (2) holds, D is the log-normal distribution, h(.) is the exponential function, \(\varvec{\beta }_N\) is similarly defined as in Expression (4), and \(\varvec{\delta }_N\) is \(\sigma _N^2\), the error term.

Conditional on U and given \(\textbf{X}=\textbf{x}\), the CDF of the distribution of Y is expressed at the U level using the following mixture distribution

$$\begin{aligned} F^{Y}_N(y)=\frac{1}{N}\sum _{k\in U} F_{D(h(\textbf{x}_k^{\top } \varvec{\beta }_N), \varvec{\delta }_N)}(y \mid \textbf{x}_k ), {\text { for any }} y\in {\mathcal {R}}. \end{aligned}$$
(10)

Note that here we assume that \(F^{Y}_N(y)\) is a model CDF. This is in contrast to the approach given in the topic-related literature (Särndal et al. 1992, p.197), where the estimand is the finite population empirical distribution function \(F^Y_{\textrm{emp}}(y)\) given by

$$\begin{aligned} F_{\textrm{emp}}^Y(y)=\frac{\sum _{k \in U} I(y_k\le y)}{N}, \end{aligned}$$
(11)

where I(.) is the indicator function, with \(I(y_k\le y)=1\) if \(y_k\le y\), 0 otherwise. The CDF estimation in finite populations usually concerns the estimation of \(F^Y_{\textrm{emp}}(.)\) and not of \(F^Y_N(.)\). In order to compare the usual approach used in survey sampling and the parametric approach, the interest is here to estimate \(F^Y(.)\), because both \(F^Y_{\textrm{emp}}(.)\) and \(F^Y_N(.)\) are estimators of \(F^Y(.)\).

At the sample level, \(F_{\textrm{emp}}^Y(y)\) is estimated by

$$\begin{aligned} {\widehat{F}}_{\textrm{emp}}^Y(y)=\frac{\sum _{k \in S} w_k I(y_k\le y)}{\sum _{k \in S} w_k}, \end{aligned}$$

while the quantile \({Q}_{(\alpha )}\) of the distribution of Y is estimated by

$$\begin{aligned} {\widehat{Q}}_{(\alpha ), \textrm{emp}} = \left[ {\widehat{F}}^Y_{\textrm{emp}}(\alpha )\right] ^{-1}, \end{aligned}$$
(12)

where \(\left[ {\widehat{F}}^Y_{\textrm{emp}}(.)\right] ^{-1}\) denotes the inverse of \({\widehat{F}}^Y_{\textrm{emp}}(.)\).

For \(F_N^Y(.)\), the parameters \(\gamma _{k, N}=h(\textbf{x}_{k}^{\top } \varvec{\beta }_N)\) and \(\varvec{\delta }_N\) in Expression (10) are estimated, respectively, by \({\widehat{\gamma }}_{k, N}=h(\textbf{x}_{k}^{\top } {\widehat{\varvec{\beta }}}_N)\) and \({\widehat{\varvec{\delta }}}_N\), where both estimators are computed on the sample, using a weighted approach with weights \(w_k\). The quantile estimation is done using two methods. The first method (denoted as “Method 1”) is based on the estimator of \(F^{Y}_N(y)\) given in Expression (13), and it is similar to the one used by Biewen and Jenkins (2005). As an alternative to Method 1, we propose in this paper a second one (denoted as “Method 2”) which is a simulation method.

  1. 1.

    Method 1 \(F^{Y}_N(y)\) is estimated on a sample S using a Hájek-type estimator as

    $$\begin{aligned} {\widehat{F}}^{Y}_N(y)=\sum _{k\in S} w_k F_{D({\widehat{\gamma }}_{k,N}, {\widehat{\varvec{\delta }}}_N)}(y\mid \textbf{x}_k)/\sum _{k\in S}{w_k}. \end{aligned}$$
    (13)

    Next, the quantile \({Q}_{(\alpha )}\) of the distribution of Y is estimated by

    $$\begin{aligned} {\widehat{Q}}_{(\alpha )}= \left[ {\widehat{F}}^Y_N(\alpha )\right] ^{-1}. \end{aligned}$$

    In many cases, \(\left[ {\widehat{F}}^Y_N(.)\right] ^{-1}\) is computed using a numerical method.

  2. 2.

    Method 2 If the inverse function of \({\widehat{F}}^{Y}_N(\alpha )\) cannot be computed (e.g., lack of monotony of \({\widehat{F}}^{Y}_N\)) or the numerical method is slow, we introduce and use the following Monte Carlo method based on parametric bootstrap:

    1. (a)

      Generate a large number m of n independent draws from the distribution \(D(h(\textbf{x}_{k}^{\top } {\widehat{\varvec{\beta }}}_N), {\widehat{\varvec{\delta }}}_N)\), \(k\in S\), respectively. A matrix M of dimension \(m\times n\) of such draws is obtained. Each element \((i, k), i=1, \dots , m, k=1, \dots n\) in M is the realization \(y_{i, k}\) of a random variable \(Y_{i, k}\) with \(Y_{i, k}\sim D(h(\textbf{x}_{k}^{\top } {\widehat{\varvec{\beta }}_{N}}), {\widehat{\varvec{\delta }}_{N}});\) given \(\textbf{x}\) all the random variables \(Y_{i, k}\) are independent.

    2. (b)

      Associate with each element \((i, k), i=1, \dots , m, k=1, \dots , n\) of M the weight \(w_k, k \in S\) and compute the empirical weighted quantile of order \(\alpha \in [0, 1]\)

      $$\begin{aligned} {\widehat{Q}}^{(i)}_{\alpha , \textrm{emp}}=\left[ {\widehat{F}}^{Y}_{\textrm{emp}, i} (\alpha )\right] ^{-1}, \end{aligned}$$

      where \({\widehat{F}}^{Y}_{\textrm{emp}, i} (y)=\sum _{k=1}^{n} w_k I(y_{i, k}\le y)/\sum _{k=1}^{n} w_k\).

    3. (c)

      For each \(\alpha \in [0, 1]\), compute the mean of \({\widehat{Q}}^{(i)}_{\alpha , emp} (Y), i=1, \dots , m;\) this mean represents an estimator of the quantile of order \(\alpha \) of the distribution with the CDF given in Expression (10).

Method 2 allows an easy construction of an approximate \(100\times (1-\gamma )\%\) confidence interval \((\gamma \in (0,1))\) for a quantile using the method of percentile bootstrap confidence intervals. Conditional to the estimated parameters, each column of the previous matrix provides a set of independent estimates \({\widehat{Q}}^{(i)}_\alpha (Y)\) of a given quantile \(\alpha \). Next, the empirical quantiles of order \(100\times (\gamma /2)\%\) and \(100\times (1-\gamma /2)\%\) are computed. They form the lower and upper bounds of an approximate \(100\times (1-\gamma )\%\) confidence interval of a quantile of order \(\alpha \). Monte Carlo simulation results (not shown here) indicate coverage rates close to \(95\%\) for this method (\(\gamma =0.05\)).

Remark 1

Method 2 can be improved if the CDF is estimated using all \(m\times n\) simulated outcomes as follows

$$\begin{aligned} \sum _{i=1}^m\left( \sum _{k=1}^n w_kI(y_{i,k}\le y)/\sum _{k=1}^n w_k\right) /m. \end{aligned}$$

This CDF estimator can be inverted to obtain the quantile estimation at level \(\alpha \). Step (c) in Method 2 is no longer necessary. The same remark applies to the algorithm given in Sect. 5. Thus, one can improve the quantile estimator precision by using \(m\times n\) outcomes instead of m. However, the computation of an approximate \(100\times (1-\gamma )\%\) confidence interval of a quantile is no longer possible. We use in our results the first version of Method 2.

5 Quantile estimation of the counterfactual distribution

We are interested to estimate the quantiles of the counterfactual distribution. This is necessary for a comparison between them and the estimated quantiles of the unconditional distribution of wage of women and those of the men, respectively, as underlined in Sect. 3.

The empirical counterfactual CDF defined at the \(U_F\) level can be written as

$$\begin{aligned} F^{C}_{\textrm{emp}}(y)=\frac{\sum _{k \in U_F} \psi _k I(y_k\le y)}{\sum _{k\in U_F} \psi _k}. \end{aligned}$$
(14)

The weighted version of DiNardo et al. (1996) and Anastasiade and Tillé (2017) methods uses the estimated empirical counterfactual CDF defined by

$$\begin{aligned} {\widehat{F}}^{C}_{\textrm{emp}}(y)=\frac{\sum _{k \in S_F} {\widehat{\psi }}_k w_k I(y_k\le y)}{\sum _{k\in S_F} {\widehat{\psi }}_k w_k}, \end{aligned}$$

where \({\widehat{\psi }}_k\) is an estimator of \(\psi _k\) given in Expression (9). Next, both methods estimate the \(\alpha \)-quantile \(Q_{(\alpha )}^{C}\) of the counterfactual distribution using

$$\begin{aligned} {\widehat{Q}}_{(\alpha ), \textrm{emp}}^{C} =\left[ {\widehat{F}}^{C}_{\textrm{emp}} (\alpha )\right] ^{-1}. \end{aligned}$$
(15)

An opposite approach is to use the following model counterfactual CDF at the \(U_F\) level

$$\begin{aligned} F^C_N(y)=\frac{1}{N_C} \sum _{k \in U_F} \psi _k F_{D(h({\textbf{x}}_{k, F}^{\top } \varvec{\beta }_F), \varvec{\delta }_F)}(y \mid \textbf{x}_{k, F}), \end{aligned}$$
(16)

where \(N_C=\sum _{k\in U_F} \psi _k\) and \(\varvec{\beta }_F\), \(\varvec{\delta }_F\) are parameters of the distribution D(.) defined on \(U_F\). We estimate \(F^C_N(y)\) by

$$\begin{aligned} {\widehat{F}}^{C}_N(y)=\frac{\sum _{k \in S_F} {\widehat{\psi }}_k w_kF_{D(h({\textbf{x}}_{k, F}^{\top } {\widehat{\varvec{\beta }}}_F), {\widehat{\varvec{\delta }}}_F)}(y \mid \textbf{x}_{k, F})}{\sum _{k \in S_F} {\widehat{\psi }}_k{w}_k}, \end{aligned}$$

where \({\widehat{\varvec{\delta }}}_F\) and \({\widehat{\varvec{\beta }}}_F\) are computed on \(S_F\) and \({\widehat{\psi }}_k\) is computed, for instance, with the help of the calibration approach of Anastasiade and Tillé (2017), using the raking method; this represents a nonparametric estimation of \({\psi }_k\) in contrast to the method of DiNardo et al. (1996) which uses a logistic regression model. Next, the estimator of \({Q}_{(\alpha )}^{C}\) is given by

$$\begin{aligned} {\widehat{Q}}_{(\alpha )}^{C}=\left[ {\widehat{F}}^{C}_N(\alpha )\right] ^{-1}. \end{aligned}$$
(17)

If the inverse function of \({\widehat{F}}^C_N(.)\) cannot be computed or its numerical approximation is slow, the following Monte Carlo method based on parametric bootstrap similar to the one given in Sect. 4 is used:

  1. 1.

    Generate a large number m of \(n_F\) independent draws from the distribution \(D(h(\textbf{x}_{k, F}^{\top } {\widehat{\varvec{\beta }}}_F), {\widehat{\varvec{\delta }}}_F)\), \(k\in S_F\), respectively. A matrix of dimension \(m\times n_F\) of such draws is obtained. Each element \((i, k), i=1, \dots , m, k=1, \dots n_F\) in this matrix is the realization \(y_{i, k}\) of a random variable \(Y_{i, k}\) with \(Y_{i, k}\sim D(h(\textbf{x}_{k, F}^{\top } {\widehat{\varvec{\beta }}}_F), {\widehat{\varvec{\delta }}}_F)\); given \(\textbf{x}_F\) all the random variables \(Y_{i, k}\) are independent.

  2. 2.

    Associate with each element \((i, k), i=1,\dots , m, k=1, \dots , n_F\) the weight \({\widehat{\psi }}_k w_k, k \in S_F\) and compute the empirical weighted quantile of order \(\alpha \in [0, 1]\) of the counterfactual wage distribution by

    $$\begin{aligned} {\widehat{Q}}^{(i), C}_{(\alpha )}=\left[ {\widehat{F}}_{\textrm{emp}, i}^C(\alpha )\right] ^{-1}, \end{aligned}$$

    where \({\widehat{F}}_{\textrm{emp}, i}^C(y)=\sum _{k=1}^{n_F} {\widehat{\psi }}_k w_k I(y_{i, k}\le y)/\sum _{k=1}^{n_F} {\widehat{\psi }}_kw_k\).

  3. 3.

    For each \(\alpha \in [0, 1]\), compute the mean of the \({\widehat{Q}}^{(i), C}_{(\alpha )}, i=1, \dots , m;\) this mean represents an estimate of the quantile of order \(\alpha \) of the counterfactual wage distribution.

Remark 2

  1. 1.

    The method to compute \({\widehat{Q}}^{(i), C}_{(\alpha )}\) uses random weights \({\widehat{\psi }}_k w_k, k \in S_F\). Its computation is reliable because \({\widehat{\psi }}_k w_k, k \in S_F\) are fixed in each run of the algorithm.

  2. 2.

    The reweighting factor \(\psi _k\) does not allow the computation of the wage variable corresponding to the counterfactual distribution given in Expression (7) (the variable \(\psi _k Y_k^F\) has a different CDF), but only the estimation of some of its parameters.

  3. 3.

    As for gender wage quantile estimation, the use of auxiliary information \(\textbf{x}_{k, F}\) in estimating \(F^{C}_N(y)\) is expected to reduce the variance of the estimator given in Expression (17), compared to that of the estimator given in Expression (15). In Sect. 6.2, we show some Monte Carlo results that sustain the variance reduction of the two parametric methods compared to the other two competitors.

6 Application using the GB2 distribution

6.1 The GB2 regression model

We illustrate the two parametric methods to estimate the structure and composition effects at quantiles’ level using the GB2 distribution. The GB2 distribution is characterized by four parameters, namely abp and q. McDonald and Xu (1995) and Kleiber and Kotz (2003) use the following probability density function of a \({\text {GB}}2(a, b, p, q)\) distribution

$$\begin{aligned} f(y; a,b,p,q) = \dfrac{\mid a\mid y^{ap-1}}{b^{ap} \textrm{B}(p,q)\left[ 1+\left( \dfrac{y}{b}\right) ^a\right] ^{p+q}}, y>0, \end{aligned}$$
(18)

where \(\textrm{B}(p, q)\) represents the function \({\text {Beta}}(p, q)\) with arguments p and q. Using the notation of Graf et al. (2011) and Graf and Nedyalkova (2015), Equation (18) is rewritten as

$$\begin{aligned} f(y; a,b,p,q) = \dfrac{a\left( \dfrac{y}{b}\right) ^{ap-1}}{b\textrm{B}(p,q)\left[ 1+\left( \dfrac{y}{b}\right) ^a\right] ^{p+q}}, y>0. \end{aligned}$$
(19)

The parameters ap and q are shape parameters and b is the scale parameter (Kleiber and Kotz 2003). All of them are strictly positive. The peak of the distribution is controlled by a, while the other two shape parameters control for the left and the right tail, respectively. The GB2 distribution can be positively or negatively skewed, depending upon the values of p and q.

We borrow from McDonald and Butler (1990) the idea of changing the scale parameter b, by expressing it as a function of the observed characteristics of the employees. The framework can also be expressed as a regression model

$$\begin{aligned} \textrm{log}(Y_k) = \textbf{X}_k^{\top } \varvec{\beta }+ \textrm{log}(\varepsilon _k), \end{aligned}$$
(20)

where \(\varepsilon _k \sim {\text {GB}}2(a,1,p,q)\). As \(\varepsilon _k \sim {\text {GB}}2(a,1,p,q)\), we have that \(Y_k\mid \textbf{X}_k=\textbf{x}_k \sim {\text {GB}}2(a, \textrm{exp}(\textbf{x}_k^{\top }\varvec{\beta }), p, q)\) (see McDonald and Butler 1990). Since \(\varepsilon _k\) follows a GB2 distribution, we refer to the model in Eq. (20) as a GB2 regression model.

In each group \(g\in \{M, F\}\), we assume that the conditional wage of \(k\in U_g\), \(Y_k\mid \textbf{X}_{k, g}=\textbf{x}_k \sim {\text {GB}}2(a_g, \textrm{exp}(\textbf{x}_k^{\top }\varvec{\beta }_g), p_g, q_g)\). Thus, for each \(k\in U_g\), Expression (19) becomes

(21)

We use the maximum pseudo-likelihood method to fit GB2 regression models using survey weights. The sandwich estimator (Huber 1967; Freedman 2006; Graf et al. 2011) or parametric bootstrap can be used to estimate the standard errors of the estimated parameters. We describe in “Appendix” the entire approach.

Biewen and Jenkins (2005) suggested to express all the four parameters of the GB2 distribution as a function of the observed characteristics. However, they note that “there would be too many parameters to be estimated, and variance calculations for the statistics of interest are rather complicated. Also we found that estimation often led to numerical problems.” We also note that if the survey weights are skewed, the estimation of the parameters is numerically complicated. In order to avoid all these problems, we express in our examples only the scale parameter as a function of the observed characteristics.

6.2 Monte Carlo studies

Monte Carlo simulation was used to show the performances of the two parametric methods when the quantiles of the counterfactual wage distribution are estimated. Three settings have been employed as follows:

  • Setting 1, where we generate a conditional wage distribution for women, \(Y_{k, F} = \textrm{exp}[1.10+X_{k, F}+\varepsilon _{k, F}]\), where \(\varepsilon _{k, F} \sim N(0, 1)\) are iid, \(X_{k, F} \sim N(5, 1)\), \(k=1, \dots N_F\), with \(N_F=50{,}000\). The covariate for the men is \(X_{k, M} \sim N(4, 1)\), iid, \(k=1, \dots N_M\), with \(N_M=N_F\). The correlation between \(\textrm{log}(Y_F)\) and \(X_F\) is about 0.70.

  • Setting 2, where we generate a conditional wage distribution for women, \(Y_{k, F} = \textrm{exp}[1.44+0.15X_{k, F}+\textrm{log}(\varepsilon _{k, F})]\), where \(\varepsilon _{k, F} \sim {\text {GB}}2(8,1,0.50,0.90)\) are iid, \(X_{k, F} \sim {\text {Gamma}}(9, 2)\), \(k=1, \dots N_F\), with \(N_F=50{,}000\). The covariate for the men is \(X_{k, M} \sim {\text {Gamma}}(10, 2)\), iid, \(k=1, \dots N_M\), with \(N_M=N_F\). The correlation between \(\textrm{log}(Y_F)\) and \(X_F\) is about 0.60.

  • Setting 3 is similar to Setting 2, using with the same \(N_F, X_{k, F}, X_{k, M}\) and \(\varepsilon _{k, F}\), but \(Y_{k, F} = \textrm{exp}[1.44+0.07X_{k, F}+\textrm{log}(\varepsilon _{k, F})], k=1, \dots N_F\). The correlation between \(\textrm{log}(Y_F)\) and \(X_F\) is about 0.30.

At the superpopulation level, the counterfactual distribution uses the factor \({\psi }(x)={\textrm{d}}F^{X_M}(x)/{\textrm{d}}F^{X_F}(x)\). For Setting 1, \({F}^{(Y_F\mid X_F)}(y\mid x)\) is the CDF of the log-normal distribution with parameters \(\varvec{\mu }=\textbf{x}_F ^{\top }\varvec{\beta }_1\) and \(\sigma ^2=1\), where \(\varvec{\beta }_1=(1.10, 1)'\). For Setting 2, \({F}^{(Y_F\mid X_F)}(y\mid x)\) is the CDF of the distribution \({\text {GB}}2(8, \textrm{exp}[\textbf{x}_F ^{\top }\varvec{\beta }_2], 0.50, 0.90)\), with \(\varvec{\beta }_2=(1.44, 0.15)'\); for setting 3, we have a similar situation. For all settings, the quantile \(Q_{(\alpha )}^C\) is computed using the inverse of \(F^C(\alpha )\) given in Expression (8). \(F^C(.)\) is used, because two different CDF are employed at the finite population level given, respectively, by Expressions (14) and (16). \(F^C(\alpha )\) is computed using Monte Carlo integration, with 10,000,000 runs; its inverse at the point \(\alpha \) is computed using a numerical method.

We use r runs and draw in each one a random sample of women and men, respectively. In Setting 1, the number of runs equals 10,000, and in Settings 2 and 3, due to the time-consuming process of fitting a GB2 distribution, we use only 1000 runs. In Setting 1, we select samples of women and men, respectively, by simple random sampling without replacement, with sample sizes \(n_F=n_M=1000\). In Settings 2 and 3, we employ systematic sampling with unequal probabilities for both samples with \(n_F=n_M=10{,}000\), where the inclusion probabilities are proportional to \(x_F\) and \(x_M\), respectively; these two settings are close to the framework used by the application given in Sect. 6.3.

In each run of the Monte Carlo simulation, we computed the quantiles of order 1%, 5%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95% and 99%, respectively, of the counterfactual wage distribution using the estimators given by Methods 1 and 2, the method of Anastasiade and Tillé (2017) (with raking calibration; hereafter, the calibration method) and the weighted version of the method of DiNardo et al. (1996) (hereafter, weighted DFL).

For each generic estimator \({\widehat{Q}}_{(\alpha )}^C\) of \(Q_{(\alpha )}^C\), the following Monte Carlo measures were used:

  • the Monte Carlo relative bias (in percentages)

    $$\begin{aligned} {\text {RB}}_{{\textrm{MC}}}({\widehat{Q}}_{(\alpha )}^C)=100\times \left( E_{\textrm{MC}}({\widehat{Q}}_{(\alpha )}^C)-Q_{(\alpha )}^C\right) /Q_{(\alpha )}^C, \end{aligned}$$

    where \(E_{\textrm{MC}}({\widehat{Q}}_{(\alpha )})=\sum _{i=1}^r {\widehat{Q}}_{i, (\alpha )}^C/r\), and \({\widehat{Q}}_{i, (\alpha )}^C\) is the quantile estimator of \(Q_{(\alpha )}^C\) computed in the ith run;

  • the Monte Carlo variance

    $$\begin{aligned} {\text {Var}}_{\textrm{MC}}\big ({\widehat{Q}}_{(\alpha )}^C\big )=\frac{1}{r-1} \sum _{i=1}^r \left[ {\widehat{Q}}_{i, (\alpha )}^C-E_{\textrm{MC}}\big ({\widehat{Q}}_{(\alpha )}^C\big )\right] ^2; \end{aligned}$$
  • the Monte Carlo root mean square error (RMSE)

    $$\begin{aligned} {\text {RMSE}}_{\textrm{MC}}\big ({\widehat{Q}}_{(\alpha )}^C\big )=\left[ {\text {Var}}_{\textrm{MC}}\big ({\widehat{Q}}_{(\alpha )}^C\big ) +\left( B_{\textrm{MC}}\big ({\widehat{Q}}_{(\alpha )}^C\big )\right) ^2\right] ^{1/2}, \end{aligned}$$

    where \(B_{\textrm{MC}}({\widehat{Q}}_{(\alpha )}^C)=E_{\textrm{MC}}({\widehat{Q}}_\alpha ^C)-Q_{(\alpha )}^C\),

  • the Monte Carlo coefficient of variation (in percentages)

    $$\begin{aligned} {\text {CV}}_{\textrm{MC}}\big ({\widehat{Q}}_{(\alpha )}^C\big )=100\times \left( {\text {Var}}_{\textrm{MC}} \big ({\widehat{Q}}_{(\alpha )}^C\big )\right) ^{1/2}/E_{\textrm{MC}}\big ({\widehat{Q}}_{(\alpha )}^C\big ). \end{aligned}$$

For the estimators corresponding to Methods 1 and 2, we estimated the parameters of the women’s wage distribution at each run using the corresponding weights of women selected in the women’s sample, as well as the estimated factor \(\psi _k\) given by the method of Anastasiade and Tillé (2017) with raking calibration. The latter was also used to compute in each run the calibration estimator for each quantile of the counterfactual distribution. Similarly, the factor \(\psi _k\) for the weighted DFL method was estimated in each run. We used a weighted logistic regression to compute \(P(G_k=1 \mid x_k)\) and \(P(G_k=0 \mid x_k)\), while \(P(G_k=1)\) and \(P(G_k=0)\) were estimated by weighted means \(\sum _{k\in S_g} w_k/\sum _{k\in S} w_k, g\in \{M, F\};\) see Expression (9). All the results were computed in R Core Team (2022). The weighted empirical quantiles were computed using the function wtd.quantile from the R package Hmisc (Harrell 2022), while the inverse of a CDF at the point \(\alpha \) was computed using the R base function uniroot. We used 1000 bootstrap runs in Method 2.

All the used estimators are biased with respect to the sampling design. The values of the Monte Carlo relative bias in percentages are shown in Tables 1, 5 and 9 for the three settings, while the values of the Monte Carlo variance are given in Tables 2, 6 and 10, respectively; the Monte Carlo root mean square errors are reported in Tables 3, 7 and 11, respectively. The Monte Carlo coefficients of variation are given in Tables 4, 8 and 12, respectively. We note that Method 1 and Method 2 provide very close values of the Monte Carlo measures in all three settings.

The estimator of Anastasiade and Tillé (2017) using calibration is used in Figs. 1, 2 and 3 as a benchmark in order to visualize the behavior of the other estimators at different quantiles. Since Method 1 and Method 2 provide almost identical results, only the results of Method 1 are shown in Figs. 1, 2 and 3. Figure 1 shows the ratio between the Monte Carlo bias \(B_{\textrm{MC}}\) obtained by using Method 1, the weighted DFL and that of the calibration method for each of the quantiles of order 1%, 5%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95% and 99%. Figure 2 provides the ratio between the Monte Carlo variance of Method 1, the weighted DFL and that of the calibration method for each quantile. Similarly, Fig. 3 shows the ratio of the Monte Carlo RMSEs.

In Setting 1, the two parametric methods show a smaller value of the Monte Carlo relative bias than the weighted DFL and the calibration estimator at each quantile. The two methods also provide a substantial reduction of the Monte Carlo variance at each quantile, and a good behavior with respect to the RMSE over the calibration and weighted DFL estimators (see Fig. 1). The estimators obtained by the two parametric methods also display a smaller coefficient of variation than the reweighting estimators at all quantiles as shown in Table 4.

For Setting 2, the Monte Carlo expectation of the estimated parameters of the GB2 distribution are for a, \(\beta _0\), \(\beta _1\), p and q, respectively, 8.02, 1.44, 0.15, 0.50 and 0.91, showing that we provide approximately unbiased estimates under the sampling design, for large sample sizes. In Setting 2, the two parametric methods result in estimators that have a lower Monte Carlo variance than the calibration and the weighted DFL estimators almost at all quantiles (see Fig. 2). Like in Setting 1, the value of the Monte Carlo coefficient of variation of the estimators obtained using the two parametric methods are smaller than of those using the last two methods (see Table 8). The parametric methods sometimes show a larger bias and relative mean square error than the calibration estimator, but provide a reduction of the Monte Carlo variance at each quantile (except for the quantile of order 20%; see also Fig. 2). Compared to Setting 1, note that the correlation between \(\textrm{log}(Y_F)\) and \(X_F\) is less important (0.60 compared to 0.70).

Setting 3 shows a smaller correlation between \(\textrm{log}(Y_F)\) and \(X_F\) (about 0.30) compared to Setting 2; this is similar to the correlation between the logarithm of women wage and age in the application given in Sect. 6.3. This correlation reduction is visible in the behavior of the Monte Carlo variance and relative bias of the two parametric methods. Thus, the parametric methods still provide a reduction of the Monte Carlo variance for most of quantiles (except for the quantiles of order 30%, 80% and 95%; see also Fig. 3) compared to the two competitors. The value of the Monte Carlo relative bias of the two parametric methods is more important than in Setting 2 for the quantiles of order 20% and 70%. Despite the lower correlation between \(\textrm{log}(Y_F)\) and \(X_F\), the shapes of the Monte Carlo RMSE of the two parametric methods are similar to the ones provided by Setting 2; see the last plot in Figs. 1 and 2, respectively. The Monte Carlo coefficients of variation of Method 1 and Method 2 also show reduced values compared to the other two methods (see Table 12).

Table 1 Setting 1: Monte Carlo relative bias (in %) of the four estimators of the counterfactual wage quantiles
Table 2 Setting 1: Monte Carlo variance of the four estimators of the counterfactual wage quantiles
Table 3 Setting 1: Monte Carlo RMSE of the four estimators of the counterfactual wage quantiles
Table 4 Setting 1: Monte Carlo coefficient of variation (in %) of the four estimators of the counterfactual wage quantiles
Table 5 Setting 2: Monte Carlo relative bias (in %) of the four estimators of the counterfactual wage quantiles
Table 6 Setting 2: Monte Carlo variance of the four estimators of the counterfactual wage quantiles
Table 7 Setting 2: Monte Carlo RMSE of the four estimators of the counterfactual wage quantiles
Table 8 Setting 2: Monte Carlo coefficient of variation (in %) of the four estimators of the counterfactual wage quantiles
Table 9 Setting 3: Monte Carlo relative bias (in %) of the four estimators of the counterfactual wage quantiles
Table 10 Setting 3: Monte Carlo variance of the four estimators of the counterfactual wage quantiles (results multiplied by 100)
Table 11 Setting 3: Monte Carlo RMSE of the four estimators of the counterfactual wage quantiles
Table 12 Setting 3: Monte Carlo coefficient of variation (in %) of the four estimators of the counterfactual wage quantiles
Fig. 1
figure 1

Setting 1, upper panel: ratio between the Monte Carlo bias obtained by using Method 1, the weighted DFL method and that of the calibration method for each quantile; middle: ratio between the Monte Carlo variance obtained by using Method 1 and that of the calibration method for each quantile; lower panel: ratio between the Monte Carlo RMSE obtained by using Method 1 and that of the calibration method for each quantile. In each panel, the horizontal line drawn at level 1 on the y-axis corresponds to the calibration method. Method 1 and Method 2 provide almost identical ratios

Fig. 2
figure 2

Setting 2, upper panel: ratio between the Monte Carlo bias obtained by using Method 1, the weighted DFL method and that of the calibration method for each quantile; middle: ratio between the Monte Carlo variance obtained by using Method 1 and that of the calibration method for each quantile; lower panel: ratio between the Monte Carlo RMSE obtained by using Method 1 and that of the calibration method for each quantile. In each panel, the horizontal line drawn at level 1 on the y-axis corresponds to the calibration method. Method 1 and Method 2 provide almost identical ratios

Fig. 3
figure 3

Setting 3, upper panel: ratio between the Monte Carlo bias obtained by using Method 1, the weighted DFL method and that of the calibration method for each quantile; middle: ratio between the Monte Carlo variance obtained by using Method 1 and that of the calibration method for each quantile; lower panel: ratio between the Monte Carlo RMSE obtained by using Method 1 and that of the calibration method for each quantile. In each panel, the horizontal line drawn at level 1 on the y-axis corresponds to the calibration method. Method 1 and Method 2 provide almost identical ratios

6.3 Application to real data

A real dataset with information collected during the Swiss survey on earnings in 2012 by the Swiss Federal Statistical Office is used to illustrate the methods. All the cases where there is missing information are removed. We also removed observations where the monthly wage is less than 1000 CHF for a full-time job, because we consider them to be data collection errors. The employees have worked at least one hour during the month of October 2012 in the private sector and are between 18 and 64 years old. The modeled variable is the standardized hourly wage. Standardized refers to the fact that the hourly wages are reported as if all the employees worked full-time. Finally, we use a sample of 144,753 employees: 66,181 women and 78,572 men. The wages of women range between 5 and 149.76 CHF, while those of men between 5 and 299.53 CHF. Figure 4 shows the estimated wage densities of men and women, respectively.

Fig. 4
figure 4

Application to real data: estimated wage densities of men (dashed) and women (solid)

A GB2 regression model was fitted separately for men and women, using the maximum pseudo-likelihood method with survey weights. We used four explanatory variables in the models: the age of the employee (that is a proxy for professional experience, since this information was absent), the education level (8 ordinal categories, with the first category being the most important), the professional position (4 ordinal categories, with the first category being the most important) and the economic sector (38 categories, the first one being the tobacco industry, where the median of the wages is the largest one in each group). The correlation between the logarithm of the wage and age is about 0.30 for both groups, respectively; this value is similar to the one used in Setting 3 given in Sect. 6.2.

In each group, there are 104 parameters to estimate (52 coefficients and 52 standard errors). They are reported in Table 15 in Appendix for the women’s sample and men’s sample, respectively. The standard errors are estimated using the sandwich estimator. Age has a positive effect on the wages for both groups. The first levels for education, professional position and economic sector are taken as reference categories in the GB2 models. The estimated coefficients associated with each category of the three covariates show negative values, as expected.

Figure 5 shows the histogram of the GB2 residuals and the corresponding P–P plot using the estimated parameters of the GB2 distribution fitted on the women’s sample. The P–P plot indicates a good agreement with the \({\text {GB}}2({\widehat{a}}_F, 1, {\widehat{p}}_F, {\widehat{q}}_F)\) distribution. A similar plot was obtained on the men’s sample.

Fig. 5
figure 5

Application to real data. GB2 regression model fitted on the women’s sample: histogram of the residuals (left panel) and P–P plot (right panel)

In addition to the previous plot, to test the goodness of fit of the GB2 distribution, we used a nonparametric bootstrap version of the Kolmogorov–Smirnov test (Meintanis and Swanepoel 2007), because the parameters ap and q are estimated. Recall that in GB2 regression, the residuals should follow the \({\text {GB}}2(a, 1, p, q)\) distribution. We implemented the test by exploiting the relationship between the \({\text {GB}}2(a, b, p, q)\) and the \({\text {Beta}}(p,q)\) distributions: If \(Z\sim {\text {GB}}2(a, b, p, q)\), then \((Z/b)^a/(1+(Z/b)^a)\sim {\text {Beta}}(p, q)\). Thus, using the transformation \(Z^{{\widehat{a}}_F}/(1+Z^{{\widehat{a}}_F})\) (with \(b_F=1\)), we tested if the transformed version of the residuals follow the \({\text {Beta}}({\widehat{p}}_F,{\widehat{q}}_F)\) distribution. The test statistic corresponding to the Kolmogorov–Smirnov test for beta distribution is available in majority of statistical software. However, the p-value of the test should be approximated by bootstrap because the parameters ap and q are estimated. The test was applied on the women’s sample using 2500 bootstrap replicates. The approximate p-value of the test was 0.56. A similar value was obtained for the residuals obtained on the men’s sample.

In order to compare the GB2 regression with the log-normal model usually used in official statistics, we show in Fig. 6 the QQ plots of the standardized residuals of the log-linear model (with the log of the wages as dependent variable and the same characteristics as covariates) and the residuals of the GB2 model, respectively. Both models are fitted on the women’s sample. We observe an important improvement of the QQ plot when we fit a GB2 model compared to the log-linear model. A similar result was obtained on the men’s sample.

Fig. 6
figure 6

Application to real data. Women’s sample: QQ plot of the standardized residuals for the log-linear model (left panel) and QQ plot of the residuals for the GB2 model (right panel)

The quantiles of the counterfactual wage distribution were estimated using, respectively, the calibration method and weighted DFL. The factor \(\psi _k\) was estimated like in Sect. 6.2: using raking ratio for the calibration method and a weighted logistic regression for the weighted DFL. The empirical distributions of the estimated \(\psi _k\) for both methods are very similar and positively skewed (the skewness coefficients are, respectively, 4.81 for weighted DFL and 5.01 for calibration), also due to skewed survey weights (the skewness coefficient equals to 3.56 on the overall sample). Figure 7 shows the boxplots of estimated \(\psi _k\) for both methods on the logarithmic scale.

The estimated quantiles of the counterfactual distribution provided by the calibration method and weighted DFL are compared with the estimated quantiles of the wage distribution of men, and with those of the wage distribution of women; they have been computed using Expression (12). Table 13 summarizes the results.

Fig. 7
figure 7

Application to real data. Boxplots of the estimated \(\textrm{log}(\psi _k)\) using the weighted DFL (left) and calibration method (right). The logarithmic scale is used for a better visualization because both distributions are positively skewed

Method 1 and Method 2 have also been applied. In Table 14, we show the corresponding results: the estimated quantiles of men’s wages, of the counterfactual wage distribution computed and finally, those of women’s wages using Method 1 and Method 2. These quantiles were computed using the approach described in Sect. 4 (for the gender wages) and Sect. 5 (for the counterfactual distribution). As in Sect. 6.2, we used 1000 bootstrap runs in Method 2.

Tables 13 and 14 provide the estimated quantiles of the counterfactual distribution for the four estimators (calibration, weighted DFL, Method 1, Method 2). They are within the range of the estimated gender quantiles. The calibration method and the weighted DFL provide similar results; the results for Method 1 and Method 2 are almost identical. The four methods provide similar values of the quantile estimators of order 2% up to 90% for the counterfactual distribution. More important differences are noted at the lowest and highest levels: 1% and 99%. At these levels, the values of the estimators are, respectively, 14.42.and 93.99 for the calibration method, and 14.25 and 95.27 for the weighted DFL. On the other hand, Method 1 shows, respectively, the values 17.95 and 78.79, and Method 2, 17.94 and 78.85.

Using all four methods, the differences of type \({\widehat{Q}}_{(\alpha )}^M-{\widehat{Q}}_{(\alpha )}^C\) are positive, showing an important unexplained gender wage gap at each quantile (see also Fig. 8; since the results are very similar for the calibration method and the weighted DFL, and, respectively, for Method 1 and Method 2, for a better visualization only the results of the calibration method and Method 1 are plotted). We note that the most important difference (approximately − 10) between Method 1 and the calibration method is shown at the quantile of level 99%. The differences \({\widehat{Q}}_{(\alpha )}^M-{\widehat{Q}}_{(\alpha )}^C\) also impact the ratio between the estimated value of the unexplained part and the estimated difference between the gender wage quantiles at level \(\alpha \), \({\widehat{Q}}_{(\alpha )}^M-{\widehat{Q}}_{(\alpha )}^F\); see also Fig. 9 (since the results are very similar for the calibration method and the weighted DFL, and, respectively, for Method 1 and Method 2, for a better visualization only the results of the calibration method and Method 1 are plotted). The calibration method and the weighted DFL show approximate ratios between 70% and 92%, while Method 1 and Method 2 reduce the ratio range: 79% to 90%. The shapes of the ratio curves provided by Method 1 and by the calibration method are very different as shown in Fig. 9. Method 1 provides a smoother curve than the calibration method, mainly due to the parametric approach used. On the other hand, the ratios of the calibration method are impacted by the skewness of the estimated \(\psi _k\) factor and thus impact the shape of their curve.

Table 13 Application to real data. Estimated quantiles of gender wage distributions and of the counterfactual wage distribution computed using the calibration method and weighted DFL
Table 14 Application to real data. Estimated quantiles of gender wage distributions and of the counterfactual wage distribution computed using the two parametric methods
Fig. 8
figure 8

Application to real data. The unexplained part of the decomposition at each quantile estimated using Method 1 (filled triangle) and the calibration method (filled square)

Fig. 9
figure 9

Application to real data. Ratio of the unexplained part and the difference between the estimated gender quantiles (Method 1—filled triangle and calibration method—filled square). The scale of ratio is expressed in % on the y-axis

7 Discussion and conclusions

Our aim is to estimate the unexplained part of the wage gap between men and women at different quantiles of the wage distributions, within the usual context of official statistics, where random samples are selected from finite populations. To do this, we fit parametric models on the conditional wage distributions, using some auxiliary information. The parametric approach can be applied for any conditional distribution of the wages. We illustrate it with a GB2 distribution by modeling wages with covariates and using survey weights. This approach extends the classical framework of a log-normal model of the wages usually used in official statistics. The GB2 distribution includes many distributions as special or limiting cases (for example, the log-normal distribution), and it is expected that its use provides a good fit for wage distributions, which are usually heavy-tailed.

Strictly speaking, the estimators corresponding to Methods 1 and 2 are design-based, even though they use an underlying model between the variable of interest and the covariates. As for all design-based estimators, the variance reduction is expected when an important correlation between the variable of interest and the covariates is detected. In the first Monte Carlo simulation (Setting 1), the correlation between the logarithm of the women wage and the covariate is larger compared to Settings 2 and 3 (0.70 versus, respectively, 0.60 and 0.30). This difference could explain that the variance reduction of the parametric methods compared to the calibration method is less important in Settings 2 and 3 than in Setting 1. Nevertheless, the parametric approach provides in all our simulation studies a good behavior in terms of Monte Carlo variance and coefficient of variation compared to the other two competitors.

We also note that there are different ways to estimate a counterfactual distribution as provided in the econometrics literature. The choice of the method used in this paper is determined by the standard use of reweighting estimators in the design-based approach.

We provided an example of GB2 regression model using real data from the Swiss Federal Statistical Office. Compared to the usual log-normal model, the GB2 model showed a better fit for the conditional wage distributions. The unexplained part of the wage differences was estimated using a parametric approach and compared to the results obtained using the methods of DiNardo et al. (1996) and Anastasiade and Tillé (2017). The GB2 approach provides smooth results compared to the other two methods, which are impacted by the skeweness of the estimated \(\psi _k\) factor. The parametric model used for the conditional distribution “regularizes” the estimates, but impose additional restrictions in the form of a parametric conditional distribution. We used in our computations a reweighting factor involved in the construction of the counterfactual distribution which is estimated using the calibration approach, and not a logistic regression; this represents a nonparametric estimation of this factor. Thus, finally, only the model relating the variable of interest and the covariates should be roughly correct, because the use of the weights in the design-based approach may protect from a misspecification of the model.