1 Introduction

Random coefficient models have been used extensively in time series, cross-section and panel regressions. Nicholls and Pagan (1985) consider the estimation of first and second moments of the random coefficient \(\beta _{i}\) and the error term \(u_{i}\), in a linear regression model. In a seminal paper, Beran and Hall (1992) establish conditions for identifying and estimating the distribution of \(\beta _{i}\) and \(u_{i}\) nonparametrically. The baseline linear univariate regression in Beran and Hall (1992) has been extended in nonparametric framework by Beran (1993), Beran and Millar (1994), Beran et al. (1996), Hoderlein et al. (2010), Hoderlein et al. (2017) and Breunig and Hoderlein (2018), to just name a few. Hsiao and Pesaran (2008) survey random coefficient models in linear panel data models.

In some econometric applications, Hausman (1981), Hausman and Newey (1995), Foster and Hahn (2000), for examples, the main interest is to estimate the consumer surplus distribution based on a linear demand system where the coefficient associated with the price is random. In such settings, the distribution of the random coefficients is needed when computing the consumer surplus function, and the nonparametric estimation is more general, flexible and suitable for the purpose. On the other hand, parametric models may be favored in applications in which the implied economic meaning of the distribution of the random coefficients is of interests. Examples include estimation of the return to education (Lemieux 2006b, c) and the labor supply equation (Bick et al. 2022).

In this paper, we consider a linear regression model with a random coefficient \(\beta _{i}\) that is assumed to follow a categorical distribution, i.e., \(\beta _{i}\) has a discrete support \(\left\{ b_{1},b_{2},\ldots ,b_{K}\right\} \), and \(\beta _{i}=b_{k}\) with probability \(\pi _{k}\). The discretization of the support of the random coefficient \( \beta _{i}\) naturally corresponds to the interpretation that each individual belongs to a certain category, or group, k with probability \(\pi _{k}\). Compared to a nonparametric distribution with continuous support, assuming a categorical distribution allows us not only to model the heterogeneous responses across individuals but also to interpret the results with sharper economic meaning. As we will illustrate in the empirical application in Sect. 6, it is hard to clearly interpret the distribution of returns to education without imposing some form of parametric restrictions.

In addition, with the categorical distribution imposed, the identification and estimation of the distribution of \(\beta _{i}\) do not rely on identically distributed error terms \(u_{i}\) and regressors \({{\textbf {w}}}_i\), as shown in Sect. 2 and 3. Heterogeneously generated errors can be allowed, which is important in many empirical applications. To the best of our knowledge, this is the first identification result in linear random coefficient model without a strict IID setting.

The identification of the distribution of \(\beta _{i}\) is established in this paper based on the identification of the moments of \(\beta _{i}\), which coincides with the identification condition in Beran and Hall (1992) that the distribution of \(\beta _{i}\) is uniquely determined by its moments, which is assumed to exist up to an arbitrary order. Since under our setup the distribution of \(\beta _{i}\) is parametrically specified, the moments of \(\beta _{i}\) exist and can be derived explicitly. The parameters of the assumed categorical distribution can then be uniquely determined by a system of equations in terms of the moments, as in Theorem 2. The parameters of the categorical distribution are then estimated consistently by the generalized method of moments (GMM). The estimation procedure based on moment conditions shares similar spirits as in Ahn et al. (2001, 2013) in which Peter Schmidt and coauthors study panel data models with interactive effects where they allow for the time effects to vary across individual units. Compared to alternative nonparametric random coefficient models, the standard GMM estimation is easy to implement, and the identified categorical structure has a clear economic interpretation.

Using Monte Carlo (MC) simulations, we find that moments of the random coefficients can be estimated reasonably accurately, but large samples are required for estimation of the parameters of the underlying categorical distributions. Our theoretical and MC results also suggest that our method is suitable when the number of heterogeneous coefficients and the number of categories are small (2 or 3). With the number of categories rising the burden on identification from the moments to the parameters of the categorical distribution also rises rapidly. The quality of identification also deteriorates as we need to rely on higher and higher moments to identify a larger number of categories, since the information content of the moments tends to decline with their order.

The proposed method is also illustrated by providing estimates of the distribution of returns to education in the USA by gender and educational levels, using the May and Outgoing Rotation Group (ORG) supplements of the Current Population Survey (CPS) data. Comparing the estimates obtained over the sub-periods 1973–1975 and 2001–2003, we find that rising between group heterogeneity is largely due to rising returns to education in the case of individuals with postsecondary education, while within-group heterogeneity has been rising in the case of individuals with high school or less education.

Related Literature This paper draws mainly upon the literature of random coefficient models. As already mentioned, the main body of the recent literature is focused on nonparametric identification and estimation. Following Beran and Hall (1992), Beran (1993) and Beran and Millar (1994) extend the model to a linear semi-parametric model with a multivariate setup and propose a minimum distance estimator for the unknown distribution. Foster and Hahn (2000) extend the identification results in Beran and Hall (1992) and apply the minimum distance estimator to a gasoline consumption data to estimate the consumer surplus function. Beran et al. (1996) and Hoderlein et al. (2010) propose kernel density estimators based on the Radon inverse transformation in linear models.

In addition to linear models, Ichimura and Thompson (1998) and Gautier and Kitamura (2013) incorporate the random coefficients in binary choice models. Gautier and Hoderlein (2015) and Hoderlein et al. (2017) consider triangular models with random coefficients allowing for causal inference. Matzkin (2012) and Masten (2018) discuss the identification of random coefficients in simultaneous equation models. Breunig and Hoderlein (2018) propose a general specification test in a variety of random coefficient models. Random coefficients are also widely studied in panel data models, for example Hsiao and Pesaran (2008) and Arellano and Bonhomme (2012)

The rest of the paper is organized as follows: Sect. 2 establishes the main identification results. The GMM estimation procedure is proposed and discussed in Sect. 3. An extension to a multivariate setting is considered in Sect. 4. Small sample properties of the proposed estimator are investigated in Sect. 5, using Monte Carlo techniques under different regressor and error distributions. Section 6 presents and discusses our empirical application to the return to education. Section 7 provides some concluding remarks and suggestions for future work. Technical proofs are given in “Appendix A.1.”

Notations Largest and smallest eigenvalues of the \( p\times p\) matrix \({{\textbf {A}}}=\left( a_{ij}\right) \) are denoted by \(\lambda _{\max }\left( {{\textbf {A}}}\right) \) and \(\lambda _{\min }\left( {{\textbf {A}}} \right) ,\) respectively, its spectral norm by \(\left\| {{\textbf {A}}} \right\| =\lambda _{\max }^{1/2}\left( {{\textbf {A}}}^{\prime }{{\textbf {A}}} \right) \), \({{\textbf {A}}}\succ 0\) means that \({{\textbf {A}}}\) is positive definite, \(\text {vech}\left( {{\textbf {A}}}\right) \) denotes the vectorization of distinct elements of \({{\textbf {A}}}\), \({{\textbf {0}}}\) denotes zero matrix (or vector). For \({{\textbf {a}}}\in {\mathbb {R}}^{p}\), \(\textrm{diag}\left( {{\textbf {a}}} \right) \) represents the diagonal matrix with elements of \( a_{1},a_{2},\ldots ,a_{p}\). For random variables (or vectors) u and v, \( u\perp v\) represents u is independent of v. We use c (C) to denote some small (large) positive constants. For a differentiable real-valued function \(f\left( \varvec{\theta }\right) \), \(\nabla _{\varvec{\theta } }f\left( \varvec{\theta }\right) \) denotes the gradient vector. Operator \( \rightarrow _{p}\) denotes convergence in probability, and \(\rightarrow _{d}\) convergence in distribution. The symbols O(1), and \(O_{p}(1)\) denote asymptotically bounded deterministic and random sequences, respectively.

2 Categorical random coefficient model

We suppose the single cross-section observations, \(\left\{ y_{i},x_{i}, {{\textbf {z}}}_{i}\right\} _{i=1}^{n}\), follow the categorical random coefficient model

$$\begin{aligned} y_{i}=x_{i}\beta _{i}+{{\textbf {z}}}_{i}^{\prime }\varvec{\gamma }+u_{i}, \end{aligned}$$
(2.1)

where \(y_{i},x_{i}\in {\mathbb {R}}\), \({{\textbf {z}}}_{i}\in {\mathbb {R}}^{p_z},\) and \(\beta _{i}\in \left\{ b_{1},b_{2},\ldots ,b_{K}\right\} \) admits the following K-categorical distribution,

$$\begin{aligned} \beta _{i}= {\left\{ \begin{array}{ll} b_{1}, &{} \text {w.p. }\pi _{1}, \\ b_{2}, &{} \text {w.p. }\pi _{2}, \\ \vdots &{} \vdots \\ b_{K}, &{} \text {w.p. }\pi _{K}, \end{array}\right. } \end{aligned}$$
(2.2)

w.p. denotes “with probability,” \(\pi _{k}\in \left( 0,1\right) \), \( \sum _{k=1}^{K}\pi _{k}=1\), \(b_{1}<b_{2}<\cdots <b_{K}\), \(\varvec{\gamma }\in {\mathbb {R}}^{p_z}\) is homogeneous and \({{\textbf {z}}}_{i}\) could include an intercept term as its first element. It is assumed that \(\beta _{i}\perp {{\textbf {w}}}_i = \left( x_{i},{{\textbf {z}}}_{i}^{\prime }\right) ^{\prime }\), and the idiosyncratic errors \(u_{i}\) are independently distributed with mean 0.

Remark 1

The model can be extended to allow \({{\textbf {x}}}_{i},\varvec{\beta }_{i}\in {\mathbb {R}}^{p}\), with \(\varvec{\beta }_{i}\) following a multivariate categorical distribution, though with more complicated notations. We will consider possible extensions in Sect. 4.

Remark 2

Since we consider a pure cross-sectional setting, the key assumption that \( \beta _{i}\) and \(x_{i}\) are independently distributed cannot be relaxed. Allowing \(\beta _{i}\) to vary with \(\textbf{w}_{i}\), without any further restrictions, is tantamount to assuming \(y_{i}\) is a general function of \(\textbf{w}_{i}\), in effect rendering a nonparametric specification.

Remark 3

The number of categories, K, is assumed to be fixed and known. Conditions \( \sum _{k=1}^{K}\pi _{k}=1\), \(b_{1}<b_{2}<\cdots <b_{K},\) and \(\pi _{k}\in \left( 0,1\right) \) together are sufficient for the existence of K categories. For example, if \(b_{k}=b_{k^{\prime }}\), then we can merge categories k and \(k^{\prime }\), and the number of categories reduces to \( K-1\). Similarly, if \(\pi _{k}=0\) for some k, then category k can be deleted, and the number of categories is again reduced to \(K-1\). Information criteria can be used to determine K, but this will not be pursued in this paper. Model specification tests could also be considered. See, for examples, Andrews (2001) and Breunig and Hoderlein (2018).

In the rest of this section, we focus on the model (2.1) and establish the conditions under which the distribution of \(\beta _{i}\) is identified.

2.1 Identifying the moments of \(\beta _{i}\)

Assumption 1

  1. (a)

    (i) \(u_{i}\) is distributed independently of \({{\textbf {w}}} _{i}=\left( x_{i},{{\textbf {z}}}_{i}^{\prime }\right) ^{\prime }\) and \(\beta _{i} \). (ii) \(\sup _{i}E\left( \left| u_{i}^{r}\right| \right) <C\), \(r=1,2,\ldots ,2K-1\). (iii) \(n^{-1}\sum _{i=1}^{n} u_i^4 = O_p(1) \).

  2. (b)

    (i) Let \({{\textbf {Q}}}_{n,ww}=n^{-1}\sum _{i=1}^{n}{{\textbf {w}}}_{i} {{\textbf {w}}}_{i}^{\prime }\), and \({{\textbf {q}}}_{n,wy}=n^{-1}\sum _{i=1}^{n} {{\textbf {w}}}_{i}y_{i}\). Then \(\left\| E\left( {{\textbf {Q}}} _{n,ww}\right) \right\|<C<\infty \), and \(\left\| E\left( {{\textbf {q}}}_{n,wy}\right) \right\|<C<\infty \), and there exists \(n_{0}\in {\mathbb {N}}\) such that for all \(n\ge n_{0}\),

    $$\begin{aligned} 0<c<\lambda _{\min }\left( {{\textbf {Q}}}_{n,ww}\right)<\lambda _{\max }\left( {{\textbf {Q}}}_{n,ww}\right)<C<\infty . \end{aligned}$$

    (ii) \(\sup _{i} E\left( \left\| {{\textbf {w}}}_{i}\right\| ^{r}\right)<C<\infty \), \(r=1,2,\ldots ,4K-2\).(iii) \(n^{-1} \sum _{i=1}^{n} \left\| {{\textbf {w}}}_{i} \right\| ^{4} = O_p(1)\).

  3. (c)

    \(\left\| {{\textbf {Q}}}_{n,ww}-E \left( {{\textbf {Q}}} _{n,ww}\right) \right\| =O_p\left( n^{-1/2}\right) \), \(\left\| {\textbf {q }}_{n,wy}- E \left( {{\textbf {q}}}_{n,wy}\right) \right\| =O_p\left( n^{-1/2}\right) \), and

    $$\begin{aligned} E \left( {{\textbf {Q}}}_{n,ww}\right) =n^{-1}\sum _{i=1}^{n}E \left( {{\textbf {w}}}_{i}{{\textbf {w}}}_{i}^{\prime }\right) \succ 0. \end{aligned}$$
  4. (d)

    \(\left\| E \left( {{\textbf {Q}}}_{n,ww}\right) -{{\textbf {Q}}} _{ww}\right\| =O\left( n^{-1/2}\right) \), \(\left\| E \left( {{\textbf {q}}}_{n,wy}\right) -{{\textbf {q}}}_{wy}\right\| =O\left( n^{-1/2}\right) \), where \({{\textbf {q}}}_{wy} = \lim \limits _{n \rightarrow \infty } E \left( {{\textbf {q}}}_{n, wy} \right) \), \({{\textbf {Q}}}_{ww} = \lim \limits _{n \rightarrow \infty } E \left( {{\textbf {Q}}}_{n, ww} \right) \) and \({{\textbf {Q}}}_{ww}\succ 0\).

Remark 4

Part (a) of Assumption 1 relaxes the assumption that \(u_{i}\) is identically distributed, and allows for heterogeneously generated errors. For identification of the distribution of \(\beta _{i}\), we require \(u_{i}\) to be distributed independently of \( {{\textbf {w}}}_{i}\) and \(\beta _{i}\), which rules out conditional heteroskedasticity. However, estimation and inference involving \(E \left( \beta _{i}\right) \) and \(\varvec{\gamma }\) can be carried out in presence of conditionally error heteroskedastic, as shown in Theorem 3. Parts (c) and (d) of Assumption 1 relax the condition that \({{\textbf {w}}}_{i}\) is identically distributed across i. As we proceed, only \(\beta _{i}\), whose distribution is of interest, is assumed to be IID across i, and it is not required for \({{\textbf {w}}}_{i}\) and \(u_{i}\) to be identically distributed over i.

Remark 5

The high-level conditions in Assumption 1, concerning the convergence in probability of averages such as \(Q_{n,ww}=n^{-1}\sum _{i=1}^{n} {{\textbf {w}}}_{i}{{\textbf {w}}}_{i}^{\prime }\), can be verified under weak cross-sectional dependence. Let \(f_{i}=f\left( {{\textbf {w}}}_{i},\beta _{i},u_{i}\right) \) be a generic function of \({{\textbf {w}}}_{i}\), \(\beta _{i}\) and \(u_{i}\).Footnote 1 Assume that \(\sup _{i}E\left( f_{i}^{2}\right) <C\), and \(\sup _{j}\sum _{i=1}^{n}\left| \textrm{cov} \left( f_{i},f_{j}\right) \right| <C\), for some fixed \(C<\infty \). Then,

$$\begin{aligned} \textrm{var}\left( \frac{1}{n}\sum _{i=1}^{n}f_{i}\right) \le \frac{1}{n^{2}} \sum _{i=1}^{n}\sum _{j=1}^{n}\left| \textrm{cov}\left( f_{i},f_{j}\right) \right| \le \frac{1}{n}\sup _{j}\sum _{i=1}^{n}\left| \textrm{cov} \left( f_{i},f_{j}\right) \right| \le \frac{C}{n}. \end{aligned}$$

By Chebyshev’s inequality, for any \(\varepsilon >0\), we have \(M_{\varepsilon }>\sqrt{C/\varepsilon }\) such that

$$\begin{aligned} \Pr \left( \sqrt{n}\left| \frac{1}{n}\sum _{i=1}^{n}\left[ f_{i}-\textrm{E }\left( f_{i}\right) \right] \right| >M_{\varepsilon }\right) \le \frac{ n\textrm{var}\left( n^{-1}\sum _{i=1}^{n}f_{i}\right) }{C}\varepsilon \le \varepsilon , \end{aligned}$$

i.e., \(n^{-1}\sum _{i=1}^{n}\left[ f_{i}-E\left( f_{i}\right) \right] =O_{p}\left( n^{-1/2}\right) \).

Denote \(\varvec{\phi }_{i}=\left( \beta _{i},\varvec{\gamma }^{\prime }\right) ^{\prime }\) and \(\varvec{\phi }=E\left( \varvec{\phi } _{i}\right) =\left( E\left( \beta _{i}\right) ,\varvec{\gamma } ^{\prime }\right) ^{\prime }\). Consider the moment condition,

$$\begin{aligned} E\left( {{\textbf {w}}}_{i}y_{i}\right) =E\left( {{\textbf {w}}}_{i} {{\textbf {w}}}_{i}^{\prime }\right) \varvec{\phi }, \end{aligned}$$
(2.3)

and sum (2.3) over i

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^{n}E\left( {{\textbf {w}}}_{i}y_{i}\right) =\left[ \frac{1}{n}\sum _{i=1}^{n}E\left( {{\textbf {w}}}_{i}{{\textbf {w}}} _{i}^{\prime }\right) \right] \varvec{\phi }. \end{aligned}$$
(2.4)

Let \(n\rightarrow \infty \), then \(\varvec{\phi }\) is identified by

$$\begin{aligned} \varvec{\phi } = {{\textbf {Q}}}_{ww} ^{-1} {{\textbf {q}}}_{wy}, \end{aligned}$$
(2.5)

under Assumption 1.

Assumption 2

Let \({\tilde{y}}_{i}=y_{i}-{{\textbf {z}}}_{i}^{\prime } \varvec{\gamma }\).

  1. (a)

    \(\left| n^{-1}\sum _{i=1}^{n}E\left( {\tilde{y}} _{i}^{r}x_{i}^{s}\right) -\rho _{r,s}\right| =O\left( n^{-1/2}\right) ,\) and \(\left| \rho _{r,s}\right| <\infty ,\) for \(r,s=0,1,\ldots ,2K-1\).

  2. (b)

    \(\left| n^{-1}\sum _{i=1}^{n}E\left( u_{i}^{r}\right) -\sigma _{r}\right| =O\left( n^{-1/2}\right) ,\) and \( \left| \sigma _{r}\right| <\infty \), for \(r=2,3,\ldots ,2K-1\).

  3. (c)

    \(n^{-1}\sum _{i=1}^{n}\left[ \textrm{var}(x_{i}^{r})-\left( \rho _{0,2r}-\rho _{0,r}^{2}\right) \right] =O\left( n^{-1/2}\right) \) where \( \rho _{0,2r}-\rho _{0,r}^{2}>0,\) for \(r=2,3,\ldots ,2K-1\).

Remark 6

The above assumption allows for a limited degree of heterogeneity of the moments. As an example, let \(E\left( u_{i}^{r}\right) =\sigma _{ir}\) and denote the heterogeneity of the \(r^{th}\) moment of \(u_{i}\) by \(e_{ir}=\sigma _{ir}-\sigma _{r}\). Then

$$\begin{aligned} \left| n^{-1}\sum _{i=1}^{n}E\left( u_{i}^{r}\right) -\sigma _{r}\right| \le n^{-1}\sum _{i=1}^{n}\left| e_{ir}\right| , \end{aligned}$$

and condition (b) of Assumption 2 is met if \(\ \sum _{i=1}^{n}\left| e_{ir}\right| =O(n^{\alpha _{r}})\) with \(\alpha _{r}<1/2\). \(\alpha _{r}\) measures the degree of heterogeneity with \(\alpha _{r}=1\) representing the highest degree of heterogeneity. A similar idea is used by Pesaran and Zhou (2018) in their analysis of poolability in panel data models.

Theorem 1

Under Assumptions 1 and 2, \( E\left( \beta _{i}^{r}\right) \) and \(\sigma _{r}\), \(r=2,3,\ldots ,2K-1\) are identified.

Proof

For \(r=2,\ldots ,2K-1\),

$$\begin{aligned} E\left( {\tilde{y}}_{i}^{r}\right)&=E\left( x_{i}^{r}\right) E\left( \beta _{i}^{r}\right) +E\left( u_{i}^{r}\right) +\sum _{q=2}^{r-1}\left( {\begin{array}{c}r\\ q\end{array}}\right) E\left( x_{i}^{r-q}\right) E\left( u_{i}^{q}\right) E\left( \beta _{i}^{r-q}\right) , \end{aligned}$$
(2.6)
$$\begin{aligned} E\left( {\tilde{y}}_{i}^{r}x_{i}^{r}\right)&=E\left( x_{i}^{2r}\right) E\left( \beta _{i}^{r}\right) +E\left( x_{i}^{r}\right) E\left( u_{i}^{r}\right) +\sum _{q=2}^{r-1}\left( {\begin{array}{c}r \\ q\end{array}}\right) E\left( x_{i}^{2r-q}\right) E\left( u_{i}^{q}\right) E\left( \beta _{i}^{r-q}\right) . \end{aligned}$$
(2.7)

where \(\left( {\begin{array}{c}r\\ q\end{array}}\right) =\frac{r!}{q!(r-q)!}\) are binomial coefficients, for nonnegative integers \(q\le r\).

Sum over i, then by parts (a) and (b) of Assumption 2,

$$\begin{aligned}&\rho _{0,r}E\left( \beta _{i}^{r}\right) +\sigma _{r} =\rho _{r,0}-\sum _{q=2}^{r-1}\left( {\begin{array}{c}r\\ q\end{array}}\right) \rho _{0,r-q}\sigma _{q}E\left( \beta _{i}^{r-q}\right) , \end{aligned}$$
(2.8)
$$\begin{aligned}&\rho _{0,2r}E\left( \beta _{i}^{r}\right) +\rho _{0,r}\sigma _{r} =\rho _{r,r}-\sum _{q=2}^{r-1}\left( {\begin{array}{c}r\\ q\end{array}}\right) \rho _{0,2r-q}\sigma _{q}E \left( \beta _{i}^{r-q}\right) . \end{aligned}$$
(2.9)

Derivation details are relegated to “Appendix A.1.” By part (c) of Assumption 2, the matrix \( \begin{pmatrix} \rho _{0,r} &{} 1 \\ \rho _{0,2r} &{} \rho _{0,r} \end{pmatrix} \) is invertible for \(r=2,3,\ldots ,2K-1\). As a result, we can sequentially solve (2.8) and (2.9) for \(E\left( \beta _{i}^{r}\right) \) and \(\sigma _{r}\), for \(r=2,3,\ldots ,2K-1\). \(\square \)

2.2 Identifying the distribution of \(\beta _{i}\)

Beran and Hall (1992, Theorem 2.1, pp. 1972) prove the identification of the distribution of the random coefficient, \(\beta _{i}\), in a canonical model without covariates, \(z_{i}\), under the condition that the distribution of \(\beta _{i}\) is uniquely determined by its moments. We show the identification of moments of \(\beta _i\) holds more generally when \(x_i\) and \( u_i\) are not identically distributed and the distribution of \(\beta _i\) is identified if it follows a categorical distribution. Note that under (2.2),

$$\begin{aligned} E\left( \beta _{i}^{r}\right) =\sum _{k=1}^{K}\pi _{k}b_{k}^{r},\;r=0,1,2,\ldots ,2K-1, \end{aligned}$$
(2.10)

with \(E\left( \beta _{i}^{r}\right) \) identified under Assumption 1. To identify \(\varvec{\pi } =\left( \pi _{1},\pi _{2},\ldots ,\pi _{K}\right) ^{\prime }\) and \({{\textbf {b}}} =\left( b_{1},b_{2},\ldots ,b_{K}\right) ^{\prime }\), we need to verify that the system of 2K equations in (2.10) has a unique solution if \( b_{1}<b_{2}<\cdots <b_{K}\), and \(\pi _{k}\in \left( 0,1\right) \). In the proof, we construct a linear recurrence relation and make use of the corresponding characteristic polynomial.

Theorem 2

Consider the random coefficient regression model (2.1), suppose that Assumptions 1 and 2 hold. Then \(\varvec{\theta }=\left( \varvec{\pi }^{\prime },{{\textbf {b}}}^{\prime }\right) ^{\prime }\) is identified subject to \(b_{1}<b_{2}<\cdots <b_{K}\) and \(\pi _{k}\in \left( 0,1\right) \), for all \(k=1,2,\ldots ,K\).

Proof

We motivate the key idea of the proof in the special case where \(K=2,\) and relegate the proof of the general case to the “Appendix A.1.” Let \(b_{1}=\beta _{L}\), \(b_{2}=\beta _{H}\), \(\pi _{1}=\pi \) and \(\pi _{2}=1-\pi \). Note that

$$\begin{aligned} E\left( \beta _{i}\right)&=\pi \beta _{L}+\left( 1-\pi \right) \beta _{H}, \end{aligned}$$
(2.11)
$$\begin{aligned} E\left( \beta _{i}^{2}\right)&=\pi \beta _{L}^{2}+\left( 1-\pi \right) \beta _{H}^{2}, \end{aligned}$$
(2.12)
$$\begin{aligned} E\left( \beta _{i}^{3}\right)&=\pi \beta _{L}^{3}+\left( 1-\pi \right) \beta _{H}^{3}, \end{aligned}$$
(2.13)

and \(E\left( \beta _{i}^{k}\right) \), \(k=1,2,3\) are identified. \( \left( \pi ,\beta _{L},\beta _{H}\right) \) can be identified if the system of Eqs. (2.11)–(2.13), has a unique solution. By (2.11),

$$\begin{aligned} \pi =\frac{\beta _{H}-E\left( \beta _{i}\right) }{\beta _{H}-\beta _{L}},\; \text {and}\; 1-\pi =\frac{E\left( \beta _{i}\right) -\beta _{L}}{\beta _{H}-\beta _{L}}. \end{aligned}$$
(2.14)

Plug (2.14) into (2.12) and (2.13),

$$\begin{aligned} E\left( \beta _{i}\right) \left( \beta _{L}+\beta _{H}\right) -\beta _{L}\beta _{H}&=E\left( \beta _{i}^{2}\right) , \end{aligned}$$
(2.15)
$$\begin{aligned} E\left( \beta _{i}^2\right) \left( \beta _L + \beta _H\right) - E\left( \beta _i\right) \beta _{L}\beta _{H}&=E\left( \beta _{i}^{3}\right) . \end{aligned}$$
(2.16)

Denote \(\beta _{L+H} = \beta _{L} + \beta _{H}\) and \(\beta _{LH} = \beta _{L}\beta _{H}\), and write (2.15) and (2.16) in matrix form,

$$\begin{aligned} {{\textbf {M}}} {{\textbf {D}}} {{\textbf {b}}}^{*} = {{\textbf {m}}}, \end{aligned}$$
(2.17)

where

$$\begin{aligned} {{\textbf {M}}} = \begin{pmatrix} 1 &{} E\left( \beta _i\right) \\ E\left( \beta _i\right) &{} E\left( \beta _i^2 \right) \end{pmatrix}, \, {{\textbf {D}}} = \begin{pmatrix} -1 &{} 0 \\ 0 &{} 1 \end{pmatrix},\, {{\textbf {b}}}^{*} = \begin{pmatrix} \beta _{LH} \\ \beta _{L+H} \end{pmatrix}, \;\text {and}\; {{\textbf {m}}} = \begin{pmatrix} E\left( \beta _i^2\right) \\ E\left( \beta _i^3\right) \end{pmatrix}. \end{aligned}$$

Under the conditions \(0< \pi < 1\) and \(\beta _H > \beta _L\),

$$\begin{aligned} \det \left( {{\textbf {M}}} \right) = \textrm{var}\left( \beta _{i}\right) = E\left( \beta _{i}^{2}\right) -E\left( \beta _{i}\right) ^{2} = \pi \left( 1-\pi \right) \left( \beta _H - \beta _L \right) ^2 > 0. \end{aligned}$$

As a result, we can solve (2.17) for \(\beta _{L+H}\) and \( \beta _{LH}\) as

$$\begin{aligned} \beta _{L+H}&=\frac{E\left( \beta _{i}^{3}\right) -E \left( \beta _{i}\right) E\left( \beta _{i}^{2}\right) }{\textrm{var }\left( \beta _i \right) }, \end{aligned}$$
(2.18)
$$\begin{aligned} \beta _{LH}&=\frac{E\left( \beta _{i}\right) E\left( \beta _{i}^{3}\right) -E\left( \beta _{i}^{2}\right) ^{2}}{\textrm{ var}\left( \beta _i \right) }. \end{aligned}$$
(2.19)

\(\beta _{L}\) and \(\beta _{H}\) are solutions to the quadratic equation,

$$\begin{aligned} \beta ^{2}-\beta _{L+H}\beta +\beta _{LH}=0. \end{aligned}$$
(2.20)

We can verify that \(\Delta =\beta _{L+H}^{2}-4\beta _{LH}>0\) by direct calculation using (2.18) and (2.19). Simplifying \(\Delta \) in terms of \(E\left( \beta _{i}^{k}\right) \) and then plugging in (2.11), (2.12) and (2.13),

$$\begin{aligned} \Delta&= \frac{ \left[ E\left( \beta _{i}^{3}\right) - E \left( \beta _{i}\right) E\left( \beta _{i}^{2}\right) \right] ^{2} -4 \textrm{var}\left( \beta _i \right) \left[ E\left( \beta _{i}\right) E\left( \beta _{i}^{3}\right) - E\left( \beta _{i}^{2}\right) ^{2} \right] }{ \left[ \textrm{var}\left( \beta _i \right) \right] ^{2} } \\&= \left( \beta _H - \beta _L \right) ^2 > 0. \end{aligned}$$

Then, we obtain the unique solutions,

$$\begin{aligned} \beta _{L}&=\frac{1}{2}\left( \beta _{L+H}-\sqrt{\beta _{L+H}^{2}-4\beta _{LH}}\right) , \end{aligned}$$
(2.21)
$$\begin{aligned} \beta _{H}&=\frac{1}{2}\left( \beta _{L+H}+\sqrt{\beta _{L+H}^{2}-4\beta _{LH}}\right) , \end{aligned}$$
(2.22)

and \(\pi \) can be determined by (2.14) correspondingly. \(\square \)

Remark 7

The key identifying assumption in (2) is the assumed existence of the strict ordinal relation \(b_{1}<b_{2}<\cdots <b_{K}\) so that \(b_{k}\) and \(b_{k^{\prime }}\) are not symmetric for \(k\ne k^{\prime }\), and \(0<\pi _{k}<1\) so that the distribution of \(\beta _{i}\) does not degenerate. When \(K=2\), the conditions \(b_{1}<b_{2}<\cdots <b_{K}\), and \(\pi _{k}\in \left( 0,1\right) \), are equivalent to \(\textrm{var}\left( \beta _{i}\right) =\pi _{1}\left( 1-\pi _{1}\right) \left( b_{2}-b_{1}\right) ^{2}>0\). In other words, not surprisingly, the categorical distribution of \(\beta _{i}\) is identified only if \(\textrm{var} \left( \beta _{i}\right) >0\).

In practice, a test for \({\mathbb {H}}_{0}:\textrm{var}\left( \beta _{i}\right) =0\) is possible, by noting that \(\textrm{var}\left( \beta _{i}\right) =0\) is equivalent to

$$\begin{aligned} \kappa ^{2}=\frac{E\left( \beta _{i}\right) ^{2}}{E\left( \beta _{i}^{2}\right) }=1, \end{aligned}$$

where \(\kappa ^{2}\) is well defined as long as \(\beta _{i}\not \equiv 0\). One important advantage of basing the test of slope homogeneity on \(\kappa ^{2}\) rather than on \(var(\beta _{i})=0\) is that \(\kappa ^{2}\,\)is scale-invariant. \(E\left( \beta _{i}\right) \) and \(E\left( \beta _{i}^{2}\right) \) are identified as in Sect. 2.1, whose consistent estimation does not require \(\textrm{var}\left( \beta _{i}\right) >0\). Consequently, in principle it is possible to test slope homogeneity by testing \({\mathbb {H}}_{0}:\kappa ^{2}=1\). However, the problem becomes much more complicated when there are more than two categories and/or there are more than one regressor under consideration. A full treatment of testing slope homogeneity in such general settings is beyond the scope of the present paper.

Remark 8

Note that in the special case of the proof of Theorem 2 where \(K=2\), \(\beta _{L+H}=\beta _{L}+ \beta _{H}\) and \(\beta _{LH}=\beta _{L}\beta _{H}\) corresponds to \( b_{1}^{*}\) and \(b_{2}^{*}\), and (2.17) is the same as (A.1.6) when \(K = 2\). This special case illustrates the procedure of identification: identify \(\left( b_{k}^{*} \right) _{k=1}^{K}\) by the moments of \(\beta _{i}\), then solve for \( \left( b_{k}\right) _{k=1}^{K}\) and finally identify \(\left( \pi _{k} \right) _{k=1}^{K}\).

3 Estimation

In this section, we propose a generalized method of moments estimator for the distributional parameters of \(\beta _i\). To reduce the complexity of the moment equations, we first obtain a \(\sqrt{n}\)-consistent estimator of \(\varvec{\gamma }\) and consider the estimation of the distribution of \(\beta _i\) by replacing \(\varvec{\gamma }\) by \({{\hat{\varvec{\gamma }}}}\).

3.1 Estimation of \(\varvec{\gamma }\)

Let \(\varvec{\phi }=\left( E\left( \beta _{i}\right) ,{\varvec{{\gamma }}}^{\prime }\right) ^{\prime }\), \(v_{i}=\beta _{i}-E\left( \beta _{i}\right) \) and using the notation in Assumption 1, (2.1) can be written as

$$\begin{aligned} y_{i}={{\textbf {w}}}_{i}^{\prime }\varvec{\phi }+\xi _{i}, \end{aligned}$$
(3.1)

where \(\xi _{i}=u_{i}+x_{i}v_{i}\). Then, \(\varvec{\phi }\) can be estimated consistently by \(\hat{\varvec{\phi }}={{\textbf {Q}}}_{n,ww}^{-1}{{\textbf {q}}} _{n,wy} \) where \({{\textbf {Q}}}_{n,ww}\) and \({{\textbf {q}}}_{n,wy}\) are defined in Assumption 1.

Assumption 3

\(\left\| n^{-1}\sum _{i=1}^{n}E \left( {{\textbf {w}}}_i{{\textbf {w}}}_i^\prime \xi _i^2\right) - {{\textbf {V}}}_{w\xi } \right\| = O\left( n^{-1/2} \right) \), \({{\textbf {V}}}_{w\xi }\succ 0, \) and

$$\begin{aligned} \left\| \frac{1}{n}\sum _{i=1}^{n} {{\textbf {w}}}_i{{\textbf {w}}}_i^\prime \xi _i^2 - \frac{1}{n}\sum _{i=1}^{n}E\left( {{\textbf {w}}}_i{{\textbf {w}}}_i^\prime \xi _i^2\right) \right\| = O_p\left( n^{-1/2} \right) . \end{aligned}$$
(3.2)

Remark 9

As in the case of Assumption 1, the high-level condition (3.2) can be shown to hold under weak cross-sectional dependence, assuming that elements of \({{\textbf {w}}}_{i} {{\textbf {w}}}_{i}^{\prime }\xi _{i}^{2}\) are cross-sectionally weakly correlated over i. See Remark 5.

Theorem 3

Under Assumption 1, \(\hat{\varvec{\phi }}\) is a consistent estimator for \(\varvec{\phi }\). In addition, under Assumptions 1 and 3, as \(n\rightarrow \infty \),

$$\begin{aligned} \sqrt{n}\left( \hat{\varvec{\phi }} - \varvec{\phi } \right) \rightarrow _{d} N\left( {{\textbf {0}}}, {{\textbf {V}}}_\phi \right) , \end{aligned}$$
(3.3)

where \({{\textbf {V}}}_{\phi } = {{\textbf {Q}}}_{w w}^{-1} {{\textbf {V}}}_{w\xi } {{\textbf {Q}}} _{w w}^{-1}. \) \({{\textbf {V}}}_{\phi }\) is consistently estimated by

$$\begin{aligned} \hat{{{\textbf {V}}}}_{\phi } = {{\textbf {Q}}}_{n,ww}^{-1}\hat{{{\textbf {V}}}}_{w\xi } {{\textbf {Q}}}_{n,ww}^{-1}\rightarrow _p {{\textbf {V}}}_{\phi }, \end{aligned}$$

as \(n\rightarrow \infty \), where \(\hat{{{\textbf {V}}}}_{w\xi } = n^{-1}\sum _{i=1}^{n} {{\textbf {w}}}_i{{\textbf {w}}}_i^\prime {\hat{\xi }}_{i}^2\), and \({\hat{\xi }}_i = y_i - {{\textbf {w}}}_i^\prime \hat{\varvec{\phi }}\).

The proof of Theorem 3 is provided in Sect. S.2 in the online supplement.

3.2 Estimation of the distribution of \(\beta _i\)

Denote the moments of \(\beta _{i}\) on the right-hand side of (2.10) by

$$\begin{aligned} {{\textbf {m}}}_{\beta }= & {} (m_{1},m_{2},\ldots ,m_{2K-1})^{\prime }\\= & {} \left[ E \left( \beta _{i}^{r}\right) \right] _{r=1}^{2K-1}\in \Theta _{m}\subset \left\{ {{\textbf {m}}}_{\beta }\in {\mathbb {R}}^{2K-1}:m_{r}\ge 0,\text { }r\text { is even}\right\} , \end{aligned}$$

and note that

$$\begin{aligned} {{\textbf {m}}}_{\beta }=\left( \begin{array}{c} m_{1} \\ m_{2} \\ \vdots \\ m_{2K-1} \end{array} \right) =\left( \begin{array}{cccc} b_{1} &{} b_{2} &{} \cdots &{} b_{K} \\ b_{1}^{2} &{} b_{2}^{2} &{} \cdots &{} b_{K}^{2} \\ \vdots &{} \vdots &{} \vdots &{} \vdots \\ b_{1}^{2K-1} &{} b_{2}^{2K-1} &{} \cdots &{} b_{K}^{2K-1} \end{array} \right) \left( \begin{array}{c} \pi _{1} \\ \pi _{2} \\ \vdots \\ \pi _{K} \end{array} \right) , \end{aligned}$$
(3.4)

so in general we can write \({{\textbf {m}}}_{\beta }\triangleq h\left( {\varvec{{\theta }}}\right) ,\) where \(\varvec{\theta }=\left( \varvec{\pi }^{\prime }, {{\textbf {b}}}^{\prime }\right) ^{\prime }\in \Theta \), and \(\varvec{\theta }\) can be uniquely determined in terms of \({{\textbf {m}}}_{\beta }\) by Theorem 2. To estimate \(\varvec{\theta }\), we consider moment conditions following a similar procedure as in Sect. 2 and propose a generalized method of moments (GMM) estimator.

We consider the following moment conditions:

$$\begin{aligned} E\left( {\tilde{y}}_{i}^{r}\right) =\sum _{q=0}^{r}\left( {\begin{array}{c}r\\ q\end{array}}\right) \textrm{ E}\left( x_{i}^{r-q}\right) E\left( u_{i}^{q}\right) m_{r-q},\text { } \end{aligned}$$

and

$$\begin{aligned} E\left( {\tilde{y}}_{i}^{r}x_{i}^{s_{r}}\right) =\sum _{q=0}^{r}\left( {\begin{array}{c} r\\ q\end{array}}\right) E\left( x_{i}^{r-q+s_{r}}\right) E\left( u_{i}^{q}\right) m_{r-q}, \end{aligned}$$
(3.5)

where \(E\left( u_{i}\right) =0\), \({\tilde{y}}_{i}=y_{i}-{{\textbf {z}}} _{i}^{\prime }\varvec{\gamma }\), \(r=1,2,\ldots ,2K-1\), and \(s_{r}=0,1,\ldots ,S-r \), where S is a user-specific tuning parameter, chosen such that the highest order moments of \(x_{i}\) included is at most S, where \(S>2K-1\).Footnote 2

Let \(\sigma _{0}=1\) and \(\sigma _{1}=0\) such that \(\sigma _{r}\) is well defined for \(r=0,1,\ldots ,2K-1\). Sum (3.5) over i and rearrange terms,

$$\begin{aligned} 0&=\sum _{q=0}^{r}\left( {\begin{array}{c}r\\ q\end{array}}\right) \left[ \frac{1}{n}\sum _{i=1}^{n}E \left( x_{i}^{r-q+s_{r}}\right) E\left( u_{i}^{q}\right) \right] m_{r-q}-\frac{1}{n}\sum _{i=1}^{n}E\left( {\tilde{y}} _{i}^{r}x_{i}^{s_{r}}\right) \nonumber \\&=\sum _{q=0}^{r}\left( {\begin{array}{c}r\\ q\end{array}}\right) \left[ \frac{1}{n}\sum _{i=1}^{n}E \left( x_{i}^{r-q+s_{r}}\right) \right] \sigma _{q}m_{r-q} - \frac{1}{n} \sum _{i=1}^{n}E\left( {\tilde{y}}_{i}^{r}x_{i}^{s_{r}}\right) +\delta _{n}^{(r,s_{r})}, \end{aligned}$$
(3.6)

where

$$\begin{aligned} \delta _{n}^{(r,s_{r})} =\sum _{q=0}^{r}\left( {\begin{array}{c}r\\ q\end{array}}\right) \left[ \frac{1}{n} \sum _{i=1}^{n}E\left( x_{i}^{r-q+s_{r}}\right) \left[ E \left( u_{i}^{q}\right) -\sigma _{q}\right] \right] m_{r-q} = O\left( n^{-1/2} \right) , \end{aligned}$$

as shown in the proof of Theorem 1.

Letting \(n\rightarrow \infty \) in (3.6),

$$\begin{aligned} \sum _{q=0}^{r}\left( {\begin{array}{c}r\\ q\end{array}}\right) \rho _{0,r-q+s_{r}}\sigma _{q}m_{r-q}-\rho _{r,s_{r}} = 0, \end{aligned}$$
(3.7)

by Assumption 2. We stack the left-hand side of (3.7) over \(r=1,2,\ldots ,2K-1\), and \(s_{r}=0,1,\ldots ,S-r\) and transform \({{\textbf {m}}}_\beta = h\left( \varvec{\theta } \right) \) to obtain \( {{\textbf {g}}}_0\left( \varvec{\theta }, \varvec{\sigma }, \varvec{\gamma } \right) \).

To implement the GMM estimation, we replace \({\tilde{y}}_{i}\), by \(\hat{\tilde{y }}_{i}=y_{i}-{{\textbf {z}}}_{i}^{\prime }\hat{\varvec{\gamma }}\), and \(\rho _{r,s_{r}}\) by \(n^{-1}\sum _{i=1}^{n}\hat{{\tilde{y}}}_{i}^{r}x_{i}^{s_{r}}\). Noting that \({{\textbf {m}}}_{\beta }=h\left( \varvec{\theta }\right) \), denote the sample version of the left-hand side of (3.7) by

$$\begin{aligned} {\hat{g}}_{n}^{(r,s_{r})}\left( \varvec{\theta },\varvec{\sigma },\hat{{\varvec{{\gamma }}}}\right) =\frac{1}{n}\sum _{i=1}^{n}{\hat{g}}_{i}^{(r,s_{r})}\left( \varvec{\theta },\varvec{\sigma },\hat{\varvec{\gamma }}\right) , \end{aligned}$$
(3.8)

where

$$\begin{aligned} {\hat{g}}_{i}^{\left( r,s_{r}\right) }\left( \varvec{\theta },\varvec{\sigma }, \hat{\varvec{\gamma }}\right) =\sum _{q=0}^{r}\left( {\begin{array}{c}r\\ q\end{array}}\right) x_{i}^{r-q+s_{r}} \sigma _{q}\left[ h\left( \varvec{\theta }\right) \right] _{r-q}-\hat{\tilde{ y}}_{i}^{r}x_{i}^{s_{r}}, \end{aligned}$$

and \(\varvec{\sigma }=\left( \sigma _{2},\sigma _{3},\ldots ,\sigma _{2K-1}\right) ^{\prime }\). Stack the equations in (3.8), over \( r=0,1,\ldots ,2K-1\) and \(s_{r}=0,1,\ldots ,S-r\) (\(S>2K-1\)), in vector notations we have

$$\begin{aligned} {{\hat{\varvec{{g}}}}}_{n}\left( \varvec{\theta },\varvec{\sigma },\hat{{\varvec{{\gamma }}}}\right) =\frac{1}{n}\sum _{i=1}^{n}{\hat{{\varvec{g}}}}_{i}\left( \varvec{\theta },\varvec{\sigma },\hat{\varvec{\gamma }}\right) . \end{aligned}$$
(3.9)

Given \(\hat{\varvec{\gamma }}\), the GMM estimator of \(\left( \varvec{\theta } ^{\prime },\varvec{\sigma }^{\prime }\right) ^{\prime }\) is now computed as

$$\begin{aligned} \left( \hat{\varvec{\theta }}^{\prime },\hat{\varvec{\sigma }}^{\prime }\right) ^{\prime }=\arg \min _{\varvec{\theta }\in \Theta ,\varvec{\sigma } \in {\mathcal {S}}}{\hat{\Phi }}_{n}\left( \varvec{\theta },\varvec{\sigma },\hat{ \varvec{\gamma }}\right) , \end{aligned}$$

where \({\hat{\Phi }}_{n}={\hat{{\varvec{g}}}}_{n}\left( \varvec{\theta },{\varvec{{\sigma }}},\hat{\varvec{\gamma }}\right) ^{\prime }{{\textbf {A}}}_{n}{\hat{\varvec{{g}}}}_{n}\left( \varvec{\theta },\varvec{\sigma },\hat{\varvec{\gamma }}\right) \), and \({{\textbf {A}}}_{n}\) is a positive definite matrix. We follow the GMM literature using the following choice of \({{\textbf {A}}}_{n}\),

$$\begin{aligned} {\hat{{\varvec{A}}}}_{n}=\left[ \frac{1}{n}\sum _{i=1}^{n}{\hat{{\varvec{g}}}} _{i}\left( \tilde{\varvec{\theta }},\tilde{\varvec{\sigma }},\hat{{\varvec{{\gamma }}}}\right) {\hat{{\varvec{g}}}}_{i}\left( \tilde{\varvec{\theta }},\tilde{\varvec{\sigma }},\hat{\varvec{\gamma }}\right) ^{\prime }-{{\bar{{\varvec{g}}}}} _{n}{\bar{{\varvec{g}}}}_{n}^{\prime }\right] ^{-1}, \end{aligned}$$
(3.10)

where \({\bar{{\varvec{g}}}}_{n}=\frac{1}{n}\sum _{i=1}^{n}{\hat{{\varvec{g}}}} _{i}\left( \tilde{\varvec{\theta }},\tilde{\varvec{\sigma }},\hat{{\varvec{{\gamma }}}}\right) \), and \(\tilde{\varvec{\theta }}\) and \(\tilde{{\varvec{{\sigma }}}}\) are preliminary estimators.

Assumption 4

Denote the true values of \(\varvec{\theta }\), \(\varvec{\sigma }\) and \(\varvec{\gamma }\) by \(\varvec{\theta }_{0}\), \( \varvec{\sigma }_0\) and \(\varvec{\gamma }_0 \).

  1. (a)

    \(\Theta \) and \(\mathcal {S}\) are compact. \(\varvec{\theta }_{0}\in \textrm{int}\left( \Theta \right) \) and \(\varvec{\sigma }_0 \in \textrm{int} \left( {\mathcal {S}} \right) \).

  2. (b)

    \({{\textbf {A}}}_{n}\rightarrow _{p}{{\textbf {A}}}\) as \(n\rightarrow \infty \), where \({{\textbf {A}}}\) is some positive definite matrix.

  3. (c)

     

    $$\begin{aligned} \frac{1}{n}\sum _{i=1}^{n}\left[ \hat{{\tilde{y}}}_{i}^{r}x_{i}^{s_{r}}-\textrm{ E}\left( {\tilde{y}}_{i}^{r}x_{i}^{s_{r}}\right) \right] =O_{p}\left( n^{-1/2}\right) , \end{aligned}$$

    for \(r=0,1,2,\ldots ,2K-1\), \(s_{r}=0,1,\ldots ,S-r,\) and \(S>2K-1.\)

Remark 10

Parts (a) and (b) of Assumption 4 are standard regularity conditions in the GMM literature. Part (c) together with Assumption 2 are high-level regularity conditions which allow us to generalize the usual IID assumption and nest the IID data generation process as a special case. The sample analog terms in (c) include \(\hat{{\tilde{y}}}_{i}=y_{i}-{{\textbf {z}}}_{i}^{\prime }\hat{{\varvec{{\gamma }}}}\), instead of the infeasible \({\tilde{y}}_{i}=y_{i}-{{\textbf {z}}} _{i}^{\prime }\varvec{\gamma }\). The \(\sqrt{n}\)-consistency of \(\hat{{\varvec{{\gamma }}}}\) shown in Theorem 3 ensures that replacing \({\tilde{y}}_{i}\) by \(\hat{{\tilde{y}}}_{i}\) does not alter the convergence rate.

Theorem 4

Let \(\varvec{\eta }=\left( \varvec{\theta }^{\prime }, \varvec{\sigma }^{\prime }\right) ^{\prime }\) and \({\varvec{{\eta }}}_{0}=\left( \varvec{\theta }_{0}^{\prime },\varvec{\sigma }_{0}^{\prime }\right) ^{\prime }\). Under Assumptions 1, 2, and 4, \(\hat{ \varvec{\eta }}\rightarrow _{p}\varvec{\eta }_{0}\) as \(n\rightarrow \infty \).

The proof of Theorem 4 is provided in “Appendix A.1.”

Assumption 5

Follow the notations as in Assumption 4 and in addition denote \({{\textbf {G}}}\left( \varvec{\theta },\varvec{\sigma },\varvec{\gamma }\right) =\nabla _{\left( \varvec{\theta }^{\prime }, \varvec{\sigma }^{\prime }\right) ^{\prime }} {{\textbf {g}}}_{0}\left( \varvec{\theta },\varvec{\sigma },\varvec{\gamma } \right) \), \({{\textbf {G}}}_{0}={{\textbf {G}}}\left( \varvec{\theta }_{0},{\varvec{{\sigma }}}_{0},\varvec{\gamma }_{0}\right) \), \({{\textbf {G}}}_{\gamma }\left( \varvec{\theta },\varvec{\sigma },\varvec{\gamma }\right) =\nabla _{\varvec{\gamma }}{{\textbf {g}}}_{0}\left( \varvec{\theta },\varvec{\sigma },\varvec{\gamma }\right) \), \({{\textbf {G}}}_{0, \gamma }={{\textbf {G}}}_{\gamma } \left( \varvec{\theta }_{0},\varvec{\sigma }_{0},\varvec{\gamma }_{0}\right) \).

  1. (a)

    \(\sqrt{n}{\hat{{\varvec{g}}}}_{n}\left( \varvec{\theta }_{0},{\varvec{{\sigma }}}_{0},\varvec{\gamma }_0\right) \rightarrow _{d}\varvec{\zeta }\sim N\left( 0,{{\textbf {V}}}\right) \) as \(n\rightarrow \infty \).

  2. (b)

    \({{\textbf {G}}}_{0}^{\prime }{} {\textbf {AG}}_{0}\succ 0\).

Remark 11

In Assumption 5, parts (a) is the high-level condition required to ensure the asymptotic normality of \({\hat{{\varvec{g}}}}_{n}\left( \varvec{\theta }_{0},\varvec{\sigma }_{0},\varvec{\gamma }_0\right) \), which can be verified by Lindeberg central limit theorem under low-level regularity conditions. Part (c) of Assumption 5 represents the full-rank condition on \({{\textbf {G}}}_{0}\), required for identification of \(\varvec{\theta }_{0}\) and \(\varvec{\sigma }_{0}\).

By Theorem 3, we have \(\sqrt{n}\left( \hat{ \varvec{\gamma }} - \varvec{\gamma } \right) \rightarrow _d \zeta _\gamma \sim N(0, V_\gamma )\). The following theorem shows the asymptotic normality of the GMM estimator \(\hat{\varvec{\eta }}\).

Theorem 5

Under Assumptions 1, 34 and 5,

$$\begin{aligned} \sqrt{n}\left( \hat{\varvec{\eta }}-\varvec{\eta }_{0}\right) \rightarrow _{d}\left( {{\textbf {G}}}_{0}^{\prime }{{\textbf {A}}}{{\textbf {G}}}_{0}\right) ^{-1} {{\textbf {G}}}_{0}^{\prime }{{\textbf {A}}}\left( \varvec{\zeta }+{{\textbf {G}}}_{0, \gamma }\varvec{\zeta }_{\gamma }\right) , \end{aligned}$$

as \(n\rightarrow \infty \).

The proof of Theorem 5 is provided in “Appendix A.1.”

Remark 12

In practice, we estimate the variance of the asymptotic distribution of \( {\hat{\eta }}\) by

$$\begin{aligned} \hat{{{\textbf {V}}}}_{\eta } = \left( \hat{{{\textbf {G}}}}^\prime \hat{{{\textbf {A}}}}_n \hat{{{\textbf {G}}}} \right) ^{-1} \hat{{{\textbf {G}}}}^\prime \hat{{{\textbf {A}}}}_n \hat{{{\textbf {V}}}}_{\zeta } \hat{{{\textbf {A}}}}_n^\prime \hat{{{\textbf {G}}}} \left( \hat{{{\textbf {G}}}}^\prime \hat{{{\textbf {A}}}}_n \hat{{{\textbf {G}}}} \right) ^{-1}, \end{aligned}$$
(3.11)

where \(\hat{{{\textbf {G}}}} = \nabla _{\left( \varvec{\sigma }^{\prime },\varvec{\theta }^{\prime }\right) ^{\prime }} \hat{{{\textbf {g}}}}_{n}\left( \hat{ \varvec{\theta }}, \hat{\varvec{\sigma }}, {\hat{\varvec{\gamma }}} \right) \), \(\hat{ {{\textbf {A}}}}_n\) is given by (3.10), and

$$\begin{aligned} \hat{{{\textbf {V}}}}_\zeta = \frac{1}{n} \sum _{i=1}^{n} \varvec{\psi }_{n,i} \varvec{\psi }_{n,i}^\prime , \end{aligned}$$

where

$$\begin{aligned} \varvec{\psi }_{n,i} = \hat{{{\textbf {g}}}}_i\left( \hat{\varvec{\theta }}, \hat{ \varvec{\sigma }}, {\hat{\varvec{\gamma }}} \right) + \nabla _{\varvec{\gamma }}\hat{ {{\textbf {g}}}}_{n}\left( \hat{\varvec{\theta }}, \hat{\varvec{\sigma }}, {\hat{\varvec{\gamma }}}\right) {{\textbf {L}}} {{\textbf {Q}}}_{n,ww}^{-1}\left( {{\textbf {w}}}_i{\hat{\xi }}_i \right) , \end{aligned}$$

and \({{\textbf {L}}} = \begin{pmatrix} {{\textbf {0}}}_{p_z\times 1}&{{\textbf {I}}}_{p_z} \end{pmatrix} \) is the loading matrix that selects \(\varvec{\gamma }\) out of \(\varvec{\phi }\).

4 Multiple regressors with random coefficients

One important extension of the regression model (2.1) is to allow for multiple regressors with random coefficients having categorical distribution. With this in mind consider

$$\begin{aligned} y_{i}={{\textbf {x}}}_{i}^{\prime }\varvec{\beta }_{i}+{{\textbf {z}}}_{i}^{\prime } \varvec{\gamma }+u_{i}, \end{aligned}$$
(4.1)

where the \(p\times 1\) vector of random coefficients, \(\varvec{\beta }_{i}\in {\mathbb {R}}^{p}\) follows the multivariate distributionFootnote 3

$$\begin{aligned} \textrm{Pr}\left( \beta _{i1}=b_{1k_{1}},\beta _{i2}=b_{2k_{2}},\ldots ,\beta _{ip}=b_{pk_{p}}\right) =\pi _{k_{1},k_{2},\ldots ,k_{p}}, \end{aligned}$$
(4.2)

with \(k_{j}\in \left\{ 1,2,\ldots ,K\right\} \), \(b_{j1}<b_{j2}<\cdots <b_{jK} \), and

$$\begin{aligned} \sum _{k_{1},k_{2},\ldots ,k_{p}\in \left\{ 1,2,\ldots ,K\right\} }\pi _{k_{1},k_{2},\ldots ,k_{p}}=1. \end{aligned}$$

As in Sect. 2, \(\varvec{\gamma }\in {\mathbb {R}} ^{p_{z}}\), \({{\textbf {w}}}_{i}=\left( {{\textbf {x}}}_{i}^{\prime },{{\textbf {z}}} _{i}^{\prime }\right) ^{\prime }\), \(\varvec{\beta }_{i}\perp {{\textbf {w}}}_{i}\), \(u_{i}\perp {{\textbf {w}}}_{i}\), and \(u_{i}\) are independently distributed over i with mean 0.

Example 1

Consider the simple case with \(p = 2\) and \(K = 2\). For \(j = 1, 2\), denote two categories as \(\left\{ L, H \right\} \). The probabilities of four possible combinations of realized \(\varvec{\beta } _i\) are summarized in Table 1, where \(\pi _{LL} + \pi _{LH} + \pi _{HL} + \pi _{HH} = 1\).

Table 1 Distribution of \({\varvec{{\beta }}}_i\) with \(p = 2\) and \(K = 2\)

We first identify the moments of \(\varvec{\beta }_{i}\). As in Sect. 2, \(\varvec{\phi }=\left( E\left( \varvec{\beta } _{i}\right) ^{\prime },\varvec{\gamma }^{\prime }\right) ^{\prime }\) is identified by

$$\begin{aligned} \varvec{\phi } = {{\textbf {Q}}}_{ww} ^{-1} {{\textbf {q}}}_{wy}, \end{aligned}$$
(4.3)

under Assumption 1. We now consider the identification of the higher-order moments of \(\varvec{\beta } _{i}\) up to the finite order \(2K-1\).

Since \(\varvec{\gamma }\) is identified as in (4.3), we treat it as known and let \({\tilde{y}}_{i}^{r}=y_{i}-{{\textbf {z}}}_{i}^{\prime } \varvec{\gamma }\). For \(r=2,3,\ldots ,2K-1\), consider the moment conditions

$$\begin{aligned} E\left( {\tilde{y}}_{i}^{r}\right)&=E\left[ \left( {\textbf { x}}_{i}^{\prime }\varvec{\beta }_{i}+u_{i}\right) ^{r}\right] \nonumber \\&=E\left[ \left( {{\textbf {x}}}_{i}^{\prime }\varvec{\beta } _{i}\right) ^{r}\right] +E\left( u_{i}^{r}\right) +\sum _{s=2}^{r-1} \left( {\begin{array}{c}r\\ s\end{array}}\right) E\left[ \left( {{\textbf {x}}}_{i}^{\prime }\varvec{\beta } _{i}\right) ^{r-s}\right] E\left( u_{i}^{s}\right) . \end{aligned}$$
(4.4)

Note that \({{\textbf {x}}}_{i}^{\prime }\varvec{\beta }_{i}=\sum _{j=1}^{p}\beta _{ij}x_{ij}\), and

$$\begin{aligned} E\left[ \left( \sum _{j=1}^{p}\beta _{ij}x_{ij}\right) ^{r}\right] =\sum _{\sum _{j=1}^{p}q_{j}=r}\left( {\begin{array}{c}r\\ {{\textbf {q}}}\end{array}}\right) E\left( \prod _{j=1}^{p}x_{ij}^{q_{j}}\right) E\left( \prod _{j=1}^{p}\beta _{ij}^{q_{j}}\right) , \end{aligned}$$

where \(\left( {\begin{array}{c}r\\ {{\textbf {q}}}\end{array}}\right) =\frac{r!}{q_{1}!q_{2}!\cdots q_{p}!}\), for nonnegative integers r, \(q_{1}\), \(\ldots \), \(q_{p}\) with \( r=\sum _{j=1}^{p}q_{j}\), denotes the multinomial coefficients. We stack \( \prod _{j=1}^{p}x_{ij}^{q_{j}}\) with \({{\textbf {q}}}\in \left\{ {{\textbf {q}}}\in \left\{ 0,1,\ldots r\right\} ^{p}:\sum _{j=1}^{p}q_{j}=r\right\} \) in a vector form by denotingFootnote 4

$$\begin{aligned} \varvec{\tau }_{r}\left( {{\textbf {x}}}_{i}\right) =\left[ \varphi \left( {{\textbf {x}}}_{i},{{\textbf {q}}}_{1}\right) ,\varphi \left( {{\textbf {x}}}_{i},{{\textbf {q}}}_{2}\right) ,\ldots ,\varphi \left( {{\textbf {x}}}_{i},{{\textbf {q}}}_{\nu _{r}}\right) \right] ^{\prime }, \end{aligned}$$

where \(\varphi \left( {{\textbf {x}}}_{i},{{\textbf {q}}}\right) =\prod _{j=1}^{p}x_{ij}^{q_{j}}\) and \(\nu _{r}=\left( {\begin{array}{c}r+p-1\\ p-1\end{array}}\right) \) is the number of distinct monomials of degree r on the variables \( x_{i1},x_{i2},\ldots ,x_{ip}\). Similarly,

$$\begin{aligned} \varvec{\tau }_{r}\left( \varvec{\beta }_{i}\right) =\left[ \varphi \left( \varvec{\beta }_{i},{{\textbf {q}}}_{1}\right) ,\varphi \left( \varvec{\beta } _{i},{{\textbf {q}}}_{2}\right) ,\ldots ,\varphi \left( \varvec{\beta }_{i}, {{\textbf {q}}}_{\nu _{r}}\right) \right] ^{\prime }, \end{aligned}$$

where \(\varphi \left( \varvec{\beta }_{i},{{\textbf {q}}}\right) =\prod _{j=1}^{p}\beta _{ij}^{q_{j}}\).

Example 2

Consider \(p = 2\) and \(r = 2\), we have

$$\begin{aligned} \varvec{\tau }_2\left( {{\textbf {x}}}_i \right)&= \left( x_{i1}^2, x_{i1}x_{i2}, x_{i2}^2 \right) ^\prime , \\ \varvec{\tau }_2\left( \varvec{\beta }_i \right)&= \left( \beta _{i1}^2, \beta _{i1}\beta _{i2}, \beta _{i2}^2 \right) ^\prime , \end{aligned}$$

and

$$\begin{aligned}&E\left[ \left( x_{i1}\beta _{i1} + x_{i2}\beta _{i2} \right) ^2 \right] = E\left( x_{i1}^2\right) E\left( \beta _{i1}^2\right) + 2 E\left( x_{i1}x_{i2}\right) E\left( \beta _{i1}\beta _{i2} \right) + E\left( x_{i2}^2 \right) E\left( \beta _{i2}^2\right) \\&\quad = \left[ E\left( x_{i1}^2\right) , E\left( x_{i1}x_{i2} \right) , E\left( x_{i2}^2 \right) \right] \textrm{diag}\left[ \left( 1, 2, 1 \right) ^\prime \right] \left[ E\left( \beta _{i1}^2\right) , E\left( \beta _{i1}\beta _{i2}\right) , E\left( \beta _{i2}^2 \right) \right] ^\prime \\&\quad = E\left[ \varvec{\tau }_2\left( {{\textbf {x}}}_i \right) \right] ^\prime \varvec{\Lambda }_2 E\left[ \varvec{\tau }_2\left( \varvec{\beta }_i \right) \right] , \end{aligned}$$

where \(\varvec{\Lambda }_2 = \textrm{diag}\left[ \left( 1, 2, 1 \right) ^\prime \right] \).

Then, the moment condition (4.4) can be written as

$$\begin{aligned} E\left( {\tilde{y}}_{i}^{r}\right)&=E\left[ \varvec{\tau } _{r}\left( {{\textbf {x}}}_{i}\right) \right] ^{\prime }\varvec{\Lambda }_{r} E\left[ \varvec{\tau }_{r}\left( \varvec{\beta }_{i}\right) \right] +E\left( u_{i}^{r}\right) \nonumber \\&\quad \quad \quad +\sum _{s=2}^{r-1}\left( {\begin{array}{c}r\\ s\end{array}}\right) E\left[ \varvec{\tau }_{r-s}\left( {{\textbf {x}}}_{i}\right) \right] ^{\prime }\varvec{\Lambda } _{r-s}E\left[ \varvec{\tau }_{r-s}\left( \varvec{\beta }_{i}\right) \right] E\left( u_{i}^{s}\right) , \end{aligned}$$
(4.5)

where \(\varvec{\Lambda }_{r}=\textrm{diag}\left[ \left[ \left( {\begin{array}{c}r\\ {{\textbf {q}}}\end{array}}\right) \right] _{\sum _{j=1}^{p}q_{j}=r}\right] \) is the \(\nu _{r}\times \nu _{r}\) diagonal matrix of multinomial coefficients. We further consider the moment conditions

$$\begin{aligned} E\left( {\tilde{y}}_{i}^{r}\varvec{\tau }_{r}\left( {{\textbf {x}}} _{i}\right) \right)&=E\left[ \varvec{\tau }_{r}\left( {{\textbf {x}}} _{i}\right) \varvec{\tau }_{r}\left( {{\textbf {x}}}_{i}\right) ^{\prime }\right] \varvec{\Lambda }_{r}E\left[ \varvec{\tau }_{r}\left( \varvec{\beta }_{i}\right) \right] +E\left[ \varvec{\tau }_{r}\left( {{\textbf {x}}} _{i}\right) \right] E\left( u_{i}^{r}\right) \nonumber \\&\quad \quad \quad +\sum _{s=2}^{r-1}\left( {\begin{array}{c}r\\ s\end{array}}\right) E\left[ \varvec{\tau }_{r}\left( {{\textbf {x}}}_{i}\right) \varvec{\tau }_{r-s}\left( {{\textbf {x}}} _{i}\right) ^{\prime }\right] \varvec{\Lambda }_{r-s}E\left[ \varvec{\tau }_{r-s}\left( \varvec{\beta }_{i}\right) \right] E \left( u_{i}^{s}\right) , \end{aligned}$$
(4.6)

\(r=2,3,\ldots ,2K-1\). (4.5) and (4.6) reduce to (2.6) and (2.7) when \(p=1\).

Assumption 6

  1. (a)

    \(\left\| n^{-1}\sum _{i=1}^{n}E\left( {\tilde{y}}_{i}^{r} \varvec{\tau }_{s}\left( {{\textbf {x}}}_{i}\right) \right) -\varvec{\rho } _{r,s}\right\| =O\left( n^{-1/2}\right) ,\) and \(\left\| \varvec{\rho } _{r,s}\right\| <\infty \), \(r,s=0,1,\ldots ,2K-1\).

  2. (b)

    \(\left\| n^{-1}\sum _{i=1}^{n}E\left[ \varvec{\tau } _{r}\left( {{\textbf {x}}}_{i}\right) \varvec{\tau }_{s}\left( {{\textbf {x}}} _{i}\right) ^{\prime }\right] -\varvec{\Xi }_{r,s}\right\| =O\left( n^{-1/2}\right) ,\) and \(\left\| \varvec{\Xi }_{r,s}\right\| <\infty \), \(r,s=0,1,\ldots ,2K-1\).

  3. (c)

    \(\left| n^{-1}\sum _{i=1}^{n}E\left( u_{i}^{r}\right) -\sigma _{r}\right| =O\left( n^{-1/2}\right) ,\) and \(\left| \sigma _{r}\right| <\infty \) for \(r=2,3,\ldots ,2K-1\).

  4. (d)

    \(\left\| n^{-1}\sum _{i=1}^{n} \left[ \textrm{var} \left( \varvec{\tau }_r \left( {\textbf {x}}_\textbf{i} \right) \right) - \left( \varvec{\Xi } _{r,r} - \varvec{\rho }_{0, r}\varvec{\rho }_{0, r}^\prime \right) \right] \right\| = O(n^{-1/2})\), where \(\varvec{\Xi }_{r,r} - \varvec{\rho }_{0, r} \varvec{\rho }_{0, r}^\prime \succ 0\) for \(r=2,3\ldots , 2K-1\).

Theorem 6

For any \({{\textbf {q}}} \in \left\{ {{\textbf {q}}}\in \left\{ 0, 1, \ldots r \right\} ^{p}: \sum _{j=1}^p q_j = r\right\} \) and \(r = 2, 3,\ldots , 2K-1\), \(E \left( \prod _{j=1}^{p} \beta _{ij}^{q_{j}}\right) \) and \(\sigma _r\) are identified under Assumptions 1 and 6.

Proof

For \(r=2,3,\ldots ,2K-1\), sum (4.5) and (4.6) over i,  go through the same steps as in the proof of Theorem 1, then by Assumptions 6(a) to (c), we have (for \( n\rightarrow \infty \))

$$\begin{aligned} \varvec{\rho }_{r,0}^{\prime }\varvec{\Lambda }_{r}E\left[ \varvec{\tau }_{r}\left( \varvec{\beta }_{i}\right) \right] +\sigma _{r}&=\varvec{\rho }_{r,0}-\sum _{s=2}^{r-1}\left( {\begin{array}{c}r\\ s\end{array}}\right) \varvec{\rho }_{0,r-s}\varvec{\Lambda }_{r-s}E\left[ \varvec{\tau }_{r-s}\left( \varvec{\beta } _{i}\right) \right] \sigma _{s}, \end{aligned}$$
(4.7)
$$\begin{aligned} \varvec{\Xi }_{r,r}\varvec{\Lambda }_{r}E\left[ \varvec{\tau } _{r}\left( \varvec{\beta }_{i}\right) \right] +\varvec{\rho }_{0,r}\sigma _{r}&=\varvec{\rho }_{r,r}-\sum _{s=2}^{r-1}\left( {\begin{array}{c}r\\ s\end{array}}\right) \varvec{\Xi }_{r,r-s} \varvec{\Lambda }_{r-s}E\left[ \varvec{\tau }_{r-s}\left( {\varvec{{\beta }}}_{i}\right) \right] \sigma _{s}. \end{aligned}$$
(4.8)

Note that

$$\begin{aligned} {{\textbf {M}}}_{r}= \begin{pmatrix} \varvec{\Xi }_{r,r} &{} \varvec{\rho }_{0,r} \\ \varvec{\rho }_{0,r}^{\prime } &{} 1 \end{pmatrix} \begin{pmatrix} \varvec{\Lambda }_{r} &{} {{\textbf {0}}} \\ {{\textbf {0}}} &{} 1 \end{pmatrix}, \end{aligned}$$

is invertible since \(\det \left( {{\textbf {M}}}_{r}\right) =\det \left( \varvec{\Xi }_{r,r}-\varvec{\rho }_{0,r}\varvec{\rho }_{0,r}^{\prime }\right) \det \left( \varvec{\Lambda }_{r}\right) >0,\) for \(r=2,3,\ldots ,R\), by Assumption 6(d). As a result, we can sequentially solve (4.7) and (4.8) for \(E\left[ \varvec{\tau }_{r}\left( \varvec{\beta }_{i}\right) \right] \) and \(\sigma _{r}\), for \(r=2,3,\ldots ,2K-1\). \(\square \)

We now move from the moments of \(\varvec{\beta }_{i}\) to the distribution of \(\varvec{\beta }_{i}\). We first focus on the identification of the marginal probabilities obtained from (4.2) by averaging out the effects of the other coefficients except for \(\beta _{ij}\), namely we initially focus on identification of \(\lambda _{jk}=\Pr \left( \beta _{ij}=b_{jk}\right) \), for \(k=1,2,\ldots ,K,\) and \(j=1,2,\ldots ,p\).

Remark 13

Focusing on the marginal distribution of \(\beta _{i}\) is similar to focusing on estimation of partial derivatives in the context of nonparametric estimation, where the curse of dimensionality applies. Consider the estimation of regressing \(y_{i}\) on \({{\textbf {x}}}_{i}=\left( x_{i1},x_{i2},\ldots ,x_{ip}\right) ^{\prime }\),

$$\begin{aligned} y_{i}=F\left( x_{i1},x_{i2},\ldots .x_{ip}\right) +u_{i}. \end{aligned}$$

Then if \(F\left( x_{1},x_{i2},\ldots ,x_{ip}\right) \) is a homogeneous function (of degree \(1/\mu \)), then

$$\begin{aligned} y_{i}=\sum _{j=1}^{p}\left( \mu \frac{\partial F\left( \cdot \right) }{ \partial x_{ij}}\right) x_{ij}+u_{i}, \end{aligned}$$

and under certain conditions we can treat \(\mu \frac{\partial F\left( \cdot \right) }{\partial x_{ij}}\equiv \beta _{ij}\).

By Theorem 6, \(E\left( \beta _{ij}^{r}\right) \) is identified for \(r=1,2,\ldots ,2K-1\) under Assumptions 1 and 6. By (4.2), we have equations

$$\begin{aligned} E\left( \beta _{ij}^{r}\right) =\sum _{k=1}^{K}\lambda _{jk}b_{jk}^{r}, \end{aligned}$$
(4.9)

\(r=0,1,\ldots ,2K-1\), which is of the same form as (2.10) and (3.4). To identify \(\varvec{\lambda }_{j}=\left( \lambda _{j1},\lambda _{j2},\ldots ,\lambda _{jK}\right) ^{\prime }\) and \({{\textbf {b}}} _{j}=\left( b_{j1},b_{j2},\ldots ,b_{jK}\right) ^{\prime }\), we can verify the system of 2K equations in (4.9) has a unique solution if \(b_{j1}<b_{j2}<\cdots <b_{jK}\) and \(\lambda _{jk}\in \left( 0,1\right) \). The following corollary is a direct application of Theorem 2.

Corollary 7

Consider the model (4.1) and suppose that Assumptions 1 and 6 hold. Then, the parameters \(\varvec{\theta }_j = \left( \varvec{\lambda }_{j}^\prime , {{\textbf {b}}}_{j}^\prime \right) ^\prime \) of the marginal distribution of \(\beta _i\) with respect to \(\beta _{ij}\) is identified subject to \(b_{j1}< b_{j2}< \cdots < b_{jK}\) and \(\lambda _{jk} \in \left( 0, 1 \right) \) for \(j = 1, 2, \ldots , p\).

The problem of identification and estimation of the joint distribution of \( \varvec{\beta }_{i}\) is subject to the curse of dimensionality. We have \( K^{p}-1\) probability weights, \(\pi _{k_{1},k_{2},\ldots ,k_{p}}\), to be identified in addition to the pK categorical coefficients \(b_{ij}\) that are identified by Corollary 7. The number of parameters increases rapidly with p. Even in the simplest case with \(K=2\), the total number of unknown parameters is \(2p+2^{p}-1\), which grows exponentially.

Note that the marginal probabilities \(\lambda _{jk}\) are related to the joint distribution by

$$\begin{aligned} \lambda _{jk}=\sum _{k_{1},\ldots ,k_{j-1},k_{j+1},\ldots ,k_{p}\in \left\{ 1,2,\ldots ,K\right\} }\pi _{k_{1},k_{2},\ldots ,k_{j-1},k,k_{j+1},\ldots ,k_{p}}, \end{aligned}$$
(4.10)

\(k=1,2,\ldots ,K\) and \(j=1,2,\ldots ,p\). The number of linearly independent equations in (4.10) is \(pK-(p-1)\).

Example 3

Consider the same setup as in Example 1 with \(p = 2\) and \(K = 2\). The marginal probabilities are obtained by

$$\begin{aligned}&\lambda _{1L} = \Pr \left( \beta _{i1} = b_{1L} \right) = \pi _{LL} + \pi _{LH},\nonumber \\&\qquad \lambda _{1H} = \Pr \left( \beta _{i1} = b_{1H} \right) = 1 - \lambda _{1L} = \pi _{HL} + \pi _{HH}, \nonumber \\&\lambda _{2L} = \Pr \left( \beta _{i2} = b_{2L} \right) = \pi _{LL} + \pi _{HL},\nonumber \\&\qquad \lambda _{2H} = \Pr \left( \beta _{i2} = b_{2H} \right) = 1 - \lambda _{2L} = \pi _{LH} + \pi _{HH} . \end{aligned}$$
(4.11)

Note that any equation in (4.11) can be expressed as a linear combination of other three equations, for example \(\lambda _{2\,H} = \lambda _{1\,L} + \lambda _{1\,H} - \lambda _{2\,L}\).

The equations corresponding to the cross-moments, \(E\left( \prod _{j=1}^{p}\beta _{ij}^{q_{j}}\right) \), are

$$\begin{aligned} E\left( \prod _{j=1}^{p}\beta _{ij}^{q_{j}}\right) =\sum _{k_{1},k_{2},\ldots ,k_{p}\in \left\{ 1,2,\ldots ,K\right\} }\left( \prod _{j=1}^{p}b_{jk_{j}}^{q_{j}}\right) \pi _{k_{1},k_{2},\ldots ,k_{p}}, \end{aligned}$$
(4.12)

for \({{\textbf {q}}}\in \left\{ {{\textbf {q}}}\in \left\{ 0,1,\ldots r-1\right\} ^{p}:\sum _{j=1}^{p}q_{j}=r\right\} \), \(r=2,\ldots ,2K-1\). The linear system (4.12) has

$$\begin{aligned} \sum _{r=1}^{2K-1}\left( {\begin{array}{c}r+p-1\\ p-1\end{array}}\right) -p(2K-1) \end{aligned}$$

equations. Then the total number of equations in (4.10) and (4.12) that can be utilized to identify joint probabilities is \(C_{r}=\sum _{r=1}^{2K-1}\left( {\begin{array}{c} r+p-1\\ p-1\end{array}}\right) -pK\), which is smaller than the number of joint probabilities \( K^{p}-1\) for large p. When \(K=2\), \(C_{r}<K^{p}-1\) for \(p\ge 7\).

Identification and estimation of the joint distribution of \(\varvec{\beta } _{i}\) in the general setting will not be pursued in this paper due to the curse of dimensionality. Instead, we consider special cases, that are empirically relevant, in which identification of the joint distribution of \( \varvec{\beta }_{i}\) can be readily established. We first consider small p and K, in particular \(p=2\) and \(K=2\) as in Example 1.

Example 4

Consider the same setup as in Example 1 with \(p=2\) and \(K=2\). In addition to (4.11), consider the cross-moment,

$$\begin{aligned} E\left( \beta _{i1}\beta _{i2}\right) =b_{1L}b_{2L}\pi _{LL}+b_{1L}b_{2H}\pi _{LH}+b_{1H}b_{2L}\pi _{HL}+b_{1H}b_{2H}\pi _{HH}. \end{aligned}$$
(4.13)

Writing (4.11) and (4.13) in matrix form, we have

$$\begin{aligned} {{\textbf {B}}}\varvec{\pi }=\varvec{\lambda }, \end{aligned}$$

where

$$\begin{aligned} {{\textbf {B}}}= \begin{pmatrix} 1 &{} 1 &{} 0 &{} 0 \\ 0 &{} 0 &{} 1 &{} 1 \\ 1 &{} 0 &{} 1 &{} 0 \\ b_{1L}b_{2L} &{} b_{1L}b_{2H} &{} b_{1H}b_{2L} &{} b_{1H}b_{2H} \end{pmatrix},\,\varvec{\pi }= \begin{pmatrix} \pi _{LL} \\ \pi _{LH} \\ \pi _{HL} \\ \pi _{HH} \end{pmatrix},\,\varvec{\lambda }= \begin{pmatrix} \lambda _{1L} \\ \lambda _{1H} \\ \lambda _{2L} \\ E\left( \beta _{i1}\beta _{i2}\right) \end{pmatrix}. \end{aligned}$$

Note that \(E\left( \beta _{i1}\beta _{i2}\right) \) is identified by Theorem 6, and \(b_{jk_{j}}\) and \( \lambda _{jk_{j}}\) are identified by Corollary 7, and matrix \({{\textbf {B}}}\) is invertible given that \(b_{1\,L}<b_{1\,H}\) and \( b_{2\,L}<b_{2\,H}\) (see “Appendix A.1”). As a result, the joint probabilities, \(\varvec{\pi },\) are identified.

Remark 14

The argument in Example 4 is applicable for identification of the joint distribution of \(\left( \beta _{ij}, \beta _{i,j^\prime } \right) ^\prime \) for \( j\ne j^\prime \) when \(p > 2\) and \(K = 2\).

5 Finite sample properties using Monte Carlo experiments

We examine the finite sample performance of the categorical coefficient estimator proposed in Sect. 3 by Monte Carlo experiments.

5.1 Data generating processes

we generate \(y_{i}\) as

$$\begin{aligned} y_{i}=\alpha +x_{i}\beta _{i}+z_{i1}\gamma _{1}+z_{i2}\gamma _{2}+u_{i}, \text { for }i=1,2,\ldots ,n, \end{aligned}$$
(5.1)

with \(\beta _{i}\) distributed as in (2.2) with \(K=2,\) and the parameters \(\pi ,\beta _{L}\) and \(\beta _{H}\).Footnote 5

We draw \(\beta _{i}\) for each individual i independently by setting \(\beta _{i}=\beta _{L}\) with probability \(\pi \) and \(\beta _{i}=\beta _{H}\) with probability \(1-\pi \), through a sequence of independent Bernoulli draws. We consider two sets of parameters in all DGPs, denoted as high variance and low variance parametrization, respectively,

$$\begin{aligned} \left( \pi ,\beta _{L},\beta _{H},E\left( \beta _{i}\right) , \textrm{var}\left( \beta _{i}\right) \right) = {\left\{ \begin{array}{ll} \left( 0.5,1,2,1.5,0.25\right) &{} \left( high\,variance\right) \\ \left( 0.3,0.5,1.345,1.0915,0.15\right) &{} \left( low\,variance\right) \end{array}\right. }. \end{aligned}$$
(5.2)

\(\beta _{H}/\beta _{L}=2\) for the high variance parametrization, and \(\beta _{H}/\beta _{L} = 2.69\), for the low variance parametrization, which is motivated by the estimates in our empirical illustration in Sect. 6.Footnote 6 The values of E\((\beta _{i})\) and \(\textrm{var}\left( \beta _{i}\right) \) are obtained noting that E\((\beta _{i})=\pi \beta _{L}+(1-\pi )\beta _{H}\), and \(\textrm{var}\left( \beta _{i}\right) =\pi (1-\pi )(\beta _{H}-\beta _{L})^{2}\). The remaining parameters are set as \(\alpha =0.25\), and \(\varvec{\gamma }=\left( 1,1\right) ^{\prime },\) across DGPs.

We generate the regressors and the error terms as follows.

DGP 1 (Baseline) We first generate \({\tilde{x}}_{i}\sim \text {IID}\chi ^{2}(2)\), and then set \(x_{i}=({\tilde{x}}_{i}-2)/2\) so that \( x_{i}\) has 0 mean and unit variance. The additional regressors, \(z_{ij}\), for \(j=1,2\) with homogeneous slopes are generated as

$$\begin{aligned} z_{i1}=x_{i}+v_{i1}\text { and }z_{i2}=z_{i1}+v_{i2}, \end{aligned}$$

with \(v_{ij}\sim \text {IID }N\left( 0,1\right) \), for \(j=1,2\). This ensures that the regressors are sufficiently correlated. The error term, \(u_{i}\), is generated as \(u_{i}=\sigma _{i}\varepsilon _{i}\), where \(\sigma _{i}^{2}\) are generated as \(0.5(1+\text {IID}\chi ^{2}(1))\), and \(\varepsilon _{i}\sim \text {IID}N(0,1)\). Note that \(\varepsilon _{i}\) and \(\sigma _{i}^{2}\) are generated independently, and \(E(u_{i}^{2})=1\).

DGP 2 (Categorical x) This setup deviates from the baseline DGP, and allows the distribution of \(x_{i}\) to differ across i. Accordingly, we generate \(x_{i}=\left( {\tilde{x}}_{1i}-2\right) /2\) where \( {\tilde{x}}_{1i}\sim \text {IID}\chi ^{2}\left( 2\right) \) for \(i=1,2,\ldots ,\lfloor n/2\rfloor \), and \(x_{i}=\left( {\tilde{x}}_{2i}-2\right) /4\) where \( {\tilde{x}}_{2i}\sim \text {IID}\chi ^{2}\left( 4\right) \), for \(i=\lfloor n/2\rfloor +1,\ldots ,n\). The additional regressors, \(z_{ij}\), for \(j=1,2\) with homogeneous slopes are generated as

$$\begin{aligned} z_{i1}=x_{i}+v_{i1}\text { and }z_{i2}=z_{i1}+v_{i2}, \end{aligned}$$

with \(v_{ij}\sim \text {IID }N\left( 0,1\right) \), for \(j=1,2\). The error term \(u_{i}\) is generated the same as in DGP 1.

DGP 3 (Categorical u) We generate \(x_{i}\) and \({{\textbf {z}}} _{i}\) the same as in DGP 1, but allow the error term \(u_{i}\) to have a heterogeneous distribution over i. For \(i=1,2,\ldots ,\lfloor n/2\rfloor \), we set \(u_{i}=\sigma _{i}\varepsilon _{i},\) where \(\sigma _{i}^{2}\sim \text {IID}\chi ^{2}\left( 2\right) \) and \(\varepsilon _{i}\sim \text {IID} N(0,1)\), and for \(i=\lfloor n/2\rfloor +1,\ldots ,n\), we set \(u_{i}=\left( {\tilde{u}}_{i}-2\right) /2\), where \({\tilde{u}}_{i}\sim \text {IID}\chi ^{2}\left( 2\right) \).

We investigate the finite sample performance of the estimator proposed in Sect. 3 across DGP 1 to 3 with low variance and high variance scenarios.Footnote 7 Details of the computational algorithm used to carry out the Monte Carlo experiments (and the empirical results that follow) are given in Sect. S.5 of the online supplement. An accompanying R package is available at https://github.com/zhan-gao/ccrm.

5.2 Summary of the MC results

Table 2 Bias, RMSE and size of the least square estimator \(\hat{{\varvec{{\phi }}}}\)
Fig. 1
figure 1

Empirical power functions for the least square estimator \(\hat{ {\varvec{{\phi }}}}\) with the high variance parametrization (\( \textrm{var}\left( \beta _i \right) = 0.25\)). Notes: The data generating process is (5.1) with high variance parametrization that is described in (5.2). “Baseline,” “Categorical x” and “Categorical u” refer to DGP 1 to 3 as in Sect. 5.1. Generically, power is calculated by \(R^{-1}\sum _{r=1}^R {{\textbf {1}}}\left[ \left| {\hat{\theta }}^{(r)} - \theta _\delta \right| / {\hat{\sigma }}_{ {\hat{\theta }}}^{(r)} > \textrm{cv}_{0.05} \right] \), for \(\theta _\delta \) in a symmetric neighborhood of the true parameter \(\theta _0\), the estimate at the r-th replication, \(\hat{ \theta }^{(r)}\), the estimated standard error of \({\hat{\theta }}^{(r)}\), \(\hat{ \sigma }_{{\hat{\theta }}}^{(r)}\), and the critical value \(\textrm{cv}_{0.05} = \Phi ^{-1}\left( 0.975 \right) \) across \(R = 5000\) replications, where \( \Phi \left( \cdot \right) \) is the cumulative distribution function of standard normal distribution

Fig. 2
figure 2

Empirical power functions for the least square estimator \(\hat{{\varvec{{\phi }}}}\) with the low variance parametrization (\( \textrm{var}\left( \beta _i \right) = 0.15\)). Notes: The data generating process is (5.1) with low variance parametrization that is described in (5.2). “Baseline,” “Categorical x” and “Categorical u” refer to DGP 1 to 3 as in Sect. 5.1. Generically, power is calculated by \(R^{-1}\sum _{r=1}^R {{\textbf {1}}}\left[ \left| {\hat{\theta }}^{(r)} - \theta _\delta \right| / {\hat{\sigma }}_{ {\hat{\theta }}}^{(r)} > \textrm{cv}_{0.05} \right] \), for \(\theta _\delta \) in a symmetric neighborhood of the true parameter \(\theta _0\), the estimate at the r-th replication, \(\hat{ \theta }^{(r)}\), the estimated standard error of \({\hat{\theta }}^{(r)}\), \({\hat{\sigma }}_{{\hat{\theta }}}^{(r)}\), and the critical value \(\textrm{cv}_{0.05} = \Phi ^{-1}\left( 0.975 \right) \) across \(R = 5000\) replications, where \( \Phi \left( \cdot \right) \) is the cumulative distribution function of standard normal distribution

Table 3 Bias, RMSE and size of the GMM estimator for distributional parameters of \(\beta \)
Fig. 3
figure 3

Empirical power functions for the GMM estimator of distributional parameters of \(\beta \) with the high variance parametrization(\(\textrm{var}\left( \beta _i \right) = 0.25\)). Notes: The data generating process is (5.1) with high variance parametrization that is described in (5.2). “Baseline,” “Categorical x” and “Categorical u” refer to DGP 1 to 3 as in Sect. 5.1. The model is estimated with \(S = 4\), the highest order of moments of \(x_i\) used in estimation. Generically, power is calculated by \(R^{-1}\sum _{r=1}^R {{\textbf {1}}}\left[ \left| {\hat{\theta }}^{(r)} - \theta _\delta \right| /{\hat{\sigma }}_{{\hat{\theta }}}^{(r)} > \textrm{cv}_{0.05} \right] \), for \( \theta _\delta \) in a symmetric neighborhood of the true parameter \(\theta _0\), the estimate at the r-th replication, \({\hat{\theta }}^{(r)}\), the estimated standard error of \({\hat{\theta }}^{(r)}\), \({\hat{\sigma }}_{{\hat{\theta }}}^{(r)}\), and the critical value \( \textrm{cv}_{0.05} = \Phi ^{-1}\left( 0.975 \right) \) across \(R = 5000\) replications, where \(\Phi \left( \cdot \right) \) is the cumulative distribution function of standard normal distribution

Fig. 4
figure 4

Empirical power functions for the GMM estimator of distributional parameters of \(\beta \) with the low variance parametrization (\(\textrm{var}\left( \beta _i \right) = 0.15\)). Notes: The data generating process is (5.1) with low variance parametrization that is described in (5.2). “Baseline,” “Categorical x” and “Categorical u” refer to DGP 1 to 3 as in Sect. 5.1. The model is estimated with \(S = 4\), the highest order of moments of \(x_i\) used in estimation. Generically, power is calculated by \(R^{-1}\sum _{r=1}^R {{\textbf {1}}}\left[ \left| {\hat{\theta }}^{(r)} - \theta _\delta \right| /{\hat{\sigma }}_{{\hat{\theta }}}^{(r)} > \textrm{cv}_{0.05} \right] \), for \( \theta _\delta \) in a symmetric neighborhood of the true parameter \(\theta _0\), the estimate at the r-th replication, \({\hat{\theta }}^{(r)}\), the standard error of \({\hat{\theta }}^{(r)}\), \({\hat{\sigma }}_{{\hat{\theta }}}^{(r)}\), and the critical value \( \textrm{cv}_{0.05} = \Phi ^{-1}\left( 0.975 \right) \) across \(R = 5{,}000\) replications, where \(\Phi \left( \cdot \right) \) is the cumulative distribution function of standard normal distribution

For each sample size \(n = 100\), 1000, 2000, 5000, 10, 000 and 100, 000 we run 5000 replications of experiments for DGP 1 (baseline), DGP 2 (categorical x) and DGP 3 (categorical u) with high variance and low variance parametrization, as set out in (5.2).

We first investigate the finite sample performance of \(\hat{\varvec{\phi }}\), as an estimator of \(\varvec{\phi }=\left( E\left( \beta _{i}\right) ,\varvec{\gamma }^{\prime }\right) ^{\prime }\). Bias, root mean squared errors (RMSE) for estimation of \(E\left( \beta _{i}\right) \), \(\gamma _{1}\) and \(\gamma _{2}\), as well as the size of testing of the null values at the 5 percent nominal value are reported in Table 2. In addition, we plot the associated empirical power functions in Figs. 1 and 2, for cases of high and low \(\textrm{var}(\beta _{i})\). The results show that \(\hat{\varvec{\phi }}\) has very good small sample properties with small bias and RMSEs, with size very close to the nominal value of 5 percent across all DGPs and parametrization, even when sample size is relatively small. The power of the test increases steadily as the sample size increases.

Then, we turn to the GMM estimator for the distributional parameters of \( \beta _{i}\) proposed in Sect. 3.2. The bias, RMSE and the test size based on the asymptotic distribution given in Theorem 5, for \(\pi \), \(\beta _{L}\) and \(\beta _{H}\), are reported in Table 3. The empirical power functions are reported in Figs. 3 and 4. The reported results are based on \(S=4\), where S \((>2K-1=3)\) denotes the highest order of moments of \(x_{i}\) included in estimation.Footnote 8

The upper panel of this table reports the results of the high variance and the lower panel for the low variance parametrization, as set out in (5.2). For all parameters and under all DGPs, the bias and RMSE decline steadily with the sample size as predicted by Theorem 4, and confirm the robustness of the GMM estimates to the heterogeneity in the regressor and the error processes. But for a given sample size, the relative precision of the estimates depends on the variability of \(\beta _{i}\), as characterized by the true value of \(\textrm{ var} (\beta _{i})\). The precision of the estimates with high variance parametrization is relatively higher than that with low variance parametrization. This is to be expected since, unlike \(E (\beta _{i}),\) the distributional parameters are only identified if \(\textrm{var} (\beta _{i})>0\). As shown in (2.18) and (2.19) for the current case of \(K=2\), \(\textrm{var}(\beta _{i})\) is in the denominator when we recover the distributional parameters from the moments of \(\beta _{i}\). When \(\textrm{var}(\beta _{i})\) is small, estimation errors in the moments of \( \beta _{i}\) can be amplified in the estimation of \(\pi \), \(\beta _{L}\) and \( \beta _{H}\). On the other hand, the larger the variance the more precisely \( \pi \), \(\beta _{H}\) and \(\beta _{L}\) can be estimated for a given n.Footnote 9 The size and power also depends on the parametrization. With both high variance and low variance parametrization, we can achieve correct size and reasonable power when n is quite large (\( n=\)100,000). We plot the empirical power functions for \(n\ge 5000\) for \( \pi \), \(\beta _{H}\) and \(\beta _{L}\) since the size is far above 5 percent for smaller values of n, and power comparisons are not meaningful in such cases.

Remark 15

Note that GMM estimators of moments of \(\beta _{i}\), namely \({{\textbf {m}}}_{\varvec{\beta }}\), can be obtained using the moment conditions in (3.7),and the transformations \({{\textbf {m}}}_{ \varvec{\beta }}=h\left( \varvec{\theta }\right) \) in (3.4) are required only to derive the estimators of \(\varvec{\theta }\), the parameters of the underlying categorical distribution. The Monte Carlo results in Sect. S.3.2 in the online supplement show that \({{\textbf {m}}}_{\varvec{\beta }}\) can be accurately estimated with relatively small sample sizes. In the estimation of both \({{\textbf {m}}}_{ \varvec{\beta }}\) and \(\varvec{\theta }\), the same set of moment conditions are included, so the estimation of distributional parameters \(\varvec{\theta }\) essentially relies on the relation \(\varvec{\theta }=h^{-1}\left( {{\textbf {m}}}_{\varvec{\beta }}\right) \). Sampling uncertainties in the estimation of \( {{\textbf {m}}}_{\varvec{\beta }}\), particularly in higher-order moments, are potentially amplified through the inverse transformation \(h^{-1}\) that involves matrix inversion, which causes the difficulties in estimation and inference of \(\varvec{\theta }\) when sample sizes are small. This is analogous to the problem of precision matrix estimation from an estimated covariance matrix. In practice, estimation of the categorical parameters is recommended for applications where the sample size is relatively large, otherwise it is advisable to focus on estimates of the lower-order moments of \(\beta _{i}\).

6 Heterogeneous return to education: an empirical application

Since the pioneering work by Becker (1962, 1964) on the effects of investments in human capital, estimating returns to education has been one of the focal points of labor economics research. In his pioneering contribution Mincer (1974) models the logarithm of earnings as a function of years of education and years of potential labor market experience (age minus years of education minus six), which can be written in a generic form:

$$\begin{aligned} \log \text {wage}_{i}=\alpha _{i}+\beta _{i}\text {edu}_{i}+\phi \left( {{\textbf {z}}}_{i}\right) +\varepsilon _{i}, \end{aligned}$$
(6.1)

as in Heckman et al. (2018, Eq. (1)), where \({{\textbf {z}}}_{i}\) includes the labor market experience and other relevant control variables. The above wage equation, also known as the “Mincer equation”, has become of the workhorse of the empirical works on estimating the return to education. In the most widely used specification of the Mincer equation (6.1),

$$\begin{aligned} \phi \left( {{\textbf {z}}}_{i}\right) =\rho _{1}\text {exper}_{i}+\rho _{2}\text { exper}_{i}^{2}+\tilde{{{\textbf {z}}}}_{i}^{\prime }\tilde{\varvec{\gamma }}, \end{aligned}$$

where \(\tilde{{{\textbf {z}}}}_{i}\) is the vector of control variables other than potential labor market experience.

Along with the advancement of empirical research on this topic, there has been a growing awareness of the importance of heterogeneity in individual cognitive and non-cognitive abilities (Heckman 2001) and their significance for explaining the observed heterogeneity in return to education. Accordingly, it is important to allow the parameters of the wage equation to differ across individuals. In Eq. (6.1), we allow \(\alpha _{i}\) and \(\beta _{i}\) to differ across individuals, but assume that \(\phi \left( {{\textbf {z}}}_{i}\right) \) can be approximated as nonlinear functions of experience and other control variables with homogeneous coefficients.

Specifically, following Lemieux (2006b, 2006c) we also allow for time variations in the parameters of the wage equation and consider the following categorical coefficient model over a given cross-section sample indexed by tFootnote 10:

$$\begin{aligned} \log \text {wage}_{it}=\alpha _{it}+\beta _{it}\text {edu}_{it}+\rho _{1t} \text {exper}_{it}+\rho _{2t}\text {exper}_{it}^{2}+\tilde{{{\textbf {z}}}} _{it}^{\prime }\tilde{\varvec{\gamma }}_{t}+\varepsilon _{it}, \end{aligned}$$
(6.2)

where the return to education follows the categorical distribution,

$$\begin{aligned} \beta _{it}= {\left\{ \begin{array}{ll} b_{tL} &{} \text {w.p. }\pi _{t}, \\ b_{tH} &{} \text {w.p. }1-\pi _{t}, \end{array}\right. } \end{aligned}$$

and \(\tilde{{{\textbf {z}}}}_{it}\) includes gender, martial status and race. \( \alpha _{it}=\alpha _{t}+\delta _{it}\) where \(\delta _{it}\) is mean 0 random variable assumed to be distributed independently of \(\text {edu}_{it}\) and \({{\textbf {z}}}_{it}=\left( \text {exper}_{it},\text {exper}_{it}^{2}\text {, } \tilde{{{\textbf {z}}}}_{t}^{\prime }\right) ^{\prime }\). Let \(u_{it}=\varepsilon _{it}+\delta _{it}\), and write (6.2) as

$$\begin{aligned} \log \text {wage}_{it}=\alpha _{t}+\beta _{it}\text {edu}_{it}+\rho _{1t}\text { exper}_{it}+\rho _{2t}\text {exper}_{it}^{2}+\tilde{{{\textbf {z}}}}_{it}^{\prime } \tilde{\varvec{\gamma }}_{t}+u_{it}. \end{aligned}$$
(6.3)

The correlation between \(\alpha _{it}\) and \(\text {edu}_{it}\) in (6.1) is the source of “ability bias” (Griliches 1977). Given the pure cross-sectional nature of our analysis, we do not allow for the endogeneity from “ability bias” or dynamics. To allow for nonzero correlations between \(\alpha _{it}\), edu\(_{it}\) and \({{\textbf {z}}} _{it}\), a panel data approach is required, which has its own challenges, as education and experience variables tend to very slow moving (if at all) for many individuals in the panel. Time delays between changes in education and experience and the wage outcomes also further complicate the interpretation of the mean estimates of \(\beta _{it}\) which we shall be reporting. To partially address the possible dynamic spillover effects, we provide estimates of the distribution of \(\beta _{it}\) using cross-sectional data from two different sample periods, and investigate the extent to which the distribution of return to education has changed over time, by gender and the level of educational achievements.Footnote 11

We estimate the categorical distribution of the return to education in (6.3) using the May and Outgoing Rotation Group (ORG) supplements of the Current Population Survey (CPS) data, as in Lemieux (2006b, 2006c).Footnote 12 We pool observations from 1973 to 1975 for the first sample period, \( t=\left\{ 1973{-}1975\right\} \) and observations from 2001 to 2003 for the second sample period, \(t=\left\{ 2001{-}2003\right\} \). Following Lemieux (2006b), we consider subsamples of those with less than 12 years of education, “high school or less,” and those with more than 12 years of education, “postsecondary education,” as well as the combined sample. We also present results by gender. The summary statistics are reported in Table 4. As to be expected, the mean log wages are higher for those with postsecondary education (for male and female), with the number of years of schooling and experience rising by about one year across the two sub-period samples. There are also important differences across male and female, and the two educational groupings, which we hope to capture in our estimation.

Table 4 Summary statistics of the May and outgoing rotation group (ORG) supplements of the current population survey (CPS) data across two periods, 1973–1975 and 2001–2003, by years of education and gender
Table 5 Estimates of the distribution of the return to education across two periods, 1973–1975 and 2001–2003, by years of education and gender
Table 6 Estimates of \({\varvec{{\gamma }}}\) associated with control variables \({{\textbf {z}}}_i\) with specification (6.2) across two periods, 1973–1975 and 2001–2003, by years of education and gender, which complements Table 5

We treat the cross-section observations in the two sample periods, \( t=\left\{ 1973{-}1975\right\} \) and \(\left\{ 2001{-}2003\right\} \), as repeated cross sections, rather than a panel data since the data in these two periods do not cover the same individuals, and represent random samples from the population of wage earners in two periods. It should also be noted that sample sizes \((n_{t})\), although quite large, are much larger during \( \left\{ 2001{-}2003\right\} \), which could be a factor when we come to compare estimates from the two sample periods. For example, for both male and female \(n_{73-75}=\) 111,632 as compared to \(n_{01-03}=\) 511,819, a difference which becomes more pronounced when we consider the number observations in postsecondary/female category—which rises from 12,882 for the first period to 100,007 in the second period.

We report estimates of \(\pi _{t}\), \(\beta _{L,t}\) and \(\beta _{H,t}\), as well as corresponding mean and standard deviations (denoted by s.d.(\({\hat{\beta }} _{it}\))) of the return to education (\(\beta _{it}\)) for \(t=\left\{ 1973{-}1975\right\} \) and \(\left\{ 2001{-}2003\right\} \). For a given \(\pi _{t}\), the ratio \(\beta _{H,t}/\beta _{L,t}\) provides a measure of within-group heterogeneity and allows us to augment information on changes in mean with changes in the distribution of return of education. The estimates for the distribution of the return to education (\(\beta _{it}\)) are summarized in Table 5, with the estimation results for control variables (such as experience, experienced squared, and other individual specific characteristic) reported in Table 6.

As can be seen from Table 5, estimates of \( \mathrm {s.d.}\left( \beta _{it}\right) \) are strictly positive for all subgroups, except for the “high school or less” group during the first sample period. For this group during the first period the estimate of \(\mathrm {s.d.}\left( \beta _{it}\right) \) for the male subsample is zero, \(\pi \) is not identified, and we have identical estimates for \(\beta _{L}\) and \(\beta _{H}\). For this subsample, the associated estimates and their standard errors are shown as unavailable (n/a). In case of the female subsample as well as both male and female subsamples where the estimates of s.d.(\({\hat{\beta }}_{it}\)) are close to zero and \(\pi \) is poorly estimated, only the mean of the return to education is informative. In the case of the samples where the estimates of \( \mathrm {s.d.}\left( \beta _{it}\right) \) are strictly positive, the estimate of the ratio \(\beta _{H,t}/\beta _{L,t}\) provides a good measure of within-group heterogeneity of return to education. The estimates of \(\beta _{H,t}/\beta _{L,t}\) lie between 1.50 and 2.79, with the high estimate obtained for the females with high school or less education during \(\left\{ { 2001{-}2003}\right\} \), and the low estimate is obtained for females with postsecondary education during the same period.

As our theory suggests the mean estimates of return to education, \(E\left( \beta _{it}\right) \) are very precisely estimated and inferences involving them tend to be robust to conditional error heteroskedasticity. The results in Table 5 show that estimates of \(E\left( \beta _{it}\right) \) have increased over the two sample periods \(t=\left\{ 1973{-}1975\right\} \) to \(t=\left\{ 2001{-}2003\right\} \), regardless of gender or educational grouping. The postsecondary educational group show larger increases in the estimates of \(E\left( \beta _{it}\right) \) as compared to those with high school or less. Estimates of \(E\left( \beta _{it}\right) \) increase by 36 percent for the postsecondary group, while the estimates of mean return to education rise only by around 5 percent in the case of those with high school or less. This result holds for both genders. Comparing the mean returns across the two educational groups, we find that mean return to education of individuals with postsecondary education is 45 percent higher than those with high school or less in the \(\{1973{-}1975\}\) period, but this gap increases to 87 percent in the second period, \(\left\{ 2001{-}2003\right\} \). Similar patterns are observed in the subsamples by gender. The estimates suggest rising between group heterogeneity, which is mainly due to the increasing returns to education for the postsecondary group.

Turning to within-group heterogeneity, we focus on the estimates of \(\beta _{H,t}/\beta _{L,t}\) and first note that over the two periods, within-group heterogeneity has been rising mainly in the case of those with high school or less, for both male and female. For the combined male and female samples and the male subsample, there is little evidence of within-group heterogeneity for the first period \(\left\{ 1973{-}1975\right\} \). However, for the second period \(\left\{ 2001{-}2003\right\} \) we find a sizeable degree of within-group heterogeneity where \(\beta _{H,t}/\beta _{L,t}\) is estimated to be around 2.41, with \(\text {s.d.}\left( \beta _{it}\right) \approx 0.03\). For the female subsample with high school or less, little evidence of heterogeneity was found for the first period, estimates of \(\beta _{H,t}/\beta _{L,t}\) increase to 2.79 for the second sample period, that corresponds to a commensurate rise in \(\text {s.d.}\left( \beta _{i}\right) \) to 0.032. The pattern of within-group heterogeneity is very different for those with postsecondary educational. For this group, we in fact observe a slight decline in the estimates of \(\beta _{H,t}/\beta _{L,t}\) by gender and over two sample periods.

Overall, our estimates of return to education and the within and between group comparisons are in line with the evidence of rising wage inequality documented in the literature (Corak 2013).

7 Conclusion

In this paper, we consider random coefficient models for repeated cross sections in which the random coefficients follow categorical distributions. Identification is established using moments of the random coefficients in terms of the moments of the underlying observations. We propose two-step generalized method of moments to estimate the parameters of the categorical distributions. The consistency and asymptotic normality of the GMM estimators are established without the IID assumption typically assumed in the literature. Small sample properties of the proposed estimator are investigated by means of Monte Carlo experiments and shown to be robust to heterogeneously generated regressors and errors, although relatively large samples are required to estimate the parameters of the underling categorical distributions. This is largely due to the highly nonlinear mapping between the parameters of the categorical distribution and the higher-order moments of the coefficients. This problem is likely to become more pronounced with a larger number of categories and coefficients.

In the empirical application, we apply the model to study the evolution of returns to education over two sub-periods, also considered in the literature by Lemieux (2006b). Our estimates show that mean (ex post) returns to education have risen over the periods from 1973–1975 to 2001–2003 mainly in the case of individuals with postsecondary education, and this result is robust by gender. We find evidence of within-group heterogeneity in the case of high school or less educational group as compared to those with postsecondary education.

In our model specification, the number of categories, K, is treated as a tuning parameter and assumed to be known. An information criterion, as in Bonhomme and Manresa (2015) and Su et al. (2016), to determine K could be considered. Further investigation of models with multiple regressors subject to parameter heterogeneity is also required. These and other related issues are topics for future research.