Abstract
This paper proposes a linear categorical random coefficient model, in which the random coefficients follow parametric categorical distributions. The distributional parameters are identified based on a linear recurrence structure of moments of the random coefficients. A generalized method of moments estimation procedure is proposed, also employed by Peter Schmidt and his coauthors to address heterogeneity in time effects in panel data models. Using Monte Carlo simulations, we find that moments of the random coefficients can be estimated reasonably accurately, but large samples are required for the estimation of the parameters of the underlying categorical distribution. The utility of the proposed estimator is illustrated by estimating the distribution of returns to education in the USA by gender and educational levels. We find that rising heterogeneity between educational groups is mainly due to the increasing returns to education for those with postsecondary education, whereas withingroup heterogeneity has been rising mostly in the case of individuals with high school or less education.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Random coefficient models have been used extensively in time series, crosssection and panel regressions. Nicholls and Pagan (1985) consider the estimation of first and second moments of the random coefficient \(\beta _{i}\) and the error term \(u_{i}\), in a linear regression model. In a seminal paper, Beran and Hall (1992) establish conditions for identifying and estimating the distribution of \(\beta _{i}\) and \(u_{i}\) nonparametrically. The baseline linear univariate regression in Beran and Hall (1992) has been extended in nonparametric framework by Beran (1993), Beran and Millar (1994), Beran et al. (1996), Hoderlein et al. (2010), Hoderlein et al. (2017) and Breunig and Hoderlein (2018), to just name a few. Hsiao and Pesaran (2008) survey random coefficient models in linear panel data models.
In some econometric applications, Hausman (1981), Hausman and Newey (1995), Foster and Hahn (2000), for examples, the main interest is to estimate the consumer surplus distribution based on a linear demand system where the coefficient associated with the price is random. In such settings, the distribution of the random coefficients is needed when computing the consumer surplus function, and the nonparametric estimation is more general, flexible and suitable for the purpose. On the other hand, parametric models may be favored in applications in which the implied economic meaning of the distribution of the random coefficients is of interests. Examples include estimation of the return to education (Lemieux 2006b, c) and the labor supply equation (Bick et al. 2022).
In this paper, we consider a linear regression model with a random coefficient \(\beta _{i}\) that is assumed to follow a categorical distribution, i.e., \(\beta _{i}\) has a discrete support \(\left\{ b_{1},b_{2},\ldots ,b_{K}\right\} \), and \(\beta _{i}=b_{k}\) with probability \(\pi _{k}\). The discretization of the support of the random coefficient \( \beta _{i}\) naturally corresponds to the interpretation that each individual belongs to a certain category, or group, k with probability \(\pi _{k}\). Compared to a nonparametric distribution with continuous support, assuming a categorical distribution allows us not only to model the heterogeneous responses across individuals but also to interpret the results with sharper economic meaning. As we will illustrate in the empirical application in Sect. 6, it is hard to clearly interpret the distribution of returns to education without imposing some form of parametric restrictions.
In addition, with the categorical distribution imposed, the identification and estimation of the distribution of \(\beta _{i}\) do not rely on identically distributed error terms \(u_{i}\) and regressors \({{\textbf {w}}}_i\), as shown in Sect. 2 and 3. Heterogeneously generated errors can be allowed, which is important in many empirical applications. To the best of our knowledge, this is the first identification result in linear random coefficient model without a strict IID setting.
The identification of the distribution of \(\beta _{i}\) is established in this paper based on the identification of the moments of \(\beta _{i}\), which coincides with the identification condition in Beran and Hall (1992) that the distribution of \(\beta _{i}\) is uniquely determined by its moments, which is assumed to exist up to an arbitrary order. Since under our setup the distribution of \(\beta _{i}\) is parametrically specified, the moments of \(\beta _{i}\) exist and can be derived explicitly. The parameters of the assumed categorical distribution can then be uniquely determined by a system of equations in terms of the moments, as in Theorem 2. The parameters of the categorical distribution are then estimated consistently by the generalized method of moments (GMM). The estimation procedure based on moment conditions shares similar spirits as in Ahn et al. (2001, 2013) in which Peter Schmidt and coauthors study panel data models with interactive effects where they allow for the time effects to vary across individual units. Compared to alternative nonparametric random coefficient models, the standard GMM estimation is easy to implement, and the identified categorical structure has a clear economic interpretation.
Using Monte Carlo (MC) simulations, we find that moments of the random coefficients can be estimated reasonably accurately, but large samples are required for estimation of the parameters of the underlying categorical distributions. Our theoretical and MC results also suggest that our method is suitable when the number of heterogeneous coefficients and the number of categories are small (2 or 3). With the number of categories rising the burden on identification from the moments to the parameters of the categorical distribution also rises rapidly. The quality of identification also deteriorates as we need to rely on higher and higher moments to identify a larger number of categories, since the information content of the moments tends to decline with their order.
The proposed method is also illustrated by providing estimates of the distribution of returns to education in the USA by gender and educational levels, using the May and Outgoing Rotation Group (ORG) supplements of the Current Population Survey (CPS) data. Comparing the estimates obtained over the subperiods 1973–1975 and 2001–2003, we find that rising between group heterogeneity is largely due to rising returns to education in the case of individuals with postsecondary education, while withingroup heterogeneity has been rising in the case of individuals with high school or less education.
Related Literature This paper draws mainly upon the literature of random coefficient models. As already mentioned, the main body of the recent literature is focused on nonparametric identification and estimation. Following Beran and Hall (1992), Beran (1993) and Beran and Millar (1994) extend the model to a linear semiparametric model with a multivariate setup and propose a minimum distance estimator for the unknown distribution. Foster and Hahn (2000) extend the identification results in Beran and Hall (1992) and apply the minimum distance estimator to a gasoline consumption data to estimate the consumer surplus function. Beran et al. (1996) and Hoderlein et al. (2010) propose kernel density estimators based on the Radon inverse transformation in linear models.
In addition to linear models, Ichimura and Thompson (1998) and Gautier and Kitamura (2013) incorporate the random coefficients in binary choice models. Gautier and Hoderlein (2015) and Hoderlein et al. (2017) consider triangular models with random coefficients allowing for causal inference. Matzkin (2012) and Masten (2018) discuss the identification of random coefficients in simultaneous equation models. Breunig and Hoderlein (2018) propose a general specification test in a variety of random coefficient models. Random coefficients are also widely studied in panel data models, for example Hsiao and Pesaran (2008) and Arellano and Bonhomme (2012)
The rest of the paper is organized as follows: Sect. 2 establishes the main identification results. The GMM estimation procedure is proposed and discussed in Sect. 3. An extension to a multivariate setting is considered in Sect. 4. Small sample properties of the proposed estimator are investigated in Sect. 5, using Monte Carlo techniques under different regressor and error distributions. Section 6 presents and discusses our empirical application to the return to education. Section 7 provides some concluding remarks and suggestions for future work. Technical proofs are given in “Appendix A.1.”
Notations Largest and smallest eigenvalues of the \( p\times p\) matrix \({{\textbf {A}}}=\left( a_{ij}\right) \) are denoted by \(\lambda _{\max }\left( {{\textbf {A}}}\right) \) and \(\lambda _{\min }\left( {{\textbf {A}}} \right) ,\) respectively, its spectral norm by \(\left\ {{\textbf {A}}} \right\ =\lambda _{\max }^{1/2}\left( {{\textbf {A}}}^{\prime }{{\textbf {A}}} \right) \), \({{\textbf {A}}}\succ 0\) means that \({{\textbf {A}}}\) is positive definite, \(\text {vech}\left( {{\textbf {A}}}\right) \) denotes the vectorization of distinct elements of \({{\textbf {A}}}\), \({{\textbf {0}}}\) denotes zero matrix (or vector). For \({{\textbf {a}}}\in {\mathbb {R}}^{p}\), \(\textrm{diag}\left( {{\textbf {a}}} \right) \) represents the diagonal matrix with elements of \( a_{1},a_{2},\ldots ,a_{p}\). For random variables (or vectors) u and v, \( u\perp v\) represents u is independent of v. We use c (C) to denote some small (large) positive constants. For a differentiable realvalued function \(f\left( \varvec{\theta }\right) \), \(\nabla _{\varvec{\theta } }f\left( \varvec{\theta }\right) \) denotes the gradient vector. Operator \( \rightarrow _{p}\) denotes convergence in probability, and \(\rightarrow _{d}\) convergence in distribution. The symbols O(1), and \(O_{p}(1)\) denote asymptotically bounded deterministic and random sequences, respectively.
2 Categorical random coefficient model
We suppose the single crosssection observations, \(\left\{ y_{i},x_{i}, {{\textbf {z}}}_{i}\right\} _{i=1}^{n}\), follow the categorical random coefficient model
where \(y_{i},x_{i}\in {\mathbb {R}}\), \({{\textbf {z}}}_{i}\in {\mathbb {R}}^{p_z},\) and \(\beta _{i}\in \left\{ b_{1},b_{2},\ldots ,b_{K}\right\} \) admits the following Kcategorical distribution,
w.p. denotes “with probability,” \(\pi _{k}\in \left( 0,1\right) \), \( \sum _{k=1}^{K}\pi _{k}=1\), \(b_{1}<b_{2}<\cdots <b_{K}\), \(\varvec{\gamma }\in {\mathbb {R}}^{p_z}\) is homogeneous and \({{\textbf {z}}}_{i}\) could include an intercept term as its first element. It is assumed that \(\beta _{i}\perp {{\textbf {w}}}_i = \left( x_{i},{{\textbf {z}}}_{i}^{\prime }\right) ^{\prime }\), and the idiosyncratic errors \(u_{i}\) are independently distributed with mean 0.
Remark 1
The model can be extended to allow \({{\textbf {x}}}_{i},\varvec{\beta }_{i}\in {\mathbb {R}}^{p}\), with \(\varvec{\beta }_{i}\) following a multivariate categorical distribution, though with more complicated notations. We will consider possible extensions in Sect. 4.
Remark 2
Since we consider a pure crosssectional setting, the key assumption that \( \beta _{i}\) and \(x_{i}\) are independently distributed cannot be relaxed. Allowing \(\beta _{i}\) to vary with \(\textbf{w}_{i}\), without any further restrictions, is tantamount to assuming \(y_{i}\) is a general function of \(\textbf{w}_{i}\), in effect rendering a nonparametric specification.
Remark 3
The number of categories, K, is assumed to be fixed and known. Conditions \( \sum _{k=1}^{K}\pi _{k}=1\), \(b_{1}<b_{2}<\cdots <b_{K},\) and \(\pi _{k}\in \left( 0,1\right) \) together are sufficient for the existence of K categories. For example, if \(b_{k}=b_{k^{\prime }}\), then we can merge categories k and \(k^{\prime }\), and the number of categories reduces to \( K1\). Similarly, if \(\pi _{k}=0\) for some k, then category k can be deleted, and the number of categories is again reduced to \(K1\). Information criteria can be used to determine K, but this will not be pursued in this paper. Model specification tests could also be considered. See, for examples, Andrews (2001) and Breunig and Hoderlein (2018).
In the rest of this section, we focus on the model (2.1) and establish the conditions under which the distribution of \(\beta _{i}\) is identified.
2.1 Identifying the moments of \(\beta _{i}\)
Assumption 1

(a)
(i) \(u_{i}\) is distributed independently of \({{\textbf {w}}} _{i}=\left( x_{i},{{\textbf {z}}}_{i}^{\prime }\right) ^{\prime }\) and \(\beta _{i} \). (ii) \(\sup _{i}E\left( \left u_{i}^{r}\right \right) <C\), \(r=1,2,\ldots ,2K1\). (iii) \(n^{1}\sum _{i=1}^{n} u_i^4 = O_p(1) \).

(b)
(i) Let \({{\textbf {Q}}}_{n,ww}=n^{1}\sum _{i=1}^{n}{{\textbf {w}}}_{i} {{\textbf {w}}}_{i}^{\prime }\), and \({{\textbf {q}}}_{n,wy}=n^{1}\sum _{i=1}^{n} {{\textbf {w}}}_{i}y_{i}\). Then \(\left\ E\left( {{\textbf {Q}}} _{n,ww}\right) \right\<C<\infty \), and \(\left\ E\left( {{\textbf {q}}}_{n,wy}\right) \right\<C<\infty \), and there exists \(n_{0}\in {\mathbb {N}}\) such that for all \(n\ge n_{0}\),
$$\begin{aligned} 0<c<\lambda _{\min }\left( {{\textbf {Q}}}_{n,ww}\right)<\lambda _{\max }\left( {{\textbf {Q}}}_{n,ww}\right)<C<\infty . \end{aligned}$$(ii) \(\sup _{i} E\left( \left\ {{\textbf {w}}}_{i}\right\ ^{r}\right)<C<\infty \), \(r=1,2,\ldots ,4K2\).(iii) \(n^{1} \sum _{i=1}^{n} \left\ {{\textbf {w}}}_{i} \right\ ^{4} = O_p(1)\).

(c)
\(\left\ {{\textbf {Q}}}_{n,ww}E \left( {{\textbf {Q}}} _{n,ww}\right) \right\ =O_p\left( n^{1/2}\right) \), \(\left\ {\textbf {q }}_{n,wy} E \left( {{\textbf {q}}}_{n,wy}\right) \right\ =O_p\left( n^{1/2}\right) \), and
$$\begin{aligned} E \left( {{\textbf {Q}}}_{n,ww}\right) =n^{1}\sum _{i=1}^{n}E \left( {{\textbf {w}}}_{i}{{\textbf {w}}}_{i}^{\prime }\right) \succ 0. \end{aligned}$$ 
(d)
\(\left\ E \left( {{\textbf {Q}}}_{n,ww}\right) {{\textbf {Q}}} _{ww}\right\ =O\left( n^{1/2}\right) \), \(\left\ E \left( {{\textbf {q}}}_{n,wy}\right) {{\textbf {q}}}_{wy}\right\ =O\left( n^{1/2}\right) \), where \({{\textbf {q}}}_{wy} = \lim \limits _{n \rightarrow \infty } E \left( {{\textbf {q}}}_{n, wy} \right) \), \({{\textbf {Q}}}_{ww} = \lim \limits _{n \rightarrow \infty } E \left( {{\textbf {Q}}}_{n, ww} \right) \) and \({{\textbf {Q}}}_{ww}\succ 0\).
Remark 4
Part (a) of Assumption 1 relaxes the assumption that \(u_{i}\) is identically distributed, and allows for heterogeneously generated errors. For identification of the distribution of \(\beta _{i}\), we require \(u_{i}\) to be distributed independently of \( {{\textbf {w}}}_{i}\) and \(\beta _{i}\), which rules out conditional heteroskedasticity. However, estimation and inference involving \(E \left( \beta _{i}\right) \) and \(\varvec{\gamma }\) can be carried out in presence of conditionally error heteroskedastic, as shown in Theorem 3. Parts (c) and (d) of Assumption 1 relax the condition that \({{\textbf {w}}}_{i}\) is identically distributed across i. As we proceed, only \(\beta _{i}\), whose distribution is of interest, is assumed to be IID across i, and it is not required for \({{\textbf {w}}}_{i}\) and \(u_{i}\) to be identically distributed over i.
Remark 5
The highlevel conditions in Assumption 1, concerning the convergence in probability of averages such as \(Q_{n,ww}=n^{1}\sum _{i=1}^{n} {{\textbf {w}}}_{i}{{\textbf {w}}}_{i}^{\prime }\), can be verified under weak crosssectional dependence. Let \(f_{i}=f\left( {{\textbf {w}}}_{i},\beta _{i},u_{i}\right) \) be a generic function of \({{\textbf {w}}}_{i}\), \(\beta _{i}\) and \(u_{i}\).^{Footnote 1} Assume that \(\sup _{i}E\left( f_{i}^{2}\right) <C\), and \(\sup _{j}\sum _{i=1}^{n}\left \textrm{cov} \left( f_{i},f_{j}\right) \right <C\), for some fixed \(C<\infty \). Then,
By Chebyshev’s inequality, for any \(\varepsilon >0\), we have \(M_{\varepsilon }>\sqrt{C/\varepsilon }\) such that
i.e., \(n^{1}\sum _{i=1}^{n}\left[ f_{i}E\left( f_{i}\right) \right] =O_{p}\left( n^{1/2}\right) \).
Denote \(\varvec{\phi }_{i}=\left( \beta _{i},\varvec{\gamma }^{\prime }\right) ^{\prime }\) and \(\varvec{\phi }=E\left( \varvec{\phi } _{i}\right) =\left( E\left( \beta _{i}\right) ,\varvec{\gamma } ^{\prime }\right) ^{\prime }\). Consider the moment condition,
and sum (2.3) over i
Let \(n\rightarrow \infty \), then \(\varvec{\phi }\) is identified by
under Assumption 1.
Assumption 2
Let \({\tilde{y}}_{i}=y_{i}{{\textbf {z}}}_{i}^{\prime } \varvec{\gamma }\).

(a)
\(\left n^{1}\sum _{i=1}^{n}E\left( {\tilde{y}} _{i}^{r}x_{i}^{s}\right) \rho _{r,s}\right =O\left( n^{1/2}\right) ,\) and \(\left \rho _{r,s}\right <\infty ,\) for \(r,s=0,1,\ldots ,2K1\).

(b)
\(\left n^{1}\sum _{i=1}^{n}E\left( u_{i}^{r}\right) \sigma _{r}\right =O\left( n^{1/2}\right) ,\) and \( \left \sigma _{r}\right <\infty \), for \(r=2,3,\ldots ,2K1\).

(c)
\(n^{1}\sum _{i=1}^{n}\left[ \textrm{var}(x_{i}^{r})\left( \rho _{0,2r}\rho _{0,r}^{2}\right) \right] =O\left( n^{1/2}\right) \) where \( \rho _{0,2r}\rho _{0,r}^{2}>0,\) for \(r=2,3,\ldots ,2K1\).
Remark 6
The above assumption allows for a limited degree of heterogeneity of the moments. As an example, let \(E\left( u_{i}^{r}\right) =\sigma _{ir}\) and denote the heterogeneity of the \(r^{th}\) moment of \(u_{i}\) by \(e_{ir}=\sigma _{ir}\sigma _{r}\). Then
and condition (b) of Assumption 2 is met if \(\ \sum _{i=1}^{n}\left e_{ir}\right =O(n^{\alpha _{r}})\) with \(\alpha _{r}<1/2\). \(\alpha _{r}\) measures the degree of heterogeneity with \(\alpha _{r}=1\) representing the highest degree of heterogeneity. A similar idea is used by Pesaran and Zhou (2018) in their analysis of poolability in panel data models.
Theorem 1
Under Assumptions 1 and 2, \( E\left( \beta _{i}^{r}\right) \) and \(\sigma _{r}\), \(r=2,3,\ldots ,2K1\) are identified.
Proof
For \(r=2,\ldots ,2K1\),
where \(\left( {\begin{array}{c}r\\ q\end{array}}\right) =\frac{r!}{q!(rq)!}\) are binomial coefficients, for nonnegative integers \(q\le r\).
Sum over i, then by parts (a) and (b) of Assumption 2,
Derivation details are relegated to “Appendix A.1.” By part (c) of Assumption 2, the matrix \( \begin{pmatrix} \rho _{0,r} &{} 1 \\ \rho _{0,2r} &{} \rho _{0,r} \end{pmatrix} \) is invertible for \(r=2,3,\ldots ,2K1\). As a result, we can sequentially solve (2.8) and (2.9) for \(E\left( \beta _{i}^{r}\right) \) and \(\sigma _{r}\), for \(r=2,3,\ldots ,2K1\). \(\square \)
2.2 Identifying the distribution of \(\beta _{i}\)
Beran and Hall (1992, Theorem 2.1, pp. 1972) prove the identification of the distribution of the random coefficient, \(\beta _{i}\), in a canonical model without covariates, \(z_{i}\), under the condition that the distribution of \(\beta _{i}\) is uniquely determined by its moments. We show the identification of moments of \(\beta _i\) holds more generally when \(x_i\) and \( u_i\) are not identically distributed and the distribution of \(\beta _i\) is identified if it follows a categorical distribution. Note that under (2.2),
with \(E\left( \beta _{i}^{r}\right) \) identified under Assumption 1. To identify \(\varvec{\pi } =\left( \pi _{1},\pi _{2},\ldots ,\pi _{K}\right) ^{\prime }\) and \({{\textbf {b}}} =\left( b_{1},b_{2},\ldots ,b_{K}\right) ^{\prime }\), we need to verify that the system of 2K equations in (2.10) has a unique solution if \( b_{1}<b_{2}<\cdots <b_{K}\), and \(\pi _{k}\in \left( 0,1\right) \). In the proof, we construct a linear recurrence relation and make use of the corresponding characteristic polynomial.
Theorem 2
Consider the random coefficient regression model (2.1), suppose that Assumptions 1 and 2 hold. Then \(\varvec{\theta }=\left( \varvec{\pi }^{\prime },{{\textbf {b}}}^{\prime }\right) ^{\prime }\) is identified subject to \(b_{1}<b_{2}<\cdots <b_{K}\) and \(\pi _{k}\in \left( 0,1\right) \), for all \(k=1,2,\ldots ,K\).
Proof
We motivate the key idea of the proof in the special case where \(K=2,\) and relegate the proof of the general case to the “Appendix A.1.” Let \(b_{1}=\beta _{L}\), \(b_{2}=\beta _{H}\), \(\pi _{1}=\pi \) and \(\pi _{2}=1\pi \). Note that
and \(E\left( \beta _{i}^{k}\right) \), \(k=1,2,3\) are identified. \( \left( \pi ,\beta _{L},\beta _{H}\right) \) can be identified if the system of Eqs. (2.11)–(2.13), has a unique solution. By (2.11),
Plug (2.14) into (2.12) and (2.13),
Denote \(\beta _{L+H} = \beta _{L} + \beta _{H}\) and \(\beta _{LH} = \beta _{L}\beta _{H}\), and write (2.15) and (2.16) in matrix form,
where
Under the conditions \(0< \pi < 1\) and \(\beta _H > \beta _L\),
As a result, we can solve (2.17) for \(\beta _{L+H}\) and \( \beta _{LH}\) as
\(\beta _{L}\) and \(\beta _{H}\) are solutions to the quadratic equation,
We can verify that \(\Delta =\beta _{L+H}^{2}4\beta _{LH}>0\) by direct calculation using (2.18) and (2.19). Simplifying \(\Delta \) in terms of \(E\left( \beta _{i}^{k}\right) \) and then plugging in (2.11), (2.12) and (2.13),
Then, we obtain the unique solutions,
and \(\pi \) can be determined by (2.14) correspondingly. \(\square \)
Remark 7
The key identifying assumption in (2) is the assumed existence of the strict ordinal relation \(b_{1}<b_{2}<\cdots <b_{K}\) so that \(b_{k}\) and \(b_{k^{\prime }}\) are not symmetric for \(k\ne k^{\prime }\), and \(0<\pi _{k}<1\) so that the distribution of \(\beta _{i}\) does not degenerate. When \(K=2\), the conditions \(b_{1}<b_{2}<\cdots <b_{K}\), and \(\pi _{k}\in \left( 0,1\right) \), are equivalent to \(\textrm{var}\left( \beta _{i}\right) =\pi _{1}\left( 1\pi _{1}\right) \left( b_{2}b_{1}\right) ^{2}>0\). In other words, not surprisingly, the categorical distribution of \(\beta _{i}\) is identified only if \(\textrm{var} \left( \beta _{i}\right) >0\).
In practice, a test for \({\mathbb {H}}_{0}:\textrm{var}\left( \beta _{i}\right) =0\) is possible, by noting that \(\textrm{var}\left( \beta _{i}\right) =0\) is equivalent to
where \(\kappa ^{2}\) is well defined as long as \(\beta _{i}\not \equiv 0\). One important advantage of basing the test of slope homogeneity on \(\kappa ^{2}\) rather than on \(var(\beta _{i})=0\) is that \(\kappa ^{2}\,\)is scaleinvariant. \(E\left( \beta _{i}\right) \) and \(E\left( \beta _{i}^{2}\right) \) are identified as in Sect. 2.1, whose consistent estimation does not require \(\textrm{var}\left( \beta _{i}\right) >0\). Consequently, in principle it is possible to test slope homogeneity by testing \({\mathbb {H}}_{0}:\kappa ^{2}=1\). However, the problem becomes much more complicated when there are more than two categories and/or there are more than one regressor under consideration. A full treatment of testing slope homogeneity in such general settings is beyond the scope of the present paper.
Remark 8
Note that in the special case of the proof of Theorem 2 where \(K=2\), \(\beta _{L+H}=\beta _{L}+ \beta _{H}\) and \(\beta _{LH}=\beta _{L}\beta _{H}\) corresponds to \( b_{1}^{*}\) and \(b_{2}^{*}\), and (2.17) is the same as (A.1.6) when \(K = 2\). This special case illustrates the procedure of identification: identify \(\left( b_{k}^{*} \right) _{k=1}^{K}\) by the moments of \(\beta _{i}\), then solve for \( \left( b_{k}\right) _{k=1}^{K}\) and finally identify \(\left( \pi _{k} \right) _{k=1}^{K}\).
3 Estimation
In this section, we propose a generalized method of moments estimator for the distributional parameters of \(\beta _i\). To reduce the complexity of the moment equations, we first obtain a \(\sqrt{n}\)consistent estimator of \(\varvec{\gamma }\) and consider the estimation of the distribution of \(\beta _i\) by replacing \(\varvec{\gamma }\) by \({{\hat{\varvec{\gamma }}}}\).
3.1 Estimation of \(\varvec{\gamma }\)
Let \(\varvec{\phi }=\left( E\left( \beta _{i}\right) ,{\varvec{{\gamma }}}^{\prime }\right) ^{\prime }\), \(v_{i}=\beta _{i}E\left( \beta _{i}\right) \) and using the notation in Assumption 1, (2.1) can be written as
where \(\xi _{i}=u_{i}+x_{i}v_{i}\). Then, \(\varvec{\phi }\) can be estimated consistently by \(\hat{\varvec{\phi }}={{\textbf {Q}}}_{n,ww}^{1}{{\textbf {q}}} _{n,wy} \) where \({{\textbf {Q}}}_{n,ww}\) and \({{\textbf {q}}}_{n,wy}\) are defined in Assumption 1.
Assumption 3
\(\left\ n^{1}\sum _{i=1}^{n}E \left( {{\textbf {w}}}_i{{\textbf {w}}}_i^\prime \xi _i^2\right)  {{\textbf {V}}}_{w\xi } \right\ = O\left( n^{1/2} \right) \), \({{\textbf {V}}}_{w\xi }\succ 0, \) and
Remark 9
As in the case of Assumption 1, the highlevel condition (3.2) can be shown to hold under weak crosssectional dependence, assuming that elements of \({{\textbf {w}}}_{i} {{\textbf {w}}}_{i}^{\prime }\xi _{i}^{2}\) are crosssectionally weakly correlated over i. See Remark 5.
Theorem 3
Under Assumption 1, \(\hat{\varvec{\phi }}\) is a consistent estimator for \(\varvec{\phi }\). In addition, under Assumptions 1 and 3, as \(n\rightarrow \infty \),
where \({{\textbf {V}}}_{\phi } = {{\textbf {Q}}}_{w w}^{1} {{\textbf {V}}}_{w\xi } {{\textbf {Q}}} _{w w}^{1}. \) \({{\textbf {V}}}_{\phi }\) is consistently estimated by
as \(n\rightarrow \infty \), where \(\hat{{{\textbf {V}}}}_{w\xi } = n^{1}\sum _{i=1}^{n} {{\textbf {w}}}_i{{\textbf {w}}}_i^\prime {\hat{\xi }}_{i}^2\), and \({\hat{\xi }}_i = y_i  {{\textbf {w}}}_i^\prime \hat{\varvec{\phi }}\).
The proof of Theorem 3 is provided in Sect. S.2 in the online supplement.
3.2 Estimation of the distribution of \(\beta _i\)
Denote the moments of \(\beta _{i}\) on the righthand side of (2.10) by
and note that
so in general we can write \({{\textbf {m}}}_{\beta }\triangleq h\left( {\varvec{{\theta }}}\right) ,\) where \(\varvec{\theta }=\left( \varvec{\pi }^{\prime }, {{\textbf {b}}}^{\prime }\right) ^{\prime }\in \Theta \), and \(\varvec{\theta }\) can be uniquely determined in terms of \({{\textbf {m}}}_{\beta }\) by Theorem 2. To estimate \(\varvec{\theta }\), we consider moment conditions following a similar procedure as in Sect. 2 and propose a generalized method of moments (GMM) estimator.
We consider the following moment conditions:
and
where \(E\left( u_{i}\right) =0\), \({\tilde{y}}_{i}=y_{i}{{\textbf {z}}} _{i}^{\prime }\varvec{\gamma }\), \(r=1,2,\ldots ,2K1\), and \(s_{r}=0,1,\ldots ,Sr \), where S is a userspecific tuning parameter, chosen such that the highest order moments of \(x_{i}\) included is at most S, where \(S>2K1\).^{Footnote 2}
Let \(\sigma _{0}=1\) and \(\sigma _{1}=0\) such that \(\sigma _{r}\) is well defined for \(r=0,1,\ldots ,2K1\). Sum (3.5) over i and rearrange terms,
where
as shown in the proof of Theorem 1.
Letting \(n\rightarrow \infty \) in (3.6),
by Assumption 2. We stack the lefthand side of (3.7) over \(r=1,2,\ldots ,2K1\), and \(s_{r}=0,1,\ldots ,Sr\) and transform \({{\textbf {m}}}_\beta = h\left( \varvec{\theta } \right) \) to obtain \( {{\textbf {g}}}_0\left( \varvec{\theta }, \varvec{\sigma }, \varvec{\gamma } \right) \).
To implement the GMM estimation, we replace \({\tilde{y}}_{i}\), by \(\hat{\tilde{y }}_{i}=y_{i}{{\textbf {z}}}_{i}^{\prime }\hat{\varvec{\gamma }}\), and \(\rho _{r,s_{r}}\) by \(n^{1}\sum _{i=1}^{n}\hat{{\tilde{y}}}_{i}^{r}x_{i}^{s_{r}}\). Noting that \({{\textbf {m}}}_{\beta }=h\left( \varvec{\theta }\right) \), denote the sample version of the lefthand side of (3.7) by
where
and \(\varvec{\sigma }=\left( \sigma _{2},\sigma _{3},\ldots ,\sigma _{2K1}\right) ^{\prime }\). Stack the equations in (3.8), over \( r=0,1,\ldots ,2K1\) and \(s_{r}=0,1,\ldots ,Sr\) (\(S>2K1\)), in vector notations we have
Given \(\hat{\varvec{\gamma }}\), the GMM estimator of \(\left( \varvec{\theta } ^{\prime },\varvec{\sigma }^{\prime }\right) ^{\prime }\) is now computed as
where \({\hat{\Phi }}_{n}={\hat{{\varvec{g}}}}_{n}\left( \varvec{\theta },{\varvec{{\sigma }}},\hat{\varvec{\gamma }}\right) ^{\prime }{{\textbf {A}}}_{n}{\hat{\varvec{{g}}}}_{n}\left( \varvec{\theta },\varvec{\sigma },\hat{\varvec{\gamma }}\right) \), and \({{\textbf {A}}}_{n}\) is a positive definite matrix. We follow the GMM literature using the following choice of \({{\textbf {A}}}_{n}\),
where \({\bar{{\varvec{g}}}}_{n}=\frac{1}{n}\sum _{i=1}^{n}{\hat{{\varvec{g}}}} _{i}\left( \tilde{\varvec{\theta }},\tilde{\varvec{\sigma }},\hat{{\varvec{{\gamma }}}}\right) \), and \(\tilde{\varvec{\theta }}\) and \(\tilde{{\varvec{{\sigma }}}}\) are preliminary estimators.
Assumption 4
Denote the true values of \(\varvec{\theta }\), \(\varvec{\sigma }\) and \(\varvec{\gamma }\) by \(\varvec{\theta }_{0}\), \( \varvec{\sigma }_0\) and \(\varvec{\gamma }_0 \).

(a)
\(\Theta \) and \(\mathcal {S}\) are compact. \(\varvec{\theta }_{0}\in \textrm{int}\left( \Theta \right) \) and \(\varvec{\sigma }_0 \in \textrm{int} \left( {\mathcal {S}} \right) \).

(b)
\({{\textbf {A}}}_{n}\rightarrow _{p}{{\textbf {A}}}\) as \(n\rightarrow \infty \), where \({{\textbf {A}}}\) is some positive definite matrix.

(c)
$$\begin{aligned} \frac{1}{n}\sum _{i=1}^{n}\left[ \hat{{\tilde{y}}}_{i}^{r}x_{i}^{s_{r}}\textrm{ E}\left( {\tilde{y}}_{i}^{r}x_{i}^{s_{r}}\right) \right] =O_{p}\left( n^{1/2}\right) , \end{aligned}$$
for \(r=0,1,2,\ldots ,2K1\), \(s_{r}=0,1,\ldots ,Sr,\) and \(S>2K1.\)
Remark 10
Parts (a) and (b) of Assumption 4 are standard regularity conditions in the GMM literature. Part (c) together with Assumption 2 are highlevel regularity conditions which allow us to generalize the usual IID assumption and nest the IID data generation process as a special case. The sample analog terms in (c) include \(\hat{{\tilde{y}}}_{i}=y_{i}{{\textbf {z}}}_{i}^{\prime }\hat{{\varvec{{\gamma }}}}\), instead of the infeasible \({\tilde{y}}_{i}=y_{i}{{\textbf {z}}} _{i}^{\prime }\varvec{\gamma }\). The \(\sqrt{n}\)consistency of \(\hat{{\varvec{{\gamma }}}}\) shown in Theorem 3 ensures that replacing \({\tilde{y}}_{i}\) by \(\hat{{\tilde{y}}}_{i}\) does not alter the convergence rate.
Theorem 4
Let \(\varvec{\eta }=\left( \varvec{\theta }^{\prime }, \varvec{\sigma }^{\prime }\right) ^{\prime }\) and \({\varvec{{\eta }}}_{0}=\left( \varvec{\theta }_{0}^{\prime },\varvec{\sigma }_{0}^{\prime }\right) ^{\prime }\). Under Assumptions 1, 2, and 4, \(\hat{ \varvec{\eta }}\rightarrow _{p}\varvec{\eta }_{0}\) as \(n\rightarrow \infty \).
The proof of Theorem 4 is provided in “Appendix A.1.”
Assumption 5
Follow the notations as in Assumption 4 and in addition denote \({{\textbf {G}}}\left( \varvec{\theta },\varvec{\sigma },\varvec{\gamma }\right) =\nabla _{\left( \varvec{\theta }^{\prime }, \varvec{\sigma }^{\prime }\right) ^{\prime }} {{\textbf {g}}}_{0}\left( \varvec{\theta },\varvec{\sigma },\varvec{\gamma } \right) \), \({{\textbf {G}}}_{0}={{\textbf {G}}}\left( \varvec{\theta }_{0},{\varvec{{\sigma }}}_{0},\varvec{\gamma }_{0}\right) \), \({{\textbf {G}}}_{\gamma }\left( \varvec{\theta },\varvec{\sigma },\varvec{\gamma }\right) =\nabla _{\varvec{\gamma }}{{\textbf {g}}}_{0}\left( \varvec{\theta },\varvec{\sigma },\varvec{\gamma }\right) \), \({{\textbf {G}}}_{0, \gamma }={{\textbf {G}}}_{\gamma } \left( \varvec{\theta }_{0},\varvec{\sigma }_{0},\varvec{\gamma }_{0}\right) \).

(a)
\(\sqrt{n}{\hat{{\varvec{g}}}}_{n}\left( \varvec{\theta }_{0},{\varvec{{\sigma }}}_{0},\varvec{\gamma }_0\right) \rightarrow _{d}\varvec{\zeta }\sim N\left( 0,{{\textbf {V}}}\right) \) as \(n\rightarrow \infty \).

(b)
\({{\textbf {G}}}_{0}^{\prime }{} {\textbf {AG}}_{0}\succ 0\).
Remark 11
In Assumption 5, parts (a) is the highlevel condition required to ensure the asymptotic normality of \({\hat{{\varvec{g}}}}_{n}\left( \varvec{\theta }_{0},\varvec{\sigma }_{0},\varvec{\gamma }_0\right) \), which can be verified by Lindeberg central limit theorem under lowlevel regularity conditions. Part (c) of Assumption 5 represents the fullrank condition on \({{\textbf {G}}}_{0}\), required for identification of \(\varvec{\theta }_{0}\) and \(\varvec{\sigma }_{0}\).
By Theorem 3, we have \(\sqrt{n}\left( \hat{ \varvec{\gamma }}  \varvec{\gamma } \right) \rightarrow _d \zeta _\gamma \sim N(0, V_\gamma )\). The following theorem shows the asymptotic normality of the GMM estimator \(\hat{\varvec{\eta }}\).
Theorem 5
Under Assumptions 1, 3, 4 and 5,
as \(n\rightarrow \infty \).
The proof of Theorem 5 is provided in “Appendix A.1.”
Remark 12
In practice, we estimate the variance of the asymptotic distribution of \( {\hat{\eta }}\) by
where \(\hat{{{\textbf {G}}}} = \nabla _{\left( \varvec{\sigma }^{\prime },\varvec{\theta }^{\prime }\right) ^{\prime }} \hat{{{\textbf {g}}}}_{n}\left( \hat{ \varvec{\theta }}, \hat{\varvec{\sigma }}, {\hat{\varvec{\gamma }}} \right) \), \(\hat{ {{\textbf {A}}}}_n\) is given by (3.10), and
where
and \({{\textbf {L}}} = \begin{pmatrix} {{\textbf {0}}}_{p_z\times 1}&{{\textbf {I}}}_{p_z} \end{pmatrix} \) is the loading matrix that selects \(\varvec{\gamma }\) out of \(\varvec{\phi }\).
4 Multiple regressors with random coefficients
One important extension of the regression model (2.1) is to allow for multiple regressors with random coefficients having categorical distribution. With this in mind consider
where the \(p\times 1\) vector of random coefficients, \(\varvec{\beta }_{i}\in {\mathbb {R}}^{p}\) follows the multivariate distribution^{Footnote 3}
with \(k_{j}\in \left\{ 1,2,\ldots ,K\right\} \), \(b_{j1}<b_{j2}<\cdots <b_{jK} \), and
As in Sect. 2, \(\varvec{\gamma }\in {\mathbb {R}} ^{p_{z}}\), \({{\textbf {w}}}_{i}=\left( {{\textbf {x}}}_{i}^{\prime },{{\textbf {z}}} _{i}^{\prime }\right) ^{\prime }\), \(\varvec{\beta }_{i}\perp {{\textbf {w}}}_{i}\), \(u_{i}\perp {{\textbf {w}}}_{i}\), and \(u_{i}\) are independently distributed over i with mean 0.
Example 1
Consider the simple case with \(p = 2\) and \(K = 2\). For \(j = 1, 2\), denote two categories as \(\left\{ L, H \right\} \). The probabilities of four possible combinations of realized \(\varvec{\beta } _i\) are summarized in Table 1, where \(\pi _{LL} + \pi _{LH} + \pi _{HL} + \pi _{HH} = 1\).
We first identify the moments of \(\varvec{\beta }_{i}\). As in Sect. 2, \(\varvec{\phi }=\left( E\left( \varvec{\beta } _{i}\right) ^{\prime },\varvec{\gamma }^{\prime }\right) ^{\prime }\) is identified by
under Assumption 1. We now consider the identification of the higherorder moments of \(\varvec{\beta } _{i}\) up to the finite order \(2K1\).
Since \(\varvec{\gamma }\) is identified as in (4.3), we treat it as known and let \({\tilde{y}}_{i}^{r}=y_{i}{{\textbf {z}}}_{i}^{\prime } \varvec{\gamma }\). For \(r=2,3,\ldots ,2K1\), consider the moment conditions
Note that \({{\textbf {x}}}_{i}^{\prime }\varvec{\beta }_{i}=\sum _{j=1}^{p}\beta _{ij}x_{ij}\), and
where \(\left( {\begin{array}{c}r\\ {{\textbf {q}}}\end{array}}\right) =\frac{r!}{q_{1}!q_{2}!\cdots q_{p}!}\), for nonnegative integers r, \(q_{1}\), \(\ldots \), \(q_{p}\) with \( r=\sum _{j=1}^{p}q_{j}\), denotes the multinomial coefficients. We stack \( \prod _{j=1}^{p}x_{ij}^{q_{j}}\) with \({{\textbf {q}}}\in \left\{ {{\textbf {q}}}\in \left\{ 0,1,\ldots r\right\} ^{p}:\sum _{j=1}^{p}q_{j}=r\right\} \) in a vector form by denoting^{Footnote 4}
where \(\varphi \left( {{\textbf {x}}}_{i},{{\textbf {q}}}\right) =\prod _{j=1}^{p}x_{ij}^{q_{j}}\) and \(\nu _{r}=\left( {\begin{array}{c}r+p1\\ p1\end{array}}\right) \) is the number of distinct monomials of degree r on the variables \( x_{i1},x_{i2},\ldots ,x_{ip}\). Similarly,
where \(\varphi \left( \varvec{\beta }_{i},{{\textbf {q}}}\right) =\prod _{j=1}^{p}\beta _{ij}^{q_{j}}\).
Example 2
Consider \(p = 2\) and \(r = 2\), we have
and
where \(\varvec{\Lambda }_2 = \textrm{diag}\left[ \left( 1, 2, 1 \right) ^\prime \right] \).
Then, the moment condition (4.4) can be written as
where \(\varvec{\Lambda }_{r}=\textrm{diag}\left[ \left[ \left( {\begin{array}{c}r\\ {{\textbf {q}}}\end{array}}\right) \right] _{\sum _{j=1}^{p}q_{j}=r}\right] \) is the \(\nu _{r}\times \nu _{r}\) diagonal matrix of multinomial coefficients. We further consider the moment conditions
\(r=2,3,\ldots ,2K1\). (4.5) and (4.6) reduce to (2.6) and (2.7) when \(p=1\).
Assumption 6

(a)
\(\left\ n^{1}\sum _{i=1}^{n}E\left( {\tilde{y}}_{i}^{r} \varvec{\tau }_{s}\left( {{\textbf {x}}}_{i}\right) \right) \varvec{\rho } _{r,s}\right\ =O\left( n^{1/2}\right) ,\) and \(\left\ \varvec{\rho } _{r,s}\right\ <\infty \), \(r,s=0,1,\ldots ,2K1\).

(b)
\(\left\ n^{1}\sum _{i=1}^{n}E\left[ \varvec{\tau } _{r}\left( {{\textbf {x}}}_{i}\right) \varvec{\tau }_{s}\left( {{\textbf {x}}} _{i}\right) ^{\prime }\right] \varvec{\Xi }_{r,s}\right\ =O\left( n^{1/2}\right) ,\) and \(\left\ \varvec{\Xi }_{r,s}\right\ <\infty \), \(r,s=0,1,\ldots ,2K1\).

(c)
\(\left n^{1}\sum _{i=1}^{n}E\left( u_{i}^{r}\right) \sigma _{r}\right =O\left( n^{1/2}\right) ,\) and \(\left \sigma _{r}\right <\infty \) for \(r=2,3,\ldots ,2K1\).

(d)
\(\left\ n^{1}\sum _{i=1}^{n} \left[ \textrm{var} \left( \varvec{\tau }_r \left( {\textbf {x}}_\textbf{i} \right) \right)  \left( \varvec{\Xi } _{r,r}  \varvec{\rho }_{0, r}\varvec{\rho }_{0, r}^\prime \right) \right] \right\ = O(n^{1/2})\), where \(\varvec{\Xi }_{r,r}  \varvec{\rho }_{0, r} \varvec{\rho }_{0, r}^\prime \succ 0\) for \(r=2,3\ldots , 2K1\).
Theorem 6
For any \({{\textbf {q}}} \in \left\{ {{\textbf {q}}}\in \left\{ 0, 1, \ldots r \right\} ^{p}: \sum _{j=1}^p q_j = r\right\} \) and \(r = 2, 3,\ldots , 2K1\), \(E \left( \prod _{j=1}^{p} \beta _{ij}^{q_{j}}\right) \) and \(\sigma _r\) are identified under Assumptions 1 and 6.
Proof
For \(r=2,3,\ldots ,2K1\), sum (4.5) and (4.6) over i, go through the same steps as in the proof of Theorem 1, then by Assumptions 6(a) to (c), we have (for \( n\rightarrow \infty \))
Note that
is invertible since \(\det \left( {{\textbf {M}}}_{r}\right) =\det \left( \varvec{\Xi }_{r,r}\varvec{\rho }_{0,r}\varvec{\rho }_{0,r}^{\prime }\right) \det \left( \varvec{\Lambda }_{r}\right) >0,\) for \(r=2,3,\ldots ,R\), by Assumption 6(d). As a result, we can sequentially solve (4.7) and (4.8) for \(E\left[ \varvec{\tau }_{r}\left( \varvec{\beta }_{i}\right) \right] \) and \(\sigma _{r}\), for \(r=2,3,\ldots ,2K1\). \(\square \)
We now move from the moments of \(\varvec{\beta }_{i}\) to the distribution of \(\varvec{\beta }_{i}\). We first focus on the identification of the marginal probabilities obtained from (4.2) by averaging out the effects of the other coefficients except for \(\beta _{ij}\), namely we initially focus on identification of \(\lambda _{jk}=\Pr \left( \beta _{ij}=b_{jk}\right) \), for \(k=1,2,\ldots ,K,\) and \(j=1,2,\ldots ,p\).
Remark 13
Focusing on the marginal distribution of \(\beta _{i}\) is similar to focusing on estimation of partial derivatives in the context of nonparametric estimation, where the curse of dimensionality applies. Consider the estimation of regressing \(y_{i}\) on \({{\textbf {x}}}_{i}=\left( x_{i1},x_{i2},\ldots ,x_{ip}\right) ^{\prime }\),
Then if \(F\left( x_{1},x_{i2},\ldots ,x_{ip}\right) \) is a homogeneous function (of degree \(1/\mu \)), then
and under certain conditions we can treat \(\mu \frac{\partial F\left( \cdot \right) }{\partial x_{ij}}\equiv \beta _{ij}\).
By Theorem 6, \(E\left( \beta _{ij}^{r}\right) \) is identified for \(r=1,2,\ldots ,2K1\) under Assumptions 1 and 6. By (4.2), we have equations
\(r=0,1,\ldots ,2K1\), which is of the same form as (2.10) and (3.4). To identify \(\varvec{\lambda }_{j}=\left( \lambda _{j1},\lambda _{j2},\ldots ,\lambda _{jK}\right) ^{\prime }\) and \({{\textbf {b}}} _{j}=\left( b_{j1},b_{j2},\ldots ,b_{jK}\right) ^{\prime }\), we can verify the system of 2K equations in (4.9) has a unique solution if \(b_{j1}<b_{j2}<\cdots <b_{jK}\) and \(\lambda _{jk}\in \left( 0,1\right) \). The following corollary is a direct application of Theorem 2.
Corollary 7
Consider the model (4.1) and suppose that Assumptions 1 and 6 hold. Then, the parameters \(\varvec{\theta }_j = \left( \varvec{\lambda }_{j}^\prime , {{\textbf {b}}}_{j}^\prime \right) ^\prime \) of the marginal distribution of \(\beta _i\) with respect to \(\beta _{ij}\) is identified subject to \(b_{j1}< b_{j2}< \cdots < b_{jK}\) and \(\lambda _{jk} \in \left( 0, 1 \right) \) for \(j = 1, 2, \ldots , p\).
The problem of identification and estimation of the joint distribution of \( \varvec{\beta }_{i}\) is subject to the curse of dimensionality. We have \( K^{p}1\) probability weights, \(\pi _{k_{1},k_{2},\ldots ,k_{p}}\), to be identified in addition to the pK categorical coefficients \(b_{ij}\) that are identified by Corollary 7. The number of parameters increases rapidly with p. Even in the simplest case with \(K=2\), the total number of unknown parameters is \(2p+2^{p}1\), which grows exponentially.
Note that the marginal probabilities \(\lambda _{jk}\) are related to the joint distribution by
\(k=1,2,\ldots ,K\) and \(j=1,2,\ldots ,p\). The number of linearly independent equations in (4.10) is \(pK(p1)\).
Example 3
Consider the same setup as in Example 1 with \(p = 2\) and \(K = 2\). The marginal probabilities are obtained by
Note that any equation in (4.11) can be expressed as a linear combination of other three equations, for example \(\lambda _{2\,H} = \lambda _{1\,L} + \lambda _{1\,H}  \lambda _{2\,L}\).
The equations corresponding to the crossmoments, \(E\left( \prod _{j=1}^{p}\beta _{ij}^{q_{j}}\right) \), are
for \({{\textbf {q}}}\in \left\{ {{\textbf {q}}}\in \left\{ 0,1,\ldots r1\right\} ^{p}:\sum _{j=1}^{p}q_{j}=r\right\} \), \(r=2,\ldots ,2K1\). The linear system (4.12) has
equations. Then the total number of equations in (4.10) and (4.12) that can be utilized to identify joint probabilities is \(C_{r}=\sum _{r=1}^{2K1}\left( {\begin{array}{c} r+p1\\ p1\end{array}}\right) pK\), which is smaller than the number of joint probabilities \( K^{p}1\) for large p. When \(K=2\), \(C_{r}<K^{p}1\) for \(p\ge 7\).
Identification and estimation of the joint distribution of \(\varvec{\beta } _{i}\) in the general setting will not be pursued in this paper due to the curse of dimensionality. Instead, we consider special cases, that are empirically relevant, in which identification of the joint distribution of \( \varvec{\beta }_{i}\) can be readily established. We first consider small p and K, in particular \(p=2\) and \(K=2\) as in Example 1.
Example 4
Consider the same setup as in Example 1 with \(p=2\) and \(K=2\). In addition to (4.11), consider the crossmoment,
Writing (4.11) and (4.13) in matrix form, we have
where
Note that \(E\left( \beta _{i1}\beta _{i2}\right) \) is identified by Theorem 6, and \(b_{jk_{j}}\) and \( \lambda _{jk_{j}}\) are identified by Corollary 7, and matrix \({{\textbf {B}}}\) is invertible given that \(b_{1\,L}<b_{1\,H}\) and \( b_{2\,L}<b_{2\,H}\) (see “Appendix A.1”). As a result, the joint probabilities, \(\varvec{\pi },\) are identified.
Remark 14
The argument in Example 4 is applicable for identification of the joint distribution of \(\left( \beta _{ij}, \beta _{i,j^\prime } \right) ^\prime \) for \( j\ne j^\prime \) when \(p > 2\) and \(K = 2\).
5 Finite sample properties using Monte Carlo experiments
We examine the finite sample performance of the categorical coefficient estimator proposed in Sect. 3 by Monte Carlo experiments.
5.1 Data generating processes
we generate \(y_{i}\) as
with \(\beta _{i}\) distributed as in (2.2) with \(K=2,\) and the parameters \(\pi ,\beta _{L}\) and \(\beta _{H}\).^{Footnote 5}
We draw \(\beta _{i}\) for each individual i independently by setting \(\beta _{i}=\beta _{L}\) with probability \(\pi \) and \(\beta _{i}=\beta _{H}\) with probability \(1\pi \), through a sequence of independent Bernoulli draws. We consider two sets of parameters in all DGPs, denoted as high variance and low variance parametrization, respectively,
\(\beta _{H}/\beta _{L}=2\) for the high variance parametrization, and \(\beta _{H}/\beta _{L} = 2.69\), for the low variance parametrization, which is motivated by the estimates in our empirical illustration in Sect. 6.^{Footnote 6} The values of E\((\beta _{i})\) and \(\textrm{var}\left( \beta _{i}\right) \) are obtained noting that E\((\beta _{i})=\pi \beta _{L}+(1\pi )\beta _{H}\), and \(\textrm{var}\left( \beta _{i}\right) =\pi (1\pi )(\beta _{H}\beta _{L})^{2}\). The remaining parameters are set as \(\alpha =0.25\), and \(\varvec{\gamma }=\left( 1,1\right) ^{\prime },\) across DGPs.
We generate the regressors and the error terms as follows.
DGP 1 (Baseline) We first generate \({\tilde{x}}_{i}\sim \text {IID}\chi ^{2}(2)\), and then set \(x_{i}=({\tilde{x}}_{i}2)/2\) so that \( x_{i}\) has 0 mean and unit variance. The additional regressors, \(z_{ij}\), for \(j=1,2\) with homogeneous slopes are generated as
with \(v_{ij}\sim \text {IID }N\left( 0,1\right) \), for \(j=1,2\). This ensures that the regressors are sufficiently correlated. The error term, \(u_{i}\), is generated as \(u_{i}=\sigma _{i}\varepsilon _{i}\), where \(\sigma _{i}^{2}\) are generated as \(0.5(1+\text {IID}\chi ^{2}(1))\), and \(\varepsilon _{i}\sim \text {IID}N(0,1)\). Note that \(\varepsilon _{i}\) and \(\sigma _{i}^{2}\) are generated independently, and \(E(u_{i}^{2})=1\).
DGP 2 (Categorical x) This setup deviates from the baseline DGP, and allows the distribution of \(x_{i}\) to differ across i. Accordingly, we generate \(x_{i}=\left( {\tilde{x}}_{1i}2\right) /2\) where \( {\tilde{x}}_{1i}\sim \text {IID}\chi ^{2}\left( 2\right) \) for \(i=1,2,\ldots ,\lfloor n/2\rfloor \), and \(x_{i}=\left( {\tilde{x}}_{2i}2\right) /4\) where \( {\tilde{x}}_{2i}\sim \text {IID}\chi ^{2}\left( 4\right) \), for \(i=\lfloor n/2\rfloor +1,\ldots ,n\). The additional regressors, \(z_{ij}\), for \(j=1,2\) with homogeneous slopes are generated as
with \(v_{ij}\sim \text {IID }N\left( 0,1\right) \), for \(j=1,2\). The error term \(u_{i}\) is generated the same as in DGP 1.
DGP 3 (Categorical u) We generate \(x_{i}\) and \({{\textbf {z}}} _{i}\) the same as in DGP 1, but allow the error term \(u_{i}\) to have a heterogeneous distribution over i. For \(i=1,2,\ldots ,\lfloor n/2\rfloor \), we set \(u_{i}=\sigma _{i}\varepsilon _{i},\) where \(\sigma _{i}^{2}\sim \text {IID}\chi ^{2}\left( 2\right) \) and \(\varepsilon _{i}\sim \text {IID} N(0,1)\), and for \(i=\lfloor n/2\rfloor +1,\ldots ,n\), we set \(u_{i}=\left( {\tilde{u}}_{i}2\right) /2\), where \({\tilde{u}}_{i}\sim \text {IID}\chi ^{2}\left( 2\right) \).
We investigate the finite sample performance of the estimator proposed in Sect. 3 across DGP 1 to 3 with low variance and high variance scenarios.^{Footnote 7} Details of the computational algorithm used to carry out the Monte Carlo experiments (and the empirical results that follow) are given in Sect. S.5 of the online supplement. An accompanying R package is available at https://github.com/zhangao/ccrm.
5.2 Summary of the MC results
For each sample size \(n = 100\), 1000, 2000, 5000, 10, 000 and 100, 000 we run 5000 replications of experiments for DGP 1 (baseline), DGP 2 (categorical x) and DGP 3 (categorical u) with high variance and low variance parametrization, as set out in (5.2).
We first investigate the finite sample performance of \(\hat{\varvec{\phi }}\), as an estimator of \(\varvec{\phi }=\left( E\left( \beta _{i}\right) ,\varvec{\gamma }^{\prime }\right) ^{\prime }\). Bias, root mean squared errors (RMSE) for estimation of \(E\left( \beta _{i}\right) \), \(\gamma _{1}\) and \(\gamma _{2}\), as well as the size of testing of the null values at the 5 percent nominal value are reported in Table 2. In addition, we plot the associated empirical power functions in Figs. 1 and 2, for cases of high and low \(\textrm{var}(\beta _{i})\). The results show that \(\hat{\varvec{\phi }}\) has very good small sample properties with small bias and RMSEs, with size very close to the nominal value of 5 percent across all DGPs and parametrization, even when sample size is relatively small. The power of the test increases steadily as the sample size increases.
Then, we turn to the GMM estimator for the distributional parameters of \( \beta _{i}\) proposed in Sect. 3.2. The bias, RMSE and the test size based on the asymptotic distribution given in Theorem 5, for \(\pi \), \(\beta _{L}\) and \(\beta _{H}\), are reported in Table 3. The empirical power functions are reported in Figs. 3 and 4. The reported results are based on \(S=4\), where S \((>2K1=3)\) denotes the highest order of moments of \(x_{i}\) included in estimation.^{Footnote 8}
The upper panel of this table reports the results of the high variance and the lower panel for the low variance parametrization, as set out in (5.2). For all parameters and under all DGPs, the bias and RMSE decline steadily with the sample size as predicted by Theorem 4, and confirm the robustness of the GMM estimates to the heterogeneity in the regressor and the error processes. But for a given sample size, the relative precision of the estimates depends on the variability of \(\beta _{i}\), as characterized by the true value of \(\textrm{ var} (\beta _{i})\). The precision of the estimates with high variance parametrization is relatively higher than that with low variance parametrization. This is to be expected since, unlike \(E (\beta _{i}),\) the distributional parameters are only identified if \(\textrm{var} (\beta _{i})>0\). As shown in (2.18) and (2.19) for the current case of \(K=2\), \(\textrm{var}(\beta _{i})\) is in the denominator when we recover the distributional parameters from the moments of \(\beta _{i}\). When \(\textrm{var}(\beta _{i})\) is small, estimation errors in the moments of \( \beta _{i}\) can be amplified in the estimation of \(\pi \), \(\beta _{L}\) and \( \beta _{H}\). On the other hand, the larger the variance the more precisely \( \pi \), \(\beta _{H}\) and \(\beta _{L}\) can be estimated for a given n.^{Footnote 9} The size and power also depends on the parametrization. With both high variance and low variance parametrization, we can achieve correct size and reasonable power when n is quite large (\( n=\)100,000). We plot the empirical power functions for \(n\ge 5000\) for \( \pi \), \(\beta _{H}\) and \(\beta _{L}\) since the size is far above 5 percent for smaller values of n, and power comparisons are not meaningful in such cases.
Remark 15
Note that GMM estimators of moments of \(\beta _{i}\), namely \({{\textbf {m}}}_{\varvec{\beta }}\), can be obtained using the moment conditions in (3.7),and the transformations \({{\textbf {m}}}_{ \varvec{\beta }}=h\left( \varvec{\theta }\right) \) in (3.4) are required only to derive the estimators of \(\varvec{\theta }\), the parameters of the underlying categorical distribution. The Monte Carlo results in Sect. S.3.2 in the online supplement show that \({{\textbf {m}}}_{\varvec{\beta }}\) can be accurately estimated with relatively small sample sizes. In the estimation of both \({{\textbf {m}}}_{ \varvec{\beta }}\) and \(\varvec{\theta }\), the same set of moment conditions are included, so the estimation of distributional parameters \(\varvec{\theta }\) essentially relies on the relation \(\varvec{\theta }=h^{1}\left( {{\textbf {m}}}_{\varvec{\beta }}\right) \). Sampling uncertainties in the estimation of \( {{\textbf {m}}}_{\varvec{\beta }}\), particularly in higherorder moments, are potentially amplified through the inverse transformation \(h^{1}\) that involves matrix inversion, which causes the difficulties in estimation and inference of \(\varvec{\theta }\) when sample sizes are small. This is analogous to the problem of precision matrix estimation from an estimated covariance matrix. In practice, estimation of the categorical parameters is recommended for applications where the sample size is relatively large, otherwise it is advisable to focus on estimates of the lowerorder moments of \(\beta _{i}\).
6 Heterogeneous return to education: an empirical application
Since the pioneering work by Becker (1962, 1964) on the effects of investments in human capital, estimating returns to education has been one of the focal points of labor economics research. In his pioneering contribution Mincer (1974) models the logarithm of earnings as a function of years of education and years of potential labor market experience (age minus years of education minus six), which can be written in a generic form:
as in Heckman et al. (2018, Eq. (1)), where \({{\textbf {z}}}_{i}\) includes the labor market experience and other relevant control variables. The above wage equation, also known as the “Mincer equation”, has become of the workhorse of the empirical works on estimating the return to education. In the most widely used specification of the Mincer equation (6.1),
where \(\tilde{{{\textbf {z}}}}_{i}\) is the vector of control variables other than potential labor market experience.
Along with the advancement of empirical research on this topic, there has been a growing awareness of the importance of heterogeneity in individual cognitive and noncognitive abilities (Heckman 2001) and their significance for explaining the observed heterogeneity in return to education. Accordingly, it is important to allow the parameters of the wage equation to differ across individuals. In Eq. (6.1), we allow \(\alpha _{i}\) and \(\beta _{i}\) to differ across individuals, but assume that \(\phi \left( {{\textbf {z}}}_{i}\right) \) can be approximated as nonlinear functions of experience and other control variables with homogeneous coefficients.
Specifically, following Lemieux (2006b, 2006c) we also allow for time variations in the parameters of the wage equation and consider the following categorical coefficient model over a given crosssection sample indexed by t^{Footnote 10}:
where the return to education follows the categorical distribution,
and \(\tilde{{{\textbf {z}}}}_{it}\) includes gender, martial status and race. \( \alpha _{it}=\alpha _{t}+\delta _{it}\) where \(\delta _{it}\) is mean 0 random variable assumed to be distributed independently of \(\text {edu}_{it}\) and \({{\textbf {z}}}_{it}=\left( \text {exper}_{it},\text {exper}_{it}^{2}\text {, } \tilde{{{\textbf {z}}}}_{t}^{\prime }\right) ^{\prime }\). Let \(u_{it}=\varepsilon _{it}+\delta _{it}\), and write (6.2) as
The correlation between \(\alpha _{it}\) and \(\text {edu}_{it}\) in (6.1) is the source of “ability bias” (Griliches 1977). Given the pure crosssectional nature of our analysis, we do not allow for the endogeneity from “ability bias” or dynamics. To allow for nonzero correlations between \(\alpha _{it}\), edu\(_{it}\) and \({{\textbf {z}}} _{it}\), a panel data approach is required, which has its own challenges, as education and experience variables tend to very slow moving (if at all) for many individuals in the panel. Time delays between changes in education and experience and the wage outcomes also further complicate the interpretation of the mean estimates of \(\beta _{it}\) which we shall be reporting. To partially address the possible dynamic spillover effects, we provide estimates of the distribution of \(\beta _{it}\) using crosssectional data from two different sample periods, and investigate the extent to which the distribution of return to education has changed over time, by gender and the level of educational achievements.^{Footnote 11}
We estimate the categorical distribution of the return to education in (6.3) using the May and Outgoing Rotation Group (ORG) supplements of the Current Population Survey (CPS) data, as in Lemieux (2006b, 2006c).^{Footnote 12} We pool observations from 1973 to 1975 for the first sample period, \( t=\left\{ 1973{}1975\right\} \) and observations from 2001 to 2003 for the second sample period, \(t=\left\{ 2001{}2003\right\} \). Following Lemieux (2006b), we consider subsamples of those with less than 12 years of education, “high school or less,” and those with more than 12 years of education, “postsecondary education,” as well as the combined sample. We also present results by gender. The summary statistics are reported in Table 4. As to be expected, the mean log wages are higher for those with postsecondary education (for male and female), with the number of years of schooling and experience rising by about one year across the two subperiod samples. There are also important differences across male and female, and the two educational groupings, which we hope to capture in our estimation.
We treat the crosssection observations in the two sample periods, \( t=\left\{ 1973{}1975\right\} \) and \(\left\{ 2001{}2003\right\} \), as repeated cross sections, rather than a panel data since the data in these two periods do not cover the same individuals, and represent random samples from the population of wage earners in two periods. It should also be noted that sample sizes \((n_{t})\), although quite large, are much larger during \( \left\{ 2001{}2003\right\} \), which could be a factor when we come to compare estimates from the two sample periods. For example, for both male and female \(n_{7375}=\) 111,632 as compared to \(n_{0103}=\) 511,819, a difference which becomes more pronounced when we consider the number observations in postsecondary/female category—which rises from 12,882 for the first period to 100,007 in the second period.
We report estimates of \(\pi _{t}\), \(\beta _{L,t}\) and \(\beta _{H,t}\), as well as corresponding mean and standard deviations (denoted by s.d.(\({\hat{\beta }} _{it}\))) of the return to education (\(\beta _{it}\)) for \(t=\left\{ 1973{}1975\right\} \) and \(\left\{ 2001{}2003\right\} \). For a given \(\pi _{t}\), the ratio \(\beta _{H,t}/\beta _{L,t}\) provides a measure of withingroup heterogeneity and allows us to augment information on changes in mean with changes in the distribution of return of education. The estimates for the distribution of the return to education (\(\beta _{it}\)) are summarized in Table 5, with the estimation results for control variables (such as experience, experienced squared, and other individual specific characteristic) reported in Table 6.
As can be seen from Table 5, estimates of \( \mathrm {s.d.}\left( \beta _{it}\right) \) are strictly positive for all subgroups, except for the “high school or less” group during the first sample period. For this group during the first period the estimate of \(\mathrm {s.d.}\left( \beta _{it}\right) \) for the male subsample is zero, \(\pi \) is not identified, and we have identical estimates for \(\beta _{L}\) and \(\beta _{H}\). For this subsample, the associated estimates and their standard errors are shown as unavailable (n/a). In case of the female subsample as well as both male and female subsamples where the estimates of s.d.(\({\hat{\beta }}_{it}\)) are close to zero and \(\pi \) is poorly estimated, only the mean of the return to education is informative. In the case of the samples where the estimates of \( \mathrm {s.d.}\left( \beta _{it}\right) \) are strictly positive, the estimate of the ratio \(\beta _{H,t}/\beta _{L,t}\) provides a good measure of withingroup heterogeneity of return to education. The estimates of \(\beta _{H,t}/\beta _{L,t}\) lie between 1.50 and 2.79, with the high estimate obtained for the females with high school or less education during \(\left\{ { 2001{}2003}\right\} \), and the low estimate is obtained for females with postsecondary education during the same period.
As our theory suggests the mean estimates of return to education, \(E\left( \beta _{it}\right) \) are very precisely estimated and inferences involving them tend to be robust to conditional error heteroskedasticity. The results in Table 5 show that estimates of \(E\left( \beta _{it}\right) \) have increased over the two sample periods \(t=\left\{ 1973{}1975\right\} \) to \(t=\left\{ 2001{}2003\right\} \), regardless of gender or educational grouping. The postsecondary educational group show larger increases in the estimates of \(E\left( \beta _{it}\right) \) as compared to those with high school or less. Estimates of \(E\left( \beta _{it}\right) \) increase by 36 percent for the postsecondary group, while the estimates of mean return to education rise only by around 5 percent in the case of those with high school or less. This result holds for both genders. Comparing the mean returns across the two educational groups, we find that mean return to education of individuals with postsecondary education is 45 percent higher than those with high school or less in the \(\{1973{}1975\}\) period, but this gap increases to 87 percent in the second period, \(\left\{ 2001{}2003\right\} \). Similar patterns are observed in the subsamples by gender. The estimates suggest rising between group heterogeneity, which is mainly due to the increasing returns to education for the postsecondary group.
Turning to withingroup heterogeneity, we focus on the estimates of \(\beta _{H,t}/\beta _{L,t}\) and first note that over the two periods, withingroup heterogeneity has been rising mainly in the case of those with high school or less, for both male and female. For the combined male and female samples and the male subsample, there is little evidence of withingroup heterogeneity for the first period \(\left\{ 1973{}1975\right\} \). However, for the second period \(\left\{ 2001{}2003\right\} \) we find a sizeable degree of withingroup heterogeneity where \(\beta _{H,t}/\beta _{L,t}\) is estimated to be around 2.41, with \(\text {s.d.}\left( \beta _{it}\right) \approx 0.03\). For the female subsample with high school or less, little evidence of heterogeneity was found for the first period, estimates of \(\beta _{H,t}/\beta _{L,t}\) increase to 2.79 for the second sample period, that corresponds to a commensurate rise in \(\text {s.d.}\left( \beta _{i}\right) \) to 0.032. The pattern of withingroup heterogeneity is very different for those with postsecondary educational. For this group, we in fact observe a slight decline in the estimates of \(\beta _{H,t}/\beta _{L,t}\) by gender and over two sample periods.
Overall, our estimates of return to education and the within and between group comparisons are in line with the evidence of rising wage inequality documented in the literature (Corak 2013).
7 Conclusion
In this paper, we consider random coefficient models for repeated cross sections in which the random coefficients follow categorical distributions. Identification is established using moments of the random coefficients in terms of the moments of the underlying observations. We propose twostep generalized method of moments to estimate the parameters of the categorical distributions. The consistency and asymptotic normality of the GMM estimators are established without the IID assumption typically assumed in the literature. Small sample properties of the proposed estimator are investigated by means of Monte Carlo experiments and shown to be robust to heterogeneously generated regressors and errors, although relatively large samples are required to estimate the parameters of the underling categorical distributions. This is largely due to the highly nonlinear mapping between the parameters of the categorical distribution and the higherorder moments of the coefficients. This problem is likely to become more pronounced with a larger number of categories and coefficients.
In the empirical application, we apply the model to study the evolution of returns to education over two subperiods, also considered in the literature by Lemieux (2006b). Our estimates show that mean (ex post) returns to education have risen over the periods from 1973–1975 to 2001–2003 mainly in the case of individuals with postsecondary education, and this result is robust by gender. We find evidence of withingroup heterogeneity in the case of high school or less educational group as compared to those with postsecondary education.
In our model specification, the number of categories, K, is treated as a tuning parameter and assumed to be known. An information criterion, as in Bonhomme and Manresa (2015) and Su et al. (2016), to determine K could be considered. Further investigation of models with multiple regressors subject to parameter heterogeneity is also required. These and other related issues are topics for future research.
Notes
\(f_{i}\) is assumed to be a scalar, and we can apply the analysis elementbyelement to a matrix, for example \({{\textbf {w}}}_{i} {{\textbf {w}}}_{i}^{\prime }\).
For identification, we require the moments of \(x_{i}\) to exist up to order \( 4K2\). S can take values between 2K to \(4K2\). In practice, the choice of S affects the tradeoff between bias and efficiency.
We assume the number of categories K is homogeneous across \(j=1,2,\ldots ,p \). This is for notational simplicity, and can be readily generalized to allow for \(K_{j}\ne K_{j^{\prime }}\) without affecting the main results.
For \({{\textbf {x}}}\in {\mathbb {R}}^{p}\), note that \(\varvec{\tau }_{0}\left( {{\textbf {x}}}\right) =1\), \(\varvec{\tau }_{1}\left( {{\textbf {x}}}\right) ={{\textbf {x}}}\) and \(\varvec{\tau }_{2}\left( {{\textbf {x}}}\right) =\textrm{vech}\left( {{\textbf {x}}}{{\textbf {x}}}^{\prime }\right) \).
A Monte Carlo experiment with \(K=3\) is relegated to Sect. S.3.5 in the online supplement.
The estimates for \(\beta _H / \beta _L\) in our empirical analysis range from 1.50 to 2.79.
We can consider a DGP with conditional heteroskedasticity, in which we follow the baseline DGP and generate the error term as \(u_{i}=x_{i} \varepsilon _{i}\), where \(\varepsilon _{i}\sim N(0,1)\). The least square estimator for \(\varvec{\phi }\) is valid in this setup in terms of estimation and inference, whereas the GMM estimator for the distributional parameters \( \varvec{\theta }\) breaks down, which is to be expected since we can only identify the first moment of \(\beta _{i}\) under conditional heteroskedasticity. The results are available on request.
We also tried estimation based on a larger number of moments (using \(S=5\) and \(S=6\)). In the case of current Monte Carlo results, adding more moments does not seem to add much to the precision of the estimates and could be counterproductive when n is not sufficiently large. The results are available in Sect. S.3.1 in the online supplement.
Section S.3.4 in the online supplement presents parametrization with \(\textrm{var}\left( \beta _i \right) = 6.35\) and 18.95, which further confirms the pattern that the larger the variance the more precisely \(\pi \), \(\beta _{H}\) and \(\beta _{L}\) can be estimated for a given n.
Some investigators have suggested including higher powers of the experience variable in the wage equation. Lemieux (2006a), for example, proposes using a quartic rather than a quadratic function. As a robustness check we also provide estimation results with quartic experience specification in Sect. S.4 in the online supplement.
The data are retrieved from https://www.openicpsr.org/openicpsr/project/116216/version/V1/view.
References
Ahn SC, Lee YH, Schmidt P (2001) GMM estimation of linear panel data models with timevarying individual effects. J Econom 101(2):219–255
Ahn SC, Lee YH, Schmidt P (2013) Panel data models with multiple timevarying individual effects. J Econom 174(1):1–14
Andrews DWK (2001) Testing when a parameter is on the boundary of the maintained hypothesis. Econometrica 69(3):683–734
Arellano M, Bonhomme S (2012) Identifying distributional characteristics in random coefficients panel data models. Rev Econ Stud 79(3):987–1020
Becker GS (1962) Investment in human capital: a theoretical analysis. J Polit Econ 70(5, Part 2):9–49
Becker GS (1964) Human capital: a theoretical and empirical analysis, with special reference to education. The University of Chicago Press, Chicago
Beran R (1993) Semiparametric random coefficient regression models. Ann Inst Stat Math 45(4):639–654
Beran R, Hall P (1992) Estimating coefficient distributions in random coefficient regressions. Ann Stat 20(4):1970–1984
Beran R, Millar PW (1994) Minimum distance estimation in random coefficient regression models. Ann Stat 22(4):1976–1992
Beran R, Feuerverger A, Hall P (1996) On nonparametric estimation of intercept and slope distributions in random coefficient regression. Ann Stat 24(6):2569–2592
Bick A, Blandin A, Rogerson R (2022) Hours and wages. Q J Econ 137:1901–1962
Bonhomme S, Manresa E (2015) Grouped patterns of heterogeneity in panel data. Econometrica 83(3):1147–1184
Breunig C, Hoderlein S (2018) Specification testing in random coefficient models. Quant Econ 9(3):1371–1417
Corak M (2013) Income inequality, equality of opportunity, and intergenerational mobility. J Econ Perspect 27(3):79–102
Foster A, Hahn J (2000) A consistent semiparametric estimation of the consumer surplus distribution. Econ Lett 69(3):245–251
Gautier E, Hoderlein S (2015). A triangular treatment effect model with random coefficients in the selection equation. Working Paper. arXiv:1109.0362
Gautier E, Kitamura Y (2013) Nonparametric estimation in random coefficients binary choice models. Econometrica 81(2):581–607
Griliches Z (1977) Estimating the returns to schooling: some econometric problems. Econometrica 45(1):1–22
Hausman JA (1981) Exact consumer’s surplus and deadweight loss. Am Econ Rev 71(4):662–676
Hausman JA, Newey WK (1995) Nonparametric estimation of exact consumers surplus and deadweight loss. Econometrica 63(6):1445–1476
Heckman JJ (2001) Micro data, heterogeneity, and the evaluation of public policy: nobel lecture. J Polit Econ 109(4):673–748
Heckman JJ, Humphries JE, Veramendi G (2018) Returns to education: the causal effects of education on earnings, health, and smoking. J Polit Econ 126(S1):S197–S246
Hoderlein S, Klemelä J, Mammen E (2010) Analyzing the random coefficient model nonparametrically. Econom Theor 26(3):804–837
Hoderlein S, Holzmann H, Meister A (2017) The triangular model with random coefficients. J Econom 201(1):144–169
Hsiao C, Pesaran MH (2008) Random coefficient models. In: Mátyás L, Sevestre P (eds) The econometrics of panel data, chapter 6. Springer, Berlin, pp 185–213
Ichimura H, Thompson TS (1998) Maximum likelihood estimation of a binary choice model with random coefficients of unknown distribution. J Econ 86(2):269–295
Lemieux T (2006a) The “mincer equation” thirty years after schooling, experience, and earnings. In: Grossbard S (ed) Jacob Mincer a pioneer of modern labor economics, chapter 11. Springer, New York, pp 127–145
Lemieux T (2006b). Postsecondary education and increasing wage inequality. Working Paper No. 12077, National Bureau of Economic Research
Lemieux T (2006c) Postsecondary education and increasing wage inequality. Am Econ Rev 96(2):195–199
Masten MA (2018) Random coefficients on endogenous variables in simultaneous equations models. Rev Econ Stud 85(2):1193–1250
Matzkin RL (2012) Identification in nonparametric limited dependent variable models with simultaneity and unobserved heterogeneity. J Econom 166(1):106–115
Mincer J (1974) Schooling. Experience and Earnings, National Bureau of Economic Research, New York. 0870142658
Newey K, McFadden D (1994) Large sample estimation and hypothesis. In: Engle RF, McFadden DL (eds) Handbook of econometrics, volume 4, chapter 36. Elsevier, Amsterdam, pp 2112–2245
Nicholls D, Pagan A (1985) Varying coefficient regression. In: Hannan EJ, Krishnaiah PR, Rao MM (eds) Handbook of statistics, volume 5, chapter 16. Elsevier, Amsterdam, pp 413–449
Pesaran MH, Zhou Q (2018) To pool or not to pool: revisited. Oxford Bull Econ Stat 80(2):185–217
Rosen K (2006) Discrete mathematics and its applications, 6th edn. McGrawHill Education, New York
Su L, Shi Z, Phillips PC (2016) Identifying latent structures in panel data. Econometrica 84(6):2215–2264
Acknowledgements
We would like to thank Timothy Armstrong, Hidehiko Ichimura, Esfandiar Maasoumi, Geert Ridder, Ron Smith and Hayun Song for helpful comments, and two anonymous referees for constructive comments and suggestions.
Funding
Open access funding provided by SCELC, Statewide California Electronic Library Consortium
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
The authors did not receive support from any organization for the submitted work. This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Proofs
Proofs
We include proofs and technical details in this section.
Proof of Theorem 1
Sum (2.6) over i and rearrange terms,
Note that
and
by Assumption 1(b) and 2(b), then by taking \(n\rightarrow \infty \) on both sides of (A.1.1), we have (2.8). Similar steps for (2.7) give (2.9). \(\square \)
Proof of Theorem 2
Let \(m_{r}=E\left( \beta _{i}^{r}\right) \), \(r=1,2,\ldots ,2K1\), which are taken as known. We show that
\(r=0,1,2,\ldots ,2K1\), has a unique solution \(\varvec{\theta }=\left( \varvec{\pi }^{\prime },{{\textbf {b}}}^{\prime }\right) ^{\prime }\), with \( b_{1}<b_{2}<\cdots <b_{K}\) and \(\pi _{k}\in \left( 0,1\right) \) imposed.
Let
be the polynomial with K distinct roots \(b_{1}\), \(b_{2}\), \(\ldots \), \( b_{K} \). Note that for each k, \(\left( b_{k}^{r}\right) _{r=0}^{2K1}\) satisfies the linear homogeneous recurrence relation,
for \(r=0,1,\ldots K1\), since q is the characteristic polynomial of the linear recurrence relation (A.1.4) and \(b_{k}\) is a root of q (Rosen 2006, Chapter 5.2). \(\left( m_{r}\right) _{r=0}^{2K1}\) is a linear combination of \(\left( b_{1}^{r}\right) _{r=0}^{2K1}\), \(\left( b_{2}^{r}\right) _{r=0}^{2K1}\), \( \ldots \), \(\left( b_{K}^{r}\right) _{r=0}^{2K1}\) by (A.1.2), then \(\left( m_{r}\right) _{r=0}^{2K1}\) also satisfies the linear recurrence relation (A.1.4), i.e.,
for \(r=0,1,\ldots ,K1\). (A.1.5) is a linear system of K equations in terms of \(\left( b^{*}_{k}\right) _{k=1}^{K}\). In matrix form,
where
\({{\textbf {D}}} = \textrm{diag}\left( \left( 1\right) ^{K1},\left( 1\right) ^{K2},\ldots ,1\right) \), \({{\textbf {b}}}^*= \left( b^{*}_K, b^{*}_{K1}, \ldots , b^{*}_1 \right) ^{\prime }\), and \({{\textbf {m}}} = \left( m_{K}, m_{K+1}, \ldots , m_{2K1} \right) ^{\prime }\).
Denote \(\varvec{\psi }_{k}=\left( 1,b_{k},b_{k}^{2}\ldots ,b_{k}^{K1}\right) ^{\prime }\) and \({\varvec{{\Psi }}}=\left( \varvec{\psi } _{1},\varvec{\psi }_{2},\ldots ,\varvec{\psi }_{K}\right) \). Then
and \({{\textbf {M}}}=\sum _{k=1}^{K}\pi _{k}{{\textbf {M}}}_{k}=\varvec{\Psi }\textrm{ diag}\left( \varvec{\pi }\right) \varvec{\Psi } ^{\prime }\). Note that \( \varvec{\Psi }^{\prime }\) is a Vandermonde matrix then \(\det \left( \Psi \right) =\prod _{1\le k<k^{\prime }\le K}\left( b_{k^{\prime }}b_{k}\right) >0\) since \(b_{1}<b_{2}<\cdots <b_{K}\).
since \(\pi _{k}\in \left( 0,1\right) \) for any k. Then, we can identify \( \left( b^{*}_{k}\right) _{k=1}^{K}\) by \(\left( m_{r}\right) _{r=0}^{2K1}\) in (A.1.6), and hence the characteristic polynomial is determined, and we can identify \(\left( b_{k}\right) _{k=1}^{K} \) by (A.1.3).
Since both \(\left( b_{k}\right) _{k=1}^{K}\) and \(\left( m_{r} \right) _{r=1}^{2K1}\) are identified, the first K equations of (A.1.2) is
and \(\varvec{\pi }\) is identified by inverting the Vandermonde matrix \( \varvec{\Psi }^{\prime }\), which completes the proof. \(\square \)
Proof of Theorem 4
Denote
where we stack the lefthand side of (3.7) and transform \( {{\textbf {m}}}_\beta = h\left( \varvec{\theta } \right) \) to get \({{\textbf {g}}} _0\left( \varvec{\theta }, \varvec{\sigma }, \varvec{\gamma } \right) \). We suppress and the argument \({\hat{\varvec{\gamma }}}\) and denote \(\varvec{\eta } = \left( \varvec{\theta }^\prime , \varvec{\sigma }^\prime \right) ^\prime \) for notation simplicity and proceed by verifying the conditions of Newey and McFadden (1994, Theorem 2.1). Theorem 2 provides the identification results which together with the positive definiteness of \( {{\textbf {A}}}\) verifies that \(\Phi _{0}\left( \varvec{\eta }, \varvec{\gamma } \right) \) is uniquely minimized to 0 at \(\varvec{\eta }_{0}\). The compactness of the parameter space holds by Assumption 4(a). Note that \({{\textbf {g}}}_0\left( \varvec{\eta }, \varvec{\gamma }\right) \) is a polynomial in \(\varvec{\eta }\), which is continuous in \( \varvec{\eta }\). \({{\textbf {g}}}_0\left( \varvec{\eta }, \varvec{\gamma }\right) \) is bounded on \(\Theta \times \mathcal {S}\). We proceed by verify the uniform convergence condition. The additive terms in \(\hat{{{\textbf {g}}}} _n\left( \varvec{\eta }, \hat{\varvec{\gamma }}\right)  {{\textbf {g}}}_0\left( \varvec{\eta },\varvec{\gamma }\right) \) are of the form \(H_{n,1} h^{\left( r, q\right) }\left( \varvec{\eta }\right) \) or \(H_{n,2}\), where
\(h^{\left( r,q\}\right) }\left( \varvec{\eta }\right) \) is a polynomial in \( \varvec{\eta }\), and
\(H_{n,1} = O_{p}\left( n^{1/2}\right) \) and \(H_{n,2} = O_{p}\left( n^{1/2}\right) \) are due to Assumption 2(a) and 4(c).
By the compactness of \(\Theta \times {\mathcal {S}} \), \(\sup _{\varvec{\eta }\in \Theta \times {\mathcal {S}} }h^{\left( r,q\right) }\left( \varvec{\eta }\right)<C<\infty \) for some positive constant C. By triangle inequality, we have
as \(n\rightarrow \infty \). Following the proof of Newey and McFadden (1994, Theorem 2.1),
By (A.1.7) and the boundedness of \({{\textbf {g}}}_0 \), \( \sup _{ \varvec{\eta }\in \eta }\left {\hat{\Phi }}_{n}\left( {\varvec{{\eta }}},\hat{\varvec{\gamma }}\right) \Phi _{n}\left( {\varvec{{\eta }}}, \varvec{\gamma }\right) \right \rightarrow _{p}0\), which completes the proof. \(\square \)
Proof of Theorem 5
We denote \(\varvec{\eta }=\left( \varvec{\theta }^{\prime },\varvec{\sigma } ^{\prime }\right) ^{\prime }\) for notation simplicity. The firstorder condition, \(\nabla _{\varvec{\eta }}\hat{{{\textbf {g}}}}_{n}\left( \hat{\varvec{\eta }},\hat{\varvec{\gamma }}\right) {{\textbf {A}}}_{n}\hat{{{\textbf {g}}}} _{n}\left( \hat{\varvec{\eta }},\hat{\varvec{\gamma }}\right) ={{\textbf {0}}}\), holds with probability 1. Denote \(\hat{{{\textbf {G}}}}\left( \varvec{\eta }, \varvec{\gamma }\right) =\nabla _{\varvec{\eta }}\hat{{{\textbf {g}}}}_{n}\left( \varvec{\eta },\varvec{\gamma }\right) \) and expand \(\hat{{{\textbf {g}}}} _{n}\left( {\hat{\varvec{\eta }}},\hat{\varvec{\gamma }}\right) \) in the firstorder condition around \(\varvec{\eta }_{0}\), we have
where \(\bar{\varvec{\eta }}\) and \(\bar{\varvec{\gamma }}\) are between \(\hat{ \varvec{\eta }}\) and \(\varvec{\eta }_{0}\); and \(\hat{\varvec{\gamma }}\) and \( \varvec{\gamma }_{0}\), respectively. Note that by termbyterm convergence, we have \(\hat{{{\textbf {G}}}}\left( \hat{\varvec{\eta }},\hat{\varvec{\gamma }} \right) ,\hat{{{\textbf {G}}}}\left( \bar{\varvec{\eta }},\hat{\varvec{\gamma }} \right) \rightarrow _{p}{{\textbf {G}}}_{0}\) and \(\nabla _{\varvec{\gamma }}\hat{ {{\textbf {g}}} }_{n}\left( \varvec{\eta }_{0},\bar{\varvec{\gamma }}\right) \rightarrow _{p} \nabla _{\varvec{\gamma }}{{\textbf {g}}}_{0}\left( \varvec{\eta } _{0,}\varvec{\gamma }_{0}\right) ={{\textbf {G}}}_{0, \gamma }\). By Assumption 4(b), \({{\textbf {A}}}_{n}\rightarrow _{p}{{\textbf {A}}}\). By Assumption 5(a) and (b) and Slutsky theorem,
which completes the proof. \(\square \)
Further details for Example 4
We need to verify the invertibility of the matrix
The span of the first three rows of \({{\textbf {B}}}\) is
\(\left( b_{1\,L}b_{2\,L},b_{1\,L}b_{2\,H},b_{1\,H}b_{2\,L},b_{1\,H}b_{2\,H}\right) ^{\prime }\notin {\mathcal {S}}\) is equivalent to \(b_{1H}b_{2H}b_{1H}b_{2L}\ne b_{1L}b_{2H}b_{1L}b_{2L}\). This can be verified by
given that \(b_{1\,L}<b_{1\,H}\) and \(b_{2\,L}<b_{2\,H}\) hold. \(\square \)
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Gao, Z., Pesaran, M.H. Identification and estimation of categorical random coefficient models. Empir Econ 64, 2543–2588 (2023). https://doi.org/10.1007/s00181023024020
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00181023024020