Abstract
We are interested in the estimation of a parameter \(\theta \) that maximizes a certain criterion function depending on an unknown, possibly infinite-dimensional nuisance parameter h. A common estimation procedure consists in maximizing the corresponding empirical criterion, in which the nuisance parameter is replaced by a nonparametric estimator. In the literature, this research topic, commonly referred to as semiparametric M-estimation, has received a lot of attention in the case where the criterion function satisfies certain smoothness properties. In certain applications, these smoothness conditions are, however, not satisfied. The aim of this paper is therefore to extend the existing theory on semiparametric M-estimators, in order to cover non-smooth M-estimators as well. In particular, we develop ‘high-level’ conditions under which the proposed M-estimator is consistent and has an asymptotic limit. We also check these conditions for a specific example of a semiparametric M-estimator coming from the area of classification with missing data.
Similar content being viewed by others
References
Chen, X. (2007). Large sample sieve estimation of semi-nonparametric models. In J. J. Heckman & E. E. Leamer (Eds.), Handbook of econometrics, 6B, Chapter 76. North Holland: Elsevier.
Chen, X., Fan, Y. (2006). Estimation of copula-based semiparametric time series models. Journal of Econometrics, 130, 307–335.
Chen, X., Liao, Z. (2014). Sieve M-inference on irregular parameters. Journal of Econometrics, 182, 70–86.
Chen, X., Pouzo, D. (2009). Efficient estimation of semiparametric conditional moment models with possibly nonsmooth residuals. Journal of Econometrics, 152, 46–60.
Chen, X., Linton, O., Van Keilegom, I. (2003). Estimation of semiparametric models when the criterion function is not smooth. Econometrica, 71, 1591–1608.
Cheng, G., Shang, Z. (2015). Joint asymptotics for semi-nonparametric regression models under partially linear structure. Annals of Statistics, 43, 1351–1390.
De Backer, M., El Ghouch, A., Van Keilegom, I. (2018). An adapted loss function for censored quantile regression. Journal of the American Statistical Association. https://doi.org/10.1080/01621459.2018.1469996.
Ding, Y., Nan, B. (2011). A sieve \(M\)-theorem for bundled parameters in semiparametric models, with application to the efficient estimation in a linear model for censored data. Annals of Statistics, 39, 3032–3061.
Escanciano, J., Jacho-Chavez, D., Lewbel, A. (2014). Uniform convergence of weighted sums of non- and semi-parametric residuals for estimation and testing. Journal of Econometrics, 178, 426–443.
Escanciano, J., Jacho-Chavez, D., Lewbel, A. (2016). Identification and estimation of semiparametric two step models. Quantitative Economics, 7, 561–589.
Goldenshluger, A., Zeevi, A. (2004). The Hough transform estimator. Annals of Statistics, 32, 1908–1932.
Groeneboom, P., Wellner, J. A. (1992). Information bounds and nonparametric maximum likelihood estimation. Basel: Birkhäuser.
Groeneboom, P., Jongbloed, G., Wellner, J. A. (2001). Estimation of a convex function: Characterizations and asymptotic theory. Annals of Statistics, 29, 1653–1698.
Horowitz, J. (2009). Semiparametric and nonparametric methods in econometrics. New York: Springer.
Ichimura, H. (1993). Semiparametric least squares (SLS) and weighted SLS estimation of single index models. Journal of Econometrics, 58, 71–120.
Ichimura, H., Lee, S. (2010). Characterization of the asymptotic distribution of semiparametric \(M\)-estimators. Journal of Econometrics, 159, 252–266.
Kim, J., Pollard, D. (1990). Cube root asymptotics. Annals of Statistics, 18, 191–219.
Kosorok, M. R. (2008). Introduction to empirical processes and semiparametric inference. New York: Springer.
Koul, H. L., Müller, U. U., Schick, A. (2012). The transfer principle: A tool for complete case analysis. Annals of Statistics, 40, 3031–3049.
Kristensen, D., Salanié, B. (2017). Higher-order properties of approximate estimators. Journal of Econometrics, 198, 189–208.
Ma, S., Kosorok, M. R. (2005). Robust semiparametric M-estimation and the weighted bootstrap. Journal of Multivariate Analysis, 96, 190–217.
Mammen, E., Rothe, C., Schienle, M. (2016). Semiparametric estimation with generated covariates. Econometric Theory, 32, 1140–1177.
Mohammadi, L., Van de Geer, S. (2005). Asymptotics in empirical risk minimization. Journal of Machine Learning Research, 6, 2027–2047.
Müller, U. U. (2009). Estimating linear functionals in nonlinear regression with responses missing at random. Annals of Statistics, 37, 2245–2277.
Pérez-González, A., Vilar-Fernández, J. M., González-Manteiga, W. (2009). Asymptotic properties of local polynomial regression with missing data and correlated errors. Annals of the Institute of Statistical Mathematics, 61, 85–110.
Polonik, W., Yao, Q. (2000). Conditional minimum volume predictive regions for stochastic processes. Journal of the American Statistical Association, 95, 509–519.
Radchenko, P. (2008). Mixed-rates asymptotics. Annals of Statistics, 36, 287–309.
Van de Geer, S. A. (2000). Empirical processes in M-estimation. New York: Cambridge University Press.
Van der Vaart, A. W., Wellner, J. A. (1996). Weak convergence and empirical processes: With applications in statistics. New York: Springer.
Van der Vaart, A. W., Wellner, J. A. (2007). Empirical processes indexed by estimated functions. IMS Lecture Notes-Monograph Series, 55, 234–252.
Acknowledgements
The authors would like to thank Xiaohong Chen, Guang Cheng, Oliver Linton, Michael Kosorok, Bin Nan, Bodhi Sen and Jon Wellner for stimulating discussions and helpful comments that improved the quality of the paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Research supported by the European Research Council (2016–2021, Horizon 2020/ERC Grant Agreement No. 694409), and by IAP research network Grant No. P7/06 of the Belgian government (Belgian Science Policy)
Appendix: Proofs
Appendix: Proofs
In this Appendix, we give the proofs of the asymptotic results, namely we prove the consistency, the rate of convergence and the asymptotic distribution of our M-estimator \(\widehat{\theta }\).
Proof of Theorem 1
Our aim is to show that
Indeed, the result we want to obtain is a direct consequence of (11) and assumption (A2). It is easy to show that assumptions (A3) and (A4) imply that
since \(\widehat{\theta }\) belongs by construction to \(\Theta \). Consider the following decomposition:
This together with (12) leads to the following inequality:
Now, the quantity \((1+o_{P^*}(1))\) on the left-hand side in the above inequality is positive on a set \(A_n\) whose outer probability tends to one when n tends to infinity. On \(A_n\), a reformulation of the previous inequality gives:
Assumptions (A3) and (A5) imply that
and assumption (A1) gives that
It now follows directly from (13)–(15) that
\(\square \)
Proof of Theorem 2
Let \(\xi _n\) be the \(O_{P^*}(r_n^{-2})\)-quantity involved in assumption (B4). We introduce the sets
and observe that \(\Theta \backslash \{\theta _0\}=\cup _{j=1}^{+\infty }S_{j,n}\). Our aim is to prove that for any \(\epsilon >0\) there exists \(\tau _\epsilon >0\) such that
for n sufficiently large. From now on, we work with an arbitrary fixed positive value of \(\epsilon \). For any \(\delta ,\,\delta _1,\,M,\,K,\,K'>0\), we obtain the following bound using assumption (B4):
where \(A_n= \{r_n|W_n| \le K',\,|\beta _n|\le \frac{C}{2},\,d_\mathcal {H}(\widehat{h},h_0)\le \frac{\delta _1}{v_n}\}\). Indeed, we can write
Assumption (B1) implies that for all \(\delta >0\) there exists \(n_\epsilon \) such that
for n larger than \(n_\epsilon \). Then, by definition of \(\xi _n\) and \(W_n\) and because of (B1), there exist three positive constants \(\delta _1\), \(K_\epsilon \) and \(K'_\epsilon \) such that
for n larger than some \(n_1 \in \mathbb {N}\). We fix \(\delta <\delta _0\) and suppose that \(n\ge \max (n_0,n_1,n_\epsilon )\) to get that assumptions (B2) and (B3) are fulfilled on all \(S_{j,n}\) such that \(2^j\le \delta r_n\).
Now, it follows directly from assumption (B3) that for each fixed j such that \(2^j\le \delta r_n\) one has for all \(\theta \in S_{n,j}\):
Consequently, we obtain the following inequality:
Now, there exists \(M_\epsilon \) such that for all \(j\ge M_\epsilon \) one gets
Consequently, if \(M\ge M_\epsilon \), using assumption (B2) and Chebyshev’s inequality we have that
Finally, since \(\alpha <2\), the series \(\sum _{j \ge M} 2^{j(\alpha -2)}\) converges and hence there exists \(M'_\epsilon \ge M_\epsilon \) such that
This finishes the proof showing (16) with \(\tau _\epsilon =2^{M'_\epsilon }\). \(\square \)
Proof of Theorem 3
The first step of the proof consists in showing the weak convergence of the process \(\gamma \mapsto r_n^2B_n(\theta _0+\frac{\gamma }{r_n},\widehat{h})\). This is shown in Lemma 1 (given below).
The remainder of the proof is based on somewhat similar arguments as those used to state the Argmax theorem in Van der Vaart and Wellner (1996). First note that E is a \(\sigma \)-compact metric space since \(E=\cup _{i=1}^\infty \mathcal {K}_i\) with \(\mathcal {K}_i=\{\gamma \in E : \Vert \gamma \Vert \le a_i\}\) for any positive sequence \((a_i)_{i\in \mathbb {N}^*}\) tending to infinity.
Then deduce from assumption (C9) together with Lemmas 2 and 3 that almost all paths of the limiting process \(\gamma \mapsto \Lambda (\gamma )+\mathbb {G}(\gamma )\) attain their supremum at an unique point \(\gamma _0\), following similar ideas to what is done in the parametric case (see Theorem 3.2.10 in Van der Vaart and Wellner 1996). Assume now that \(\gamma _0\) is measurable. The weak convergence of \(r_n(\widehat{\theta }-\theta _0)\) to \(\gamma _0\) is equivalent to the next statement (Portmanteau’s theorem):
Let C be an arbitrary closed subset of E and fix \(\epsilon >0\). The random variable \(\gamma _0\) is tight because it takes values in E, which is \(\sigma \)-compact. Combining this tightness and the first part of (C1), it is possible to find \(K_\epsilon >0\) and hence a compact set \(\mathcal {K}_\epsilon :=\{\gamma : \Vert \gamma \Vert \le K_\epsilon \}\) such that
It follows easily from (20) that
Now, using Lemma 1 and assumption (C8) we obtain
by Slutsky’s lemma and Portmanteau’s theorem. On the other hand, for every open set G containing \(\gamma _0\), we have:
This together with (22) leads to
Consequently, it follows from (21) that for all \(\epsilon >0\),
Since the right-hand side of (24) holds for all \(\epsilon >0\), it also holds for \(\epsilon =0\). The result now follows from Portmanteau’s theorem. \(\square \)
We end this section with three lemmas that were needed in the proof of Theorem 3.
Lemma 1
For all \(K>0\), let \(\mathcal {K}=\{\gamma \in E : \Vert \gamma \Vert \le K\}\) be a compact subset of E. Then, under the assumptions of Theorem 3, for any such \(\mathcal {K}\), the process \(\gamma \mapsto r_n^2B_n(\theta _0+\frac{\gamma }{r_n},\widehat{h})\) converges weakly to the process \(\gamma \mapsto \Lambda (\gamma )+\mathbb {G}(\gamma )\) in \(\ell ^\infty (\mathcal {K})\). Moreover, almost all paths of the limiting process are continuous (uniformly on every compact \(\mathcal {K}\)) with respect to \(\Vert \cdot \Vert \).
Proof
The weak convergence of the process \(\gamma \mapsto r_n^2B_n(\theta _0+\frac{\gamma }{r_n},\widehat{h})\) in \(\ell ^\infty (\mathcal {K})\) follows directly from Slutsky’s theorem and Lemmas 2 and 3. On the other hand, \(\Vert \cdot \Vert \) makes \(\mathcal {K}\) totally bounded (since it is compact) and \(\gamma \mapsto r_n^2B_n(\theta _0+\frac{\gamma }{r_n},h_0)+r_nW_n(\gamma )\) is asymptotically uniformly \(\Vert \cdot \Vert \)-equicontinuous in probability, asymptotically tight, and it converges weakly to \(\gamma \mapsto \Lambda (\gamma )+\mathbb {G}(\gamma )\) in \(\ell ^\infty (\mathcal {K})\) (see proof of Lemma 3). Thus, almost all paths of the limiting process are uniformly \(\Vert \cdot \Vert \)-continuous on \(\mathcal {K}\) (see Theorem 1.5.7 in Van der Vaart and Wellner 1996). Moreover, because E may be covered by a countable sequence of such compact sets, almost all paths of the limiting process are \(\Vert \cdot \Vert \)-continuous on E. \(\square \)
Lemma 2
Let \(\mathcal {K}=\{\gamma \in E : \Vert \gamma \Vert \le K\}\). Then, under the assumptions of Theorem 3, for all \(\gamma \in \mathcal {K}\), there exist \(\xi _{0,n},\xi _{1,n},\xi _{2,n}\), such that \(\sup _{\gamma \in \mathcal {K}}|\xi _{j,n}|=o_{P^*}(1), j=0,1,2,\,\) and
Proof
Let us introduce the following notations:
with \(\theta =\theta _0+\gamma /r_n\).
Because the compact \(\mathcal {K}\) is bounded and \(\theta _0\) belongs to the interior of \(\Theta \), there exists \(n_\mathcal {K}\) such that for all \(n\ge n_{\mathcal {K}}\) and for all \(\gamma \in \mathcal {K}\), the quantity \(\theta _0+\frac{\gamma }{r_n}\) is in \(\Theta \). Then, for all \(\gamma \in \mathcal {K}\) entails that
This can be reformulated as
Then, use assumptions (C1) and (C7) to get
Combining (26) and (27), we obtain
with
It can be easily shown that \(\sup _{\gamma \in \mathcal {K}}|\xi _{j,n}(\gamma )|=o_{P^*}(1)\) for \(j=0,1,2\) using assumptions (C3) and (C7). \(\square \)
Lemma 3
Let \(\mathcal {K}=\{\gamma \in E : \Vert \gamma \Vert \le K\}\). Then, under the assumptions of Theorem 3, the process \(\gamma \mapsto r^2_nB_n(\theta _0+\frac{\gamma }{r_n},h_0)+r_nW_n(\gamma )\) is asymptotically tight, asymptotically uniformly equicontinuous with respect to \(\Vert \cdot \Vert \) on \(\mathcal {K}\), and it converges weakly to the process \(\gamma \mapsto \Lambda (\gamma )+\mathbb {G}(\gamma )\) in \(\ell ^\infty (\mathcal {K})\).
Proof
The main idea of this proof consists in writing the process \(T_n:\gamma \mapsto r_n^2B_n(\theta _0+\frac{\gamma }{r_n},h_0)+r_nW_n(\gamma )\) as the sum of two processes \(T_{1,n}:\gamma \mapsto r_n^2(B_n(\theta _0+\frac{\gamma }{r_n},h_0)-B(\theta _0+\frac{\gamma }{r_n},h_0))\) and \(T_{2,n}:\gamma \mapsto r_n^2B(\theta _0+\frac{\gamma }{r_n},h_0)+r_nW_n(\gamma )\) and studying separately the properties of \(T_{1,n}\) and \(T_{2,n}\). However, in some specific cases it could be possible to state the weak convergence of \(T_n\) without this decomposition. Let us first note that assumption (C7) implies that for n sufficiently large (only depending on \(\mathcal {K}\)) so that \(\theta _0+\frac{\mathcal {K}}{r_n} \subset \Theta \), the processes \(T_{1,n}\) and \(T_{2,n}\) take values in \(\ell ^\infty (\mathcal {K})\).
The process \(T_{1,n}\) does not depend on the estimation of the nuisance parameter. Hence, following similar ideas as in the parametric case we get from assumptions (C4), (C5) and (C10) the asymptotic uniform equicontinuity of \(T_{1,n}\) with respect to \(\Vert \cdot \Vert \) on \(\mathcal {K}\) (as a sub-product of the proof of Theorem 2.11.9 in Van der Vaart and Wellner 1996). On the other hand, for n large enough, \(\theta _0+\gamma /r_n \in \Theta \) (see the proof of Lemma 2). Assume now that n is large enough and use assumption (C7) to conclude that for all \(0<\delta \le \delta _1\),
where \(b_n\le \sup _{\gamma ,\gamma '\in \mathcal {K}}|r_n^2(o(\frac{\Vert \gamma \Vert ^2}{r_n^2})+o(\frac{\Vert \gamma '\Vert ^2}{r_n^2}))| \rightarrow 0\) as n tends to infinity, and \(\alpha _n=O_{P^*}(1)\) uniformly over \(\delta \le \delta _1\). Let \(\epsilon \) and \(\eta \) be arbitrary positive constants. It is clear that, for any \(0<\delta \le \delta _1\) and any positive constant K, (29) leads to
Finally choose \(K_\eta \) such that the last term is smaller than \(\eta \), and take \(\delta \le \delta _1\wedge (\frac{\epsilon }{2K_\eta })^\frac{1}{\tau }\). It then follows that \(T_{2,n}\) is asymptotically uniformly equicontinuous in probability with respect to \(\Vert \cdot \Vert \) on \(\mathcal {K}\).
Hence, the same is also true for the process \(T_n\), since it is the sum of two such processes. The asymptotic tightness and hence the weak convergence of \(T_n\) to \(\Lambda +\mathbb {G}\) in \(\ell ^\infty (\mathcal {K})\) now follows from Theorems 1.5.7 and 1.5.4 in Van der Vaart and Wellner (1996), together with assumption (C9) and the fact that \(\mathcal {K}\) is totally bounded with respect to the \(\Vert \cdot \Vert \)-norm (since it is compact). Moreover, using Addendum 1.5.8 in the same book, almost all paths of the limiting process on \(\mathcal {K}\) are uniformly continuous with respect to \(\Vert \cdot \Vert \). \(\square \)
About this article
Cite this article
Delsol, L., Van Keilegom, I. Semiparametric M-estimation with non-smooth criterion functions. Ann Inst Stat Math 72, 577–605 (2020). https://doi.org/10.1007/s10463-018-0700-y
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10463-018-0700-y