Bias-corrected quantile regression estimation of censored regression models

In this paper, an extension of the indirect inference methodology to semiparametric estimation is explored in the context of censored regression. Motivated by weak small-sample performance of the censored regression quantile estimator proposed by Powell (J Econom 32:143–155, 1986a), two- and three-step estimation methods were introduced for estimation of the censored regression model under conditional quantile restriction. While those stepwise estimators have been proven to be consistent and asymptotically normal, their finite sample performance greatly depends on the specification of an initial estimator that selects the subsample to be used in subsequent steps. In this paper, an alternative semiparametric estimator is introduced that does not involve a selection procedure in the first step. The proposed estimator is based on the indirect inference principle and is shown to be consistent and asymptotically normal under appropriate regularity conditions. Its performance is demonstrated and compared to existing methods by means of Monte Carlo simulations.


Introduction
The censored regression model has been studied and extensively used in a wide range of applied economics literature. To estimate the parameters of censored regression models, the maximum likelihood estimator (MLE) is usually used under the assumption that the error terms have distribution functions with a known parametric form (e.g., that the error terms are normally distributed). Contrary to the least squares estimator in linear regression model, the MLE is not robust to departures from the parametric assumptions about the error term distribution. If the employed assumptions do not hold, the MLE is in general inconsistent (cf. Arabmazar and Schmidt 1982;Brännäs and Laitila 1989). The same applies also to Heckman's two-step estimator (see Jonsson 2012).
To relax the strong assumptions of MLE, several semiparametric estimators have been introduced in the econometric literature. Relying on very weak identification assumptions, Powell (1984Powell ( , 1986a proposed the censored least absolute deviation (CLAD) and censored regression quantile (CRQ) estimators by imposing the restriction that the conditional quantile of the error term is zero. These consistent and asymptotically normal estimators were applied in many contexts (e.g., Fahr 2004; Melenberg and van Soest 1996), and furthermore, have been extended in many directions, which include random censoring (Honore et al. 2002;Portnoy 2003) as well as panel-data models (Honore 1992;Campbell and Honore 1993). In practice, the CRQ estimator is very appealing due to its robustness to misspecification of the error-term distribution and of the form of heteroskedasticity.
On the other hand, CRQ is difficult to compute exactly since its objective function is non-differentiable, non-convex, and more importantly, exhibits multiple local minima. The results using algorithms searching local minima (see Fitzenberger 1997b, for an overview) depend on the choice of the starting points, while the algorithms searching the global minimum (e.g., Fitzenberger and Winker 2007) can become relatively demanding in terms of computation time, especially if many regressors are involved. In Monte Carlo experiments, Fitzenberger (1997a) demonstrated a more severe drawback of the CRQ estimator in small samples than the mean-biasedness and inefficiency documented already by Paarsch (1984) and Moon (1989): the CRQ estimator exhibits a heavy-tailed asymmetric distribution in small samples. These observations made also by Fitzenberger and Winker (2007), for instance, are supported by larger differences between expected and median values of CRQ in some simple settings, and as argued by Khan and Powell (2001), can be attributed to the fact that the CRQ parameter estimates play two roles: while they are estimates of the regression parameters, they also determine which observations have positive conditional quantiles and can thus enter the minimization of the CRQ objective function. (Note that these unfavorable finite-sample properties are shared to some extent also by some alternatives to CLAD such as the symmetrically censored least squares of Powell (1986b). Other alternative estimator such as those by Horowitz (1986) and Honore and Powell (1994) do not exhibit such heavy-tailed finite-sample distributions, but require the error terms and the explanatory variables being independent, which is a rather strong assumption.) Since Khan and Powell (2001) highlighted the inherent property of CLAD and CRQ that causes their poor small-sample performance-the joint identification of observations entering the objective function and of the quantile regression line, several stepwise estimation procedures have been proposed, for example, by Khan and Powell (2001) and Chernozhukov and Hong (2002). These methods select first a subset of observations to identify the quantile regression line and then apply the quantile regression (QR) on the selected observations. The first step can be achieved, for example, by a nonparametric selection procedure as in Buchinsky and Hahn (1998) and Khan and Powell (2001). Although asymptotically nearly equivalent to an 'oracle' QR estimator, the selection procedure in the first step works at a cost in finite samples. Alternatively, Chernozhukov and Hong (2002) developed a three-step estimation method that involves a parametric first step to circumvent the "curse-of-dimensionality" problem posed by nonparametric selection procedures. Its performance in small samples does not however improve upon the two-step estimators in simple regression models.
The finite-sample performance of the stepwise estimators does not seem substantially better than CLAD in studies of Khan and Powell (2001), Chernozhukov and Hong (2002), and most recently Tang et al. (2011). This might be due to relatively low precision of the initial nonparametric fit, for instance, but also due to a comparison with CLAD estimates based on favorable local minima rather than the global minimum. To improve upon the stepwise estimators, we introduce an alternative semiparametric estimator for the censored regression model under conditional quantile restriction. Contrary to the existing methods, we apply the linear regression QR estimator to all data (rather than to a preselected subsample) and then correct its bias caused by censoring. For the bias correction, indirect inference (II), which was suggested by Gouriéroux et al. (1993), is used. The indirect inference methodology is a simulationbased technique that is essentially used for estimation of the parameters of correctly specified but intractable models, but it can be employed as a bias correction method too (e.g., Gouriéroux et al. 2000;Gouriéroux et al. 2010).
Implementing the standard II approach requires knowledge of the error-term distribution at least up to a parametric form. To exploit only the conditional quantile restriction, we propose a new II methodology based on simulating the values of the error terms from a semiparametrically estimated distribution. The proposed II estimator is based on the standard linear QR (which allows for linear, quadratic, and polynomial functions of regressors) for two reasons. First, linear QR has desirable properties such as convexity of the objective function and a reasonably small variance in small samples. Second and more importantly, the properties of linear QR are known even under model misspecification (see Angrist et al. 2006) and can be used to construct a nonparametrically estimated error distribution for the II simulations and subsequent bias correction. Hence, the proposed bias-corrected QR procedure can be shown to be consistent and asymptotically normal. Its benefits in small samples are demonstrated by means of Monte Carlo simulations.
The remainder of the paper is organized as follows. Section 2 presents a review of relevant estimation methods of censored regression model and a brief overview of the indirect inference methodology. In Sect. 3, the proposed bias-corrected QR estimator is described in details. The asymptotic properties of the indirect estimator are discussed in Sect. 4 and the results of Monte Carlo experiments are presented in Sect. 5. Proofs are given in the appendices.

Estimation of censored regression model and indirect inference
The censored regression model and some relevant estimators are introduced in Sect. 2.1 and the indirect inference concept is described in Sect. 2.2.

Censored regression model
Let us define the censored regression model. First, data are supposed to be a random sample of size n ∈ N originating from a latent linear regression model where y * i ∈ R is the latent dependent variable, x i ∈ R k is the vector of explanatory variables, β 0 represents the k-dimensional parameter vector, and ε i is the unobserved error term with its conditional τ -quantile, τ ∈ (0, 1), being zero: q τ (ε i |x i ) = 0. The observed responses y i equal to y * i censored from below at some cut-off point cp i : We consider here only the case of fixed censoring with a known cut-off point cp i ≡ cp, and without loss of generality, cp = 0. An extension to random censoring is possible by the procedure of Honore et al. (2002). The CRQ estimator is an extension of the classical linear QR to the censored regression model under a conditional quantile restriction. Since the conditional quantile function of y i in (2) is simply max{0, x T i β 0 }, Powell (1986a) proposed the CRQ estimator β CRQ n : where B is a compact parameter space, ρ τ (z) = {τ − I (z ≤ 0)}z with τ ∈ (0, 1), and I (·) denotes the indicator function. Note that CRQ can be interpreted as applying the linear QR estimator to the observations x i with x T i β 0 > 0: only observations with x T i β 0 > 0 carry information to identify and estimate the conditional quantiles of y * i given x i , x T i β 0 > 0, whereas observations with x T i β 0 ≤ 0 carry information that the conditional quantile of y * i is negative and the conditional quantile of y i equals zero given x i , x T i β 0 ≤ 0, but they do not identify the conditional quantile of y * i at x i and their contributions to the objective function (3) are independent of β in a neighborhood of β 0 . This leads then to a heavy-tailed small-sample distribution of CRQ.
To eliminate this property, Khan and Powell (2001) proposed a two-step estimation method. In the first step, the observations with x T i β 0 > 0 are determined by an initial semiparametric or nonparametric estimation, and in the second step, the standard QR estimation is conducted on the selected observations. Nevertheless, the finite sample results of Khan and Powell (2001) do not seem to generate a substantial advantage with respect to the CRQ estimator in terms of mean or median squared errors, possibly due to an imprecise selection of observations in the first step; alternatively, using a local rather than a global optimization algorithm for CRQ could have played a role.
Later, Chernozhukov and Hong (2002) proposed a semiparametric three-step estimator of the censored regression model under conditional quantile restriction. The initial subset of observations with x T i β 0 > 0 is selected by a parametric binary-choice model (e.g., logit) and QR is used in the subsequent steps to obtain not only consistent estimates, but also a more precise selection of the observations with x T i β 0 > 0. Their finite-sample results are however not substantially better than those of the twostep procedures: while having a smaller mean bias in small samples, the three-step estimates often exhibit a larger mean squared errors (cf. Tang et al. 2011, Sect. 5).

Parametric indirect inference
Our strategy for estimating the censored regression model will differ from the existing ones in that QR will be applied to all observations and its bias due to censoring will be corrected by means of the indirect inference (II). In this section, we therefore describe a general principle of (parametric) II introduced by Gouriéroux et al. (1993) and discuss how II can be applied as a bias correction method following Gouriéroux et al. (2000). Consider a general model, for example, (1)- (2): where y i represents the response variable, x i ∈ R k is the vector of explanatory variables with a distribution function G 0 (·), β 0 ∈ B ⊂ R k is the parameter vector, and ε i is the unobserved error term with a known conditional distribution function F 0 (·|x i ) (a generalization to a nonparametrically estimated distribution function will follow in Sect. 3).
To implement II, an instrumental criterion, which is a function of the observations {y i , x i } n i=1 and of an auxiliary parameter vector θ ∈ ⊂ R q , q ≥ k, has to be defined (e.g., linear QR applied to censored data). This criterion is minimized to estimate the auxiliary parameter vector: (please note that the dependence on the explanatory variables {x i } n i=1 is kept implicit as we do not consider simulating values of x i , but work conditionally on observed {x i } n i=1 ; see Gouriéroux et al. 1993, for details). The data-generating process is then fully determined by F 0 and β 0 and the instrumental criterion Q n is assumed to converge asymptotically to a non-stochastic limit that has a unique minimum θ 0 : Evaluating it at any F(·|x i ) and β leads to the definition of the binding function b(F, β): which implies that θ 0 = b(F 0 , β 0 ). Under some regularity assumptions (see Assumptions A1-A4 in Gouriéroux et al. 1993), θ n is a consistent estimator of θ 0 . Provided that b(F 0 , β) is known and oneto-one, a consistent estimate β n of β 0 would be defined as β n = b −1 (F 0 , θ n ) (F 0 is traditionally assumed to be fully known; auxiliary parameters of the error distribution have to be a part of the parameter vector θ ). Since the binding function is often difficult to compute, Gouriéroux et al. (1993) defined a simulation-based procedure to estimate the parameter β 0 .
, assuming this distribution is fully known. Then for any given β, one can generate S sets of simulated paths { y 1 (β), . . . , y S (β)} using model (4) conditional on x i for s = 1, . . . , S. From these simulated samples, S auxiliary estimates can be computed: Under appropriate conditions, θ s n (β) tends asymptotically to b(F 0 , β), which allows to define the indirect inference estimator in the following way: where is a positive definite weighting matrix. This estimator can be shown to be consistent and asymptotically normal (see Proposition 1 and 3 in Gouriéroux et al. 1993). As in GMM estimation, the choice of does not affect the asymptotic distribution of the estimator if dim(β) = dim(θ ) and its choice will thus be asymptotically irrelevant.
To argue that II can be used as a bias correction technique, note that β can represent the parameter value in the original model (4) (e.g., censored regression (1)-(2)) and θ the parameter value of the auxiliary biased criterion (e.g., linear QR applied to censored data). The binding function b(F 0 , β) then maps the parameter values β to biased estimates θ and its inverse β I I n = b −1 (F 0 , θ n ) maps the biased estimates back to the parameters in the original model; see Gouriéroux et al. (2000) for details. quantile restriction. As the linear quantile regression is used as an instrumental criterion and the distribution of ε i is unknown, a crucial ingredient of the procedure is the behavior of QR under misspecification (this is true even if quadratic or polynomial regression functions are used due to the kink of the true quantile function). Angrist et al. (2006) characterize the QR vector under misspecification as a minimizer of a weighted mean-squared approximation to the true conditional quantile function, assuming the almost-sure existence of the conditional density of the dependent variable, or equivalently, of the error term. As this result does not directly apply to the censored regression model, we modify their result to accommodate the fact that the error distribution is not continuous. In particular, we characterize the QR estimates under misspecification in terms of the distribution function of the error term rather than its density as Angrist et al. (2006) did.
Let us first introduce necessary notation. The conditional quantile function of the dependent variable y i is max{0, x T i β 0 }. For any quantile index τ ∈ (0, 1), the QR vector is defined by: where and F y (y|x i ) be the conditional distribution functions of u i and y i , respectively, and let f u (u|x i ) and f y (y|x i ) denote the corresponding conditional densities whenever they exist. For continuously distributed latent error ε i in model (1), f u (u|x i ) and f y (y|x i ) exist and will be used only for u i > − max{0, x T i β 0 } and y i > 0, respectively.
Theorem 1 states that the linear QR vector depends on the weighting function w(x i , β 0 , θ 0 ), which in turn is a function of the distribution function F u (·|x i ). Thus, for any other distribution function the weighting function remains unchanged and the linear QR yields the same vector θ 0 .
Next, we consider Theorem 1 in the censored regression model with errors ε i ∼ F ε (·|x i ). To facilitate semiparametric estimation that does not require complete estimation of F ε (·|x i ), we will construct another distribution F ε (·|x i ) that is always from the same parametric family of distributions and that results in the same QR fits as F ε (·|x i ) by Theorem 1.
First, note that As the censoring points of u i and u i are identical (conditionally on x i ), we essentially have to match the two latent continuous distributions of ε i and ε i . We will show that the distribution ε i can be a normal one: and analogously for u i and y i ): σ (x i ; β 0 ) has to be therefore chosen so that The definition of σ ( where and τ are the distribution functions of N (0, 1) and N (μ τ , 1), respectively (note that (12) leads to σ ( (12) yields the same linear QR vector as the real data generated under ε i ∼ F ε (·|x i ) and the biases of the linear QR estimates in the censored regression model (1)-(2) both with the original data distribution ε i ∼ F ε (·|x i ) and with the data generated from ε i ∼ N (μ τ · σ (x i ; β 0 ), σ (x i ; β 0 )) will be equal.
For the case of F y (x T i θ 0 |x i ) = τ , i.e., x T i (θ 0 −β 0 ) → 0 for x T i β 0 > 0, we consider the limit of (12) and define σ ( If the bias correction of linear QR is to be performed by II, we can simulate the set of error terms { ε 1 , . . . , ε S } from N (μ τ ·σ (x i ; β), σ (x i ; β)) instead of the original data distribution and calibrate over β ∈ B (provided that σ (x i , β) is known). However, β is not identified in this case because Eq. (11) holds for any value β substituted for β 0 if definition (12) is used at that β. To achieve identification, β 0 in (12) has to be replaced by an initial estimate or the denominator in (12) also has to be a function of β instead of being equal to its true value at β 0 . We consider the latter strategy to achieve good performance even in very small samples. It is well known that the identification of the parameter vector in (1)-(2) under conditional quantile restriction relies on the observations with positive values of the index x T i β 0 > 0 (Powell 1984(Powell , 1986a since We also exploit this fact and we define σ ( using σ (x i ; β) will hold only at β ≡ β 0 and the parameter vector β can be identified (see Lemma 1 for details). Further, as (14) becomes where φ τ (·) is the density function of τ (·) (note that the limit is the same for both cases of (14)).
With the definition (14) and (15) of σ (x i ; β), which assumes knowledge of the true conditional distribution F y (x T i θ 0 |x i ) at x T i θ 0 , we can define the infeasible indirect inference (III) estimator β I I I n by 1 For x T i β ≤ 0 , σ (x i ; β) might take a negative value. However, note that the data-generating process for simulated data as well as the identification of β are invariant to the sign of σ (x i ; β).
where θ s n (β) = arg min . . , S, and is a positive definite weighting matrix. For the sake of brevity, these distributions N (μ τ · σ (x i ; β), σ (x i ; β)) will be referred to as F ε(β) within the binding function and its density will be denoted as f ε(β) . The corresponding quantities for the response variable are F y(β) and f y(β) .
To define a feasible indirect inference estimator, the simulated error distribution defined by σ (x i ; β) has to be estimated by N (μ τ · σ n (x i ; β 0 ), σ n (x i ; β 0 )) using an estimate σ n (x i ; β). Denoting θ n the linear QR estimate for the original data and F y, Since the denominators in (17) might take value 0, we again extend the definition (17) For a given β and any sequence we refer to this event as the "zero-denominator" Z D i,n ( θ n , β). If Z D i,n ( θ n , β) occurs, then we use instead of (17) the linearly interpolated values Alternatively, one can also use a straightforward analog of (15) and define wheref y,n (·|x i ) is an estimate of the conditional density function f y (·|x i ). As this requires an additional nonparametric estimator, we rely on definition (18). The corresponding theoretical results in Sect. 4 are however valid also for (19) if the uniform convergence off y,n (·|x i ) to f y (·|x i ) is imposed.
Having an estimate σ n (x i ; β) defined by (17)-(18) (or (17) and (19)), the feasible indirect inference (FII) estimator β F I I n can be defined as where θ s n (β) = arg min ). Finally, let us remark that the proposed estimator β F I I n does not perform a selection procedure as it is done in Khan and Powell (2001) and Chernozhukov and Hong (2002), that is, the proposed estimation method is applied to all observations in the sample. Furthermore, our estimation procedure corrects the downward bias of linear QR caused by the censoring of the dependent variable and can be thus considered as a bias-correction method. The bias-correction procedure is based, similar to twostep estimators, on nonparametric estimates. Even though the bias-correction does not seem to be overly sensitive to the (lack of) precision of these nonparametric estimates, it could benefit from using some dimension reduction technique (e.g., Xia et al. 2002) to estimate σ (x i , β) on a lower dimensional space in models with a large numbers of explanatory variables, especially discrete ones. Finally, the linear QR includes implicitly also quadratic and polynomial models since x i can contain both the values of covariates as well as their powers. In the simulations (see Sect. 5), we used for simplicity only the auxiliary model linear in variables as the quadratic auxiliary model, which can better approximate the true quantile function, did not perform substantially better than the linear one.

Large sample properties
In this section, the asymptotic properties of the indirect-inference estimators for the censored regression model, β I I I n and β F I I n , are derived. As our main result, we prove that β I I I n and β F I I n are asymptotically equivalent and asymptotically normally distributed. Let us first introduce conditions required for establishing the consistency and asymptotic normality of the III estimator.
A.1 The parameter spaces and B are compact subsets of R k and the true parameter are independent and identically distributed with finite second moments. The support of x i ∈ X is assumed to be compact. Moreover, the index x T i θ 0 is continuously distributed, that is, there is at least one continuously distributed explanatory variable with θ 0 j = 0. A.4 The τ th conditional quantile of ε i is zero. The error term ε i has the conditional distribution F ε (t|x i ) with the conditional density function f ε (t|x i ), which is uniformly bounded both in t and x i , positive on its support, and uniformly continuous with respect to t and x i .
A.5 The following matrices are assumed to be finite and positive definite: Moreover, b(F, β) is assumed to be continuous in β and F (with respect to the supremum norm) at β 0 and F 0 .
Let us provide a few remarks regarding the necessity of these assumptions. Assumptions A.1, A.2, and A.3 are essential for establishing the consistency and asymptotic normality of the QR estimates θ n as argued in Angrist et al. (2006) as well as the consistency and asymptotic normality of β I I I n . As shown in Angrist et al. (2006), compactness of the support of X , which is typically achieved by trimming in semiparametric estimation and is also required by Khan and Powell (2001), can be relaxed to the existence of finite (2+δ)th moment of x i . In our proofs of the asymptotic properties of β I I I n , relaxing the compactness of X would additionally require max i≤n x i = o p (n α ) for some 0 < α < 1/2 (this is an assumption closely related to the existence of finite (2 + δ)th moments; seeČížek 2006, Proposition 2.1). Moreover, we assume the existence of one continuous explanatory variable.
Next, Assumption A.4 is the standard assumption in quantile regression models (e.g., Powell 1986a), although the density function f ε (t|x i ) is usually assumed to be positive only in a neighborhood of 0. Given the misspecification of the linear QR, it is convenient to assume non-zero density everywhere as f ε (t|x i ) is evaluated for any t = x T i θ 0 . Concerning Assumption A.5, it contains usual full-rank conditions used in censored and quantile regression models and is necessary for the identification of parameter vectors, see for example Khan and Powell (2001). In the case of J , J , and J crq , it rules out collinearity among the explanatory variables restricted to regions with x T i θ 0 > 0 and x T i β 0 > 0, respectively (e.g., collinearity could arise if x i with a bounded support contains a dummy variable d i with so small coefficient that x T i β 0 < 0 whenever d i = 1). Further, the first part of Assumption A.7 is imposed to simplify the proof of the consistency and asymptotic normality of the proposed estimator: it rules out the data without any censoring. The results remain valid even if there is no censoring, although some proofs would slightly differ. The second part of Assumption A.7 just formalizes the continuous-regressor Assumption A.3.
Finally, Assumption A.6 is the standard assumption necessary for defining the indirect inference estimator: the population QR estimates θ(β) and θ (β ) for data simulated from the censored regression model with parameters β and β should differ if β = β . Note though that we require the link function to be one-to-one only at the distribution F 0 = F y(β 0 ) corresponding to the true parameter values β 0 . This should however not be very restrictive in practice. On the one hand, the uniqueness of the population QR estimates θ 0 and θ(β) for some β requires, apart from the full-rank conditions in A.5, that the distribution of the responses and linear regression function is continuous with strictly positive conditional densities at x T i θ 0 for x T i θ 0 > 0 (cf. Assumption A.5). For the simulated data (x T i θ 0 , y s i (β)), the continuity follows from the presence of at least one continuously distributed explanatory variable (Assumption A.3 and A.7) and from the latent errors ε i being Gaussian with uniformly bounded variances (see Appendix 2), which follows primarily from the compactness of the parameter and variable spaces (Assumptions A.1 and A.3). For the real data, which follow an unknown distribution, we however have to impose the uniqueness of θ 0 (Assumption A.2). On the other hand, the one-to-one link function also means that any change in β results in a change of θ(β). This holds trivially in the model without any censoring as the data generating process defined by β then leads to the population QR estimates θ = b( F 0 , β) = β. If the censoring is present, b( F 0 , β) = β, but the link function is still usually one-to-one: any change in β, for example an increase of the intercept, is reflected by the corresponding change in the censored population data, for example by shifting non-censored population data up. This is a direct consequence of the error distribution defining F 0 being independent of β and its conditional Gaussian density of ε i given x i being everywhere positive-the population data therefore always contain some non-censored responses at any x i .
Although it is difficult to provide a more specific general condition which ensures that the link function is a bijection, one can detect the violation of this condition at least locally. Due to the local identification condition under model misspecification, which is represented here by the full rank of matrix D (cf. Theorem 3.1 of White 1982), singularity or near singularity of the estimated derivative of the link function will indicate that Assumption A.6 is violated. If one suspects that Assumption A.6 is violated, a possible approach is to enhance the auxiliary model in such a way that it better approximates the true model (the closer the auxiliary model is to the true model, the less likely Assumption A.6 is violated). In particular for the censored regression model, we can use, for example, quadratic functions of regressors and fit an auxiliary quadratic QR model rather than the linear one.
These assumptions are sufficient to derive the asymptotic distribution of the infeasible estimator. For the sake of simplicity of some proofs, we will additionally assume that the conditional error distribution F ε (·|x i ) has an infinite support (see Appendix 1 for details), but the stated results are valid in the general case as well.
Theorem 2 Let quantile τ ∈ (0, 1), be a non-singular k ×k matrix, and S ∈ N be a fixed number of simulated samples. Under Assumptions A.1-A.7, β I I I n is a consistent estimator of β 0 and it is asymptotically normal: The asymptotic variance matrix of β I I I n derived in Theorem 2 consists of several parts. First, the matrices and are the variances of the QR first-order conditions in the real and simulated data, respectively. Next, J and J are the corresponding Jacobian matrices defined in Assumption A.5. Finally, matrix K characterizes the unconditional covariance between the real and simulated data.
The next theorem shows that the feasible estimator β F I I n is asymptotically equivalent to the infeasible one β I I I n provided that one extra assumption holds: the conditional distribution function and its nonparametric estimates, which are used in (17) to define σ n (x i ; β), have to be smooth functions of the data. Additionally, the nonparametric estimate F y,n (z i |x i ) has to be consistent and to converge at a faster rate than the sequence c n = O(n −k 0 ), k 0 > 0, used in the definition of σ n (x i ; β). This is however not a constraint as k 0 is arbitrary.

A.8 For any compact sets
is monotonic in t and F y,n (t|x i ) = 0 for t < 0 and any x i ∈ X . A.9 The conditional distribution functions F y (z|x i = x) are piecewise Lipschitz functions in x for any z ∈ R.
Assumption A.8 is satisfied for many bias-corrected estimators of conditional distribution functions. Assumption A.9 on the conditional distribution function then states explicitly a minimum requirement that facilitates a consistent estimation and hence validity of Assumption A.8, although stronger assumptions on the smoothness of F y (z|x i ) are usually used (cf. Li and Racine 2008). Theorem 3 shows that the feasible and infeasible II estimates, β F I I n and β I I I n , are asymptotically equivalent, and consequently, the asymptotic variance-covariance matrix of β F I I n is given by (21). All elements of matrix V (S) can be readily estimated in practice (after replacing θ 0 by θ n ) as we explain now. First, let us note that, as a by-product of constructing estimate σ n (x i , β F I I n ) of σ (x i , β 0 ) in the FII estimation procedure, we obtain along with the FII estimate β F I I n also estimates θ n , F y,n (x T i θ n |x i ), and f y,n (x T i θ n |x i ) of θ 0 , F y (x T i θ 0 |x i ), and f y (x T i θ 0 |x i ), respectively, for all i = 1, . . . , n; see formulas (18) and (19). Moreover, all these estimates are consistent by Lemmas 5, 2, and 3. Another by-product of the estimation procedure are S simulated samples (x i , y s i ( β F I I n )) n i=1 used to compute S auxiliary estimatesθ s n ( β F I I n ) in (20). Since random vectors (x i , y i ) are independent and identically distributed, i = 1, . . . , n, most elements of the asymptotic variance matrix D −1 [J −1 J −1 + S −1 J −1 J −1 + (1 − S −1 ) J −1 K J −1 − 2J −1 K J −1 ](D T ) −1 can be consistently estimated due to the law of large numbers by (cf. Assumption A.5)

Theorem 3 Let the assumptions of Theorem 2 be satisfied. If Assumptions
The only exception is the matrix D = ∂b( F y(β 0 ) , β 0 )/∂β T , which represents the derivative of the link function. As this derivative is defined in terms of the simulated model with y s i (β) being drawn from N (x T i β, σ n (x i , β)) given x i and censored at 0, we can draw an arbitrarily large sample representing the whole population and evaluate the link function and its derivative numerically as suggested by Gouriéroux et al. (1993). In particular, let 1 δ > 0, an integer N n, andx k andε k be randomly drawn from the empirical distribution of {x i } n i=1 and N (0, 1) for k = 1, . . . , N . Denotingθ N (β) the auxiliary QR estimate obtained for data (x k , max{0,x T k β +ε k σ n (x k , β)}) N k=1 , the derivative of the link function can be estimated by matrixD n = (d i j,n ) k,k i, j=1 with the i jth element equal tod whereθ i,N (β) denotes the ith element ofθ N (β) and e j = (0, . . . , 0, 1, 0, . . . , 0) T is the vector with the jth element equal to 1 and all other elements equal to 0. Note though that, given the discussion in Sect. 5.2, it can be advisable to use a better approximation of the estimator's variance in small samples than the asymptotic distribution; for example, the bootstrap.

Monte Carlo simulations
Although we characterized the asymptotic properties of the proposed bias-corrected QR estimator, it is primarily aimed to improve the finite-sample performance of existing estimators. To analyze the benefits of the bias-correction performed by means of the indirect inference, this method is now compared with many existing estimators by means of Monte Carlo simulations. The simulation setting is described in Sect. 5.1 and the results are discussed in Sect. 5.2.

Simulation design
The data-generating process is similar to the one considered by Khan and Powell (2001): with the slope parameters equal to 1 and α chosen in each sample so that the censoring level is always the same; unless stated otherwise, the censoring level equals 50 %.
The k regressors x 1i , . . . , x ki are uniformly distributed on − √ 3, √ 3 , but the results are qualitatively rather similar across other data distributions (e.g., x i being normally distributed). Further, we focus on the median regression case τ = 0.5. The error term ε i can thus follow various error distributions with the median equal to zero, such as the normal N (0, σ x ), Student t d , and double exponential D E x p(λ) ones.
For this data-generating process, we consider the following estimators: (i) the standard Tobit MLE constructed for normal homoscedastic errors; (ii) the CLAD estimator; (iii) the two-step LAD of Khan and Powell (2001) based on their three initial estimators-the maximum score estimator (2S-MSC), the Nadaraya-Watson estimator of the propensity score (2S-NW), and the conditional quantile estimator (2S-LQR); (iv) the 'infeasible' LAD (IFLAD) defined as the QR estimator applied only to data points with α + βx i ≥ 0; (v) the three-step estimator of Chernozhukov and Hong (2002) based on the initial logit estimator (3S-LOG); (vi) the proposed QR estimator with bias corrected by indirect inference (FII); and (vii) the corresponding infeasible indirect inference (III) estimator, which does not estimate the conditional error distribution, but 'knows' the true one.
The QR estimates were in all cases computed by the Barrodale and Roberts (BR) algorithm as implemented in the R package "quantreg." The same package was also used for computing CLAD by an adapted BR algorithm; given that it only finds local minima, we searched the global minimum by exhaustive search of all elemental subsets if k = 1, and due to infeasibility of this in higher dimensional models, started the adapted BR algorithm from the naive QR regression estimate if k ≥ 1. For the indirect-inference based methods, we use the Nelder-Mead simplex method with multiple starting points as an optimization algorithm; 2 the number of simulated samples is S = 50 by default. Further, many of the considered methods depend on some initial nonparametric estimators of the conditional mean, conditional quantile, and conditional distribution and density functions. The nonparametric estimators considered here are those by Racine and Li (2004) for the conditional mean and distribution function estimation, by Li and Racine (2008) for the conditional quantile estimation, and by Hall et al. (2004) for the conditional density estimation; we use their implementation in the R package "npreg," which also includes the bandwidth choice by the least-squares cross-validation. The estimation and the bandwidth choice were based in all cases on the Gaussian kernel (the results are however insensitive to the kernel choice).
Finally, the bias-corrected QR estimator using the estimated values of the conditional distribution function can sometimes exhibit multiple minima in very small samples (in such cases, there are usually two minima found irrespective of the number of starting points). This concerns only the feasible estimator, but not the infeasible one: given the consistency of the nonparametric estimators, of the proposed II method, and the identification result in Lemma 1, multiple minima thus disappear as the sample size increases (as we practically observed in the simulation study). In the presence of multiple local minima, we simply use the average of the found local minima as the The results for all methods are obtained using 1000 simulations for k ∈ {1, 3, 5} variables and sample sizes n = 50, 100, and 200 and are summarized using the bias and root mean squared error (RMSE) of the slope estimates (the absolute bias is reported if k > 1).

Simulation results
Let us first discuss the number S of simulated samples and its influence on the estimates for standard normal errors, k = 1, and n = 100. As Table 1 documents, the II estimators are consistent for any fixed and finite S: the III estimator performs equally well for any S = 5, . . . , 250 and one could expect similar performance of FII in large samples once the bias of nonparametric estimation becomes small. At n = 100 however, FII based on an increasing number S of simulated samples shows a diminishing negative finitesample bias and an increasing RMSE (RMSE does not increase further for S > 250). While the trade-off is not too large and we choose S = 50 here, S should be set larger if one is concerned about the bias and can be set smaller if RMSE is important. A similar trade-off-a larger negative bias connected to a lower variance of estimates-will be also observed later in the case of some existing estimators.
The first comparison of estimators is obtained for ε i ∼ N (0, 1), k = 1, and n = 50, 100, and 200; see Table 2. The MLE estimator serves as a parametric benchmark, and given normality, performs best in all cases. First, CLAD exhibit large biases and RMSEs in small samples, especially with n = 50 observations; this is due to the heavy right-tail of the CLAD distribution. Next, the existing two-step estimators exhibit relatively large RMSEs, which are however smaller than those of CLAD (except for 2S-MSC), and negative biases, which vary with the choice of the initial estimator. The fact that 2S-LQR performs better than the infeasible IFLAD in terms of RMSE is related to the difference in definitions: IFLAD uses data points with α + βx i ≥ 0, whereas the two-step estimators rely on data points with α + βx i ≥ c (e.g., c is set to 0.05 here as done in Khan and Powell 2001). Increasing c reduces the bias, but increases the variance of estimates (especially for 2S-LQR) and vice versa. In comparison, the three-step estimator 3S-LOG has usually larger RMSEs than 2S-NW The values in brackets represent the standard deviations based on the asymptotic distribution of the II estimators The results for the infeasible estimators are in italics or 2S-LQR, but rather small finite-sample bias compared to all other semiparametric methods.
Looking at the two infeasible estimators, III exhibits always smaller RMSE than IFLAD, although the difference decreases with an increasing sample size. It is also interesting to note that IFLAD exhibits systematically a larger negative bias, whereas III leads to a smaller, but positive bias (or almost zero bias for n = 200). This is reflected by the performance of the proposed FII estimator, which exhibits smaller biases and RMSEs than any of the existing semiparametric methods. One can also notice that, even though FII exhibits generally a smaller bias than the methods of Khan and Powell (2001), the bias of FII is negative in contrast to the bias of III. Finally, let us mention the variance of the II estimates implied by the asymptotic distribution in Theorem 2. While the simulated and asymptotic RMSE are approaching each other in the case of III, the asymptotic standard deviations differ substantially in the case of the feasible estimator FII. Given that the small-sample variance of FII is additionally related to the number S of simulated samples in the opposite way than in Theorem 2, see Table 1, it is not advisable to use the asymptotic distribution in small samples.
The next set of results is obtained for five different error distributions, see Table 3 for k = 1 and n = 100. The first three distributions are homoscedastic-N (0, 1), t 5 , and D E x p(1), whereas the remaining two distribution are heteroscedastic-N (0, ce 0.75x ) in the case of "positive" heteroscedasticity and N (0, ce −0.75x ) in the case of "negative" heteroscedasticity (c is always chosen so that the unconditional variance equals 1). The "Gaussian" MLE estimator serves again as a parametric benchmark, and in the case of homoscedasticity, performs very well across the tested (symmetric and unimodal) distributions. The performance of the semiparametric estimators under homoscedasticity does not substantially differ from the case with normal errors. Besides 2S-MSC, the semiparametric estimators possess smaller RMSEs than CLAD. The two-step esti- Table 3 The bias and root mean squared error of all estimators in a model with the standard normal, student, and double exponential errors; n = 100 The results for the infeasible estimators are in italics Table 4 The absolute bias and root mean squared error of all estimators in a model with the standard normal errors and k = 1, 3, and 5 explanatory variables; n = 100 Estimator The results for the infeasible estimators are in italics mators however exhibit relatively large negative biases compared to the three step estimator and the II estimator. The FII estimator is thus preferable in all cases. The comparison substantially changes once the heteroscedastic errors are considered. The MLE estimates are now severely biased. In the "positive" heteroscedasticity scenario, CLAD exhibits large bias and RMSE. The existing two-and three-step estimators thus provide better estimates than CLAD in terms of RMSEs, and moreover, their biases increase (becoming less negative or even positive). The best performance can be attributed to 2S-LQR followed by FII. In the case of the "negative" heteroscedasticity, the differences among semiparametric estimators are much smaller and CLAD becomes the best performing estimator due to the fact that the observations with con-ditional median above zero have now very low variance (cf. Khan and Powell 2001). Note that the proposed FII is again the second best, now after CLAD.
Furthermore, let us consider a model with standard normal errors and multiple regressors, k = 1, 3, 5; in Table 4, the sample size n = 100 is used irrespective of the dimension k. Let us first state that the results obtained for k = 1 in previous experiments for CLAD and 2S-MSC are not fully comparable to the results for k = 3 and k = 5 as the earlier ones (k = 1) search for global minima while the latter ones (k > 1) correspond to the local minima obtained by starting from the linear QR estimates. This of course reduces variability of the corresponding estimates, and in this experimental design, makes CLAD and 2S-MSC competitive. Nevertheless, one can observe that the semiparametric methods based on some initial k-dimensional nonparametric smoothing tend to increase their RMSE with k faster than methods based purely on one-dimensional (non)parametric estimation. The large part of this effect is related to the increases in the total absolute bias-see the biases of 2S-NW and 2S-LQR, which are similar for each individual parameter irrespective of k, but the total bias of k estimated parameter thus grows with k. The II bias-correction procedure is not affected by this and FII thus performs substantially better than 2S-NW and 2S-LQR for larger k. On the other hand, the performance of FII is matched by the three-step method of Chernozhukov and Hong (2002) at k = 5 and also by the "local-minimum" CLAD.
Finally, let us shortly mention how the estimators are affected by the amount of censored observations in the data. Using standard normal errors, k = 1, and n = 100, we studied performance under the assumption that 50 % observations are censored. If the fraction of censored observations decreases towards zero, the asymptotic theory predicts that all considered semiparametric estimates will converge to the simple linear QR estimates. This is confirmed by the simulation results in Table 5: we see that, at lower levels of censoring and in particular for 25 % censored observations, the RMSEs of all semiparametric estimators are very close to each other (with the exception of 2S-MSC). Altogether, all semiparametric alternatives to CLAD perform better than CLAD, although the differences are likely to be small for very large data sets. The proposed FII estimator performs equally well in large samples and is in many cases preferable to existing semiparametric methods in small and moderate samples.

Conclusion
We proposed a new estimation method for the censored regression models thatcontrary to existing methods-relies on the linear QR estimates for the whole sample and that applies a bias-correction technique to obtain consistent estimates. For the bias correction, the indirect inference technique is applied and extended so that it allows sampling from a nonparametrically estimated distribution function. The consistency and asymptotic distribution of the proposed estimator were found and shown to be first-order independent of the initial nonparametric estimates of the auxiliary error distribution. Finally, one of the important benefits of this estimation approach is its small-sample performance as was demonstrated by means of Monte Carlo simulations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Appendix 1: Proof of Theorem 1
Proof of Theorem 1 The proof is almost identical to the proof of Theorem 2 in Angrist et al. (2006). We need to prove that the solution of is equal to the solution of Since the objective function in (23) is convex, any fixed point θ =θ is a solution of the corresponding first-order condition: On the other hand, the first order condition for (22) is given by (cf. the proof of Theorem 2 in Angrist et al. 2006) By the law of iterated expectations, D(θ ) can be written as Further, it follows from the definition of w( Because θ 0 is the unique solution of (22), it also uniquely solves (23) since the objective function in (23) is convex in θ. Therefore, θ = θ 0 =θ solves both (22) and (23).

Appendix 2: Auxiliary lemmas
For the rest of the proofs, we introduce necessary notation. The norms · and · ∞ will refer to the Euclidean norm on R d and to the supremum norm in functional spaces, respectively. The δ-neighborhood of a vector t ∈ R d is denoted U (t, δ) = {t ∈ R d : t − t < δ}. The probability distribution and density functions of N (−μ τ , 1) are denoted τ and φ τ , respectively. Additionally, recall that ρ τ (z) = {τ − I (z ≤ 0)}z and its derivative is denoted ϕ τ (z) = τ − I (z ≤ 0).
Next, for } is used, assuming that w i ∼ P. For easier reading, we also use a simplified notation for the simulated distributions F(β) = F y(β) and F(β) = F y (β) .
For the sake of simplicity of some proofs, we will additionally assume that the conditional error distribution F ε (·|x i ) has (uniformly) an infinite support in order to guarantee that sup x∈X F ε (K |x) < 1 for any K < ∞, and by Assumption A.3, that sup x∈X sup θ∈ F y (x T θ |x) < K F < 1. Consequently, the conditional variance σ (x i ; β 0 ) defined in (14)-(15) is everywhere positive at the true β 0 , and given the compactness of B, , X , and A.7, σ (x i ; β 0 ) > C σ > 0 for all x i ∈ X . If the limit expression (15) and Assumption A.4 are taken into account, one can observe that the variance function is also bounded from above: σ (x i ; β) < K σ for any β ∈ B and all x i ∈ X .
First, the following identification result is derived. A.1-A.5 and A.7, b( F(β 0

Lemma 1 Under Assumptions
Proof First, note that under the listed assumptions, b( F 0 , β) is one-to-one, where is one-to-one, the vector β 0 , which satisfies b( F 0 , β 0 ) = b(F y , β 0 ) = θ 0 by Theorem 1, uniquely solves the QR moment condition [see Eq. (26)]. As F y (t|x i ) = 0 for any t < 0 and P(x T i θ 0 = 0) = 0 by Assumption A.7, (27) can be written as Next, the QR moment condition for the simulated data { y s i (β 1 ), Because the censored distribution F y(β 1 ) (t|x i ) = 0 for all t < 0, we can again rewrite it as Recalling By substituting (14), 3 where θ 0 is replaced by b( F 0 , β 0 ), we get Using identity (28), (30) can be simplified to By Powell (1986a), under Assumptions A.1-A.5 we know that . Thus, we should have β 0 = β 1 because J crq is positive definite by Assumption A.5.
The further auxiliary results presented in this section are proved in the discussion paper ofČížek and Sadikoglu (2014).
Under Assumptions A.1-A.5 and A.7, it holds that as n → ∞ for any θ ∈ ; 2. θ n is a consistent estimator of θ 0 , θ n P → θ 0 as n → ∞; 3. for any sequence θ n Additionally, suppose that, for sample size n ∈ N , data are independently and identically distributed according to probability distributions P n , which satisfy Assumptions A.1-A.5 and A.7 uniformly in n. Denoting as n → ∞ for some δ > 0 and some distribution P 0 , which satisfies assumptions A.1-A.5 and A.7. Then for any sequence θ n Proof SeeČížek and Sadikoglu (2014, Theorem 5).
To prove consistency, note that θ n → θ 0 = b(F y , β 0 ) by Theorem 4. Similarly for any s = 1, . . . , S, θ s n (β 0 ) → b( F(β 0 ), β 0 ) = b(F y , β 0 ) = θ 0 and θ s n (β) → b( F(β), β) = θ 0 for β = β 0 since, for a given β ∈ B, the data ( y s i (β), x i ) n i=1 also satisfy the assumptions of Theorem 4. The same holds also for S s=1 θ s n (β 0 )/S as the limits of θ s n (β) are independent of s and S is finite. The III criterion (16) is thus a strictly convex function in θ n and S s=1 θ s n (β 0 )/S, which converges for any β to arg min Hence by Assumption A.6, any minimizer β I I I n of (16) satisfies S s=1 θ s n ( β I I I n )/S → θ 0 in probability as n → ∞ (Newey and McFadden 1994, Theorem 2.7). As the link function b is one-to-one continuous mapping (Assumption A.6) and the parameter space B is compact (Assumption A.1), β I I I n has to converge in probability to β 0 , which is the unique minimum of (42) (cf. the proof of Theorem 1 in Gouriéroux et al. 1993).
The proof for asymptotic normality of β I I I n is similar to the proof of Proposition 3 in Gouriéroux et al. (1993), which is however given for a twice continuously differentiable instrumental criterion. By taking the first-order condition of the optimization problem (16) with respect to β, the first-order condition is obtained almost surely: 1 S S s=1 ∂ θ s n ( β I I I n ) T ∂β θ n − 1 S S s=1 θ s n ( β I I I n ) = 0.
Consequently, we have due to the full-rank Assumption A.6 that Recall that D denotes ∂b( F(β 0 ), β 0 )/∂β T . Subtracting (44) from (52) To prove the theorem, we have to show that the difference S s=1 θ s n (β 0 )/S − S s=1 θ s n (β 0 )/S is asymptotically negligible in probability. The estimates θ s n (β 0 ) and θ s n (β 0 ) are obtained for data y i = max{x T i β 0 +ε s i , 0}, where ε s i ∼ N (μ τ σ (x i ), σ (x i )) and σ (x i ) represent the conditional variance functions σ (x i ; β 0 ) and σ n (x i ; β 0 ), respectively. Thus, ε s i (β 0 )|x i follows a normal distribution with mean μ τ σ (x i ; β 0 ) and variance σ (x i ; β 0 ) and (x i , ε s i (β 0 )) follows their joint distribution P 0 . On the other hand, ε s i (β 0 )|x i is characterized a different conditional distribution, which is a normal distribution with mean μ τ σ n (x i ; β 0 ) and variance σ n (x i ; β 0 ); the joint distribution of (x i , ε s i (β 0 )) is denoted P n (it has the same marginal distribution of x i as P 0 ). We have already established that both θ s n (β 0 ) and θ s n (β 0 ) are consistent estimators of θ 0 . As the variance functions at β 0 are bounded (see the introduction of Appendix 2), censored data simulated from P 0 and P n satisfy assumptions of Theorem 4 if condition (32) is verified. As all terms of {r (y i , x i , θ )−r (y i , x i , θ )} 2 in condition (32) with their expectations varying with P n have the form C 2 (θ , θ , θ 0 )ρ τ (y − x T θ )ρ τ (y − x T θ ), C 1 (θ , θ , θ 0 )ρ τ (y − x T θ)I (y ≤ x T θ 0 ), or C 0 (θ , θ , θ 0 )I (y ≤ x T θ 0 )I (y ≤ x T θ 0 ) for some deterministic functions C 0 , C 1 , C 2 and some θ , θ ∈ U (θ 0 , δ), Lemma 4 implies the validity of condition (32). Consequently, we can write using the asymptotic linearity result of Theorem 4 for any s = 1, . . . , S tions g and g , respectively. This however directly follows from Lemma 4 as x i has a compact support by Assumption A.3.