Improving precipitation forecasts using extreme quantile regression

Aiming to estimate extreme precipitation forecast quantiles, we propose a nonparametric regression model that features a constant extreme value index. Using local linear quantile regression and an extrapolation technique from extreme value theory, we develop an estimator for conditional quantiles corresponding to extreme high probability levels. We establish uniform consistency and asymptotic normality of the estimators. In a simulation study, we examine the performance of our estimator on finite samples in comparison with a method assuming linear quantiles. On a precipitation data set in the Netherlands, these estimators have greater predictive skill compared to the upper member of ensemble forecasts provided by a numerical weather prediction model.


Introduction
Extreme precipitation events cause large economic damages, when large amounts of water cannot be properly drained. For example, water boards in the Netherlands, flooding might occur if anticipatory water management is not applied. Accurate predictions are therefore vital for taking preventive measures such as pumping the water out of the system.
The current approach of forecasting relies on deterministic forecasts obtained by numerical weather prediction (NWP) models, [11]. These models rely on non-linear differential equations from physics describing the flow in the atmosphere. Together with an initial condition of the atmosphere, the NWP models are used to forecast precipitation, among other weather quantities.
The uncertainty in these types of forecasts is attributed to uncertainty in the initial condition and in the physical parametrizations in the model itself. An ensemble prediction system quantifies the uncertainty due to these two factors by applying small perturbations to the original quantities and running the NWP model multiple times subsequently. An ensemble forecast is to be viewed as a sample from the distribution of the predicted variable, where uncertainties in initial condition and model parametrizations are taken into account. Therefore, it is natural to consider the empirical distribution function of the ensemble forecast as an estimator of the distribution of the predicted variable, in this paper precipitation.
While the NWP ensemble prediction systems are rather skilful in forecasting precipitation for relatively short lead times, skill quickly decreases as lead time increases. Using upper ensemble members for forecasting extreme precipitation appears to be most challenging, due to the large spatial and temporal uncertainties of precipitation forecasts. Most methods that have been proposed to post-process forecasts are instead focussed on the bulk of the conditional distribution, see [18].
For the upper ensemble members there are two serious problems. First, the upper ensemble members tend to be not well calibrated, i.e. not reliable [3], especially for large amounts of precipitation, cf. Figure 3. Second, the highest probability level of the extreme precipitation forecast is limited by the number of ensemble members, which is typically not large due to computational costs. In the ensemble prediction system of the European Centre for Medium-Range Weather Forecasts (ECMWF), which we consider in our case study, the system generates 51 ensemble members. Thus, the largest probability level is given by 51 52 . In this paper, we aim to develop a post-processing approach for predicting extreme precipitation quantiles. More precisely, we focus on the problem of estimating the tail of the conditional distribution F Y |X , with X a precipitation forecast by the NWP model and Y the observed precipitation. We are interested in the function x → Q Y |X (τ |x) for τ close to one, where Q Y |X denotes the conditional quantile function.
Several estimators have already been proposed to estimate extreme conditional quantiles. All these estimators have a similar structure consisting of two steps. First, the quantile function Q Y |X is estimated for moderately high probability levels τ . In the second step, these estimated quantiles are used to extrapolate to obtain estimators of extreme conditional quantiles.
For the first step, general quantile estimation techniques are used. Examples are linear quantile regression in [17] and [16], k-nearest neighbour in [1] and inverse of empirical conditional distribution functions smoothed in the covariates in [8,4], and [5]. For the second step two 'types' of approaches can be distinguished. First, a local approach, where an extreme quantile estimator is applied to a sequence of estimated quantiles for moderately high probability levels attained from the first step. This method is used in [ . The second type, where the exceedances above a threshold estimated in the first step are used to fit a generalized Pareto distribution, was introduced in [6]. An application of the result of [6] to precipitation data is discussed in [2], where a generalized Pareto distribution is fitted to the exceedances above an estimated linear quantile. They showed skilful short-range forecasts of extreme quantiles.
Most methods allow for a varying extreme value index depending on the covariates. The estimators of extreme value indices in such models are generally subject to high variability. In the context of weather forecasting, this may lead to inconsistent forecasts over the covariates. After carefully considering the trade-off between the generality of the model and the efficiency of the estimation, we propose an additive model with a constant extreme value index for all covariates cf. (2.1). Our empirical study shows that this is a reasonable assumption for predicting extreme precipitation quantiles in the dataset we consider. In our two step procedure, we first estimate a nonstationary threshold by local linear quantile regression and then extrapolate to extreme quantiles based on the exceedances of this threshold.
The scientific contribution of this paper is fourfold. First, we the propose a model that achieves a good balance between generality and estimation efficiency and it fits the feature of post-processing data sets. Second, we derive asymptotic properties of the estimators, by first showing uniform consistency of local linear quantile regression, using a uniform Bahadur representation for the quantile estimator. Moreover, we establish asymptotic normality of the estimators of the extreme value index as well as the extreme conditional quantiles. Third, we address the issue such as selection of the bandwidth and tuning parameters, which is highly relevant from the application point of view. Fourth, our procedure yields skilful prediction outperforming the upper ensemble member based on validation. Besides, our procedure can extrapolate to an extreme probability level that goes beyond the empirical quantile associated with the upper ensemble member.
The outline of the paper is as follows: Section 2 we present out proposed model and develop the estimating procedures. The asymptotic properties of the estimator are studied in Section 3. In Section 4 we propose a data driven approach for bandwidth selection. We show with a detailed simulation study in Section 5 the finite sample performance of our estimator and compare it with existing methods. Finally in Section 6 we apply our esti-mator to a dataset of precipitation observations and ensemble forecasts in the Netherlands. The part for the theoretical results are provided in the appendix.

Model and Estimation
We aim to estimate the conditional tail quantiles of Y given X, namely Q Y |X (τ |·) for τ close to one. To this end, we assume that there exists a τ c ∈ (0, 1) such that where r is a smooth continuous function and Q denotes the quantile function of an error variable , which is independent of X. In order to make the model identifiable, it is assumed that Q (τ c ) = 0. As a result, Q Y |X (τ c |x) = r(x). Moreover, we assume that the distribution of has a heavy right tail, that is there exists γ > 0 such that, where γ is the extreme value index of . It is important to note that this additive structure is only assumed for probability levels τ exceeding τ c , which allows us to model the tail of the conditional distribution without assuming structure for τ < τ c . On one hand, the quantile curve x → Q Y |X (τ |x) for any τ ≥ τ c has the same shape as r. On the other hand, the distance between the two quantile curves, that is Q Y |X (τ 1 |x) − Q Y |X (τ 2 |x) for any τ 1 > τ 2 ≥ τ c , is determined by Q only and thus does not depend on x. We will refer to our model as the Common Shape Tail (CST) model.
We remark that various types of additive structures have been proposed in recent studies on modeling extremes with covariates. In [17], a linear structure is assumed for r, where two scenarios are considered: the slope of the linear function is a nonparametric function of τ or it is constant. The latter scenario is a special case of our model. In [16], a linear structure is assumed for the conditional quantile function after the power transformation. In both papers, r is estimated by linear quantile regression. In [14], a nonparametric location-scale representation is assumed and local linear mean regression is used to estimate the conditional quantile called α-CVaR in that paper, where the existence of the fourth moment of the error variable is required. This requirement implies an upper bound on the extreme value index: γ < 1 4 . Let (X 1 , Y 1 ), . . . , (X n , Y n ) denote i.i.d. paired observations satisfying (2.1). Based on this random sample, we construct a two step estimation procedure for Q Y |X (τ n |·), where for asymptotics, τ n → 1 as n → ∞. We shall estimate r and Q (τ n ) respectively in each of the two steps.
First, for the estimation of r we choose to follow the local linear quantile regression approach studied in [19]. An obvious advantage of the quantile regression approach is that it does not impose a constraint on the moments of the conditional distribution. Let h = h n denote the bandwidth. In a window of size 2h around a fixed point x, we approximate the function linearly: The function r and its derivative are estimated by the solution of the following minimization problem: (r n (x),r n (x)) = arg min where ρ τ (u) = u(τ − I(u < 0)) is the quantile check function, cf.
[12] and K a symmetric probability density function with [−1, 1] as support. Second, for the estimation of Q (τ n ), we consider the residuals defined by e i = Y i −r n (X i ), i = 1, . . . , n. Using the representation of Y i = Q Y |X (U i |X i ), with {U i , i = 1, . . . , n} i.i.d. uniform random variables, and the model assumption (2.1), the residuals permit a more practical expression as below.
Denote the order statistics of the residuals by e 1,n ≤ . . . ≤ e n,n . Let k n be an intermediate sequence depending on n such that k n → ∞ and k n /n → 0 as n → ∞. Then a Hill estimator of the extreme value index is given bŷ log e n−i+1,n e n−kn,n .
The intuitive argument behind this estimator is that {e n−i,n , i = 0, . . . , k n } are asymptotically equivalent to the upper order statistics of a random sample from the distribution of , i.e. for some δ > 0, max i=0,...,kn |e n−i,n − Q (U n−i,n )| = o p (n −δ ); see the proof of Theorem 3.2 in the Appendix. For the same reason, we use the well known Weissman estimator of Q (τ n ) based on the upper residuals: Combining the estimator of r(x) given by (2.3) and the estimator of Q (τ n ) given by (2.5), we obtain the estimator of the conditional tail quantile: (2.6) By construction, this estimator of the conditional tail quantile is continuous in x. We shall refer to our estimator as CST-estimator.

Asymptotic Properties
In this section, we present the asymptotic properties of the estimators obtained in Section 2. We begin with uniform consistency ofr n in (2.3). We first state the assumptions with respect to our model (2.1). Let g denote the density of X, f Y |X (·|x) denote the conditional density of Y given X = x and c denote an arbitrary finite constant.
A1 The support of g is given by Theorem 3.1. Letr n be the estimator defined in (2.3). Choose K a symmetric Lipschitz continuous probability density function supported on [−1, 1] This theorem quantifies the direct estimation error made in the first step of our procedure. Note that the "error" made in the first step is transmitted to the second step by the definition of the residuals. Thus, the uniform consistency ofr is important for deriving the asymptotic property ofQ Y |X (τ n |·) not only becauser is a constructing part ofQ Y |X (τ n |·), but it also influences the asymptotic behavior ofQ (τ n ).
Remark 3.1. Although many studies have been devoted to the non-parametric quantile regression, to the best of our knowledge, there is no existing result on the uniform consistency forr n for an additive model. In [13], a general uniform Bahadur representation is obtained for local polynomial estimators of M-regression for a multivariate additive model. A local linear quantile regression is one of the M-regression and thus is included in the estimators considered in that paper. Corollary 1 in [13] is our starting point for deriving the uniform consistency ofr n .
For the asymptotic normality ofγ n , we assume that Q satisfies the following condition, which is a second order strengthening of (2.2). A4 There exist γ > 0, < 0 and an eventually positive or negative function As a consequence, |A(t)| is regularly varying with index .
Theorem 3.2. Let the conditions of Theorem 3.1 and A4 be satisfied. Let k n → ∞ and k n /n → 0, n n −(δ+γ) → 0 as n → ∞, with δ from Theorem 3.1. Then Remark 3.2. When deriving asymptotic properties for extreme statistics, it typically requires some regular conditions on k n , the number of tail observations used in the estimation when the sample size is n. For the original Hill estimator, which is based on i.i.d observations, the asymptotic normality is proved under Assumption A4 and √ k n A(1 − k n /n) → λ ∈ R. The other two conditions on k n , that is, lim n→∞ √ k n A(Q (1 − k n /n)) = 0 and lim n→∞ k γ+1 n n −(δ+γ) = 0 are used to make sure that the upper order residuals behave similarly to the upper order statistics of a random sample from the distribution of . Suppose one chooses k n = n α for 0 < α < min 2 2 −1 , 2γ 2γ −1 , δ+γ γ+1 , it satisfies all the conditions on k n . So in theory, there exists a wide range of choices for a proper k n . In practice, it often becomes challenging especially when the sample size is not large. We propose a practical way of choosing k n in Section 5.
The asymptotic normality ofQ Y |X (τ n |x) defined in (2.6) is now given below. To simplify notation, we denote with p n = 1 − τ n . Theorem 3.3. Let the conditions of Theorem 3.2 be satisfied. Assume Remark 3.3. The condition np n = o(k n ) guarantees that the conditional quantile is an extreme one. It gives the upper bound for p n . And the con- gives the lower bound on p n , which limits the range of extrapolation. Clearly p n = O(n −1 ) satisfies both conditions. The asymptotic normality holds even for some p n < 1 n , which means it is beyond the range of the available data. In the weather forecast context, predicting the amount of precipitation so extreme that it never occurred during the observed period is also feasible. The assumption lim n→∞ √ knp γ n n δ log kn npn = 0 is a technical condition we use to guarantee that the error made in the first step does not contribute to the limit distribution.
The proofs for Theorems 3.1, 3.2 and 3.3 are provided in the Appendix.

Bandwidth selection
The selection of the bandwidth is a crucial step in local linear quantile regression cf. (2.3). The bandwidth controls the trade-off between the bias and variance of the estimator. Increasing the bandwidth h decreases the variance, but tends to increase the bias due to larger approximation errors in the local linear expansion.
In [19], the authors propose to estimate the optimal bandwidth for quantile regression by rescaling the optimal bandwidth for mean regression. There is a rich literature on bandwidth selection for mean regression. However, in our setting this approach is not satisfactory because the scaling factor is difficult to estimate and it also assumes the existence of the first moment, i.e. it limits to the case γ < 1.
Instead we adopt a bootstrap approach, similar to the one proposed in [1] to estimate the global optimal bandwidth with respect to the mean integrated squared error (MISE), i.e., Let B denote the number of bootstrap samples. The bootstrap samples . . , B are sampled with replacement from the original n data pairs. The optimal bandwidth is estimated by minimizing the bootstrap estimatorŜ(h) of S(h), which is given by the objective function in (4.1).
where h 0 is an initial bandwidth chosen by visual inspection andQ h,j Y |X (τ c |x) denotes the estimate of the conditional quantile function based on the j-th bootstrap sample. In practice, the integral is approximated using numerical integration.
Two alternative approaches were attempted. First, a bootstrap approach, fixing the covariates X and sampling for each covariate level an uniform random variable U . For values of U ≥ τ c a positive residual e is sampled and In the case U < τ c a local linear quantile estimate is obtained at the covariate level X with bandwidth h 0 at probability level U . The bandwidth is then estimated by the solution of the minimization in (4.1). Second, a leave-one-out cross validation approach that minimizes the quantile loss function is used to obtain the estimator of the optimal bandwidth: whereQ h,−i Y |X denotes the conditional quantile estimate with bandwidth h and leaving out the ith observation. Intuitively, the cross validation approach is attractive as it is much faster compared to the bootstrap approach and it is based on the idea of scoring the quantile curve with the same scoring function used for estimation. Yet, based on a simulation study, the direct bootstrap procedure performed significantly better compared to these alternative approaches. This is in accordance with the conclusions drawn in [1].

Simulation
In this section, the finite sample performance of the CST-estimator is assessed using a detailed simulation study. A comparison is made with two estimators from the literature, the estimator proposed in [17] and the estimator proposed in [4]. Both methods utilize a two step procedure. In [17], the first step consists of estimating a sequence of linear quantile curves for moderately high probability levels, using quantile regression. The estimator in [4] proposes, for the first step, a conditional empirical distribution function smoothed in the covariates. A sequence of moderately high quantiles are obtained from the left continuous inverse of this estimated distribution function. In the second step, both methods use a Hill estimator of the extreme value index based on the estimated quantiles. Extrapolation to the extreme quantiles is done by a Weissman type estimator, similar to the one in (2.5).
The data for the simulation are drawn from the model in (5.1). We choose X uniformly distributed in [−1, 1] and independently, follows from a generalized Pareto distribution with γ = 0.25, or a Student t 1 distribution. For the function σ, we consider two cases: σ(x) = 1 and σ(x) = 4+x 4 . Note that for σ(x) = 1, our model assumption (2.1) is satisfied with τ c = 0. For σ(x) = 4+x 4 , our model assumption is not satisfied since the distribution of the additive noise depends on x. which allows us to study the robustness of the model assumptions Y = r(X) + σ(X) . (5.1) We consider three choices for the function r: linear, nonlinear monotone and a more wiggly function, Performance is compared for two sample sizes : n = 500 and n = 2500. The estimation of the quantile curves x → Q Y |X (τ |x) with τ = 0.99 and τ = 0.995 is assessed with an empirical estimator of the mean integrated squared error: 1 Y |X (τ |x) denotes the estimate based on i-th sample. The integral is approximated by numerical method. Tables 1 and 2 report the estimated MISE for different models and different methods.
For the CST estimator, we choose τ c = 0.5 while the model holds for any τ c ≥ 0. The value of k is typically chosen by inspection at the point where the Hill plot, i.e (k,γ(k)), becomes stable. For the simulations it is not possible choose k by inspection for each of the 500 data sets. We propose an automated procedure to detect the smallest k from which on the value ofγ(k) become stable. More precisely, define the following local variance Then, we choose k c = min{k : v q (k) ≤ t p }, where the threshold t p depends on the overall variability ofγ(k), that is t p = min 1≤k≤n−q v q (k) + p max 1≤i,j≤n−q (v q (i) − v q (j)) with p ∈ (0, 1). For the simulation study, we choose q = n/50 and p = 0.1 An example of this method with Student-t errors is shown in Figure 1.
For the estimator in [17], it is proposed to choose k = [3.5n 1/3 ] where [.] denotes the integer part. Additionally, the probability sequence for which the linear quantile curves are estimated is given by, n−k n , · · · , n−3 n . The estimator allows for varying extreme value indices as well as a constant extreme value index. A constant extreme value index is used as this is assumed in our setting. We refer to this estimator as the linear estimator. The model assumption for this method is satisfied only when r = r 1 , the linear case.
In [4], the bandwidth is selected via a cross-validation approach detailed in that paper. Furthermore, it is argued that nine moderately high quantiles are considered sufficient for extrapolation, i.e. quantiles corresponding to exceeding probabilities αn 1 , . . . αn 9 . and α n = c log n n . Based on our simulation results we choose c = 5 when is from the generalized Pareto distribution and c = 12 when is from the Student-t 1 distribution. From this point on we will refer to this estimator as the kernel estimator. The model assumptions of this estimator are satisfied in all scenarios considered here.
For generalized Pareto errors, the mean integrated squared errors are shown in Table 1. For the case σ(x) = 1, the CST estimator performs best, as expected, since the data follow the model assumption (2.1). For the case σ(x) = 4+x 4 a similar conclusion can be drawn for n = 500. Though, for a sample size of 2500 the linear estimator does slightly better.
For Student t 1 errors, the results are shown in Table 2. For sample size n = 500, the CST estimator has smaller MISE for τ = 0.99 and larger MISE for τ = 0.995, in comparison with the linear estimator. For a larger sample size n = 2500, the CST estimator outperforms the other two methods.
The kernel estimator is outperformed by the other two estimators in all settings although it is the only method that the model assumption is fulfilled for all settings. The procedure does not assume any structure in the data and it allows for varying extreme value indices, which requires to estimate the extreme value index locally by using a very limited amount of observations. As a consequence, the functionγ(x) fluctuates heavily and it further creates large errors in the quantile extrapolation. From the simulation result, it is clear that this method suffers severely from lack of efficiency for the sample sizes considered here.

Post-processing extreme precipitation
Our dataset consists of observations and ECMWF ensemble forecasts of daily accumulated precipitation at eight meteorological stations spread across the Netherlands (de Bilt, De Kooy, Twente, Eelde, Leeuwarden, Beek, Schiphol and Vlissingen). The data in this study is for the warm half year, namely 15th of April until 15th of October, in the years 2011 till 2017. Defining the lead time as the time between initialization of the ensemble run and the end of the day at 00UTC for which the forecast is valid. We consider lead times from 24 hours up till 240 hours with 12 hour increments. For each lead time and location the number of observations is about 1287.
For fixed lead time and location, an ensemble forecast consists of 51 exchangeable members, which can be seen as a sample from the distribution of precipitation, where the uncertainty in the initial condition and model parametrizations are accounted for. As a result, quantile estimates for probability levels i 52 , for 1 ≤ i ≤ 51, are given by the order statistics of the ensemble forecast. In practice, it is known that the upper ensemble member is not well calibrated in the sense that it leads to underestimation of the extremes, see Figure 3. This is partly caused by a representatively error, because the forecast is a grid-cell average and the observation is a station (point)value. Statistical post-processing can correct this and other systematic errors [18]. For long lead times it the upper ensemble member loses all predictive skill as can be seen from Figure 2. We show that, by applying the CST estimator, we can calibrate the upper ensemble member and obtain more skilful forecasts for short and long lead times. To relate to the notation of Section 2, we denote the daily accumulated precipitation by Y and the upper ensemble member by X. The upper ensemble member then corresponds to the τ = 51 52 quantile of Y |X.
For each lead time and each of the eight locations, we estimate the CST model as in (2.1) with τ c = 0.95. The bandwidth h is determined by the bandwidth selection method described in Section 4 and k chosen by the selection procedure detailed in Section 5, with p = 0.1 and q = 5. Additional to choosing X as the upper ensemble member we have considered other ensemble members and trimmed means of the ensemble members, though the upper ensemble member showed best performance.
The predictive performance of a quantile estimatorQ i (τ ) can be quantified by the quantile verification score and visualized by the quantile reliability diagram, which are discussed in detail in [3]. The quantile verification score is defined as QVS τ (Q) = n i=1 ρ τ (Y i −Q i (τ )), where ρ τ is the quantile check function. The score is always positive, where low scores represent good predictive power and high scores bad predictive power. In [3] it is shown that the score can be decomposed in three components: uncertainty, reliability and resolution, where only the last two depend on the estimator itself. A reliable or calibrated forecast has the same distribution as the underlying distribution that is estimated.
The quantile reliability diagram visualizes the reliability of the forecast quantile. By creating equally sized bins with respect to the forecast quantile and then graphing the empirical quantile of the corresponding observations in the bin against the mean forecast quantile in the bin. For the forecast to be reliable these points should lie on the line y = x.
It is natural to compare the predictive performance of a quantile estimator to some reference quantile estimatorQ ref . For this we take the climatological empirical quantiles as the reference method, i.e. the empirical quantiles of the sample Y i , 1 ≤ i ≤ n. Note that this is the simplest estimate we can obtain without making use of a numerical weather prediction model. The The validation is carried out using a seven-fold cross validation, where, in every iteration, one year is left out of the model estimation and used as the independent validation sample. In Figure 2 the QVSS is shown as a function of lead time. The bands are obtained by calculating the QVSS for each location separately. The graph on the left shows the performance of the CST estimator in red and the ensemble in blue for τ = 51 52 . It can be observed that for short and long lead times the CST estimator has higher skill than the ensemble. In Figure 3 two quantile reliability diagrams are shown, for 36 hours lead time on the left and 192 hours lead time on the right. In the figure it can be seen that the CST estimator shows better reliability than the ensemble for all but the lowest precipitation values at lead time of 36 hours. For a lead time of 192 hours the CST estimator shows better reliability for all precipitation values.
The CST estimator is also able to extrapolate further into the tail, where the ensemble has no members. For the τ = 0.995 quantile the QVSS is graphed against lead time on the right hand side of Figure 2. This clearly shows that for lead times, up to 5 days, the CST estimator obtains skilful forecasts. To conclude, we have shown that the CST estimator has more skill than the upper ensemble member for short and long lead times and even obtains skilful quantile estimates for higher quantiles than are available from the ensemble.
Then, the triangle inequality leads to The last equality follows from the fact that r is uniformly bounded by Assumption A1.
Next, we show that, there exists a δ C ∈ (0, 1 2 − δ h ) such that sup . Then for any x, y ∈ [a, b], by the triangle inequality and the Lipschitz continuity of K, we have Note that the constant c does not depend on i, that is, the Lipschitz continuity is uniform in i for all T i 's. Consequently, it follows from that |ψ τ (u)| ≤ 1 that, Let M n = n δ C +2δ h log n and {I i = (t i , t i+1 ], i = 1, . . . , M n } be a partition of (a, b], where t i+1 − t i = b−a Mn . Then for t ∈ I i , or equivalently, Therefore, for n sufficiently large, where the third inequality is due to that c(b−a) Mnh 2 n < 1 2 n −δ C for n sufficiently large. Next, we apply Hoeffding's inequality to bound P i . Define For each i and n, {W n,i,j , 1 ≤ j ≤ n} is a sequence of i.i.d random variables. And with probability one, |W n,j,i | ≤ sup −1≤u≤1 K(u) sup a≤x≤b C n,i (x) =: c 3 . Moreover, E (W n,j,i ) = 0 because E(ψ τc ( j )) = 0 and X j and j are independent. Thus, by Hoeffding's inequality, Note that 1 − 2δ h − 2δ C > 0 by the choice of δ C . Thus, for n → ∞, Hence, (A.2) is proved. Now by choosing δ = δ C , we obtain via (A.1) that, due to that δ h ∈ ( 1 5 , 1 2 ) and δ C < 1 2 − δ h .

A.2. Proof of Theorem 3.2
The proof follows a similar line of reasoning as that of Theorem 2.1 in [17]. The uniform consistency ofr n given in Theorem 3.1 plays a crucial role.
Then {U i , i = 1, . . . , n} constitute i.i.d. random variables from a standard uniform distribution. We prove that with probability tending to one, e n−j,n for j = 0, . . . , k n can be decomposed as follows, e n−j,n = Q (U i(j) ) + r(X i(j) ) −r n (X i(j) ) for j = 0, . . . k n , where i(j) is the index function defined as e i(j) = e n−j,n . In view of (2.4), it is sufficient to prove that with probability tending to one, U i(j) > τ c jointly for all j = 0, . . . , k n . Define another index function,ĩ(j) by Uĩ (j) = U n−j,n . Then it follows for n large enough, |r n (x) − r(x)| < 0 = P (e n−kn,n < V n ) = 1 − P(e n−kn,n ≥ V n ) where the second equality follows from that Q Y |X (τ c |X i(j) ) = r(X i(j) ) and the last equality follows from (2.4) and the fact that U n−kn,n > τ c for n large enough. Then, lim n→∞ P ∪ kn j=0 {U i(j) < τ c } = 0 follows from Q (U n−kn,n ) → ∞ and V n = o p (1) as n → ∞. Hence, (A.3) is proved.
Next, we show that lim n→∞ P ∩ kn j=0 {e n−j,n = Q (U n−j,n ) + r(X i(j) ) −r n (X i(j) )} = 1, (A.4) that is the ordering of k largest residuals is determined by the ordering of U i 's. In view of (A.3), it is sufficient to show that with probability tending to one, min 1≤i≤kn (Q (U n−i+1,n ) − Q (U n−i,n )) ≥ 2 max 1≤i≤kn |r(X i(j) ) −r n (X i(j) |. (A.5) By the second order condition given in (3.1) and Theorem 2.3.9 in [7], for any small δ 1 , δ 2 > 0, and n large enough, where the third equality follows from that min 1≤i≤kn Therefore we can conclude, by the assumption that k γ+1 n n −γ−δ → 0. We remark that the proof for Theorem 2.1 in [17] isn't completely rigorous, namely, the proof for (S.1) in the supplementary material of that paper is not right. We fix the problem while proving (A.9), which is an analogue to (S.1).