Threshold selection and trimming in extremes

We consider removing lower order statistics from the classical Hill estimator in extreme value statistics, and compensating for it by rescaling the remaining terms. Trajectories of these trimmed statistics as a function of the extent of trimming turn out to be quite flat near the optimal threshold value. For the regularly varying case, the classical threshold selection problem in tail estimation is then revisited, both visually via trimmed Hill plots and, for the Hall class, also mathematically via minimizing the expected empirical variance. This leads to a simple threshold selection procedure for the classical Hill estimator which circumvents the estimation of some of the tail characteristics, a problem which is usually the bottleneck in threshold selection. As a by-product, we derive an alternative estimator of the tail index, which assigns more weight to large observations, and works particularly well for relatively lighter tails. A simple ratio statistic routine is suggested to evaluate the goodness of the implied selection of the threshold. We illustrate the favourable performance and the potential of the proposed method with simulation studies and real insurance data.


Introduction
The use of Pareto-type tails has been shown to be important in different areas of risk management, such as for instance in computer science, insurance and finance. In social sciences and linguistics the model is referred to as Zipf's law. This model corresponds to the max-domain of attraction of a generalized extreme value distribution with a positive extreme value index (EVI) ξ : where denotes a slowly varying function at infinity: Since the appearance of the paper of Hill (1975) in which the EVI estimator was proposed with X n,n ≥ X n−1,n ≥ · · · ≥ X n−i+1,n ≥ · · · ≥ X 1,n denoting the ordered statistics of a random sample from F , the literature on estimation of ξ > 0 and other tail quantities such as extreme quantiles and tail probabilities has increased exponentially. We refer to Embrechts et al. (2013), Beirlant et al. (2004), de Haan and Ferreira (2007), and Gomes and Guillou (2015) for detailed discussions and reviews of these estimation problems. Next to the proposal of numerous estimators, focus has gradually shifted to selection methods of k and to the construction of bias-reduced estimators which exhibit plots of estimates which, as a function of k, are as stable as possible. Indeed, plots of estimators of ξ as a function of k that are consistent under the large semi-parametric model Eq. 1 are hard to interpret. In case of the Hill estimator some authors refer to Hill horror plots. While it has been frequently suggested to choose a 'stable' area (see for instance (Drees et al. 2000) and (De Sousa and Michailidis 2004)), such a stable part is often absent or hard to find. Sometimes more than one stable section is present, like in some insurance applications as we will discuss later. The typical available guidelines for the choice of k to be used in the implementation of the EVI estimators depend strongly on the properties of the tail itself, and k needs to be estimated adaptively from the data. This problem can be compared with choosing a bandwidth parameter in density estimation. It is typically suggested that the optimal value of k should be the one that minimizes the mean-squared error (MSE). However, this optimum depends on the sample size, the unknown value of ξ as well as on the nature of the slowly varying function , as was first described in Hall et al. (1985). Bootstrap methods were proposed in Hall (1990), Draisma et al. (1999), Danielsson et al. (2001), and Gomes and Oliveira (2001). Beirlant et al. (1996) and Beirlant et al. (2002) derived regression diagnostic methods on a Pareto quantile plot.
Other selection procedures can be found in Drees and Kaufmann (1998) and Guillou and Hall (2001). Possible heuristic choices are provided in Gomes and Pestana (2007), Gomes et al. (2008), and Beirlant et al. (2011). Recent proposals rooted in goodness-of-fit approaches are found in Bader et al. (2018), Drees et al. (2020), and Schneider et al. (2019). Almost all authors consider the adaptive choice of k for the Hill estimator. In this paper we consider trimming of the Hill estimator, omitting some of the lower order statistics in X n−k+1,n , . . . , X n,n , which leads to statistics of the type for some 1 ≤ b ≤ k and suitable constants c i (b, k). This kind of kernel-type statistics have been previously proposed (cf. Csörgő et al. 1985) as estimators of ξ . However, the implementation of the optimal kernel is not an easy task nor our focus in this paper. Instead, we propose a special form of the kernel that leads to an identity which aids in the threshold estimation problem. In Section 2 we derive the coefficients c i (b, k) which make T b,k unbiased when is constant and when we force the coefficients c i (b, k) = c(b, k) not to depend on i. We present a novel lower-trimmed Hill plot which provides significant graphical support for the estimation problem of ξ , as we illustrate with both simulations and real world data. We also provide mathematical evidence that, as a function of b, the variability of the T b,k statistics is lower than the one in the Hill plot. In Section 3, we examine the asymptotic characteristics of T b,k in Eq. 4 under the general model Eq. 1. The asymptotic expected empirical variance of T b,k is shown to be less sensitive on the tail parameter ξ than the asymptotic mean-squared error (AMSE) of the usual Hill estimator Eq. 3. We identify a link between the corresponding two optimal k-choices which allows to bypass the specification of ξ and other characteristics of the tail behavior for the identification of the optimal threshold in the classical Hill estimate, and the resulting procedure turns out to be simple to implement in practice. Subsequently, we study the estimator T k obtained by averaging the trimmed Hill estimators over b = 1, . . . , k. This latter estimator naturally assigns more weight to the larger observations, the weights being only moderately changed when increasing k. Furthermore, the specification of these weights is independent of the distribution F . Note that, in contrast, earlier criteria for reweighting terms in the Hill estimator (such as e.g. Csörgő et al. (1985) in terms of kernel estimates, see also (Beirlant et al. 2002, Sec.3)) had to heavily rely on the tail parameter ξ . In Section 4 we then present a simple ratio statistic as a tool to evaluate the goodness of selection of k. Section 5 confirms the good performance of the proposed methods using simulations, where T k turns out to outperform the classical Hill estimator in almost all cases. Note that our approach eventually suggests a fully automated procedure for the threshold selection, also in the absence of knowledge about, or assumptions on, the tail characteristics. Section 6 favorably illustrates this on a set of real-life motor third party liability insurance data. For exposition purposes, all proofs can be found in the Appendix. We would like to emphasize that the approach proposed in this paper suggests a general procedure that can in principle also be applied to other estimators in extreme value analysis.

Derivation
Assume first, for simplicity, that we have independent and identically distributed (i.i.d.) exact Pareto random variables, X 1 , X 2 , . . . , X n , with tail given by and we are interested in robust estimation of the tail index ξ . A main tool used throughout the paper is the well-known Rényi representation, which states (in the second distribution equality below), that for the order statistics of a random sample X 1 , . . . , X n from the distribution Eq. 5, one has, for k ≤ n, with Y i,k = X n−i+1,n /X n−k,n (i = 1, . . . , k), Here, E k,k ≥ · · · ≥ E 1,k are the order statistics of an independent i.i.d. exponential sample E 1 , . . . , E k with mean ξ , and E * 1 , . . . , E * k is another independent i.i.d. exponential sample with mean ξ . Bhattacharya et al. (2019) recently proposed linear estimators of the form in order to trim the upper order statistics in outlier-contaminated samples, where the constants c k 0 ,k (i) are chosen in a way to ensure that the resulting estimator for ξ is unbiased. For fixed k 0 , k, the problem can then be recast into that of finding suitable weights δ i such that one can writê Using the Rényi representation Eq. 6 and solving some elementary linear equations, they derived δ i = 1 r , i < r, and δ r = (k − r + 1)/r. This led them to the so-called trimmed Hill estimator which is shown to be quite useful in outlier detection under Eq. 1. In a similar way, but for a different purpose, in this paper we investigate trimming from the left. Concretely, we consider estimators of the form where c i (b, k) are constants to be determined. As above, we would like to find suitable weights γ i such that Setting q = k − b + 1, the Rényi representation Eq. 6 yields . Here we use the notation j ∨ q = max{j, q}. Unfortunately, the set of equations has no solution (for j ≤ q the left-hand-side cannot remain constant in j ). Instead, we choose to set and as the defining equations. The solution of Eqs. 8 and 9 is given by Plugging Eq. 10 into Eq. 7, we then arrive at the following definition of a lowertrimmed Hill statistic T b,k : where we use the convention k j =k+1 j −1 := 0. Note that T b,k can be considered as a trimmed pseudo-likelihood estimator of ξ under the strict Pareto model Eq. 5. Indeed, under Eq. 5 the exceedances Y i,k (i = 1, . . . , k) are distributed as the order statistics from a sample of size k from Eq. 5 with σ = 1. Then the trimmed likelihood, considering the conditioning on the exceedances larger than Y b+1,k , is given by leading to the likelihood equation From Eq. 6 it follows that E(log Y 1/ξ b+1,k ) = k j =b+1 j −1 , from which the estimator T b,k follows after formal substitution of log Y 1/ξ b+1,k by its expected value in Eq. 12.

A lower-trimmed Hill plot
The so-called Hill plots, in which T k,k are plotted as a function of k, probably are the most popular starting tool in extreme value analysis for Pareto-type tails. However, the difficulties involved when interpreting these plots and searching for 'horizontal' or 'stable' parts that indicate the start of a tail part that resembles a pure Pareto tail, if at all available, constitute a serious obstacle in practical applications. In the literature, two lines of research have been developed in order to remedy these problems: adaptive selection of k along some criteria such as minimization of the MSE, or biasreduction searching for estimators that are more stable, hence reducing the drift due to bias when k is taken too large. Concerning the line of research on bias reduction we can refer to recent papers using ridge regression methods in Buitendag et al. (2019) or Mean-of-Order-p estimators from Gomes et al. (2016), and the many other proposals cited in those papers. Both approaches, adaptive selection and bias reduction, can provide extra insight in a given case and complement each other. For instance, with increasing k bias-reduced estimators often start deviating from the original Hill plot at or around MSE-optimal k levels. Moreover, next to the selection of an appropriate threshold, stable bias-reduced methods based on extended Pareto models can provide models that fit well on a larger set of top data, see for instance (Papastathopoulos and Tawn 2013) and the references therein.
While in the next sections we propose an adaptive selection method for k, we also suggest plotting T b,k defined above for b = 1, . . . , k and different k, as it is unbiased for any b, k, b ≤ k under Eq. 5 by construction. Analogous to the Hill plot, we exploit the second degree of freedom and plot, for selected values of k, T b,k as a function of b. That is, the plot is constructed by overlaying the trajectories for a selection of k values. The lower variance of these trajectories, comes from the fact that the normalizing order statistic is fixed, and hence a non-constant behaviour is easier to identify visually than in the classical Hill plot. As a particular consequence, the selection of k that makes the tail resemble a pure Pareto tail is easier to determine, by examining when the trajectories start to be constant, and hence indicating zones with reduced bias.
The following Proposition provides mathematical evidence for the above observations. Hill estimator T b,b . More precisely,

Proposition 2.1 As a function of the number b of order statistics being used, in the exact Pareto case (5) the estimator T b,k has lower variance than the classical
As an illustration, we now compare the performance of these lower-trimmed Hill (LTH) plots for Pareto, near-Pareto and spliced Pareto distributions. The latter is defined through its cumulative distribution function (c.d.f.) (13) for c ≥ 1 and r > −1/ξ 0 , which is the c.d.f. of a Pareto random variable with tail index ξ 0 up to some splicing point c, continuously pasted with the c.d.f. of a Pareto random variable with another tail index ξ = (1/ξ 0 + r) −1 thereafter. Splicing models (also sometimes referred to as composite models) are for instance popular in reinsurance modelling, cf. (Albrecher et al. 2017, Ch.4).
• Student-t distribution with 10 degrees of freedom. The absolute value function was applied to the sample.
• Log-gamma with logshape parameter 3/2 and lograte parameter 1, i.e. with density The LTH plots together with usual Hill plots are shown in the top panels of Figs. 1-5. Reduced-bias plots based on Mean-of-Order-p 0 estimators are added in Figs. 2-5. We restrict here to p 0 = −1 (different choices were also considered, but did not yield substantial differences). The LTH plots are made for a selection of k, from 1 to 1000 by spacings of 50 (1,51,101,...), as a function of the lower trimming b. Recall that b ≤ k, so the lines have different domains on the x-axis. Observe that the right end-point of each of the overlaid lines corresponds to the respective point in the Hill plot. The bottom panels of Figs. 1-5 suggest two ways of measuring For the spliced distribution in Fig. 2 observe how the LTH estimator becomes horizontal as a function of b when k is close to the (rank of the) splicing point. For smaller k, the plot then looks similar to the exact Pareto case. Loosely speaking, the slope of the lines are a useful visual tool for detecting the number of upper order statistics k after which a Pareto tail is feasible. The case of the Student-t (Fig. 3) and Burr (Fig. 4) distribution show the problem of a large bias for the Hill estimator throughout, where the regime of a Pareto tail is only reached at the most extreme quantiles, and stable k areas are not really available. The Hill plot is roughly a monotone function of the number of order statistics, while the bias-reduced estimator already departs from the Hill plot at small values of k. The k levels indicated by the LTH variance and slope plots confirms that conclusion. Finally, the log-gamma distribution (Fig. 5) with a logarithmic slowly varying function is known to be a difficult case for extreme value analysis: the Hill plot does show a stable area up to k around 500, where the estimates still exhibit a large bias. The reduced-bias estimates follow the Hill estimates for most of the plot up to k around 500, and this range is also indicated by the LTH slope plot.
In the next sections we will develop inferential and data-driven selection and estimation tools on the basis of these graphical tools.

Regularly varying tails
We now move from the simple Pareto sample to a general Fréchet domain of attraction, with tails of the form Eq. 1. Denote by Q the quantile function associated to F , Assumptions on the rate of convergence of the above limit make it possible to obtain explicit results concerning asymptotic properties of the lower-trimmed Hill estimator. Hence, we impose the second order condition for some regularly varying function Q 0 with index p < 0. Theorem 3.1 Under the model Eq. 1 and second order condition Eq. 14, T b,k as defined in Eq. 11 satisfies the following asymptotic distributional identity, for n, k, n/k → ∞, where E 1 , . . . , E k are i.i.d. standard exponential random variables, and where we use the notation

Distribution of the average
Define the average of the T b,k across b as We can immediately see that so that the asymptotic bias terms can be recognized directly. To ease notation, let us introduce the constants Theorem 3.2 The average T k as defined in Eq. 16, under model Eq. 1 and second order condition Eq. 14 satisfies the following asymptotic distributional identity, for n, k, n/k → ∞, is the exponential integral.
Equipped with the representations in terms of exponential variables that we obtained in Theorems 3.1 and 3.2, we set on to analyze the mean of the empirical variance of T b,k as a function of b.
Theorem 3.3 The mean of the empirical variance of {T b,k ; 1 ≤ b ≤ k}, under model Eq. 1 and second order condition Eq. 14 satisfies the following asymptotic identity, for n, k, n/k → ∞, where C = 0.502727 and Notice the fact that C and f are universal. A plot of f as a function of p is given in Fig. 6.

Optimal k in the Hall class
We now make a further assumption on the regularly varying class, in order to get an explicit form of Q 0 . Concretely, we assume the Hall class, cf. Hall (1982), which satisfies the property An immediate consequence then is the explicit expression (1)). Hence, Recall that the classical Hill estimator for this class has AMSE given by which is minimized for see e.g. (Beirlant et al. 2004, p.125)). In a similar way, the minimizer of Eq. 21 is simply Hence from Eqs. 22 and 23 we obtain a simple expression of the optimal threshold k * 0 of the Hill estimator in terms of k * :

Interpretation of T k as a weighted Hill estimator
Observe that, for fixed k, with , so that one can interpret the estimator T k as a modification of the classical Hill estimator that uses different weights for different order statistics. It is not hard to see that asymptotically the correction factors behave like Figure 7(left) highlights the accuracy of this approximation for k = 100 across different values of i, and also illustrates the fact that the largest data point receives a weight of almost 2 in this case, whereas on from the 20th-largest observation the weight is lower than for the classical Hill estimator, and the weight diminishes for smaller data points. Note that, as k increases, the weight of the largest observation grows above any bound, but extremely slowly, namely Figure 7(right) illustrates that even for a value as large as k = 10000, θ 1 is still below 2.4.

A ratio statistic
Once a k * has been selected, it is important to be able to statistically assess whether the remaining upper tail differs significantly from the one of a pure Pareto. In order to recognize whether a Pareto tail has been achieved or not, we have seen that flatness of quantities which we expect to be close to one. Although these statistics do not have the property of being i.i.d. and hence test sizes have to be calibrated using Monte Carlo simulation, an advantage which carries over to the present setting is that they do not depend on ξ . Indeed, defining ω , by the order statistics property of the Poisson process, where, m = m i=1 E i , and E i , i = 1, 2, . . . , is an i.i.d. sequence of independent unit-rate exponential random variables. This invariance with respect to the ξ parameter permits to assess the goodness of selection of a threshold k * as follows: (1) Simulate the R b,k * statistics N MC times, and call them (2) For fixed α ∈ (0, 1), find the empirical α/2 and 1−α/2 quantiles corresponding to each of the b = 2, . . . , k * − 1 samples, and call them (q 1 , q 2 ) 2 , . . . , (q 1 , q 2 ) k * −1 . (4) The proportion α r is the global level of the test, and the algorithm stops when it is close enough to a pre-specified level (typically 0.05). If α r is larger (smaller) than the pre-specified level, go to Step (2) and decrease (increase) α. (5) Plot the R b,k * , b = 2, . . . , k * − 1, from the data, together with the last set of quantiles (q 1 , q 2 ) 1 , . . . , (q 1 , q 2 ) k * −1 . It is also a good idea, for visualization, to plot the standardized version which for a pure Pareto tail is expected by construction to lie (as a trajectory) between 0 and 1 in 100(1 − α r )% of the cases. Here, we have used the notation (q 1 , q 2 ) b = (q 1,b , q 2,b ).
Example 4.1 For the Burr sample of Fig. 4, we compare taking k * = 326 and k * = 600 in the plots of Fig. 8. The first number, k * = 326 is precisely the one that minimizes the expected empirical variance, according to the parameters of the Burr sample and to formula Eq. 23, with p chosen to be −1. The number of Monte Carlo simulations was in each case N MC = 10000, and the significance level is α = 0.05. Observe how the fit is good for k = 326, but is outside the bands for k = 600.
Remark 4.2 This approach can only be considered as a selection procedure itself if the corresponding sequential testing is adjusted to have the correct size. In other words, if the above algorithm is used multiple times to choose k, the rejection probability will exceed the desired α level. An alternative is to take sequential values of k into the algorithm, which makes the routine highly computationally intensive. Hence, we presently recommend it solely as a goodness of selection evaluation.

Simulations
We perform a simulation study based on three different and common distributions which belong to the Hall class Eq. 20. We consider simulating N sim = 1000 times from the following three distributions, with four sub-cases for each distribution, for varying sample size and parameters: • The Burr distribution, with tail given by which implies by Taylor expansion that We consider for n = 100, 500 the two sets of parameters η = 1, λ = 2, τ = 1/2; and η = 3/2, λ = 1/2, τ = 2.
• The Student-t distribution with m degrees of freedom. The tail is given in terms of hypergeometric and Gamma functions. Since this distribution is symmetric around zero, we take the absolute values of the data, which preserves the tail behaviour. We have that We consider, for n = 100, 500, the two sets of parameters m = 2, 10.
For each sample we evaluate the Hill estimator H k = T k,k and the averaged trimmed estimator at three particular choices of k. Note that these threshold choices are designed for the Hill estimator, but will turn out sensible for the latter estimator as well.
(i) We use the popular procedure of Guillou and Hall (2001) as a benchmark for finding the optimal choice of k, and denote the resulting tail estimators by Hk GH , Tk GH . Such a threshold selection procedure has been subject to comparisons (both in Guillou and Hall (2001) itself and in Beirlant et al. (2002)) to other alternatives like Danielsson et al. (2001) and Drees and Kaufmann (1998). We also refer to Schneider et al. (2019) for a recent paper which was developed independently around the same time as the present article.
(ii) An estimatork * 0 of k * 0 from Eq. 22 is obtained as follows. Motivated by Eq. 21, we computek * as the minimizer of the empirical variance (the search beginning at 1/5 of the sample size, to avoid degeneracies) of the trimmed Hill estimator, as a function of b, and using Eq. 24 to set . Observe that while we still have to input p, here prior knowledge of ξ, D is no longer needed. We choose p = −1 as the canonical choice.
(iii) As in (ii), but using the true parameter of p, in order to quantify how the removal of a potential misspecification of p by the canonical choice p = −1 affects the estimators. The results obtained when using an estimator of p, such as given by Fraga Alves et al. (2003), rather than fixing it at a fixed value, indicated that the additional variability by including an additional estimator does not lead to further improvements at smaller sample sizes. This confirms the observations made for instance in Drees and Kaufmann (1998) and Beirlant et al. (2002).
We then plot the bias, variance and MSE of each resulting estimator as a function of k.
The results are given in Figs. 9 and 10 for the Burr case; Figs. 11 and 12 for the Fréchet case; Figs. 13 and 14 for the GPD case; and Figs. 15 and 16 for the Student-t case. We observe that the behaviour is very similar for the four families (which is not uncommon in this context, cf. (Beirlant et al. 2002, p.178)).
For the Hill estimator, we notice that our method fares well against the benchmark. The misspecification of the second order parameter p does not play a substantial role, except perhaps for the most biased case of the Student-t distribution with 10 degrees of freedom. The same behaviour is observed within the three T -estimators. When comparing Hill against T -estimators, the latter improve the bias and MSE for Fig. 9 Burr distribution, parameters η = 1, λ = 2, τ = 1/2. Top: Violin plots for n = 100, 500 of the estimators Hk GH , H k * 0 ,p=−1 , H k * 0 ,p=−1/λ , Tk GH , T k * 0 ,p=−1 , T k * 0 ,p=−1/λ . Bottom: diagnostics of T k (blue) and H k (red) as a function of k Fig. 10 Burr distribution, parameters η = 3/2, λ = 1/2, τ = 2. Top: Violin plots for n = 100, 500 of the estimators Hk GH , H k * 0 ,p=−1 , H k * 0 ,p=−1/λ , Tk GH , T k * 0 ,p=−1 , T k * 0 ,p=−1/λ . Bottom: diagnostics of T k (blue) and H k (red) as a function of k Fig. 11 Fréchet distribution, parameter α = 1. Top: Violin plots for n = 100, 500 of the estimators Hk GH , H k * 0 ,p=−1 , Tk GH , T k * 0 ,p=−1 . Bottom: diagnostics of T k (blue) and H k (red) as a function of k nearly all k, and in most cases also the variance (except for very heavy tails (ξ ≥ 1) and small values of k).
Remarkably, the estimator T k 0 ,p=−1 , where the canonical p = −1 is used, is highly competitive against the Hill estimator, especially so for ξ ≤ 1. This is not a contradiction, since the optimality of the Hill estimator refers to choices for k Fig. 12 Fréchet distribution, parameter α = 1/2. Top: Violin plots for n = 100, 500 of the estimators Hk GH , H k * 0 ,p=−1 , Tk GH , T k * 0 ,p=−1 . Bottom: diagnostics of T k (blue) and H k (red) as a function of k within the class of H k , whereas the T k estimators span a different class (visible in the weighting interpretation of Section 3.3), and when k is optimized w.r.t. AMSE in that class, even better performance can be feasible, which, however, is not the subject of the present paper. Fig. 13 GPD distribution, parameters γ = 1/2, σ = 2. Top: Violin plots for n = 100, 500 of the estimators Hk GH , H k * 0 ,p=−1 , H k * 0 ,p=−γ , Tk GH , T k * 0 ,p=−1 , T k * 0 ,p=−γ . Bottom: diagnostics of T k (blue) and H k (red) as a function of k

Insurance data
Let us now consider a real-life insurance data set consisting of 837 motor third party liability (MTPL) insurance claims from the period 1995-2010 that was studied Bottom: diagnostics of T k (blue) and H k (red) as a function of k intensively in Albrecher et al. (2017) (where it is referred to as "Company A"). These data are right-censored, and were also analyzed recently combining survival analysis techniques and expert information in Bladt et al. (2019). Here, we focus only on  In Fig. 18 we depict the lower-trimmed Hill plots and the usual Hill plot, together with the empirical variance. We have also included T k , which follows somewhat closely the trajectory of the usual Hill plot, and the mean-of-order −1 bias-reduced estimator. As a preliminary observation, notice that the k-area at which visually the Hill plot and its bias-reduced version start to differ is roughly the same as where the lower trimmed Hill trajectories start flattening out, which serves as a good sanity check. As in the simulation studies in Section 5, in order to avoid degeneracies, we only look at candidates for the minimizer to the right of n/5, which corresponds to 167 in this case. The minimum empirical variance is then obtained fork * = 222. Using the canonical choice p = −1, we have thatk * 0 = 222/2.62421 ≈ 85. Note that for the same choice p = −1, using the prior eyeballed estimate ξ ≈ 0.5, and based on the Burr-like Hill plot, we can deduce that λ = −1/p = 1, and then D = 1/τ = −ξ = −0.5. We thus get by Eq. 22 the ad-hoc sample fraction k ah = 112 (which might be considered the classical choice of the threshold in this case). Notice, however, that the latter estimate is only available heuristically, since one needs a first estimate of ξ to estimate ξ itself, and also to make distributional assumptions on the data. We include this estimate here simply as a naive solution that requires no further statistical procedures beyond looking at the Hill plot.
The corresponding estimates of ξ are given by The simulation studies of Section 5 may suggest the third of the above numbers to be the most reliable estimate, since there is no way of quantifying the eyeball-aided procedure involving k ah . However, the 95% confidence interval for the first estimate is (0.400, 0.616), suggesting that for a one-sample analysis, it is difficult to make a definitive statement on the statistical superiority of any of the four estimates. Further, the ratio statistic test in Fig. 19 suggests that for both thresholds the sample is Pareto in the tail (with only a slight issue for the two largest observations). The takeaway is that, roughly speaking, we are able to reconfirm in an automated statistical way what can be deduced by looking at a Hill plot (together with its bias-reduced variants) and guessing the distribution of the sample.
In (Albrecher et al. 2017, p.99), a splicing point was suggested for this data set at around k = 20, based on expert opinion. A semi-automated option using our method for detecting this splicing point would be to replace the left limit k = 167 by a very small number (in this case k = 4 is chosen after visual insection of the erratic nature of the empirical variance for the first three), and then to apply our method, which leads to the detection of the minimum variance at k = 38 (which is clearly visible in Fig. 18). Under the assumption p = −1 this then leads to k ≈ 14 as a suggested splicing point. We would like to point out that the identification of the splicing point matters in insurance practice, since the different resulting distributional assumptions on either side then have an effect on the location of extreme quantiles and risk management in general, including the capital requirements for solvency purposes. The possibly natural existence of splicing points can also be argued from a causal perspective, as different degrees of inspection scrutiny may be applied below and above certain claim levels.
As a side remark, in the present data set the ultimates for the highest claims have a certain degree of intrinsic uncertainty (as they are just estimates of the final closed claim size), and a more systematic way to approach this particular situation would be to combine the trimming of the Hill estimator from below and above, but the latter is not the focus of the present paper.

Conclusion
In this paper, we showed that trimming the Hill estimator from the left can lead to favorable properties in connection with the expected empirical variance of the tail index estimators in extreme value statistics. For the Hall class, we established asymptotic results on the behavior of this expected empirical variance, which allows to develop a guideline for the choice of the optimal threshold in the tail index estimation problem. It turns out that there is an intrinsic link between this optimal threshold and the classical optimal threshold for the Hill estimator. Since in the trimming context the identification of the optimal threshold is much more insensitive on the tail characteristics (it only depends on the p-parameter in the Hall class, not on D nor on the tail index ξ ), this link allows to circumvent the classical problem in threshold selection for the Hill estimator. As a by-product, by suitable averaging we develop a novel tail index estimator which assigns a non-uniform weight to each observation in a natural way, relies on fewer assumptions on the tail characteristics, is simple to implement and outperforms the classical Hill estimator in most cases. The latter is illustrated in simulation studies. In addition, the technique is applied to a real-life insurance data set that was previously studied by other techniques. Note also that the proposed selection principle for k based on the variance of the lower-trimmed Hill plots can be applied to any estimator of ξ for which the asymptotic mean squared error can be written as M 1 ξ 2 /k + M 2 (p)Q 0 (n/k) 2 with M 1 > 0 and M 2 (p) > 0 only depending on p. We conclude by noting that the approach taken in this paper is in principle also applicable for the potential improvement of tail index estimators other than the Hill estimator. Further possible directions of future research include the combination of left trimming with right trimming in situations with possible outliers, as well as the consideration of possibly censored data.
which gives the second identity.
Proof of Theorem 3.1. We first note that where Y 1,n < · · · < Y n,n are the order statistics of a standard Pareto sample (the ξ = 1 case). Then, from the second order condition Eq. 14 we obtain that for A = Y n−k,n and x = Y n−i+1,n /Y n−k,n , as k, n, n/k → ∞, . But by the Rényi representation Eq. 6 of exponential order statistics, the first term is distributed as where E 1 , E 2 , . . . , E k are i.i.d. standard exponential random variables. For the second term, by convergence to uniform random variables and a Riemann integral approximation, we get and since (1 − 1/Y n−k,n ) is a uniform order statistic, we further get that Q 0 (Y n−k,n ) Q 0 (n/k) P → 1.
Putting the three pieces together then establishes Eq. 15.

Proof of Theorem 3.2.
With the shortened notation, we write and by exchange of the order of summation, we can write (1)) +Q 0 (n/k)c k,p (1 + o p (1)).
Again, by Riemann integration we have that Putting the pieces together then indeed yields Eq. 18.

Proof of Theorem 3.3. Let us first decompose each summand by writing
and subsequently consider each term separately. From Eq. 27 we have that (1 + o p (1)).