Abstract
The distributed Hill estimator is a divide-and-conquer algorithm for estimating the extreme value index when data are stored in multiple machines. In applications, estimates based on the distributed Hill estimator can be sensitive to the choice of the number of the exceedance ratios used in each machine. Even when choosing the number at a low level, a high asymptotic bias may arise. We overcome this potential drawback by designing a bias correction procedure for the distributed Hill estimator, which adheres to the setup of distributed inference. The asymptotically unbiased distributed estimator we obtained, on the one hand, is applicable to distributed stored data, on the other hand, inherits all known advantages of bias correction methods in extreme value statistics.
Similar content being viewed by others
Data availability
Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.
Code availability
The code for simulation study is available upon request.
References
Alves, M.F., Gomes, M.I., de Haan, L.: A new class of semi-parametric estimators of the second order parameter. Port. Math. 60(2), 193–214 (2003)
Cai, J.J., de Haan, L., Zhou, C.: Bias correction in extreme value statistics with index around zero. Extremes 16(2), 173–201 (2012)
Chen, L., Li, D., Zhou, C.: Distributed inference for extreme value index. Biometrika. to appear (2021). https://doi.org/10.1093/biomet/asab001
Danielsson, J., de Haan, L., Peng, L., de Vries, C.G.: Using a bootstrap method to choose the sample fraction in tail index estimation. J. Multivar. Anal. 76(2), 226–248 (2001)
de Haan, L., Ferreira, A.: Extreme value theory: an introduction. Springer Science & Business Media (2006)
de Haan, L., Mercadier, C., Zhou, C.: Adapting extreme value statistics to financial time series: dealing with bias and serial dependence. Finance Stochast. 20(2), 321–354 (2016)
Dekkers, A.L., Einmahl, J.H., de Haan, L.: A moment estimator for the index of an extreme-value distribution. Ann. Stat. 17(4), 1833–1855 (1989)
Drees, H., Ferreira, A., de Haan, L.: On maximum likelihood estimation of the extreme value index. Ann. Appl. Probab. 14(3), 1179–1201 (2004)
Einmahl, J.H., de Haan, L., Zhou, C.: Statistics of heteroscedastic extremes. J. R. Stat. Soc. Ser. B Stat Methodol. 78(1), 31–51 (2016)
Fan, J., Wang, D., Wang, K., Zhu, Z.: Distributed estimation of principal eigenspaces. Ann. Stat. 47(6), 3009–3031 (2019)
Gomes, M.I., de Haan, L., Peng, L.: Semi-parametric estimation of the second order parameter in statistics of extremes. Extremes 4(5), 387–414 (2002)
Gomes, M.I., de Haan, L., Rodrigues, L.H.: Tail index estimation for heavy-tailed models: accommodation of bias in weighted log-excesses. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70(1), 31–52 (2008)
Gomes, M.I., Pestana, D.: A simple second-order reduced bias’ tail index estimator. J. Stat. Comput. Simul. 77(6), 487–502 (2007)
Guillou, A., Hall, P.: A diagnostic for selecting the threshold in extreme value analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63(2), 293–305 (2001)
Hill, B.M.: A simple general approach to inference about the tail of a distribution. Ann. Stat. 3(5), 1163–1174 (1975)
Li, R., Lin, D.K., Li, B.: Statistical inference in massive data sets. Appl. Stoch. Model. Bus. Ind. 29(5), 399–409 (2013)
Smith, R.L.: Estimating tails of probability distributions. Ann. Stat. 1174–1207 (1987)
Volgushev, S., Chao, S.-K., Cheng, G.: Distributed inference for quantile regression processes. Ann. Stat. 47(3), 1634–1662 (2019)
Zhou, C.: Existence and consistency of the maximum likelihood estimator for the extreme value index. J. Multivar. Anal. 100(4), 794–815 (2009)
Acknowledgements
We thank the editors and three referees for their helpful comments and suggestions. Liujun Chen and Deyuan Li’s research is partially supported by the National Nature Science Foundations of China grants 11971115 and 71661137005.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
1.1 Proofs
1.1.1 Preliminary
Lemma 1
Let \(Y, Y_1,\dots , Y_n\) be i.i.d. Pareto (1) random variables with distribution function \(1-1/y, \ y\ge 1\). Let \(Y^{(1)} \ge \cdots \ge Y^{(n)}\) be the order statistics of \(\left\{ Y_1,\dots ,Y_n \right\}\). Let f be a function such that \(\text {Var} \left\{ f(Y) \right\} <\infty\). Then for any \(k\ge 1\),
where \(Y_1^*, Y_2^*,\ldots , Y_k^*\) are i.i.d. Pareto (1) random variables. Moreover,
is independent of \(Y^{(k+1)}\) and asymptotically normally distributed with mean zero and variance \(\text {Var} \left\{ f(Y) \right\}\) as \(n \rightarrow \infty\), provided that \(k=k(n) \rightarrow \infty\) and \(k/n \rightarrow 0\).
Proof of Lemma 1
This Lemma follows directly from Lemma 3.2.3 in de Haan and Ferreira (2006) with the fact that \(\log Y\) follows a standard exponential distribution. \(\square\)
Lemma 2
Let \(Y_1,\dots ,Y_n\) be i.i.d. Pareto (1) random variables and \(Y^{(1)}\ge \cdots \ge Y^{(n)}\) be the order statistics of \(\left\{ Y_1,\dots ,Y_n \right\}\). Then for any \(\rho <0\),
where \(g(k,n,\rho )\) is defined in (3). Moreover, if k is a fixed integer, then \(g(k,n,\rho ) \rightarrow k^{\rho }\Gamma (k-\rho +1)/\Gamma (k+1)\) as \(n \rightarrow \infty\). If k is an intermediate sequence, i.e. \(k\rightarrow \infty , k/n \rightarrow 0\) as \(n \rightarrow \infty\), then,
Proof of Lemma 2
We first handle the case when k is a fixed integer. By the Stirling’s formula,
as \(x\rightarrow \infty\), we have that, as \(n \rightarrow \infty\),
which leads to
Next, we handle the case when k is an intermediate sequence. By the Stirling’s formula, we have that, as \(n \rightarrow \infty\),
By the Taylor’s formula and some direct calculation, we obtain that, as \(n\rightarrow \infty\),
and
It follows that, as \(n \rightarrow \infty\),
\(\square\)
Lemma 3
Let \(Y_1,\dots ,Y_n\) be i.i.d. Pareto (1) random variables and \(Y^{(1)}\ge \cdots \ge Y^{(n)}\) be the order statistics of \(\left\{ Y_1,\dots ,Y_n \right\}\). Define for \(\rho <0\),
Then, the following results hold.
-
(i)
For fixed k, \(\mathbb {E}(Z_k^a)<\infty\), for \(a=1,2,3,4\). Moreover, \(\mathbb {E}\left( Z_k^2 \right) -\left\{ \mathbb {E}\left( Z_k \right) \right\} ^2>0\).
-
(ii)
For intermediate k, i.e., \(k = k(n)\rightarrow \infty , k/n \rightarrow 0\) as \(n\rightarrow \infty\), and \(a=1,2,3,4\),
$$\begin{aligned} \mathbb {E}\left( Z_k^a \right) = \frac{1}{(1-\rho )^a}\left\{ 1+\frac{a(a-1)}{2(1-2\rho )}\frac{1}{k}+O\left(k^{-2}\right) \right\} . \end{aligned}$$
Proof of Lemma 3
By Lemma 1, we have that,
where \(Y_1^*,\dots ,Y_k^*\) are i.i.d. Pareto (1) random variables. Denote \(T_i=\left\{ (Y_i^*)^{\rho }-1 \right\} /\rho\), for \(i=1,\dots ,k\) and \(Z_k= k^{-1}\sum _{i=1}^k T_i\). Then, \(T_i, i=1,\dots ,k\) follows the generalized Pareto distribution with the cumulative distribution function \(F(t) = 1-(1+\rho t)^{-1/\rho }\). Thus, we have that for \(a=1,2,3,4\),
First, we handle the case when k is fixed. The result is obvious since \(kZ_k\) is a finite sum of i.i.d. generalized Pareto random variables with shape parameter \(\rho <0\).
Next, we handle the case when k is an intermediate sequence. For \(a=1\), we have that, \(E(Z_k)=E(T_i)=(1-\rho )^{-1}\).
For \(a=2\), we have that,
For \(a=3\), we have that
The term \(\mathbb {E} \left( Z_k^4 \right)\) can be handled in a similar way as that for handling \(\mathbb {E}\left( Z_k^3 \right)\). \(\square\)
Lemma 4
Assume that the distribution function F satisfies the third order condition (4). Then there exist two functions \(A_0(t)\sim A(t)\) and \(B_0(t)=O\left\{ B(t) \right\}\) as \(t\rightarrow \infty\), such that for any \(\delta >0\), there exists a \(t_0 = t_0(\delta )>0\), for all \(t\ge t_0\) and \(tx\ge t_0\),
Proof of Lemma 4
This lemma follows from applying Theorem B.3.10 in de Haan and Ferreira (2006) to the function \(f(t):=\log U(t)-\gamma \log t\). \(\square\)
1.2 Proofs for Section 3
Recall that \(U=\left\{ 1/(1-F) \right\} ^{\leftarrow }\). Then \(X{\mathop {=}\limits ^{d}}U(Y)\), where Y follows the Pareto (1) distribution. Since we have i.i.d. observations \(\left\{ X_1,\dots ,X_N \right\}\), we can write \(X_i{\mathop {=}\limits ^{d}}U(Y_i)\), where \(\left\{ Y_1,\dots ,Y_N \right\}\) is a random sample of Y. Recall that the N observations are stored in m machines with n observations each. For machine j, let \(Y_j^{(1)} \ge \cdots \ge Y_j^{(n)}\) denote the order statistics of the n Pareto (1) distributed variables corresponding to the n observations in this machine. Then \(M_j^{(i)}{\mathop {=}\limits ^{d}}U(Y_j^{(i)}), i=1,\dots ,n, j=1,\dots ,m\).
Proof of Proposition 1
We intend to replace t and tx in Lemma 4 by n/k and \(Y_j^{(i)}, i=1,\dots ,k+1, j=1,\dots ,m\), respectively. For this purpose, we introduce the set
By Lemma S.2 in the supplementary material of Chen et al. (2021), we have that for any \(t_0>1\), if condition (2) holds, then \(\lim _{N\rightarrow \infty }\mathbb {P}\left( \mathcal {F}_{t_0} \right) =1\). Then, we can apply the intended replacement to get that, as \(N\rightarrow \infty\),
where the \(o_P(1)\) term is uniform for all \(1\le i\le k+1\) and \(1\le j\le m\). By applying (11) twice for a general i and \(i=k+1\) and the inequality \(x^{\rho \pm \delta }/y^{\rho \pm \delta }\le (x/y)^{\rho \pm \delta }\) for any \(x,y>0\), we get that as \(N \rightarrow \infty\),
By taking the average across i and j, we obtain that
Firstly, we handle \(I_1\). By Lemma 1, we have that,
where \(Y_{i}^{j,*}, i=1,\dots ,k,j=1,\dots ,m\) are independent and identically distributed Pareto (1) random variables. The central limit theorem yields that as \(N \rightarrow \infty\), \(I_1 = \gamma P_N^{(1)} +o_P(1),\) where \(P_N^{(1)}\sim N(0,1)\).
For \(I_2\), write \(\delta _{j,n}=\left( k Y_{j}^{(k+1)} / n\right) ^{\rho }(k\rho )^{-1}\sum _{i=1}^k \left\{ \left( Y_{j}^{(i)} / Y_{j}^{(k+1)} \right) ^{\rho }-1 \right\}\). Then we have that \(I_2=\sqrt{km}A_0(n/k)m^{-1}\sum _{j=1}^m \delta _{j,n}\), where \(\delta _{j,n}, j=1,\dots ,m\) are i.i.d. random variables.
We are going to show that, as \(N \rightarrow \infty\),
If k is fixed, (13) follows directly from Lemma 3 (i) and the Lyapunov central limit theorem for triangular array.
Next, we handle the case when k is an intermediate sequence. In this case, in order to apply the Lyapunov central limit theorem with 4-th moment, we need to calculate \(\text {Var}\left( \delta _{j,n} \right)\) and \(\mathbb {E}[\left\{ \delta _{j,n}-\mathbb {E}\left( \delta _{j,n} \right) \right\} ^4 ]\). Denote \(m_{n}^{(a)}:=\mathbb {E}\left\{ \left( \delta _{j,n} \right) ^a \right\} ,\ a=1,2,3,4\). By Lemma 1, we have that,
First, we calculate \(\text {Var}\left( \delta _{j,n} \right)\). By Lemma 3, we have that,
here in the last step, we used the fact that as \(n\rightarrow \infty\), \(g(k,n,\rho )\rightarrow 1\) and \(g(k,n,2\rho )\rightarrow 1\). By Lemma 2, we have that, as \(n \rightarrow \infty\),
Hence, as \(n\rightarrow \infty\), \(\text {Var}\left( \delta _{j,n} \right) =k^{-1}(1-\rho )^{-2}\left( \left( 1-2\rho \right) ^{-1}+\rho ^2 \right) +o(k^{-1})\).
Next, we calculate \(\mathbb {E}[\left\{ \delta _{j,n}-\mathbb {E}\left( \delta _{j,n} \right) \right\} ^4 ]\). By Lemma 2 and Lemma 3, we have that, for \(a=3,4\), as \(N\rightarrow \infty\),
Note that,
By some direct calculation, all terms of order \(k^{-1}\) and \(n^{-1}\) are cancelled out. Thus, as \(N \rightarrow \infty\), \(\mathbb {E}[\left\{ (\delta _{j,n}-\mathbb {E}\left( \delta _{j,n} \right) \right\} ^4 ] =O(k^{-2})\). Combining \(\text {Var}(\delta _{j,n})\) and \(\mathbb {E}[\left\{ \delta _{j,n}-\mathbb {E}\left( \delta _{j,n} \right) \right\} ^4 ]\), we conclude that the sequences \(\left\{ \delta _{j,n} \right\} _{j=1}^m\) satisfy the Lyapunov’s condition. Then, (13) follows by the central limit theorem. Applying (13), we obtain that, as \(N\rightarrow \infty\),
For \(I_3\), by using the weak law of large numbers for triangular array, we have that, as \(N \rightarrow \infty\),
where the last equality follows by the condition \(\sqrt{km}A(n/k)B(n/k)=O(1)\).
For \(I_4\), by similar arguments as for \(I_3\), we obtain that, as \(N\rightarrow \infty\), \(I_4{\mathop {\rightarrow }\limits ^{P}} 0\). Combining \(I_1,I_2,I_3\) and \(I_4\), we have proved (i).
Next, we handle \(R_k^{(2)}\). By (12), we obtain that, as \(N\rightarrow \infty ,\)
For \(I_5\), by Lemma 1, we have that
The central limit theorem yields that as \(N \rightarrow \infty\), \(I_5 = \gamma ^2 P_N^{(2)}+o_P(1)\), where \(P_N^{(2)}\sim N(0,20)\). In addition, the covariance of \(P_N^{(1)}\) and \(P_N^{(2)}\) is equal to the covariance of \(\log Y_i^{j,*}\) and \(\left( \log Y_i^{j,*} \right) ^2\), where \(Y_{i}^{j,*}\) follows the Pareto (1) distribution. Hence, \(\text {Cov}(P_N^{(1)},P_N^{(2)})=4.\)
For \(I_6\), we write \(I_6 = 2\sqrt{km}A_0(n/k)m^{-1}\sum _{j=1}^m \eta _{j,n}\), where
are i.i.d. random variables for \(j=1,2,\dots ,m\). We can verify the Lyapunov’s condition for the series \(\left\{ \eta _{j,n} \right\} _{j=1}^m\) following similar steps as those for \(\left\{ \delta _{j,n} \right\} _{j=1}^m\). Then by applying the central limit theorem and Lemma 2, we obtain that
By the weak law of large numbers for triangular array, we have that
and
Combining the results for \(I_5,I_6,I_7\) and \(I_8\), we have proved (ii).
Finally, we handle \(R_k^{(3)}\). Also, by (12), we have that
By similar steps as for handling the four items \(I_5, I_6,I_7\) and \(I_8\), we can show that \(I_9 = \gamma ^3 P_N^{(3)}+o_P(1)\), where \(P_N^{(3)}\sim N(0,684)\) and \(\text {Cov}(P_N^{(1)},P_N^{(3)})=18, \text {Cov}(P_N^{(2)},P_N^{(3)})=98\). And
which yields (iii). \(\square\)
Proof of Theorem 1
Applying Proposition 1 with \(k=k_{\rho }\), we have that, as \(N \rightarrow \infty\),
As a consequence, we have that, as \(N \rightarrow \infty\),
It follows that, as \(N\rightarrow \infty\),
and
By the condition (5), the dominating terms in the two expressions above are
respectively. Therefore, as \(N \rightarrow \infty\),
It follows that as \(N\rightarrow \infty\),
Theorem 1 is thus proved by applying the Cramér’s delta method. \(\square\)
Proof of Theorem 2
By Proposition 1, as \(N \rightarrow \infty\), \(R_{k_n}^{(1)}\) has the following asymptotic expansion:
which leads to
Together with the asymptotic expansion of \(R_{k_n}^{(2)}\), we have that, as \(N\rightarrow \infty\),
Thus, as \(N\rightarrow \infty\),
The relation \(k_n/k_{\rho }\rightarrow 0\) implies that \(A(n/k_n)/A(n/k_{\rho })\rightarrow 0\) as \(N \rightarrow \infty\). Thus, by Theorem 1, we have that, as \(N \rightarrow \infty\),
Together with the consistency of \(\hat{\rho }_{k_{\rho },\tau }\) and \(R_{k_n}^{(1)}\), we have that, as \(N \rightarrow \infty\),
Combining with Proposition 1, we obtain that, as \(N\rightarrow \infty\),
\(\square\)
Rights and permissions
About this article
Cite this article
Chen, L., Li, D. & Zhou, C. Adapting the Hill estimator to distributed inference: dealing with the bias. Extremes 25, 389–416 (2022). https://doi.org/10.1007/s10687-022-00440-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10687-022-00440-y