Extremal clustering in non-stationary random sequences

It is well known that the distribution of extreme values of strictly stationary sequences differ from those of independent and identically distributed sequences in that extremal clustering may occur. The extent to which extremal clustering may occur is measured by the extremal index. Here we consider non-stationary sequences subject to suitable long range dependence restrictions. We first consider the case of non-independent but identically distributed random variables. We find that the limiting distribution of appropriately normalized sample maxima depends on a parameter that measures the average extremal clustering of the sequence and generalizes the result of O'Brien (1987) for the extremal index. Based on this new representation we derive the asymptotic distribution for the times between consecutive extreme observations and use this to construct moment and likelihood based estimators for parameters that measure the potential for extremal clustering at a particular location in the sequence. We also discuss extensions to the case of different marginal distributions.


Introduction
Extreme value theory for strictly stationary sequences has been extensively studied, initiated in the works of Watson (1954), Berman (1964), Loynes (1965), and continued by Leadbetter (1974Leadbetter ( , 1983 and O'Brien (1987) amongst others. One of the key findings in this line of research is that unlike in independent and identically distributed sequences where extreme values tend to occur in isolation, stationary sequences possess an intrinsic potential for clustering of extremes, i.e., several extreme values may be observed in quick succession. Understanding the extremal clustering characteristics of a random process is critical in many applications where a run of extreme values may have serious consequences. For example, if a sequence consists of daily temperatures at some fixed location then a cluster of extremes may correspond to a heatwave.
The extent to which extremal clustering may occur is naturally measured for stationary sequences by a parameter θ ∈ [0, 1] known as the extremal index. Suppose that the stationary sequence of random variables X 1 , X 2 , . . . has marginal distribution function F and let M n = max{X 1 , . . . , X n }. In the special case where the random variables in the sequence are independent then, for any u ∈ R, P(M n ≤ u) = F (u) n , whereas in general, provided a suitable long range dependence restriction is satisfied then, for large n, P(M n ≤ x n )= F (x n ) nθ . Here, and in what follows, x n is a sequence of real numbers converging to x F = sup{x ∈ R : F (x) < 1} in such a way that nP(X 1 > x n ) → τ > 0 as n → ∞. Leadbetter (1983) showed that θ −1 is the limiting mean cluster size, and Hsing et al. (1988) further developed this result by considering the point process of exceedance times above a large threshold. In particular, Hsing et al. (1988) found that the point process of normalized exceedance times {i/n : X i > x n } converges weakly as n → ∞ to a compound Poisson process on [0, 1]. The compound nature of the Poisson process arises from the fact that the cluster positions follow a homogeneous Poisson process, each of which is marked by an independent realization from the socalled cluster size distribution, the mean of which is θ −1 . The cluster size distribution π is defined by π(j) = lim n→∞ π n (j), for j ∈ N, where π n (j) = P pn i=1 1(X i > x n ) = j pn i=1 1(X i > x n ) > 0 (1.1) and p n = o(n). The stated result about the limiting mean cluster size can then be expressed as θ −1 = ∞ j=1 jπ(j). In a more general non-stationary setting, one may expect the cluster size distribution to vary with time. Although we will not consider a point process approach here, the results presented in the following sections implicitly confirm such an expectation.
Another characterization of θ that links it to the extremal clustering properties of a stationary sequence can be found in O'Brien (1987). Defining M j,k = max{X i : j + 1 ≤ i ≤ k}, O'Brien (1987) found that the distribution function of M n satisfies P(M n ≤ x n ) − F (x n ) nθn → 0, as n → ∞, (1.2) where θ n = P(M 1,pn ≤ x n | X 1 > x n ), (1.3) for p n = o(n), and provided the limit exists, θ n → θ as n → ∞. This result clarifies that smaller values of θ are indicative of a larger degree of extremal clustering, since the conditional probability in (1.3) is small when an exceedance of a large threshold is likely to soon be followed by another exceedance. Early attempts at estimating θ were based on associating θ −1 with the limiting mean cluster size. Different methods for identifying clusters gave rise to to different estimators, well known examples being the runs and blocks estimators (Smith and Weissman, 1994). For the runs estimator, a cluster is identified as being initialized when a large threshold is exceeded and ends when a fixed number, known as the run length, of non-exceedances occur. The extremal index is then estimated by the ratio of the number of identified clusters to the total number of exceedances. A difficulty that arises when using this estimator is its sensitivity to the choice of run length (Hsing, 1991).
The problem of cluster identification was carefully studied by Ferro and Segers (2003) who considered the distribution of the time between two exceedances of a large threshold. They found that the limiting distribution of appropriately normalized interexceedance times converges to a distribution that is indexed by θ. In particular, for a given threshold u ∈ R, they define the random variable T (u) = min{n ∈ N : X n+1 > u | X 1 > u}, and find that as n → ∞,F (x n )T (x n ) converges in distribution to a mixture of a point mass at zero and an exponential distribution with mean θ −1 . Thus, by computing theoretical moments of this limiting distribution and comparing them with their empirical counterparts, they construct their so-called intervals estimator.
Motivated by the fact that many real world processes are non-stationary, in this paper we are led to investigate the effect on the extremal clustering properties of a random sequence when the stationarity assumption is dropped. Previous statistical works that invoke the concept of extremal clustering in a non-stationary sequence include Süveges (2007) and Coles et al. (1994). Süveges (2007) uses the likelihood function introduced by Ferro and Segers (2003) for the extremal index together with smoothing methods to capture non-stationarity in a time series of temperature measurements, whereas in a similar application, Coles et al. (1994) use a Markov model together with simulation techniques to estimate the extremal index within different months.
Early works that develop the extreme value theory for non-stationary sequences with emphasis on the asymptotic distribution of sample maxima include Hüsler (1983Hüsler ( , 1986. Hüsler (1983) focuses on the case of sequences with a common marginal distribution and Hüsler (1986) considers the more general case where the margins may differ. While neither papers discuss extremal clustering or statistical inference, Hüsler (1986) discusses the difficulties of generalising the extremal index to non-stationary sequences.
In this paper we see that the notion of a time varying extremal index may be defined for a class of non-stationary sequences in a way that naturally extends the theory for stationary sequences. In the simple case of an identically distributed but not necessarily stationary sequence X 1 , X 2 . . . with common marginal distribution function F , we find that under similar assumptions as in O'Brien (1987) that (1.5) Thus, we find in this setting, that the limiting distribution of the sample maximum at large thresholds is characterized by a parameter γ = lim n→∞ γ n , (provided the limit exists). By analogy with equation (1.3), we see from (1.5) that γ may be regarded as the average of local extremal indices, each of which measures the potential for extremal clustering at a particular location in the sequence.
We will see that these local extremal indices can be estimated using the intervals estimator of Ferro and Segers (2003) which is a function of the observed times between consecutive exceedances of a high threshold. In the case where the sequence is stationary, so that all terms in the summation (1.5) are equal, the formula for γ n reduces to θ n in (1.3). The result stated in equations (1.4) and (1.5) is reminiscent of the main result in de Haan (2015). In de Haan (2015), independent but non-identically distributed random variables with proportional tails are considered, a regime known as heteroscedastic extremes (see also Einmahl et al. (2016)). In such a setting, de Haan (2015) finds that the asymptotic distribution of appropriately normalized sample maxima is characterized by the Cesàro mean of the tail proportionality constants. We deduce a more general version of this result in Section 5 where we allow for dependence in the sequence.
The structure of the paper is as follows. Section 2 defines the notation and assumed mixing condition used throughout the paper and states the main theoretical results regarding the asymptotic distribution of the sample maxima and normalized interexceedance times in the case of a common marginal distribution. Section 3 discusses approaches to parameter estimation using the result from Section 2 on the distribution of the interexceedance times. Section 4 considers the estimation problem for two simple non-stationary Markov sequences with periodic dependence structures. Section 5 discusses extensions of the main results to sequences with different marginal distributions and Section 6 gives the proofs of the main theoretical results.

Notation and definitions
Unless otherwise stated, we will assume that all random variables in the sequence X 1 , X 2 , . . . have common marginal distribution F with upper endpoint x F = sup{x ∈ R : F (x) < 1}, though we do not assume stationarity. We discuss the consequence of dropping the common marginal assumption in Section 5. We writeF = 1 − F . In addition to the definitions for M n and M j,k given in the introduction, we define M (A) = max{X i : i ∈ A} where A is an arbitrary set of positive integers. We denote the number of elements in the set A by |A |. We also refer to a set of consecutive integers as an interval. We denote asymptotic equivalence of the real valued functions f and g by f ∼ g, which means that lim x→∞ f (x)/g(x) = 1. For u ∈ R, the σ-algebra generated by the events {X i > u : j 1 ≤ i ≤ j 2 } is denoted by F j 1 ,j 2 (u), where i, j 1 , j 2 ∈ N.
In all results in this section, we use the standard technique of block-clipping, see for example Section 10.2.1 in Beirlant et al. (2004), to divide n observations up in to sets of alternating sizes of p n and q n where p n = o(n) and q n = o(p n ). Specifically, we define intervals A i and A * i by A i = (i − 1)(p n + q n ) + 1, . . . , ip n + (i − 1)q n (2.1) for i = 1, 2, . . . r n , where r n = n/(p n + q n ) . We assume the asymptotic independence (AIM) mixing condition of O'Brien (1987) which restricts long range dependence and gives approximate independence of the random variables M (A i ) and M (A j ), i = j.
Definition 2.1. (AIM(x n )). The sequence {X n } ∞ n=1 is said to have asymptotic independence of maxima relative to the sequence x n ∈ R (AIM(x n )) if there exists a sequence q n of positive integers with q n = o(n) such that for any two intervals I 1 = {i 1 , . . . , i j } and I 2 = {i j +q n +1, . . . , i j +q n +k} separated by q n we have where the maximum is taken over all positive integers i 1 , i j and k such that |I 1 | ≥ q n , |I 2 | ≥ q n and i j + q n + k ≤ n.

Remark:
In the following results we abbreviate the mixing coefficents α n,qn (x n ) in Definition 2.1 to α n .

Asymptotic distribution of M n
Before giving the main result on the asymptotic distribution of M n , we restate Lemma 3.1. from O'Brien (1987) amended to allow for non-stationarity.
Lemma 2.1. Let {X n } ∞ n=1 have AIM (x n ) and let q n and α n be as in Definition 2.1. Set r n = n/(p n + q n ) , and let {A i } rn i=1 be as in equation (2.1) with p n = o(n), nα n = o(p n ) and q n = o(p n ). (2.2) (2.5)

Remarks:
If we can find a sequence q n such that the conditions in Definition 2.1 hold, then we can automatically find a sequence p n such that (2.2) holds, for example, by taking p n = {n max(q n , nα n )} 1/2 . Moreover, if s n is a sequence of positive integers such that p n = o(s n ) and s n = o(n), then Lemma 2.1 holds with p n replaced by s n and r n replaced by t n = n/(s n + q n ) . In the special case where the sequence is stationary, so that all terms in the product in (2.5) are equal, we have that P(M pn > x n ) = o(P(M sn > x n )). To see this, suppose that P(M n ≤ x n ) → L ∈ (0, 1) as n → ∞. Then, P(M pn > x n ) ∼ 1 − L 1/rn ∼ −r −1 n log(L) and similarly P(M sn > x n ) ∼ 1 − L 1/tn ∼ −t −1 n log(L) and so the stated result is a consequence of the fact that t n = o(r n ). By the stationarity of the sequence, it follows that if I 1,n and I 2,n are any two subsets of {1, 2, . . . , n} of lengths p n and s n respectively, then P(M (I 1,n ) > x n ) = o(P(M (I 2,n ) > x n )). In Theorem 2.1 below, we will require a weak version of this to hold in the non-stationary setting, and this is stated in equation (2.6). This condition, loosely speaking, says that we are much more likely to observe an exceedance of a large threshold within a large set of observations than within a small set. One way that such a condition may fail to hold is if the dependence in the sequence is too strong. For example, in the contrived case of complete dependence where X j = X 1 for j ≥ 2 we have that P(M (I 1 ) > x n ) = P(M (I 2 ) > x n ) = P(X 1 > x n ) regardless of the sizes of I 1 and I 2 . Of course, such a situation is excluded by the mixing condition in Definition 2.1, however it would seem that Definition 2.1 does not imply (2.6) in the general non-stationary case. An obvious sufficient condition for (2.6) to be satisfied is that the terms in the product of (2.5) are equal, or at least asymptotically equivalent, so that the same argument as in the stationary case holds.
We may construct non-stationary sequences with periodicity in the dependence structure that satisfy this property. For example, suppose that X 1 ∼ N (0, 1) and for n ≥ 1, let X n+1 = ρ n X n + (1 − ρ 2 n ) 1/2 Z n where Z n ∼ N (0, 1) independent of X n and ρ n is periodic sequence, i.e., there is an h ∈ N such that ρ n = ρ n+h for all n. In such a case, M a,a+b and M c,c+b are identically distributed when c − a is a multiple of h. Non-stationary Markov models of this type are considered further in Section 4.
We can now state the main result.

7)
and consequently (2.9) Ferro and Segers (2003) provided a method for estimating the extremal index of a stationary sequence without the need for identifying independent clusters of extremes. This was achieved by considering the distribution of the time between two exceedances of a threshold u. Specifically, for a fixed u they define the random variable T (u) by

Interexceedance times
and investigate the distribution of T (u) as u approaches x F . A simple normalization of T (u) gives rise to a non-degenerate limiting distribution. In particular, Ferro and Segers (2003) show that as n → ∞,F (x n )T (x n ) converges in distribution to a mixture of a point mass at zero (with probability 1 − θ) and an exponential random variable with mean θ −1 (with probability θ). The mixture arises from the fact that the interexceedance times can be classified in to two categories: within cluster and between cluster times. The mass at zero stems from the fact that the within cluster times, which tend to be small relative to the between cluster times, are dominated by the factorF (x n ) which goes to zero. In the stationary case, conditioning on the event X 1 > u in equation (2.10) defining T (u) may be replaced with X i > u for any i ∈ N (then X n+1 must be replaced by X n+i ) without affecting the distribution of T (u). In the non-stationary case we must allow for the possibility that the distribution of interexceedance times may vary with location in the sequence. Thus we consider for each i ∈ N and threshold u, the random variable T i (u) defined by (2.11) We find that the distribution ofF (x n )T i (x n ) converges as n → ∞ to a mixture of a mass at zero (with probability 1 − θ i ) and an exponential random variable with mean γ −1 (with probability θ i ) where θ i is the extremal index at time point i (defined below). As in Ferro and Segers (2003), a slightly stronger mixing condition is required to derive this result than was needed for Theorem 2.1. We define the mixing coefficients where the supremum is over all A ∈ F 1,l (u) with P(A) > 0 and B ∈ F l+q,n (u). We also use the notation θ i,n = P(M i,i+pn ≤ x n | X i > x n ) and γ n = n −1 n i=1 θ i,n . The result on the limiting distribution of the interexceedance times is now given.
n=1 be a sequence of random variables with common marginal distribution F and suppose that nF (x n ) → τ . Suppose (2.6) holds and γ n → γ ∈ [0, 1] and θ i,n → θ i ∈ [0, 1] as n → ∞. If there is a sequence of positive integers q n = o(n) such that α * cn,qn (x n ) → 0 as n → ∞ for all c > 0, then for t > 0 (2.13) for any c > 0 where α and α * are the mixing functions from Theorems 2.1 and 2.2 respectively. Consequently, under the assumptions of Theorem 2.2, α cn,qn (x n ) → 0 as n → ∞ for all c > 0. If we define p n = {cn max(q n , cnα cn,qn )} 1/2 then p n = o(cn), cnα cn,qn = o(p n ), q n = o(p n ) and we may apply Theorem 2.1 to the first cn terms of the sequence to get P(M cn ≤ x n ) = F (x n ) cnγ + o(1).

Introduction
In this section we consider moment and likelihood estimators for the θ i and γ based on the limiting distribution of normalized interexceedance times given in Theorem 2.2. We first show that the intervals estimator of Ferro and Segers (2003) may be used to estimate the θ i and then consider likelihood based estimation along the lines of Süveges (2007). For simplicity, in the likelihood formulation, we shall assume a finite number of parameters to infer, say θ 1 , . . . ,

Moment based estimators
The limiting result in Theorem 2.2 implies that the first two moments ofF (u)T i (u) satisfy If we assume that the threshold is chosen to be suitably large so that the o(1) terms can be dropped, we can solve these two equations to get the unknown parameters in terms of the moments of the normalized interexceedance times as A complication that arises in the non-stationary settting is that, since θ i is defined via a probability conditional on the event X i > u, if X i does not exceed the threshold u then we will have no interexceedance times relevant to estimating θ i . This problem doesn't arise in the stationary case where every interexceedance time may be used to estimate the extremal index θ.
In order to estimate the θ i , it is natural to assume that they are structured in some way, e.g., periodic or piecewise constant. Such a situation may arise, for example, through a periodicity in the dependence structure of the sequence. In this way, we obtain a set of interexceedance times T that can be used for estimating θ i . Equation (3.1) suggests the estimator Note that, while equation (3.1) also suggests an estimator for γ, this is based only on the interexceedances relevant to estimating θ i and also requires an estimate ofF (u). One possibility is to obtain d such estimates and take the mean of these as the estimate of γ. However, this estimator need not respect the relation γ = d −1 d i=1 θ i , a consequence of the fact that we dropped the o(1) terms when solving the first two moment equations for T (u)F (u) for θ i and γ.
In the examples that we consider later, we estimate γ using the mean of the estimates for the θ i parameters.
We now investigate the bias of the estimatorθ i in equation (3.2). We note from (2.13) that for n ∈ N we have which motivates consideration of the positive integer valued random variable T defined by where p ∈ (0, 1) and θ i , γ ∈ (0, 1] and we may identify p with F (x n ). In a similar manner to Ferro and Segers (2003), we find that A Taylor expansion of the right hand side of equation (3.5) about p = 1 gives On the other hand, since this motivates the estimatorθ (3.8) whose first order bias in zero. This estimator forms the key component of the intervals estimator of Ferro and Segers (2003), which we can use to estimate θ i .

Maximum likelihood estimation
Theorem 2.2 also allows for the construction of the likelihood function for the vector of unknown parameters. This is an attractive approach due to the modelling possibilities that become available, e.g., parameterizing θ i as a function of i, smoothing spline interpolation methods or the use of Bernstein polynomials. However, as discussed in Ferro and Segers (2003) in the stationary case, problems arise with maximum likelihood estimation due to the uncertainty in how to assign the interexceedance times to which component of the limiting mixture distribution. Since the asymptotically valid likelihood is used as an approximation at some subaymptotic threshold u, all observed normalized interexceedance times are strictly positive. Assigning all interexceedance times to the exponential part of the limiting mixture means that they are all being classified as between cluster times. This is tantamount to exceedances of a large threshold occuring in isolation, and so the maximum likelihood estimator based on this (typically misspecified) likelihood converges to 1 in distribution regardless of the true underlying value of θ.
A partial solution to this problem was proposed in Süveges (2007), for a certain class of sequences having the property that the limiting extremal clusters consist only of runs of consecutive exceedances. This contrasts with the more general situation where within clusters, some observations may fall below the threshold. For a stationary sequence, the technical condition required to be satisfied for the model of consecutive exceedances of the clusters is D (2) (x n ) due to Chernick et al. (1991). This condition states that nP(X 1 > x n , X 2 ≤ x n , M 2,pn > x n ) → 0, where p n = o(n), i.e., the probability of again exceeding the threshold x n after dropping below it within a cluster falls to zero (faster that n −1 ) as n → ∞. For such a sequence, the conditional probability characterization of θ in O'Brien (1987) reduces to θ = lim n→∞ P(X 2 ≤ x n | X 1 > x n ). A diagnostic plot was suggested by Süveges (2007) to assess the plausibility of this assumption from a given realization of a stationary process and it is found that maximum likelihood estimation outperforms the intervals estimator, in the sense of lower root mean squared error, for sequences satisfying this assumption.
If we were to make a similar assumption in the non-stationary case, e.g., nP(X i > x n , X i+1 ≤ x n , M i,i+pn > x n ) → 0 for each i, so that the consecutive exceedances model for clusters is accurate, we obtain the likelihood function as (3.10) where {t (j) i } n i j=1 are the interexceedance times for estimating θ i . The full log-likelihood is then > 1], and in practiceF (x n ) must be replaced with an estimate. Unlike in the stationary case, the likelihood equations don't have a closed form solution, essentially due to the dependence of γ on all the θ i , however equation (3.11) is easily optimized numerically provided d is not too large. If d is large, it is more natural to parameterize θ i in terms of a small number of parameters which we may estimate by maximum likelihood or consider nonparametric estimation along the lines of Einmahl et al. (2016).
We may generalise this idea and assign all interexceedance times below some value k to the zero part of the mixture so that the corresponding expression for L i becomes (3.12) In the stationary setting the technical condition required is nP( Chernick et al., 1991). Selection of an appropriate value of k is equivalent to the selection of the run length for the runs estimator, and this problem is considered in Süveges and Davison (2010). However, in a non-stationary setting, where the clustering characteristics of the sequence may change in time, the appropriate value of k may also be time varying, so that k should be replaced with k i in equation (3.12). If too small a value is selected for k i then some of the interexceedance times may be wrongly assigned to the exponential component of the likelihood leading to an overestimate of θ i whereas if k i is selected to be too large then we tend to underestimate θ i .

Introduction
In this section we consider two simple examples of non-stationary Markov sequences with a periodic dependence structure and common marginal distributions. In the first example, no limiting extremal clustering occurs at any position in the sequence, so that θ i = 1 for each i, in contrast to the second example where θ i < 1 for each i. We note that for sufficiently well behaved stationary Markov processes, mixing conditions much stronger than those considered in Section 2 hold (Athreya and Pantula, 1986). For example, for the autoregressive sequence X n+1 = ρX n + n , |ρ| < 1, with { n } ∞ n=1 independent and identically distributed as N (0, 1 − ρ 2 ), Theorems 1 and 2 from Athreya and Pantula (1986) give that that the mixing conditions of Section 2 hold for any sequence q n such that q n → ∞, q n = o(n), for any x n . One would naturally expect that making small changes to the dependence structure of a stationary Markov process, such as letting ρ in the previous example vary along the sequence, would not significantly alter the mixing properties of the sequence. For the two examples that follow this is indeed the case, see for example Bradley (2005)

Normal autoregressive model
Stationary sequences X 1 , X 2 , . . . , where each X i is a standard normal random variable, are extensively studied in Chapter 4 of Leadbetter et al. (1983). It is shown there that if the lag n autocorrelation ρ(n) satisfies ρ(n) log n → 0 as n → ∞, a condition originally due to Berman (1964), then the extremal index θ of the sequence equals one and thus no limiting extremal clustering occurs. For example, the autoregressive process X n+1 = ρX n + n , where n ∼ N (0, 1 − ρ 2 ), has extremal index one provided ρ < 1, since the lag n correlation is ρ n . This is a special case of a more general result that says that a stationary asymptotically independent Markov sequence has an extremal index of one (Smith, 1992). We say that the sequence X 1 , X 2 , . . . with common marginal distribution function F is asymptotically independent at lag k if P(X n+k > u | X n > u) → 0 as u → x F , and asymptotically independent if it is asymptotically independent at all lags.
Non-stationary normal sequences are considered in Chapter 6 of Leadbetter et al. (1983), although the emphasis is on the limiting distribution of sample maximum rather than clustering. Here, we consider a simple non-stationary autoregressive model X n+1 = ρ n X n + n with n ∼ N (0, 1 − ρ 2 n ), so that X n ∼ N (0, 1) for each n, and specify a periodic lag one correlation function ρ n+1 = 0.5 + 0.25 sin(2πn/7) for n ≥ 0. We note that ρ(n) log n → 0 as n → ∞ and the sequence is asymptotically independent. Applying Theorem 6.3.4 of Leadbetter et al. (1983), and comparing the non-stationary sequence to an independent standard normal sequence, we deduce that P(M n ≤ x n ) − Φ(x n ) n → 0 as n → ∞ where Φ is the standard normal distribution function, and thus conclude that γ = 1 and θ i = 1 for i = 1, . . . , 7.
We simulated 1000 realizations of this sequence of length 10 4 , and for each realisation, estimated θ 1 , . . . , θ 7 and γ for a range of high thresholds, using both the intervals estimator and maximum likelihood with k in equation (3.12) equal to zero and one. We then repeated this procedure for sequences of length 10 5 . We found that maximum likelihood with k = 0 performed best as measured by root mean squared error in γ, at all but one combination of threshold and sample size. The one exception was for a sample size of 10 4 and the threshold equal to the 0.99 theoretical quantile, for which the expected total number of exceedances is 100, giving approximately 14 interexceedance times to estimate each θ i . However, in this case the root mean squared error for γ when using the intervals estimator and maximum likelihood with k = 0 only differed in the third decimal place. Maximum likelihood with k = 1 typically performed slighty poorer than the in-tervals estimator. This is not suprising as we will assign several interexceedance times to the zero component of the likelihood whereas asymptotically, they all belong to the exponential component. Table 1 shows the 0.025 and 0.975 quantiles for the sampling distributions of the maximum likelhood estimators with k = 0. In the table, u = q p corresponds to the threshold that there is probability p of exceeding at each time point i.e., P(X i > q p ) = p. We note that although the true values of each θ i is 1, so that no extremal clustering occurs in the limit as u → x F , clustering may occur at subasymptotic levels. Moreover, there will tend to be more subasymptotic clustering in the sequence at positions with a corresponding larger lag one autocorrelation (larger ρ i ). This point has been thoroughly discussed in the context of stationary sequences and estimation of the extremal index, Ancona-Navarrete and Tawn (2000), Eastoe and Tawn (2012), and leads to the notion of a subasymptotic or threshold based extremal index. In the non-stationary setting we may define a threshold based version of θ i . Specifically, for a fixed threshold u and positive integer m, we define which provides a measure of the potential for clustering above the threshold u at position i. The probability in equation (4.1) can be evaluated by integration of the joint distribution of (X i , . . . , X i+m ), which in this case is multivariate normal, normalised by P(X i > u). We performed this calculation for i equal to 3 and 6, for various combinations of u and m using the technique for evaluating multivariate normal probabilities given in Genz (1992) and implemented in the R package mvtnorm. The results are presented in Table 2. The reason for considering i equal to 3 and 6 is that these positions have respectively the highest and lowest associated lag one autocorrelations, namely ρ 3 = 0.74 and ρ 6 = 0.26, and consequently correspond to the positions where there is greatest and least subasymptotic clustering potential. This feature is clearly seen in Table  2 which shows that θ 3 (u, m) < θ 6 (u, m) for each combination of u and m. Another feature of the subasymptotic extremal indices that is clear from Table 2, is that there is less extremal clustering as the threshold increases, i.e, θ i (u 1 , m) ≤ θ i (u 2 , m) when u 1 ≤ u 2 , so that θ i (u, m) is a monotone function of u for fixed m. In fact, for the normal autoregressive model we have, for fixed m, θ i (u, m) → 1 as u → x F for each i, and we can use results from Tawn (1996, 1997) to make this more precise. First, note that By the asymptotic independence of the sequence, all conditional probabilites in (4.3) tend to zero as u → x F , which verifies that θ i (u, m) → 1 as u → x F for fixed m. Applying the results of Table 1: 0.025 and 0.975 quantiles of the sampling distributions for θ 1 , . . . , θ 7 , γ, in the normal autoregressive model using maximum likelihood. These are based on 1000 realizations of the process for different sample sizes n and thresholds u. n u θ 1 θ 2 θ 3 θ 4 θ 5 θ 6 θ 7 γ 10 4 q 0. 10 .50, 1.0 .50, 1.0 .50, 1.0 .50, 1.0 .50, 1.0 .50, 1.0 .50, 1.0 .83, .97  10 4 q 0.05 .47, 1.0 .49, 1.0 .49, 1.0 .49, 1.0 .48, 1.0 .49, 1.0 .50, 1.0 .83, .96  10 4 q 0.01 .43, 1.0 .38, 1.0 .39, 1.0 .40, 1.0 .43, 1.0 .40, 1.0 .37, 1.0 .79, .94  10 5 q 0.10 .61, 1.0 .62, 1.0 .61, 1.0 .61, 1.0 .61, 1.0 .61, 1.0 .61, 1.0 Tawn (1996, 1997) for the joint tails of bivariate normal random vectors, we obtain how the conditional probabilities in (4.3) tend toward zero as where P(X i > q p ) = p and ρ i,i+n = cor(X i , X i+n ) = n−1 j=0 ρ i+j is the correlation between X i and X i+n . Consequently, we have for n ≥ 2, P(X i+n > q p | X i > q p ) = o{P(X i+1 > q p | X i > q p )} as p → 0 so that the bounds in (4.3) tend to zero at the same rate, and so as p → 0.

Bivariate logistic dependence
Consider a stationary Markov sequence X 1 , X 2 , . . . with unit Fréchet margins, i.e. F (x) = e −1/x for x > 0, and (X n , X n+1 ) following a bivariate logistic distribution with joint distribution function (4.6) The parameter α in equation (4.6) controls the strength of dependence between adjacent terms in the sequence, with α = 1 corresponding to independence and α → 0 giving complete dependence. Such a sequence exhibits asymptotic dependence provided α < 1, in particular, lim u→∞ P(X n+1 > u | X n > u) = 2 − 2 α . By exploiting the Markov strucure of the sequence, precise calculation of θ can be achieved using the numerical methods described in Smith (1992), where it is found for example that the sequence with α = 1/2 has θ = 0.328. The case of α = 1/2 is also considered in Süveges (2007) and Süveges and Davison (2010), where the problem of estimating θ using maximum likelihood is considered. Based on diagnostic plots, Süveges (2007) concludes that the D (2) (x n ) condition is not satisfied for this sequence, and moreover, the maximum likelihood estimator for θ based on a run length of k = 1 has bias of around 20%. Süveges and Davison (2010) find that a more suitable run length is k = 5, and in this case the maximum likelihood estimator for θ has lower root mean squared error than the intervals estimator. Naturally, smaller values of α will tend to be associated with larger values of the run length k, though the precise nature of this relation is unclear.
We consider a non-stationary version of such a sequence, so that again X 1 , X 2 , . . . are unit Fréchet and the joint distribution of (X n , X n+1 ) is given by (4.7) Non-stationarity is induced by specifying a non-constant sequence α n for the dependence parameters. We again consider a periodic dependence structure specified by the sequence α n+1 = 0.5 + 0.25 sin(2πn/7) for n ≥ 0. Note that although we have specified the same parametric form for the dependence parameters α as in the previous example for ρ, the two parameters are not directly comparable.
As in the previous example, we simulated 1000 realizations of this process, of lengths 10 4 and 10 5 . Table 3 shows the 0.025 and 0.975 quantiles of the sampling distributions for the intervals estimators of θ 1 , . . . , θ 7 , γ at a range of different thresholds. Although we don't know the true parameter values, presumably the true values of θ 3 and θ 6 are the smallest and largest values respectively, as they correspond to the positions of the greatest and least extremal dependence, as measured through α n . The non-overlapping nature of the intervals for θ 3 and θ 6 in Table 3 gives evidence in favour of θ 6 < θ 3 , the one exception being for sample size 10 4 and a threshold equal to the 0.99 theoretical quantile, in which case there are too few exceedances to estimate the parameters with any precision.
Finally, we note that the maximum likelihood estimator for θ i based on a run length of k = 1 is biased toward 2 α i − 1, which would be the true value attained if the consecutive exceedances model for the clusters was valid. Naturally, larger values of k give rise to smaller estimates and a value of k = 5 was found to behave similarly to the intervals estimator in the regions of the sequence with higher extremal dependence (lower α). These features can be seen in Figure 1 which shows the median value of the 1000 estimates for each parameter using the intervals estimator and maximum likelihood with k equal to 1 and 5 from the simulation, using a sample size of 10 5 and threshold equal to the 0.95 theoretical quantile. It is plausible that different values of k may be appropriate at different positions of the sequence due to the difference in strengths of extremal dependence throughout the sequence, though one would imagine that in a real world data set, any changes in the dependence structure would be more slow and subtle.

Introduction
In this section, we discuss the consequences for Theorem 2.1 of dropping the assumption of common marginal distributions. We see that the limiting distribution of the sample maximum at high thresholds is again characterized by local extremal indices, though not in general by γ (see equation (5.3)). We consider our results in the context of heteroscedastic extremes, a type of non-stationarity that has been at the centre of recent research (Einmahl et al., 2016).

The general case
A careful reading of the proof of Theorem 2.1 shows that equation (2.7) holds even if we drop the assumption of common marginal distributions, though some of the assumptions need modification. This is a consequence of the fact that the validity of Lemma 2.1 depends on the mixing condition and not on the marginal distributions. However, condition (2.4) is meaningless when the margins are different and may be replaced by lim n→∞ max{nF i (x n ) : In the presence of different marginal distributions we may assign a different sequence of thresholds to each time point, i.e., for each i ∈ N, we have a corresponding sequence x n,i . Then condition (2.4) is replaced by the assumption: lim n→∞ max{nF i (x n,i ) : 1 ≤ i ≤ n} → τ > 0, or lim sup n→∞ max{nF i (x n,i ) : 1 ≤ i ≤ n} < ∞. Subsequently, the assumption (2.6) is replaced by P(∩ i∈I 1,n {X i > x n,i }) = o(P(∩ i∈I 2,n {X i > x n,i })). The proof of Theorem 2.1 which is given in the case of common margins, holds with only modification of notation in this more general case. For example P In this general case, equation (2.7) becomes where θ j = lim n→∞ P j+pn k=j+1 {X k ≤ x n,k } | X j > x n,j . As a special case, consider the situation where there are heteroscedastic extremes, i.e., changes in the scale of extremes, as in Einmahl et al. (2016). That is, assume there is a sequence {c j } of positive real numbers and a reference distribution function F such that for all j ∈ N, lim n→∞Fj (x n,j )/F (x n,j ) = c j . Assume for simplicity that we may use that same sequence of thresholds at each time point, that is x n,j = x n for each j ∈ N, and assume the sequence {c j θ j } ∞ j=1 is Cesàro summable with n −1 n j=1 c j θ j → θ c as n → ∞. Then equation (5.3) in this case gives as n → ∞, from which we see that the limiting distribution of the sample maxima is characterised by the parameter θ c , which is formed from a mixture of the scedasis and clustering sequences c j and θ j respectively. In the case where θ j = 1 for each j, which happens for example for independent sequences, we generalise the result of de Haan (2015).

Auxillary results
In this section we state and prove a few Lemmas that are required in the proof of Theorem 2.1. With the exception of Lemma 6.5., these are generalizations of Theorem 1.5.1 from Leadbetter et al. (1983).

Comment:
The case τ = ∞ can be dealt with as in ) though we will have no need for this case here.
Lemma 6.2. Let {t n } be a sequence of positive integers such that t n → ∞ and {a n } a non-negative sequence such that a n → 0 and t n a n is bounded above as n → ∞. Then (1 − a n ) tn − e tnan → 0 as n → ∞. (6.4) Proof. This follows from Lemma 6.1 by considering subsequences along which t n a n converges.
(6.5) =⇒ (6.6): Conversely, suppose that (6.5) holds. After negating and exponentiating we have Hence, it is sufficient to show that the last product in expression (6.8) converges to 1. Rewrite this product as and observe that both t n R max n → 0 and t n R min n → 0 as n → ∞ since t n R (i) n → 0 for each i. Then and by Lemma 6.1 we have (1 + R min n ) tn → e 0 = 1 and (1 + R max n ) tn → e 0 = 1. Hence, from (6.9) we see that tn i=1 1 + R (i) n → 1 as required.
Lemma 6.4. Let {t n } be a sequence of positive integers such that t n → ∞ and let {a (1) n } ∞ n=1 , . . . be a sequence of non negative sequences such that for each i ∈ N, a (i) n → 0 and t n a (i) n is bounded above as n → ∞. Then Proof. This follows from Lemma 6.3 by considering subsequences along which tn i=1 a (i) n converges.

Proof of Theorem 2.1.
In addition to the notation defined in Section 2.1, we also define X i j = X (i−1) (pn+qn)+j , M i j,k = max {X i j+1 , . . . , X i k }. Now we note that n j=1 P(X j > x n , M j,j+pn ≤ x n ) = rn i=1 pn j=1 P(X i j > x n , M i j,j+pn ≤ x n ) + o(1) (6.13) since the difference between the two sums is n j=1 P(X j > x n , M j,j+pn ≤ x n ) − rn i=1 pn j=1 P(X i j > x n , M i j,j+pn ≤ x n ) = rn i=1 pn+qn j=pn+1 P(X i j > x n , M j,j+pn ≤ x n ) + o(1) ≤ r n q n P(X 1 > x n ) + o(1) ≤ q n p n + q n n P(X 1 > x n ) + o(1) → 0.
so that (6.12) gives P(M n ≤ x n ) ≤ exp − n j=1 P(X j > x n , M j,j+pn ≤ x n ) + o(1) (6.14) Now we prove the reverse inequality of (6.14). Following O'Brien (1987), introduce a new sequence s n = o(n) which plays the role of p n such that p n = o(s n ) and let t n = n/(s n + q n ) which now plays the role of r n and note that t n = o(r n ) and the definitions in (6.11) and (2.1) are modified by replacing p n with s n . Then for i = 1, . . P(X i j > x n , M i j,j+pn ≤ x n ) 1 + o (1) where the penultimate inequality follows from the fact that {M i 0,sn−pn > x n , M i sn−pn,sn ≤ x n } ⊆ sn−pn j=1 {X i j > x n , M i j,j+pn ≤ x n }. Consequently, P(X i j > x n , M i j,j+pn ≤ x n ) 1 + o(1) + o(1). (6.15) Now, with a (i) n = sn j=1 P(X i j > x n , M i j,j+pn ≤ x n ) 1 + o(1) we note that t n a (i) n ≤ n s n + q n s n P(X 1 > x n )(1 + o(1)) ≤ A s n s n + q n for some positive constant A, since P(X 1 > x n ) = O(1/n). Consequently, since s n /(s n +q n ) → 1, t n a (i) n is bounded above, applying Lemma 6.4 to (6.15) and Lemma 6.5 we get P(M n ≤ x n ) ≥ exp − tn i=1 sn j=1 P(X i j > x n , M i j,j+pn ≤ x n ) + o(1). (6.16) A similar argument that was used to show (6.13) gives n j=1 P(X j > x n , M j,j+pn ≤ x n ) = tn i=1 sn j=1 P(X i j > x n , M i j,j+pn ≤ x n ) + o(1) (6.17) so that (6.16) becomes P(M n ≤ x n ) ≥ exp − n j=1 P(X j > x n , M j,j+pn ≤ x n ) + o(1) (6.18) and so (6.14) and (6.18) together prove (2.7). Also, since exp − n j=1 P(X j > x n , M j,j+pn ≤ x n ) = exp − nP(X 1 > x n ) γn with γ n = n −1 n j=1 P(M j,j+pn ≤ x n | X j > x n ), this also gives (2.8).

Proof of Theorem 2.2.
We first note that since both q n = o(n) and α n = α n,qn (x n ) → 0, we can find a sequence of positive integers p n = o(n) such that nα n = o(p n ) and q n = o(p n ) so that the conditions of Theorem 2.1 are satisfied. Let t > 0 and write k n = t/F (x n ) ∼ tn/τ so that for sufficiently large n, k n > p n +q n . Now, firstly for auxillary purposes, note that In a similar way, since {M i+kn ≤ x n } ⊆ {M i+pn+qn,i+kn ≤ x n }, P(M i+pn+qn,i+kn ≤ x n ) − P(M i+kn ≤ x n ) = P(M i+pn+qn > x n , M i+pn+qn,i+kn ≤ x n ) ≤ (i + p n + q n )P(X 1 > x n ) → 0, so that P(M i+pn+qn,i+kn ≤ x n ) = P(M i+kn ≤ x n ) + o(1).
Now we can derive the limiting distribution ofF (x n )T i (x n ). We have P(F (x n )T i (x n ) > t) = P(T i (x n ) > k n ) = P(X i+1 ≤ x n , . . . X i+kn ≤ x n | X i > x n ) = P(M i,i+kn ≤ x n | X i > x n ) Now i + k n ∼ k n ∼ nt/τ so that α * kn,qn (x n ) → 0 and so we may apply Theorem 2.1 to the first k n observations to get P(M i+kn ≤ x n ) = P(M kn ≤ x n ) + o(1) = F (x n ) knγ + o(1) = F n (x n ) (γt/τ ) + o(1) → e −γt so that (6.19) gives the result.