Multivariate peaks over thresholds models

Multivariate peaks over thresholds modelling based on generalized Pareto distributions has up to now only been used in few and mostly two-dimensional situations. This paper contributes theoretical understanding, models which can respect physical constraints, inference tools, and simulation methods to support routine use, with an aim at higher dimensions. We derive a general point process model for extreme episodes in data, and show how conditioning the distribution of extreme episodes on threshold exceedance gives four basic representations of the family of generalized Pareto distributions. The first representation is constructed on the real scale of the observations. The second one starts with a model on a standard exponential scale which is then transformed to the real scale. The third and fourth representations are reformulations of a spectral representation proposed in Ferreira and de Haan (Bernoulli 20(4), 1717–1737, 2014). Numerically tractable forms of densities and censored densities are found and give tools for flexible parametric likelihood inference. New simulation algorithms, explicit formulas for probabilities and conditional probabilities, and conditions which make the conditional distribution of weighted component sums generalized Pareto are derived.

threshold and in catchments where the threshold is not exceeded are important. Additionally, the inclusion of undershoots increases the amount of information that can be used for inference. Just as in one dimension, the natural model is that extreme episodes occur according to a Poisson process and that overshoots and undershoots (or undershoots larger than a censoring threshold) jointly follow a multivariate GP distribution.
The aim of this paper is to contribute probabilistic understanding, physically motivated models, likelihood tools, and simulation methods, all of which are needed for multivariate PoT modelling of extreme episodes via multivariate GP distributions. Specifically, the key contributions are: new representations of GP distributions conducive to model construction; density formulas for each of these representations; new properties of multivariate GP distributions; and simulation tools. Many of these results are oriented towards enabling improved statistical modelling, but here we restrict ourselves to a probabilistic study. A companion paper  addresses practical modelling aspects.
We begin by deriving the basic properties of the class of multivariate GP distributions. We then pursue the following program: (i) to exhibit the possible point process limits of extreme episodes in data; (ii) to show how conditioning on threshold exceedances transforms the distribution of the extreme episodes to GP distributions, and to use this to find physically motivated representations of the multivariate GP distributions; and (iii) to derive likelihoods and censored likelihoods for the representations in (ii).
In part (ii) of the program, we develop four representations. The first one is in the same units as the observations, i.e., on the real scale, and in the second one the model is built on a standard exponential scale and then transformed to the real observation scale. The third is a spectral representation proposed in Ferreira and de Haan (2014), and the fourth one a simple reformulation of this representation aimed at aiding model construction. A useful, and to us surprising, discovery is that it is possible to derive the density also for the fourth representation, and that this density in fact is simpler than the densities for the other two first representations. The importance of (iii) is that likelihood inference makes it possible to incorporate covariates, e.g. temporal or spatial trends, in a flexible and practical way.
The insights and results obtained in carrying out this program, we believe, will lead to new models, new computational techniques, and new ways to make the necessary compromises between modelling realism and computational tractability which together will make possible routine use, also in dimensions higher than two. The limiting factor is the number of parameters rather than the number of variables. The models mentioned in Example 3 may be a case in point. The formulas for probabilities, conditional probabilities and conditional densities given in Sections 5 and 6, together with the discovery that weighted sums of components of GP distributions conditioned to be positive also have a GP distribution, add to the usefulness of the methods. Simulation of GP distributions is needed for several reasons, including computation of the probabilities of complex dangerous events and goodness of fit checking. The final contribution of this paper is a number of simulation algorithms for multivariate GP distributions.
The multivariate GP distributions were introduced in Tajvidi (1996), Beirlant et al. (2004, Chapter 8), and Rootzén and Tajvidi (2006); see also Falk et al. (2010, Chapter 5). A closely related approximation was used in Smith et al. (1997). The literature on applications of multivariate PoT modelling is rather sparse (Brodin and Rootzén 2009;Michel 2009;Aulbach et al. 2012). Some earlier papers use point process models which are closely related to the PoT/GP approach (Coles and Tawn 1991;Joe et al. 1992). Other papers consider nonparametric or semiparametric rank-based PoT methods focusing on the dependence structure but largely ignoring modelling the margins (de Haan et al. 2008;Einmahl et al. 2012;Einmahl et al. 2016). However, the GP approach has the advantages that it provides complete models for the threshold excesses, that it can use well-established model checking tools, and that, compared to the point process approach, it leads to more natural parametrizations of trends in the Poisson process which governs the occurrence of extreme episodes.
There is an important literature on modelling componentwise, perhaps yearly, maxima with multivariate generalized extreme value (GEV) distributions: for a survey in the spatial context see Davison et al. (2012). However, componentwise maxima may occur at different times for different components, and in many situations the focus is on the PoT structure: extremes which occur simultaneously. Additionally, likelihood inference for GEV distributions is complicated by a lack of tractable analytic expressions for high-dimensional densities, so that inference often is much easier, and perhaps more efficient, in GP models; see Huser et al. (2015) for a survey and an extensive comparison. The most important special case of GP models are those for which all variables can be simultaneously extreme, and there is no mass placed on hyperplanes (see Section 2 for details of the support); this is a typical modelling assumption. Further comment on the situation of asymptotic independence, where this does not hold, is made in Section 8, as well as in Kiriliouk et al. (2016).
Section 2 derives and exemplifies the basic properties of the GP cumulative distribution functions (cdf-s). In Section 3 we develop a point process model of extreme episodes, and Section 4 shows how conditioning on exceeding high thresholds leads to three basic representations of the GP distributions. Section 5 exhibits the fourth representation and derives densities and censored likelihoods, while Section 6 gives formulas for probabilities and conditional probabilities in GP distributions. Finally, Section 7 contributes simulation algorithms for multivariate GP distributions and Section 8 discusses parametrization issues and gives a concluding overview.

Multivariate generalized Pareto distributions
This section first briefly recalls and adapts existing theory for multivariate GEV distributions, and then derives a number of the basic properties of GP distributions.
Throughout we use notation as follows. The maximum and minimum operators are denoted by the symbols ∨ and ∧, respectively. Bold symbols denote d-variate vectors. For instance, α = (α 1 , . . . , α d ) and 0 = (0, . . . , 0) ∈ R d . Operations and relations involving such vectors are meant componentwise, with shorter vectors being recycled if necessary. For instance ax + b = (a 1 x 1 + b 1 , . . . , a d x d + b d ), x ≤ y if x j ≤ y j for j = 1, . . . , d, and t γ = (t γ 1 , . . . , t γ d ). If F is a cdf then we write F = 1 − F for its tail function, and also write F for the probability distribution determined by the cdf. That X ∼ F means that X has distribution F , and d − → denotes convergence in distribution. The symbol 1 is the indicator function: 1 A equals 1 on the set A and 0 otherwise.
Below we repeatedly use that if X is a d-dimensional vector with P (X u) > 0 and s > 0 then (2.1)

Background: multivariate generalized extreme value distributions
Throughout, G denotes a d-variate GEV distribution, so that in particular G has non-degenerate margins. The class of GEV distributions has the following equivalent characterizations, see e.g. Beirlant et al. (2004): (M1) It is the class of limit distributions of location-scale normalized maxima, i.e., the distributions which are limits 2) of normalized maxima of independent and identically distributed (i.i.d.) vectors X 1 , X 2 , . . . ∼ F , for a n > 0 and b n ; and (M2) It is the class of max-stable distributions, i.e., distributions such that taking maxima of i.i.d. vectors from the distribution only leads to a location-scale change of the distribution. By (M1) the class of GEV distributions is closed under location and scale changes.
The marginal distribution functions, G 1 , . . . , G d , of G may be written as for x ∈ R such that α j +γ j (x−μ j ) > 0. We will use this parametrization throughout. The parameter range is (γ j , μ j , α j ) ∈ R × R × (0, ∞). Define Then G j is supported by the interval while G is supported by a (subset of) the rectangleĨ 1 ×· · ·×Ĩ d . The lower and upper endpoints of G j are denoted by η j ∈ R ∪ {−∞} and ω j ∈ R ∪ {+∞}, respectively. One may alternatively write the condition x ∈Ĩ 1 × · · · ×Ĩ d as γ x + σ > 0.
PoT models are determined by the difference between the thresholds and the location parameters of the observations, and not by their individual values. Hence, it does not entail any loss of generality to shift the location parameters {μ i } to make the assumption 0 < G(0) < 1 hold. We will often use the stronger condition that σ > 0, i.e., that σ j > 0 for all j ∈ {1, . . . , d}. By Eq. 2.4, this is equivalent to assuming that 0 is in the interior of the support of every one of the d margins G 1 , . . . , G d , i.e., that η j < 0 < ω j and thus 0 < G j (0) < 1 for all j ∈ {1, . . . , d}. This is an additional restriction only for γ j < 0: if γ j = 0, then σ j = α j > 0, while if γ j > 0 then G(0) > 0 implies η j < 0 and thus σ j = −γ j η j > 0.
An easy argument shows that G is max-stable if and only if for each t > 0 there exist scale and location vectors a t ∈ (0, (Resnick 1987, Equation (5.17)). It follows from Eq. 2.3 that these parameters are given by (Resnick 1987, Proposition 5.8). The measure ν is called intensity measure because, by (M1), the limit of the expected number of locationscale normalized points, say a −1 n (X i − b n ), i ∈ {1, . . . , n}, in a Borel set A which is bounded away from η and such that ν(∂A) = 0, is equal to ν(A). The intensity measure ν determines the limit distribution of the sequence of point processes n i=1 δ a −1 n (X i −b n ) , see Section 3.

Generalized Pareto distributions
Let G be a GEV distribution with 0 < G(0) < 1 and let ν be the corresponding intensity measure. Then 0 < ν({y; y ≤ 0}) < ∞, so that we can define a probability measure supported by the set {y; y ≤ 0} by restricting the intensity measure ν to that set and normalizing it. The result is the generalized Pareto (GP) distribution associated to G. Its cdf H may be expressed as see Beirlant et al. (2004, Chapter 8) and Rootzén and Tajvidi (2006). If a GEV cdf G and a GP cdf H satisfy (2.6), then we say that they are associated and write H ↔ G. For completeness, we prove (2.6) in the Appendix. For points x ∈ [−∞, ∞) d with x ≥ η and x j = η j for some j , the value of H (x) is determined by right-hand continuity. Below it is shown that η is determined by the values of H (x) for x ≥ 0.
The probability that the j -th component, j ∈ {1, . . . , d}, exceeds zero is equal to 1 − H j (0) = log G j (0)/ log G(0), which is positive if and only if G j (0) < 1, that is, when σ j = α j − γ j μ j > 0. Since G(0) < 1 implies that G j (0) < 1 for some but not necessarily all j , the GP family includes distributions for which one (or several) of the components never exceed their threshold, so that the support of that component lies in [−∞, 0]. This could be useful in some modelling situations, but still, the situation of main interest is when all components have a positive probability of being an exceedance, or equivalently when H j (0) < 1 for all j ∈ {1, . . . , d}, or, again equivalently, when σ > 0.
Similarly to the characterizations (M1) and (M2) of the GEV distributions, the class of GP distributions H such that H j (0) < 1 for all j ∈ {1, . . . , d} has the following characterizations (Rootzén and Tajvidi 2006). 1 The functions σ t , u t in the characterizations are assumed to be continuous, and additionally u t is assumed increasing.
(T1) The GP distributions are limits of distributions of threshold excesses: Let X ∼ F . If there exist scaling and threshold functions s t ∈ (0, ∞) d and u t ∈ R d with F (u t ) < 1 and F (u t ) → 1 as t → ∞, such that for some cdf H + with nondegenerate margins, then the function H + (x); x > 0 can be uniquely extended to a GP cdf H (x); x ∈ R d , and The GP distributions are threshold-stable: Let X ∼ H where H has nondegenerate margins on R + . If there exist scaling and threshold functions s t ∈ (0, ∞) d and u t ∈ R d , with u 1 = 0 and H (u t ) → 1 as t → ∞, such that We use the term "threshold-stable" for property (T2) in analogy with the terms "sum-stable" and "max-stable". A distribution is sum-or max-stable if the sum or maximum, respectively, of independent variables with this distribution has the same distribution, up to a location-scale change. Analogously, a distribution is thresholdstable if conditioning on the exceedance of suitable higher thresholds leads to distributions which, up to scale changes, are the same as the original distribution. This property is illustrated in Fig. 1, with u t and s t as given below in Theorem 1(viii).
If (T1) holds we say that F belongs to the (threshold) domain of attraction of H . In contrast to the limit in (M1), different threshold functions can lead to limits which are not location-scale transformations of one another. A cdf F is in a domain of attraction for maxima if and only if it is in a threshold domain of attraction.
The GP distribution H is supported by the set for all j , and x j > 0 for some j }.
This proves the intuitively appealing result that H + j is a one-dimensional GP distribution, and shows that σ , γ and then also η are determined by the values of H (x) for x ≥ 0.
If J is a non-empty subset of {1, . . . , d} and is the conditional distribution of (X j : j ∈ J ) given that max j ∈J X j > 0, see Eq. 2.1 above. Recall that G and H are said to be associated, H ↔ G, if they satisfy (2.6).
Theorem 1 Let G be a GEV with margins (2.3) and suppose H ↔ G.
distribution of X − u given that X u is a GP distribution with the same shape parameter γ and σ replaced by σ + γ u. (v) A finite or infinite mixture of GP distributions with the same σ and γ is a GP distribution.
If σ > 0, the scaling and threshold functions in the (T2) characterization of GP distributions may be taken as s t = t γ and u t = σ (t γ − 1)/γ , for t ≥ 1. (ix) The parameters γ and σ are identifiable from H .
In words, Theorem 1(i) says that conditional margins of GP distributions are GP, but that marginal distributions of GP distribution are typically not GP. For instance, if H is a two-dimensional GP cdf, then H + 1 is a one-dimensional GP cdf (given by (2.9)), but typically H 1 is not. Intuitively, the reason is that the conditioning event implicit in H 1 (x) also includes the possibility that it is the second component, rather than the first one, that exceeds its threshold. Theorem 1 (ii)−(v) also establish closure properties of the class of GP distributions. By (vi) and (vii) a GEV distribution specifies the associated GP distribution and conversely a GP distribution specifies a curve of associated GEV distributions in the space of distribution functions. Regarding (vii), note that a GEV distribution G such that 0 < G j (0) < 1 for all j is determined by its values for x ≥ 0 (proof in the Appendix). Finally, (viii) identifies the affine transformations which leave H unchanged, and (ix) establishes identifiability of the marginal parameters.

Proof
(i) Let0 denotex for the special case when Inserting (2.3) into the equation above for J = {j } together with straightforward calculation proves (2.9), and hence completes the proof of the first assertion.
If H J (0) = 0, then H J = H + J and it follows from the first assertion that H J is a GP distribution function. Further, GP distributions are supported by {y; y 0} and hence if H J (0) > 0 then H J is not a GP cdf. This proves the second assertion. (ii) If G is a GEV cdf then, for s > 0, the map x → G(x/s) is a GEV cdf too, and the result then follows from (iii) Proceeding as in the proof of (i), but in the first step instead using that ( tributions, and using standard converging subsequence arguments it follows from marginal convergence that there exist σ > 0 and γ such that σ n → σ and γ n → γ . Define u n,t and s n,t from H n as in Eq. 2.8 (vi). Then, since H n is a GP cdf we have, using first (T2) and (viii), and then the continuous mapping theorem, that as n → ∞.
Since H n d →H it follows thatH satisfies (T2) and hence is a GP cdf. (v) We only prove that a mixture of two GP cdf-s with the same σ and γ is a GP cdf too, using Theorem 4 below (the proof of that theorem does not use the result we are proving now). The proof for arbitrary finite mixtures is the same, and the result for infinite mixtures then follows by taking limits of finite mixtures and using (iv). Let H 1 and H 2 be GP cdf-s with the same marginal parameters σ , γ and let p ∈ (0, 1). By Theorem 4 and Eq. 4.2 there exists cdf- Straightforward calculation shows that pc 1 + (1 − p)c 2 = 1/ ∞ 0F (t γ σ γ ) dt so thatH (x) satisfies Eq. 4.2 and hence is a GP cdf.
(vi) The first assertion follows from Eq. 2.6. Choose t 1 so that − log G(0) t 1 = − log G * (0). Then H ↔ G and H ↔ G * imply together that Since a GEV cdf with σ > 0 is determined by its values for x ≥ 0, see the Appendix, this completes the proof. (vii) The first part follows from Eq. 2.6, and that this determines G again follows from the appendix. (viii) By the proofs of (iii) and (vi) the conditional distribution of By assertion (vi), the GPD H determines the curve of GEV-s G t for t > 0. It suffices to show that γ j (t) and σ j (t) do not depend on t. But this follows by straightforward calculations from the max-stability property G( with a t and b t as in Eq. 2.5. Example 1 below exhibits two-dimensional GP distributions with positive mass on certain lines, and the first part of Example 2 provides a cdf where the second assertion in (i) of Theorem 1 comes into play. In contrast to scale transformations, it seems likely that if σ > 0 then a non-trivial location transformation of a GP cdf never is a GP cdf. The second part of Example 2 shows one of the exceptional cases where the support of one of the components is contained in (−∞, 0) and where a location transformation of a GP distribution does give another GP distribution.
Example 1 This example rectifies the one on pages 1726-1727 in Ferreira and de Haan (2014). Let G(x, y) = exp{−1/(x +1)−1/(y +1)} for (x, y) ∈ (−1, ∞) 2 , the distribution of two independent unit Fréchet random variables with lower endpoints α 1 = α 2 = −1. The corresponding multivariate generalized Pareto distribution is given by (2.11) We conclude that H is the distribution function of the random vector (X, Y ) given by If we modify the example by choosing Gumbel rather than Fréchet margins, so that G(x, y) = exp(−e −x − e −y ) for (x, y) ∈ R 2 , then the GP cdf H is the cdf of the vector We identify H as the distribution of the random pair (T , T ), where P( Fig. 2, middle panel. It follows that, e.g., H 1 (0) = 0 and hence in this example H 1 = H + 1 . As a variation of the example let G(x, y) = exp[−e −x∧(y+μ) ] be the cdf of (Z, Z − μ), with Z standard Gumbel and μ > 0. The corresponding GP cdf is and H is the cdf of (T , T − μ) with T standard exponential. Now, for −μ < ν the location transformed cdf H (x, y + ν) equals e −(x∧0)∧(y+ν+μ) − e −x∧(y+ν+μ) , which is the same as in Eq. 2.13, but with μ replaced by ν + μ > 0. Hence also H (x, y + ν) is a GP cdf. The support of H is shown in the right-hand panel of Fig. 2.

Point processes of extreme episodes
The first step in our program for PoT inference is to specify a point process model for extreme episodes. This model exhibits extreme episodes as a product process obtained by multiplying a random vector, the "shape" vector, with a random quantity, the "intensity" of the episode. (For γ = 0, the model instead is a sum.) This is parallel to models commonly used for max-stable processes, see e.g. Schlather (2002). In Section 4 below we obtain basic and physically interpretable representations of the GP distributions by conditioning the product process of extreme episodes on threshold exceedance.
In this and subsequent sections we assume that σ > 0. Let X 1 , X 2 , . . . be i.i.d. random vectors with cdf F and marginal cdf-s F 1 , . . . , F d , and let a n , b n be as in Eq. 2.2. Further, let η be the vector of lower endpoints of the limiting GEV distribution, see Eq. 2.4 and the sentences right below it. We consider weak limits of the point processes according to whether γ j > 0 or γ j = 0 or γ j < 0, and set S γ = I 1 × · · · × I d and S γ =S γ \ {η}.
The limit point process is specified as follows: Let 0 < T 1 < T 2 < . . . be the points of a Poisson process on [0, ∞) with unit intensity and let (R i ) i≥1 be independent copies of a random vector R which satisfies Condition 2 below. Further assume that the vectors (R i ) i≥1 are independent of (T i ) i≥1 , and define the point process where, by convention, R i,j /T 0 i − σ j /0 is interpreted to mean R i,j − σ j log T . The condition on R is as follows.

Condition 2 The components of the random vector
Let F R be the cdf of R. For γ j = 0, the moment restriction in Condition 2 can be seen to be equivalent to requiring that 0 < ∞ 0 P(R j > t γ j x j ) dt < ∞, if x j ∈ (0, ∞) and γ j > 0 or if x j ∈ (−∞, 0) and γ j < 0. For γ j = 0, the moment condition is instead equivalent to 0 < , it in turn follows that the moment condition implies that 0 < ∞ 0F R (t γ (x + σ /γ )) dt < ∞, if the components x j of x are as above, and where we have used the convention that t 0 (x j + σ j /0) should be replaced by σ j log t + x j . F satisfies (2.2). Then, for some R which satisfies Condition 2, N n d → P r on S γ , as n → ∞.

Theorem 3 Suppose
(3.2) Conversely, for any P r given by Eq. 3.1 there exist a GEV cdf G and a n > 0 and b n , with 0 < G j (b n,j ) < 1 and G j (b n,j ) → 1 for j = 1, . . . , d, such that Eq. 3.2 holds for F = G.
for μ, α, γ given by Eq. 2.3 so that the marginal cdf-s of Y * are standard Fréchet. It follows as in Theorem 5 of Penrose (1992) (see also (de Haan and Ferreira 2006) and Schlather (2002)) that there exists a random vector R * ∈ [0, ∞) d with E(R * j ) < ∞ such that Y * has the same distribution as sup i≥1 R * i /T i where the random vectors R * i are i.i.d. copies of R * and independent of the unit rate Poisson process (T i ) i≥1 . Reversing the transformation which led from Y to Y * , it follows that Y has the same distribution as sup with the R i satisfying Condition 2, and where throughout expressions should be interpreted as specified after Eq. 3.1 for γ j = 0. For ν the intensity measure of P r , we have G(x) = exp{−ν({y : y x})}. By standard reasoning, convergence in distribution of a −1 converges vaguely to ν( · ) on S γ . By Theorem 5.3 of Resnick (2007) this proves (3.2).
Conversely, given P r , define the cdf G by G( Straightforward calculation as in Schlather (2002, Theorem 2) show that G is max-stable. Hence there exist sequences a n and b n with the stated properties such that for independent random vectors X 1 , X 2 , . . . with common distribution G, the distribution of a −1 n ( n i=1 X i − b n ) is equal to G too. By the first part of the proof, this proves (3.2).
In the proof of Theorem 3 we obtained part of the following result, which we record here for completeness.

3)
and to any GEV cdf there exists an R which satisfies this equation. Here we use the convention that t 0 (x j + σ j /0) should be replaced by σ j log t + x j .
Proof Writing ν for the intensity measure of P r we have ν({y; y x}) = ∞ 0F R (t γ (x + σ γ )) dt. The right-hand side of Eq. 3.3 is therefore equal to the probability that P r has no points in the set {y; y x}. The result now follows from the proof of Theorem 3.

Representations of multivariate GP distributions
This section contains the second step in the program for PoT inference. We show how conditioning on threshold exceedances in the point process (3.1) gives four widely useful representations of the class of multivariate GP distributions. The first representation, (R) is on the real scale and corresponds to the point process P r in Eq. 3.1 with points obtained as products of shape vectors and intensity variables. In the second representation, (U ), the basic model is constructed on a standard scale and then transformed to the real scale. The third representation, (S), is equivalent to the spectral representation in Ferreira and de Haan (2014). A fourth representation, (T ), which is a variation of the (S) representation, is introduced in Section 5.
In the literature the standard scale is chosen as one of the following: Pareto scale, γ = 1, uniform scale, γ = −1, or exponential scale, γ = 0. Here we choose the γ = 0 scale because of the simple additive structure it leads to. For all four representations, it is straightforward to switch from one scale to another one.
To understand the GP representation (R) we first approximate P r by a truncated point processP r where {T i } is replaced by a unit rate Poisson process By the assumptions in Condition 2, the limit as K → ∞ of this expression, exists (cf the discussion just before Theorem 3), and it is also immediate that H R (∞) = 1, so that H R is a cdf on [−∞, ∞) d . If γ i = 0 then t γ i (x i + σ i γ i ) should be interpreted to mean x i + σ i log t. We write GP R (σ , γ , F R ) for the cdf (4.2) and call it the (R) representation. Theorem 4 shows that the class of such cdf-s is the same as the class of all GP cdf-s with σ > 0.
Heuristically, for simplicity assuming that γ j = 0, j = 1, . . . d, the calculations above proceed by equating a −1 n (X − b n ) with R/T γ − σ γ so that extremes of X asymptotically have the form a n R/T γ + b n − a n σ γ . Setting b = b n − a n σ γ and noting that R satisfies Condition 2 if and only if a n R satisfies Condition 2, the intuition is that, asymptotically, extremes of X have the form for some constant b and a random vector R which satisfies Condition 2. The interpretation is that the vector R is the shape of the extreme episode, say a storm, and that T −γ is the intensity of the storm. Here T represents a pseudo random variable with an improper uniform distribution on (0, ∞). Although such a X ∞ therefore does not have a proper distribution, one can verify that the cdf H R in Eq. 4.2 is derived as if it were the conditional distribution of X ∞ − u, given that X ∞ u, for σ γ = u − b, i.e., as the formal conditional distribution of R/T γ − σ γ given that R/T γ − σ γ 0. In statistical application one would assume that u is large enough to make it possible to use the cdf H R as a model for threshold excesses. The parameters of R and the parameters σ and γ are then estimated from the observed threshold excesses. The heuristic interpretation in case one or more of the γ j equals 0 is the same, one only has to write R j − σ j log T instead of R j /T 0 − σ j /0. Figure 3 illustrates how the multivariate GP distribution is derived from the Poisson process representation. Each realization of the Poisson process (3.1) yields a small, Poisson distributed, number of points in the region {x; x 0}. The expected number of such points is E where if γ j = 0 the component is to be interpreted as e R j /σ j , and thus depends on the distribution of R and the parameters σ , γ .
Defining U by σ γ e γ U = R, where we use the convention that if γ j = 0 then the j -th component is given by σ j U j = R j , we can write (4.2) as

Fig. 3
Deriving the GP from the Poisson process representation. Left: two-dimensional illustrations of "shape vectors" R and the 1000 largest "intensities" T −γ for γ = (0.3, 0.4). Centre: points X = R/T γ − σ /γ , where σ = (0.5, 0.5) against index, with a horizontal line at zero. Right: points X 2 versus X 1 with exceedances of zero in at least one coordinate highlighted for x such that γ σ x + 1 > 0, and where F U is the cdf of U . Here we assume that 0 < E(e U j ) < ∞ for j = 1, . . . , d, which is equivalent to assuming that R satisfies Condition 2. We write GP U (σ , γ , F U ) for the cdf defined by Eq. 4.4 and call it the (U ) representation. The intuition parallel to Eq. 4.3 is that H U is the formal conditional distribution of σ e γ (U −log T ) − 1 γ given that σ γ (e γ (U −log T ) − 1) 0 or, equivalently, given that U − log T 0. For later use we note that a GP U (1, 0, F U ) vector X 0 has the cdf and that a general GP U vector X is obtained from X 0 through the transformation log t), and, in particular, thatF S (log t) = 1 {0<t<1} . Inserting this into Eq. 4.5 then gives the cdf where the second equality follows from the change of variable log t = −v. Further, using the transformation (4.6) which connects (4.5) with Eq. 4.4, it follows more generally that if U = S, where d j =1 S j = 0, then for general σ , γ one obtains the cdf (4.8) We write GP S (σ , γ , F S ) for the cdf (4.8), and call it the (S) representation.
The last expression in Eq. 4.7 can be given an interpretation in terms of random variables: it is the cdf of S + E, where E is a standard exponential variable which is independent of S, and then Eq. 4.8 is the cdf of σ γ e γ (S+E) − 1 . This is the Ferreira and de Haan (2014) spectral representation transformed to the exponential scale. Eqs. 4.2,4.4,and 4.8, are all equal to the class of all GP distributions with σ > 0. For each class the conditional marginal distributions are given by Eq. 2.9.
Proof The assertion for the GP R (σ , γ , F R ) distributions follows from combining Eq. 3.3 with Eq. 2.6.
By definition, the class of GP U (σ , γ , F U ) cdf-s is the same as the class of GP R (σ , γ , F R ) cdf-s, and thus the same conclusion holds for the GP U (σ , γ , F U ) cdf-s. Since GP S (σ , γ , F S ) cdf-s are GP U (σ , γ , F U ) cdf-s, it follows that they are GP distributions.
To prove the full statement about the class GP S (σ , γ , F S ) we first note that by the construction of the GP S (σ , γ , F S ) cdf-s, it is enough to prove that the statement holds for GP distributions with γ = 0 and σ = 1. However, then the result follows by combining (T2) with the discrete version of de Haan and Ferreira (2006), Theorem 2.1, translated to the exponential scale (i.e., with their W replaced by e S+E ), and with ω 0 = 1.
The last assertion follows by straightforward calculation. As an example we prove it for the GP R (σ , γ , F R ) class for the case γ j = 0; j = 1, . . . , d. Let F j be the marginal distribution of the j -th component of R. It follows from Eqs. 2.10 and 4.2 that H + j , the distribution of the j -th component of H R conditioned to be positive, is given by where the second equality follows from making a change of variables from t ( γ j σ j x + 1) 1/γ j to t in the numerator.
It may be noted that since F R , F U , and F S are cdf-s then also H R , H U , and H S are cdf-s, so that in contrast to Eq. 2.6 the Eqs. 4.2, 4.4, and 4.8 hold for all x ∈ [−∞, ∞) d , subject to the provision that Eqs. 4.4 and 4.5 only apply for γ x + σ > 0.
The distributions of the random vectors R and U are not uniquely determined by the corresponding GP distributions H R and H U in Eqs. 4.2 and 4.4, respectively. The next proposition is a generalization of Theorem 1 (vi).

Proposition 1 Suppose that the random variable Z is strictly positive, has finite mean and is independent of R or U . Then GP
Proof We only prove the assertion for R, since the one for U follows from it. Replacing F R by F Z γ R in the numerator and denominator of Eq. 4.2 yields, after an application of Fubini's theorem and a change of variables, a factor E(Z) coming out in front the integrals both in the numerator and denominator. Upon simplification, the random variable Z is seen to have had no effect on H R .
Usually one would let the model for U include free location parameters for each component, and the model for R a free scale parameter for each component, in order to let data determine the relative sizes of the components. However, as one consequence of the proposition, one should then, e.g., fix the location parameter for one of the components of U , or fix the sum of the components, to ensure parameter identifiability. Similarly, if γ j = 0 for j = 1, . . . , d and if the model for R includes a free scale parameter for each component, then one should, e.g., fix one of these scale parameters.

Densities
To find the densities for the (R) and (U ) representations, we assume that R and U have densities with respect to Lebesgue measure on R d . For the (S) representation, we make the assumption that S is obtained from a vector T by setting S = T − d j =1 T j . We write GP T (σ , γ , F T ) for these distributions, write H T for the cdf-s, and call it the (T ) representation. Clearly, the class of GP T distributions is the same as the class of GP S distributions, and hence is equal to the class of all GP distributions with σ > 0. The densities for the (R) and (U ) representations are just as would be obtained if the R and U cdf-s were continuously differentiable and interchange of differentiation and integration was allowed. However, they, in fact, do not require any assumptions beyond absolute continuity with respect to d-dimensional Lebesgue measure.
The support of the vector for γ x + σ > 0, and where the densities are 0 otherwise. If γ j = 0 then for h R the expressions t γ j (x j + σ j γ j ) should be replaced by x j + σ j log t. For h U and h T , if γ j = 0, the expressions 1 γ j log( γ j σ j x j + 1) should be replaced by their limits x j /σ j .
Hence, by Eq. 4.2, and using Fubini's theorem, We conclude that Eq. 5.1 holds for σ = γ = 1. The proof of the general form of Eq. 5.1 only differs from this case in bookkeeping details, and is omitted.
To prove (5.2), recall that H U = H R if γ = 0, σ = 1, so that, by (5.1), WritingH for the corresponding cdf, the general cdf H U is obtained as H U (x) = H ( 1 γ log( γ σ x + 1)) and Eq. 5.2 then follows by a chain rule type argument. We again only prove (5.3) for the case σ = 1, γ = 0, and with x 0. Also here extension to the general case is a chain rule argument. It follows from Eq. 4.7 that Hence, using first Fubini's theorem, then a change of variables from te d j =1 s j to t and Fubini's theorem, and finally a change of variables from s to s + log t and Fubini's theorem, This proves that Eq. 5.3 holds for γ = 0 and σ = 1.
In some cases, the integrals in Eqs. 5.1, 5.2 and 5.3 can be computed explicitly; see the examples below. Otherwise the one-dimensional integrals allow for fast numerical computation as soon as one can compute densities and distribution functions of R or U efficiently. Either way, this can make full likelihood inference possible, also in high dimensions.

Censored likelihood
Sometimes one does not trust the GP distribution to fit the excesses well on the entire set x 0. Then, instead of using a full likelihood obtained as a product of the densities in Theorem 5, one can use a censored likelihood which is based on the values of the excesses which are larger than some censoring threshold v = (v 1 , . . . , v d ). This idea was introduced for multivariate extremes in Smith et al. (1997), and has since become a standard approach to inference. Huser et al. (2015) explore the merits of this and other approaches via simulation.
Write D = {1, . . . , d}, and let C ⊂ D be the set of indices which correspond to the components which are censored, i.e., which do not exceed their censoring threshold v i . Then, using the notation x A = {x j ; j ∈ A} and writing h for h R , h U or h T , the likelihood contribution of a censored observation is For certain models, the |C|−dimensional integral in Eq. 5.4 can be avoided, which is advantageous from a practical perspective.
Example 3 The simplest situation is when the components of the shape vector R are mutually independent. This could e.g. be a model for windspeeds over a small area, perhaps a wind farm, with T −γ representing the intensity of the average geostrophic wind and with the components of R representing random wind variations caused by local turbulence.
Let f j be the density function of R j , the j th component of R, let F j be the corresponding cdf, write y j = x j + σ j /γ j , and assume that v j ≤ 0, j ∈ C. The integral which appears in the numerator in Eq. 5.4 for h = h R in Eq. 5.1 can then be written as Here quick numerical computation of both integrals is typically possible.
Sometimes these integrals can also be computed analytically, and similarly for the corresponding integrals for h U and h T . As a simple example, consider (5.3) with γ = 0 and σ = 1 and with the components of T having independent standard Gumbel distributions with cdf exp{−e −x }. Then, with c the number of elements in C, i.e., the number of censored components, and abbreviating 1 {x D\C 0} to 1 D\C , we obtain that Whilst the previous example is a theoretical illustration, the class of GP distributions obtained by letting R (or U ) have independent components with parametrized marginal distributions does make for a large and flexible class of models. It includes, for example, the GP distributions associated to the commonly used logistic and negative logistic max-stable distributions. For this and further examples, see Kiriliouk et al. (2016).

Further examples
We illustrate two further constructions with tractable densities. The first is a toy example to exhibit the idea of building process knowledge into a model. The second is a variation on existing extreme value models based on lognormal distributions.
Example 4 An extreme flow episode in a river network consisting of two tributaries which join to form the main river could be modeled as R/T γ = (R 1 /T γ , R 2 /T γ , (R 1 + R 2 + E)/T γ ), with γ > 0, so that R 3 = R 1 + R 2 + E. Here the first component corresponds to flow in tributary number one, the second component to flow in tributary number two, and the third component to flow in the main river. The simplest model is that R 1 , R 2 , E are independent and have a standard exponential distribution. Then, (5.5) Assuming in addition that σ = (σ, σ, σ ), we have ∞ 0F since R 3 is a sum of three exponential variables, and thus has a gamma distribution. It follows from Eqs. 5.1 and 5.5 with y = x + σ γ that h R (x) = 1 {x 0, −σ /γ ≤x, x 1 +x 2 ≤x 3 } γ −1 2(σ/γ ) 1/γ (x 3 + σ/γ ) −3−1/γ .
Example 5 Lognormal distributions have been used in max-stable modelling, e.g., in Huser and Davison (2013), and as point process models in Wadsworth and Tawn (2014), and are an important class of models. As an example, in the (R) representation, suppose that 0 ≤ γ and that F R (x) = (log x), where is the cdf of a d-dimensional normal distribution with mean μ and nonsingular covariance matrix . Write φ for the corresponding density and let A = −1 be the precision matrix.
Making the change of variables from log t to t and completing the square, we can evaluate the integral, finding The integral in the denominator can be expressed as a sum of d components, each of which involves a (d − 1)-dimensional normal cdf, see Huser and Davison (2013). However, if d is large then this expression is cumbersome. Inference methods for similar high-dimensional models are explored in de Fondeville and Davison (2016).

Probabilities and conditional probabilities
Equations 4.2, 4.4, and 4.8 give probabilities of rectangles for GP distributions, on the real scale. In this section they are generalized to expressions for probabilities of general sets and for conditional probabilities. Below, we only consider GP R models. It is straightforward to derive the corresponding formulas for the other representations.
Let F = {y; y 0}, set A = {y; y ≤ x}, and for a, b ∈ R d and a set B ⊂ R d write a(B + b) for the set {a(y + b); y ∈ B}. As is easily checked, (6.1) Now, if in the derivation of Eq. 4.2 the special set A defined above is replaced by a general set A ⊂ R d , the result still is the same, (6.2) A proof that Eq. 6.2 holds for any set A is immediate: using Fubini's theorem it is seen that the right-hand side of the equation is a probability distribution as function of A, and since it agrees with the distribution H R on sets of the form {y; y ≤ x}, the two distributions are equal. The intuition is that H R (A) is the (formal) conditional probability of the event {R/T γ − σ /γ ∈ A} given the event {R/T γ − σ /γ 0}.
Let the random vector X have the distribution H R in Eq. 4.2. Then P[X ∈ A | X ∈ B] = H R (A ∩ B)/H R (B), and hence (6.2) can also be used to find conditional probabilities. Further, assuming continuity, Eq. 5.1 determines the conditional densities. For instance, writing f | X 1 =x for the conditional density of (X 2 , . . . X d ) given that X 1 = x, we find, for x > 0, By further integration, it follows that Example 6 In Example 4, extreme flow episodes in the two river tributaries are modelled using (R 1 /T γ , R 2 /T γ ) with R 1 and R 2 independent standard exponential variables and with γ > 0. Suppose X ∼ H where H is the GP distribution obtained from (R 1 /T γ , R 2 /T γ ) and let s > 0. Since R 1 + R 2 has a gamma distribution, it is straightforward to evaluate (6.2) to find the distribution of the sum of the flows in the two tributaries: Similar computations using Eq. 6.3 show that for x 1 , x 2 > 0 , for c 2 = (1 + γ )(x 1 + σ 1 /γ ) 1+1/γ (γ x 1 + σ 1 + σ 2 ) −2−1/γ and c 3 = (x 1 + σ 1 /γ ) 1+1/γ (γ x 1 + σ 1 + σ 2 ) −1−1/γ . Hence, dividing (6.5) with the same expression with s set to zero, we find that the sum conditioned to be positive has a GP distribution with the same shape parameter as the marginal distributions but with a larger scale parameter. The conditional distribution of X 2 given that X 1 = x > 0, conditioned to be positive, has a GP distribution with a smaller shape parameter, γ /(1 + γ ).
Many of the results in Example 6 hold more generally. For instance, the conditional GP distribution of sums holds as soon as the marginals have the same shape parameter. The intuition is simple: Suppose the GP R distribution has been obtained from the vector (R 1 /T γ − σ 1 /γ, . . . , R d /T γ − σ d /γ ) by (formal) conditioning on at least one of the components being positive. Then a weighted sum of the components equals R/T γ − σ/γ , for R = d j =1 a j R j and σ = d j =1 a j σ j , with coefficients a 1 , . . . , a d . According to the GP R representation, provided a 1 , . . . , a d ≥ 0, the distribution of R/T γ − σ/γ conditioned to be positive is a one-dimensional GP distribution with parameters γ and σ . Further, that a sum is positive implies that at least one component is positive, and hence first conditioning on at least one component being positive, and then conditioning on the sum being positive gives the same result as conditioning directly on the sum being positive. Thus the one-dimensional GP distribution holds for component sums in GP distributions. Similar reasoning can be applied to, e.g., joint distributions of several weighted sums and several components. Here, we only prove the one-dimensional result.
Proposition 2 Let X be a GP random vector with common shape parameter γ for all d margins and with scale parameter σ > 0, and if γ ≤ 0 additionally assume that P[ d j =1 a j X j > 0] > 0. Then, for a ∈ [0, ∞) \ {0}, the conditional distribution of the weighted sum d j =1 a j X j given that it is positive is generalized Pareto with shape parameter γ and scale parameter σ = d j =1 a j σ j .
Proof Since P[ d j =1 a j X j > 0] > 0 holds automatically if γ > 0 and σ > 0, this condition is satisfied for all values of γ . Let A x = {y ∈ R d | d j =1 a j y j > x} and as above define R = d j =1 a j R j . Then, for x > 0 and with F = {y; y 0}, as above, A x ∩ F = A x , and [for γ = 0 using the convention that t 0 (x + σ/0) means x + σ log t] the numerator in Eq. 6.1 for A = A x is and hence by Eq. 6.1 the last equality follows from making the change of variables from t (1 + x/σ ) 1/γ to t in the numerator.
Example 1 exhibits a situation where the component sum in a GP distribution is identically equal to −∞ and hence the assumption P[X 1 + X 2 > 0] > 0 is not satisfied.

Simulation
In this section we outline four methods for sampling from multivariate GP distributions. For Methods 1 to 3 we focus on simulation of a GP vector X 0 with σ = 1 and γ = 0, since a vector X with general σ and γ is obtained at once from the vector X 0 through (4.6). Furthermore, using the connection between GP U and GP R distributions, GP R vectors may be obtained by simulating GP U vectors, and vice versa. Throughout we assume that simulation of U from F U and T from F T is possible. Recall the relation S = T − d j =1 T j which was used to define the (T ) representation. The first method follows immediately from Eq. 4.7.
Method 1: simulation from the (T ) representation. Simulate a vector T ∼ F T and an independent variable E ∼ Exp(1) and set X 0 = E + T − max 1≤j ≤d T j .
Simulation from the (R) and (U ) representations is less direct. We propose three methods: rejection sampling, MCMC sampling, and approximate simulation using (4.1). The idea in Methods 2 and 3 is to use an appropriate change of measure so that Method 1 can be used to simulate from the (T ) representation. The GP T (1, 0, F T ) density is If in this equation one replaces T by T 0 where T 0 has density Thus, if one can simulate T 0 vectors, then these give GP U (1, 0, F U ) vectors via Method 1.
Method 2: simulation of T 0 via rejection sampling. Let ϕ be a probability density function which satisfies f T 0 (x) ≤ Kϕ(x), for some constant K > 0. Draw a candidate vector T c 0 from ϕ and accept the candidate with probability , and repeat otherwise. Use the accepted vector as input T in Method 1.
The acceptance probability is 1/K, and thus it is advantageous to find a ϕ such that K is not too large. In high dimensions however, such a ϕ might be difficult to find.
Method 3: simulation of T 0 via MCMC. Use a standard Metropolis-Hastings algorithm to simulate from a Markov chain with stationary distribution (7.1). At iteration i, draw a candidate vector T c 0 from the density f T and accept the candidate with probability min{1, exp( is the current state of the chain. If the candidate is not accepted, then the previous state of the chain is repeated. After a suitable burn-in time, values of the chain should represent dependent samples from Eq. 7.1; the draws can be thinned to produce approximately independent replicates. Use the simulated values of the chain as inputs T to Method 1. Alternative proposal distributions could be used with appropriate modification of the acceptance probability; for details see e.g. Chib and Greenberg (1995).
By Eq. 4.1, an approximate way to simulate X ∼ GP R (σ , γ , F R ) is as follows.
Method 4: approximate simulation from the (R) representation. Choose a large K > 0.
In this algorithm, the probability to keep a simulated R/T γ value is so one has to simulate approximately K/ ∞ 0F R (t γ σ γ ) dt values of R/T γ to get one X-value. Hence a large K, which ensures that the approximating distribution H (K) is close to H, leads to longer computation times, and a compromise has to be made. As a guide to the compromise, it is often, e.g. for Gaussian or log-Gaussian processes, possible to compute, analytically or numerically, sharp bounds for the approximation errors.
To summarize, Method 1 is simplest, but only produces GP T (σ , γ , F T ) vectors. Method 2 and Method 3 provide ways to simulate vectors T 0 from distribution (7.1), which can then be inserted into Method 1 to simulate from the GP U (σ , γ , F U ) and GP R (σ , γ , F R ) distributions. Method 4 is as simple to program as Method 1 and produces i.i.d. vectors, but, similarly to Method 3, only approximates the target distribution.

Conclusion
This paper studies the probability theory underlying peaks over thresholds modelling of multivariate data using generalized Pareto distributions. We first derive basic properties of the multivariate GP distribution, including behaviour under conditioning; scale change; convergence in distribution; mixing; and connections with generalized extreme value distributions. The main results are a point process limit result which gives a general and concrete description of the behaviour of extreme episodes; new representations of the cdf-s of multivariate GP distributions, motivated by and derived from the point process result; expressions for likelihoods and censored likelihoods; formulas for probabilities and conditional probabilities of general sets; and algorithms for random sampling from multivariate GP distributions. Throughout, the results are illustrated by examples.
We provided four different representations of GP distributions, labelled (R), (S), (T ), and (U ). Computationally, the (T ) densities are simplest, and simulation from the (T ) representation also is simpler than simulation from the other representations. On the other hand, it seems impractical to use the (T ) representation for prediction or spatial modelling, since taking lower-dimensional margins of it do not simply lead to the proper lower-dimensional (T ) representations, and since a d-dimensional (T ) representation does not include any prescription for how to extend it to a (d + 1)dimensional one. The (S), (T ) and (U ) representations allow for smooth transitions from positive to negative γ j , in contrast to the (R) representation. In some situations, however, requirements of realistic physical modelling can nevertheless lead to the use of the (R) representation.
Peaks over thresholds modelling of extremes of a random vector Y first selects a suitable level u and then models the distribution of the over-and undershoots, X = Y − u, conditional on the occurrence of at least one overshoot, by a GP distribution. Of course, this GP model also models the conditional distribution of the original vector Y , since Y = X + u. Modelling issues which are not treated include choice of the level u, perhaps as a function of covariates like time, and modelling of the Poisson process which governs the occurrence of extreme episodes.
A further practical issue, which is outside the scope of the current paper, is that of asymptotic independence of extremes. In the event that the limiting probability of joint occurrence of extremes, conditional upon at least one extreme component, is zero, multivariate GP distributions will typically not represent the best models. Asymptotic independence is usually manifested in practice by the threshold stability properties of multivariate GP distributions not holding. Diagnostics based on these stability properties are presented in Kiriliouk et al. (2016).
The paper gives a basis for understanding and modelling of extreme episodes. We believe it will contribute to the solution of many different and important risk handling problems. However, it still is an early excursion into new territory, and much research remains to be done. Important challenges include incorporating temporal dependence and developing methods for prediction of the unfolding of extreme episodes.