Multivariate peaks over thresholds models

Multivariate peaks over thresholds modeling based on generalized Pareto distributions has up to now only been used in few and mostly 2-dimensional situations. This paper contributes theoretical understanding, physically based models, inference tools, and simulation methods to support routine use, with an aim at higher dimensions. We derive a general point process model for extreme episodes in data, and show how conditioning the distribution of extreme episodes on threshold exceedance gives four basic representations of the family of generalized Pareto distributions. The first representation is constructed on the real scale of the observations. The second one starts with a model on a standard exponential scale which then is transformed to the real scale. The third and fourth are reformulations of a spectral representation proposed in A. Ferreira and L. de Haan [Bernoulli 20 (2014) 1717--1737]. Numerically tractable forms of densities and censored densities are found and give tools for flexible parametric likelihood inference. New simulation algorithms, explicit formulas for probabilities and conditional probabilities, and conditions which make the conditional distribution of weighted component sums generalized Pareto are derived.


Introduction
Peaks over thresholds (PoT) modelling was introduced in the hydrological literature. The philosophy is simple: extreme events, perhaps extreme water levels, are often quite different from ordinary everyday behavior, and ordinary behavior then has little to say about extremes, so that only other extreme events give useful information about future extreme events. To make this idea operational, one defines an extreme event as a value, say a water level, which exceeds some high threshold, and only uses the sizes of the excesses over this threshold, the "peaks over the threshold", for statistical inference. This idea was given a theoretical foundation by combining it with asymptotic arguments motivating that the natural model is that exceedances occur according to a Poisson process and that excess sizes follow a generalized Pareto (GP) distribution (Balkema and de Haan, 1974;Pickands, 1975;Davison and Smith, 1990).
Since then thousands of papers have used one-dimensional PoT models (though often not under this name), in areas ranging from earth and atmosphere science to finance, see e.g. Kyselý et al. (2010), Katz et al. (2002), McNeil et al. (2015), and the method has been presented in a number of books. As just one example, the book by Coles (2001) gives a useful practical introduction to the method.
However, often it is not just one extreme event which is important, but an entire extreme episode. In the 2005 flooding of New Orleans caused by windstorm Katrina, more than 50 dykes were breached. However, many others held, and the damage was determined by which dykes held and which were flooded (Wikipedia, 2016b). Extreme rain can lead to devastating landslides, and can be caused by one day with very extreme rainfall, or by smaller, but still extreme rain amounts during two or more consecutive days (Guzzetti et al. (2007)). The 2003 heat-wave in central Europe is estimated to have killed between 25 000 and 70 000 people. Many deaths, however, were not caused by one extremely hot day, but rather by a sequence of very high minimum nightly temperatures which led to increasing fatigue, and eventually to death (Wikipedia, 2016a). These, and very many other important societal problems underline the importance of statistical methods which can handle multivariate extreme episodes, such as the water levels at many different locations during a flood, or the minimum night temperatures during a number of consecutive days.
The aim of this paper is to contribute to the probabilistic understanding, physically based models, likelihood tools, and simulation methods which are needed for multivariate PoT modelling of extreme episodes, also in high dimensions. Using the same philosophy as for extreme events in one dimension, PoT modelling of extreme episodes proceeds by choosing a high threshold for each component of the episode, and then to consider an episode as extreme if at least one component exceeds its threshold. One then only models the difference between the values of the components and their respective thresholds. However, in the multivariate case both the heights of the peaks over the thresholds and the depths of the troughs under the thresholds are modeled. For instance, in a rainfall episode affecting a number of catchments, both the amount of rain in the catchments where rainfall exceeds the threshold and in catchments where the threshold is not exceeded are important. Additionally, the inclusion of troughs, or undershoots, increases the amount of information that can be used for inference.
Just as in one dimension, the natural model is that extreme episodes occur according to a Poisson process and that overshoots and undershoots (or undershoots larger than a censoring threshold) jointly follow a multivariate GP distribution. Below we derive the basic properties of this class of distributions. We then pursue the following program: (i) to model the possible asymptotic forms of extreme episodes in data; (ii) to show how conditioning on threshold exceedances transforms the distribution of the extreme episodes to GP distributions, and to use this to find physically motivated and interpretable representations of the multivariate GP distributions; and (iii) to derive likelihoods and censored likelihoods for the representations in ii).
In part ii) of the program, we develop three representations. The first one is in the same units as the observations, i.e., on the real scale, and in the second one the model is built on a standard exponential scale and then transformed to the real observation scale. The third is a spectral representation proposed in Ferreira and de Haan (2014).
A useful, and to us surprising, discovery is that it is possible to derive the density for the Ferreira-de Haan representation, and that this density in fact is simpler than the densities for the other two representations. The importance of iii) is that likelihood inference makes it possible to incorporate covariates, e.g. temporal or spatial trends, in the modelling in a flexible and practical way.
The insights and results obtained in carrying out this program, we believe, will lead to new models, new computational techniques, and new ways to make the necessary compromises between modeling realism and computational tractability which together will make possible routine use, also in high dimensions. The models mentioned in Example 5.2 may be a case in point. The formulas for probabilities, conditional probabilities and conditional densities given in Section 5.3, together with the discovery that weighted sums of components of GP distributions conditioned to be positive also have a GP distribution, add to the usefulness of the methods.
Simulation of GP distributions is needed for several reasons, including computation of the probabilities of complex dangerous events and goodness of fit checking. The final contribution of this paper is a number of simulation algorithms for multivariate GP distributions.
The multivariate GP distributions were introduced in Tajvidi (1996), Beirlant et al. (2004, Chapter 8), and Rootzén and Tajvidi (2006); see also Falk et al. (2010, Chapter 5). The literature on applications of multivariate PoT modelling is rather sparse (Brodin and Rootzén, 2009;Michel, 2009;Aulbach et al., 2012). Some earlier papers use point process models which are closely related to the PoT/GP approach (Coles and Tawn, 1991;Joe et al., 1992). Other papers consider nonparametric or semiparametric rank-based PoT methods focusing on the dependence structure but largely ignoring modelling the margins (de Haan et al., 2008;Einmahl et al., 2012Einmahl et al., , 2016. However, the GP approach has the advantages that it provides complete models for the threshold excesses, that it can use well-established model checking tools, and that it leads to more natural parametrizations of trends in the Poisson process which governs the occurrence of extreme episodes. There is an important literature on modelling componentwise, perhaps yearly, maxima with multivariate generalized extreme value (GEV) distributions. A survey in the spatial context is given in Davison et al. (2012). Componentwise maxima are sometimes mandated by the practical problem under study, or by the fact that only data on componentwise maxima are available. However, the componentwise maxima may occur at different times for different components, and in many situations focus is on the structure modeled by the PoT approach: extremes which occur simultaneously. Additionally, likelihood inference for GEV distributions is complicated by the fact that analytic expressions for high-dimensional densities typically are not available, so that inference often is much easier, and perhaps more efficient, in GP models; see Huser et al. (2015) for a survey and an extensive comparison.
Section 2 recalls some basic facts about multivariate GEV and GP distributions, and derives and exemplifies the basic properties of the class of GP cumulative distribution functions (cdf-s). In Section 3 we derive a point process model of the structure of extreme episodes, and Section 4 shows how conditioning on exceeding high thresholds leads to three basic representations of the GP distributions. In Section 5 we derive the densities and censored densities which correspond to the three representations, discuss some parametrization issues, and give formulas for probabilities and conditional probabilites in GP distributions. Finally, Section 6 contributes simulation algorithms for multivariate GP distributions and Section 7 gives a concluding overview of results and future challenges.

Multivariate generalized Pareto distributions
This section first briefly recalls and adapts parts of existing theory for multivariate GEV and GP distributions, then derives a number of basic properties of GP distributions, and ends with a brief comment on work by Falk and coauthors.
We throughout use notation as follows. The maximum and minimum operators are denoted by the symbols ∨ and ∧, respectively. Bold symbols denote d-variate vectors. For instance, α = (α 1 , . . . , α d ) and 0 = (0, . . . , 0) ∈ R d . Operations and relations involving such vectors are meant componentwise, with shorter vectors being recycled if necessary. For instance ax If F is a cdf then we writeF = 1 − F for its tail function, and also write F for the probability distribution determined by the cdf. The expression X ∼ F means that X has distribution F , and d − → denotes convergence in distribution. The symbol 1 is the indicator function: 1 A equals 1 on the set A and 0 otherwise.

Background: multivariate generalized extreme value distributions
Throughout, G denotes a d-variate GEV distribution. The class of GEV distributions has the following equivalent characterizations, see e.g. Beirlant et al. (2004).
(M1) It is the class of limit distributions of maxima: Let X 1 , X 2 , . . . be an i.i.d. sequence with cdf F . If, for some scaling and location sequences a n ∈ (0, ∞) d and b n ∈ R d , where G has non-degenerate margins, then G is a GEV distribution. Conversely all GEV distributions may be obtained in this way.
(M2) It is the class of max-stable distributions: Let X 1 , X 2 , . . . be i.i.d. with cdf G which has non-degenerate margins. If there are scaling and location sequences a n ∈ (0, ∞) d and b n ∈ R d such that then G is a GEV distribution. Conversely all GEV distributions have this property.
If (M1) holds then F is said to be in the (maximum) domain of attraction of G, and we write F ∈ D(G). It follows from (M1) that the class of GEV distributions is closed under location and scale changes.
The marginal distribution functions G 1 , . . . , G d of G may be written as We will use this parametrization throughout. The case γ j = 0 is to be interpreted as the limit γ j → 0, that is, while G is supported by a (subset of) the rectangle I 1 × · · · × I d . The lower and upper endpoints of G j are denoted by η j ∈ R ∪ {−∞} and ω j ∈ R ∪ {+∞}, respectively. One may alternatively write the condition x ∈ I 1 × . . . × I d as γx + σ > 0.
An easy argument shows that G is max-stable if and only if to each t > 0 there exist scale and location vectors a t ∈ (0, ∞) d and b t ∈ R d such that G(a t x + b t ) t ≡ G(x) (Resnick, 1987, Equation (5.17)). It follows from (2.3) that these parameters are given by To a GEV distribution G we can associate a Borel measure ν on d j=1 [−η j , ∞) \ {η} by the formula ν({y : y ≤ x}) = − log G(x) for x ∈ [−∞, ∞) d , with the convention that − log(0) = ∞ (Resnick, 1987, Proposition 5.8). The measure ν is called intensity measure because, by the domain-of-attraction condition (2.2), the limit of the expected number of points a −1 n (X i − b n ), i ∈ {1, . . . , n}, in a ν-smooth Borel set A is equal to ν(A). The intensity measure ν determines the limit distribution of the sequence of point processes n i=1 δ a −1 n (X i −bn) , see Section 3.

Generalized Pareto distributions
Let G be a GEV distribution with 0 < G(0) < 1 and let ν be the corresponding intensity measure. Then 0 < ν({y : y ≤ 0}) < ∞, so that we can define a probability measure supported on the set {y : y ≤ 0} by restricting the intensity measure ν to that set and normalizing it. The result is the generalized Pareto (GP) distribution associated to G. For its cumulative distribution function, H, we obtain the expression (2.6) see Beirlant et al. (2004, Chapter 8) and Rootzén and Tajvidi (2006). If a GEV cdf G and a GP cdf H satisfy (2.6), then we say that they are associated and write H ↔ G.
For completeness, we have added a proof of (2.6) in the appendix. For points x ∈ [−∞, ∞) d such that x ≥ η and x j = η j for some j, the value of H(x) is determined by right-hand continuity 1 : H(x) = lim h 0 H(x + h). The margins of H are denoted by H 1 , . . . , H d .
The GP distributions defined by (2.6) constitute the class of limit distributions of threshold excesses for random vectors. The probability that component j ∈ {1, . . . , d} exceeds zero is equal to 1 − H j (0) = log G j (0)/ log G(0), which is positive if and only if G j (0) < 1, that is, when σ j = α j − γ j µ j > 0. Since G(0) < 1 implies that G j (0) < 1 for some but not necessarily all j, the GP family includes distributions for which one (or several) of the components never exceed their threshold, so that the support of that component lies in [−∞, 0]. This could be useful in some modelling situations, but still, the situation which is of main interest is when all components have a positive probability of being an exceedance, or equivalently when H j (0) < 1 for all j ∈ {1, . . . , d}, or, again equivalently, when σ > 0.
Completely parallel to the class of GEV distributions, the class of GP distributions H such that H j (0) < 1 for all j ∈ {1, . . . , d} has the following equivalent characterizations (Rootzén and Tajvidi, 2006). The threshold functions u t in the characterizations are assumed to be continuous and increasing.
(T1) It is the class of limit distributions of threshold excesses: Let X ∼ F . If there exist scaling and threshold functions s t ∈ (0, ∞) d and u t ∈ R d , with F (u t ) < 1 and F (u t ) → 1 as t → ∞, such that where H has nondegenerate margins and H j (0) < 1 for all j ∈ {1, . . . , d}, then H is a GP distribution. Conversely, all GP distributions H such that H j (0) < 1 for all j ∈ {1, . . . , d} may be obtained in this way.
(T2) It is the class of threshold-stable distributions: Let X ∼ H where H has nondegenerate margins and H j (0) < 1 for all j ∈ {1, . . . , d}. If there exist scaling and threshold function s t ∈ (0, ∞) d and u t ∈ R d , with H(u t ) → 1 as t → ∞, such that then H is a GP distribution. Conversely, all GP distributions H for which H j (0) < 1 for all j ∈ {1, . . . , d} have this property.
In Rootzén and Tajvidi (2006) 2 it is additionally shown that in (T1) and (T2), it is enough to assume that the convergence holds for x > 0.
If (T1) holds we say that F belongs to the (threshold) domain of attraction of H with threshold function u t and write F ∈ D(H). In contrast to the situation for GEV distributions, different threshold functions can lead to limits which are not location-scale transformations of one another. A cdf F is in a domain of attraction for maxima if and only if it is in a threshold domain of attraction.
Recall the notation I j = (η j , ω j ) for the interval supporting G j . The GP distribution H is supported by the set [η, ω] \ [η, 0] = {x ∈ R d : ω j ≥ x j ≥ η j for all j, and x j > 0 for some j}.
The GP distribution H may assign positive mass to the hyperplanes {y : y j = η j }, even if η j = −∞; see Examples 2.3 and 2.4 below.
For a non-empty subset S of {1, . . . , d}, let H S denote the corresponding |S|-variate marginal distribution of H. Further, let H + S denote H S conditioned to have at least one positive component; this presupposes that σ j > 0 for some j ∈ S, where σ j = α j − µ j γ j as before. By Proposition 2.1(i) below, if σ > 0, then H + j := H + {j} , the j-th marginal distribution of H, conditioned to be positive, has cdf This proves the intuitively appealing result that H + j is a one-dimensional GP distribution, and also shows that σ and γ are parameters of the marginal distributions.
In the following proposition we give the general version of this result and collect a number of other properties of GP distributions. If S is a non-empty subset of {1, . . . , d} is the conditional distribution of (X j : j ∈ S) given that max j∈S X j > 0, see (2.1) above.
Recall that G and H are said to be associated, H ↔ G, if they satisfy (2.6).
of X − u given that X u is a GP distribution with the same shape parameter γ and σ replaced by σ + γu. In words, Proposition 2.1(i) says that conditional margins of GP distributions are GP, but that marginal distributions of GP distribution are typically not GP. For instance, if H is a two-dimensional GP cdf, then H + 1 is a one-dimensional GP cdf [given by (2.7)], but typically H 1 is not. Intuitively, the reason is that the conditioning event implicit in H 1 (x) also includes the possibility that it is the second component, rather than the first one, that exceeds its threshold.
Proposition 2.1 (i)−(iii) and Proposition 2.2 establish various closure properties of the class of GP distributions. By (iv) and (v) of Proposition 2.1, a GEV distribution specifies the associated GP distribution and conversely a GP distribution specifies a curve of associated GEV distributions in the space of distribution functions. By (iv) it follows that, for a GEV G and a GP for a t and b t given by (2.5), it then also holds that H ↔ G(a t ( · ) + b t ). Regarding (v), note that a GEV distribution G such that 0 < G j (0) < 1 for all j ∈ {1, . . . , d} is determined by the values of G(x) for x ≥ 0 (proof in the appendix). Finally, the assertion (vi) identifies the affine transformations which leave H unchanged, and (vii) establishes identifiability of the marginal parameters.
Further, inserting (2.3) into the equation above for S = {j} together with straightforward calculation proves (2.7), and hence completes the proof of the first assertion.
If H S (0) = 0, then H S = H + S and it follows from the first assertion that H S is a GP distribution function. If instead H S (0) > 0 then it follows from (2.6) that H S is not a GP distribution. This proves the second assertion.
(ii) If G is a GEV cdf then, for s > 0, the map x → G(x/s) is a GEV cdf too, and the result then follows from (iii) Proceeding as in the proof of (i), but in the first step instead using that (x ∧ 0 + u) = (x ∧ 0 + u) ∧ 0, shows that the conditional distribution of X − u given that X u is (iv) The first assertion follows from (2.6). Using the first assertion, we may without loss of generality assume that − log G(0) = − log G 1 (0), and then and in particular, G(x) = G 1 (x) for x ≥ 0. Since a GEV cdf with σ > 0 is determined by its values for x ≥ 0, see the appendix, this completes the proof.
(v) The first part follows from (2.6), and that this determines G again follows from the appendix.
(vi) It follows from the proofs of (iii) and (iv) that the conditional distribution of . By assertion (iv), the GPD H determines the curve of GEV-s G t for t > 0. It suffices to show that γ j (t) and σ j (t) do not depend on t. But this follows by straightforward calculations from the max-stability property with a t and b t as in (2.5).
Proof of Proposition 2.2. Convergence in distribution in R d implies convergence of the marginal distributions, and using standard converging subsequence arguments it follows from marginal convergence that there exist σ > 0 and γ such that σ n → σ and γ n → γ. Define u n,t and s n,t from H n as in Proposition 2.1(vi). Then, since H n is a GP cdf we have, using first (T2) and (vi), and then the continuous mapping theorem, that and since H n → d H it follows that H satisfies (T2) and hence is a GP cdf.
In Examples 2.3 and 2.4 below we exhibit two-dimensional GP distributions with positive mass on certain lines, and Example 2.5 provides a distribution where the second assertion in (i) of Proposition 2.1 comes into play. In contrast to scale transformations, it seems likely that if σ > 0 then a non-trivial location transformation of a MGP distribution never is a GP distribution. Example 2.6 shows one of the exceptional cases where the support of one of the components is contained in (−∞, 0) and where a location transformation of a GP distribution does give another GP distribution.
Example 2.4. If we modify Example 2.3 by choosing Gumbel margins rather than Fréchet margins, so that G(x, y) = exp(−e −x − e −y ) for (x, y) ∈ R 2 , then the corresponding MGP distribution H is the distribution function of the random pair (X, Y ) given by The support of H is given by the union of two half-lines through −∞, that is, the sets Example 2.5. Let G(x, y) = exp[−1/{(x∧y)+1}] for (x, y) ∈ (−1, ∞) 2 , the distribution of the random pair (Z, Z), where Z is a unit Fréchet random variable with lower endpoint given by −1. The corresponding MGP distribution H is given by (2.10) We identify H as the distribution of the random pair (T, T ), where P( The support of H is given by the diagonal {(t, t) : 0 < t < ∞}, see the middle panel of Figure 1. It follows that, e.g., H 1 (0) = 0 and hence in this example Example 2.6. As a variation of Example 2.5, let G(x, y) = exp[−e −x∧(y+µ) ] for (x, y) ∈ R 2 be the distribution of the random vector (Z, Z − µ), with Z a standard Gumbel variable, and with µ > 0. Then the corresponding GP cdf is , which is the same as in (2.6), but with µ replaced by ν + µ > 0. Hence also H(x, y + ν) is a GP cdf. The support of H is shown in the right-hand panel of Figure 1.

Point processes of extremes
The first step in our program for PoT inference is to specify a model for the distribution of extreme episodes in data on the real, physical scale. In this section we do this through a point process model of extreme episodes. We motivate the model by proving that the class of point processes in it coincides with the class of limits of location and scale normalized point processes of extreme values. Let X 1 , X 2 , . . . be a sequence of independent random vectors with cdf F and marginal cdf-s F 1 , . . . , F d . Let a n ∈ (0, ∞) d and b n ∈ R d be scale and location sequences such that 0 < F j (b n,j ) < 1 and F j (b n,j ) → 1 as n → ∞, for all j = 1, . . . , d. We consider weak limits of the point processes where δ x denotes a point mass at or (−∞, 0) according to whether γ j > 0 or γ j = 0 or γ j < 0, and setS γ =Ī 1 × · · · ×Ī d and S γ = I 1 × · · · × I d . We use the following representation, (R), of the limiting point process, where the letter 'R' indicates that the model is constructed on the real scale, i.e., on the space where the observations live.
. . be the points of a homogeneous Poisson process on [0, ∞) with unit intensity, let (R i ) i≥1 be independent copies of R which are independent of (T i ) i≥1 , and define the point process where, by convention, Let F R be the cdf of R. For γ j = 0 the moment condition in (R) can be seen to be equivalent to requiring that 0 < ∞ 0 P(R j > t γ j x) dt < ∞, for x ∈ I j . For γ j = 0 the moment condition instead is equivalent to 0 where we have used the convention that t 0 x should be replaced by log t + x.
In the theorem below, the arrow '→ d ' denotes weak convergence of point processes.
Theorem 3.1. Suppose that F ∈ D(G) for some G. Then there exists an R which satisfies (R) and sequences a n > 0 and b n , with 0 < F j (b n,j ) < 1 and F j (b n,j ) → 1 for each j = 1, . . . , d, such that N n → d P r on S γ , as n → ∞. (3.2) Conversely, for any point process P r given by (3.1) there exist a GEV cdf G and sequences a n > 0 and b n , with 0 < G j (b n,j ) < 1 and G j (b n,j ) → 1 for each j = 1, . . . , d, such that (3.2) holds for F = G.
Proof. The result is closely related to earlier results by S.I. Resnick, so we only give a brief outline of the proof. By choosing b n suitably we may without loss of generality assume that a −1 n ( n i=1 X i − b n ) converges to a GEV cdf G, the margins of which have support I j , or equivalently that the parameters satisfy γ j µ j = α j for all j = 1, . . . , d.
Let Y ∼ G and define Y * by Y * j = |Y j | 1/γ j if γ j = 0 and Y * j = e Y j if γ j = 0, so that Y * is a GEV cdf with margins given by (2.3) with µ = α and γ = 1. It follows as in Theorem 5 of Penrose (1992) (see also de Haan and Ferreira (2006) and Schlather (2002)) that there exists a random vector R * ∈ [0, ∞) d with E(R * j ) < ∞ such that Y * has the same distribution as sup i≥1 R * i /T i where the random vectors R * i are i.i.d. copies of R * and independent of the homogeneous Poisson process (T i ) i≥1 . Reversing the transformation which led from Y to Y * , it follows that Y has the same distribution as sup i≥1 R i /T γ i , with R i as in (R). For ν the intensity measure of P r , we have G(x) = exp{−ν({y : y x})}. By standard reasoning, the weak convergence of a −1 , and this in turn implies that n P(a −1 n (X 1 − b n ) ∈ · ) converges vaguely to ν( · ). By Theorem 5.3 of Resnick (2007) this proves (3.2).
Conversely, given the point process The j-th marginal distribution, G j , of G may be obtained as the zero-probability of the marginal Poisson point process (R i,j /T γ j i : i ≥ 1) for the interval (x, ∞), so that (see the discussion just after Condition (R)), for x strictly between the two endpoints of I j , Furthermore, easy calculations, such as for instance the ones in Schlather (2002, Theorem 2), show that G is max-stable. It follows that there exist sequences a n and b n with the stated properties such that for independent random vectors X 1 , X 2 , . . . with common distribution G, the distribution of a −1 n ( n i=1 X i − b n ) is equal to G too. By the first part of the theorem, this proves (3.2).
In the proof of Theorem 3.1 we obtained part of the following result, which we record here for completeness. The proof is the same as the proof of the last displayed equation in the previous proof, so we omit it. Let F R j denote the j-th marginal distribution of F R .
and to any GEV cdf for which γµ = α there exists an R which satisfies this equation. The marginal distributions are for j = 1, . . . , d.

Representations of multivariate GP distributions
This section contains the second step in the program for PoT inference. We show how conditioning on a threshold exceedance in the point process (3.1) gives three widely useful representations of the class of multivariate GP distributions. These representations give new intuition for modelling and lead to tractable expressions for densities. In this and subsequent sections we assume that σ > 0.
The first representation, (R), directly corresponds to the point process P r in (3.1). In the second representation, (R s ), the basic model is constructed on a standardized scale and then transformed to the scale of the observations. The third representation, (Rs), is equivalent to representation in Ferreira and de Haan (2014), and is obtained by adding the assumption that the random vector S in the (R s ) representation satisfies d j=1 S j = 0.
In the literature the standard modelling scale is typically chosen as one of the following three scales: standard Pareto scale corresponding to γ = 1, standard Weibull scale corresponding to γ = −1, or the standard Gumbel/exponential scale where γ = 0. Here we choose the last alternative, γ = 0, because of the simple additive structure which the model then acquires. It is straightforward to translate the (R s ) and (Rs) representations to representations based on the standard Pareto or standard Weibull scales.
The points in the limit process for (R) have the simplest and perhaps most intuitive structure which, informally, describes extreme events as a random profile, R i , divided componentwise by random scaling variables T γ j i , or, if γ j = 0, shifted by subtracting a random location variable log T i . In the (R s ) representation, the general model is instead obtained using a componentwise inverse Box-Cox transformation of the (R) model on the standard Gumbel/exponential scale, and the distribution of this limit point process then depends smoothly on the shape parameter vector γ. The (Rs) model leads to the simplest expressions for likelihoods and for simulation but is a bit further away from physical interpretation.
For the (R s ) representation we make the following assumption.
(R s ) Let S be a d-dimensional random vector taking values in [−∞, ∞) d such that 0 < E(e S j ) < ∞ for j = 1, . . . , d, let S 1 , S 2 , . . . be independent copies of S, and let (T i ) i≥1 be as in (R), and be independent of (S i ) i≥1 . Define the point process P s by If γ = 0 then P r has the simple additive form P s = i≥1 δ S i −log T i , which is obtained as the limit of P s as γ → 0. Further, if γ j = 0, then setting R i,j = exp(γ j S i,j )/γ j it is seen that P s is obtained by moving all the points in the corresponding process P r componentwise a distance 1/γ j , and hence that a point process convergence theorem similar to Theorem 3.1 holds also for P s . We next obtain the GP representation based on (R), Equation (4.2) below, through a limiting argument, use this argument to make a basic intuitive interpretation of the representation, and finally, in Theorem 4.1, prove that all GP distributions are obtained through this representation.
For this, we first replace P r by a truncated point processP r where the unit rate Poisson process {T i } on R + is replaced by a unit rate Poisson process {T i } on a bounded interval [0, U ]. Recalling the standard representation of a Poisson process as a Poisson distributed number of Unif [0, U ] variables, a typical point inN r has the form R/T γ withT ∼ Unif [0, U ]. Now, by (2.1), and using that P By the assumptions in (R), the limit as U → ∞ of this expression, exists (cf the discussion just before Theorem 3.1), and it is also immediate that H r (∞) = 1, so that H r is a cdf on [−∞, ∞) d . Here, if γ i = 0 then t γ i (x i + σ i γ i ) should be interpreted to mean x i /σ i + log t.
Theorem 4.1 below shows that the class of cdf-s given by (4.2) is the same as the class of all GP cdf-s. In the following we write GP r (σ, γ, F R ) for the distributions defined by (4.2). It should be noted that the location normalization made in Theorem 3.1 to set the support of the point process to S γ does not restrict the generality of (4.2): an exceedance of σ/γ by R/T γ is the same as an exceedance ofσ/γ by R/T γ + c ifσ = σ + cγ, for any vector c of constants.
The intuition is that a typical point in P r has the form where T represents a pseudo random variable with an improper uniform distribution on (0, ∞). Although such a X therefore does not have a proper distribution, one can verify that (4.2) can be derived as if it were the conditional distribution of X −σ/γ, given that X ≤ σ/γ. In Section 5.3 below is seen that probabilities and conditional probablities of general events in the distribution (4.2) may be derived in the same way. Using parallel reasoning for the limit process P s , the (formal) conditional cdf of for x such that γ σ x + 1 > 0, and where F S is the cdf of S. We write GP s (σ, γ, F S ) for the distribution defined by (4.3). Again, in Theorem 4.1 it is proved that H s indeed is a distribution function, and that the class of distribution functions (4.3) equals the class of GP cdf-s. For later use, note that in this construction, a general GP s vector is obtained from a GP s (1, 0, F S ) vector X 0 through the transformation For σ = 1 and γ = 0 the cdf-s H s and H r are the same, and are given by (4.5) Suppose now that S =S whereS satifies d j=1S j = 0 and that σ = 1. It is straightforward to see that then FS(x + log t) − FS(x ∧ 0 + log t) = 1 {0≤t<1} FS(x + log t), and, in particular, thatFS(log t) = 1 {0≤t<1} . Inserting this into (4.5) then gives the cdf where the second equality follows from the change of variable log t = −v. Further, using the transformation (4.4) which connects (4.5) with (4.3), it follows more generally that if S =S, where d j=1S j = 0, then for general σ, γ one obtains the cdf (4.7) We write GP s (σ, γ, FS) for the distribution (4.7), and call it the (Rs) representation.
In Section 5 below, the random vectorS will be constructed from a general random vector S by settingS = S − ∨ d j=1 S j . Further, the last expression in (4.6) can be given an interpretation in terms of random variables: it is the cdf ofS + E, where E is a standard exponential variable which is independent ofS, and then (4.7) is the cdf of σ γ e γ(S+E) − 1 . This is the representation derived in Ferreira and de Haan (2014), translated to the exponential scale. Proof. By Corollary 3.2, to any GEV cdf G with γµ = α there exists a random vector R which satisfies (R) and for which G 0 (x) = exp{− ∞ 0F R (t γ x)dt}, for x ∈ S γ . The general GEV distribution is thus obtained from a location transformation of G 0 , and by a suitable choice of σ may be written as G(x + σ γ ) = exp{− ∞ 0F R (t γ (x + σ γ ))dt}. Further, requiring that 0 < G 0 ( σ γ ) < 1 can be seen to be equivalent to σ > 0. Inserting this into (2.6) shows that the class of cdf-s GP r (σ, γ, F R ) is equal to the class of all GP distributions with σ > 0.
By definition, the class of GP s (σ, γ, F S ) cdf-s is the same as the class of GP r (σ, γ, F R ) cdf-s, and hence the same conclusion holds for the GP s (σ, γ, F S ) cdf-s. Since the GP s (σ, γ, FS) cdf-s are GP s (σ, γ, F S ) cdf-s, it follows that they are GP distributions.
To prove the full statement about the class GP s (σ, γ, FS) we first note that by the construction of the GP s (σ, γ, FS) cdf-s, it is enough to prove that the statement holds for GP distributions with γ = 0 and σ = 1. However, then the result follows by combining (T2) with the discrete version of de Haan and Ferreira (2006), Theorem 2.1, translated to the exponential scale (i.e with W replaced by e bS+e ), and with ω 0 = 1.
It is straightforward to verify that the conditional marginal distributions have the form (2.7). As an example we show the calculations for the GP r (σ, γ, F R ) class for the case γ j = 0; j = 1, . . . , d. Let F j be the marginal distribution of the j-th component of R. It follows from (2.8) and (4.2) that the distribution of the j-th component of H r conditional on being positive, denoted H + , is given by where the second equality follows from making the change of variables t( γ j σ j x + 1) 1/γ j → t in the numerator.
It may again be noted that since F R , F S , and FS are cdf-s then also H r , H s , and Hs are cdf-s, so that in contrast to equation (2.6) the equations (4.2), (4.3), and (4.5) hold for all x ∈ (−∞, ∞) d , subject to the extra provision that (4.3) and (4.5) only apply for γx + σ > 0.

Likelihoods, censored likelihoods, and conditional probabilities for multivariate generalized Pareto distributions
The central aim of this paper is to provide the basis for likelihood inference in multivariate peaks over thresholds modelling. In this section we find the densities and censored likelihoods which are needed for this, discuss parametrization issues, and give formulas for probabilities and conditional probabilities of general sets.

Densities and censored densities
To derive the densities for the (R) and (R s ) models we use the assumption that the random vectors R and S have a density with respect to Lebesgue measure on R d , and for the (Rs) model we assume thatS is obtained asS = S − ∨ d j=1 S j , where S has a density with respect to Lebesgue measure on R d . The form of the densities for (R) and (R s ) are just as would be obtained if the R and S cdf-s were continuously differentiable and interchange of differentiation and integration was allowed. However they in fact do not require any assumptions beyond absolute continuity with respect to d-dimensional Lebesgue measure.
The support of the vectorS is contained in the (d − 1) dimensional subset {s ∈ R d | ∨ d j=1 s j = 0} and henceS does not have a density with respect to Lebesgue measure on R d . The requirement made in (R s ) of finite exponential moment is not necessary for an S used to constructS sinceS always has always has finite exponential moment, so in (Rs) the only requirement on S is that it has a density.
Theorem 5.1. Suppose σ > 0. If F R has density f R with respect to Lebesgue measure on R d , then H r has the density h r given below, and if F S has density f S with respect to Lebesgue measure on R d , then H s and Hs have the densities h s and hs below: for γx + σ > 0, and where the densities are 0 otherwise. If γ j = 0 then for h r the argument t γ j σ j γ j in F R should be replaced by log t and the argument t γ j (x j + σ j γ j ) in f R should be replaced by x j /σ j + log t. Similarly, for h S and h S 0 , if γ j = 0, the expressions 1 γ j log( γ j σ j x j + 1) should be replaced by their limits x j /σ j .
The change of variables y = t(z + 1) shows that for this case Hence, by (4.2), and using Fubini's theorem, which proves that (5.1) holds for σ = γ = 1. The proof of the general form of (5.1) only differs from this case in bookkeeping details, and is omitted.
To prove (5.2), recall that H s = H r if γ = 0, σ = 1, so that it follows from (5.1) that then WritingH for the corresponding cdf, the general cdf H s is obtained as H s (x) =H( 1 γ log( γ σ x+ 1)) and (5.2) then follows by a chain rule type argument.
We again only prove (5.3) for the case σ = 1, γ = 0, and with x 0. Also here extension to the general case is a straightforward chain rule argument. By (4.6) we have that Hence, using first Fubini's theorem, then the change of variables te ∨ d j=1 sj → t, then Fubini's theorem and the change of variables s → s + log t, and then Fubini's theorem once more, we have that This proves that (5.3) holds for γ = 0 and σ = 1.
In (Rs) the distribution on {s ∈ R d | ∨ d j=1 s j = 0} given byS is obtained by starting with an absolutely continuous distribution on R d and then subtracting the maximal component. It is also possible avoid the detour via R d and obtain a GP variable directly from a vectorS which has a density on {s ∈ R d | ∨ d j=1 s j = 0}. Also for this construction it is possible to find the density of X, by arguments as above. We omit the details.
In some cases the integrals in the numerators of (5.1) -(5.3) can be computed explicitly, see the examples below. When this is not possible, the one-dimensional integrals in the densities allow for fast numerical computation as soon as one can compute densities and distribution functions of R (or S) efficiently. Either way, this can make full likelihood inference possible, also in high dimensions.
Sometimes one doesn't trust the GP distribution to hold on the entire set x 0. Then, instead of using a full likelihood obtained as a product of the densities in Theorem 5.1, one can use a censored likelihood which is based on the values of the excesses which exceed some censoring threshold v = (v 1 , . . . , v d ). Write D = {1, . . . , d}, and let C ⊂ D be the set of indices which correspond to the components which are censored, i.e. which do not exceed their censoring threshold v i . Then, using the notation x A = {x j ; j ∈ A} and writing h for h r , h s or hs , the likelihood contribution of the censored observation is (5.4) Example 5.2. The simplest situation is when the components of R (or S) are independent. This could e.g. be a model for windspeeds over a small area, perhaps a windmill park, with T γ representing the average (geostrophic) wind and the components being "nuggets": random wind variations caused by local turbulence. Let f j be the density function of R j , the jth component of R, let F j be the corresponding cdf, write y j = x j + σ j /γ j , and assume that v j ≤ 0, j ∈ C. The integral which appears in numerator in (5.4) for h = h r can then be written as Here both integrands typically can be computed quickly, and then also numerical computation of h C will be fast.
In a number of cases it is also possible to compute these integrals explicitly, and similarly for the corresponding integrals for h = h s and hs. As a simple example consider (5.3) with γ = 0, σ = 1 and with the components of S having independent standard Gumbel distributions with cdf exp{−e −x }, and consider hs. Then, writing c for the number of elements in C, i.e. for the number of uncensored components, and abbreviating Example 5.3. An extreme flow episode in a river network consisting of two tributaries which join to form the main river could be modeled as R/T γ = (R 1 /T γ , R 2 /T γ , (R 1 + R 2 +E)/T γ ), with γ > 0, so that R 3 = R 1 +R 2 +E. Here the first component corresponds to flow in tributary one, the second component corresponds to flow in tributary two, and the third component is flow in the main river. The very simplest model is that R 1 , R 2 , E are independent and have a standard exponential distribution. Then, and, additionally assuming that σ = (σ, σ, σ), 3 ) = (γ/σ) 1/γ Γ(3 + 1/γ)/2, since R 3 is a sum of three exponential variables, and thus has a gamma distribution. It follows from (5.1) and (5.5) with y = x + σ γ that h r (x) = 1 {x 0, −σ/γ≤x, x 1 +x 2 ≤x 3 } γ −1 2(σ/γ) 1/γ (x 3 + σ/γ) −3−1/γ .
Example 5.4. Variants of log Gaussian distributions have been used in max-stable modelling e.g. in Huser and Davison (2013), and as a point process model in Wadsworth and Tawn (2014), and they provide an important class of models. As an example, in the (R) model suppose that 0 ≤ γ and that F R (x) = Φ(log x) where Φ is the d.f. of a ddimensional normal distribution with mean vector µ and nonsingular covariance matrix Σ. Write φ for the corresponding density and let A = Σ −1 be the precision matrix. Then, writing y = log(x + σ γ ) − µ, Making the change of variables log t → t and completing the square, the integral can be evaluated, to show that The integral in the denominator can be expressed as a sum of d components, each of which is a d − 1 dimensional Gaussian cdf, see Huser and Davison (2013). However, if d is large then this expression is cumbersome. It is an interesting challenge to develop methods to compute the denominator efficiently for large d.

Parametrization
Each of the models GP r (σ, γ, F R ), GP s (σ, γ, F S ) and GP s (σ, γ, FS) contains all the GP cdf-s with σ > 0, so which one to use for likelihood inference is a choice of parametrization which is guided by questions of modelling realism and computational efficiency. In this section we discuss a number of issues connected with the parametrization of GP distributions.
The vectors R and S are not uniquely determined by the corresponding GP distributions, as is seen in the following result which is a generalization of Proposition 2.1(iv). It follows that if the model for F S includes free location parameters for each component, then the model is overparametrized, and one should, e.g., fix the location parameter for one of the components to zero. Similarly, if γ i = 0 for i = 1, . . . , d and if the model for F R includes a free scale parameter for each component, then one should e.g. fix one of these scale parameters to 1.
Proposition 5.5. Suppose that the random variable Z is strictly positive and has finite mean. Then GP s (σ, γ, F S+log Z ) = GP s (σ, γ, F S ), and GP r (σ, γ, Proof. The first assertion is an immediate consequence of the second one, so we only prove this. Now, conditioning on Z and using Fubini's theorem and the change of variables t/c → t shows that The same argument shows that and inserting these two equations into (4.2) proves the proposition.
Computationally the (Rs) densities typically are simplest, and the (R s ) densities may also have some computational advantage over the (R) densities since the integral in the denominator in the (R s ) densities does not depend on the parameters σ, γ. Instead, understanding of how well a model describes a physical situation is best done on the real physical scale, and often requirements of realistic physical modelling lead to use of the (R) model.
One can gain insight into the realism of an (R s ) model by rewriting it as an (R) model. Using the intuition described after the derivation of (4.2), exceedances of 0 by σ γ (e γ(S−log T ) − 1) are the same as exceedances of σ γ by σ γ e γS /T γ which in turn are the same as exceedances of σ γ by sign(γ)e γS+log(σ/|γ|) /T γ . Thus, if one sets R equal to sign(γ)e γS+log(σ/|γ|) then, heuristically, the (R) and (R s ) models are the same. It is easy to prove that this in fact is is correct, i.e. that this choice of R makes the righthand side of (4.2) equal to the righthand side (4.3), by checking the inequalities componentwise, separately for the cases γ j > 0, γ j = 0 and γ j < 0, and using the standard conventions for the case γ j = 0 .
As one consequence of this, if the parametric model for S doesn't include location parameters, then e log(σ/|γ|) determines the relative sizes of the components of the extreme episode. However, σ is strongly influenced by the choice of thresholds and γ is determined by the tails of the marginal distributions, and hence it doesn't seem likely that e log(σ/|γ|) will catch the relative component sizes well, and it thus often is desirable to have the model for S include location parameters.
A main issue is how estimates and confidence intervals for quantities like high quantiles perform when one or more of the γ j -s could be either positive or negative. For the (R s ) and (Rs) model the transition from positive to negative γ-s is smooth, and such estimates and confidence intervals will work automatically.
The distributions of marginal excesses in (R) models are the same as in the (R s ) models, and hence inference about γ and σ again will work for γ j -s which could be both positive and negative. However, in (R) models one may have to restrict the allowed range of γ, e.g. to γ > 0 to obtain useful estimates and confidence intervals for functions of components, such as high quantiles of maxima or sums of components.
One example where this transition from positive to negative γ-s is not smooth in the (R) model is the river network in Example 5.3, where it isn't even clear how to define the model in a useful way for negative γ-s. However, this model cannot be constructed at all as an (R s ) model.
As another example, if one uses a location-scale model for S in the (R s ) model, so that S = s s S * + m s , where the model for S * usually would include additional parameters, then the (R) model R = sign(γ) e srS * +mr is just a reparametrization γs s ↔ s r and γm s + log(σ/|γ|) ↔ m r of the (R s ) model. Hence maximum likelihood estimators, say of high quantiles of functions of the components, are the same in both models, and, e.g., delta method confidence intervals are asymptotically the same. However, non-asymptotically, if confidence intervals in the (R) model includes both negative and positive values for some γ i then confidence interval for functions of components may become unreasonable. Hence, again, in the (R) model, it may be safest to make the sign of the components γ a modelling assumption.

Probabilities and conditional probabilities
Equations (4.2), (4.3), and (4.7) give the probabilities of rectangles for GP distributions. In this section they are generalized to give expression for probabilities of general sets, and to expressions for conditional probabilities. We only consider the GP r models, since such expressions are usually needed on the real scale.
Let F = {y; y 0}, set A = {y; y ≤ x}, and for a, b ∈ R d and a set B ⊂ R d write aB + b for the set {ay + b; y ∈ B}. As is easily checked, Now, if in the derivation of (4.2) the special A defined above is replaced by a general set A ⊂ R d , the result still is the same, A proof that (5.6) holds for any set A is immediate: using Fubini's theorem it is seen that the righthand side of the equation is a probability distribution, as function of A, and since it agrees with the distribution H r on sets of the form {y; y ≤ x}, the two distributions are equal. The intuition is that H r (A) is the (formal) conditional probability of the event {R/T γ − σ/γ ∈ A} given the event {R/T γ − σ/γ 0}.
Let the random vector X ∼ H r have the distribution (4.2). Then P (X ∈ A | X ∈ B) = H r (A∩B)/H r (B), and hence (5.6) can also be used to find conditional probabilities. Further, assuming continuity, (5.1) determines the conditional densities. E.g., writing f | X 1 =x for the conditional density of (X 2 , . . . X d ) given that X 1 = x we get, for x > 0, that and by integrating it further follows that The one-dimensional integrals in these equations can often be computed numerically. Below, a simple example where they are obtained analytically.
Much of the results in the example hold more generally. E.g., the conditional GP distribution of sums holds as soon as the marginals have the same shape parameter. The intuition is simple: Suppose the GP r distribution has been obtained from the vector (R 1 /T γ − σ 1 /γ, . . . , R d /T γ − σ d /γ) by (formal) conditioning on at least one of the components being positive. Then a weighted sum of the components equals R/T γ −σ/γ, for R = d j=1 a j R j and σ = d j=1 a j σ j . According to the GP r representation, provided a 1 , . . . , a d ≥ 0, the distribution of R/T γ − σ/γ conditioned to be positive is a onedimensional GP distribution with parameters γ and σ. Further, that a sum is positive implies that at least one component is positive, and hence first conditioning on at least one component being positive, and then conditioning on the sum being positive gives the same result as conditioning directly on the sum being positive. Thus the one-dimensional GP distribution holds for component sums in GP distributions. Similar reasoning can be applied to, e.g., joint distributions of several weighted sums and several components. However, we only prove the one-dimensional result.
Proposition 5.7. Let X be a GP random vector with common shape parameter γ for all d margins and with scale parameter σ > 0, and for γ ≤ 0 additionally assume that P ( d j=1 a j X j > 0) > 0. Then, for a ∈ [0, ∞) \ {0}, the conditional distribution of the weighted sum d j=1 a j X j given that it is positive is generalized Pareto with shape parameter γ and scale parameter σ = d j=1 a j σ j .
Proof. Since P ( d j=1 a j X j > 0) > 0 holds automatically if γ > 0 and σ > 0, this condition is satisfied for all values of γ. Let A x = {y ∈ R d | d j=1 a j y j > x} and as above define R = d j=1 a j R j . Then, for x > 0 and with F = {y; y 0}, as above, A x ∩ F = A x , and (for γ = 0 using the the standard convention that t 0 (x + σ/0) means x/σ + log t) the numerator in (5.6) for A = A x is and hence by (5.6) where the last equality follows from making the change of variables t(1 + x/σ) 1/γ → t in the numerator.
Example 2.4 exhibits a situation where the sum of the components of a GP distribution is identically equal to −∞ and where the assumption P (X 1 + X 2 > 0) > 0 hence is not satisfied.

Simulation
In this section we briefly outline four different methods for sampling from multivariate GP distributions. For Methods 1-3 we focus on simulation of a GP vector X 0 with σ = 1, γ = 0; X with general σ, γ is obtained from the vector X 0 through (4.4). Furthermore, using the connection between GP s and GP r distributions, simulated GP r vectors may be obtained by simulating GP s vectors, and vice versa. Throughout this section we assume that simulation of S from f S is possible. Recall the relationS = S − ∨ d j=1 S j .
Method 1 (simulation from the (Rs) model). This is immediate using (4.6): simulate a vector S from f S and an independent variable E ∼ Exp(1) and set X 0 = E + S − max 1≤j≤d S j . Simulation from the (R) and (R s ) models is less direct. We develop three methods for this: rejection sampling, MCMC sampling, and approximate simulation using (4.1).
The main idea behind Methods 2 and 3 is to use an appropriate change of measure so that Method 1 can be used to simulate from the (R s ) model. The GP s (1, 0, FS) density is hs ( If in this equation one replaces S by S 0 where S 0 has density Thus, if one can simulate S 0 vectors, these give GP s (1, 0, F s ) vectors via Method 1.
Method 2 (simulation of S 0 via rejection sampling). Assume that ϕ is a bounding density which satisfies f S 0 (x) ≤ Kϕ(x), for some constant K. Then one draws a candidate vector S c 0 ∼ ϕ and accepts the candidate with probability f S 0 (S c 0 ) ϕ(S c 0 )K , and repeats until the required number of draws have been accepted. The probability of acceptance is 1/K, and thus it is advantageous to find a ϕ such that K is not large. In high dimensions however, such a K might be difficult to find, and the following method may be preferable.
Method 3 (simulation of S 0 via MCMC). This consists of using a standard Metropolis-Hastings algorithm to simulate from a Markov chain with stationary distribution (6.1). At iteration i, one draws a candidate vector S c 0 from the density f S and accepts with probability min 1, e ∨ d j=1 S c 0,j −∨ d j=1 S i−1 0,j , where S i−1 0 is the current state of the chain. Alternative proposal distributions could be used with appropriate modification of the acceptance probability; for details see e.g Chib and Greenberg (1995).
Method 4 (approximate simulation from the (R) model). Equation (4.1) suggests an approximate way to simulate a vector X ∼ GP r (σ, γ, F R ) distribution: one chooses a suitably large U , simulatesT ∼ Unif [0, U ] and an independent R ∼ F R , and if R/T γ σ/γ one sets X = R/T γ − σ/γ, and if not one repeats the procedure. In this algorithm, the probability to keep a simulated R/T γ value is so one has to simulate approximately U/ ∞ 0F R (t γ σ γ ) dt values of R/T γ to get one Xvalue. Hence, a large U which ensures that H (U ) approximates H well leads to longer computation times, and a compromise has to be made. As a guide to the compromise, it is often possible to compute, analytically or numerically, sharp bounds for the approximation errors. Sometimes, as for Gaussian or log-Gaussian processes, one in particular can use existing sharp bounds for P (∨ d 1 |R| 1/γ i i > t).
To summarize, Method 1 is the simplest, but can only be used to simulate GP s (σ, γ, FS) vectors in a straightforward manner. Method 2 and Method 3 provide ways to simulate vectors S 0 from distribution (6.1), which can then be inserted into Method 1 to simulate from the GP s (σ, γ, F S ) distribution. Method 4 is as simple as Method 1 to implement and produces i.i.d. vectors, but similarly as Method 3, only approximates the target distribution. Which method to use will depend on the specific problem at hand, and on available analytic, programming, and computer resources.

Conclusion
This paper studies the probability theory underlying peaks over thresholds modelling of multivariate data using generalized Pareto distributions. We first derive basic properties of the multivariate GP distribution, including behavior under conditioning, scale change, and convergence in distribution, and provide the connections with multivariate generalized extreme value distributions. The main results are a point process limit result which gives a general and concrete description of the behavior of extreme episodes in data; three representations of the cdf-s of multivariate GP distributions, motivated by and derived from the point process result; expressions for likelihoods and censored likelihoods; formulas for probabilities and conditional probabilities of general sets; and algorithms for random sampling from multivariate GP distributions. Throughout, the general results are illustrated by examples.
Peaks over thresholds modeling of extremes of random vectors Y first selects a suitable level u and then models the distribution of the over-and undershoots, X = Y − u, conditional on the occurrence of at least one overshoot, by a GP distribution. Of course, this GP model at once also gives a model for the original Y -s, since Y = X + u. Remaining modelling components which are not the topic of this paper include choice the level u, perhaps as a function of covariates like time, and modelling of the Poisson process which governs the occurrence of extreme episodes. Specifically, quantile regression provides an appealing approach to choosing u.
The paper gives a basis for understanding and likelihood modelling of extreme episodes, also in high dimensions. We believe it will contribute to the solution of different and important risk handling problems. However, it still is an early excursion into new territory, and much research remains to be done. As examples of this, inclusion of time series dependence into the models, and development of methods for prediction of extremes are important challenges.