1 Introduction

Estimating extremes of environmental phenomena such as wind speed, sea-surge and rainfall plays an important role in structural design. Models from classical extreme value theory, such as the generalised extreme value (GEV) or generalised Pareto (GP) distributions, give us limiting models for the tail behaviour of such variables and provide a general template for modelling and extrapolation. The aim of most practical applications is the estimation of the event we might expect to see, on average, once every r years: the so-called r-year return level, commonly notated as \(z_{r}\) (with estimate \({\hat{z}}_{r}\)). Over the last three decades or so, pragmatic solutions to circumvent departures from the ideal of independent and identically distributed observations on extremes have been developed; see for example, Davison and Smith (1990). Some of the most commonly-used solutions result in sample size reduction; for example, the use of filtering schemes to avoid issues of temporal dependence (e.g. peaks over thresholds, or POT), or using only those extremes from within a particular calendar unit to avoid problems associated with seasonal variability. Such a reduction can result in extremely wide confidence intervals for \(z_{r}\)—sometimes giving confidence bounds that are implausible for the variable being studied. The aim of some recent work, then, has been to investigate the use of methods that maximise the number of extremes pressed into use; see for example, Eastoe and Tawn (2012) and Fawcett and Walshaw (2012, 2016), the latter illustrating methods that can substantially reduce return level estimation uncertainty relative to methods such as POT.

More recently, much focus has been given to the extension, and practical application, of the theory of multivariate extremes, often motivated by the need to account for spatial dependence between extremes. Davison et al. (2012) provide a comprehensive coverage of the development of models for extremes occurring spatially: again, the aim is for the estimation of return levels, albeit on maps over a spatial grid. The increasing sophistication of models to address issues such as temporal dependence, seasonal variability and trend, coupled with an increase in dimensionality required of an analysis which is—for example—spatial in flavour, often makes inference within a maximum likelihood setting difficult. Moreover, the complexities that such modelling issues bring are often more naturally handled within a Bayesian framework (e.g. the specification of prior distributions for seasonal effects within a random effects model; see for example, Fawcett and Walshaw 2006a).

The incorporation of external sources of information through the prior distribution is an obvious element of appeal for any analyst of extremes working with scarce data. Also appealing is the natural extension within the Bayesian framework to prediction; as Fawcett and Walshaw (2016) discuss, an estimate of the r-year posterior predictive return level, \(z_{r,{\textsf {pred}}}\), provides practitioners with a point summary capturing estimation uncertainty. It is surprising, then, that there are not more examples of Bayesian inference for extremes in the literature. Certainly, it is our experience that few practitioners will perform analyses within a Bayesian setting.

Coles and Powell (1996) provide a solid review of Bayesian inference for extremes up to that date. Since then, Coles and Tawn (1996) and Smith and Walshaw (2003) have investigated the merits of expertly-elicited priors whereas Beirlant et al. (2004) and Eugenia Castellanos and Cabras (2007) have considered objective priors for the GEV and GP models. Various authors have used the Bayesian paradigm to exploit meteorological structure in their data via hierarchical models for extremes—for example, Fawcett and Walshaw (2006a), Sang and Gelfand (2009, 2010) and Davison et al. (2012). Smith (1999) compares predictive inference under the Bayesian and frequentist paradigms and Coles and Tawn (1996) give some informal comparisons between predictive return levels and estimates based solely on the posterior distribution for \(z_{r}\). Fawcett and Walshaw (2006a, 2008, 2016) demonstrate predictive inference for return levels of wind speed and sea-surge extremes, recommending \({\hat{z}}_{r, {\textsf {pred}}}\) as the most convenient, and useful, representation of a return level for practitioners. However, no published work supports this through a formal investigation into the performance of the predictive return level. The main contribution of this paper, then, is to explore the properties of \({\hat{z}}_{r,{\textsf {pred}}}\). In particular, we focus on a comparison of the exceedance probabilities of \({\hat{z}}_{r,{\textsf {pred}}}\) to their intended values \(r^{-1}\), and the general performance of \({\hat{z}}_{r,{\textsf {pred}}}\) relative to other estimative summaries obtained directly from the posterior distribution for \(z_{r}\).

This paper is organised as follows. In Sect. 2 we give some practical motivation for this work, including some results from an analysis of wind speed extremes at a location in the southwest of the UK. This section will include a primer in extreme value techniques and associated modelling procedures for readers who might be unfamiliar with this area, with a particular focus on recently-proposed methods for handling temporal dependence; the use of Bayesian methods for extremes will be discussed by illustration. In Sect. 3 we discuss the aims and design of our simulation study for investigating the performance of the predictive return level, followed by a detailed discussion of our findings from this study. We conclude with some general comments and areas for future work in Sect. 4.

2 Practical motivation and modelling

In this section, we introduce the wind speed data we use throughout the paper. We then give a brief overview of the basic methods for modelling extremes on such processes, including some general background on Bayesian sampling and specific details relating to the posterior predictive return level. Some results are then presented comparing estimative and predictive return levels for our wind speed series.

2.1 Data

Figure 1 shows boxplots, and a plot of each observation against its lag 1 counterpart, for a series of hourly gust wind speed maxima observed at Yeovilton in southwest England between January 1st 2003 and December 31st 2012 (inclusive). The boxplots reveal clear seasonal variability in the wind speed extremes, and there is also significant first-order autocorrelation (persisting above monthly-varying high thresholds). Estimates of \(z_{r}\) or \(z_{r, {\textsf {pred}}}\) based on fitting an appropriate model to the wind speed extremes might be used to inform the design of a new structure. For example, the British Standards Institute (BSI) use estimates of the 50-year wind speed to produce contour maps displaying the strength requirements for new structures. Similarly, the Office for Nuclear Regulation (ONR) recommends that structures at nuclear sites in the UK are built to protect against the 10,000-year return level associated with variables to which these structures might be vulnerable. We argue that a Bayesian approach to return level inference can improve the estimation procedure, with the potential to reduce estimation uncertainty (through the incorporation of prior knowledge) and, in practical terms, the ability to provide practitioners with a single point summary that incorporates uncertainty due to model estimation through the predictive return level.

Fig. 1
figure 1

Boxplots (by month, left) and each observation plotted against its lag 1 counterpart (right) for a series of hourly gust wind speed maxima observed at Yeovilton between January 1st 2003 and December 31st 2012 (inclusive). The blue lines in the first plot correspond to high thresholds that have been chosen to identify observations as extreme

2.2 Statistical modelling

2.2.1 The basics

Let \(\{X_{n}\}\) denote a stationary sequence of random variables with common distribution function (d.f.) F, and let \(M_{n}={\text {max}}\{X_{n}\}\). It is typically the case that, as \(n \rightarrow \infty \),

$$\begin{aligned} {\mathrm{Pr}}(M_{n} \le x) \approx F^{n\theta }(x), \end{aligned}$$
(1)

where \(\theta \in (0,1]\) is known as the extremal index; e.g. Leadbetter and Rootzén (1988). As \(\theta \rightarrow 0\) there is increasing dependence in the extremes of the process; for an independent process, \(\theta =1\). In practice, F is unknown, and very small discrepancies in estimates of F obtained from observed data can lead to rather substantial discrepancies for \(F^{n\theta }\). Initially concerned with the independent case (i.e. \(\theta =1\)), classical extreme value theory sought families of limiting models for \(F^{n}\) for large n. This leads to the GEV distribution (e.g. Jenkinson 1955), with d.f.

$${\mathscr {G}}(y) = \left\{\begin{array}{ll} {\mathrm{exp}}\left[ -\left( 1+\xi (y-\mu )/\varsigma \right)^{-1/\xi }\right] , &\quad \xi \ne 0; \\ {\mathrm{exp}}\left[ -{\mathrm{exp}}\left( -(y-\mu )/\varsigma \right) \right], &\quad \xi =0, \end{array} \right.$$
(2)

defined on \(\{y{:}\,1+\xi (y-\mu )/\varsigma >0\}\), where \(-\,\infty< \mu < \infty \), \(\varsigma >0\) and \(-\,\infty< \xi <\infty \) are location, scale and shape parameters (respectively). The GEV can be used to model a set of block maxima \(\{M_{\tau }\}\) with block length \(\tau \); the calendar year is often used for \(\tau \), giving rise to an annual maxima analysis.

Pickands (1975) showed that for large u the distribution of \((X-u)|X>u\) is approximately GP with d.f.

$${\mathscr {H}}(y) = \left\{ \begin{array}{ll} 1-\left( 1+\xi y/\sigma \right) ^{-1/\xi }, &{}\quad \xi \ne 0; \\ 1-{\mathrm{exp}}\left[ -y/\sigma \right] , &{}\quad \xi =0, \end{array} \right.$$
(3)

defined on \(\{y{:}\,y>0\,{\mathrm{and}}\,(1+\xi y/\sigma )>0\}\), where \(\sigma =\varsigma + \xi (u-\mu )\) and \(\xi \) are the GP scale and shape parameters (respectively). The GP distribution, being the limiting distribution for excesses over a high threshold u, provides a natural way of modelling extremes of time series such as our wind speed data. Modelling extremes in this way can be less wasteful than a block maxima approach using the GEV, since more extremes are usually pressed into use. Thus, in this paper we will focus on the use of the GP distribution as a model for excesses over a high threshold.

2.2.2 Practicalities

Using the GP distribution to model threshold excesses, the linearity of \(E[X-u|X>u]\) in u can be exploited in a mean residual life plot (MRL plot; see Coles 2001, Ch. 4) to help find a suitably high threshold u for the classification of extremes. To maximise estimation precision, Fawcett and Walshaw (2016) suggest making use of all excesses over u, despite the obvious temporal dependence often present. Specifically, they propose fitting (3) by adopting one of the following strategies:

  1. 1.

    Parametric modelling of dependence

    As in Smith et al. (1997) and Fawcett and Walshaw (2006a), where appropriate assume a first-order Markov structure for the temporal evolution of extremes over u; that is, assume the following likelihood for \({\varvec{\psi }}\):

    $$\begin{aligned} L(\varvec{\psi }) = \prod _{i=1}^{n-1}f(x_{i},x_{i+1};\varvec{\psi })\Bigg / \prod _{i=2}^{n-1}f(x_{i}; \varvec{\psi }), \end{aligned}$$
    (4)

    where \(\varvec{\psi }\) is a parameter vector containing marginal and dependence parameter(s) and f, as appropriate, denotes a joint or marginal density function. Appealing to bivariate extreme value theory, transformation from GP to standard Fréchet margins gives a range of models to use for the dependence of consecutive extremes, the most commonly-used being the logistic family with d.f.:

    $$\begin{aligned} G(x_{i},x_{i+1}) = {\mathrm{exp}}\left\{ -\left( x_{i}^{-1/\alpha }+x_{i+1}^{-1/\alpha }\right) ^{\alpha }\right\} ; \end{aligned}$$
    (5)

    here, independence and complete dependence are attained when \(\alpha =1\) and \(\alpha \rightarrow 0\) respectively. Differentiation of (5), with careful censoring when either one or both of \((x_{i}, x_{i+1})\) lies sub-threshold, gives pairwise contributions to the numerator in (4); univariate contributions to the denominator are given through (3). The polynomial relationship:

    $$\begin{aligned} \theta \approx 0.013 - 0.092\alpha +1.833 \alpha ^2-0.756 \alpha ^3, \end{aligned}$$
    (6)

    as constructed in Fawcett and Walshaw (2012) and discussed in the Appendix, can then be used—after estimation of \(\varvec{\psi }=(\sigma ,\xi ,\alpha )^{{\textsf {T}}}\)—to provide the fitted distribution for the right-hand-side of (1), using (3) as a model for \(F^{n}\). Within this class of models for asymptotic dependence, other models can be used—for example, the bilogistic model, which allows for asymmetry in the dependence structure between \((x_{i},x_{i+1})\) through the inclusion of an additional dependence parameter \(\beta \) (\(0<\alpha , \beta <1\)); see Coles (2001, Ch. 8) for more details. Indeed, Fawcett and Walshaw (2012) suggest polynomial expressions for the extremal index based on the fitted values of the dependence parameters here, too.

    Of course it might be that, for the dependence structure, asymptotic independence is more appropriate; that is,

    $$\begin{aligned} \chi = \lim _{z\rightarrow z^{*}}{\mathrm{Pr}}(X_{i+1}>z|X_{i}>z), \end{aligned}$$

    where \(z^{*}\) is the upper limit of the support of the marginal distribution, takes the value zero (in the case of asymptotic dependence, \(\chi >0\)).Footnote 1 Here, a standard time series model such as a Gaussian \({{\textit{AR}}}(1)\) process can be used in place of the models we have outlined for consecutive variables that are asymptotically dependent (a marginal transformation being used to convert to GP form for observations exceeding the threshold). Here, \(\theta =1\), although Ancona-Navarrete and Tawn (2000) derive penultimate approximations for \(\theta (u_{p})\), a threshold-dependent extremal index with threshold \(u_{p}\) set at the p-% quantile; for example, for an \({{\textit{AR}}}(1)\) process, \(\theta (u_{0.95}) \approx 0.711\) and \(\theta (u_{0.99}) \approx 0.855\). As with the approximation in (6), in the Appendix we also construct approximations for \(\theta (u_{p})\) for an \({{\textit{AR}}}(1)\) process with dependence parameter A, and threshold \(u_{p}\). The crucial censoring device employed when either one or both of \((x_{i},x_{i+1})\) lies sub-threshold (explained above in the context of the models used for asymptotic dependence) is also used in the application of an \({{\textit{AR}}}(1)\) process.

    We note here that, within our description of models for asymptotic dependence, formal tests are available for selecting the most suitable model for first-order dependence; see for example, Coles (2001, Ch. 8). We also note that in both the asymptotic dependent/independent cases it is straightforward to investigate the merits of a higher-order dependence (e.g. by invoking d-variate extreme value models (see for example, Coles and Tawn 1991), or an \({\textit{AR}}(d)\) process, to model dependence between d consecutive values in the process).

  2. 2.

    Direct estimation of the extremal index

    Here, initially ignore dependence and proceed by fitting the GP distribution to all excesses over u to approximate \(F^{n}\) in (1). Then estimate the extremal index directly to adjust for extremal dependence and hence complete the right-hand-side in (1). Fawcett and Walshaw (2016) make various recommendations for the extremal index estimator that should be used under this approach, but a simulation study shows that the estimator of Ferro and Segers (2003), given by

    $$\begin{aligned} {\bar{\theta }} = {\mathrm{min}}\left( 1,\frac{2\left\{ \sum _{i=1}^{K-1}(T_{i}-a)\right\} ^{2}}{(K-1)\sum _{i=1}^{K-1}(T_{i}-b)(T_{i}-c)}\right) , \end{aligned}$$
    (7)

    where \(T_{i}\) are the \(K-1\) inter-arrival times between our K threshold excesses (\(a=b=c=0\) if the largest inter-arrival time is no greater than 2, and \(a=b=1\) and \(c=2\) if the largest inter-arrival time is greater than 2), strikes a good balance between optimising accuracy and precision and providing an easy-to-use estimator.

In the presence of seasonal variability, Fawcett and Walshaw (2016) recommend a seasonal piecewise approach to modelling, with a unique GP model for extremes within each season. Of course, this should only be attempted where there is confidence that it is the same physical mechanism generating extremes at different times of the year, seasonal variability in the extremes arising as a result of just a change in the scale of this mechanism. This assumption seems reasonable for wind speeds in temperate climates (e.g. the UK), where it is usually the same alternating sequence of anticyclones and depressions leading to most of the storms that occur throughout the year. For example, assuming either (1) or (2) above to capture temporal dependence, estimates of \((\sigma _{m},\xi _{m},\theta _{m}), m=1, \ldots , M,\) might be obtained for each season m, the analysis perhaps being simplified by assuming a common shape parameter \(\xi \) or extremal index \(\theta \) across all seasons (where appropriate). Anecdotal evidence in (for example) Walshaw (1994) and Fawcett and Walshaw (2006a) indicates there are no real gains to be made, in terms of return level inference, by allowing the GP parameters to vary smoothly through time.

Though not a feature of our data in Fig. 1, trends in extremes can be modelled by imposing a time dependence on the GP scale parameter, i.e. \(\sigma =\mathrm{exp}\{\beta _{0}+\beta _{1}t\}\), \(t=1, 2, \ldots ,\) where t is an indicator of time. More generally, a dependence on covariates can be induced by writing the parameters in the form \(h({\varvec{X}}^{\textsf {T}}\varvec{\beta })\), where h is a specified function, \(\varvec{\beta }\) is a vector of parameters and \({\varvec{X}}\) is a model vector. For applications of General Additive Models (GAMs) to extremes, see (for example) Yee and Stephenson (2007) and Chavez-Demoulin and Davison (2005).

2.2.3 Return levels

Inversion of the right-hand-side of (1), assuming a GP model for threshold excesses, gives the following expression for the r-year return level:

$$z_{r} = \left\{ \begin{array}{ll} u+\sigma \xi ^{-1}\left[ \left( \lambda _{u}^{-1}w_{r}\right) ^{-\xi }-1\right] &\quad \xi \ne 0 \\ u-\sigma {\mathrm{log}}\left( \lambda_{u}^{-1}w_{r}\right) &\quad \xi =0, \end{array} \right.$$
(8)

where \(w_{r}=1-\left[ 1-(rn_{y})^{-1}\right] ^{1/\theta }\), \(\lambda _{u}\) is the rate of threshold excess and \(n_{y}\) is the (average) number of observations per year. Under approach (1), as outlined in Sect. 2.2.2, \((\sigma , \xi , \theta )\) in Eq. (8) can be replaced with their maximum likelihood estimates/Bayesian estimates to obtain estimates of \(z_{r}\), with an assessment of uncertainty being made through standard errors/posterior standard deviations and (profile-likelihood) confidence intervals/credible intervals, respectively, in the usual way. Under approach (2), \((\sigma , \xi )\) can be replaced with their maximum likelihood or Bayesian estimates and \(\theta \) replaced with an estimate obtained via Eq. (7), with a bootstrap procedure as proposed in Fawcett and Walshaw (2012) enabling the incorporation of uncertainty in estimates of \(\theta \) into estimates of \(z_{r}\).

To recombine seasonally-varying parameters when estimating return levels, assuming independence between seasons we can solve the following for \(x=z_{r}\):

$$\begin{aligned} \prod _{m=1}^{M} H_{m}(x)^{n_{m}\theta _{m}} = 1-r^{-1}, \end{aligned}$$
(9)

where \(H_{m}\) is the GP d.f. in season m with parameter set \((\lambda _{u_{m}}, \sigma _{m}, \xi _{m})\), and \(\theta _{m}\)/\(n_{m}\) are the extremal index/number of observations in season m. Of course, as discussed earlier, inference can be simplified if we assume a constant shape or dependence across all seasons.

2.3 Bayesian inference for wind speed extremes

After performing investigations into the dependence structure of our wind speed extremes, such as those described in Fawcett and Walshaw (2006b), we conclude that a first-order Markov structure, assuming asymptotic dependence according to the logistic model (Eq. 5), is appropriate (specifically, diagnostics such as the \(\chi \)/\({\bar{\chi }}\) plots as discussed in Coles (2001, Ch. 8) implied asymptotic dependence; comparisons between the bivariate logistic model and other models, and model-orders, did not improve over a fit of the former to consecutive pairs of extremes). Thus, we adopt approach (1) as outlined in Sect. 2.2.2 for handling dependence of consecutive observations. Given the seasonal variability observed in the wind speed extremes, and our earlier discussion in Sect. 2.2.2, we adopt a seasonal piecewise approach to modelling. Specifically, following discussions about the UK wind climate in Walshaw (1994), we use the calendar month as our seasonal unit, assuming stationarity of wind speed extremes within each month. Here, MRL plots have been used for the selection of monthly-varying thresholds.

Following the recommendations of Fawcett and Walshaw (2016) we adopt a fully Bayesian approach to inference, using Markov chain Monte Carlo (MCMC) techniques to draw approximate samples from the marginal posteriors for \((\sigma _{m}, \xi _{m}, \theta _{m})\), \(m=1, \ldots , 12\), and hence \(z_{r}\) through Eq. (9). Details on MCMC techniques are now extensively published (e.g. Gamerman and Lopes 2006); Sect. 2.3.2 gives more specific information about the algorithm we employ.

2.3.1 Prior specification

Generally, we work with a re-parameterised GP scale:

$$\begin{aligned} \eta ={{\mathrm{log}}}(\sigma -\xi u). \end{aligned}$$

This re-parameterisation gives a scale that is threshold-independent (unlike \(\sigma \)); working with the natural logarithm retains the positivity of this parameter in our MCMC sampling scheme. Based on work in Fawcett and Walshaw (2016), we adopt an informative prior specification for the parameter vector \((\eta _{m}, \xi _{m}, \alpha _{m})^{{\textsf {T}}}\) based on observations on wind speed extremes made at a nearby location. Specifically, we use:

$$\begin{aligned} (\eta _{m},\xi _{m}) \sim N_{24}\left( {\varvec{\mu }},{\varvec{\varSigma }}\right) \quad {\mathrm{and}} \quad \alpha _{m} \sim {\mathrm{Beta}}(10,19), \end{aligned}$$

\(m=1,\ldots , 12\). The components of the mean vector \(\varvec{\mu }\) are chosen to closely match our beliefs about what are the most likely values of \((\eta _m,\xi _m)\) based on our study of wind speeds at the nearby location; we specify values for \({\textsf {cov}}(\eta _{m},\xi _{m})\) according to our beliefs about the covariances between these parameters at the nearby location, scaled (albeit rather crudely) to reflect our uncertainties about differences between monthly wind speeds at the two locations. We choose our priors for \(\alpha _{m}\) for similar reasons and, as is often the case, we assume independence between the marginal and dependence parameters. Of course, given the re-parameterisation to a threshold-independent scale parameter, more objective priors could be used in the absence of any such external information.

2.3.2 MCMC sampling

We set initial values for all parameters to their prior means, using a simple Metropolis update to give successive draws

$$\begin{aligned} (\eta _{m}^{[j]},\xi _{m}^{[j]}, \alpha _{m}^{[j]}), \quad j=1, \ldots , 10^5, \end{aligned}$$

after thinning to every tenth iteration. Within each Metropolis step we use a random walk update to generate candidate values for each of the parameters, tuning the innovation variances to optimise the efficiency of our sampler. Convergence is assessed by starting each chain at multiple new initial values and observing the trace plots. No formal MCMC diagnostics are employed, although checks such as the Gelman–Rubin convergence diagnostic, and effective sample size computations (see for example, Gamerman and Lopes 2006), could be employed here.

Equation (6) is used to obtain a sample from the posterior for the extremal index \(\theta _{m}\), after which a posterior sample for \(z_{r}\) is obtained on substitution of successive draws for the GP parameters and the extremal index into Eq. (9).

2.3.3 Prediction

As discussed earlier, one of the advantages of a Bayesian analysis of extremes is the natural extension to prediction via the posterior predictive distribution. If Y denotes a future extreme of our wind speed series, then we can write

$$\begin{aligned} {\mathrm{Pr}}\left\{ Y\le y| {\varvec{x}}\right\} = \int _{\varvec{\varPsi }}{\mathrm{Pr}}\left\{ Y \le y|\varvec{\psi }\right\} \pi \left( \varvec{\psi }|{\varvec{x}}\right) d\varvec{\psi } \end{aligned}$$
(10)

for the predictive distribution of our extremes, where \({\varvec{x}}\) represents past observations, \(\varvec{\psi }\) is a generic parameter vector and \(\pi (\varvec{\psi }|{\varvec{x}})\) is the posterior density for \(\varvec{\psi }\). Solving

$$\begin{aligned} {\mathrm{Pr}}\left\{ Y \le z_{r,{\textsf {pred}}}|{\varvec{x}}\right\} = 1 - r^{-1} \end{aligned}$$
(11)

for \(z_{r, {\textsf {pred}}}\) therefore gives an estimate of the r-year return level that captures uncertainty in parameter estimation. Although (10) is analytically intractable, it can be approximated since we have estimated the posterior distribution using MCMC. Regarding the sample \(\varvec{\psi }^{(1)}, \ldots , \varvec{\psi }^{(S)}\) as realisations from the stationary distribution \(\pi (\varvec{\psi }|{\varvec{x}})\), we have

$$\begin{aligned} {\mathrm{Pr}}\left\{ Y \le z_{r, {\textsf {pred}}}|{\varvec{x}}\right\} \approx \frac{1}{S}\sum _{j=1}^{S}{\mathrm{Pr}}\left\{ Y \le z_{r, {\textsf {pred}}}|\varvec{\psi }^{[j]} \right\} , \end{aligned}$$
(12)

which we can set equal to \(1-r^{-1}\) and solve for \(z_{r,{\mathsf{pred}}}\) using a numerical solver. In our analysis of wind speed extremes, we have \(\varvec{\psi } = (\eta _{m}, \xi _{m}, \theta _{m})^{{\textsf {T}}}\), \(m=1, \ldots , 12\).

2.3.4 Some results

Table 1 shows some estimative return levels for the wind speed extremes; that is, some point summaries from the posterior distributions for \(z_{r}\), for some specific r. Also shown are summaries of the spread of these posteriors via 95% credible intervals. Accompanying these estimative return level summaries are their predictive counterparts \({\hat{z}}_{r, {\textsf {pred}}}\). Figure 2 shows plots of both \({\hat{z}}_{r}\) and \({\hat{z}}_{r, {\textsf {pred}}}\) over a range of values for r (on the usual logarithmic scale for these plots to magnify results for long-range return periods; posterior means are shown for \({\hat{z}}_{r}\), along with the 95% credible intervals). It is clear from both Table 1 and Fig. 2 that designing a structure to withstand the extremes of wind speed as suggested by the estimative return levels could result in under-protection (especially when using the posterior mode), relative to the predictive estimates. This is more apparent for long return periods—recall from Sect. 2.1 that the ONR in the UK currently recommends that nuclear structures are protected against the 10,000 year event. Indeed, although not the case here, studies often report the predictive return level lying beyond even the 95% credible upper bound for \(z_{r}\); as an example, see the second block of results in Table 1 for another wind speed location in the Peak District of Central England.

Table 1 Posterior means, modes and 95% credible intervals for return levels \(z_{r}\), with estimates of the corresponding predictive return levels \(z_{r, {\textsf {pred}}}\)
Fig. 2
figure 2

Means (blue) taken from the MCMC samples of the posterior distributions for return levels \(z_{r}\) across a range of return periods r, with 95% credible intervals (outer light blue shaded area). Also shown, in red, are corresponding estimates of the predictive return levels \(z_{r, {\textsf {pred}}}\)

As discussed in Fawcett and Walshaw (2016), the predictive return level estimate might be preferred since it provides the practitioner with a single point summary that encapsulates uncertainty in parameter estimation. However, open questions remain about the quantity \(z_{r, {\textsf {pred}}}\). For example, how do exceedance probabilities of \(z_{r, {\textsf {pred}}}\) compare to the intended values \(r^{-1}\) (on an annual scale)? How do these probabilities compare to those under an estimative approach for \(z_{r}\)? Given results in, for example, Coles and Tawn (1996) and Fawcett and Walshaw (2016), we might expect \(z_{r,{\textsf {pred}}}\) to give exceedance probabilities considerably lower than \(r^{-1}\); implicit in the predictive return level is the allowance for uncertainty in parameter estimation, resulting in higher estimates of \(z_{r}\) and correspondingly lower estimates of \(r^{-1}\). But is this really the case, and if so, can these discrepancies be quantified and could they result in substantial over-protection? At the very least, practitioners should be aware of these discrepancies, should they choose to work with \(z_{r, {\textsf {pred}}}\). Are there any advantages of using \(z_{r,{\textsf {pred}}}\) as opposed to using some other point summary from the posterior for \(z_{r}\), such as the upper end-point of the 95% credible interval or perhaps some other quantile? We aim to answer these questions, and more, in the simulation study in the next Section.

3 Simulation study

3.1 Study design

The first stage of our simulation study requires the simulation of a stationary reference series \({{\mathbf{y}}}^{({\mathrm{Ref}})}\) of length N years with \(N_{y}\) observations per year, where N is very large. We assume a first-order Markov structure, and we consider a range of models for the dependence between neighbouring extremes in \({{\mathbf{y}}}^{({\mathrm{Ref}})}\). Specifically, under the assumption of asymptotic dependence we use the logistic and bilogistic models as discussed in Sect. 2.2.2, covering a range of temporal dependencies through specific choices for \(\alpha \)/\((\alpha , \beta )\). To account for scenarios in which asymptotic independence might be a more plausible assumption, we also obtain \({{\mathbf{y}}}^{({\mathrm{Ref}})}\) from an AR(1) process with lag 1 autocorrelation A, again covering a range of temporal dependencies through specific choices for A. Marginally, our reference series are primarily GP-distributed with scale and shape \((\sigma , \xi )\) giving scale and shape \((\sigma ^{*} = \sigma +\xi u, \xi ^{*}=\xi )\) for excesses over some threshold u; see Coles (2001, Ch. 4). However, we also consider chains \({\mathbf{y}}^{({\mathrm{Ref}})}\) from distributions in one of the domains of attraction of the GP distribution.

Now, at each replication \(\ell \) in our simulation study, \(\ell =1, \ldots , L\), we simulate a stationary series \({{\mathbf{y}}}^{(\ell )}\) of length n years, with \(n_{y}\) observations per year, n perhaps being typical of what we might usually work with in terms of environmental extremes. As with the reference dataset, we assume a first-order Markov structure according to models for asymptotic dependence and asymptotic independence, with the same marginal assumptions as before. For each series \({{\mathbf{y}}}^{(\ell )}\) we perform a full MCMC procedure, primarily using objective (and, where available, conjugate) priors. For example, for the \({\textit{AR}}(1)\) process with lag 1 autocorrelation A, we have that \({{\mathbf{y}}}^{(\ell )}|\mu , \tau \sim N(\mu =0, \tau ^{-1}=(1-A^{2})^{-1})\). We thus assume the conjugate prior specification

$$\begin{aligned} \mu |\tau \sim N\left( 0, \frac{1}{c\tau }\right) \quad {\mathrm{and}} \quad \tau \sim Ga(g, h), \end{aligned}$$

yielding

$$\begin{aligned} \mu |\tau ,{{\mathbf{y}}}^{(\ell )} \sim N\left( \frac{n \bar{{\mathbf{y}}}^{\ell }}{c+nn_{y}}, \frac{1}{(c+nn_{y})\tau }\right) \end{aligned}$$

and

$$\begin{aligned} \tau |{{\mathbf{y}}}^{(\ell )} \sim Ga\left( g+\frac{nn_{y}}{2}, h+\frac{cnn_{y}\left( \bar{{\mathbf{y}}}^{\ell }\right) ^{2}}{2(c+nn_{y})}+\frac{nn_{y}s^{2}}{2}\right) , \end{aligned}$$

where \(\bar{{\mathbf{y}}}^{\ell }\) and \(s^{2}\) are the mean and variance (respectively) of the simulated series \({{\mathbf{y}}}^{(\ell )}\), \(c=10^{-1}\) and \(g = h = 10^{-3}\). This enables posterior inferences to be made on the autoregressive parameter A and hence the extremal index \(\theta \) via the polynomial approximation constructed in the Appendix. An example in the case of asymptotic dependence is where we use the logistic model with dependence parameter \(\alpha \); assuming the prior \(\alpha \sim U(0,1)\), we then perform Metropolis–Hastings sampling, as outlined in Sect. 2.3.2, to obtain draws from the posterior for \(\alpha \) using the likelihood in Eq. (4), and hence for the extremal index \(\theta \) via Eq. (6). In the case of asymptotic dependence/independence we then transform the margins from standard Fréchet/Normal, respectively, to GP with scale and shape \((\sigma , \xi )\), before performing a full MCMC procedure on excesses over a range of u.

The procedures outlined above yield S iterations after burn-in to obtain approximate samples \(\varvec{\sigma }^{*(\ell )}\) and \(\varvec{\xi }^{*(\ell )}\), \(\ell = 1, \ldots , L\), of length S from the posterior distributions of the GP scale and shape \(\sigma ^{*}\) and \(\xi ^{*}\), as well as approximate samples from the posteriors of the dependence parameters (i.e. \(\alpha \) or A) and hence samples \(\varvec{\theta }^{(\ell )}\) from the posterior of the extremal index; see Sect. 2.2.2. At each replication \(\ell \), via Eq. (8) we also obtain posterior samples \(\varvec{z_{r}}^{(\ell )}\) from the r-year return levels for a range of return periods r. From these draws we can obtain the posterior mean \({\bar{z}}_{r}^{(\ell )}\), the posterior mode \({\dot{z}}_{r}^{(\ell )}\) and the posterior 95% credible interval upper bound \(z_{r,{\textsf {upper}}}^{(\ell )}\); we also obtain the predictive return level \(z_{r, {\textsf {pred}}}^{(\ell )}\) via Eq. (12), essentially giving sampling distributions of length L for each of these return level summaries. Defining p to be the proportion of annual maxima in \({{\mathbf{y}}}^{({\mathrm{Ref}})}\) exceeding each of \({\bar{z}}_{r}^{(\ell )}\), \({\dot{z}}_{r}^{(\ell )}\), \(z_{r,{\textsf {upper}}}^{(\ell )}\) and \(z_{r, {\textsf {pred}}}^{(\ell )}\), \(\ell =1, \ldots , L\), gives sampling distributions for \(p_{{\bar{z}}_{r}}\), \(p_{{\dot{z}}_{r}}\), \(p_{z_{r, {\textsf {upper}}}}\) and \(p_{z_{r, {\textsf {pred}}}}\), respectively. Other than the sampling distributions for each of the return level summaries themselves, of particular interest might be comparisons between each of the proportions \(p_{*}\) and the intended exceedance probabilities \(r^{-1}\) (the * subscript used here to denote generically any one of our estimators for \(r^{-1}\)).

3.2 Parameters in our study

In our study, we use \(N=10^{5}\) and \(N_{y} = 365.25 \times 24\), in line with having hourly measurements on our variable. We use \(\alpha =\{0.3, \ldots , 0.9\}\) for the logistic model; for the bilogistic model we fix \(\alpha \) at 0.5 and use \(\beta =\{0.3, \ldots , 0.9\}\); for the \({\textit{AR}}(1)\) process we use \(A=\{0.2, \ldots , 0.8\}\).

Marginally, we hold \(\sigma \) unit constant (i.e. \(\sigma =1\)) but consider a range of tail behaviours through the GP shape parameter \(\xi \), where \(\xi = \{-\,0.4, -\,0.1, 0, 0.1, 0.3\}\), yielding \(\sigma ^{*} = \xi u + 1\) and \(\xi ^{*}=\xi \), where we use \(u =\{u_{0.9}, u_{0.95}, u_{0.99}\}\). We perform \(L=1000\) replications, within which we simulate chains \({{\mathbf{y}}}^{(\ell )}\) of length \(n=50\) years with \(n_{y}=N_{y}\); for each chain, we perform MCMC with \(S=10{,}000\) iterations after an appropriate burn-in discard. For each combination of \((\sigma ^{*}, \xi ^{*}, \theta )\) across all dependence models considered, we perform small MCMC pilot runs in a bid to select suitable values for the MCMC tuning parameters before running the full simulation study, aiming for acceptance rates of 20–30%.

3.3 Results

In this section we present the findings of our simulation study, focusing on comparisons between the predictive return level and the other summaries obtained directly from the posterior distribution for \(z_{r}\). Specifically, we give attention to the return level exceedance probabilities associated with the different Bayesian estimators for \(z_{r}\), and we look at how these compare to the intended values \(r^{-1}\). We consider the cases of asymptotic dependence (Sect. 3.3.1) and independence (Sect. 3.3.2), but also the effects of model mis-specification on return level inference when asymptotic dependence/independence is incorrectly assumed (Sect. 3.3.3). We investigate the effects of prior specification on estimates of return level exceedance probabilities (Sect. 3.3.4), relative to those discussed in Sects. 3.3.1 and 3.3.2; specifically, we look at the effects of using informative priors on marginal and dependence parameters, and the effects of mis-chosen informative priors (unless otherwise stated, all results in other sections lean on objective prior specifications). For information, we investigate the performance of the predictive return level under our approach, which uses information an all threshold excesses, to that obtained under the commonly-used POT approach (Sect. 3.3.5). We also assess the effects of using chains that are drawn marginally from distributions within the domain of attraction of the GP distribution, rather than directly from the GP distribution itself (Sect. 3.3.6). At the end of this Section we give some general comments on the sensitivity of our comparisons between the different estimators of \(r^{-1}\) to the marginal structure of the simulated chains (Sect. 3.3.7).

3.3.1 Asymptotic dependence

One arm of the study: logistic dependence structure with \(\xi =-0.4\) and \(u=u_{0.95}\)

Figure 3 shows sampling distribution means, and 95% confidence intervals, for \({{\mathrm{log}}}(1+rp_{*})\). The horizontal dotted lines are at \({{\mathrm{log}}}2 = {{\mathrm{log}}}[{\mathbb {E}}(r p_{*})+1] \ge {\mathbb {E}}[{{\mathrm{log}}}(1+rp_{*})]\), according to Jensen’s inequality, in effect giving a theoretical upper bound to the means of our sampling distributions for \({{\mathrm{log}}}(1+rp_{*})\). The target of each of our estimators is a probability close to zero; also, over-estimation of these probabilities would arise from under-estimation of the corresponding return levels, perhaps resulting in under-protection from a safety point-of-view if such estimates were to be used as design parameters. Thus, over-estimation of \(r^{-1}\) by \(p_{*}\) might be seen as more costly than under-estimation, but the root mean squared error (RMSE), given by

$$\begin{aligned} \sqrt{L^{-1}\sum _{\ell =1}^{L}\left( p^{(\ell )}_{*}-r^{-1}\right) ^{2}}, \end{aligned}$$

punishes under- and over-estimation equally. Thus, linear-exponential errors—or linex errors (e.g. Zellner 1986)—given by

$$\begin{aligned} {\mathrm{exp}}\left\{ d(p_{*}-r^{-1})\right\} - d(p_{*}-r^{-1})-1, \quad d>0, \end{aligned}$$

can be used to impose an asymmetric error favouring under-estimation of \(r^{-1}\). Table 2 therefore reports the mean linex error (MLE) in each component of our simulation study, along with the standard RMSE for comparison. Both of these error measures in Table 2 accompany the estimated bias for each estimator \(p_{*}\). For these particular results the simulated chains display asymptotic dependence according to the bivariate logistic model for consecutive pairs in the process; marginally, \(\xi =-0.4\) and \(u=u_{0.95}\).

Fig. 3
figure 3

Sampling distribution means (bullets) and 95% confidence intervals (vertical lines, running between the sampling distribution 2.5 and 97.5% quantiles) for \({{\mathrm{log}}}(1+rp_{*})\), using (1) direct summaries from the return level posterior distribution (blue; solid \(=\) posterior mean, dashed \(=\) posterior mode, dot-dashed \(=\) posterior 95% credible upper bound) and (2) the posterior predictive return level (red). Here, the simulated data are constructed with asymptotic dependence according to a bivariate logistic model with dependence \(\alpha \). The horizontal dotted line is at \({{\mathrm{log}}}2\), representing the maximum of \({\mathbb {E}}[{{\mathrm{log}}}(1+rp_{*})]\). Marginally, \(\xi =-0.4\) and \(u=u_{0.95}\)

Table 2 Estimated biases (\(\times100 \)), root mean squared errors (RMSE, \(\times 100\)) and mean linex errors (MLE, \(\times 100 \)), for each of our four estimators of the return level exceedance probability \(r^{-1}\)

The superiority of the predictive return level over the most commonly-used posterior summary—the posterior mean—is obvious, especially for longer return periods. For example, Table 2 and Fig. 3 show that the predictive return level yields exceedance probabilities \(p_{z_{r, {\textsf {pred}}}}\) that are increasingly more accurate and precise as the return period r increases, especially as the extremal dependence in the series weakens (i.e. as \(\alpha \rightarrow 1\)). In comparison, the return level posterior mean is (at best) on a par with the predictive return level, in terms of its associated exceedance probabilities, when \(r=10\); for longer return periods the bias of these estimated exceedance probabilities is noticeably larger than those produced by the predictive return level (increasingly so as r increases), as is our uncertainty in these estimates. Where both the predictive return level and the return level posterior mean lead to exceedance probabilities that over-estimate \(r^{-1}\), there is usually a smaller bias in \(p_{z_{r, {\textsf {pred}}}}\) than in \(p_{{\bar{z}}_{r}}\). For most strengths of dependence, and for larger return periods, the sampling distribution means for \({\mathrm{log}}(1+rp_{*})\) are within their range (i.e. \(\le {{\mathrm{log}}}\,2\)) when \(p_{*} = p_{z_{r, {\textsf {pred}}}}\), and certainly on more occasions than when \(p_{*}=p_{{\bar{z}}_{r}}\). As we might expect, \(r^{-1}\) is often under-estimated when using the return level posterior 95% credible upper bound; especially for shorter-range return periods, we might expect \(z_{r, {\textsf {upper}}}\) to over-estimate \(z_{r}\), leading to too-small values for \(p_{z_{r, {\textsf {upper}}}}\). The return level posterior mode consistently produces estimates \(p_{{\dot{z}}_{r}}\) that are too large. These are always substantially larger than those produced by the other three estimators, and with the largest uncertainty, casting doubt on the value of the posterior mode as a useful summary of the return level posterior distribution.

As with Fig. 3, the top row of Fig. 5 shows sampling distribution means for our four estimates of \(r^{-1}\), but now across a smooth range of values for r for some fixed values of the logistic dependence parameter. Here, we choose \(\alpha =0.3\), \(\alpha = 0.5\) and \(\alpha = 0.9\), representing fairly strong extremal dependence (similar to that observed in our wind speed extremes), moderate extremal dependence and near-independent extremes, respectively. We see that estimates based on the return level posterior mean and the predictive return level consistently over-estimate \(r^{-1}\) when \(\alpha =0.3\), but with estimates based on the predictive return level always being substantially less biased than those based on the posterior mean, especially for longer return periods. Interestingly, for this level of dependence estimates of \(r^{-1}\) based on the return level 95% credible upper bound are closest to the intended exceedance probability, and this observation is supported by the results in Table 2. As the extremal dependence weakens, we see an even closer agreement between our estimated exceedance probabilities based on the predictive return level and the intended exceedance probabilities \(r^{-1}\), with estimates using the return level posterior mean consistently displaying a larger bias in our plots in Fig. 5. In agreement with the results shown in Fig. 3, the plots in Fig. 5 show that across all values for \(r\ge 100\), estimated exceedance probabilities based on the return level posterior mode are always most biased, with substantial over-estimation of the exceedance probability.

To put these results into a practical context, recall from Sect. 2.1 that we discuss the use of the 10,000-year return level estimate by the ONR in the UK, as a design requirement for structures at nuclear sites. For our wind speed data, monthly estimates (posterior means) of the logistic dependence parameter \(\alpha \) are around 0.3. Focusing on the final plot in Fig. 3, and the first plot in Fig. 5, we see the much smaller bias in estimates of \(r^{-1}\) produced by the predictive return level than the return level posterior mean, and with greater precision in the predictive estimates; however here, for this long-range return period, the 95% credible upper bound for \(z_{r}\) produces the best estimate of \(r^{-1}\). An over-estimate of \(r^{-1}\) could result in significant under-protection (as this arises from an under-estimate of \(z_{r}\)), and we note here that—relative to the estimates of \(r^{-1}\) based on the predictive return level and return level 95% credible upper bound—those based on the return level posterior mean are much over-estimated (and with more uncertainty).

Other arms of the study: main findings

Here, we report some findings from other arms of the study in which the simulated chains displayed asymptotic dependence. Largely, the general direction of the results already reported was replicated in other arms. For instance, sticking with the logistic model but changing the marginal shape parameter still resulted in return level exceedance probabilities more in line with the intended values \(r^{-1}\) when using the predictive return level compared to the return level posterior mean, with substantially smaller biases for return periods of practical interest and much smaller values of RMSE/MLE. As an example, with \(\xi =0.1\) and \(\alpha =0.3\) or 0.5, biases incurred by \(p_{{\bar{z}}_{r}}\) were always larger than those incurred by \(p_{z_{r, {\textsf {pred}}}}\) (around five times larger for \(r=10,000\)), with consistently narrower 95% confidence intervals; when \(\alpha =0.9\), the outperformance of \(z_{r, {\textsf {pred}}}\) relative to \({\bar{z}}_{r}\) was even more marked. One difference to note is in estimates of \(r^{-1}\) using the return level posterior mode: for simulated chains with increasingly heavy tails (e.g. when positive values for \(\xi \) were used), we observed smaller biases than those reported in Figs. 3, 5 and Table 2 (for which \(\xi =-0.4\)), perhaps indicating that this summary would be more useful for very positively skewed data.

Switching to the bilogistic model for consecutive extremes, allowing for asymmetry in the dependence structure, did not result in noticeable deviations from the results discussed so far, for combinations of dependence parameters \(\alpha \) and \(\beta \) resulting in similar levels of observed extremal dependence as given by the dependence parameter \(\alpha \) in the logistic model. This might suggest a robustness of our findings across different dependence structures within an overall framework of asymptotic dependence. Similarly, our results were consistent across the other two threshold levels considered (\(u_{0.9}\)/\(u_{0.99}\), the 90/99% marginal quantiles respectively).

3.3.2 Asymptotic independence

One arm of the study: \(\textit{AR(1)}\) with \(\xi = -0.4\) and \(u=u_{{0.95}}\)

Figure 4 shows the same information as Fig. 3 but now for the case of asymptotic independence where our simulated chains are \({\textit{AR}}(1)\) processes with lag 1 autocorrelation A; as in Sect. 3.3.1, marginally \(\xi =-0.4\) and \(u=u_{0.95}\). The bottom half of Table 2 reports the estimated bias, RMSE and MLE for our four estimators of \(r^{-1}\). The superiority of the predictive return level relative to estimates obtained using the return level posterior mean, is obvious, and more apparent than in the previous section when considering series with asymptotic dependence. For example, the sampling distribution means for \({\mathrm{log}}(1+rp_{*})\) are all within their range (i.e. \(\le {{\mathrm{log}}}\,2\)) when \(p_{*} = p_{z_{r, {\textsf {pred}}}}\), regardless of the return period r and strength of dependence A; this is not the case when \(p_{*} = p_{{\bar{z}}_{r}}\). As in the case of asymptotic dependence, we also note greater precision in estimates based on the predictive return level, with narrower 95% confidence intervals (substantially so for larger return periods). The results reported in the bottom half of Table 2 confirm this, with smaller biases typically being observed for \(p_{z_{r, {\textsf {pred}}}}\) than for \(p_{{\bar{z}}_{r}}\) and much smaller values of RMSE/MLE. As in Sect. 3.3.1, estimates based on the return level posterior mode perform most poorly, with estimates based on the return level 95% credible upper bound seemingly performing well (though not as well as those based on the predictive return level) for some large values of r. These results are supported by the plots in the bottom row of Fig. 5, in which we see estimates of the intended exceedance probability \(r^{-1}\) based on the predictive return level being consistently less biased than all the others, for the three levels of dependence we focus upon (\(A=0.7\), \(A=0.5\) and \(A=0.3\), representing reasonably strong, moderate and weak dependence, respectively).

Fig. 4
figure 4

Sampling distribution means (bullets) and 95% confidence intervals (vertical lines, running between the sampling distribution 2.5% and 97.5% quantiles) for \({{\mathrm{log}}}(1+rp_{*})\), using (1) direct summaries from the return level posterior distribution (blue; solid \(=\) posterior mean, dashed \(=\) posterior mode, dot-dashed \(=\) posterior 95% credible upper bound) and (2) the posterior predictive return level (red). Here, the simulated data are constructed with asymptotic independence according to an \({\textit{AR}}(1)\) process with lag 1 autocorrelation A. The horizontal dotted line is at \({{\mathrm{log}}}2\), representing the maximum of \({\mathbb {E}}[{{\mathrm{log}}}(1+rp_{*})]\). Marginally, \(\xi =-0.4\) and \(u=u_{0.95}\)

Other arms of the study: main findings

We report similar findings for other arms of the study in which the simulated chains display asymptotic independence according to an \({\textit{AR}}(1)\) process, but with different values of \(\xi \) or different thresholds being used. In all cases, estimates of the intended exceedance probability \(r^{-1}\) had smallest bias, and RMSE/MLE, when based on the predictive return level, and especially so for long return periods. Estimates based on the return level posterior mean and posterior mode were consistently too large.

3.3.3 Mis-specification of dependence

We now investigate the effects of mis-specifying the dependence structure on our four estimates of the return level exceedance probability \(r^{-1}\). Specifically, at each iteration \(\ell \), \(\ell = 1, \ldots , 1000\), we simulate \({\mathbf{y}}^{\ell }\) from an \({\textit{AR}}(1)\) process with lag 1 autocorrelation A; inference then proceeds by assuming asymptotic dependence and fitting the logistic/bilogistic model to consecutive pairs in the process, and then obtaining estimates of \(r^{-1}\) using \(p_{{\bar{z}}_{r}}\), \(p_{{\dot{z}}_{r}}\), \(p_{z_{r, {\textsf {upper}}}}\) and \(p_{z_{r, {\textsf {pred}}}}\) in the way we describe in Sect. 3.1. Conversely, we also simulate \({\mathbf{y}}^{\ell }\) with asymptotic dependence via the logistic/bilogistic models, inference then proceeding assuming asymptotic independence through the fitting of an \({\textit{AR}}(1)\) process. In reality, the precise form of dependence structure is unknown; diagnostic checks such as the \(\chi /{\bar{\chi }}\)-plots discussed in Coles (2001, Ch. 8) can be used to help assess the nature of the dependence present, although their interpretation can be difficult. Thus, the aim of this part of the simulation study is to investigate our four estimators of the return level exceedance probability under an incorrect specification of dependence structure, something that could easily occur in an analysis of real data when we attempt to press all threshold excesses into use.

Figure 6 (top row) shows plots similar to those in Fig. 5. The simulated data exhibit asymptotic independence through an \({\textit{AR}}(1)\) structure with lag 1 autocorrelation A, but asymptotic dependence is incorrectly assumed with dependence structure for consecutive pairs according to the logistic model with parameter \(\alpha \). Compared to the results shown in the bottom row of plots in Fig. 5, in which the correct form of dependence was assumed, we see a larger bias in estimates of \(r^{-1}\) with \(p_{z_{r, {\textsf {pred}}}}\) across all values of lag 1 autocorrelation A; however, the predictive return level still clearly outperforms both the posterior mean and mode in terms of the associated exceedance probabilities it yields and their proximity to the intended values \(r^{-1}\). Although not reported here, values of the RMSE/MLE were consistently smaller for \(p_{z_{r, {\textsf {pred}}}}\) than the other three estimates associated with the return level posterior distribution. For the opposite case of mis-specification in terms of the dependence structure—that is, when the data were simulated to exhibit asymptotic dependence but an \({\textit{AR}}(1)\) process was assumed—we observed an increase in the bias of estimated exceedance probabilities associated with the return level posterior mean, mode and 95% confidence upper bound, but with the predictive return level yielding estimates of \(r^{-1}\) close to the intended values.

Fig. 5
figure 5

Sampling distribution means for estimates of the return level exceedance probabilities \(r^{-1}\), using (1) direct summaries from the return level posterior distribution (blue; solid \(=\) posterior mean, dashed \(=\) posterior mode, dot-dashed \(=\) posterior 95% credible upper bound), and (2) the posterior predictive return level (red). Simulated data constructed with: asymptotic dependence according to a bivariate logistic model with dependence \(\alpha \) (top row); asymptotic independence according to an \({\textit{AR}}(1)\) process with lag 1 autocorrelation A (bottom row). The black dashed line represents the target exceedance probability \(r^{-1}\). Marginally, \(\xi =-0.4\) and \(u=u_{0.95}\)

3.3.4 Prior specification

All results discussed so far have assumed an objective (and, where possible, conjugate) prior specification for both dependence and marginal components of our simulated series. However, in keeping with our wind speed data analysis in Sect. 2.3, for some arms of the study we also adopt informative priors. For example, consider the case of asymptotic dependence under the logistic model with dependence parameter \(\alpha \). To emulate our approach to prior specification in Sect. 2.3.1, at each replication \(\ell \) a pair of stationary series \(({{\mathbf{y}}}^{\ell }, {{\mathbf{y}}}^{\dagger \ell })\) is simulated, each series in this pair being drawn from the same GP distribution marginally and having the same dependence structure. Maximum likelihood estimates of the marginal and dependence parameters for \({{\mathbf{y}}}^{\dagger \ell }\), and the corresponding elements of their covariance matrix, are then used to inform the prior specification for the marginal and dependence parameters for \({\mathbf{y}}^{\ell }\). Specifically, we assume that \((\eta ^{*}={{\mathrm{log}}}(\sigma ^{*}-\xi ^{*}u), \xi ^{*}) \sim N_{2}(\varvec{\mu }, \varvec{\varSigma })\) and \(\alpha \sim {\mathrm{Beta}}(a,b)\), with sensible choices for \(\varvec{\mu }\), \(\varvec{\varSigma }\), a and b based on our analysis of \({\mathbf{y}}^{\dagger \ell }\) (as opposed to an objective specification using independent N(0, v) priors for \(\eta ^{*}\) and \(\xi ^{*}\) with large v, and a U(0, 1) prior for \(\alpha \), as used in Sect. 3.3.1).

Using informative priors based on \({\mathbf{y}}^{\dagger \ell }\) usually resulted in no obvious change in the accuracy of the return level exceedance probabilities obtained using our four Bayesian estimates of return levels, relative to those obtained under an assumption of objective priors (although occasionally noticeable reductions in bias were observed under the informative prior specification). However, as expected, estimates were appreciably more precise, with much smaller values of RMSE/MLE for estimates obtained using all four of our Bayesian posterior summaries (particularly so for those based on the predictive return level). As examples, when \(\alpha =0.5\) using the logistic model for series with asymptotic dependence, we see from Table 2 that: (1) the bias, RMSE and MLE (\(\times 100\)) for \(r^{-1}=1/1000\) are 0.379, 1.871 and 0.018, respectively, for estimates based on the predictive return level—assuming informative priors based on \({{\mathbf{y}}}^{\dagger }\) gives corresponding values of 0.371, 1.342 and 0.009; (2) the bias, RMSE and MLE (\(\times 100\)) for \(r^{-1}=1/10{,}000\) are 0.134, 1.049 and 0.006, respectively, for estimates based on the return level posterior mean—assuming informative priors based on \({\mathbf{y}}^{\dagger \ell }\) gives corresponding values of 0.037, 0.447 and 0.001.

To investigate the effects of a mis-chosen prior, for some arms of the study (‘strong’ dependence only—i.e. \(\alpha =0.3\) and \(A=0.9\) for the logistic model and \({\textit{AR}}(1)\) processes respectively) we also allow \(\varvec{\mu }\), \(\varvec{\varSigma }\), a and b to be informed by a maximum likelihood analysis of simulated chains \({{\mathbf{y}}}^{\ddagger \ell }_{i}\), \(i=1, \ldots , 5\); unlike \({{\mathbf{y}}}^{\dagger \ell }\), these chains have means/variances and temporal dependencies which are (increasingly) dissimilar to those in \({{\mathbf{y}}}^{\ell }\). Specifically, we set:

$$\begin{aligned}&\bar{{\mathbf{y}}}^{\ddagger \ell }_{i} = (0.5i+1)\times \bar{{\mathbf{y}}}^{\ell } \quad {\mathrm{and}}\\&{\mathrm{s.d.}}\,({\mathbf{y}}^{\ddagger \ell }_{i}) = (-\,0.2i+1.1)\times {\mathrm{s.d.}}\,({\mathbf{y}}^{\ell }) \end{aligned}$$

and, depending on whether the simulated chains exhibit asymptotic dependence or asymptotic independence,

$$\begin{aligned}&\alpha _{{\mathbf{y}}^{\ddagger \ell }_{i}} = 0.1i+0.4 \quad \mathrm{or}\\&A_{{\mathbf{y}}^{\ddagger \ell }_{i}} = -0.1i+0.8 \end{aligned}$$

respectively, meaning that the series \({\mathbf{y}}_{5}^{\ddagger \ell }\) is the most dissimilar to \({\mathbf{y}}^{\ell }\) (hence leading to the most ill-informed prior specification). Informative priors based on \({\mathbf{y}}^{\ddagger \ell }\) resulted in some increases in the estimated bias, RMSE and MLE of our return level exceedance probabilities, relative to those based on \({\mathbf{y}}^{\dagger \ell }\), especially when using the most dissimilar series on which to base our prior specifications (\({\mathbf{y}}^{\ddagger \ell }_{4}\) and \({\mathbf{y}}^{\ddagger \ell }_{5}\)). Here, biases in estimates of \(r^{-1}\) were notably larger than those using the objective priors or the informative priors based on \({\mathbf{y}}^{\dagger \ell }\) (but least so for estimates based on the predictive return level and for the larger return periods), and values of the RMSE/MLE were always larger than those using informative priors based on \({\mathbf{y}}^{\dagger \ell }\) (again, least so for estimates based on the predictive return level, especially for return periods \(r=1000\) and \(r=10{,}000\)). Informative priors based on the least dissimilar series (\({\mathbf{y}}_{1}^{\ddagger \ell }\) and \({\mathbf{y}}_{2}^{\ddagger \ell }\)) yielded very similar results to those based on \({\mathbf{y}}^{\dagger \ell }\).

3.3.5 Comparisons with POT

In this part of the simulation study we investigate the performance of our four methods for estimating the return level exceedance probability \(r^{-1}\) when a standard declustering scheme is employed to filter out a set of independent threshold excesses. Under a POT procedure, a cluster of extremes over some high threshold u is deemed to have terminated once at least \(\kappa \) consecutive sub-threshold observations have been made; from each cluster identified in this way the maximum is then carried forward into the analysis, the GP distribution being used as a model for the set of cluster peak excesses. Although in practice this is a commonly-used procedure to circumvent the problems of temporal dependence, as Fawcett and Walshaw (2012, 2016) discuss, not only is it wasteful of data (often leading to infeasibly wide credible intervals for quantities such as return levels) but parameter and return level estimates can be extremely sensitive to the choice of \(\kappa \). Thus, we do not recommend a POT analysis at all, and we favour an approach as detailed in Sect. 2.2.2 of this paper and used so far in this simulation study. However, we include some results based on declustered data here for information and comparison purposes.

Figure 6 (bottom row) shows sampling distribution means for our estimates of \(r^{-1}\) across a range of return periods r, having declustered our simulated series \({\mathbf{y}}^{\ell }\) at each iteration \(\ell = 1, \ldots , 1000\) using \(\kappa =5\) and \(\kappa =20\). As before, in separate arms of the study we simulate series exhibiting asymptotic dependence and asymptotic independence. However, since the aim is to eliminate dependence between extremes, we assume the extremal index \(\theta \approx 1\) for our cluster peak excesses, and we bypass the stage in our analysis where we estimate the dependence parameter(s). Thus, the aim here is to compare results based on declustered data to those from Sects. 3.3.1 and 3.3.2, in which all threshold excesses were used and the dependence structure estimated; we can also investigate the sensitivity of our estimators of \(r^{-1}\) to the choice of declustering interval \(\kappa \). Regardless of the declustering interval used, the predictive return level consistently yields exceedance probabilities closer to the intended \(r^{-1}\) across the full range of return periods considered, with both the posterior means and modes resulting in relatively over-estimated exceedance probabilities. When declustering, all posterior summaries yield exceedance probabilities that are more biased than those obtained having pressed all extremes into use; see the top row of Fig. 5 for a comparison.

Fig. 6
figure 6

Sampling distribution means for estimates of the return level exceedance probabilities \(r^{-1}\), using (1) direct summaries from the return level posterior distribution (blue; solid \(=\) posterior mean, dashed \(=\) posterior mode, dot-dashed \(=\) posterior 95% credible upper bound), and (2) the posterior predictive return level (red). Simulated data constructed with: asymptotic independence according to an \({\textit{AR}}(1)\) process with lag 1 autocorrelation A, but when fitting, asymptotic dependence assumed according to a bivariate logistic model (top row); asymptotic dependence according to a bivariate logistic model with dependence \(\alpha \), but dependence filtered using runs declustering with cluster termination interval \(\kappa =5\) and \(\kappa =20\) (bottom row; results using \(\kappa =20\) giving the higher curve each time). Marginally, \(\xi =-0.4\) and \(u=u_{0.95}\)

3.3.6 Marginal domain of attraction assumption

So far, our simulated chains have always been drawn from a GP distribution marginally, which is the limiting distribution for excesses over a high threshold. In practice, our threshold excesses will in fact arise from a distribution in one of the domains of attraction (DoA) of the GP distribution; see for example, Coles (2001, Ch. 3). Thus, for both asymptotically dependent and independent extremes we also simulate chains \(\mathbf{y }^{\ell }\) with Weibull, Fréchet and Uniform margins (representing, respectively, models from the Gumbel, Fréchet and Weibull DoA). Switching from GP margins to distributions in one of the DoA of the GP distribution did not seem to have any real effect on our estimators for \(r^{-1}\), relative to the results shown in Figs. 3, 4 and 5 and Table 2. Similarly, switching between the three DoA did not reveal anything over-and-above the differences we observed when changing the value of the shape parameter under a GP marginal assumption (see the “Other arms of the study: main findings” discussions in Sects. 3.3.1, 3.3.2). For example, after what we might reasonably expect from sampling variability, the results shown in Table 2 were in line with analogous results using chains that had been marginally transformed to Uniform (Table 2 shows results for \(\xi =-0.4\), giving Weibull-type tails with a finite upper endpoint).

3.3.7 Marginal structure: general remarks

The results reported in Table 2 and Figs. 3, 4, 5 and 6 compare our four return level summaries for simulated chains with relatively short, bounded tails (the GP marginals here have \(\xi =-0.4\)). As we discuss throughout Sect. 3.3, similar findings were obtained across most other parameters in our study design. However, comparisons in some arms of the simulation study were clearly being influenced by the marginal shape parameter \(\xi \). As might be expected, for much heavier-tailed margins the resulting posterior distribution for \(z_{r}\) was substantially more right-skewed, resulting in larger biases for estimates of \(r^{-1}\) based on the return level posterior mean and the return level 95% credible upper bound. For these arms of the study, estimates of \(r^{-1}\) based on the posterior mode (\(p_{{\dot{z}}_{r}}\)) out-performed the other estimative summaries, although estimates based on the predictive return level seemed to be generally less biased and with smallest error. Generally, as the value of \(\xi \) increases the performance of \(p_{{\bar{z}}_{r}}\) and \(p_{z_{r, {\textsf {upper}}}}\) deteriorate in terms of estimated bias, RMSE and MLE, but \(p_{z_{r, {\textsf {pred}}}}\) retains the accuracy and precision observed in Table 2 and Figs. 3, 4, 5 and 6. Comparisons between our estimators do not appear to be sensitive to the scale of the underlying GP distribution. Although the GP marginal scale \(\sigma \) is held unit constant, as discussed in Sect. 3.2 excesses over u have a threshold- and shape-dependent scale \(\sigma ^{*}\). For arms of the study in which \(\xi \) was constant but \(\sigma ^{*}\) varied, we did not see any real departure from the general findings reported in Table 2 and Figs. 3, 4, 5 and 6. In short: comparisons between our estimators are more sensitive to shape than to scale of the underlying GP distribution, but estimates of \(r^{-1}\) produced by the predictive return level seem relatively robust to changes in scale and shape.

4 Conclusions

4.1 General summary

In this paper we have discussed the merits of a Bayesian approach to inference on environmental extremes, and the natural extension to prediction such an inferential framework offers. In our experience, practitioners often find the standard reporting of return level estimates—a point estimate with some measure of uncertainty (e.g. a maximum likelihood estimate with standard error/95% confidence interval, or, within a Bayesian setting, the posterior mean and standard deviation/95% credible interval)—difficult to work with in practice. Certainly, as we discuss in this paper, standard approaches such as POT analyses can yield estimates of return levels with extremely and unrealistically wide confidence/credible intervals, sometimes giving bounds that lie beyond the physical constraints of the variable being studied. Although Bayesian credible intervals have a more intuitive interpretation than frequentist confidence intervals (i.e. providing the stated probability coverage), our experience suggests that practitioners would prefer to work with a single point summary in which estimation uncertainty has been properly accounted for. For this reason, Fawcett and Walshaw (2016) recommend the posterior predictive return level estimate as the most appropriate posterior summary to feed back to practitioners.

We build on earlier work presented in Fawcett and Walshaw (2016) in which an estimation strategy that attempts to maximise precision is outlined. Our recommended approach is to model all excesses over a threshold with the GP distribution, accounting for temporal dependence through estimation of the extremal index. Where extremes vary seasonally, we recommend a piecewise seasonal approach to modelling (where appropriate), pressing threshold excesses from all seasons into use; other features, such as trends, can be simply captured through linear modelling of the GP scale parameter. We advocate a Bayesian approach to analysis, in which precision can be further increased through the specification of informative prior distributions for the GP parameters and from which predictive inference is neatly handled.

The main contribution of our work in this paper is to assess the performance of the posterior predictive return level relative to what we refer to as estimative return levels—standard point estimates taken directly from the return level posterior distribution. We do this through a large scale simulation study, in which data with various dependence structures, and tail behaviours, are simulated. We compare posterior predictive inferences for return levels to their estimative counterparts within the recommended modelling framework in Fawcett and Walshaw (2016), in which all excesses are modelled, but also within a more commonly-adopted POT modelling procedure. For a range of return periods r, on a fine scale, and across a range of temporal dependencies in the simulated data, we compare exceedance probabilities for return level summaries—specifically, the return level posterior mean, posterior mode, and posterior 95% credible upper bound—to those obtained from the posterior predictive return level, and to their expected values \(r^{-1}\). Our general findings are that, for most commonly-observed levels of temporal dependence and for both asymptotically dependent and independent extremes, the posterior predictive return level has exceedance probabilities much more in line with what we would expect to see than do the standard estimative posterior summaries (e.g. posterior mean/mode). In our simulation study, the posterior predictive return level also yields estimates of exceedance probabilities with much higher precision than the corresponding exceedance probabilities obtained from the estimative summaries. We believe the findings presented throughout Sect. 3 of this paper lend firm justification for the adoption of the posterior predictive return level as the best return level summary for practitioners, whether the modelling framework of Fawcett and Walshaw (2016) is adopted or a simple POT analysis is used. Further, if all excesses are used as in Fawcett and Walshaw (2016), but an incorrect assumption regarding the dependence structure is made, the posterior predictive return level still yields exceedance probabilities more in-line with what we would expect, compared to the other estimative summaries. The superiority of the predictive return level also seems to hold under informative prior specification/mis-specification, and across different marginal assumptions.

4.2 Further thoughts

One of the practical advantages of the posterior predictive return level, as we discuss throughout this paper, is the incorporation of estimation uncertainty into a single point estimate, perhaps to be used to aid structural design. Although the results of our simulation study in Sect. 3.3 go some way to indicate the superiority of the predictive return level relative to more standard point summaries from the return level posterior distribution, it might be useful for such point estimates to take account of the consequences of error. Indeed, in a machine learning or Bayesian decision theoretic context (e.g. Berger 2010), the aim is to choose the decision function \(\delta ({\varvec{x}})\) which minimises the a posteriori expected loss for some model parameter \(\psi \):

$$\begin{aligned} \int _{\varPsi }L(\psi , \delta ({\varvec{x}})) \pi (\psi |{\varvec{x}})d\psi . \end{aligned}$$
(13)

In Eq. (13), L represent a loss function: \(L(\psi , \delta ({\varvec{x}})) = (\psi -\delta ({\varvec{x}}))^{2}\) gives squared errors, although as we discuss in Sect. 3.3.1, for estimates of \(r^{-1}\) we might rather use linex errors since over- and under-estimation might not be equally serious. In a predictive setting, it is necessary to have a predictive version of Eq. (13). Conditioning on the observed (\({\varvec{x}}\)) and averaging over the unknowns (e.g. parameter(s) \(\psi \) and future observations \({\varvec{y}}\)), gives

$$\begin{aligned} \int _{Y}L({\varvec{y}}, \delta ({\varvec{x}}))f_{Y}({\varvec{y}}|{\varvec{x}})d{\varvec{y}}, \end{aligned}$$
(14)

where \(f_{Y}({\varvec{y}}|{\varvec{x}})\) is the posterior predictive density for \({\varvec{y}}\). The optimal decision \(\delta ({\varvec{x}})\) can then be seen as the action that minimises this predictive a posteriori expected loss. From an inference point-of-view, \(\delta ({\varvec{x}})\) is a function whose output \({\hat{\psi }}\) is an estimate of \(\psi \).

Obviously, this sort of approach for formally taking account of the consequences of error in our estimators for \(r^{-1}\) will be highly sensitive to the choice of loss function. For example, it can be shown that the posterior mean minimises Eq. (13) when L returns squared errors, and the posterior median when L returns absolute errors. Although we outline a rationale for using linex errors for our problem, to penalise over-estimation of \(r^{-1}\) more heavily than under-estimation, we feel that more work is needed to determine the suitability of linex errors here, and more generally a linex loss function for use in minimising Eqs. (13) and (14).

The contribution of parameter uncertainty to the predictive return level can be estimated by comparing \({\hat{z}}_{r, {\textsf {pred}}}\) to what we call naïve return level estimates. Figure 7 shows, for one arm of our simulation study, sampling distribution means for the predictive return level alongside sampling distribution means for this naïve estimator. Here, at each replication in the simulation study, rather than account for parameter uncertainty via Eq. (12) we assume that each of our marginal and dependence parameters are fixed at their posterior means; we substitute these means directly into Eq. (8) (of course, we could fix the model parameters at some other posterior summary, or indeed their likelihood modes). Thus, the difference between the solid and dashed lines in the plots in Fig. 7 can be seen as the average contribution to \(z_{r, {\textsf {pred}}}\) of the implicit allowance for uncertainty in parameter estimation. The results are shown for asymptotically dependent chains simulated according to a bivariate logistic model for consecutive pairs in the series, for three strengths of dependence; however, discrepancies of similar magnitude were observed for other arms of the study (although for heavier-tailed chains the naïve estimator was more sensitive to the choice of posterior summary used to fix the model parameters). In a real data context, plots of \({\bar{z}}_{r}\) against \({\hat{z}}_{r, {\textsf {pred}}}\) can be used to reveal such contributions to the predictive return level.

Fig. 7
figure 7

Sampling distribution means for predictive return levels \(z_{r, {\textsf {pred}}}\) (solid lines) and “naïve” return levels (dashed lines). Simulated data are constructed with asymptotic dependence according to the bivariate logistic model with dependence \(\alpha \). Marginally, \(\xi =-0.4\) and \(u=u_{0.95}\)