Regional Pooling in Extreme Event Attribution Studies: an Approach Based on Multiple Statistical Testing

Statistical methods are proposed to select homogeneous locations when analyzing spatial block maxima data, such as in extreme event attribution studies. The methods are based on classical hypothesis testing using Wald-type test statistics, with critical values obtained from suitable parametric bootstrap procedures and corrected for multiplicity. A large-scale Monte Carlo simulation study finds that the methods are able to accurately identify homogeneous locations, and that pooling the selected locations improves the accuracy of subsequent statistical analyses. The approach is illustrated with a case study on precipitation extremes in Western Europe. The methods are implemented in an R package that allows easy application in future extreme event attribution studies.


Introduction
Extreme event attribution studies on precipitation extremes are typically motivated by the occurrence of an extreme event which causes major impacts such as damages to infrastructure and agriculture, or even fatalities, see, for instance, van der Wiel et al. (2017); van Oldenborgh et al. (2017); Otto et al. (2018); Kreienkamp et al. (2021).A key task for attributing the event to anthropogenic climate change consists of a statistical analysis of available observational data products at the location the mathematical concept of the proposed methods, starting with a detailed description of the underlying model assumptions and a strategy for the detection of a possible pooling region in Section 2.1.It is recommended to all readers.In Sections 2.2 and 2.3, mathematical details on the applied estimators and test statistics are given, and they may be skipped by non-expert readers.Next, the ideas of the bootstrap procedures that allow to draw samples from the distribution under the null hypothesis are explained.Again, this may be skipped by non-statisticians.Section 2.5 goes into detail about the detection strategy of possible pooling regions and the treatment of the related multiple testing problem.This section is of interest to all readers who want to apply the methods, since it provides details on how to process a set of p-values obtained from testing multiple hypotheses.Next, Section 3 gives the results of the simulation study that was performed in order to evaluate the performance of the proposed methods.These results are of interest to all readers, and they serve as a basis for the case study conducted in Section 4. Section 5 then discusses several extensions of the proposed methods.In 5.1, we provide a method for estimating region-wise return periods once a pooling region has been chosen.Here, a region-wise return period of a given event is defined as the number of years that one has to wait on average until an event of the same or even greater magnitude is observed anywhere in the pooling region.Extensions to different model assumptions that suit e.g.other variables such as temperature are discussed in Section 5.Last but not least, we come to a conclusion in Section 6.Some mathematical details and further illustrations on the simulation study and the case study are postponed to a supplement.

Assessing spatial homogeneities for precipitation extremes 2.1 A homogeneous model for precipitation extremes
The observational data of interest consists of annual or seasonal maximal precipitation amounts (over some fixed time duration, e.g., a day) collected over various years and at various locations (in practice, each location may correspond to a spatial region; we separate these two terms from the outset to avoid misunderstandings: subsequently, a region shall be a set of locations).More precisely, we denote by m (t) d the observed maximal precipitation amount in season t and at location d, with t = 1, . . ., n and d = 1, . . ., D. The location of primary interest shall be the one with index d = 1.Note that the choice of d = 1 is made for illustrative purposes only and can be replaced by any index d ∈ {1, . . ., D}.
In view of the stochastic nature, we assume that m (t) d is an observed value of some random variable d is generated by a maxima operation, standard extreme value theory (Coles, 2001) suggests to assume that M (t) d follows the generalized extreme value (GEV) distribution, i.e., for some µ d (t), σ d (t) > 0, γ d (t) ∈ R, where the GEV(µ, σ, γ) distribution with location parameter µ > 0, scale parameter σ > 0 and shape parameter γ ∈ R is defined by its cumulative distribution function (1) for x such that 1 + γ x−µ σ > 0. Due to climate change, the temporal dynamics at location d, which are primarily driven by the function t → (µ d (t), σ d (t), γ d (t)), are typically non-constant.Any proxy for climate change qualifies as a suitable temporal covariate, and a standard assumption in extreme event attribution studies, motivated by the Clausius-Clapeyron relation, postulates that for certain parameters α d , γ d ∈ R, µ d , σ d > 0. Here, GMST (t) denotes the smoothed global mean surface temperature anomaly, see Philip et al. (2020).Note that (2) implies hence the model may be identified as a temporal scaling model.It is further assumed that any temporal dependence at location d is completely due to GMST (t), which we treat as deterministic and which implies that M (1) d , . . ., M (n) d are stochastically independent, for each d = 1, . . ., D. For the moment, the spatial dependence will be left unspecified.
Recall that the location of interest is the one with d = 1, which is characterised by the four parameters µ 1 , σ 1 , γ 1 , α 1 .As described before, estimating those parameters based on the observations from location d = 1 only may be unpleasantly inaccurate, which is why one commonly assumes that the D locations have been carefully selected by experts to meet the following space-time homogeneity assumption: where Θ := (0, ∞) 2 × R 2 and ϑ = (µ, σ, γ, α) , ϑ d = (µ d , σ d , γ d , α d ) , and where the upper index ED stands for 'equal distribution', since, in short, Equation (3) states that the location-wise GEV parameters coincide for the D locations.
In the subsequent sections, we aim at testing the validity of the expert's hypothesis H ED 0 .Here, it is not only of interest to test the hypothesis for the whole set {1, . . ., D}, but also to find a (maximal) subset A ⊂ {1, . . ., D} with 1 ∈ A and |A| = k ≥ 2 on which the space-time homogeneity assumption holds.Here, for an arbitrary index set A, the latter assumption may be expressed through with Θ as in Equation (3) and ϑ A = (µ A , σ A , γ A , α A ) , meaning that the locationwise GEV parameters coincide for all locations with index in the set A, making the respective locations a possible pooling region.Now, a maximal subset A for which Equation (4) holds may be determined with the following strategy: Since we are interested in finding all locations that 'match' the location of primary interest with index d = 1, we test for each pair A d = {1, d}, d = 2, . . ., D, whether the null hypothesis H ED 0 (A d ) holds.This will provide us with a set of p-values based on which we can decide which locations to reject and which not to reject.Those locations that are not rejected can then be assumed to be sufficiently homogeneous and are thus included in the suggestion of a pooling region of maximal extent.For further details on this strategy and the impact of the induced multiple testing problem, see Section 2.5.

Wald-type test statistics
We define test statistics which allow to test for the sub-hypotheses H ED 0 (A) of H ED 0 from Equation (4), where A ⊂ {1, . . ., D}.For that purpose, we propose to use classical Wald-type test statistics; see Section 14.4.2 in Lehmann and Romano (2021) for a general discussion and Lilienthal et al. (2022) for a similar approach in temporally stationary GEV models, i.e., with α d fixed to α d = 0. Write A = {d 1 , . . ., d k } with 1 ≤ d 1 < • • • < d k ≤ D and let h A : R 4D → R 4(k−1) be defined by We may then write H ED 0 (A) equivalently as Hence, significant deviations of h A ( θ) from 0 with θ from Section 2.2 provide evidence against H ED 0 (A).Such deviations may be measured by the Wald-type test statistic where H A = ḣA (θ) ∈ R 4(k−1)×4D denotes the Jacobian matrix of θ → h A (θ), which is a matrix with entries in {−1, 0, 1} that does not depend on θ.In view of the asymptotic normality of θ, see Section 2.2, the asymptotic distribution of T n (A) under the null hypothesis H ED 0 (A) is the chi-square distribution χ 2 4(k−1) with 4(k−1) degrees of freedom; see also Section 4 in Lilienthal et al. (2022).Hence, rejecting H ED 0 (A) if T n (A) exceeds the (1 − α)-quantile of the χ 2 4(k−1) -distribution provides a statistical test of asymptotic level α ∈ (0, 1).The finite-sample performance of the related test in the stationary setting was found to be quite inaccurate (see Lilienthal et al., 2022).To overcome this issue, we propose a suitable bootstrap scheme in the next section.

Parametric bootstrap devices for deriving p-values
Throughout this section, we propose two bootstrap devices that allow to simulate approximate samples from the H ED 0 (A)-distribution of the test statistic T n (A) from Equation ( 8).Based on a suitably large set of such samples, one can compute a reliable p-value for testing H ED 0 (A), even for short sample sizes.The first method is based on a global fit of a max-stable process model to the entire region under consideration, while the second one is based on fitting multiple pairwise models.The main difference of the two approaches is that the first one can test the hypothesis H ED 0 (A) for arbitrary subsets A ⊂ {1, . . ., D}, while the second approach is restricted to testing the null hypothesis on subsets of cardinality two, i.e., it can only test whether a pair of locations is homogeneous.Depending on the question that is asked, applying the one or the other method may be advantageous.

Global bootstrap based on max-stable process models
The subsequent bootstrap device is a modification of the parametric bootstrap procedure described in Section 5.3 of Lilienthal et al. (2022).Fix some large number B, say B = 200, noting that larger numbers are typically better, but going beyond B = 1000 is usually not worth the extra computational effort.
The basic idea is as follows: for each b = 1, . .Here, for simulating the bootstrap samples, we assume that the spatial dependence structure of the observed data can be sufficiently captured by a max-stable process model.Max-stable processes provide a natural choice here, since they are the only processes that can arise, after proper affine transformation, as the limit of maxima of independent and identically distributed random fields {Y i (x) : x ∈ R p } (Coles, 2001, Section 9.3).Parametric models for max-stable processes are usually stated for unit Fréchet (i.e., GEV(1, 1, 1)) margins.Therefore, the first steps in our algorithm below aim at transforming the margins of our observed data to be approximately unit Fréchet.
More precisely, the parametric bootstrap algorithm is defined as follows: Algorithm 1 (Bootstrap based on max-stable processes).
Fréchet-distributed data, by letting (3) Fit a set of candidate max-stable process models with standard Fréchet margins to the observations (Y D ) t=1,...,n and choose the best fit according to the composite likelihood information criterion (CLIC), which is a model selection criterion that is commonly applied when fitting max-stable process models.Throughout, we chose the following three models: (c) the Brown-Resnick process (2 parameters).
For further details on max-stable processes, the mentioned models and the CLIC, see Davison et al. (2012) and Davison and Gholamrezaee (2012).Respective functions are implemented in the R package SpatialExtremes (Ribatet, 2022).( 4 Note that until now we haven't used the particular hypothesis (5) Assume that H ED 0 (A) from Equation ( 4) is true, and estimate the four dimensional model parameters ϑ A = (µ A , σ A , γ A , α A ) ∈ Θ by (pseudo) maximum likelihood based on the pooled sample (M (1) Denote the resulting parameter vector as θA = (μ A , σA , γA , αA ) , and note that θA should be close to θd for each In a classical test situation, one may now reject H ED 0 (A) for a fixed set A at significance level α ∈ (0, 1) if p(A) ≤ α.In the current pooling situation, we would need to apply the test to multiple pooling regions A, which hence constitutes a multiple testing problem where standard approaches yield inflated levels.We discuss possible remedies in Section 2.5.

Pairwise bootstrap based on bivariate extreme value distributions
Recall that the location of primary interest is the one with index d = 1.
As stated in Section 2.1, it is of interest to test for all bivariate hypotheses H ED 0 ({1, d}) with d = 2, . . ., D. For that purpose, we may apply a modification of the bootstrap procedure from the previous section that makes use of bivariate extreme value models only.By doing so, we decrease the model risk implied by imposing a possibly restrictive global max stable process model.
The modification only affects step (3) and ( 4) from Algorithm 1.More precisely, for testing the hypothesis H ED 0 (A d ) with A d = {1, d} for some fixed value d = 2, . . ., D, we make the following modifications: Algorithm 2 (Pairwise bootstrap based on bivariate extreme value distributions).Perform step (1) and ( 2) from Algorithm 1 with the set {1, . . ., D} replaced by A d .(3a) Fit a set of bivariate extreme value distributions to the bivariate sample (Y d ) t=1,...,n , assuming the marginal distributions to be unit Fréchet.Choose the best fit according to the Akaike information criterion (AIC), a model selection criterion that rewards a good fit of a model and penalises the model's complexity at the same time (Akaike, 1973) Note that Algorithm 2 is computationally more expensive than Algorithm 1 since model selection and fitting of dependence models and its subsequent simulation must be performed separately for each hypothesis H ED 0 (A d ) of interest.

Combining test statistics
As already addressed at the end of Section 2.1, it is not only of interest to test the global hypothesis H ED 0 , since a possible rejection of H ED 0 gives no indication about which locations deviate from the one of primary interest.Instead, one might want to test hypotheses on several subsets and then pool those subsets for which no signal of heterogeneity was found.In this subsection, we provide the mathematical framework of testing sub-hypotheses and discuss how to deal with the induced multiple testing problem.
Mathematically, we propose to regard H ED 0 as a global hypothesis that is built up from elementary hypotheses of smaller dimension.A particularly useful decomposition is based on pairwise elementary hypotheses: recalling the notation H ED 0 (A) from Equation (4), we clearly have i.e., H ED 0 holds globally when it holds locally for all pairs {1, d} with d ∈ {2, . . ., D}.We may now either apply Algorithm 1 or Algorithm 2 to obtain a p-value, say p raw d = p({1, d}), for testing H ED 0 ({1, d}), for any d ∈ {2, . . ., D}.Each p-value may be interpreted as a signal for heterogeneity between locations 1 and d, with smaller values indicating stronger heterogeneity.The obtained raw list of p-values may hence be regarded as an exploratory tool for identifying possible heterogeneities.
Since we are now dealing with a multiple testing problem, it might be advisable to adjust for multiple comparison in order to control error rates.This can be done by interpreting the raw list based on classical statistical testing routines, in which pvalues are compared with suitable critical values to declare a hypothesis significant.Several methods appear to be meaningful, and we discuss three of them in the following.For this, let α ∈ (0, 1) denote a significance level, e.g., α = 0.1.IM (Ignore multiplicity): reject homogeneity for all pairs {1, d} for which p raw d ≤ α.In doing so, we do not have any control over false rejections.In particular, in case D is large, false rejections of some null hypotheses will be very likely.On the other hand, the procedure will have decent power properties, and will likely detect most alternatives.Hence, in a subsequent analysis based on the pooled sample of homogeneous locations, we can expect estimators to exhibit comparably little bias and large variance.
In general, controlling the family-wise error rate will result in comparably little power, i.e., we might falsely identify some pairs of locations as homogeneous.Hence, in a subsequent analysis based on the pooled sample of homogeneous locations, we can expect estimators to exhibit comparably large bias and little variance.
Under an additional assumption on the p-values that belong to the true null hypotheses (they must exhibit some positive dependence), the BH procedure is known to asymptotically control the false discovery rate (FDR) at level α, i.e., FDR := E Number of false rejections Number of all rejections 1(at least one rejections) ≤ α, see Theorem 9.3.3 in Lehmann and Romano (2021).Control of the FDR will be confirmed by the simulation experiments in Section 3. If one were interested in guaranteed theoretical control of the FDR rate, one might alternatively apply the Benjamini Yekutieli (BY) stepup procedure, see (Benjamini andYekutieli, 2001) andTheorem 9.3.3 in Lehmann andRomano (2021).In view of the fact that the procedure is much more conservative than BH, we do not recommend its application in the current setting.
Concerning a subsequent analysis, estimators based on a pooled sample obtained from the BH procedure can be expected to exhibit bias and variance to be somewhere between the IM and Holm procedure.
Remark 1.The decomposition of H ED 0 into hypotheses of smaller dimensionality is not unique.For instance, we may alternatively write where In practice, the sequence is supposed to be derived from some expert knowledge of the region of interest; it shall represent a sequence of possible pooling regions where B k is constructed from B k−1 by adding the locations which are a priori 'most likely' homogeneous to the locations in B k .Note that, necessarily, K ≤ D − 1, which provides an upper bound on the number of hypotheses to be tested.
The derivation of respective testing methods is straightforward.In view of the facts that the choice of the sequence is fairly subjective and that the eventual results crucially depend on that choice, we do not pursue the method any further.

Simulation Study
A large-scale Monte Carlo simulation study was conducted to assess the performance of the proposed bootstrap procedures in finite sample situations.We aim at answering the following questions: (a) Regarding the test's power : What percentage of locations that are heterogeneous w.r.t. the location of primary interest can be expected to be identified correctly?(b) Regarding the test's error rates: What percentage of locations that are homogeneous w.r.t. the location of primary interest can be expected to be wrongly identified as heterogeneous (FDR)?What is the probability of wrongly identifying at least one location that is homogeneous w.r.t. the location of interest as heterogeneous (FWER)?(c) Regarding the chosen pooling regions: How does return level (RL) estimation based on the pooling regions proposed by the bootstrap procedures compare to RL estimation based on the location of interest only or the whole (heterogeneous) region?
The data was generated in such a way that the temporal spatial dynamics from the case study in Section 4 are mimicked.To achieve this, we started by fitting the scale-GEV model from Equation (2) to annual block-maxima of observations from 1950-2021 at 16 spatial locations in Western Europe (i.e., n = 72 and D = 16) that are arranged in a 4 × 4 grid; see Figure 1 and the additional explanations in Section 4. The locations correspond to the center points of the grid cells; the distance between the center points of two neighbouring grid cells is approximately 140 km.The location-wise GEV parameter estimates θd exhibit the following approximate ranges over d ∈ {1, . . ., 16}: μd ∈ (18.1, 30.8) with a mean of 20.85, σd ∈ (4.185, 7.92) with a mean of 5.3, γd ∈ (−0.13, 0.36) with a mean of 0.08 and αd ∈ (−2.3, 5.08) with a mean of 1.5.Fitting the scale-GEV model to the full pooled sample of size n • D = 1152, we obtained parameter estimates that were close to the means over the location-wise parameter estimates, with 20.37, 5.8, 0.1, 1.5 for location, scale, shape and trend parameter, respectively.Next, we transformed the margins to (approximate) unit Fréchet by applying the transformation from Equation ( 9), such that we can fit several max-stable process models to the transformed data.The best fit was Smith's model with approximate dependence parameters σ 11 = 0.4, σ 12 = 0.2, σ 22 = 0.9; see Davison et al. (2012) for details on the model.
Based on these model fits, we chose to generate data with the following specifications: first, the sample size was fixed to n = 75 and the regional 4 × 4 grid was chosen as described before, i.e., d = 16.The grid cell/location labelled '10' is chosen as the one of primary interest.Further, the dependence structure is fixed to that of Smith's model with (approximately) those parameters that gave the best fit on the observed data, i.e. σ 11 = 0.4, σ 12 = 0.2, σ 22 = 0.9.For simulating data, we first simulate from this max-stable process model (Ribatet, 2022) and then transform the margins to scale-GEV distributions, either in a homogeneous or in a heterogeneous manner.Here, the globally homogeneous model is defined by fixing the marginal scale-GEV parameters to approximately the mean values of the location-wise GEV parameters obtained for the real observations, i.e., for each d ∈ {1, . . ., 16}.
For each of the 449 models, we now apply the following three bootstrap procedures, each carried out with B = 300 bootstrap replications (recall that the grid cell of interest is the one labelled with 10): (B1) The bootstrap procedure from Algorithm 1 with A = {1, . . ., 16}.(B2) The bootstrap procedure from Algorithm 1 for all sets A d = {10, d} with d ∈ {1, . . ., 16} \ {10}.(B3) The bootstrap procedure from Algorithm 2 for all sets A d = {10, d} with d ∈ {1, . . ., 16} \ {10}.Note that the second and third method both yield 15 raw p-values.Each procedure was applied to 500 simulated samples from all models under consideration.
Regarding (B1), we compute the percentage of rejections among the 500 replications, which represents the empirical type I error of the test under the homogeneous model and the empirical power under the heterogeneous models.The results can be found in Figure 2. The null hypothesis is met in the central square only, and we observe that the nominal level of α = 0.1 is perfectly matched.All non-central squares correspond to different alternatives, and we observe decent power properties in both scenarios.Note that a rejection only implies that the entire region {1, . . ., 16} is not homogeneous; there is no information on possible smaller subgroups that are homogeneous to the location of interest.
Regarding (B2) and (B3), rejection decisions were obtained for each hypothesis H ED 0 ({10, d}) by one of the three methods from Section 2.5.The empirical familywise error rate is then the percentage of cases (over 500 replications) for which at least one null hypothesis was rejected.Likewise, for the false discovery rate, we calculate, for each replication, the number of false rejections and divide that by the total number of rejections (when the number of total rejections is 0, this ratio is set to 0).The empirical false discovery rate is obtained by taking the mean over all 500 replications.Similarly, for assessing the power properties, we calculate the empirical proportion of correct rejections (i.e., among the 2 or 7 locations that deviate, the proportion of detected heterogeneous locations) over all 500 replications.
Results for the false discovery and family-wise error rate are given in Table 1.We find that the p-value combination methods from Section 2.5 are sufficiently accurate: the BH method controls the false discovery rate, while Holm's method controls the family-wise error rate.This holds exactly for procedures (B3), where the maximal FDR (FWER) of the BH (Holm) method is at 9.4% (8.7%), and approximately for (B2), where the maximal FDR (FWER) is at 12.2% (12.6%).Further, we see that the IM procedure neither controls the FWER nor the FDR.
The power properties for procedure (B2) combined with the BH method are shown in Figure 3.We see that the procedure is able to detect some of the deviations of the null hypothesis, with more correct rejections the stronger the deviation is.The method is particularly powerful when the location and scale parameters deviate into  opposite directions, i.e. when c µ > 0 and c σ < 1 or c µ < 0 and c σ > 1.There is no obvious pattern regarding the deviations of the shape and trend parameter.Further, we analogously show the power properties of the IM method with bootstrap (B2) in Figure 3.As expected, this method has more power against all alternatives under consideration.However, this comes at the cost of more false discoveries, as can be seen in Table 1.
The results for bootstrap scheme (B3) were very similar and are therefore not shown here, but can be found in Section B of the supplementary material.Likewise, we omit the results for the more conservative Holm procedure, which exhibits, as expected, less power against all alternatives.Further, we repeated the simulation study with an increased location-wise sample size of n = 100.As one would expect, the tests have more power in this case.
The results presented so far show that the proposed pooling methods work 'as intended', since the theoretical test characteristics are well approximated in finite sample situations, and since we observe decent power properties.In practical applications however, spatial pooling of locations is usually the starting point for subsequent analyses.For instance, one may be interested in estimating return levels at the location of interest based on the data from all locations that were identified as homogeneous.Moreover, the analysis of alternative data sets like climate model data may be based on the homogeneous locations identified within the analysis of observations.This suggests that the methods should be examined with regard to their quality in subsequent analyses.For that purpose, we consider, as an example, the problem of return level estimation at the location of interest.The state-of-the-art method would consist of GEV fitting at the location of interest only, which results in (asymptotically) unbiased estimators that suffer from large variance.Basing the estimator on pooled regions will decrease the variance, but at the same time increase its bias if some heterogeneous locations have been wrongly identified as homogeneous.
In particular, pooling based on a conservative testing approach like the BH procedure leads to the acceptance of many locations and thus to a large pooling area and low estimation variance.Most likely, some of the chosen locations will be violating the null hypothesis though, which yields a rather large estimation bias.For pooling based on a more liberal rejection approach like the IM procedure, the estimation bias and variance behave exactly opposite: since the null hypotheses are more likely to be rejected, the resulting pooling sample is smaller (i.e., larger estimation variance) but 'more accurate' (i.e., smaller estimation bias).
For our comparison, we consider fitting the scale-GEV model based on pooled locations that have been obtained from one of the following eight methods m ∈ {LOI, full, MS IM, MS Holm, MS BH, biv.IM, biv.Holm, biv.BH}.
Here, LOI refers to considering the location of interest only (no pooling), full refers to pooling all available locations, and the last six methods encode pooling based on any combination of the proposed p-value combination methods and bootstrap approaches.
For each method, we compute the maximum likelihood estimate θ = (μ, σ, γ, α) of the scale-GEV model parameters and transform this to an estimate of the T -year  return level (RL) in the reference climate of year t by , where μ(t) = μ exp(αGMST (t)/μ) and σ(t) = σ exp(αGMST (t)/μ) and where G is the cumulative distribution function of the GEV-distribution, see Equation (1).Now, in our simulation study, we know that the true value of the target RL is given by RL From the 500 replications we can therefore compute the empirical Mean Squared Error (MSE) of method m as where RL (m,j) t (T ) denotes the estimated RL obtained in the j-th replication with method m.Note that we have suppressed the MSE's dependence on T and t from the notation.
In Figure 4 we compare MSEs of the 100-year RL with reference climate as in year 2021, which is given by RL 2021 (100) = 55.87, by plotting the difference MSE(m 1 )−MSE(m 2 ) with m 1 ∈ {MS BH, MS IM} and m 2 ∈ {full, ROI} as obtained for the setting where |A dev | = 7.The plots reveal that both the MS BH and the MS IM method are superior to the the LOI fit for almost all scenarios.Comparing the two methods to the full fit reveals that there are certain scenarios for which the full fit performs substantially worse, mostly when the shape and scale parameter deviate towards the same direction for the alternatives.For those scenarios where the full fit outperforms the two methods, the discrepancy is not very large, with the BH method performing slightly better than the IM method.
A comparison between MS BH and MS IM is shown in Figure 5 for |A dev | ∈ {2, 7}.The results reveal that the BH method slightly outperforms the IM method in the case |A dev | = 2 for almost all alternative scenarios.In case |A dev | = 7, the results are quite mixed, with the IM method becoming clearly superior when the shape, scale and location parameters deviate jointly to the top.In all other scenarios, the differences are only moderate, sometimes favoring one method and sometimes the other.Corresponding results for the bootstrap methods based on bivariate extreme value distributions are very similar and therefore not shown.Further, the results were found to be robust against the choices of t = 2021 and T = 100 that were made here for the return level estimation.
Overall, the results suggest the following practical recommendation: if the full sample is expected to be quite homogeneous a priori (for instance, because it was built based on expert knowledge), then estimation based on BH-based pooling is preferable over the other options (LOI, the full and the IM-based fit).If one expects to have a larger number of heterogeneous locations, it is advisable to apply the IM procedure (or any other liberal procedure), which likely rejects most of the heterogeneous locations and hence reduces the bias.In general, the liberal behavior of IM-based pooling suggests its use when it is of highest practical interest to obtain a pooled region that is as homogeneous as possible (as a trade-off, one has to accept that the region is probably much smaller than the initial full region). .Among other things, our illustrative application of the methods explained above will reveal that grid cell 11 has been rightfully dismissed, while grid cell 17 might have been considered homogeneous.Further, there is no clear evidence that any other grid cell that has been declared We apply the two proposed bootstrap procedures to areas (A) and (B).Note that the raw p-values obtained with the bootstrap based on bivariate extreme value distributions should be very similar (or even identical when using the same seed for random number generation) for those grid cells that appear in both areas, while they may differ to a greater extent for the MS bootstrap.This is because the p-value for a given pair obtained with the bivariate bootstrap procedure only depends on the observations of the pair, while the respective p-value obtained with the MS bootstrap also depends on the spatial model that was fitted to the whole area.However, even if the raw p-value of a given pair obtained for setting (B) coincides with the raw p-value obtained for setting (A), the adjustment for multiple testing can lead to slightly different rejection decisions of the pair at a given level α.The bootstrap procedures are applied with B = 2000 bootstrap replications.
We start by discussing the results of the application to the larger grid in (A).Recall that, for a given significance level α, one rejects the null hypothesis for all grid cells whose p-value is smaller than α.To visualise the results, we therefore shade the grid cells according to the magnitude of their (adjusted) p-value.Here, we divide the colour scale into three groups: [0, 0.05], (0.05, 0.1] and (0.1, 1], with a dark red tone assigned to the first group, a brighter red tone for Group 2 and an almost transparent shade for Group 3.This allows us to see the test decisions for significance levels of α ∈ {0.05, 0.1}: when the significance level is chosen as α = 0.1, all tiles with a reddish shade are rejected, while when working with a level of α = 0.05 only tiles shaded in the dark shade are rejected. Results for both the conservative BH procedure and the liberal IM procedure are shown in Figure 7.For completeness, results on Holm's method, which is even more conservative than BH, as well as the BH and IM p-values themselves can be found in the supplementary material, Tables C.2 and C.3.One can see that, for a given rejection method (i.e.BH or IM), the MS and bivariate procedures mostly agree on the rejection decisions that would be made at a level of 10% (compare the rows of Figure 7 to see this).The same holds when working with a significance level of 5%.
Further, as expected, the IM method rejects more hypotheses than the BH method.However, according to the results of the simulation study, it is quite likely that at least one of these rejections is a false discovery.
Analogous results for the 4 × 4 grid in (B) are shown in Figure 8.As discussed above, except for the MS BH method, the results are consistent with the results obtained for the 6 × 6 grid in the sense that for those locations which are contained in both grids, the locations with p-values of critical magnitude (< 10%) coincide (compare the plot in the upper right corner of Figure 8 to the plot in the upper right corner of Figure 7 to see this for the MS IM method, and similar for the other methods).For the MS BH method, grid cells 10, 14, 15, and 16 are not significant anymore at a level of 10 %, but we recorded an adjusted p-value of 0.106 for those four grid cells, so this is a rather tight decision.The p-values obtained for the 4 × 4 grid can be found in Table C.1 in the supplementary material.
Let us now move on to the interpretation: considering the larger grid first, some grid cells for which the characteristics of extreme precipitation are different (according to expert opinion) from the grid cell of the target location are detected as heterogeneous.These rejected grid cells are located along the coast and in the mountainous terrain.Comparing the results with Kreienkamp et al. (2021); Tradowsky et al. (2022), we observe that grid cell 11 has been rejected in their study based on expert knowledge.For grid cell 17, however, we do not detect any statistical evi- dence that the probabilistic behavior of extreme precipitation is different from the grid cell of the target location, even when applying the liberal IM procedure.We would like to stress though that non-rejection of a null hypothesis does not provide any evidence of the null hypothesis, even when ignoring the multiple testing issue.Hence, expert knowledge that leads to rejection should, in general, outweigh any statistical non-rejection.This particularly applies to the eastern (continental) grid cells in the larger 6×6-grid, which can be influenced by heavy precipitation caused by different synoptic situations compared to the target region.
Moreover, as the results for locations 10, 14, 15, and 16 showed some discrepancy across the different testing procedures, we suggest that the final decision on the exclusion or inclusion of these locations in a spatial pooling approach should be based on expert knowledge of the meteorological characteristics, and the willingness to trade possible bias for variance (with a possibly larger bias when including the lo- cations -note that statistical evidence against homogeneity in the bivariate extreme value distribution-based bootstrap is only weak, and wrongly declaring the regions as homogeneous is possibly not too harmful).The same holds for locations 9, 20, 23 and 27 for which only the IM method yielded p-values between 5% and 10%.Again, these rather small p-values could be subject to false discoveries though, and since the heterogeneity signal is also not too strong, there is no clear evidence that these need to be excluded from pooling.
For a last evaluation of results from pairwise tests, we estimated the 100-year RLs in the reference climate of the year 2021, i.e. with reference value GMST (2021) = 0.925 • C, on five different data sets obtained from the 4 × 4 grid.Here, we use the data sets consisting of data from • the location of interest only • pooling those grid cells suggested by the results of the case study (i.e., all cells but 11, or all cells but 10, 11, 14, 15, 16) or expert opinion (i.e., all cells but 11, 17) • pooling all grid cells of the 4 × 4 grid.The results can be found in Finally, we would like to mention that similar results were obtained when applying the BH test procedures to all triples containing the pair of grid cells (20, 21), i.e., the extended target region considered in the study of Kreienkamp et al. (2021); Tradowsky et al. (2022), consisting of those regions in Germany and Belgium affected worst by the July 2021 flooding.

Extensions
In this section, we discuss how to estimate region-wise return levels under homogeneity assumptions (Section 5.1).We also propose two possible extensions of the pooling approach from the previous sections to other hypotheses (Section 5.2) or other underlying model assumptions (Section 5.3).

Estimation of regional return levels and return periods
As pointed out in Kreienkamp et al. (2021); Tradowsky et al. (2022) among others, an estimated return period (RP) of T years for a given event and in a fixed reference climate (e.g., the preindustrial climate), obtained based on a fit of the GEV distribution to a pooled homogeneous sample, has the following interpretation: for each fixed location/tile within the region, one can expect one event of the same or larger magnitude within T (imaginary) years of observing the reference climate.We refer to this quantity as the local return period.Obviously, one would expect more than one event of similar magnitude happening at at least one of the locations of the pooling region.Likewise, for a given T , one would expect a higher T -year return level for the whole region.The latter corresponds to the value that is expected to be exceeded only once in T years at at least one of the locations.
Mathematically, using the notation from Section 2.1, the exceedance probability of value r at at least one among D ≥ 2 locations in the reference climate corresponding to year t is given by p t (r) = P ∃ j ∈ {1, . . ., D} : such that the return period for event r of the region is RP reg t (r) = 1 pt(r) .Further, the T -year return level of the region in the climate corresponding to year t is the minimal value RL where G is the distribution function of the GEV distribution and where µ(t), σ(t) and γ denote the parameters at reference climate of year t from Equation ( 2) under the homogeneity assumption from Equation ( 3).The locations are, however, usually not independent in applications.In the following, we propose a simulation-based estimation method that involves max-stable process models to account for the spatial dependence.As before, the R package SpatialExtremes (Ribatet, 2022) allows for fitting and simulating max-stable process models.
Algorithm 3. (Simulation-based estimation of the regionwise RL and RP) (1) Fit the scale-GEV parameters to the pooled homogeneous sample, resulting in the parameter vector θ = (μ, σ, γ, α) .( 2) Transform the margins of the pooled data to approximately unit Fréchet by applying transformation from Equation ( 9) with the parameter estimate from Step 1. Then fit several max-stable process models to the obtained data and choose the best fit according to the information criterion CLIC.(3) Replicate for b = 1, . . ., B the following steps: (i) Generate one random observation (y .
Especially, when we have estimated the local 100-year RL, we can get an estimate of the return time this event has for the whole region.Likewise, when we have an estimate of the local return period of an event with value r, we can estimate what the event value for that return period would be for the whole region.
We illustrate the estimators for the pooled data sets from Section 4. The estimates are based on B = 100 000 simulation replications and are shown in Table 3.We see that the local 100-year return levels have substantially shorter region-wise return periods.In the region with 15 tiles (only cell 11 excluded), the estimated local 100-year RL at reference climate of 2021 can be expected to be exceeded once in approximately 19 years in at least one of the 15 tiles.We find a similar regionwise return period for the pooling region consisting of 14 tiles.In the pooling region consisting of 11 tiles, the local 100-year return level becomes a region-wise 33-year event.This comparably larger value arises from the smaller region that is considered: the smaller the region, the less likely it is that one of the locations exceeds a high threshold.Further, as expected, we find that the region-wise 100-year return levels at reference climate of 2021 are larger than their local counterparts.For the regions consisting of 15 and 14 tiles, this increase is approximately 26%, while it is approximately 17.3% for the region consisting of 11 tiles.Table 3: Estimated local (second column) and regional (fourth column) 100-year RLs for reference climate 2021, for three possible choices of pooling regions as indicated by the first column.Column 3 shows the regional return periods of the local 100-year events.

A homogeneous scaling model with location-wise scaling factor
In this section, we maintain the temporal dynamics from the scale-GEV model from Equation (2).However, instead of testing for the homogeneity assumption from Equation (3), we additionally allow for a location-wise scaling factor under the null hypothesis.Such an approach can be useful when it is known that observations from different locations occur on different scales, but, apart from that, show a common probabilistic behaviour.In fact, a stationary version of the following model is commonly used in hydrology, where it is known as the Index Flood approach (Dalrymple, 1960).More precisely, suppose that where c d > 0 is a location-specific scaling factor that we may fix to 1 at the location of primary interest (say d = 1, i.e., c 1 = 1).Writing the model in Equation ( 15) can be rewritten as where Note that the parameters µ 1 , . . ., µ D , σ 1 , . . ., σ D , α 1 , . . ., α D satisfy the relationships for certain parameters δ, η, κ; in particular, µ 1 , . . ., µ D , σ 1 , . . ., σ D , α 1 , . . ., α D can be retrieved from µ 1 , . . ., µ D , δ, η (note that the constraint on α d /σ d is not needed, but comes as a consequence of the first two relations).Fitting this model instead of fitting the scale-GEV distribution to each location separately has the advantage of reducing the number of parameters that need to be estimated substantially ( 4 with a Wald-type test statistic.In this case, the latter is defined as where g A : R 4D → R 3(k−1) is given by with Jacobian matrix G A (θ) ∈ R 3(k−1)×4D , since the hypothesis in Equation ( 17) may be rewritten as H LS 0 (A) : g A (θ) = 0.When considering this kind of modification, the bootstrap algorithms from Section 2.4, steps ( 5)-( 7), must be adapted accordingly.In step (5), one has to estimate the parameter under the constraint of the considered null hypothesis by adapting the log-likelihood accordingly.The estimated parameters are then used during the transformation step (6).Further, the test statistic in steps ( 6) and ( 7) is replaced by T LS n (A) from ( 18).Further details are omitted for the sake of brevity.

General homogeneous models with smooth parametrization
In this section, we consider general GEV models in which the location, scale and shape parameters are allowed to depend in a (fixed) differentiable way on some parameter vector ϑ ∈ R q and some temporal covariate c (t) ∈ R p with p, q ∈ N.
More precisely, suppose that f µ , f σ and f γ are (known) real-valued functions of ϑ and c that are differentiable with respect to their first argument ϑ.We assume that, for each d = 1, . . ., d, there exists an unknown parameter ϑ d such that The global null hypothesis of interest within this model is assumed to be expressible as h(ϑ 1 , . . ., ϑ D ) = 0 for a differentiable function h : R qD → R s with s ∈ N.
An example is given by the linear shift model that is frequently considered when modelling temperature extremes in Extreme Event Attribution studies (see Philip et al., 2020), where A possible global null hypothesis of interest could be When considering this kind of extension, one has to adapt the maximum likelihood estimator as well as the estimator of its covariance matrix, hence steps (1)-( 2) and ( 5)-( 7) in the bootstrap algorithms are affected.Further details are omitted for the sake of brevity.

Conclusion
Extreme event attribution studies can build upon a GEV scaling model.Depending on the analysed variable, it may be useful to apply spatial pooling and fit the GEV distribution to a pooled sample of observations collected at sufficiently homogeneous spatial locations as it has been done in Kreienkamp et al. (2021); Tradowsky et al. (2022); Vautard et al. (2015), among others.Here, we propose several statistical methods that enable the selection of a homogeneous pooling region from a larger initial region.The BH approach was found to be quite conservative, hence some heterogeneous locations are likely to be declared homogeneous.The IM approach is a more liberal alternative with a higher proportion of rejected locations that may contain some homogeneous ones.In subsequent analyses, the selected pooling region typically results in a classical bias-variance trade-off: the larger the pooling region, the smaller the variance.At the same time, the bias may be larger, given that some heterogeneous regions may have been declared homogeneous.In practice, the tests should always be complemented by expert knowledge on the driving meteorological/climatological background processes.
To make the statistical approach to select homogeneous pooling regions for attribution studies as described here usable for the extreme event attribution community, we have developed a software package that can be freely downloaded and used by applied researchers (Zanger, 2022).The selection of spatial pooling regions for attribution studies may hence be based on a combination of expert knowledge and thorough statistical tests.The experts applying the methods can thereby decide between a conservative approach, which tends to reject more locations and a liberal approach which tends to accept more locations as being homogeneous.This decision depends on the a priori knowledge about the meteorology of the analysed area and the specific requirements of the study.
If the ultimate interest is estimation of, for example, return levels, one may, as an alternative to the classical approach based on pooling selected locations, consider penalized maximum likelihood estimators with a penalty on large heterogeneities (Bücher et al., 2021).A detailed investigation of the resulting bias-variance tradeoff would be a worthwhile topic for future research.

Declaration
Ethical Approval Not applicable.
SUPPLEMENT TO THE PAPER: "Regional Pooling in Extreme Event Attribution Studies: an Approach Based on Multiple Statistical Testing" Leandra Zanger, Axel Bücher, Frank Kreienkamp, Philip Lorenz, Jordis Tradowsky This supplement contains mathematical details on the maximum likelihood estimator and the estimation of its covariance matrix, as well as additional simulation results and further details on the case study.References like (1) refer to equations from the main paper, while references like (A.1) or
We next motivate the claimed normal approximation.First of all, in view of the differentiability of ϑ → ϑ , the vector of maximum likelihood estimators is necessarily a zero of the gradient of the log-likelihood function, i.e., we have Denoting the true parameter vector by θ, a Taylor expansion implies that where R n denotes higher order terms which are negligible.Solving for for suitable functions f and g, where distributed and independent over t.Under suitable assumptions on t → f (c (t) ) (and hence on c (t) ), we obtain that the variance of I n,d,ϑ d is of the order 1/n.As a consequence, with J n,d,ϑ d defined just after Equation ( 7).More precisely, in an asymptotic framework where one assumes that c (t) = h(t/n) for some continuous function h : [0, 1] → R, expressions as in Equation (A.7) converge to that both the integral and the expectation exist).
Next, consider √ nL n,θ .It suffices to argue that we may apply a suitable version of the Central Limit Theorem, under suitable assumptions on t → c (t) .Similar as for I n,θ , by Equation (A.5), each entry of √ nL n,θ is of the form and μj (c) = μj exp(α j c/μ j ).The final estimator for Σ n is Σn = ( Σn;j,k ) D j,k=1 with Σn;j,k = Ĵ−1 n,j,ϑ d Ĉn,j,k Ĵ−1 n,k,ϑ k .
B Additional results of the simulation study

B.2 Additional results for record length n = 100
Since the bootstrap procedures implicitly depend on the asymptotic distribution of the test statistic, we repeated the simulation study with a larger sample size, in order to investigate the sample size's impact on the performance of the bootstrap procedure.The location-wise sample size is increased to n = 100.Again, the FDR and FWER are reported (Table B.1), as well as the power plots for BH and (B2) in has substantially increased (on average by 50% for the BH method and by 18% for the IM method).The results for the other methods and (B3) were again similar.

Figure 1 :
Figure 1: Illustration of the grid used for the simulation.The regions contained in A dev are shaded in blue, with |A dev | = 2 shown in the left plot and |A dev | = 7 shown on the right.The region of interest is the one labelled 10.

Figure 2 :
Figure2: Rejection rates in % obtained for (B1), in the setting where either 2 (left plot) or 7 (right plot) regions deviate from the others.Each coloured square contains the rejection rate for one of the 225 different models, with the central square with c µ = c σ = 0 corresponding to the null hypothesis.The x-and y-axis and the facets determine the values of the scale-GEV parameter vector of the deviating locations through Equation (14).

Figure 3 :
Figure3: Proportion of correct rejections in % obtained with the BH procedure (upper row) and the IM procedure (lower row) at a level of 0.1, in the setting where two stations deviate from the rest (left column) or 7 locations deviate from the rest (right column), with the bootstrap procedure based on max-stable processes.The axis and facets are as described in Figure2.

Figure 4 :
Figure 4: Comparison of MSEs of RL 2021 (100) obtained for different choices of the method m, in the setting where |A dev | = 7. Shown are the differences MSE(m1) − MSE(m2) with m1 and m2 as indicated in the plot title.Negative values (red) therefore indicate a lower MSE for the method mentioned first, and vice versa for positive values.The axis and facets are as described in Figure 2.

Figure 5 :
Figure 5: Comparison of MSEs of RL 2021 (100) in the setting where |A dev | = 2 (left) and |A dev | = 7 (right).Shown are the differences MSE(MS BH) − MSE(MS IM).Negative values (red) therefore indicate a lower MSE for the BH method, while positive values (blue) indicate a lower MSE for the IM method.The axis and facets are as described in Figure 2.

Figure 6 :
Figure 6: Regions analysed within this case study and the respective numbering used here.The data consists of April-September block maxima of tile-wise averaged daily precipitation sums (RX1day) from 1950-2021.

Figure 7 :
Figure 7: (Adjusted) p-values obtained with the BH (left) and the IM (right) method on the 6 × 6 grid, with the bootstrap based on max-stable processes (top row) and the bootstrap based on bivariate extreme value distributions (bottom row).

Figure 8 :
Figure 8: Adjusted p-values obtained with the BH (left) and the IM (right) method on the 4 × 4 grid, obtained with the bootstrap based on max-stable processes (top row) and the bootstrap based on bivariate extreme value distributions (bottom row).
) from the chosen maxstable process model.(ii) Transform the margins to GEV margins, by applying the transformation in (10) with parameters as estimated in Step 1, resulting in the observation (The regionwise T -year return level RL t,reg (T ) and the return period RP t,reg (r) of an event with value r can now be estimated from the empirical cumulative distribution function F * t of the sample (m (t), * max,b ) b=1,...,B through RL reg t + (D − 1) = D + 3 instead of 4D parameters).Once the local scaling factors are identified, we can bring all observations to the same scale by dividing each location by its location-specific scaling factor.Now one can test whether such a local scaling model holds on a subset A = {d 1 , . . ., d k } ⊂ {1, . . ., D} with cardinality k = |A| ≥ 2, by testing the hypothesis Figure B.2 refer to equations or Figures within this appendix.

Figure B. 1 :
Figure B.1: Proportion of correct rejections in % obtained with the Benjamini Hochberg procedure at a level of 0.1, in the setting where two stations deviate from the rest (left column) or 7 stations deviate from the rest (right column), with the bootstrap procedure based on bivariate extreme value distributions.The axis and facets are as described in Figure2.

Figure B. 2 :
Figure B.2: Proportion of correct rejections in % obtained with the Benjamini Hochberg procedure at a level of 0.1, in the setting where two stations deviate from the rest (left column) or 7 stations deviate from the rest (right column), with the bootstrap procedure based on bivariate extreme value distributions.The axis and facets are as described in Figure 2.
3. The time series used throughout the case study are shown in Figure C.1.Along with the Apr-Sep maxima of 1950-2021, we plot values of μ(t) = μ exp αGMST (t) μ , where μ and α are estimated on the data of location 21 only (blue line), the respective location d (red line) or the pooled data of the pair (21, d) (green line), for d ∈ {1, . . ., 36} \ {21}.

Figure B. 3 :
Figure B.3: Proportion of correct rejections in % obtained with the Holm procedure at a level of 0.1, in the setting where two stations deviate from the rest (left column) or 7 stations deviate from the rest (right column), with the bootstrap procedure based on max-stable processes.The axis and facets are as described in Figure 2.

Figure B. 4 :
Figure B.4: Proportion of correct rejections in % obtained with the Holm procedure at a level of 0.1, in the setting where two stations deviate from the rest (left column) or 7 stations deviate from the rest (right column), with the bootstrap procedure based on bivariate extreme value distributions.The axis and facets are as described in Figure 2.

Figure B. 5 :
Figure B.5: Proportion of correct rejections in % obtained with the Benjamini Hochberg procedure at a level of 0.1, in the setting where two stations deviate from the rest (left column) or 7 stations deviate from the rest (right column), with the bootstrap procedure based on max-stable processes and record length n = 100.The axis and facets are as described in Figure2.

Figure B. 6 :
Figure B.6: Proportion of correct rejections in % obtained when ignoring the multiple testing problem,at a level of 0.1, in the setting where two stations deviate from the rest (left column) or 7 stations deviate from the rest (right column), with the bootstrap procedure based on max-stable processes and record length n = 100.The axis and facets are as described in Figure2.

Figure C. 1 :
Figure C.1: Observations and fitted trend as estimated for each location d, d = 1, . . ., 36, (red line) as well as the trend obtained from the GEV fit for location 21 (blue line) and the GEV-fit obtained from the pooled sample consisting of the pair (21, d), d = 1, . . ., 36 (green line).Location labels are given in the top right corner.
∈ {1, . . ., T }, d ∈ {1, . . ., D}} and that satisfy the null hypothesis H ED 0 .For each fixed A ⊂ {1, . . ., D} with k = |A| ≥ 2, the test statistics computed on all bootstrap samples, say (T * n,b (A)) b=1,...,B , are then compared to the observed test statistic T n (A).Since the bootstrap samples do satisfy H ED 0 (A), the observed test statistic T n (A) should differ significantly from the bootstrapped test statistics in case H ED 0 (A) is not satisfied on the observed data.
d : t

Table 1 :
False Discovery Rate (FDR) and family-wise Error Rate (FWER) for the three p-value combination methods from Section 2.5 and the two bootstrap methods (B2) and (B3).The stated values are the minimum, maximum and mean across the 224 alternative models from each scenario.

Table 2 :
Estimated parameters and estimate of RL 2021 (100) obtained when pooling all grid cells but the ones given in the first column.

Table C
1, in the setting where two stations deviate from the rest (left column) or 7 stations deviate from the rest (right column), with the bootstrap procedure based on bivariate extreme value distributions.The axis and facets are as described in Figure2.
.1: (Adjusted) p-values for all three methods from Section 2.5 obtained with both bootstrap methods applied to the 4 × 4 grid.Values that are significant at the 10%-level are in boldface.Results are based on B = 2000 bootstrap replications.