Testing ensembles of climate change scenarios for “statistical significance”

Climate impacts and adaptation research increasingly uses ensembles of regional and local climate change scenarios. To do so, the ensembles are examined to evaluate whether they describe a systematic difference between present states (and impacts) and envisaged future states—and such differences are often characterized as being statistically significant. This term “significance” is well defined by statistical terminology as the result of a test of a null hypothesis that is applied to samples of observations that are obtained with a defined sampling strategy. However such a statistical null hypothesis may not be a well-posed problem in the context of the evaluation of climate change scenarios. Therefore, the usage of terms such “statistically significant scenario” may be misunderstood in the general discourse about the certainty of projected climate change. We propose to employ instead a simple descriptive approach for characterizing the information in an ensemble of scenarios. Physical plausibility in the light of theoretical reasoning often adds robustness to the interpretation of climate change scenarios.


Introduction
Ensembles of climate change scenarios, say of future seasonal precipitation in Northern Germany, are sometimes described as being significant. In scientific contexts, this term usually refers to an assessment of the likelihood of a given outcome under an assumed set of conditions, with greater significance associated with rarer outcomes under those assumptions.
A key element in determining statistical significance is a null-hypothesis H o, against which statistical rareness is assessed; significant results are those that occur in the far tails of the distribution that prevails under H o. To assess statistical significance under Ho, scenarios are implicitly viewed as realizations of a random variable or process, i.e., as members of a well-defined population of possible events. Decisions regarding Ho are generally made on the basis of a test statistic t that summarizes the available sample of realizations; this statistic is itself a realization of a random variable T, where T represents the population of all possible outcomes of t under repeated sampling. Note that generally the test statistic has been constructed so that it has desirable optimality properties under an assumed set of conditions, although this is not always the case.
How far in the tails, and which directions in the tails, provide information regarding signficance is often a matter of social agreement that may or may not, in part, be explicitly articulated as an alternative hypothesis Ha. In climate science, the tail is often given by those events larger than 95% of possible outcomes under Ho, corresponding to a significance level of 5% or less. The nullhypothesis is rejected when the outcome t, as summarized by the test statistic on the basis of a collection of scenarios p1,p2, ..., pn, is inconsistent with the statistical model proposed under Ho. An underlying assumption that supports the interpretation of t is that the available collection of scenarios p1,p2, ..., pn forms a sample of realizations of a random process P , where the sampling process is understood. That is, the assessment of the rarity of the outcome under Ho depends upon an assumption that the available sample of scenarios is presentative of a generating process, or population, P. We will see later when we discuss ensembles of opportunity [17] that this is a nontrivial assumption.
It is understood that decisions concerning the null hypothesis are subject to error with an error rate that corresponds to the selected significance level. Generally, tests of hypothesis only operate with the user-specified error rate characteristics when all assumptions are satisified. In case of climate change scenarios (for example, simulated differences between future and present precipitation), tests of the null hypothesis that there is no change on average, Ho :ΔµP = 0, versus the alternative hypothesis Ha :ΔµP > 0, are often applied to ensembles of scenarios constructed from multiple climate models. [We employ here one-sided tests for the sake of simplicity.] Rejection of this null hypothesis means that the scenarios tend to be positive, but it does not describe the range of behaviour seen in the scenarios particularly well, since individual ensemble members could show precipitation reductions even if the ensemble mean tends to be positive.
However, our primary concern is with the more general case in which potential future change is described with an ensemble of climate change scenarios that are derived from multiple climate models and perhaps with multiple forcing scenarios. Such ensembles of opportunity encompass variation between ensemble members that is due not just to natural internal variability in the climate system as simulated by climate models, but also model and forcing scenario differences. Here, an appropriate formulation of a null-hypothesis and of the underlying random processes is much more of a challenge than in the previous simpler case.
To explain the concern we will consider two examples, one based on an ensemble of scenarios for future rainfall amount in Northern Germany, and a second based on an ensemble of stream flow projections for the Peace River in British Columbia, Canada. We begin with the rainfall projections as a typical example of projected change in a weather variable. To be specific, we consider n scenarios of possible future climate change at a given location for, say, seasonal summer rainfall amounts p1....pn (as shown in Figure 1). Let us assume that the projected change pi > 0 in m cases and that pi < 0 in n − m cases. In a world without forced change it would be reasonable to expect equal numbers of increases and decreases. That is, we would expect the null hypothesis Ho : E(m/n)=0.5 (with E() denoting the expectation of a random variable) to be true at all locations. If the pi are drawn independently (from an imaginary infinite population P of possible scenarios) and if Ho is true, then m has a binomial distribution B(m; n, 0.5) where the probability for each pi being positive is 0.5. This is a simple test (the sign test; [6]), but good for explaining concepts. If we make the necessary assumptions, then we can determine an m * so that k≥m * B(k; n, 0.5) ≤ 5%, and we would reject the no-change null-hypothesis against the alternative hypothesis of precipitation decrease at less than or equal to the 5% significance level when the actual number m, derived from our specific sample, is less than or equal to m * . This is because m ≤ m * would be observed in 5% or fewer cases when the probabilities of a negative and positive outcomes are equal. Figure 1 shows the number m of positive changes for Germany amongst the eleven scenarios available in the Norddeutscher Klimaatlas [8]. This ensemble of regional climate scenarios is composed of 11 high resolution regional climate simulations: 4 simulations from COSMOCLM operating at ~ 20 km spatial resolution and driven with ECHAM5 global simuations (2xA1B, 2xB1); Norddeutscher Klimaatlas scenarios is almost everywhere "significant" at less than the 5% level (with m ≤ 2); only about 3 (green) boxes exhibit values of m = 3 or = 4 that are "not significant".
Should it therefore be said that "the ensemble is significant for Germany"? Unfortunately, such a statistical interpretation is fraught with difficulty. One issue related to the application of the sign test, which is discussed increasingly in the climate science community, is whether the scenario members can be considered to be independent (cf. [18]), [17], [1]). This is nevertheless a minor issue in relation to the broader question -what is the population of potential scenarios that we are referring to? In what sense is the ensemble p1, ...pn representative? What is P ? The difficulty is that the test makes an inference about non-observed cases, namely additional scenarios that could, in principle, be drawn from P by running additional climate models and considering additional forcing scenarios. However, we would be challenged to describe the statistical sampling strategy that led to the available ensemble [17].
To explore this difficulty further, one might ask whether the inference for Germany applies to: a) "All scenarios" -in which case, we would have to determine what is meant by "all". This presumably means, all conceivable models, all conceivable emission scenarios, and all conceivable downscaling approaches, plus an understanding of how the available 11 scenarios were selected from that broad population. Moreover, this would need qualification.
All emission scenarios that are deemed valid by contemporary economists? Or all followers of a certain school of thought? All climate models?
There is likely no way to make an assertion about "all scenarios", because that set is simply not definable. Nevertheless, there have been attempts to quantify uncertainty from models, forcing scenarios and downscaling, for example, using complex hierarchical Bayesian models (cf. [9], [10]), and serious thought has been given to the basis for the interpretation of statistics calculated from multimodel ensembles (cf. [12]).
b) "Scenarios based on a specific emission scenario" -in which case we would want to make individual statements for all emission scenarios. This is the approached that was used by the Intergovernmental Panel on Climate Change (IPCC) in its 4th Assessment Report [7]. The c) "Scenarios based on a specific emission scenario from a restricted class of models" -this is closer to being a tractable problem if the class of models is sufficiently restricted, albeit still a very large problem. For example, one might consider models that share the same code, but in which parameter settings have been varied systematically using a well designed sampling scheme, such as is used in the climateprediction.net-project [2] (see also [11]). These concerns of interpretation are not restricted only to scenarios of future weather conditions, but also to scenarios of more complex aspects of the future climate, such as future stream flow large river basins. The possibility that there may be change in future hydrolic regimes is of concern both for users of rivers and from an ecological perspective. Figure 2 shows a collection of projections of change in the annual hydrograph (flow regime) of the Peace River in British Columbia (BC) at the Bennett Dam, which generates a large proportion of BC's electricity.
These scenarios (see [13] system and gauging site, winter flow is projected to increase in all scenarios, that the timing of peak flow advances in most (but not all) scenarios, and that summer flow is reduced. These are changes that are consistent with warming and winter precipitation increase in a basin that in the present climate, is dominated by winter snow storage (i.e., a nival flow regime). Some aspects of these changes would be considered to be "significant" if the sign test were to be applied naively, as discuss above, but as with precipitation change in North Germany, this would be misleading. The projected changes that are shown in Figure 2 are physically plausible, and do cover a considerable range of uncertainty, but clearly not all uncertainties (e.g., uncertainty that results from the choice of downscaling technique or hydrologic model is not at all represented in Figure 2 -and we are again challenged to understand the sampling process that lead to the selection of these 23 scenarios from a hypothetical population of such scenarios.

Concluding remarks
The question arises why the term is used. Presumably, one reason is that statistical "significance" is often confounded with importance. The lay and statistical usages of "significance" are often mixed in the overall discourse on climate change. For many, this terminology may indicate a certainty when this may not, in fact, be the case. This shrouds the character of scenarios as scripts of possible future change, which should be used by stakeholders to examine possible consequences and possible countermeasures [14]. Among lay-people, this may also contribute to the common blending of the term "projections" with the term "predictions" [3], in spite of the careful distinctions made in IPCC reports.
Even if statistical testing were completely appropriate, the dependency of the power of statistical tests on sample size n remains a limitation on interpretation. decrease. We would suggest that the strength of discrimination, such as in recurrence analysis [15], is an important adjunct to testing for the identification of robust information.
Nevertheless, given the conceptual problems discussed above, it is perhaps best to simply express the state of knowledge in a descriptive manner such as the following: Using n scenarios constructed with the models A, B, ..., emissions scenarios S1,S2, ..., and so on, we find rainfall amounts decrease in most grid boxes for all scenarios, and that in the remaining few grid boxes, they decrease in most (72.3%) but not all scenarios. One may additionally add that previously computed scenarios, possibly using much coarser grids and less advanced models, would have resulted in similar or consistent projections. Such an approach might be criticized in some quarters as underselling the utility of scenarios, but it is our view that there is greater risk in communicating quantified uncertainty when the basis for that quantification is not clear. We should be clear that they are scenarios, and not forecasts. Regardless of how the information in an ensemble of scenarios is communicated, their core utility is that they provide planning tools based on the principles of physics and on possible, plausible and internally consistent ideas (in the sense of Schwartz, 1991) about future.

Acknowledgments
Many thanks to Moritz Maneke of Norddeutsches Klimabüro, who supplied us with the regional scenarios.