Reliability and importance of structural diversity of climate model ensembles
Abstract
We investigate the performance of the newest generation multi-model ensemble (MME) from the Coupled Model Intercomparison Project (CMIP5). We compare the ensemble to the previous generation models (CMIP3) as well as several single model ensembles (SMEs), which are constructed by varying components of single models. These SMEs range from ensembles where parameter uncertainties are sampled (perturbed physics ensembles) through to an ensemble where a number of the physical schemes are switched (multi-physics ensemble). We focus on assessing reliability against present-day climatology with rank histograms, but also investigate the effective degrees of freedom (EDoF) of the fields of variables which makes the statistical test of reliability more rigorous, and consider the distances between the observation and ensemble members. We find that the features of the CMIP5 rank histograms, of general reliability on broad scales, are consistent with those of CMIP3, suggesting a similar level of performance for present-day climatology. The spread of MMEs tends towards being “over-dispersed” rather than “under-dispersed”. In general, the SMEs examined tend towards insufficient dispersion and the rank histogram analysis identifies them as being statistically distinguishable from many of the observations. The EDoFs of the MMEs are generally greater than those of SMEs, suggesting that structural changes lead to a characteristically richer range of model behaviours than is obtained with parametric/physical-scheme-switching ensembles. For distance measures, the observations and models ensemble members are similarly spaced from each other for MMEs, whereas for the SMEs, the observations are generally well outside the ensemble. We suggest that multi-model ensembles should represent an important component of uncertainty analysis.
Keywords
Climate model Multi-model ensembles Reliability Rank histogram Degree of freedom Perturbed physics ensembles1 Introduction
Due to our lack of understanding of the climate system and limitations of computational power, climate models are far from perfect. The different models do, however, span a considerable range of output which leads to the possibility of making probabilistic predictions of the future based on the models (Collins et al. 2012). How best to integrate ensembles of models into a probabilistic calculation is still a matter of debate. For example, one approach to generating probabilistic future predictions is to implement a weighting procedure based on the performance of the present day climate simulation (e.g., Sexton et al. 2012). One of the prerequisites for implementation of such a method is that the ensemble employed should initially be broad enough to include the truth. Understanding the characteristics of the ensembles that have already been generated is an important step in this process. Here we build on earlier work investigating the reliability of climate model ensembles (e.g., Annan and Hargreaves 2010, hereafter AH10, Yokohata et al. 2012, hereafter Y12). The multi-model ensembles (MMEs) are made up of output from common experiments run by the world’s modelling centres. These models vary in construction and contain different parameterisations of climate processes, and different methods for the numerical integration (different grids, numerical schemes etc.). No one model is better than all the others in all aspects (e.g., Gleckler et al. 2008). As such, we may consider the MME as sampling at least some of our uncertainties in how a climate model should be constructed. One such MME is the Climate Model Intercomparison Project phase three (CMIP3, Meehl et al. 2007) which contributed to the fourth assessment report of Intergovernmental Panel on Climate Change. Subsequently a new phase of CMIP (CMIP5, Taylor et al. 2012) has been started. This MME contains more models and new, hopefully improved, model versions of the older models, some with increased resolution and complexity (i.e., with additional feedbacks being prognostically modelled). The number of structurally distinct ensemble members (i.e., excluding initial condition ensembles) is increased in CMIP5 (Taylor et al. 2012), which should enable more robust conclusions to be drawn about the ensemble characteristics.
In addition to the MMEs, some modelling centres have, over the last decade, developed ensembles based on a single model (single model ensembles, SMEs). One kind of SMEs is a “perturbed physics” ensemble (PPE) in which uncertainties in model parameters are sampled (Murphy et al. 2004; Stainforth et al. 2005; Collins et al. 2006a; Webb et al. 2006; Annan et al. (2005a, b); Jackson et al. 2008; Sanderson (2011); Yokohata et al. 2010). Some new PPEs based on the newly developed models contributing to the CMIP5 have recently been generated (Shiogama et al. 2012; Klocke et al. 2011). The first SMEs merely varied the values of parameters (which are just single numbers in the model code), but recently, researchers have started to create ensembles with larger differences by switching between different sets of the physical schemes. An ensemble created in this way has been termed a “multi-physics” ensemble (MPE) (Watanabe et al. 2012; Gettelman et al. 2012).
Here we investigate the reliability of the new CMIP5 ensemble and compared it to previous ensembles, both MMEs and SMEs. We use the rank histogram approach (AH10, Y12) which is often used in the field of numerical weather prediction (Jolliffe and Primo 2008, hereafter JP08). In previous work using these statistical tests (AH10, Y12 and Hargreaves et al. 2011), we were unable to reject the hypothesis of reliability for the CMIP3 MME for either modern climate or the climate change of the Last Glacial Maximum. This gives us some confidence in the CMIP3 ensemble. Conversely it was found that the SMEs were generally less reliable (Y12, Hargreaves et al. 2011), although it should be noted that no MPEs were analysed in those studies.
The methods for assessing reliability used in these previous analyses have some limitations. First, the statistical test of reliability depends on the “independent number of observation” as discussed in JP08, but that number was assumed rather than calculated in the previous work. In AH10 and Y12, climatological mean fields of observation are compared with those of model ensemble members at each grid point. Since the neighboring grid points are not necessarily independent, it is not easy to know the independent number in the fields which corresponds to the “effective degree of freedom” (EDoF). If the EDoF increases, the statistical test for the reliability becomes stricter (JP08).
Second, in the rank histogram analysis presented in Y12, the number of bins in the rank histogram (which should naturally be the number of ensemble member plus one) was reduced to 11 throughout, for consistency with the number of ensemble members in the CMIP3 ensemble. This may reduce the power of the test if the rebinning smooths the histogram of the larger ensemble. In addition, the rank histogram of each climate variable is investigated separately in Y12, but the overall characteristics of climate model ensembles may be investigated if we create multi-variate rank histograms.
Third, the rank histogram does not provide information on the magnitude of model errors. In terms of model error, Y12 investigated only the relationship between the errors of ensemble mean and standard deviation of model ensemble members.
In this work, we address these issues, calculating the EDoF (using the formulation by Bretherton et al. 1999 as in Annan and Hargreaves 2011), exploring the effect of increasing the number of bins in the rank histogram, and calculating multi-variate rank histograms. In addition to the rank histogram we explore other ways of evaluating the ensemble, analysing the distances between models and observational data by calculating the minimum spanning trees (e.g., Wilks 2004) and the average of the distances between the observation and the models for all the ensembles.
In Sect. 2, the model ensembles of MMEs and SMEs and the methods of analysis are presented. The analysis methods include the explanation of the calculation of rank histogram and the statistical test for the reliability (2–2), the formulation of EDoF (2–3), and the distances between observation and model ensemble members (2–4). Results and discussion are presented in Sect. 3 and summarised in Sect. 4.
2 Model ensembles and methods of analysis
2.1 Climate model ensembles
List of CMIP5 multi-model ensemble
Model Name | Institute | Reference |
---|---|---|
1. ACCESS1.0 | Commonwealth Scientific and Industrial Research Organization (CSIRO) and Bureau of Meteorology (BOM), Australia | http://wiki.csiro.au/confluence/display/ACCESS/ACCESS+Publications |
1. BCC-CSM1.1 | Beijing Climate Center, China Meteorological Administration | |
1. CanESM2 | Canadian Centre for Climate Modelling and Analysis | |
1. CCSM4 | National Center for Atmospheric Research | Gent et.al. (2011) |
1. CNRM-CM5 | Centre National de Recherches Meteorologiques/Centre Europeen de Recherche et Formation Avancees en Calcul Scientifique | http://www.cnrm.meteo.fr/cmip5—follow model description link |
1. CSIRO-Mk3.6.0 | Commonwealth Scientific and Industrial Research Organization in collaboration with Queensland Climate Change Centre of Excellence | Rotstayn et al. (2010), http://cmip-pcmdi.llnl.gov/cmip5/ |
1. FGOALS-s2 | LASG, Institute of Atmospheric Physics, Chinese Academy of Sciences | |
1. GFDL-CM3 2. GFDL-ESM2G 3. GFDL-ESM2M | NOAA Geophysical Fluid Dynamics Laboratory | |
1. GISS-E2-H 2. GISS-E2-R | NASA Goddard Institute for Space Studies | |
1. HadCM3 2. HadGEM2-CC 3. HadGEM2-ES | Met Office Hadley Centre | Collins et al. (2001) Smith et al. (2007) Smith et al. (2010) Jones et al. (2011) Martin et al. (2011) Collins et al. (2011) Bellouin et al. (2007) Collins et al. (2008) Johns et al. (2006) Martin et al. (2006) Ringer et al. (2006) |
1. INM-CM4 | Institute for Numerical Mathematics | Volodin et al. (2010) |
1. IPSL-CM5A-LR 2. IPSL-CM5A-MR 3. IPSL-CM5B-LR | Institut Pierre-Simon Laplace | |
1. MIROC4h 2. MIROC5 3. MIROC-ESM 4. MIROC-ESM-CHEM | Atmosphere and Ocean Research Institute (The University of Tokyo), National Institute for Environmental Studies, and Japan Agency for Marine-Earth Science and Technology | Sakamoto et al. (2012) Tatebe et al. (2012) Watanabe et al. (2010) Watanabe et al. (2011) |
1. MPI-ESM-LR 2. MPI-ESM-P | Max Planck Institute for Meteorology | Raddatz et al. (2007) Marsland et al. (2003) |
1. MRI-CGCM3 | Meteorological Research Institute | Yukimoro et al. (2011) |
1. NorESM1-M 2. NorESM1-ME | Norwegian Climate Centre |
List of CMIP3 ensemble
Model | Institute | CMIP3-AO | CMIP3-AS | References |
---|---|---|---|---|
1. CCSM3 | National Center for Atmospheric Research | ○ | ○ | Collins et al. (2004) Smith and Gent (2004) |
1. CGCM3.1-T47 2. CGCM3.1-T63 | Canadian Centre for Climate Modelling and Analysis | ○ | ○ | McFarlane et al. (1992) Flato (2005) Pacanowski et al. (1993) |
1. CNRM-CM3 | Meteorologiques/Centre Europeen de Recherche et Formation Avancees en Calcul Scientifique | ○ | Salas-Mélia et al. (2005) | |
1. ECHAM5/MPI-OM | Max Planck Institute for Meteorology | ○ | ○ | Roeckner et al. (2003) Marsland et al. (2003) Haak et al. (2003) |
1. ECHO-G | ○ | Roeckner et al. (1996) Legutke and Maier-Reimer (1999) Min et al. (2004) | ||
1. FGOALS-g1.0 | LASG, Institute of Atmospheric Physics, Chinese Academy of Sciences | ○ | Yu et al. (2002) Yu et al. (2004) | |
1. GFDL-CM2.0 2. GFDL-CM2.1 | NOAA Geophysical Fluid Dynamics Laboratory | ○ | ○ | Delworth et al. (2006) Gnanadesikan et al. (2006) Wittenberg et al. (2006) Stouffer et al. (2006) |
1. IPSL-CM4 | Institut Pierre-Simon Laplace | ○ | Malti et al. (2006) | |
1. MIROC3-Hi 2. MIROC3-Med | Atmosphere and Ocean Research Institute (The University of Tokyo), National Institute for Environmental Studies, and Japan Agency for Marine-Earth Science and Technology | ○ | ○ | K-1 model developers (2004) |
1. MRI-CGCM | Meteorological Research Institute | ○ | ○ | Shibata et al. (1999) Yukimoto et al. (2001) |
1. PCM | ○ | Washington et al. (2000) | ||
1. UKMO-HadCM3 | Met Office Hadley Centre | ○ | Gordon et al. (2000) Pope et al. (2000) | |
2. UKMO-HadGEM1 | ○ | ○ | Martin et al. (2004) Roberts (2004) |
In the present study, we also create a CMIP5+CMIP3-AO ensemble, which simply combines CMIP5-AO and CMIP3-AO. The number of CMIP5+CMIP3-AO ensemble member is 44. In this combined ensemble, we make no adjustment or allowance for the possibility that some models may be particularly closely related to one another, for example consecutive generations from a single modelling centre. Such issues are of course a major topic, but this research focus is beyond the scope of this work (e.g., Masson and Knutti 2011).
List of single-model ensembles
Ensemble | Experiment | Model | Number of parameter perturbed | Number of ensemble members | References of model and ensembles |
---|---|---|---|---|---|
HadCM3-AO | 20th century by AOGCM | HadCM3 | 31 | 128 | Gordon et al. (2000) Murphy et al. (2007) Collins et al. (2006a) |
HadSM3-AS | Control by ASGCM | HadSM3 | 31 | 17 | Pope et al. (2000) Webb et al. (2006) Yokohata et al. (2010) |
NCAR-A | Control by AGCM | CAM3.1 | 15 | 100 | Collins et al. (2006b) Jackson et al. (2004) Jackson et al. (2008) |
MIROC5-AO | Control by AOGCM | MIROC5 | 10 | 36 | Watanabe et al. (2010) Shiogama et al. (2012) |
MIROC3-AS | Control by ASGCM | MIROC3.2 | 13 | 32 | K-1 model developers (2004) Yokohata et al. (2010) |
MIROC-MPE-A | Control by AGCM | MIROC3.2 and MIROC5^{a} | Physical scheme changed^{a} | 8 | K-1 model developers (2004) Watanabe et al. (2010) Watanabe et al. (2012) |
HadCM3-AO and HadSM3-AS were created in the Quantifying Uncertainty in Modelling Predictions (QUMP) project. The atmospheric components of HadCM3 and HadSM3 are identical, and have resolution of 2.5 latitudinal degrees by 3.75 longitudinal degrees with 19 vertical levels. The ocean component of HadCM3 has a resolution of 1.25 × 1.25 degrees with 20 levels. In HadSM3, a motionless 50 m slab ocean is coupled to the atmospheric model and ocean heat transport is diagnosed for each member.
See Y12 and references therein for further details on the construction of HadSM-AS (Murphy et al. 2004; Webb et al. 2006), HadCM3-AO (e.g., Collins et al. 2010), NCAR-A (Jackson et al. 2004, 2008), and MIROC3-AS (Annan et al. 2005a, b; Yokohata et al. 2010) ensembles. Here we outline the main features of the construction of the two new SMEs, MIROC5-AO, and MIROC-MPE-A. These SMEs were constructed within the Japan Uncertainty Modelling Project (JUMP). For MIROC5-AO, Shiogama et al. (2012) devised a method to create an ensemble by atmosphere–ocean coupled model without flux correction. This ensemble is based on a new version of MIROC developed for the CMIP5 project, whose physical schemes are sophisticated and model performance are improved from the former version (Watanabe et al. 2010). The atmospheric component of MIROC5 used in this study has T42 (about 300 km grid) horizontal resolution, whereas the original version of MIROC5 has T85 (about 150 km grid) resolution, with 40 vertical levels. The ocean component model has approximately 1° horizontal resolution and 49 vertical levels with an additional bottom boundary layer. Using results from AGCM experiments, Shiogama et al. (2012) chose sets of parameter values for which the energy budget at the top of the atmosphere was predicted to be close to zero in order for these members not to have climate drift, and then ran AOGCM models with these parameter sets. The number of ensemble members in MIROC5-AO is 36.
Although the climate sensitivity of MIROC3.2 is relatively high compared to other CMIP3 models at 4.0 K (Yokohata et al. 2008), that of MIROC5 is substantially lower at 2.6 K (Watanabe et al. 2010). Since the differences in the response to CO2 increase are caused by changing model physical schemes, Watanabe et al. (2012) created a “multi-physics” ensemble (MPE) by switching physical schemes of MIROC3.2 to those of MIROC5. Using a full factorial design, three schemes were changed in the MPE: vertical diffusion, cloud microphysics, and cumulus convection. Including the two control models, there are, therefore, 8 simulations in total.
2.2 Reliability and rank histogram of model ensembles
In the present study, we follow the same philosophy in the definition of reliability and interpretation of the rank histogram as Y12, which is analogous to how it is commonly used in numerical weather prediction. The definition of the term “reliable” in this study is as follows: the ensemble is reliable if the observational data can be considered as having been drawn from the distribution defined by the model ensemble. That is, the null hypothesis of a uniform rank histogram is not rejected (JP08). Of course, in reality, creation of a perfect ensemble is impossible, so with enough data and ensemble members, all ensembles may be found to be unreliable at some level. What we are really testing here is whether the ensembles may be shown to be unreliable for the metrics of interest. Investigating the spatial scale at which the ensembles become unreliable is an interesting topic for future work, but is outside the scope of this paper (Sakaguchi et al. 2012).
Since the data are historical, the analysis here is essentially that of a hindcast, and since some of these data may have been used during model construction and tuning, it is debatable to what extent they can be considered to provide validation of the models. Furthermore, the relationship between current performance and prediction of future climate change remains unclear (e.g., Abe et al. 2009; Knutti 2010, Shiogama et al. 2011). Thus, reliability over a hindcast interval is not necessarily a sufficient condition to demonstrate that the model forecasts are good (Y12). On the other hand, it is clearly preferable that an ensemble should account for sufficient uncertainties to provide a reliable depiction of reality. Where an ensemble is not reliable in this sense, it must raise some doubts as to how credible it is as a representation of uncertainties in the climate system.
The method for calculating the rank histograms in this study is the same as that described in AH10 and Y12, and involves constructing rank histograms for the gridded mean climatic state of the model ensembles for the present-day climate with respect to various observational data sets. We use the 9 climate variables of surface air temperature (SAT), sea level pressure (SLP), precipitation (rain), the top of atmosphere (TOA) shortwave (SW) and longwave (LW) full-sky radiation, clear-sky radiation (CLR, radiative flux where clouds do not exists), and cloud radiative forcing (CRF, radiative effect by clouds diagnosed from the difference between full-sky and clear-sky radiation, Cess et al. 1990).
We consider uncertainties in the observations by using two independent datasets, listed in Table 3 of Y12. As in Y12, we used the point-wise difference between each pair of data sets as an indication of observational uncertainty, although this is likely to be somewhat of an underestimate of the true error.
In addition to the mean climate states, we evaluated the long-term trend in the historical experiments by CMIP5-AO, CMIP3-AO, and HadCM3-AO. Due to its robust attribution to external forcing, we evaluate the long-term trend of SAT over the last 40 years (1960–1999). We do not investigate the twentieth century trend of PRCP, SLP, or TOA radiation because the interannual to decadal variability is generally large in these variables, and there are large uncertainties and sometimes an artificial trend in observations owing to the difficulty in measurement of these variables (Trenberth et al. 2007).
The features of the rank histogram can be interpreted as follows. If a model ensemble was perfect such that the true observed climatic variable can be regarded as indistinguishable from a sample of the model ensemble, then the rank of each observation lies with equal probability anywhere in the model ensemble, and thus the rank histogram should have a uniform distribution (subject to sampling noise). On the other hand, if the distribution of a model ensemble is relatively under-dispersed such that the ensemble spread does not capture reality, then the observed values will lie towards the edge or outside the range of the model ensemble, and then the rank histogram will form a L- or U-shaped distribution. An ensemble with a persistent bias, either too high or too low, may either have a trend across the bins, or a strong peak in one end bin if the bias is sufficiently large. If the histogram has a domed shape with highest values towards the centre, then this implies that the ensemble is overly broad compared to a statistically indistinguishable one.
In Y12, a value of 10 was used for n_{obs}, based on the estimate by Annan and Hargreaves 2011, in which SAT, SLP and rain of the CMIP3 ensemble ranges from 4 to 11. However, the effective degree of freedom may be different among model ensembles, and the statistical test for the uniformity also depends on n_{obs}. In the present study, therefore, we estimate the effective degree of freedom based on the method of Bretherton et al. (1999), which is described in the next section.
As described in JP08, under the null hypothesis of a uniform underlying distribution, the Chi square statistic for the full distribution is sampled from approximately a Chi square distribution with (k − 1) degrees of freedom. Using a table of the Chi square distribution and the value of T in Eq. (1), we can calculate the p value and reject the hypothesis of uniform distribution if the p value is smaller than the level of significance. Similarly, each of the components such as bias, V-shape, ends, left-ends, and right-ends calculated by the formulation of JP08, should have an approximate Chi square distribution with one degree of freedom. We can also estimate the p value of these components and test the hypothesis of a uniform distribution.
2.3 Effective degrees of freedom of model ensembles
2.4 Distances between observation and model ensemble members
The rank histogram analysis discussed in Sect. 2.2 only considers the rank ordering of models and observations, and thus information on the distances between observation and ensemble members is missing. It also takes an intrinsically univariate and scalar viewpoint of the data, considering each observation independently of the others. An alternative approach, based on minimum spanning trees (Wilks 2004), handles multidimensional data sets directly, and also considers the distance between ensemble members and the observations. Therefore, we also investigate our ensembles using this approach, which we now briefly describe. We consider a 2D data field, and the equivalent output field from each ensemble member, as points (“nodes”) in a high dimensional space, with the length of the “edge” or line segment between each pair of them defined as the area-weighted RMS difference. In graph theory, a tree is a set of n − 1 edges which collectively connect n nodes, and if each edge is assigned a length function, then a minimum spanning tree is a tree of minimum total distances (which will be unique, if all the pairwise distances differ).
Once the pair-wise distances between observation and model ensemble members defined in Eq. (4) are obtained, the MST for any set of nodes, and its total length (i.e., the sum of the lengths of its edges) can be readily generated using a standard algorithm. Here, in order to understand the relationship of the distances between the ensemble members and those between observation and ensemble members, leave-one-out analysis as described in Wilks (2004) is performed. First, the MST for the nodes excluding the observations, namely the MST for the model ensemble members, defined as M(0), is calculated. Then, the MSTs in which the observational data is used to replace each ensemble member in turn from 1 to n_{ens}, defined as M(k) for k = 1 to n_{ens} is calculated. Finally the rank of the total length of M(0) among those of M(k) for k = 1 to n_{ens} is evaluated. Here, the rank is defined as one if the M(0) has the smallest total length. If the observations were drawn from the same distribution as the ensemble, then the length of the M(0) should be indistinguishable from the lengths of the other M(k). If, however, the observations are relatively distant from the ensemble, then the M(0) will be shorter than the M(k). Given a sufficiently large number of observational data sets, the histogram of the ranks of the associated MSTs can be generated (the MST rank histogram) but, since we only have a small number of data fields, we prefer to examine the ranks on an individual basis in Sect. 3.1.
3 Results and discussions
3.1 Rank histogram of model ensembles
As Fig. 1 shows, the difference between the MMEs (red) and SMEs (blue) is striking. The rank histograms for the MMEs are dome-shape in general, while those of SMEs are U-shaped (with large peaks at the highest and lowest rank). This means that in SMEs, there are large areas where either all of the ensemble members underestimate the observation (the peak at the lowest rank) or all the members overestimate it (the peak at the highest rank). This result is similar to that shown in Y12.
As found in Y12, the histograms of SMEs tend to have the peaks at the highest and lowest rank, but the details of this varies between the model ensembles and variables. In general, the histogram of climate variables only related to dynamical process (SLP, SW clear-sky radiation) tend to be U-shape in SMEs, possibly because model parameters related to dynamical processes are not generally perturbed in the SMEs. It is interesting to note that the peaks at the highest and lowest end of MIROC-MPE-A are smaller than those of MIROC5-AO, MIROC3-AS. For example, the histograms of the LW radiation are not U-shape in MIROC-MPE-A. As described in Sect. 2.1, MIROC-MPE-A is constructed by replacing model schemes for cloud physics, vertical diffusion etc. (Watanabe et al. 2012). Therefore, it is reasonable to expect that MIROC-MPE-A would have more structural diversity than the ensembles of its original models, MIROC5-AO and MIROC3-AS, which would lead to the rank histograms for the MPE generally being closer to a flat distribution.
In Y12, the statistical test for reliability was performed by assuming the effective degree of freedom, n_{obs} in Eq. (1) is 10. However, in the present study, we estimate n_{obs} using the EOF analysis explained in the next section, and then perform the statistical tests in Sect. 3.3.
3.2 Effective degree of freedom of model ensembles
As shown in Fig. 3, the EDoF of model ensembles increases with increasing number of ensemble members, appearing to asymptote to a relatively small value for some ensembles, but continuing to increase in other cases. The SMEs tend to exhibit systematically lower EDoF than the MMEs, with the exception of the MIROC3-AS SME. This analysis suggests that parametric variation is generally less effective than structural changes in spanning a diverse range of climatological behavior.
3.3 Reliability of model ensembles from statistical tests of rank histogram
The minimum p values of Chi square statistics calculated from the rank histogram
Value | CMIP35-AO | CMIP5-AO | CMIP3-AO | CMIP3-AS | HadC3-AO | HadS3-AS | NCAR-A | MIRO5-AO | MIRO3-AS | MIRO-MPE-A |
---|---|---|---|---|---|---|---|---|---|---|
# of ens | 44 | 28 | 16 | 10 | 17 | 128 | 100 | 36 | 32 | 8 |
Over-all | 0.1666 | 0.2187 | 0.2444 | 0.5025 | 0.0499 | 0.0021 | 0.0000 | 0.0000 | 0.0000 | 0.1529 |
SAT | 0.1397 | 0.1973 | 0.1126 | 0.0744 | 0.5760 | 0.2447 | 0.0000 | 0.0089 | 0.0715 | 0.0318 |
Rain | 0.1543 | 0.1520 | 0.2712 | 0.3895 | 0.0421 | 0.0065 | 0.0000 | 0.0000 | 0.4034 | 0.0474 |
SLP | 0.3705 | 0.3251 | 0.5283 | 0.4251 | 0.0000 | 0.0032 | 0.0000 | 0.0000 | 0.0011 | 0.0001 |
SW Net | 0.0612 | 0.1220 | 0.0595 | 0.1735 | 0.7786 | 0.2361 | 0.0001 | 0.0000 | 0.0000 | 0.2237 |
LW Net | 0.1350 | 0.0685 | 0.1739 | 0.3324 | 0.5814 | 0.2361 | 0.0003 | 0.0022 | 0.0096 | 0.7967 |
SW CRF | 0.2648 | 0.2985 | 0.2486 | 0.4832 | 0.8193 | 0.7722 | 0.0000 | 0.0000 | 0.0000 | 0.2468 |
LW CRF | 0.3430 | 0.2498 | 0.3248 | 0.3590 | 0.5973 | 0.0920 | 0.0011 | 0.0379 | 0.0829 | 0.6906 |
SW CLR | 0.1845 | 0.1672 | 0.3413 | 0.1861 | 0.0005 | 0.0000 | 0.0000 | 0.0005 | 0.0000 | 0.0052 |
LW CLR | 0.2338 | 0.2738 | 0.2795 | 0.4113 | 0.2557 | 0.4781 | 0.3169 | 0.4899 | 0.3190 | 0.6425 |
SAT trend | 0.4565 | 0.5968 | 0.3194 | NA | 0.3356 | NA | NA | NA | NA | NA |
# of p < 0.05 | 0 | 0 | 0 | 0 | 3 | 3 | 8 | 8 | 5 | 4 |
The p values using the nine climate variables (denoted as “Overall” in Table 4) of MMEs are larger than the threshold (0.05, significant level = 5 %), which means that, according to this analysis, these ensembles have not been shown to be unreliable. Although their rank histograms appear somewhat domed, they are acceptably close to the uniform distribution. On the other hand, the p values of many of the results from the SMEs are smaller than the threshold. These ensembles can then be said to be unreliable because their rank histograms are significantly different from the uniform distribution. The U-shaped characteristic of the SME histograms indicates that these ensembles are under-dispersed.
Among the SMEs, the p value of HadCM3-AO is almost on the threshold (0.05), and MIROC-MPE-A is larger than this threshold. One possible reason for the relatively good performance of these models in the statistical test is that the number of ensemble members (i.e., number of bins in Fig. 1) is small, as discussed above. Another possible reason for the reliability of MIROC-MPE-A compared to the other SMEs might be that the multi-physics ensemble has more structural diversity compared to the original MIROC5-AO or MIROC3-AS, and thus it is sufficiently diverse to span the observations.
In Table 4, p values of the nine climate variables (plus SAT trend for the ensembles performing the historical simulation) are also shown. In MMEs, the number of climate variables with p values smaller than the threshold is zero, which means these MMEs are reliable for all the variables investigated. On the other hand, in SMEs, the reliability varies between climate variables and model ensembles. HadCM3-AO and HadSM3-AS have relatively better performance compared to other ensembles (four of the p values are less than 0.05).
Conversely, the p-values of all the SMEs apart from MIROC-MPE-A are smaller than the threshold of “under-dispersed”. Only the p-value of HadCM3-AO is sensitive to small changes in the EDoF, as a slight decrease would put it above the threshold.
The reason for the tendency towards a dome shape in the MME is unclear. Y12 describes how tuning an ensemble to observations will tend to centralise it on them (meaning that the distance from ensemble mean to observations, normalised by ensemble spread, will shrink). Thus, tuning to modern observations might tend to result in a domed rank histogram if the untuned ensemble had a flat distribution. However, several of the SMEs have certainly been tuned to observations, without this phenomenon occurring and being under-dispersed, and there seems no direct way to measure to what extent this tuning has been explicitly or implicitly performed for MMEs, and for which climatic variables.
We should note that the rank histogram technique is often used in the field of numerical weather prediction where a larger number of observations and simulations are available (and thus the effective degrees of freedom are greater) compared to the present work. Therefore, the rank histogram results shown here may be less convincing. For this reason, in the next section we investigate the relationship between the observations and model ensembles based on their distances.
3.4 Distances between observation and model ensembles
Since the rank histogram discussed above evaluates only the rank ordering of observations amongst model ensemble members, we also investigate the distances of the model ensembles to the observations in various ways. First, we calculate the minimum spanning tree (MST) by removing observation and each ensemble members one by one as described in Sect. 2.3. Using this procedure we obtain total ensemble number plus one MSTs. If the ensemble members are collectively far away from the observation (compared to their distances from each other), then the MST omitting the observation is smaller than the MSTs removing model ensemble members. With only a small number of data sets, we do not explicitly form the rank histogram and test for non-uniformity, but instead examine the rank of the MST for each variable in turn and consider whether it lies at the extreme end of the set of MST lengths.
Rank of minimum spanning tree (MST) without observation among MSTs of observation plus model ensemble members removing each ensemble members
Value | CMIP35-AO | CMIP5-AO | CMIP3-AO | CMIP3-AS | HadC3-AO | HadS3-AS | NCAR-A | MIRO3-AS | MIRO5-AO | MIRO-MPE-A |
---|---|---|---|---|---|---|---|---|---|---|
# of ens | 44 | 28 | 16 | 10 | 17 | 128 | 100 | 32 | 36 | 8 |
Over-all | 16 | 12 | 10 | 5 | 1* | 2* | 1* | 1* | 1* | 1* |
SAT | 10 | 8 | 6 | 6 | 1* | 2* | 1* | 1* | 1* | 1* |
Rain | 17 | 7 | 14 | 8 | 1* | 4* | 1* | 5 | 1* | 1* |
SLP | 21 | 14 | 15 | 8 | 1* | 2* | 2* | 1* | 1* | 1* |
SW Net | 21 | 14 | 15 | 7 | 3 | 5* | 1* | 1* | 1* | 1* |
LW Net | 21 | 12 | 13 | 4 | 1* | 4* | 1* | 2 | 1* | 2 |
SW CRF | 14 | 11 | 11 | 5 | 3 | 5* | 1* | 1* | 1* | 1* |
LW CRF | 17 | 12 | 12 | 4 | 2 | 3* | 1* | 2 | 1* | 2 |
SW CLR | 20 | 12 | 14 | 4 | 1* | 1* | 1* | 1* | 1* | 1* |
LW CLR | 11 | 7 | 11 | 6 | 1* | 2* | 1* | 1* | 1* | 1* |
SAT trend | 10 | 8 | 6 | NA | 5 | NA | NA | NA | NA | NA |
In Fig. 6, the values of circle indicate the average of error of model ensemble members. Especially for the climate variables such as SAT, SLP, SW and LW clear-sky radiation as shown in Fig 6(1), (3), (6), and (9), the average of errors in MMEs and SMEs are similar, but the distances between model ensembles in MMEs are larger than SMEs. As discussed in the analysis of rank histogram, the inability for the SMEs to have sufficient diversity may be related to the fact that parameters in dynamical processes were not perturbed in SMEs.
4 Summary
- 1.
The rank histograms using all the climate variables of MMEs have a tendency towards being dome-shaped with a peak around the middle rank, while those of SMEs are U-shape with strong peaks at the highest and lowest ranks (Fig. 1). This indicates that the spread of MMEs tend towards being “over-dispersed” in that the rank of observations generally stays close to the middle of the range, while that of SMEs tend to be “under-dispersed” in which all the ensemble members often overestimate or underestimate the observation. Even though the over-dispersion of the MMEs does not reach the level of statistical significance, the similarity of CMIP5 to CMIP3 (Fig. 1 and 2), suggests that this has arisen as a consequence of the way in which the diverse range of models has been constructed, rather than merely occurring by chance.
- 2.
The EDoF of model ensembles are calculated by changing the ensemble sizes (Figs. 3, 4), and it is found that the MMEs generally have large EDoF compared to the SMEs. One of the SMEs, MIROC3-AS has similar EDoF to the MMEs. The method used to sample the parameters might effect the resultant EDoF in the PPEs.
- 3.
Using the EDoF formulated in Eq. (3), a statistical test for the reliability of model ensembles is performed (Table 4). Multi-variate histograms using all the climate variables (“Overall” in Table 4) indicate that the rank histograms of MMEs are not significantly different from the uniform distribution, and thus, with respect to this analysis, the MMEs, may be considered to be reliable. On the other hand, the rank histograms of the SMEs, except the histogram of MIROC-MPE-A, are U-shaped and significantly different from the uniform distribution indicating that they are under-dispersed (see Fig. 1). These results suggest that the structural diversity is important in order to include the observation among the spread of model ensembles. Large EDoF in MMEs should contribute to their reliability.
- 4.
The dependencies of reliability on the EDoF are also investigated (Fig. 5). The MMEs, which tend towards being over-dispersed, remain reliable within an increase of EDoF of about a factor of two. Most of the SMEs are also robustly under-dispersed, but HadCM3-AO could be considered reliable if the EDoF has been slightly overestimated. The rank histogram of MIROC-MPE-A is not statistically different from the uniform distribution. This may be because the number of ensemble members is small, which causes the statistical test to be less powerful, and also because the “multi-physics” ensembles can sample the structural uncertainties to some extent by changing the physical schemes (Watanabe et al. 2012).
- 5.
MSTs (minimum sum of distances between ensemble members, Table 5) and the averages of the distances between the observations and model ensemble members (Fig. 6) are calculated. In the MMEs, the distances between ensemble members are not different from those between the observation and ensemble members. On the other hand, the distances between ensemble members in the SMEs are smaller than those between the observation and ensemble members. These results are consistent with the analysis of rank histograms in which the spread of MMEs include the observation, but that of SMEs do not.
It should be noted that the SMEs examined here were not explicitly designed to be reliable according to the rank histogram metric, although they were designed with some expectation that each member of the ensemble would verify well against a basket of observations. It would be an interesting endeavor to set out to produce a reliable PPE or MPE and to design a perturbation algorithm accordingly. As shown in Collins et al. (2010), the algorithm for parameter perturbations in a PPE does influence the diversity of the mean climates and trends seen in each member, suggesting that such an endeavor might be possible, perhaps using some iterative algorithm. Such challenges remain a subject of future research.
On the other hand, since our analysis reveals that the MMEs are reliable when compared to the subset of observational fields examined, or their spread tends to be “over-dispersed” rather than “under-dispersed”, it may be useful to apply unequal weights to generate improved simulations of future predictions (e.g., Collins et al. 2012). For example, if we chose a subset of ensemble members from the CMIP5 ensemble, the rank histogram approach should be useful. We can choose a subset of members whose reliability become higher, i.e., with a rank histogram close to uniform. However, present-day reliability does not necessarily imply reliability for future projections, hence additional work is required to investigate the relationships between simulation errors and uncertainties in projections (e.g., Collins et al. 2012). Further cause for caution arises from the only test of reliability performed to date for a climate change, that of the Last Glacial Maximum (Hargreaves et al. 2011), which does not find any evidence of the ensemble being over-dispersed. In addition, a domed rank histogram may be also a consequence of tuning towards observations, in which case such weighting would amount to double-counting the data. These issues require further investigation so, at present, the most robust strategy may be to use the whole MME when using climate model ensembles for probabilistic prediction.
Acknowledgments
We acknowledge the World Climate Research Programme’s Working Group on Coupled Modelling, which is responsible for CMIP, and we thank the climate modeling groups (listed in Table 1 of this paper) for producing and making available their model output. For CMIP the US Department of Energy’s Program for Climate Model Diagnosis and Intercomparison provides coordinating support and led development of software infrastructure in partnership with the Global Organization for Earth System Science Portals. M.C. was partially supported by funding from NERC grants NE/I006524/1 and NE/I022841/1. MW is supported by the Joint DECC/Defra Met Office Hadley Centre Climate Programme (GA01101). T.Y., J.D.A, H.S., S.E., M.Y., J.C.H. were supported by the Global Environment Research Fund of the Ministry of the Environment of Japan (S-10, Integrated Climate Assessment – Risks, Uncertainties and Society, ICA-RUS). We thank anonymous reviewers for their constructive comments. We gratefully acknowledge Tamaki Yasuda and Osamu Arakawa for helping to get CMIP data.
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.