Climate model ensembles
For the MMEs, both the CMIP5 (Taylor et al. 2012) and CMIP3 (Meehl et al. 2007) ensembles are used for the analysis. The CMIP5 dataset is obtained from the federated archives initiated under Earth System Grid project (http://esg-pcmdi.llnl.gov/) led by Program for Climate Model Diagnosis and Intercomparison (PCMDI) and being advanced through the Earth System Grid Federation (ESGF; http://esgf.org/wiki/ESGF_Overview; Williams et al. 2011), established under the Global Organization for Earth System Science Portals (GO-ESSP; http://go-essp.gfdl.noaa.gov/). We use the CMIP5 model output of the historical simulation of 28 atmosphere–ocean coupled models (CMIP5-AO) for which sufficient data was available in the archives. The models used in the analysis, are summarised in Table 1. We use only one run for each model listed in Table 1, so the number of ensemble members of CMIP5-AO is 28.
Table 1 List of CMIP5 multi-model ensemble
The CMIP3 dataset was obtained from the PCMDI archives (Meehl et al. 2007), and we use the output from the historical simulations by both the atmosphere–ocean coupled model (CMIP3-AO) and the atmosphere-slab ocean coupled model (CMIP3-AS). The CMIP3 models used for the analysis are the same as those in Y12, and the details are summarised therein. We use only one run for each model listed in Table 2, so the number of ensemble members in CMIP3-AO for which suitable outputs are available is 16, and that of CMIP3-AS is 10.
Table 2 List of CMIP3 ensemble
In the present study, we also create a CMIP5+CMIP3-AO ensemble, which simply combines CMIP5-AO and CMIP3-AO. The number of CMIP5+CMIP3-AO ensemble member is 44. In this combined ensemble, we make no adjustment or allowance for the possibility that some models may be particularly closely related to one another, for example consecutive generations from a single modelling centre. Such issues are of course a major topic, but this research focus is beyond the scope of this work (e.g., Masson and Knutti 2011).
We use six different SMEs based on structurally distinct models as summarised in Table 3. The PPEs created by HadCM3 (Gordon et al. 2000), HadSM3 (Pope et al. 2000), CAM3.1 (Collins et al. 2006b), and MIROC3.2 (K-1 model developers 2004) are here called HadCM3-AO, HadSM3-AS, NCAR-A, MIROC3-AS, respectively. These four ensembles were also used in Y12. In addition, a new PPE from the MIROC5 atmosphere–ocean coupled model (Watanabe et al. 2010), and a new MPE created from a mixture of elements from the MIROC3.2 and MIROC5 atmosphere models are analysed. These new ensembles are hereafter called MIROC5-AO and MIROC-MPE-A.
Table 3 List of single-model ensembles
HadCM3-AO and HadSM3-AS were created in the Quantifying Uncertainty in Modelling Predictions (QUMP) project. The atmospheric components of HadCM3 and HadSM3 are identical, and have resolution of 2.5 latitudinal degrees by 3.75 longitudinal degrees with 19 vertical levels. The ocean component of HadCM3 has a resolution of 1.25 × 1.25 degrees with 20 levels. In HadSM3, a motionless 50 m slab ocean is coupled to the atmospheric model and ocean heat transport is diagnosed for each member.
See Y12 and references therein for further details on the construction of HadSM-AS (Murphy et al. 2004; Webb et al. 2006), HadCM3-AO (e.g., Collins et al. 2010), NCAR-A (Jackson et al. 2004, 2008), and MIROC3-AS (Annan et al. 2005a, b; Yokohata et al. 2010) ensembles. Here we outline the main features of the construction of the two new SMEs, MIROC5-AO, and MIROC-MPE-A. These SMEs were constructed within the Japan Uncertainty Modelling Project (JUMP). For MIROC5-AO, Shiogama et al. (2012) devised a method to create an ensemble by atmosphere–ocean coupled model without flux correction. This ensemble is based on a new version of MIROC developed for the CMIP5 project, whose physical schemes are sophisticated and model performance are improved from the former version (Watanabe et al. 2010). The atmospheric component of MIROC5 used in this study has T42 (about 300 km grid) horizontal resolution, whereas the original version of MIROC5 has T85 (about 150 km grid) resolution, with 40 vertical levels. The ocean component model has approximately 1° horizontal resolution and 49 vertical levels with an additional bottom boundary layer. Using results from AGCM experiments, Shiogama et al. (2012) chose sets of parameter values for which the energy budget at the top of the atmosphere was predicted to be close to zero in order for these members not to have climate drift, and then ran AOGCM models with these parameter sets. The number of ensemble members in MIROC5-AO is 36.
Although the climate sensitivity of MIROC3.2 is relatively high compared to other CMIP3 models at 4.0 K (Yokohata et al. 2008), that of MIROC5 is substantially lower at 2.6 K (Watanabe et al. 2010). Since the differences in the response to CO2 increase are caused by changing model physical schemes, Watanabe et al. (2012) created a “multi-physics” ensemble (MPE) by switching physical schemes of MIROC3.2 to those of MIROC5. Using a full factorial design, three schemes were changed in the MPE: vertical diffusion, cloud microphysics, and cumulus convection. Including the two control models, there are, therefore, 8 simulations in total.
Reliability and rank histogram of model ensembles
In the present study, we follow the same philosophy in the definition of reliability and interpretation of the rank histogram as Y12, which is analogous to how it is commonly used in numerical weather prediction. The definition of the term “reliable” in this study is as follows: the ensemble is reliable if the observational data can be considered as having been drawn from the distribution defined by the model ensemble. That is, the null hypothesis of a uniform rank histogram is not rejected (JP08). Of course, in reality, creation of a perfect ensemble is impossible, so with enough data and ensemble members, all ensembles may be found to be unreliable at some level. What we are really testing here is whether the ensembles may be shown to be unreliable for the metrics of interest. Investigating the spatial scale at which the ensembles become unreliable is an interesting topic for future work, but is outside the scope of this paper (Sakaguchi et al. 2012).
Since the data are historical, the analysis here is essentially that of a hindcast, and since some of these data may have been used during model construction and tuning, it is debatable to what extent they can be considered to provide validation of the models. Furthermore, the relationship between current performance and prediction of future climate change remains unclear (e.g., Abe et al. 2009; Knutti 2010, Shiogama et al. 2011). Thus, reliability over a hindcast interval is not necessarily a sufficient condition to demonstrate that the model forecasts are good (Y12). On the other hand, it is clearly preferable that an ensemble should account for sufficient uncertainties to provide a reliable depiction of reality. Where an ensemble is not reliable in this sense, it must raise some doubts as to how credible it is as a representation of uncertainties in the climate system.
The method for calculating the rank histograms in this study is the same as that described in AH10 and Y12, and involves constructing rank histograms for the gridded mean climatic state of the model ensembles for the present-day climate with respect to various observational data sets. We use the 9 climate variables of surface air temperature (SAT), sea level pressure (SLP), precipitation (rain), the top of atmosphere (TOA) shortwave (SW) and longwave (LW) full-sky radiation, clear-sky radiation (CLR, radiative flux where clouds do not exists), and cloud radiative forcing (CRF, radiative effect by clouds diagnosed from the difference between full-sky and clear-sky radiation, Cess et al. 1990).
We consider uncertainties in the observations by using two independent datasets, listed in Table 3 of Y12. As in Y12, we used the point-wise difference between each pair of data sets as an indication of observational uncertainty, although this is likely to be somewhat of an underestimate of the true error.
In addition to the mean climate states, we evaluated the long-term trend in the historical experiments by CMIP5-AO, CMIP3-AO, and HadCM3-AO. Due to its robust attribution to external forcing, we evaluate the long-term trend of SAT over the last 40 years (1960–1999). We do not investigate the twentieth century trend of PRCP, SLP, or TOA radiation because the interannual to decadal variability is generally large in these variables, and there are large uncertainties and sometimes an artificial trend in observations owing to the difficulty in measurement of these variables (Trenberth et al. 2007).
The methodology of the rank histogram calculation is described below. First, the model data and observational data were interpolated onto a common grid (resolution of T42 in CMIP5-AO, CMIP3-AO, HadCM3-AO, MIROC5-AO, MIROC-MPE-A, and T21 for the other model ensembles). Second, we inflate the model ensemble to account for observational uncertainties by adding random Gaussian deviates to the model outputs as follows,
$$ X_{\text{model}}^{\prime } = X_{\text{model}} + \sigma_{\text{obs}} Z, $$
where X
model is the value of model ensembles, \( \sigma_{\text{obs}} \) is the standard deviation of the mean of two observations as listed in Table 3 of Y12, and Z is randomly sampled values from a normalised Gaussian distribution. Details are described in Sect. 2.4 of Y12. In this way, the sampling distributions of the observations and perturbed model data will be the same if the underlying sampling distributions of reality and models coincide. Due to the large number of data points, our results are robust to sampling variability in these random perturbations. Third, at each grid point, we compared the value of the observation with the ensemble of model values, evaluating the rank of the observation in the ordered set of ensemble values and observed value. Here a rank of one corresponds to the case where the value of observation is larger than all the ensemble members. We generate a global map of the rank of observation, R(l,m), where l and m denote the index of latitudinal and longitudinal grid point, for each variable and each ensemble. Using the global map of rank of observation, R(l,m), the rank histogram, h(i) is the histogram of the ranks, weighted by the fractional area of each grid box over the whole grid.
The features of the rank histogram can be interpreted as follows. If a model ensemble was perfect such that the true observed climatic variable can be regarded as indistinguishable from a sample of the model ensemble, then the rank of each observation lies with equal probability anywhere in the model ensemble, and thus the rank histogram should have a uniform distribution (subject to sampling noise). On the other hand, if the distribution of a model ensemble is relatively under-dispersed such that the ensemble spread does not capture reality, then the observed values will lie towards the edge or outside the range of the model ensemble, and then the rank histogram will form a L- or U-shaped distribution. An ensemble with a persistent bias, either too high or too low, may either have a trend across the bins, or a strong peak in one end bin if the bias is sufficiently large. If the histogram has a domed shape with highest values towards the centre, then this implies that the ensemble is overly broad compared to a statistically indistinguishable one.
Since a model ensemble can be regarded as unreliable if the rank histogram of observations is significantly non-uniform, we performed a statistical test for uniformity, whose details are described in Y12. We use the technique introduced by JP08 and decompose the Chi square statistics, T, into components relating to “bias” (the trend across the rank histogram), “V-shape” (peak or trough towards the centre), “ends” (both left and right end bins are high or low), and “left-ends” or “right-ends” (the left or right end bin is high or low). Using the rank histogram, h(i) as defined above, the Chi square statistics can be described as
$$ T = \sum\limits_{i = 1}^{k} {\frac{{\left[ {n_{\text{obs}} h(i) - e_{i} } \right]^{2} }}{{e_{i} }}} $$
(1)
where k is the number of bins in rank histogram (corresponds to the maximum rank), and i is the index of rank of the observation. e
i = n
obs/k corresponds to the expected bin value for a uniform distribution, and n
obs
h(i) is the “observed value of ith bin” in JP08. In the present study, h(i) is calculated as a form of probability, which corresponds to the probability that the rank of observation comes to ith bin, and n
obs is the “number of observation” in JP08. Since values of neighbouring grid points are highly correlated, their ranks of observation cannot be considered as independent of each other, and thus n
obs is also referred to as the “effective degrees of freedom of the data” as discussed in AH10 and Y12.
In Y12, a value of 10 was used for n
obs, based on the estimate by Annan and Hargreaves 2011, in which SAT, SLP and rain of the CMIP3 ensemble ranges from 4 to 11. However, the effective degree of freedom may be different among model ensembles, and the statistical test for the uniformity also depends on n
obs. In the present study, therefore, we estimate the effective degree of freedom based on the method of Bretherton et al. (1999), which is described in the next section.
As described in JP08, under the null hypothesis of a uniform underlying distribution, the Chi square statistic for the full distribution is sampled from approximately a Chi square distribution with (k − 1) degrees of freedom. Using a table of the Chi square distribution and the value of T in Eq. (1), we can calculate the p value and reject the hypothesis of uniform distribution if the p value is smaller than the level of significance. Similarly, each of the components such as bias, V-shape, ends, left-ends, and right-ends calculated by the formulation of JP08, should have an approximate Chi square distribution with one degree of freedom. We can also estimate the p value of these components and test the hypothesis of a uniform distribution.
Effective degrees of freedom of model ensembles
We use the formulation of EDoF by Bretherton et al. (1999). Using the spatial patterns of climatology of model ensemble members, EDoF can be described as
$$ N_{\text{ef}} \left( n \right) = \left( {\sum\limits_{k = 1}^{n} {f_{k}^{2} } } \right)^{ - 1} $$
(2)
where N
ef is the effective degree of freedom, n is the number of members in a model ensemble, f
k
is the fractional contribution of EOF k to the total variance. f
k
is calculated from the EOF across the climatology of model ensemble members. Equation (2) means that if the fractional contribution from the small k EOF is large, then the differences in special patterns among model ensemble members can be explained by the pattern of small k EOF, and thus the EDoF of model ensemble is small.
In Bretherton et al. (1999), it is shown that for any sampling distribution, the estimate of EDoF presented in Eq. (2) based on a finite sample will tend to underestimate the true EDoF which would be obtained by an infinite sample from the same distribution. N
trueef
, the value of EDoF if the number of model ensemble members is infinity, can be estimated as follows.
$$ N_{\text{ef}}^{\text{true}} = \frac{{N_{\text{ef}} (n)}}{{{{1 - N_{\text{ef}} (n)} \mathord{\left/ {\vphantom {{1 - N_{\text{ef}} (n)} n}} \right. \kern-0pt} n}}} $$
(3)
The EDoFs calculated as above are used for the statistical test for the reliability of rank histogram. We set n
obs = N
trueef
in Eq. (1), and then perform the statistical test using the rank histogram described in Sect. 2.2.
Distances between observation and model ensemble members
The rank histogram analysis discussed in Sect. 2.2 only considers the rank ordering of models and observations, and thus information on the distances between observation and ensemble members is missing. It also takes an intrinsically univariate and scalar viewpoint of the data, considering each observation independently of the others. An alternative approach, based on minimum spanning trees (Wilks 2004), handles multidimensional data sets directly, and also considers the distance between ensemble members and the observations. Therefore, we also investigate our ensembles using this approach, which we now briefly describe. We consider a 2D data field, and the equivalent output field from each ensemble member, as points (“nodes”) in a high dimensional space, with the length of the “edge” or line segment between each pair of them defined as the area-weighted RMS difference. In graph theory, a tree is a set of n − 1 edges which collectively connect n nodes, and if each edge is assigned a length function, then a minimum spanning tree is a tree of minimum total distances (which will be unique, if all the pairwise distances differ).
Therefore, in order to calculate the minimum spanning tree (MST), we first evaluate the pair-wise distances between the climatology of an observational data field and equivalent output from model ensemble, D
kl
, via the global area-averaged RMS difference as follows,
$$ D_{kl} = \sqrt {\frac{1}{{n_{i} n_{j} }}\sum\limits_{j = 1}^{{n_{j} }} {\sum\limits_{i = 1}^{{n_{i} }} {\left[ {X_{k} (i,j) - X_{l} (i,j)} \right]^{2} A_{ij} } } } $$
(4)
where i and j denotes the index for the grid points, and n
i
and n
j
are the numbers of grid points for the latitude and longitude. k and l in Eq. (4) are the index of observation and model ensemble members used for the pair-wise distances. Here, we defined k < l, k = 0 for the observation, and k or l = from 1 to n
ens for the model ensembles, where n
ens is the model ensemble members described in Table 3. X
k
(i,j) and X
l
(i,j) denote the values of the climate variables used for the above analysis. Aij is the weight of each grid area fraction (ratio of each grid area to global area).
Once the pair-wise distances between observation and model ensemble members defined in Eq. (4) are obtained, the MST for any set of nodes, and its total length (i.e., the sum of the lengths of its edges) can be readily generated using a standard algorithm. Here, in order to understand the relationship of the distances between the ensemble members and those between observation and ensemble members, leave-one-out analysis as described in Wilks (2004) is performed. First, the MST for the nodes excluding the observations, namely the MST for the model ensemble members, defined as M(0), is calculated. Then, the MSTs in which the observational data is used to replace each ensemble member in turn from 1 to n
ens, defined as M(k) for k = 1 to n
ens is calculated. Finally the rank of the total length of M(0) among those of M(k) for k = 1 to n
ens is evaluated. Here, the rank is defined as one if the M(0) has the smallest total length. If the observations were drawn from the same distribution as the ensemble, then the length of the M(0) should be indistinguishable from the lengths of the other M(k). If, however, the observations are relatively distant from the ensemble, then the M(0) will be shorter than the M(k). Given a sufficiently large number of observational data sets, the histogram of the ranks of the associated MSTs can be generated (the MST rank histogram) but, since we only have a small number of data fields, we prefer to examine the ranks on an individual basis in Sect. 3.1.
In order to focus more directly on the distances between observation and ensemble members, we also calculate the average of distances between the observation and the models. For the observations, and then for each ensemble member in turn, we calculate the average of distances from it, to all the other nodes:
$$ \overline{{D_{k} }} = \frac{1}{{n_{\text{ens}} }}\sum\limits_{l \ne k}^{{n_{\text{ens}} }} {D_{kl} } $$
(5)
If the distances from the observation to ensemble members are larger than those among ensemble members, \( \overline{{D_{0} }} \) is larger than \( \overline{{D_{k} }} \) with \( k \ne 0 \).