1 Introduction

The behaviour of globally average surface temperatures over the period 1998 to 2012/2014 has attracted a great deal of attention, with questions being raised as to whether the observed values represented a serious hiatus (or pause) in global warming or whether they simply corresponded to a decadal-scale fluctuation associated with internal variability of the climate system (Lewandowsky et al. 2015, 2016). The Intergovernmental Panel on Climate Change (IPCC 2013; Flato et al. 2013) concluded that a reduction of the warming trend (1998–2012) was attributable, in roughly equal measure, to a cooling contribution from both internal climate variability, and external forcing comprising solar and volcanic activity. For example, Fyfe et al. (2016) estimated that the planet warmed at a rate of +0.170 °C per decade from 1972 to 2001, but at a slower rate of +0.113 °C per decade from 2000 to 2014, contrary to the expectations for an increase in global warming over time. However, an updated analysis by Karl et al. (2015) indicated no change in the global warming trend, while Lewandowsky et al. (2015) also found no substantive evidence for a “pause” or “hiatus”, maintaining that it was “statistically indistinguishable from previous fluctuations”.

Closely related to this issue is an apparent discrepancy between observed temperatures and those simulated by climate models. The Intergovernmental Panel on Climate Change (IPCC 2013; Flato et al. 2013) also noted that: “There is very high confidence that [climate] models reproduce the general features of the globalscale annual mean surface temperature increase over the historical period, including the more rapid warming in the second half of the twentieth century”. They also concluded, however, that “most simulations of the historical period do not reproduce the observed reduction in global mean surface warming trend over the last 10–15 years”. They concluded that these differences were due, to a substantial degree, to internal variability in the real world with possible contributions from forcing errors and, for some climate models, an overestimation of the response to increasing greenhouse gases (GHGs) and other anthropogenic forcings including sulphate aerosols (c.f. Bindoff et al. 2013). Their analysis also indicates most models tended to warm more than observations did from 1961–1990 to 1998–2012.

On the other hand, other studies (e.g. Brown et al. (2015); Marotzke and Forster (2015); England et al. (2015)) found no evidence for any systematic overestimate in the responses of the models, while Huber and Knutti (2014) concluded that “… the reduced warming and mismatch between models and observations can, to a large extent, be explained by the combined effect of reduced forcing and natural variability, each of these components contributing about an equal amount”.

It has been suggested that natural variability during this period comprised a cool phase of the Interdecadal Pacific Oscillation (IPO; or the closely Pacific Decadal Oscillation (PDO), Han et al. 2013). The IPO is a naturally occurring form of interdecadal climate variability centred in the Pacific Ocean (Power et al. 1999; Folland et al. 1999; Meehl and Arblaster 2012; Salinger et al. 2001; Henley et al. 2015; England et al. 2014; Kirtman et al. 2013). The IPO is known to have some effect on the variability of global temperatures on decadal time scales (Brown et al. 2014; Meehl et al. 2011, 2013). This variability is very largely absent from multi-model mean (MMM) values since any such cool and warm phases are uncorrelated across different model simulations such that their effects tend to cancel out. Consequently, several studies have focused on individual model simulations and their ability to match observed variability. Of particular interest here are the studies by Kosaka and Xie (2013), Watanabe et al. (2014), England et al. (2014), Meehl et al. (2014) and Kociuba and Power (2015), who focused on the ability of models to simulate Pacific Ocean variability. Kosaka and Xie (2013) showed that differences between modelled and observed decadal-scale trends could be largely removed in their climate model if the model was forced with observed sea surface temperatures (SSTs) in the eastern tropical Pacific, rather than allowing the model to calculate its own SSTs. England et al. (2014) could largely simulate the observed changes by simply forcing their climate model with observed tropical Pacific winds. Watanabe et al. (2014) showed that approximately 30-50% of decadal-scale anomalies in global average temperatures could be simulated in their model when forced with observed tropical wind-stress anomalies. Kociuba and Power (2015) found that none of the 35 CMIP5 models they examined simulated the marked strengthening of the Walker circulation observed over the 33-year period 1980–2012, while Meehl et al. (2014) showed that some CMIP5 models could reproduce features of the hiatus period, given the correct initial conditions. They identified 10 out of 262 CMIP5 simulations that happened to match the slowdown in the observations as reflected in 15-year trends.

Other studies attempting to discriminate among the models and assess if there are any implications for projections for the end of the century include Huber and Knutti (2014), Risbey et al. (2014), England et al. (2015) and Marotzke and Forster (2015). Risbey et al. (2014) and England et al. (2015) argued that the recent discrepancies between models and observations were not that important, since there is little difference between the temperature projections for the end of the century from those models which can simulate features of the recent period and those that do not—the implication being that there should soon (i.e. post-2014) be a return to a warming trend similar to that predicted by the models (Huber and Knutti 2014).

With the benefit of hindsight, brought about by record warm temperatures associated with the 2015/2016 El Nino event, it is possible to better address some of the issues surrounding the so-called “hiatus” period. A great deal of attention has been given to trends over 15-year periods, especially 1998–2012, but here we are much more interested in the longer-term changes which better reflect anthropogenic climate change (Wehner and Easterling 2015; Lewandowsky et al. 2015). We consider the role of the Interdecadal Pacific Oscillation (IPO), the frequency of El Niño-Southern Oscillation (ENSO) events, and internal climate variability more broadly in accounting for the differences between observed and simulated multi-decadal changes in the Pacific and global temperatures. This is an important extension of previous work to longer time-scales. This means, for example, that the conclusions discussed above relating to e.g. 15-year trends may not necessarily apply. We will then consider what implications contrasts between model and observed multi-decadal temperature changes have for global and Pacific temperature projections for the coming century.

The data, models and methods used are described in Sect. 2. Results are presented in Sects. 3, 4, 5, 6. Section 3 provides an analysis of decadal, multi-decadal and longer-term trends in global and Pacific temperatures using data up to 2016. Model means of the climate model simulations of the historical period are compared with observations in Sect. 4. The ability of internal, multi-decadal climate variability to account for the contrasts between the multi-decadal changes in the MMM and the observations is examined in Sect. 5. The implications the contrasts have for projections of global and Pacific temperature over the remainder of the twenty-first century are examined in Sect. 6. Results are summarised and discussed in Sect. 7.

2 Data and methods

Observed global annual average surface temperatures (1900–2015) are those generated by the Hadley Research Centre available from http://www.metoffice.gov.uk/hadobs/hadcrut4/data/current/download.html (HadCRUT4).

All temperatures are expressed as anomalies with respect to the 1961–1990 average (which is a standard reference period used by e.g. the World Meteorological Organization). We also frequently refer to the period 1998–2012, because this has been the focus of numerous studies dealing with the so-called hiatus, including the latest IPCC report (Flato et al. 2013).

At the time of writing, global values were available for much of 2016 and indicated that the end-of-year annual average was highly likely to exceed that of 2015 by about +0.2 °C (Schmidt, https://mobile.twitter.com/ClimateOfGavin/status/731599988141248512). Consequently, we estimate the 2016 value to be +0.95 °C.

Here we also refer to Pacific Ocean averages calculated for the box region 10 °S to 25 °N and 160 °E to 110 °W. This region has been chosen since it encompasses a node of the Interdecadal Pacific Oscillation, a pattern of SST variability at decadal time scales centred on the Pacific Ocean (England et al. 2014; Kirtman et al. 2013; Power et al. 1999; Folland et al. 1999; Meehl and Arblaster 2012; Salinger et al. 2001; Henley et al. 2015). It also represents a region where an apparent hiatus in warming trends is apparent.

A 2016 value for the Pacific box region (+0.71) was estimated based on the correlation with global average values and the fact that in previous years following major El Nino events, these values tend to decrease by about −0.4 °C. Note that the results are not affected by whether we restrict the analyses to 2015, when all available observations were available at the time of writing, or include the plausible estimates for 2016.

Eighty-three simulations from 38 models from the CMIP5 archive (Taylor et al. 2012) have been used to analyse the pre-industrial, the historical (1900–2005) and the future (2006–2100) periods under three (RCP2.6, RCP4.5 and RCP8.5) emissions scenarios. The historical runs included forcing from increasing greenhouse gases, changing sulphate aerosols, major volcanic eruptions and solar variability (Moss et al. 2010; van Vuuren et al. 2011). The historical and future simulations were combined to produce model estimates for 1900–2100. Global means were calculated for each year using the same masks as the HadCRUT4 data set. For years after 2012, the 2012 masks were used. Note that the HadCRUT4 values may underestimate the actual rate of warming, because surface air temperatures over oceans are assumed to be the same as SSTs (Cowtan et al. 2015). The full list of models used is given in Supplementary Table 1.

IPO patterns are identified using the 2nd EOF of 13-year low pass filtered SSTs (Meehl and Arblaster 2012). We used gridded SST data (1877–2003) from the (MLOST) V3.5.324 data set (Vose et al. 2012) and model simulated values over the historical period 1900–2005. Data pre-processing was conducted as described by Jones et al. (2013). Both the observations and model values were re-gridded to a common 1.5°× 1.5° grid. Very little Arctic data is included so that under sampling of the Arctic (Cowtan and Way 2014; Karl et al. 2015) does not affect the results.

While an IPO index can be defined as the EOF time series, it can also be more simply calculated as the linear combination of anomalies averaged over three distinct box regions (Henley et al. 2015). These index values, up to and including 2015, are available from http://www.esrl.noaa.gov/psd/data/timeseries/IPOTPI.

El Nino/La Nina events are defined according to NINO34 annual average temperatures. We firstly detrend the data (both observations and simulated values) by subtracting 31-year running averages from the raw values. El Niño/La Nina years are those in which the detrended anomalies exceed plus or minus one standard deviation (running 31-year values). An El-Nino-like index is either +1 for an El-Nino year, −1 for a La-Nina year and is zero otherwise (i.e. a ENSO neutral year). This standardization enables a comparison between different time series with different trends and different levels of variability and has been used successfully in previous studies (e.g. Power and Smith 2007).

3 Analysis of observations

Figure 1a shows observed global annual average surface temperature anomalies (relative to 1961–1990) for the period 1901 up to 2016. The years 1998–2012 are highlighted since this period has been the subject of so many studies. It can be seen that temperature anomalies could be described as relatively stagnant between 1997 and 2014. 31-year running averages indicate a fairly constant rate of warming up to the present, despite the relatively stagnant period.

Fig. 1
figure 1

Observed annual average surface temperatures 1921–2016 expressed as anomalies relative to the average for the period 1961–1990. The values over the period 1901–1997 are indicated by black circles, the values between 1998 and 2012 are indicated by blue circles, and the values between 2013 and 2016 by red circles. Running 31-year average values are indicated by the yellow squares. The smoothed curves represent 3rd order polynomial fits to the data over three 97-year windows: 1901–1997 (black), 1916–2012 (blue) and 1920–2016 (red): a Globe and b Pacific. In both cases, the six panels show the effect of applying different polynomial fits to the data

The raw data, including the 31-year average values, are well represented by 3rd order polynomial fits. Three such fits to the raw data are also shown based on overlapping 97-year windows (1901–1997, 1916–2012, and 1920–2016) and unlike running averages, provide estimates of the magnitude of background trends at the beginning and end of each window. The effect of applying different polynomial fits to the data is indicated by the panels in the figure, which show the estimated background trends at the end of each window. The higher order fits reflect the relatively shorter term fluctuations in the data compared to the lower order fits. A 3rd order polynomial provides a close fit to the data since (a) 85 % of the variance in the 1921–2016 data is explained, (b) this percentage is more than explained by a 2nd order fit (80 %) yet not much less than explained by the higher order fits. Importantly, the 3rd order polynomial fits yield almost identical background trends, indicating that they capture a robust long-term feature of the data. In contrast, the trends associated with the higher order fits tend to differ, being more sensitive to shorter-term fluctuations including hiatus periods. As a consequence, the data can be described as comprising a robust, ever-increasing long-term warming trend that is independent of shorter-term fluctuations at the end of the series.

A different picture emerges for Pacific average temperatures (Fig. 1b), as the 31-year average temperature anomalies (yellow squares) and a 3rd order polynomial reflect the existence of a slow-down over the period 1990–2000. All fits to the data over the windows reveal that, apart from the linear fits, the background trends all differ considerably in response to the values after 1997. In each case, the fits based on the early 1901–1997 window indicate relatively moderate to strong warming trend, the 1916–1997 window yields far less warming and even cooling, while the recent 1921–2016 window indicates a return to moderate warming. These features reflect a distinct, but temporary, hiatus in the long-term warming trend in the Pacific.

We subtract the raw global values from the 3rd order polynomial fit to the full data set (1900–2016) to obtain the residual or detrended values (black, Fig. 2). These represent the effects of internal variability at relatively shorter time scales including contributions from the IPO, El Nino/La Nina events and any other external sources including volcanic eruptions. The IPO can only explain about 16 % of the residual variance in global temperatures, mainly because of regional cancellations (Chen et al. 2008). The IPO-based values (using linear regression) are indicated by the green and indicates that it can account for no more than about −0.03 °C to the slowdown in observed temperatures after 1998. The NINO34 index indicates that this contribution comprises five La Nina events (1999, 2000, 2007, 2008 and 2011) and only a single El Nino event (2002) between 1999 and 2014.

Fig. 2
figure 2

Detrended Pacific temperature anomalies (black) versus anomalies associated with the IPO (green). El Nino (red) and La Nina (blue) years based on NINO34 temperatures are indicated

4 Comparing model simulations with observations

Figure 3 compares both the observed global and Pacific average surface temperature anomalies (relative to 1961–1990) over the period 1900–2016 with MMM values from the CMIP5 simulations using the RCP4.5 emissions scenario. Note that for the time period considered, there is very little difference between the MMM values from either the RCP8.5, RCP4.5 or RCP2.6 simulations. Third order polynomial fits to the data over the period are shown in order to better compare the long-term trends in the data sets. In both Figures the model values and the observations behave similarly after about 1910, up until about 1990, but appear to diverge thereafter. By 2016 the model ensemble values are about +0.2 °C and +0.3 °C warmer than the observations for the globe and the Pacific respectively. In particular, the MMM global temperatures consistently exceed the observed values every year after 1998 and, for Pacific temperatures, every year after 1993. This period is also unusual for the fact that the observed global temperature was lower than the temperature in all ensemble members, for all years between 2011 and 2014, except in only 10 out of a possible 332 instances.

Fig. 3
figure 3

Annual average surface temperatures 1921–2016 from observations and multi-model mean values from CMIP5 (RCP4.5) simulations. The temperatures are all expressed as anomalies relative to 1961–1990. The smoothed curves are based on 3rd order polynomial fits to the data over the period 1920–2016. The upper and lower smooth curves correspond to the multi-model minimum and maximum values respectively. a Observed global values (black), model values (red squares). b Pacific values (blue circles), model values (orange squares). Note differences in scale

The probability that the MMM average value for 1986–2016 comes from the same population as the observations is only 0.14 according to a simple t test. However, these differences are dependent on the baseline period we use to define the anomalies (Hawkins and Sutton 2015). For example, these differences would disappear if we used the 1986–2016 period to define the anomalies in both the models and observations, but at the expense of making the models appear unrealistically cool for much of the twentieth century. If we take the period 1900–1985 as the baseline period, then the model values based on the 1961–1990 period are warmer by +0.03 °C. By subtracting this amount from each model value we can make the 1900–1985 average value match that of the observations. While this reduces the differences at the end of the time series, it does not greatly affect the divergence between them. For the 1986–2016 period, the resultant probability increases to 0.34, but not enough to indicate that the MMM values as a whole are consistent with the observations, despite the record warm years in the 2015 and 2016. The choice of baseline does not affect the interpretation of the trends in the data.

The issue here is that we expect that, at a certain time scale, the effects of greenhouse-related climate change to dominate, and any other sources of variability will be represented at shorter time scales. The fact that we find the observed data is well represented by a 3rd order polynomial (a form of low pass filter), and that the implied trends at the end of successive 100-year windows are almost identical, suggests that we can identify just such a signal and that other sources of variability occur on time scales much less than about 50 years. i.e. hiatus periods of up to 20 years represent variability not associated with a robust background warming trend. Therefore, using the same “filter” on model-based data allows a comparison to be made of the actual and simulated warming signal.. Furthermore, the MMM global warming trend is effectively unaffected by any simulated internal variability since this tends to cancel out across the different simulations. As a consequence, the difference between the observed temperatures and the MMM values of about +0.2 °C is unlikely to represent the effects of internal variability. As already noted, the IPO can only explain about 0.03 °C of the detrended global values (Fig. 2).

A short-term excess of La Nina events over El Nino events between 2003 and 2014 contributed to the observed Pacific hiatus. However, over the longer period 1986–2016 the excess was only 1 (Fig. 2). Over this same period, based on simulated Pacific temperatures, the average excess of La Nina-like events over El Nino-like events across all 83 simulations was −0.66 (i.e. slightly more El Niño-like events than La Niña-like events) but not significant given that the standard deviation of the excess values across all simulations is 2.9. These numbers indicate that changes in the frequency of El Nino/La Nina events can affect short-term trends but contribute relatively little to any changes in long-term trends. Even if we assume that that an excess of La Nina events can explain a similar amount as the IPO (which strictly cannot be done because these variables are closely related, Power et al. 2005; Newman et al. 2003) the total contribution would still only be about 0.06 °C.

The Atlantic Multi-decadal Oscillation (AMO) represents North Atlantic sea surface temperature variability and appears to represent a source of variability in global average temperatures independent of the Pacific (Chylek et al. 2014). However, observations suggest that it would have contributed relatively warm values over recent years (see Fig. 4c). The observed AMO cannot, therefore, help explain why the observed global temperatures are cooler than the MMM values.

Fig. 4
figure 4

Spatial pattern of multi-decadal temperature change (from 1961–1990 to 1998–2012, HADCRUT4) from a observations (HADCRUT4), b models (MMM) and c the difference between them. (Observations-MMM)

In the case of the Pacific, the situation is different since the background warming trend does appear affected by a hiatus (Fig. 1b). At the end of the time series the model values again appear too warm by about +0.3 °C. If we use the 1900–1985 baseline, the difference is slightly less (+0.28 °C). The IPO explains about 68 % of the variance in the Pacific time-series but can explain no more than about 0.15 °C during the hiatus period. This is about half the difference between the observed and MMM values at the end of the time series indicating that, even for this region, the models, as a whole, appear too warm. It is worth noting that this difference is persistent, and often exacerbated, by the choice of different baseline periods.

Finally, while the IPO makes a contribution to the recent multi-decadal change in Pacific temperature (Sect. 3.1; Fig. 10), the multi-decadal change in the IPO does not appear large in terms of its own historical record (“see Appendix”). This does not support the hypothesis (see, e.g., Kociuba and Power 2015) that the contrast between observed and modeled multi-decadal variability can be accounted for by an extreme natural excursion of the IPO in the real world.

5 Pacific variability

Figure 4a shows the global pattern of multi-decadal temperature change (i.e., temperature change from 1961–1990 to 1998–2012). Although the changes depicted are dominated by the long-term warming trend, it also reflects a contribution by the IPO consistent with earlier research (e.g., Trenberth and Fasullo 2013; England et al. 2014; Meehl and Arblaster 2012; Goddard 2014; Kociuba and Power 2015). Figure 4b shows the corresponding MMM temperature anomaly pattern. It appears relatively featureless over the entire Pacific Ocean when compared with the observed pattern (Fig. 4a). This is expected since averaging across model simulations largely filters out any internally generated features, including the impact of the IPO. The difference between the two patterns (Fig. 4c) therefore emphasizes the contribution of the real world IPO, and other sources of internal variability, to the model-to-observed contrast in the multi-decadal temperature change. In fact, if models and the forcing applied to them were perfect, and observational error is ignored, then non-zero values in Fig. 4c would – by construction - arise entirely from interdecadal climate variability.

Figure 4c indicates that the (cool) difference between the observed and MMM values exceeds 0.7 °C in the central Pacific. Is internal variability large enough so we can reasonably expect it to account for such a value? In order to address this question, let us assume - for the time being - that modelled internal multi-decadal variability has a realistic magnitude. Figure 5a shows the spatial structure of the standard deviation (SD) of multi-decadal variability arising from internal climate variability alone, as it appears in pre-industrial simulations. If we suppose that the models and the forcing applied to them are perfect, then the population standard deviation is given by SD* = SD (1 + 1/N)1/2, where N is the number of simulations. Defining D2012 as the difference between the observed temperature change over the period 1998–2012 (relative to 1961–1990) and the MMM value over the same period, then the ratio D2012/SD* provides a measure of the role of internal climate variability. For example, if D2012/SD* = 1.0 in a particular location, then the difference is equivalent to 1 SD*, indicating that it is easily accounted for by internal variability.

Fig. 5
figure 5

Estimates of the contribution of internal climate variability to recent multi-decadal temperature change over the IndoPacific a multi-model mean (MMM) of the standard deviation (SD) of ∆T = T15(t) − T30(t) (°C) that can arise from internal climate variability alone, and the (spatially varying) ratio D(2012)/SD*, where SD* is the standard deviation of variability that can arise in D(2012) from internal climate variability alone, under the assumption that the models and the forcing applied to them are perfect. b All models, c ten warmest models only and d ten coldest models only, are used to calculate both D and SD*. Here D(2012) = Tobs(2012) − MMM [Tmodels(2012)], T15 is a 15 years average ending in year t, and T30 is an average over a 30 years period ending 22 years prior to t. This statistic is chosen to match the averaging lengths and gaps between the periods used in Fig. 6 (i.e. the 15-year period 1998–2012 and the preceding thirty-year period 1961–1990). The estimates presented are based on the internal climate variability evident in pre-industrial simulations (see Supplementary Table 1 for a list of the simulations analysed). Under the assumption that the models and the forcing applied to them are perfect, the standard deviation of the difference, D, between ∆Tobs and the MMM of ∆Tmodels is given by SD* = SD(1 + 1/N2)1/2, where N is the number of models used in the calculation of the MMM

Figure 5b shows the pattern of D2012/SD* for all simulations, while Fig. 5c, d show the patterns for the ten “warmest” (based on D2012) and the ten “coolest” simulations respectively. In all three cases (in Fig. 5b–d) the magnitude of D2012/SD* over the Pacific Ocean actually exceeds 4, indicating that the differences are extremely large compared with modelled internal multi-decadal variability. In fact the warmest models exhibit ratios exceeding five in some locations. The model-to-observed contrast in D2012 is therefore very unlikely—especially in the case of the warmest models—to be caused by internal climate variability alone.

This conclusion is partially based on the assumption that models simulate a realistic level of internal climate multi-decadal variability. But do they? Given the importance of the IPO to multi-decadal variability in the Pacific, let us begin answering this question by examining how well the models simulate the IPO. IPO patterns were diagnosed in each of the model simulations. The pattern matching correlation coefficient with the observed IPO ranges from 0.3 to about 0.75 (c.f. Brown et al. 2014). The MMM IPO pattern (Fig. 6b) resembles the observed pattern (Fig. 6a) but the modelled temperature excursions depicted are not as large. This is consistent with the findings of Kociuba and Power (2015). They concluded that the difficulty models have in simulating the recent multi-decadal strengthening of the Walker circulation may be partially due to a systematic underestimate of internal interdecadal variability in the strength of the Walker circulation. It also consistent with the results of England et al. (2014), who drew similar conclusions in relation to Pacific trade wind strength. This is also consistent with the fact that models tend to underestimate autocorrelation arising from El Niño-Southern Oscillation, which can make it difficult for models to generate enough decadal and longer-term IPO-related variability (Kociuba and Power 2015).

Fig. 6
figure 6

Standard deviation of SST variability (1900–2005) linked to the Interdecadal Pacific Oscillation (IPO): a observations (Rayner et al. 2003), b multi-model mean (MMM). Stippling indicates that over 70 % of models have the same sign as the multi-model mean, which exceed the 99 % statistical significance level under the assumption of model independence (Power et al. 2012)

While the analysis above suggests that modelled IPO variability might be too weak, the observational record is relatively brief for this purpose, and so our level of confidence in the conclusion that models underestimate the level of multi-decadal variability in the Pacific is not high. Furthermore, the modelled internal multi-decadal climate variability would have to be far weaker than real-world internal interdecadal climate variability in order for this to account for the model-to-observed contrasts over the Pacific (in Figs. 4c, 5b–d). For example, if the standard deviation is doubled to account for this possible model deficiency, then the ratio (Fig. 5b–d) still exceeds 2.5. A naturally occuring excursion of the real-world IPO would have to be very large in historical terms. But this does not appear to be the case, as the recent multi-decadal excursion of the IPO index (“see Appendix”) is not unusual in terms of the IPO’s own historical record.

We therefore conclude that it is highly unlikely that Pacific Ocean internal variability (even if allowance is made for possibly weaker than observed simulated variability) alone can account for the resulting contrast between the observations and the model simulations (in Figs. 4c, 5b–d; especially Fig. 5c—the “warmest” models).

Finally, note that while Huber and Knutti (2014) argue that ENSO variability could account for a 15-year cooling trend of −0.06 °C, we find little evidence that IPO and El Nino/La Nina events can explain the apparently robust MMM warm bias in multi-decadal temperature changes.

6 Model temperature projections

Figure 7 compares observed and individual model values (15-year trailing averages of anomalies relative to 1961–1990) for both the globe and the Pacific. The observed global value in 2012 (Fig. 7a) is lower than 90 % of the individual values, while the corresponding Pacific value (Fig. 7b) is lower than all the model simulations analysed. Also shown are the evolution of the ten warmest (based on D(2012)) simulations (red lines) and the ten coolest simulations (blue lines). The plot indicates that the projected global values at the end of the twenty-first century partly depend on the values at the start of the twenty-first century, since the red and blue lines tend to separate. On the other hand, there is no evidence for a comparable separation over the Pacific.

Fig. 7
figure 7

Trailing 15-year average surface air temperature anomalies, relative to preceding 30-year averages. For example, the values in 2012 represents the multi-decadal difference 1998–2012 relative to 1961–1990. Observed (HadCRUT4, black) and model values (1920–2100, red, blue and grey): a Global and b Pacific. The red lines correspond to the ten warmest values at 2012, the blue lines to the ten coolest, and the grey lines to the remainder

Figure 7 also indicates that model-to-model variability in multi-decadal temperature change in the Pacific (between 1961–1990 and 1998–2012) is largely driven by internal variability, whereas model-to-model variability in recent multi-decadal global temperature change is partially driven by model-to-model differences in the response of models to external forcing.

Figure 8 provides model results from three different twenty-first century scenarios in which GHG emissions and concentrations are highest under RCP8.5, lower under RCP4.5 and lowest under RCP2.6. It can be seen that the simulations that yield relatively warm changes over the past half-century (red lines) tend to produce larger projected changes. Similarly, simulations that yield relatively cool temperature changes over the past half-century (blue lines) tend to exhibit smaller projected changes. This finding is independent of the baseline period chosen. The contrast between late twenty-first century projections from the models exhibited the greatest change over the past half-century and the models exhibiting the least change over the past half-century is clearest under both RCP8.5 (Fig. 8a) and RCP4.5 (Fig. 8b).

Fig. 8
figure 8

Trailing 15-year global average surface air temperatures. Observed (HadCRUT4, black) and model values (1920–2100, red, blue and grey): a RCP8.5, b RCP4.5 and c RCP2.6. The red lines correspond to the ten warmest MMM values at 2012, the blue lines to the ten coolest, and the grey lines to the remainder

The link between recent multi-decadal temperature change and projected temperature change is further illustrated in Fig. 9, which shows the projected global average temperature change for the period 2086–2100 versus the simulated multi-decadal temperature change for the period 1998–2012 (both changes relative to 1961–1990). In each case, the relatively warm present-day models yield relatively warm projections and the strength of this effect increases with the higher emissions scenario. For RCP 8.5 (Fig. 9a) the correlation coefficient (r) associated with the line-of-best-fit is +0.45, and is statistically significant with a p value of .005.

Fig. 9
figure 9

Projected global average temperature change for the end of the century 15-year period (2071–2100) (relative to 1961–1990) versus model average multi-decadal temperature change from 1961–1990 to 1998–2012, for three different emissions scenarios. The blue and red squares indicate the average values for “cool” and “warm” models, respectively. Lines-of-best-fit, and associated correlation coefficients and p values, are shown. a RCP8.5, b RCP4.5 and c RCP2.6

Note that this does not necessarily mean that the warming rate between 1998–2012 and 2086–2100 is necessarily greater for the warmer models. Suppose we define “cool models” as models that simulate recent warming less than 0.35 °C, and “warm models” as models that simulate recent warming greater than 0.55 °C. Then under RCP8.5 the “warm models” warm more than the “cool models” (4.15 °C compared with 3.71 °C), whereas the warming is similar for both “warm models” and “cool models” under RCP4.5 (1.98 and 1.96 °C, respectively), and under RCP2.6 the “warm models” actually warm less than the “cool models” (1.03 and 1.23 °C, respectively) between the two periods (i.e. 1998–2012 and 2086–2100). We might therefore conclude that model-to-model deviations in the magnitude of global warming over the past-half century does not seem to provide simple or clear guidance for twenty-first century temperature change for the same models. These results appear broadly consistent with those of England et al. (2015), who showed that model-to-model deviations on recent, much shorter (15-year) trends do not provide any guidance on model-to-model deviations in the magnitude of late twenty-first century warming.

However, under RCP8.5 the “warm models” do tend to warm more than the “cool models” between 1998–2012 and 2086–2100, and there is a (very marginally) larger warming of the “warm models” under RCP4.5. This leaves open the possibility that some of these models might overestimate the warming response to the imposition of greenhouse gas increases. But clearly, model-to-model contrasts in the sensitivity to GHG increases alone cannot fully explain the results.

It may be that model-to-model contrasts in the response to both (a) GHG increases and (b) sulphate aerosol changes may need to be taken into account in order to understand the results. This possibility can arise if “cool models” tend to cool more than the other models in response to sulphate aerosol increases and if “warm models” tend to warm more than the other models in response to GHG increases. Factors (a) and (b) could then explain the results if: (a) dominates the contrasts in the twenty-first century warming rates between “warm” and “cool models” under RCP8.5 (which has the largest GHG increases); (b) dominates under RCP2.6 (which has the smallest GHG increases); neither dominates under RCP4.5 (which has GHG increases that lie between those of RCP2.6 and RCP8.5).

Finally note that since the modelled multi-decadal changes over the past half-century as a whole appear too high, most likely because of issues associated with forcings and/or the sensitivity to GHGs and other anthropogenic forcing (Flato et al. 2013; Kirtman et al. 2013), it follows that the projections might also be too high.

7 Summary and discussion

The vast majority of the numerous earlier papers related to the so-called global warming hiatus (e.g. Fyfe et al. 2013; Fyfe and Gillett 2014; Flato et al. 2013; Hawkins et al. 2014; Kosaka and Xie 2013; see Introduction for additional references) focussed on relatively short-period (e.g. 15 year) trends. Here we primarily focus on much longer-term (multi-decadal changes). A key example of a multi-decadal change is a change from a thirty-year reference period (1961–1990) to a later 15-year period (e.g. 1998–2012). Such metrics measure changes over approximately half a century. This very different focus means that earlier conclusions drawn on the basis of analyses examining much shorter-term change (e.g. trends over 15 year periods) do not necessarily apply.

We examined decadal, multi-decadal and longer-term changes in global and Pacific temperatures using data up to near-present, and compared these to simulated changes over the same period. We identified large and important model-to-observed contrasts, and we examined the implications the contrasts have for projections of global and Pacific temperature over the remainder of the twenty-first century. In order to do this we also examined the ability of models to simulate multi-decadal climate variability and the degree to which internal, multi-decadal climate variability can account for the model-to-observed contrasts.

A key finding is that, in all of the CMIP5 model simulations analysed, for the period 1998–2012 relative to 1961–1990, the Pacific warms more on multi-decadal time-scales than do the observations. Furthermore, at the global scale, 90 % of the simulations exhibit greater than observed multi-decadal surface warming.

The more moderate observed warming is due in part to cooling by the IPO and from an increase in the frequency of La Niña events. However, we find that the magnitude of the contrast between the observed and simulated multi-decadal changes in Pacific temperature are so large that it is highly unlikely that Pacific Ocean internal variability alone, even if allowance is made for possibly weaker than observed simulated variability, can account for the contrast. This is especially true for the contrast between observations and the models exhibiting the greatest multi-decadal warming. We also find that, while the IPO makes a contribution to the recent multi-decadal change in Pacific temperature, the corresponding multi-decadal change in IPO indices is not large in terms of its own historical record. This does not support the hypothesis (see, e.g., Kociuba and Power 2015) that the contrast between observed and modeled multi-decadal variability over the past half-century can be accounted for by an extreme, natural excursion of the IPO in the real world.

Together, this indicates that imperfections in the models or the forcing applied to them are partially responsible for the model-to-observed contrast in multi-decadal temperature change over the past half-century. To be relevant the imperfections must result in too little cooling during 1998–2012 relative to the reference period (1961–1990), or too little warming during the reference period. This could include imperfections in the forcing used in the simulations over the past half-century. Candidates for errors in forcing include: an underestimate of stratospheric aerosol concentrations; reduced solar output; lower levels of water vapour in the upper atmosphere; sulphate aerosols; and the impact of GHGs on warming in the tropical Pacific (Andersson et al. 2015; Clement et al. 1996; Flato et al. 2013; Huber and Knutti 2014, Santer et al. 2014; Schmidt et al. 2014) in the more recent period (e.g. 1998–2016). Huber and Knutti (2014) estimate that underestimating stratospheric and solar factors could have contributed to a cooling trend of −0.07 °C over 15 years but appears insufficient to account to about for the approximately 0.2 °C difference at the end of the time series. Another candidate for examining the model-to-observed contrast is that some of the models—especially those models that exhibit the greatest multi-decadal warming to date—may overestimate the warming response to the imposition of greenhouse gas increases.

We showed, however, that model-to-model contrasts in the sensitivity to GHG increases alone do not fully explain the model-to-model contrasts in projected global temperature increases from 1998–2102 to 2086–2100. Instead we showed that competing, model-to-model contrasts in the response to both (a) GHG increases and (b) sulphate aerosol might explain the results if we assume that: (a) dominates the contrasts in the twenty-first century warming rates between “warm models” (i.e., the models that warmed the most over the past half-century) and “cool models” (i.e., the models that warmed the least over the past half-century) under RCP8.5 (which has the largest GHG increases among the RCP scenarios); (b) dominates under RCP2.6 (which has the smallest GHG increases); neither (a) nor (b) dominate under RCP4.5 (which has GHG increases that lie between those of RCP2.6 and RCP8.5). This could then explain: why “warm models” (i.e. the models that warmed the most over the past half-century) warm more than “cool models” under RCP8.5 from 1998–2012 to 2086–2100; why the temperature changes between 1998–2012 and 2086–2100 are similar in the “warm models” and “cool models” under RCP4.5; and why the “cool models” warm more than the “warm models” do, between 1998–2012 and 2086–2100, under RCP2.6.

Further substantial warming over the twenty-first century is nevertheless projected in all the models under business as usual emission scenarios (RCP4.5 and RCP8.5; Figs. 1, 6 respectively). If attention is restricted to models with more accurate simulations of recent multi-decadal temperature change then the resulting model ensemble has far fewer members exhibiting the highest warming levels in the late twenty-first century. The warming nevertheless remains much larger than warming to date, under both the RCP4.5 and RCP8.5 scenarios (Table 1).

Table 1 Details of MMM temperatures anomalies (°C) with respect to 1961–1990. The observed (HadCRUT4) value for 1998–2012 is +0.39 °C

Finally, we found that observations up to 2016 show no evidence of a hiatus in the background global warming trend. This is consistent with the findings from other studies (Karl et al. 2015, Lewandowsky et al. 2015, 2016) that stress the importance of multi-decadal time scales rather than short-term trends (e.g. Fyfe et al. 2013; Flato et al. 2013; Hawkins et al. 2014; Kosaka and Xie 2013). However, the data do indicate a temporary hiatus in the warming trend for the Pacific, due in part to cooling by the IPO associated with a short-term excess of La Nina events between 2003 and 2014.