How accurately can the climate sensitivity to CO2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hbox {CO}_{2}$$\end{

The equilibrium climate sensitivity (ECS, in K) to CO2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hbox {CO}_{2}$$\end{document} doubling is a large source of uncertainty in projections of future anthropogenic climate change. Estimates of ECS made from non-equilibrium states or in response to radiative forcings other than 2×CO2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hbox {2}\times \hbox {CO}_{2}$$\end{document} are called “effective climate sensitivity” (EffCS, in K). Taking a “perfect-model” approach, using coupled atmosphere–ocean general circulation model (AOGCM) experiments, we evaluate the accuracy with which CO2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hbox {CO}_{2}$$\end{document} EffCS can be estimated from climate change in the “historical” period (since about 1860). We find that (1) for statistical reasons, unforced variability makes the estimate of historical EffCS both uncertain and biased; it is overestimated by about 10% if the energy balance is applied to the entire historical period, 20% for 30-year periods, and larger factors for interannual variability, (2) systematic uncertainty in historical radiative forcing translates into an uncertainty of ±30to45%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\pm }\,30\, {\rm to} \,45\%$$\end{document} (standard deviation) in historical EffCS, (3) the response to the changing relative importance of the forcing agents, principally CO2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hbox {CO}_{2}$$\end{document} and volcanic aerosol, causes historical EffCS to vary over multidecadal timescales by a factor of two. In recent decades it reached its maximum in the AOGCM historical experiment (similar to the multimodel-mean CO2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hbox {CO}_{2}$$\end{document} EffCS of 3.6 K from idealised experiments), but its minimum in the real world (1.6 K for an observational estimate for 1985–2011, similar to the multimodel-mean value for volcanic forcing). The real-world variations mean that historical EffCS underestimates CO2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hbox {CO}_{2}$$\end{document} EffCS by 30% when considering the entire historical period. The difference for recent decades implies that either unforced variability or the response to volcanic forcing causes a much stronger regional pattern of sea surface temperature change in the real world than in AOGCMs. We speculate that this could be explained by a deficiency in simulated coupled atmosphere–ocean feedbacks which reinforce the pattern (resembling the Interdecadal Pacific Oscillation in some respects) that causes the low EffCS. We conclude that energy-balance estimates of CO2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hbox {CO}_{2}$$\end{document} EffCS are most accurate from periods unaffected by volcanic forcing. Atmosphere GCMs provided with observed sea surface temperature for the 1920s to the 1950s, which was such a period, give a range of about 2.0–4.5 K, agreeing with idealised CO2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hbox {CO}_{2}$$\end{document} AOGCM experiments; the consistency is a reason for confidence in this range as an estimate of CO2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hbox {CO}_{2}$$\end{document} EffCS. Unless another explosive volcanic eruption occurs, the first 30 years of the present century may give a more accurate energy-balance historical estimate of this quantity.


Introduction
The equilibrium climate sensitivity (ECS), defined as the steady-state global-mean surface air temperature change due to a doubling of the atmospheric carbon dioxide concentration, has been used for decades as a benchmark for the magnitude of climate change predicted by general circulation models (GCMs) in response to CO 2 increase. Although an equilibrium climate is not expected in the future, ECS is relevant to future climate change because it correlates with global warming under realistic time-dependent scenarios for the future, which are dominated by CO 2 increase Knutti et al. 2017;Grose et al. 2018). Over the past 25 years, GCMs have considerably improved in their simulation of present climate and historical climate change (Reichler and Kim 2008;Flato et al. 2013, where by "historical" we mean since the 19th century), but their ECS has had a persistently wide spread. The range of ECS simulated by GCMs was 1.9-5.2 K (Mitchell et al. 1990) when assessed in the first Assessment Report of the Intergovernmental Panel on Climate Change, and 2.1-4.7 K in the most recent (the Fifth Assessment Report, AR5, Flato et al. 2013).
This uncertainty has stimulated efforts to evaluate the ECS from observed historical climate change. One common approach is to apply the global-mean energy balance of the climate system where F is the effective radiative forcing (ERF, Myhre et al. 2013, calculated from observed or estimated forcing agents), N is the global-mean net downward radiative flux at the top of the atmosphere (TOA) i.e. the heat flux into the climate system, T is the global-mean surface temperature change with respect to an unperturbed equilibrium in which N = F = 0 , and R = F − N = T is the radiative response of the system to change in T. Note that F is positive downwards, while R is positive upwards.
Our in Eq. (1) is the positive-stable climate feedback parameter W m −2 K −1 , with > 0 so that R = T resists F. This sign convention is convenient for our purposes. Some papers on this subject use a negative-stable climate feedback parameter , numerically the same as ours but with + T instead of − T in Eq. (1). The advantage of that convention is that those processes which are positive feedbacks in a physical sense e.g. water vapour feedback, tending to amplify T, make positive contributions to the net , which is negative. The reciprocal of (= − ) is the climate sensitivity parameter S = 1∕ ( K W −1 m 2 ); the larger , the smaller S . This quantity is always given a positive sign, regardless of the sign convention for .
The energy balance (Eq. 1) implies that ECS = F 2× ∕ , where F 2× is the ERF of 2 × CO 2 , since N = 0 in the perturbed equilibrium. Thus a larger implies a smaller ECS. When is estimated from climate change which has not reached equilibrium (whether historical, future or under idealised scenarios), F 2× ∕ = SF 2× is called the "effective climate sensitivity" (EffCS), which equals the ECS only if is a constant, as was formerly assumed (e.g. by Gregory et al. 2002, among many others). The usual method to estimate in CMIP5 is from Eq. (1), by regression of N against T for the abrupt4xCO2 experiment, in which CO 2 is instantaneously quadrupled at t = 0 with respect to the control state (Gregory et al. 2004). Recent work shows that historical climate change tends to give a larger median estimate of , and hence a smaller EffCS, than GCMs do under idealised high-CO 2 scenarios, such as abrupt4xCO2, which have ERF of the magnitude typically projected for the 21st century (Forster 2016).
Since the unperturbed equilibrium is not a known historical state, in practice Eq. (1) is applied to the differences (denoted by Δ, in N, F and T) between two historical states (Gregory et al. 2002;Otto et al. 2013) or by regression in the differential form Both Eqs. (2) and (3) eliminate the unknown equilibrium state. If data is available throughout the period of interest, regression (Eq. 3) is a more efficient estimator of the slope than differences (Barnes and Barnes 2015). Either way, this is a modified version of the method of Gregory et al. (2004), following Forster and Gregory (2006) and Tett et al. (2007), for the situation where F is time-dependent. Many studies have estimated from real-world historical F, N and T using Eqs. (1), (2) or (3) in various ways (examples are cited in the review by Knutti et al. 2017).
ERF F is not an observable quantity, and has to be calculated using models of radiative transfer, calibrated formulae (e.g. supplementary material of Myhre et al. 2013) and atmosphere GCM (AGCM) experiments (Sect. 3.1; Hansen et al. 2005). Therefore historical F is a source of systematic uncertainty in estimating , especially on account of anthropogenic tropospheric aerosol forcing (Gregory et al. 2002;Myhre et al. 2013;Forster 2016;Skeie et al. 2018).
Historical N is a source of statistical uncertainty in estimating , due to the combination of two circumstances. First, internally generated i.e. unforced variations in the climate system add statistical "noise" to the externally forced signal in N. Second, the comparative shortness of the observational record of N limits the possibility of reducing the imprecision due to the noise. N can be evaluated reasonably precisely from satellite measurements of the global TOA Earth radiation budget, especially by the Earth Radiation Budget Experiment (ERBE) during 1985-1988 and by the Clouds and Earth's Radiant Energy System (CERES) since 2000, and of global ocean temperature measurements by Argo floats since 2005 (Allan et al. 2014;Roemmich et al. 2015;Palmer 2017). N can be estimated less precisely from the sparser ocean temperature measurements made by ships back to the 1960s, but hardly at all for earlier decades (Abraham et al. 2013).
An alternative method for estimating (Sect. 6.1) has recently been developed, using an AGCM experiment called amip-piForcing, in which observed sea surface temperature (SST) is a boundary condition, to which simulated N responds Zhou et al. 2016;). This method does not involve knowing real historical F and N, and thus avoids the uncertainties associated with these quantities. The amip-piForcing experiment gives a larger (smaller EffCS) for historical climate change than experiments using the same AGCMs, incorporated in coupled atmosphere-ocean GCMs (AOGCMs), to simulate the response to 4 × CO 2 . Moreover, amip-piForcing shows substantial decadal historical variation in . For any transient climate state, the EffCS and quantify the relationship between changes in global-mean R and global-mean T, determined by the response to SST of surface and atmospheric processes which affect TOA radiation. The AOGCM, AGCM and energy-budget analyses provide evidence that is not constant in various ways. We can distinguish two kinds of reason for the inconstancy of . First, might depend on the magnitude of global-mean T or F, which could be formalised by making Eq. (1) non-linear in these quantities (Meraner et al. 2013;Good et al. 2012;Gregory et al. 2015;Bloch-Johnson et al. 2015). Second, R and may vary because of changes in the pattern of SST, i.e. "pattern effects" Ceppi and Gregory 2019). Such effects cannot be predicted by Eq. (1), because it deals only with global means, and it becomes nonsensical in limiting cases. For instance, if changing SSTs alter R but not T, is infinite and EffCS is zero.
The inconstancy of raises the question which is the title of this paper. To address the question, we analyse AOGCM simulations of the historical period. The analysis has two aspects. First, we evaluate how accurately we would be able to estimate the EffCS for CO 2 forcing from the historical record if the real world truly behaved like an AOGCM i.e. a "perfect-model" test. The AOGCMs enable this investigation because they provide complete datasets for many alternative realisations of the historical period, whereas the historical period has occurred only once in the real world and the observational dataset of it is incomplete. Second, we investigate the causes of the time-variation of in the historical period. We make use of AOGCM experiments that simulate change due to unforced variability alone and to subsets of historical forcings, whereas we cannot control these influences in the real world.
In Sect. 2 we give details of the AOGCM experiments, and in Sect. 3 we derive estimates of F for the AOGCMs. In Sect. 4 we show that, if the AOGCMs are realistic, dR∕dT evaluated from historical climate change by Eq. (3) may be an imprecise and biased estimate of the historical , owing to the statistical effects of unforced variability. In Sect. 5 we show that varies during the historical period in response to the changing nature of the forcing, which is not due to CO 2 alone. The AOGCMs indicate that the most recent decades should have closest to its CO 2 value, but in Sect. 6 we present evidence that the historical time-variation of in the AOGCMs may be unrealistic in that regard, by comparison with AGCM amip-piForcing experiments. We conclude in Sect. 7 by discussing the answer to the question posed by the paper, in view of the statistical and systematic errors in estimating the CO 2 from the historical .
Throughout the paper, uncertainties written with ± in the text and shown by coloured shading in the diagrams are one standard deviation or one standard error (as appropriate). Our notation for different methods of estimating , discussed throughout the paper, is summarised in Table 1.

AOGCM historical experiments
We analyse results from the historical, historicalNat and his-toricalGHG experiments from 16 AOGCMs of the Coupled Model Intercomparison Project Phase 5 (CMIP5, Table 2). Climate change is calculated with respect to the piControl experiment, which has constant pre-industrial forcing agents. The historical, historicalGHG and historicalNat experiments begin in the latter part of the 19th century from piControl states, and run to 2005 with time-dependent historical changes in forcing agents. The historical experiment includes all changes in atmospheric composition, anthropogenic and volcanic aerosols, solar irradiance and landuse; historicalGHG includes changes only in greenhouse gas concentrations, historicalNat only in the natural forcing agents of volcanic aerosol and solar irradiance.
Unforced interannual variability in T (pooled standard deviation of 0.11 K in the AOGCM piControl experiments) In this paper > 0 is the positive-stable climate feedback parameter ( W m −2 K −1 ), evaluated as the slope from regression of the globalmean annual-mean radiative response R against surface air temperature change T, from real-world estimates or from ensembles of historical simulations with AOGCMs and AGCMs. Various choices for the regression are denoted as shown in the table, second column for the entire historical period (labelled "All", time-independent and marked with an overbar), third for 30-year periods (labelled "30", timedependent and marked with a tilde) where ̃ (t) applies to the 30 years centred on time t. Lower-case subscripts denote ensemble means of integrations from individual models, upper-case denote multimodel means  mann et al. 2013). Therefore, in order to clarify the forced signal, historical experiments with most AOGCMs have been run as ensembles of various sizes, with each integration in the ensemble beginning from a different state in the piControl experiment. Provided the states are sufficiently separated, the unforced variability in the ensemble members is not correlated, and its temporal standard deviation is a factor 1∕ √ N smaller in the ensemble mean of N integrations than in each individually.
The CMIP5 historical ensembles have no more than 10 members and fewer in most cases (Table 2). We also use a much larger historical ensemble of 100 members carried out with the MPI-ESM1.1 AOGCM, which is an updated version of the CMIP5 AOGCM of Giorgetta et al. (2013). We assume that variations in global climate in the mean of this ensemble are mostly the response to forcing, since unforced variability is reduced by a factor of 10. This makes it very useful in a perfect-model approach, since we can obtain an accurate estimate of its true , provided we know F, which is the subject of the next section.

Historical radiative forcing
To apply the global-mean energy balance to observed climate change, we need to know historical ERF. Myhre et al. (2013, AR5) estimated F(t) from historical emissions and atmospheric composition, radiative transfer calculations, and a variety of models. The net forcing goes up as greenhouse gas concentrations increase, partly compensated by negative ERF from anthropogenic aerosols (our Fig. 1, their Figure 8.18). There is a large negative spike for a small number of years following each major volcanic eruption, due to reflection of sunlight by aerosol formed from sulphur dioxide injected into the stratosphere. A wide systematic uncertainty range of 1.1-3.3 W m −2 is given for the net anthropogenic ERF at 2011 relative to 1750.
In the following sections we diagnose from CMIP5 historical experiments using Eq. (1). For that purpose we need to know F in the AOGCMs, which may be substantially different from the real world F, on account of various model errors. The object of this section is to estimate the model F.

Diagnosis using AGCMs
The historical F(t) can be diagnosed for an AOGCM by running a pair of experiments with the AGCM alone, having prescribed unchanging climatological pre-industrial sea surface temperature and sea ice concentration. One of the experiments, called piClim-histall, has time-dependent atmospheric composition and land use for the historical period, while the other is a control, called piClim-control, with constant pre-industrial forcings (Hansen et al. 2005;Held et al. 2010;Andrews 2014;Pincus et al. 2016).  -0  2  ACCESS1-3  3  2  CNRM-CM5  10  5  6  CSIRO-Mk3-6-0  10  5  5  CanESM2  5  5  5  GFDL-CM3  5  3  3  6  GFDL-ESM2M  1  1  1  5  GFDL-ESM2G  3  HadGEM2-ES  5  4  3  If we assume, despite the forcing, that the surface boundary conditions enforce the same surface temperature in the two experiments, T = 0 ⇒ F = N for the difference in energy balance Eq. 1 between them. That is, the historical ERF equals the net input N of energy to the climate system due to the forcing agents. Surface temperature is free to change over land, for practical reasons (e.g. Kamae et al. 2019), giving T ≃ 10 % of the equilibrium T (Andrews et al. 2012, red crosses in their Fig. 1). This effect has not been quantified for CMIP5 historical simulations, but it will be possible to quantify it in CMIP6 using the experiments piClim-histall and piClim-control.
We have run the experiments with the ECHAM6.3 and HadGEM2-A AGCMs to obtain F(t) for MPI-ESM1.1 and HadGEM2-ES AOGCMs, which incorporate these AGCMs respectively. The ECHAM6.3 (MPI-ESM1.1) F(t) is very close to the AR5 estimate, whereas the HadGEM2 F increases considerably less ( Fig. 1), in part due to strong negative land-use forcing . The difference between these two models illustrates the possibly large but unknown spread in CMIP5 F.

Forcing due to tropospheric and volcanic aerosol
To examine the consistency between our set of AOGCMs and the AR5 regarding forcing, we estimate the historical annual-mean T(t) expected in response to the AR5 F(t) with the "step model", which uses T(t) in response to a stepchange in CO 2 in each AOGCM as a kernel to be convolved with the forcing timeseries (more detail given in Appendix A). The step-model mean shows more warming during the historical period than the AOGCM mean (Fig. 2a). We suggest that this is because the AR5 F is larger than the AOGCM mean F, due to the negative anthropogenic aerosol forcing being stronger in AOGCMs than in reality, consistent with the expert judgement of Myhre et al. (2013). Alternatively, EffCS may be larger for anthropogenic aerosol forcing than it is for CO 2 (i.e. efficacy greater than unity, defined at the start of Sect. 5; Hansen et al. 2005;Shindell 2014;Marvel et al. 2016;but cf. Paynter and Frölicher 2015). The step model implicitly assumes the same EffCS for all forcing agents. The multimodel standard deviation of the step-model timeseries is 0.08 K (the pink envelope in Fig. 2a, pooled over years), which must be due mostly to the AOGCM spread in climate feedback, because the step model uses the same AR5 F for all AOGCMs. The multimodel standard deviation of the AOGCM historical timeseries is 0.14 K (the grey envelope, pooled over years). If the standard deviation of unforced interannual variability in T in every AOGCM were 0.11 K, which is the pooled estimate from piControl, and if the 64 historical integrations (Table 2) were equally weighted (both of these are fair approximations), unforced variability would make a negligible contribution of 0.11∕ √ 64 = 0.013 K to the AOGCM historical multimodel standard deviation. Therefore we suggest that the multimodel standard deviation is larger for the AOGCMs than the step model because of the AOGCM spread in F. Since different choices have been made for numerous aspects of the formulation of AOGCMs, the actual ERF in a given CMIP5 historical run will not necessarily be the same as the AR5 median estimate for the real world.
To estimate the uncertainty in F from AOGCMs, we take N ≃ F∕3 for the multimodel mean (Gregory and Forster 2008), whereby Eq. (2) becomes = (F − N)∕T ≃ 2 3 F∕T ⇒ T ≃ 2 3 F∕ . Therefore the fractional uncertainty in T will be the sum in quadrature of the fractional uncertainties in and historical F, which we assume to be uncorrelated ). For the timemean of 1986-2005 (the reference period of the AR5 for projections) relative to the time-mean of 1860-1879 (our reference period for ERF in Fig. 1), T has a standard deviation in the step model of about ±15% . This uncertainty is attributable to . It is negligible compared with the standard deviation in the AOGCMs in T of ±45% , which must therefore be nearly entirely attributable to the AOGCM uncertainty in F. By comparison, if the AR5 likely range for F of 1.13-3.33 W m −2 K −1 at 2011 relative to 1750 ) is assumed to represent the 5-95% range of a normal distribution, its standard deviation is ±30%.
We have evaluated the root-mean-square (RMS) difference in T(t) for 1900 onwards between the step-model mean and the AOGCM mean as a function of a time-independent scaling factor applied to the AR5 timeseries of anthropogenic aerosol ERF. The smallest RMS difference, meaning the closest mean match of the step models to the AOGCMs (dashed red line in Fig. 2b), is obtained by making the anthropogenic aerosol ERF 50% stronger (more negative) than the AR5 estimate. Consistent with this finding, the estimate by Zelinka et al. (2014) of the anthropogenic aerosol ERF at 2000 relative to 1860 in a set of AR5 AGCMs is 1.6±0.4 times larger than the AR5 median estimate.
It may also be noted that the negative spikes of F in volcano years are not as deep in the AGCMs as in the AR5 estimate ( Fig. 1). Linear regression of AGCM F against AR5 F for the years with strong volcanic forcing gives 0.78 for ECHAM6.3 and 0.58 for HadGEM2. This is qualitatively consistent with earlier findings that volcanic forcing is about 80% of the AR5 estimate in the mean of CMIP5 AOGCMs (Larson and Portmann 2016), and about 70% in the HadCM3 AOGCM , which the latter authors attributed to rapid cloud adjustments not included in the AR5 estimate. Fig. 2 Timeseries of historical global-mean annual-mean surface air temperature, relative to the time-mean of 1900-2005, from observations, from CMIP5 AOGCMs (using the ensemble mean for each AOGCM) and from the step-model emulation of CMIP5 using the AR5 ′ ERF timeseries with scaling factors (described in the text) applied to volcanic and anthropogenic aerosol ERF. The solid lines show the multimodel mean for the AOGCMs and the emulation of AOGCMs. In a the envelopes show the ensemble standard deviation, and b compares the multimodel means with the observational estimate Step model with CMIP5 abrupt4xCO2 and AR5 forcing Krakatau Santa Maria Katmai Agung

El Chichon Pinatubo
Ditto with anthropogenic aerosol ERF x 1.5

Estimate of CMIP5 historical forcing
To estimate the historical F(t) in CMIP5 models, in view of the findings of this section, we multiply the AR5 volcanic F by 0.8 and the AR5 anthropogenic aerosol F by 1.5. Henceforth by " AR5 ′ forcing" we mean the AR5 F with these modifications. The AR5 ′ F is not a revised estimate for the real world. We note that there there is a model spread of ±45% , but we do not have estimates for individual CMIP5 models. In CMIP6, the historical F for each model will be diagnosed by the AGCM experiments of Sect. 3.1, which are included in the Radiative Forcing Model Intercomparison Project (RFMIP, Pincus et al. 2016).

Using regression to estimate historical climate feedback
During the historical period, the net forcing grows, T rises, and the heat loss R to space increases. The 100-member MPI-ESM1.1 historical ensemble is useful to illustrate this behaviour because it is so large that the noise is fairly small in the ensemble mean, and because we have a diagnosis of F for this model (Sect. 3.1), enabling an accurate estimate of R = F − N . We see that the decadal trends of R = F − N and T usually have the same sign, both usually being positive, and their interannual variability shows some similarity as well, especially regarding the negative excursions caused by volcanic forcing (Fig. 3). Their agreement on these features means that the ensemble-mean annual-mean R and T are positively correlated (with coefficient of 0.94, Fig. 4). This is consistent with the assumption R = T of the energy balance Eq. 1, which motivates the estimation of from the covariation of R and T.
In this section, we summarise some statistical issues that affect the accuracy of the estimate. Its findings are important to the interpretation of historical data, but its subject is a digression from the physical investigation. Therefore we have put the detailed discussion and mathematical demonstrations in appendices.
Following many other authors, we obtain according to Eq. (3) as the slope from linear regression of R against T. Unforced variability affects N and hence R, making statistically uncertain. From the MPI-ESM1.1 historical ensemble, the distribution of obtained by regression of R against T in the individual integrations is 1.38 ± 0.08 W m −2 K −1 (mean and standard deviation). This is consistent with the median of 1.43 W m −2 K −1 estimated by Dessler et al. (2018) from the same dataset using differences between the means of the last and the first decades Eq. 2. The standard deviation of slopes from the difference method is 0.14 W m −2 K −1 , larger than from the regression method, because the latter uses more data, making it a more efficient estimator (Appendix D.1).
The choice of T as independent variable follows our physical intuition that T determines the magnitude of R rather than vice-versa. Using the historical MPI-ESM1.1 ensemble, we show that this choice is preferable also on statistical grounds (Appendix B). We show further that estimates of historical made by OLS regression from real-world R and T are biased low, giving an overestimate of historical EffCS, due to noise T ′ in T which does not produce proportionate variability T ′ in R (Appendix C).
Evaluating the statistics for all the AOGCMs, we find that the bias is larger in ̃ (multimodel mean of 20%) for a 30-year period than in (10%) for the entire historical period. The bias affects the difference method as well as OLS regression (Appendix D.1). Total least-squares regression is a method that would avoid the bias, but it is not obviously applicable because it depends on information that we do not have (Appendix D.5).
As well as the mean bias, individual integrations give a spread of slopes due to the noise. The consequent uncertainty is larger in ̃ than in (multimodel mean respectively of 0.42 W m −2 K −1 or ∼ 30%, and 0.11 W m −2 K −1 or ∼ 10%, Appendix C).
For the real world, random error in the observational dataset, due to instrumental uncertainty or sampling, is a possible source of noise in T that is uncorrelated with R, but this is not relevant to the model world, where we have perfect information. In both worlds, unforced variability in the climate system, unrelated to F, is the likely source of bias, through two physical mechanisms (both demonstrated in Appendix D.6).
First, if variability is driven by spontaneous fluctuations in N that have some persistence, and if the response in T to these fluctuations has some thermal inertia, will be biased low (the second case considered by Proistosescu et al. 2018). This effect could be caused for example by interannual variability in cloudiness, and hence planetary albedo, produced by regional climate variability; such variations may persist with anomalies of SST, and the heat capacity of the upper ocean sets the timescale of response. The effect causes to be underestimated by OLS because the spontaneous fluctuation in N is misattributed to R.
Second, if spontaneous variability in SST produces a response in N with a different from the externally forced response, probably because it has a different geographical pattern (Dessler et al. 2018), the OLS slope is contaminated by from the variability. Unlike the first mechanism, this one can produce variability in in either sense.

Time-variation of historical climate feedback related to forcing agents
The original motivation for estimating ECS from historical climate change depends on the assumption that is constant. If it is not, the historical may differ from for idealised CO 2 -forced climate change (Paynter and Frölicher 2015). In this section, we examine the dependence of in AOGCMs on time, and relate this to the changing nature of the forcing, in order to work out how CO 2 may best be estimated from historical . The relationship between forcing and climate response is often discussed in terms of the efficacy, defined as T forced by unit F of the given agent divided by T for unit forcing of CO 2 (Hansen et al. 2005). Our discussion is related to this concept, but it is framed in terms of because we are interested in the variation of R with T due to climate feedbacks. In contrast, efficacy quantifies the dependence of T on F, which involves ocean heat uptake as well, and its definition therefore requires a choice of scenario and timescale for the temperature response. For example, efficacy may be defined using T after a specified elapsed time in an AOGCM experiment with constant forcing (as by Hansen et al. 2005) or the equilibrium T under constant forcing of an AGCM with a slab ocean.

Time-variation of climate feedback in the historical experiment
In the MPI-ESM1.1 historical ensemble, we evaluate the time-variation of ̃ i (t) and ̃ e (t) (see Table 1 for definition) by regression in overlapping 30-year periods e.g. ̃ for the 30 years centred on 1st January 1940 is obtained from regression of annual means for 1925-1954. We find that ̃ e (t) shows significant decadal variation (solid orange line in Fig. 5a). For example, ̃ e = 1.14 ± 0.30 W m −2 K −1 in 1924 and 2.63 ± 0.36 W m −2 K −1 in 1955, whose difference of 1.49 ± 0.47 W m −2 K −1 is significant at the 1% level. This variation must be evidence of time-dependence which is synchronous across the ensemble of integrations, and therefore attributable to external forcing.
On the other hand, ̃ i (t) does not depend significantly on time (dotted orange line in Fig. 5a), judged by comparison with its standard deviation of 0.35 W m −2 K −1 due to unforced variability (the standard deviation among the 100 integrations, pooled over years, not shown). This is because unforced variability has a greater effect on individual integrations, and obscures the response to forcing that can be discerned in the ensemble mean.
Since the historical ensembles with CMIP5 models are much smaller than the MPI-ESM1.1 ensemble, to suppress the unforced variability we aggregate the models, by calculating a time-dependent climate feedback parameter, denoted by ̃ E (Table 1), from the multimodel-mean R(t) and T(t) of the ensemble means of individual CMIP5 models i.e. treating the models as equally weighted members of a "superensemble". (We use the word "multimodel" instead of just "model" to emphasise that it is a mean over all models, rather than the mean over all integrations of a single model.) We assume that the forced response will have correlated time-dependence among the models, whereas the unforced variability will be uncorrelated. The multimodel mean is used for similar reasons in statistical studies of attribution Time-dependent climate feedback parameter ̃ E (the same solid black line in all panels, labelled "CMIP5 E" in panel (a) and "historical" in the other two) for the multimodel mean of the CMIP5 historical experiment, a compared with the mean ̃ I of individual CMIP5 models (labelled "CMIP5 I"), and with ̃ e and ̃ i from the MPI-ESM1.1 ensemble, b compared with ̃ E for the multimodel means of the CMIP5 historicalGHG and historicalNat experiments, and with the time-mean (dotted horizontal line) of ̃ for 30-year periods in the CMIP5 piControl simulations, c compared with ̃ E for the multimodel means of the AGCM amip-piForcing, the CMIP5 historicalNat experiments, and an estimate made from observational datasets for N and T. The lightly coloured regions around the some of the lines are ±1 standard error, with ±1 standard deviation for CMIP5 I in (a). In b and c the vertical dashed lines indicate the beginning of the three periods of the regression analysis of Fig. 6a, centred on 1930, 1960 and 1990. Note that ̃ decreases upwards on the vertical axis, in order that the effective climate sensitivity increases upwards of climate change to forcing agents (e.g. Jones et al. 2013;Hua et al. 2018).
The small standard error of ̃ E (grey envelope in Fig. 5b) means that its time-variation is well-defined and statistically significant. It is moreover rather similar to ̃ e of MPI-ESM1.1 (compare solid black and orange lines in Fig. 5a), corroborating the idea that the time-variation is forced, and thus similar among all models. There is a minimum in ̃ E around 1930, a maximum during 1945-1974, and the absolute minimum (highest EffCS) occurs after 1980. The timevariation cannot be an artefact arising from the OLS bias because the minima in ̃ occur when the rate of warming is largest (around 1930 and after 1980), and hence the bias towards small ̃ due to unforced variability is of minimal importance compared with the response to forcing.
The time-variation of ̃ E in the CMIP5 historical experiment is similar in amplitude and period to the time-variation of ̃ in the AGCM amip-piForcing experiment with observed historical sea-surface temperature (described in Sect. 1; , but different in time-profile (compare black and blue lines in Fig. 5c). We will study amip-piForcing in Sect. 6, once we have drawn conclusions from the present section concerning the response to forcing in the AOGCMs.
For comparison, we also calculate a multimodel mean, denoted by ̃ I (t) (dotted black line in in Fig. 5a), from the ̃ i (t) timeseries of the individual models. Like ̃ i of MPI-ESM1.1, ̃ I has insignificant forced time-variation, judged by comparison with the standard deviation among integrations (grey envelope, calculated for each model ensemble and pooled over models; if also pooled over years, the standard deviation is 0.42 W m −2 K −1 ). The lack of significant forced variation is due to the dominance of ̃ by unforced variability in individual integrations, while the greater OLS bias (Sect. 4) caused by larger unforced variability explains why � I < � E at all times (compare solid and dotted black lines in Fig. 5a).

Greenhouse-gas forcing
Since the largest historical forcing is CO 2 , we consider the possibility that the response to CO 2 could somehow cause forced time-variation in ̃ E . Most CMIP5 models have a tendency for to decrease with time under constant CO 2 (Armour et al. 2013;Andrews et al. 2015). In our set of CMIP5 AOGCMs, regression of −N against T for years 1-20 and years 1-140 of abrupt4xCO2 gives multimodelmean = 1.26 and 1.02 W m −2 K −1 respectively. In some AGCMs and AOGCMs, it has been found that decreases as CO 2 concentration rises (Good et al. 2012;Jonko et al. 2012;Gregory et al. 2015). Either of these effects might explain the long-term decreasing tendency in historical ̃ E (Fig. 5b), although not its decadal variation.
To test this hypothesis, we calculate E in the histori-calGHG experiment, whose forcing is predominantly CO 2 , using the AR5 estimate of greenhouse-gas F(t). We find that R and T in historicalGHG have a high correlation coefficient of 0.99 over the historical period (1871-2005, shown in red in Fig. 6a for the period since 1915), and there is little time-variation in ̃ E in the historicalGHG experiment (solid red line in Fig. 5b). Therefore we reject the hypothesis that the long-term decreasing trend in historical ̃ E is due to CO 2 forcing. After about 1960, historical ̃ E decreases strongly. This tendency is opposite to that of historicalGHG ̃ E , which increases slightly, perhaps due to reduction of OLS bias as the greenhouse-gas forcing grows relative to the unforced variability (Appendices D.3 and D.6).

Comparison of historicalGHG and abrupt4xCO2 climate feedback
The historicalGHG E = 1.03 ± 0.01 W m −2 K −1 (EffCS 3.6 K, Fig. 6a) is close to multimodel-mean = 1.02 W m −2 K −1 from years 1-140 of abrupt4xCO2 (Sect. 5.2). The correlation coefficient across models between abrupt4xCO2 and historicalGHG e is 0.55 for years 1-20 and 0.68 for years 1-140, both significant at the 10% level. This similarity is expected, since historicalGHG is dominated by CO 2 forcing, but because CO 2 varies with time and perhaps with CO 2 concentration, and might differ among the various greenhouse gases, we cannot expect a perfect correlation. We suppose that it is larger for years 1-140 because this timescale is more similar to the length of the historicalGHG experiment. The correlation might also be reduced by our neglect of model-dependence in the greenhouse-gas F(t), which we do not know for any of the models. To take this approximately into account, we recalculate historicalGHG e using the AR5 greenhouse-gas F scaled for each AOGCM by the ratio of that AOGCM's abrupt4xCO2 ERF to the multimodel-mean value. The correlation coefficients with abrupt4xCO2 are increased to 0.61 for years 1-20 and 0.77 for years 1-140 (Fig. 7a), supporting the conjecture that the model spread in greenhouse-gas forcing is substantial (Andrews et al. 2012;Chung and Soden 2015). The historicalGHG e is about 10% larger than abrupt4xCO2 for years 1-140 in the multimodel mean.

Volcanic and anthropogenic aerosol forcings
We have seen that the time-dependence of historical ̃ E is statistically significant (Sect. 5.1), but not related to greenhouse-gas forcing (Sect. 5.2). Therefore we suppose that it is due to the varying relative importance of the other forcing agents. Such an effect could occur if depends on the nature of the forcing. As discussed at the start of Sect. 5, this idea is related to the efficacy of forcing agents. For many agents, including anthropogenic aerosols, is found to be close to CO 2 (efficacy is near unity), provided ERF is used to quantify forcing (Hansen et al. 2002;Shine et al. 2003;Sherwood et al. 2015). For volcanic aerosol, may be larger than for CO 2 (EffCS smaller, efficacy less than unity; Marvel et al. 2016;Ceppi and Gregory 2019).
In this discussion, we frequently consider and contrast three consecutive historical periods, which have different mixtures of forcing, as described in the following paragraphs. We choose them each to be 30 years, like the sliding window used to evaluate ̃ , because that means the OLS bias will not affect their comparison (Sect. 4).
The time-dependence of ̃ E in historicalNat, in which the forcing is dominated by volcanic aerosol (Fig. 1), shows large decadal variation (Fig. 5b). During 1915During -1944 were no large volcanic eruptions, so the variation of T and R and their correlation of 0.41 are all relatively small (green crosses in Fig. 6a) and must be due nearly entirely to unforced variability. For historicalNat during this period regression gives Unlike in historicalNat, T and R have substantial trends in the historical experiment during 1915-1944 (black crosses in Fig. 6a) due to anthropogenic forcing, especially by greenhouse gases (Fig. 1). The historical ̃ E = 1.4 ± 0.1 W m −2 K −1 of this period (solid black line) is somewhat larger than for greenhouse gas forcing (solid red line). This could be explained by the growth of negative anthropogenic aerosol forcing during this period, with a smaller (larger EffCS) than for greenhouse-gas forcing; the combination would produce a larger than either alone (Appendix B in supplementary online material of Gregory and Andrews 2016).  1915-1944 1945-1974 1975-2005 Historical volcano years are in circles For historicalNat for the period since 1945, during which there were three large volcanic eruptions, ̃ E is fairly constant (green line in Fig. 5b). The regression of R against T gives = 2.5 ± 0.2 W m −2 K −1 for 1945-1974 and 2.4 ± 0.1 W m −2 K −1 for 1975 onwards, which are very similar (EffCS 1.5 K), and more than twice historicalGHG ̃ E (compare the dotted and dashed red lines in Fig. 6a with the dotted and dashed green lines). These results suggest that the climate feedback parameter for volcanic forcing is larger (smaller EffCS) than for greenhouse gases (predominantly CO 2 ) in CMIP5 AOGCMs on average.
For 1945-1974 (30 years centred on 1st January 1960) historical ̃ E = 2.1 ± 0.2 W m −2 K −1 , similar to historicalNat (dotted black and green lines in Fig. 6a), and distinct from historicalGHG (dotted red line). We suggest that historical and historicalNat ̃ E are similar during this period because the increase in greenhouse-gas forcing in the historical experiment is offset by the increase in negative anthropogenic aerosol forcing, leaving only a small net anthropogenic forcing trend (Fig. 1), so the strong volcanic forcing from Agung is the greatest influence in both experiments.
For 1975-2005 (a period of 31 years, centred in 1990 and running up to the end of the CMIP5 historical integrations), historical ̃ E = 1.2 ± 0.1 W m −2 K −1 diverges from histori-calNat and comes much closer to historicalGHG (black approaches red in Fig. 5b, dashed black and red lines have a similar slope in Fig. 6a). We suggest that the historical and historicalGHG ̃ E are similar during this period because the net anthropogenic forcing grows much more rapidly due to greenhouse gas increase, once the aerosol forcing is steady (Fig. 1). Despite the further years of volcanic forcing from El Chichon and Pinatubo, the greenhouse-gas forcing dominates the historical F and the consequent rise in T (Fig. 3).   1925-1954 (in red), d timemean piControl ̃ . In a we plot for years 1-140 of abrupt4xCO2, and in b-d years 1-20. In a we use the AR5 estimate for historical-GHG F(t), scaled for each AOGCM by its own abrupt4xCO2 ERF (as discussed in the text), and for b, c we use our AR5 ′ estimate for historical F(t) for all AOGCMs except HadGEM2-ES and MPI-ESM1.1 (models J and P), for which we use F(t) diagnosed in these models individually (compared in Fig. 1). The dotted line in all panels is 1:1; all models lie to the left of this line in d, indicating that piControl � < abrupt4xCO2 In summary, the time-variation of historical ̃ E in CMIP5 can be mainly explained by the varying importance of forcings due to greenhouse gases and volcanic aerosol, if is larger for the latter. This means the EffCS is higher ( smaller) when volcanic forcing is relatively less important, around 1940 (when there were no major eruptions) and since 1975 (when greenhouse-gas forcing has rapidly increased). The growth of negative anthropogenic aerosol forcing during the intermediate period meant that the increase in net anthropogenic forcing was less important than the volcanic forcing, so the EffCS was dominated by response to volcanic forcing, and was relatively low. This explanation does not require EffCS for anthropogenic aerosol to differ substantiantially from the CO 2 EffCS.

Comparison of historical and abrupt4xCO2 climate feedback
Despite the large time-variation of E (black in Fig. 5), multimodel-mean R and T are highly correlated (coefficient of 0.94 for 1871-2006, black symbols in Fig. 6b). Moreover, E = 1.27 ± 0.04 W m −2 K −1 for the entire historical period (dotted black line in Fig. 6b) is very close to the multimodelmean = 1.26 W m −2 K −1 for years 1-20 of abrupt4xCO2 (Sect. 5.2).
However, for individual AOGCMs, the correlation of e with abrupt4xCO2 is much weaker, and insignificant at the 10% level, at 0.24 for years 1-20 (Fig. 7b) and − 0.02 for years 1-140. The multimodel standard deviation of the difference between e and abrupt4xCO2 is 37% (0.47 W m −2 K −1 ). The likely reason is the large AOGCM spread in F, which we have estimated as ± 45 % (Sect. 3.2), due principally to anthropogenic aerosol. Scaling the greenhouse-gas forcing using the ratio of abrupt4xCO2 ERF, as we did for historicalGHG, raises the correlation coefficients somewhat, to 0.37 and 0.24, but they are are still insignificant at the 10% level, confirming the dominant effect of uncertainty in non-greenhouse-gas forcing.
A more accurate estimate might be obtained from periods which are dominated by CO 2 forcing, when historical ̃ should be closer to CO 2 and F is more accurately known. One possibility is the recent decades, when the greenhouse-gas forcing has been increasing rapidly and the anthropogenic sulphate aerosol forcing has been fairly constant (Sect. 5.4; Gregory and Forster 2008;Bengtsson and Schwartz 2013), so historical and historicalGHG ̃ E are consequently close (Fig. 5b). For 1975For -2004 (30 years centred on 1st January 1990) the correlation of ̃ e with abrupt4xCO2 is 0.64 (Fig. 7c), a considerably stronger correlation than for e , and the standard deviation of the difference is smaller, at 27%. Scaling the greenhouse-gas forcing using the ratio of abrupt4xCO2 ERF improves the correlation only a little in this case.
For most of the historical period, ̃ E (t) is much larger (EffCS smaller) in historical than historicalGHG (the timemean difference between the black and red lines is 0.75 W m −2 K −1 in Fig. 5b), but the multimodel-mean difference between historical e and abrupt4xCO2 is only 2% (0.03 W m −2 K −1 ). We can understand this apparent contradiction by considering multimodel-mean R(t) and T(t). The slope during intervals of volcanic forcing (joined by solid orange lines in Fig. 6b) is evidently greater than at other times, consistent with time-varying historical ̃ E (t) (Fig. 5b). However, the volcanic forcing is small on the long-term mean, and although the periods affected by volcanic forcing are of several years, they are only temporary digressions from the long-term trend. Hence the large volcanic ̃ has little effect on the best-fit slope for the entire historical period (dotted black line in Fig. 6b), which is only a little larger than ̃ E = 1.19 ± 0.10 W m −2 K −1 for the last 30 years of the timeseries (dashed black line, the same as in Fig. 6a).
In summary, in the AOGCMs, as an estimate of abrupt4xCO2 , historical E has a small positive bias, because of the influence of volcanic forcing, and a large uncertainty, due principally to anthopogenic aerosol forcing. In the real world, we cannot evaluate accurately because we do not have adequate estimates of F and N for the entire historical period. Response to volcanic forcing has a much stronger effect on the time-dependent ̃ E than it does on E . Therefore ̃ E from periods that are affected by volcanoes has a large positive bias as an estimate of abrupt4xCO2 . In the AOGCMs, the bias is smallest in the period since 1975, during which we have the best observations of the real world. HistoricalGHG and piControl are different in the character of the covariation of R and T, which is highly correlated in the former but not in the latter (correlation coefficient of 0.24 between annual-mean R and T in the piControl population). Nonetheless, their regression slopes are similar. Although historicalGHG ̃ E is greater than piControl ̃ during nearly all the historical period, their difference is rarely statistically significant (Fig. 5b, 5% two-tailed significance level) before about 1970. This explains the simularity of historicalNat and historicalGHG ̃ E during 1915-1945. For each model we compare the piControl ̃ for unforced variability with abrupt4xCO2 for CO 2 forcing. These quantities have a modest but significant correlation across models (0.55, Fig. 7d), as found by Zhou et al. (2015) for the cloud component. Colman and Power (2018) note both similarities and differences in feedbacks for decadal variability and CO 2 forcing. It is clear that abrupt4xCO2 is larger than piControl ̃ in all models, leading us to infer that historicalGHG ̃ e and ̃ E are also larger than piControl. In some models, piControl � < 0.5 W m −2 K −1 , implying EffCS exceeding 7 K, and it is negative in one model (MIROC5). Dessler (2013) found similar results for piControl experiments of AOGCMs from the Coupled Model Intercomparison Project Phase 3 (CMIP3). These low values result from a pronounced OLS bias due to noise in T that is not correlated with R (Appendix C). There is a more complex relationship between R and T for internally generated fluctuations, and it is physically incorrect to treat R simply as an instantaneous response to T (Xie and Kosaka 2017; Lutsko and Takahashi 2018;Proistosescu et al. 2018) 6 Time-variation of historical climate feedback related to SST patterns

Comparison of unforced and
Previously published work has shown that the variation of is mostly determined by the pattern and magnitude of sea surface change in response to radiative forcing (Armour et al. 2013;Andrews et al. 2015;Haugstad et al. 2017;Ceppi and Gregory 2019). The effect of the agent comes mainly via the surface forcing, which is rapidly modified by climate feedbacks, ocean heat uptake and atmospheric and oceanic dynamical responses. We depend on AOGCMs to project the consequent sea surface changes, but we do not know whether their results are realistic in the characteristics relevant to .
In this section we compare from historical AOGCM simulations, driven by forcing agents, with from AGCM simulations driven by sea surface conditions prescribed from observations. AMIP experiments have shown that AGCMs reproduce the time-variation of TOA radiation and other quantities quite well when given realistic surface conditions (Allan et al. 2014). Thus the advantage of the AGCM simulations is their closer resemblance than the AOGCM simulations to the real historical record, while their disadvantage is that they do not allow us to isolate the effects of the individual forcing agents and unforced variability, which have imprinted their effects all together on the observational sea surface conditions.

Time-variation of climate feedback in the amip-piForcing experiment
The AGCM experiment named amip-piForcing, using observationally derived time-dependent historical sea-surface boundary conditions from the Atmosphere Model Intercomparison Project (AMIP, Gates et al. 1999;Hurrell et al. 2008), with constant pre-industrial forcing agents (atmospheric composition etc.), has recently been carried out with various AGCMs ( In each of these AGCMs, e obtained by regression of −N against T from amip-piForcing for the entire historical period is larger (EffCS smaller) than in the abrupt4xCO2 experiment with the corresponding AOGCM ). Regression of multimodel-mean R against T for the five AGCMs gives E = 1.59 ± 0.08 W m −2 K −1 for amip-piForcing (blue crosses and dotted line in Fig. 6b), about 30% larger than both historical ̃ E (black crosses and dotted line), and multimodel mean abrupt4xCO2 = 1.25 W m −2 K −1 for years 1-20 (Sect. 5.5).
When computed in a 30-year window, ̃ (t) shows large decadal variation, but the spread of ̃ among the integrations of each AGCM is rather small, because most of the interannual variability is prescribed through the sea surface conditions (SST patterns dominate the effect, and sea ice variations are relatively uninfluential; Gregory and Andrews 2016). In each AGCM, there is consequently little difference between ̃ i (t) and ̃ e (t) , unlike in AOGCMs. Owing to the strong influence of the common surface boundary conditions, the AGCMs furthermore have synchronised time-variations in ̃ , illustrated by ̃ E of the multimodel mean (blue in Fig. 5c), but they have different time-means and vary with roughly constant offsets. Their spread is similar to that of in the standard idealised amip-p4K AGCM experiment, which imposes a uniform SST warming of 4 K (Ringer et al. 2014).
The minimum ̃ E (maximum EffCS) of amip-piForcing is close to historicalGHG ̃ E (1.03 W m −2 K −1 , Sect. 5.5), and occurs in the middle of the longest interval without major volcanic eruptions, when forced climate change was therefore anthropogenic. This is consistent with the inference that EffCS for greenhouse-gas forcing is higher than for volcanic forcing. For the five AGCMs in our ensemble of amip-piForcing experiments, we have compared ̃ e for 1925-1954 with abrupt4xCO2 of the corresponding AOGCM (red in Fig. 7c). The rank correlation is perfect, and the (product-moment) correlation coefficient is 0.94, consistent with the dominance of CO 2 forcing during this period.
The maximum ̃ E (minimum EffCS) of amip-piForcing is attained in the period since 1960, during which it is fairly constant, while CMIP5 historical ̃ E is declining (EffCS increasing), due to the dominance of the greenhouse-gas increase over volcanic forcing once anthropogenic aerosol has stabilised (as found above, Sect. 5.4). The large recent ̃ E ≃ 2.5 W m −2 K −1 of amip-piForcing is outside the range of all individual CMIP5 historical integrations since 1960 (Marvel et al. 2018) and of all individual CMIP5 piControl integrations, whose maximum ̃ are 2.3 and 2.2 W m −2 K −1 respectively for 30-year periods, and it is about twice the CMIP5 multimodel-mean abrupt4xCO2 (Sect. 5.5).

Effect of patterns of SST change on radiative response
Since amip-piForcing and historical experiments both reproduce observed T(t) closely, the differences in ̃ = dR∕dT between amip-piForcing and historical, which are particularly large around 1940 and 1990 (Fig. 5c), must be due to differences in R(t). During 1925During -1954During (30 years around 1940, R = F − N in the CMIP5 historical multimodel mean has an increasing trend, but R = −N in the HadCM3-A amip-piForcing experiment has no trend (black in Fig. 8b and blue in Fig. 8a respectively), consistent with ̃ being smaller in amip-piForcing (EffCS larger). By contrast, during 1974-2004 (30 years around 1990), R is increasing about twice as fast in amip-piForcing, which has larger ̃ (EffCS smaller).
To investigate how the two sets of sea surface fields (one from CMIP5 AOGCMs, the other from observations) produce the same T(t), but different R(t), we use three further HadCM3-A experiments with constant pre-industrial forcing agents, like amip-piForcing. These experiments have no interannual variation in sea ice concentration, which follows the climatological annual cycle of the AMIP dataset for 1871-1900. The first of the three is the amip-piForcingClimI experiment , which has the same SST fields as amip-piForcing, and yields very similar R(t) (blue and cyan in Fig. 8a), confirming that the interannual variation is due almost entirely to SST changes (rather than sea ice changes).
The other two experiments follow Zhou et al. (2016). One of them applies the global warming but no change in SST pattern, while the other applies the pattern of change but no global warming. They aim to distinguish the effects on from variation of global-mean T and from the changing pattern of SST. The monthly SST fields for 1871-2012 for both experiments are derived from the AMIP SST fields T S (x, y, M, Y) , where x, y are longitude and latitude, M the month within the year and Y the year.
First we calculate the monthly SST climatology T SC (x, y, M) of the late nineteenth century (1871-1900), which we treat as the unperturbed climate, then we calculate the anomaly T S = T S (x, y, M, Y) − T SC (x, y, M) of the SST in a given month from the unperturbed climatological mean. In one experiment, a geographically uniform warming T SU is added to the climatological SST, equal to the global-mean of the anomaly, where G(⋅) denotes a global mean. In the other experiment, the local perturbation T SD to the climatology is the deviation of the local anomaly from its global mean, By construction, and In the experiment with the uniform perturbation T SU , the time-mean global-mean surface air temperature anomaly is T = 0.37 K for 1975-2004 with respect to the 1871-1900 climatology, almost the same as amip-piForcingClimI, and 15% less than T = 0.44 K from amip-piForcing because of omitting the effect of the recent decline in Arctic sea-ice.
The zero-mean perturbation T SD to SST produces negligible global-mean temperature change, but the time-varying changes to the pattern of SST have a strong effect on cloudiness and thus affect N and hence R. During 1975During -2004, the trends in R in the HadCM3-A uniform and deviation experiments are positive ( dR∕dT > 0 ) and about the same size (dotted red and grey lines in Fig. 8a). Each alone is similar to the trend in the CMIP5 historical experiment (dotted black line in Fig. 8b), consistent with our finding above that in amip-piForcing, whose SST perturbation is the sum of the uniform and deviation perturbations, the trend of R is about twice the size as in the historical experiment, making the EffCS smaller in amip-piForcing.
During 1925-1954, the trends in R in the HadCM3-A uniform and CMIP5 historical experiments are positive and similar, but the R in the HadCM3-A deviation experiment has a negative trend. That is, although global-mean T is rising, the changing pattern of SST tends to produce an increasing trend in heat uptake (dN∕dT > 0, dR∕dT < 0) by the climate system. The opposed trends due to the global mean and its pattern lead to the weak net trend of R and make the EffCS larger in amip-piForcing during this period.
Thus R is not a response to T alone, but depends also on the changing patterns of SST. It could be that both the global mean and the patterns have the same causes (unforced or forced), but they do not have a consistent relationship. The time-variation of ̃ in amip-piForcingClimI (and therefore amip-piForcing) is mainly due to the patterns of T SD , while ̃ for the uniform T SU is fairly constant through the historical period (Fig. 8c). Assuming that HadCM3-A is typical of AGCMs in amip-piForcing, we suppose that the common time-variation of ̃ is due to the patterns, while the fairly time-constant model spread is due to model-dependent climate feedback in response to uniform warming.

Differences between simulated and observed responses to volcanic forcing
In Sect. 5.4 we concluded that the time-dependence of historical ̃ E could be mainly explained by the varying relative importance of forcings due to greenhouse gases and volcanic aerosol, if is larger for the latter. In Sect. 6.1 we saw that the time-variation of ̃ E is different for amip-piForcing and historical. In Sect. 6.2 we attributed the time-variation in amip-piForcing to the changing patterns of deviation of SST from its global mean. We conjecture that these findings Fig. 8 a,  For information about the effect of volcanoes, we turn to historicalNat. There is greater similarity in time-dependence of ̃ E since 1930 between historicalNat and amip-piForcing than between historical and amip-piForcing (Fig. 5c). Although all three have smaller ̃ E in the first half of the twentieth century (higher EffCS), the minimum has a similar magnitude and date (around 1940) in amip-piForcing and historicalNat, while historical is increasing by then, having reached its minimum earlier and at a larger value. Moreover, ̃ E is minimum (highest EffCS) in recent decades in historical, but maximum (lowest EffCS) and similar in amip-piForcing and historicalNat. During this period in the latter two experiments ̃ E is close to 2.3 W m −2 K −1 (magenta cross in Fig. 5c, EffCS 1.6 K), which is the value calculated from observational estimates for 1985-2011 for T (HadCRUT4 blended land and sea surface temperature, Morice et al. 2012) and N (ERBE and CERES satellite measurements of TOA radiative flux, Allan et al. 2014) with the AR5 F.
Despite the similarity of the timeseries of ̃ E (t) in amip-piForcing and historicalNat, their R(t) timeseries look quite different (Fig. 8a, b). In historicalNat, immediately after each major volcanic eruption, there is a large negative spike in R, which then returns to zero over ∼ 10 years. The same structure is apparent in R in the historical experiment, where it is superimposed on the positive trend due to global warming. The episodic covariation of volcanically forced T and R gives the large ̃ E ≃ 2.5 W m −2 K −1 of historicalNat for the period since 1975 (green in Fig. 6b).
In the same period, while amip-piForcing has a similar ̃ E (blue line), it does not show unusually large variations in R at the times of eruptions (Fig. 8a); on the contrary, it has larger excursions at other times, presumably due to unforced variability. The same difference of character can be seen when comparing T from the CMIP5 historical experiment with the observational estimate (Fig. 2). Rapid cooling following major eruptions is clear in CMIP5, but not in observations.
The forced response in R to volcanoes in obvious in the historicalNat multimodel mean (green line in Fig. 8b), because the unforced variability has been intentionally suppressed by taking the mean. The negative spikes in R should also be present in amip-piForcing if the CMIP5 simulated forced response is realistic. Because amip-piForcing is driven by the observed record of SST, which is a single realisation of history rather than a mean, we expect that unforced variability will be larger than in the historicalNat multimodel mean, and could cancel out a volcanic spike by chance.
However, it seems unlikely that all the historical major eruptions would have been obscured in this way. The his-toricalNat multimodel mean R(t) falls below − 0.3 W m −2 following the eruptions of Krakatau, Agung, Santa Maria and Pinatubo (green line in Fig. 8b). The same is true for all four of these eruptions in the majority of the 31 individual historicalNat integrations (Table 2), where we count R < − 0.3 W m −2 in the year of the eruption or in either of the following two years as a volcanic signal. There is no historicalNat integration in which fewer than two of these four eruptions produce such a signal, but none of them does in amip-piForcing R (blue line in Fig. 8a).
An alternative possibility is that unforced variability in R is larger in the real world than in CMIP5 AOGCMs, and dwarfs all variations of the size of the forced volcanic signal. Such large unforced variability would dominate the T-R relationship throughout the historical period, Neither anthropogenic nor natural forced signals would be discernible; instead ̃ E would be fairly steady, like in the individual historical integrations ( ̃ i of MPI-ESM1.1 and ̃ I of CMIP5 in Fig. 5a, Sect. 5.1). This is quite unlike what we see in amip-piForcing (Figs. 5c and 6b).
Therefore we suggest that CMIP5 AOGCMs are not realistic in their response to volcanic forcing. In the real world, represented by amip-piForcing, volcanic forcing does not cause a large rapid cooling of T, as it does in CMIP5. Instead, volcanic forcing "sucks" heat from the ocean beneath. The system reacts as though it had a large heat capacity, so that T ≃ 0 ⇒ R ≃ 0 ⇒ N ≃ F < 0 , yielding a negative spike in N. We suggest that, in both the real world and CMIP5, the volcanically forced SST pattern gives a large , but that it lasts for longer in the real world. Following the eruption, the pattern of SST change causes R > 0 for a decade or two, perhaps through some persistent response to the subsurface cooling (discussed in Sect. 7). Consequently the volcanic episodes since 1960 are not distinct in the real world, but form a continuous period.
In support of this suggestion, we note that the normalised patterns of SST variation during 1975SST variation during -2004 in historical-Nat and observations have some similarities (Fig. 9a, b), especially regarding features in the North and low-latitude Pacific. On the other hand, the normalised patterns of the historical and historicalGHG experiments (Fig. 9c, d) resemble each other in these regions. For these "normalised patterns", we exclude areas poleward of 65 • , where observational SST data is sparse and the comparison with model data is complicated by the treatment of sea-ice. We regress local annual-mean SST over the 30 years against its area-mean within 65 • S-65 • N, to obtain a pattern in K K −1 with unit mean. Note that any correlated variation of local SST and global mean will contribute to this pattern, both trends and variability. Finally we subtract unity uniformly, and divide by the spatial standard deviation. The result is a field with zero mean and unit standard deviation.
The observed and historicalNat patterns could be consistent with a low EffCS because the warming in the west Pacific in these patterns leads to large upper tropospheric warming, giving large negative lapse-rate feedback, and increased stability in the low-cloud regions, giving small or negative cloud feedback (Zhou et al. 2016;Ceppi and Gregory 2017;. Further GCM experiments or analyses are needed to establish how the differences in the observed and CMIP5 SST patterns lead to their various values of . Although the pattern of SST change in historicalNat is somewhat similar to observations, it is much less pronounced, as shown by smaller magnitude of SST variation explained by regression in historicalNat (0.025 K) compared with observations (0.100 K). (This number is the spatial standard deviation of the field obtained from multiplying the pattern in K K −1 from the regression, before normalisation, by the temporal standard deviation of T. This field quantifies the local temporal variation of SST due to the globalmean temporal variation.) The comparison suggests that the AOGCMs respond with a realistic pattern to volcanic forcing, but too weakly. Consequently the stronger SST variation due to greenhouse-gas forcing (0.044 K) is able to overwhelm the volcanic pattern during 1975-2004 in the CMIP5 historical experiment, making ̃ E similar to historicalGHG (Fig. 5c). In the real world, on the other hand, the volcanic response is persistent and dominant, and accounts for the low EffCS of the AMIP period.

How accurately can CO 2 EffCS be estimated from historical EffCS?
Many calculations have been published of the effective climate sensitivity (EffCS), i.e. the equilibrium warming of global-mean surface air temperature for doubled CO 2 , as estimated from non-equilibrium states or radiative forcings other than 2 × CO 2 . Some calculations use observed climate change during the historical period, others use GCM simulations of climate change with idealised elevated CO 2 concentration. For convenience, we refer to these two kinds of estimate as "historical" and " CO 2 ". Both historical EffCS and CO 2 EffCS have a wide spread (Knutti et al. 2017). We have quantified several reasons for the differences among these estimates, in order to address the question which supplies the title of this work. First, the estimate of the climate feedback parameter using ordinary least-square regression (OLS) of the globalmean top-of-atmosphere radiative response against the global-mean surface temperature change from a single realisation of historical change (such as the real world) is both uncertain and biased towards low values by the presence of unforced variability. The bias causes EffCS ∝ 1∕ to be overestimated, in the multimodel mean by about 10% for regression of the entire historical period, and 20% for 30-year periods. It is unimportant in scenarios of strong forcing, such as abrupt4xCO2, but cannot be neglected when considering historical variations.
Second, evaluating historical EffCS is hampered by the systematic uncertainty in the forcing F, which in CMIP5 AOGCMs gives a ± 45% uncertainty in historical EffCS. The present phase of the Coupled Model Intercomparison Project contains new experiments which should greatly reduce the spread in all the model forcings, but an accurate estimate of real-world historical EffCS from the global-mean  energy balance depends on reduction of the uncertainty in real-world historical F, assessed as about ± 30 % by the AR5. Third, varies substantially on multidecadal timescales, according both to AOGCM historical experiments, which simulate climate change in response to forcing agents, and to AGCM amip-piForcing experiments, in which observed historical sea surface temperature is prescribed. This means that historical EffCS depends on the period from which it is evaluated. The historical and amip-piForcing experiments indicate that for most of the historical period the EffCS was smaller ( larger) than CO 2 EffCS, by up to a factor of ∼ 2 at some times. This bias is in the opposite direction to and therefore not explained by bias in the OLS slope.
The time-variation of in the historical experiments can mainly be explained by the varying relative importance of greenhouse gas and volcanic aerosol forcing, provided that the EffCS for volcanic aerosol forcing is smaller than for CO 2 forcing (i.e. its efficacy is less than unity), so that historical EffCS falls below CO 2 EffCS during volcanically affected periods. As a result, the EffCS from regression of the historical multimodel mean for the entire historical period is about 5% lower than CO 2 EffCS.
The time-variation of in the amip-piForcing experiments is due to the evolving patterns of SST, and synchronised in all the AGCMs because of their common boundary conditions. The EffCS from regression of the amip-piForcing multimodel mean for the entire historical period is about 30% less than CO 2 EffCS, a much greater bias than in the historical multimodel mean.
AOGCM historical and AGCM amip-piForcing experiments agree that the EffCS was relatively high in the period around 1940, when there were no large volcanic eruptions, and both greenhouse-gas and anthropogenic aerosol forcings were increasing in magnitude. The EffCS for this period in amip-piForcing has a range of 2.1-4.6 K, and is highly correlated with AOGCM CO 2 EffCS across models. The agreement increases confidence in this range as an estimate of CO 2 EffCS.
Since 1960, there have been three large volcanic eruptions. During this period, EffCS falls to its lowest values in amip-piForcing, of around 1.6 K, in agreement with our observational estimate for the 27 years around 1998, and consistent with low EffCS for volcanic forcing. On the other hand, EffCS increases since 1960 in the historical experiment, converges with the historicalGHG EffCS, and is correlated across AOGCMs with the CO 2 EffCS. We further discuss the disagreement between historical and amip-piForcing in Sect. 7.2.
Nearly 30 years have now passed since the eruption of Pinatubo, similar to the interval between the eruption of Katmai and 1940, so we might expect that the EffCS has returned to its CO 2 value, although another decade of observations may be required to demonstrate it clearly. Because greenhouse-gas forcing is increasing more rapidly now than in the early 20th century, the OLS bias in will be less important. We therefore consider that the EffCS of the first 30 years of the present century may give the most accurate energy-balance historical estimate of CO 2 EffCS, especially if the uncertainty in F can be reduced, unless another explosive volcanic eruption occurs.

SST and EffCS since 1975
We have carried out AGCM experiments to show that the observed pattern of SST change during 1975-2004 (the final 30 years of the CMIP5 historical experiments) induces heat loss from the climate system, producing the historically low EffCS that is simulated in amip-piForcing, and suppressing the greenhouse warming. In some respects this pattern (Fig. 9a, b) resembles the Interdecadal Pacific Oscillation, which has been associated with the reduced rate or hiatus of global warming during the early twenty-first century, through the influence of accelerated Pacific trade winds on ocean heat uptake Meehl et al. 2016;Oka and Watanabe 2017;Xie and Kosaka 2017).
The observed pattern of SST change during 1975-2004 has some similarities to the pattern that results during the same period from volcanic forcing in the AOGCM his-toricalNat experiment, including for instance the contrast between strong warming in the western Pacific and cooling or weak warming in the east, consistent with feedbacks giving a low EffCS (Zhou et al. 2016;Ceppi and Gregory 2017;. However, the amplitude is much weaker in historicalNat than in observations. Therefore in the historical experiment the volcanic pattern is overwhelmed by the greenhouse-gas pattern as the latter forcing increases, whereas in the real world the similar but stronger pattern has continued to dominate. This explains why for recent decades is larger (EffCS smaller) when estimated from observations or AGCM amip-piForcing experiments than from AOGCM historical experiments.
There are several possible causes of the observed SST pattern, apart from volcanic forcing. It could be forced by anthropogenic aerosol (Smith et al. 2016), which is not distinguished in our analysis of the time-dependence of the EffCS. It could be due to an internal mode of Pacific interannual variability that is stimulated by the response to or recovery from volcanic forcing (Emile-Geay et al. 2008;Maher et al. 2015;Khodri et al. 2017;Hua et al. 2018;Eddebbar et al. 2019), or it could be due entirely to unforced variability.
Whatever the cause, it is striking that in amip-piForcing, associated with this pattern, reaches such a large value, given that it is derived from the single realisation of observed climate history. This contrasts with the AOGCMs, in which we found evaluated from a single integration to be biased low by the presence of unforced variability (Appendix C), and comparably large values are attained only in the multimodel mean. We speculate that there are coupled atmosphere-ocean feedbacks which reinforce this SST pattern in the real world but are lacking in models (McGregor et al. , 2018Raedel et al. 2016;Yuan et al. 2018;Liu et al. 2018).
The divergence of historical and amip-piForcing indicates either that the AOGCM forced response is unrealistic, or that unforced variability has recently taken the EffCS outside the range it shows in piControl experiments. Either explanation implies a deficiency in AOGCMs, and calls for further investigation.

Prospects for estimating the climate response to CO 2
There are powerful reasons for wanting to evaluate the CO 2 EffCS from existing historical data, rather than waiting until we have accumulated enough further years of greenhouse-gas-forced climate change to enable an accurate energy-budget estimate. For the period since the 1980s, an estimate of EffCS can already be made from the observed energy budget (subject to systematic uncertainty in F), but this may be an underestimate of the CO 2 EffCS, due to pattern effects (Sects. 7.1 and 7.2). To avoid this problem, GCMs have been used to obtain relationships between historical and CO 2 -forced EffCS that may be used to correct observationally derived estimates of the EffCS (Armour 2017;. However, such methods suffer from systematic uncertainty owing to their dependence on the SST patterns being correctly represented by GCMs.
In order to make better use of the observed data and to refine or constrain AOGCM projections of the future, we need to study the interactions of the forcings, climate feedbacks and ocean heat uptake with the spatiotemporal patterns of SST change. Although such an analysis is more difficult than appealing to the historical global energy balance, it is necessary because the assumption that a single constant global climate feedback parameter can describe the responses to all forcings on all timescales is clearly inadequate.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creat iveco mmons .org/licen ses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

A The step model
The step model (Good et al. 2011Hansen et al. 2011;) is based on the assumption that the climate responses X i (t) in the quantities of interest (T and N) to separate forcings F i (t) combine linearly to give . By assuming further that the response to any step-change in forcing depends only on the size of the step and not the nature of the forcing agent, we can estimate the response to historical time-dependent net forcing F(t) due to all agents by treating it as the sum of a set of discrete steps in forcing, such that F(t) = ∑ t j=1 [F(j) − F(j − 1)] where j are successive instants of time (we use a timestep of one year) and F(0) = 0 . The response of an AOGCM at time t to the forcing increment which occurred at time j < t is estimated as X 4× (t − j + 1)[F(j) − F(j − 1)]∕F 4× , where X 4× is the AOGCM's time-dependent response to the step-change forcing F 4× in the abrupt4xCO2 experiment, since time t is timestep t − j + 1 since the forcing increment [F(j) − F(j − 1)] occurred. Hence, adding up the response to all previous increments, Note that the step-model makes no assumption about the value or time-variation of , except that it is the same for all magnitudes and kinds of forcing.

B Choice of independent variable for regression
Ordinary least-squares (OLS) linear regression assumes that all variations in the independent variable x cause proportionate variations in the dependent variable y. If there is "noise" in y, meaning fluctuations that are linearly uncorrelated with the "signal", which is a function of x, the OLS estimate of the slope dy∕dx is imprecise, with a standard error that increases with the amplitude of the noise (Appendix D.2), but it is unbiased, meaning that expectation value of the estimate equals the true value. On the other hand, if our data for x contain some noise which does not cause variations in y i.e. the "true" independent x on which y depends is not precisely known (possible sources of such noise are considered in Sect. 4), the OLS estimate of the slope is biased. It is expected to be smaller than the true value, and the bias grows with the amplitude of the noise (Appendix D.3). Therefore if one of the variables contains noise which is not correlated with the other variable, the former should be chosen as dependent and the latter as independent, in order to obtain an unbiased estimate of the slope. This is the natural choice for a situation where the independent variable is chosen precisely by the experimenter, and the dependent variable is measured with some uncertainty. In our application, N and T are physically both dependent on the prescribed F, so it is not obvious which of R = F − N or T we should select as the independent variable.
Because random error is small in the MPI-ESM1.1 historical ensemble mean, we expect the bias in the estimated slope to be small, regardless of whether T or R is chosen as the independent variable. The correlation between T and R is less than unity, so the slopes for the two choices are not quite equal (Appendix D.4), but they are close, namely 1.36 ± 0.04 W m −2 K −1 for regression of ensemble-mean R against ensemble-mean T, denoted by e (Table 1, solid line in Fig. 4), and 1.54 ± 0.05 W m −2 K −1 for T against R (dashed line), where the standard error is inferred from the residual of the fit. Therefore the historical slope for the ensemble mean is e =1.4-1.5 W m −2 K −1 , assuming the underlying physical relationship is truly linear.
The mean of the ensemble of slopes obtained by regression of R against T in the individual integrations is i = 1.38 ± 0.01 W m −2 K −1 (mean and standard error), not shown in Fig. 4 because it is statistically indistinguishable from e . However, the mean of the slopes from individual members when we regress T against R is quite different (dotted line in Fig. 4, slope of 2.08 ± 0.01 W m −2 K −1 ), and looks like a poor fit to the ensemble-mean data. This bias is the expected outcome of OLS regression of y against x when x contains noise which is uncorrelated with y (Appendix D.3). If there is uncorrelated noise in R, linear regression of T against R gives an estimate of dT∕dR which is biased low, and hence its reciprocal = dR∕dT is biased high.
To minimise the bias, we prefer to choose T as the independent variable for OLS regression (Appendix D.4), assuming that the noise in R is not correlated with T.
Certainly, there appears to be more noise in R than in T (Fig. 3), consistent with physical understanding that T is related to the time-integral of N (although a similar bias in the slope could be caused by correlated noise in T and R, Appendix D.6). The results from the MPI-ESM1.1 are consistent with assuming that T contains no noise, but this may not hold for other AOGCMs.

C Error in estimating climate feedback from a single ensemble member
Using the HadGEM2 historical F (Sec. 3.1), we carry out the calculations of Appendix B for the HadGEM2-ES historical ensemble, which comprises only five members, a typical size for CMIP5 submissions. We obtain i = 0.94 ± 0.10 W m −2 K −1 and e = 1.22 ± 0.14 W m −2 K −1 , thus e > i , unlike MPI-ESM1.1, in which we found above that e ≃ i . The correlation coefficient between ensemble-mean R and T is 0.59, weaker than for MPI-ESM1.1 due to the smaller ensemble size and consequently greater noise in the ensemble mean.
For the same calculations with the historical experiments of other CMIP5 AOGCMs we use our AR5 ′ estimate for F(t) (Sect. 3.3), because F has not been diagnosed in these models. Since F is model-dependent, it may differ from the AR5 ′ estimate, so from the regression could be inaccurate; that would be a systematic error that affects all the ensemble members of each model equally, rather than a statistical uncertainty affecting them randomly. Within each model ensemble, noise produces a spread of . The geometrical multimodel mean of the ensemble standard deviation of is 0.11 W m −2 K −1 , ∼ 10% of the multimodel-mean e .
Across AOGCMs, the correlation cofficient of i and e is very high (0.96, Fig. 10a) but e > i (Fig. 10b), as for HadGEM2-ES, except in the MPI and CanESM2 AOGCMs, in which e ≃ i . This is consistent with the bias of OLS regression whereby the slope is underestimated when there is noise in T that is not correlated with R (Appendix B); because the noise is larger in individual integrations than in the ensemble mean, i is underestimated more severely than e . Furthermore, the bias tends to be greater for larger e (Fig. 10b, correlation 0.61), consistent with the same explanation (Appendix D.3). The multimodel-mean underestimate of i with respect to e is 10%.
As mentioned in Sect. 1, estimates of using observed N can be made only from the more recent ∼ 30 years, since interannual variation of N is not well enough known at earlier times. To evaluate the effect of the OLS bias on estimated from a 30-year period, denoted by ̃ (Table 1), with each AOGCM we regress R against T for 30-year periods starting in every year (i.e. they overlap) in every integration, obtaining a timeseries ̃ (t) for each integration (following . From these we calculate the ensemble-mean timeseries, denoted by ̃ i (t) , and its historical time-mean. The time-mean is the expectation value of ̃ for a randomly chosen 30-year period of a single integration. The geometrical multimodel mean of the ensemble standard deviation of ̃ , pooled over years in each model, is 0.42 W m −2 K −1 , 30% of the multimodelmean time-mean ̃ e . Similarly, from the ensemble-mean R and T of each model we compute the ̃ e (t) for 30-year periods and its historical time-mean.
Across models, the correlation coefficient of the timemeans of ̃ i and ̃ e is high (0.88), but time-mean ̃ e is greater in all cases (Fig. 10c), consistent with a greater bias of OLS regression for a randomly chosen 30-year period of a single integration than of the ensemble mean, just as for i and e , but the effect is more pronounced because the noise is more important for a shorter period. The multimodel-mean underestimate of ̃ i with respect to ̃ e is 20%. Since the CMIP5 ensembles are fairly small, it is likely that ̃ e is also biased, and the underestimate of the true value therefore greater.

D Statistical issues in regression
In this appendix, we consider various statistical issues related to the estimation of as the slope of the regression of R against T. These issues apply more generally than to those specific variables. The general problem is to estimate the slope m in the linear relationship y(t) = mx(t) , where x and y are timeseries of length n with values at times t = 1 , 2 , … , n , given the data x i and ŷ i , which may differ from x and y because of random noise. (To simplify the formulae we have chosen the origin so that the means of x and y are zero.) In the model world, we may have an ensemble of integrations i = 1, … N , with the same x and y in all but different noise in each. For ensemble member i, we obtain an estimate m i = cov(x i ,ŷ i )∕var(x i ) of m = dy∕dx by ordinary least-squares linear regression (OLS) of ŷ i (t) against x i (t) . The OLS estimate minimises the root-mean-square (RMS) of the residuals of the y i (t) from the fitted line in the y-direction. By doing so it maximises the likelihood that the residuals are consistent with independent identically distributed random noise i (t) in y.

D.1 The difference method is a special case of regression
In the special case of n = 2 , whatever the noise may be, a straight line can be drawn exactly through the 1.8 Relationships in CMIP5 AOGCM historical experiments between evaluated from the ensemble-mean R(t) and T(t), and the ensemble-mean of evaluated from R(t) and T(t) in individual integrations, a, b between i and e , c between time-mean ̃ i and timemean ̃ e (see Table 1 for notation). Only those AOGCMs which have more than one ensemble member are included (see Table 2). We use our AR5 ′ estimate for historical F(t) for all AOGCMs except HadGEM2-ES and MPI-ESM1.1 (models J and P), for which we use F(t) diagnosed in these models individually (compared in Fig. 1). The dotted line in b is zero on the vertical axis; all models lie very near or above this line, indicating that e − i ≥ 0 . The dotted line in a, c is 1:1; all models lie very near or to the right of this line in a, indicating that e ≥ i (consistent with b), and in c, indicating that time-meañ e ≥ time-meañ i two points x = x 0 ± 1 2 Δx and ŷ = y 0 ± 1 2 Δy , leaving zero residual. Denoting a mean by M(⋅) , we obtain M(x) = x 0 , M(ŷ) = y 0 , var(x) = M(x 2 ) − (M(x)) 2 = ( 1 2 Δx) 2 , cov(x,ŷ) = M(xŷ) − M(x)M(ŷ) = 1 4 Δx Δy . Hence for this case the OLS formula gives m = var(x)∕cov(x,ŷ) = Δy∕Δx , the slope of the line passing through the points. Therefore m estimated as the slope between the endpoints in x is a special case of OLS, using a minimal amount of data, and the results derived in this appendix, that m is uncertain and may be biased on account of noise in x and y, apply to the difference method Eq. 2 just as they do to regression Eq. 3.

D.2 No bias in m due to uncorrelated noise in y
The rationale for the use of OLS is that the the independent variable x i is perfectly known but the dependent variable ŷ i is noisy, With these assumptions, var(x) = var(x) , and since M(x) = 0 . Therefore the OLS slope is an imprecise estimate of m. However, the expectation value E(m) = m , because E(M(x )) = 0 if there is no correlation between x and ; we call the noise "uncorrelated" to indicate that is not correlated with x or y. Thus, the OLS estimate of the slope is not biased by the presence of uncorrelated noise in y.
To illustrate this, we choose a set of n = 10 random numbers x(t) in the interval 0-1, and take m = 1 ⇒ y = x (x, y and y = x are shown in black in Fig. 11a). We generate N = 10 5 instances of ŷ i (t) from y(t) by adding independent normally distributed i (t) with standard deviation of 0.075. The correlation coefficients of x with ŷ i have a positively skewed distribution (red in Fig. 11b). We regress each ŷ i (t) against x(t) to obtain m i (an example ŷ i and its regression line are shown in red in Fig. 11a). The distribution of m is normal, its mean is m = 1 and its standard deviation 0.079 (red in Fig. 11c). If we increase the amplitude of noise to 0.100 and 0.125, m remains unbiased but becomes less precise (standard deviation of 0.105 for green and 0.131 for blue in Fig. 11c), and the correlation is degraded gradually (Fig. 11b).
Although x was chosen randomly, there is no uncorrelated noise in x in this example, because x i = x i . For example, we might have where x(t) is the response to external forcing and the same in all ensemble members, while i (t) is unforced variability that is different in each member. Although might be called "noise in x", it is perfectly correlated with noise m in y. If all variations x ′ in x , however they are caused, produce corresponding variations mx ′ in ŷ , m will be an unbiased estimate of m. If x and y are T and R, this is the case which Proistosescu et al. (2018) call "ocean-forced".

D.3 Bias in m due to uncorrelated noise in x
If y is not noisy but x contains uncorrelated noise i (t) in ensemble member i, we have which differs from Eq. (5) because the variations in x do not produce proportionate variations m in ŷ . In this situation and Similiar to Sect. D.2, E(M(x )) = 0 for uncorrelated noise, giving i.e. the estimate of the slope is not only imprecise, but also biased low if there is uncorrelated noise in x. (We have written this as an approximation because the expectation value of a ratio does not exactly equal the ratio of expectation values.) The slope is underestimated, through the appearance of var( ) in the denominator, because OLS assumes that all variations in x cause variations in ŷ . The larger the ratio of noise to signal var( )∕var(x) , the greater the bias. This bias has been called "regression dilution" (Frost and Thompson 2000). We illustrate this case with the same x(t) and y(t) as the previous case, but this time we take ŷ(t) = y(t) and generate N instances of x i (t) from x(t) by adding independent normally distributed i (t) . The distribution of m i from regressing y(t) against x i (t) is negatively skewed and biased low (median 0.95, 5-95% range 0.85-1.09, red in Fig. 11d). For larger noise, the spread and the bias both increase (median 0.92 for green and 0.88 for blue in Fig. 11d). The distribution of correlation coefficients in the three cases are the same as for noise in y, because the formula is symmetrical in x and y.
In our application we are estimating m = from R = y and T = x . The expected magnitude of the bias in ̂ is therefore If var(T) and var( ) are independent of , this formula predicts that the expected bias in ̂ will increase in proportion to . In our set of model simulations of the past, var(T) is not independent of , because we expect that a model with Illustration of the effect of random noise on ordinary least squares regression. We take the x(t) shown in black in a, with a slope of unity so that y = x , generate many sets of x i (t) and ŷ i (t) by adding noise either to y or x, and calculate the distribution of estimated slopes. a Red shows an example with noise in y of standard deviation 0.075 and its regression line, grey envelope is the 5-95% range of regression lines; b distribution of correlation coefficients between x i (t) and ŷ i (t) with noise in either x or y; c, d distribution of slopes of regression lines when there is noise in y or x respectively; b-d each show results for noise with three different standard deviations, as indicated by the key in d a larger (smaller EffCS) will produce a smaller historical T increase. This makes var(T) smaller, 1∕(var(T) + var( )) larger, and strengthens the dependence of the expected negative bias E(̂) − upon .

D.4 Correct choice of independent variable
If y is independent and perfectly known while x is dependent and noisy, we should instead minimise the RMS deviations of the x from the fitted line in the x-direction, obtaining from ensemble member i an estimate m † i = cov(x i ,ŷ i )∕var(ŷ i ) of the slope dx∕dy . The product m † im i = (cov(x i ,ŷ i )) 2 ∕(var(x i )var(ŷ i )) = r 2 i , where r i is the (product-moment) correlation coefficient between x i and ŷ i . Thus the lines fitted in the two ways have equal slopes m i = 1∕m † i if and only if x i and ŷ i are perfectly correlated or anticorrelated ( r i = ± 1).
In the usual situation of imperfect correlation, the choice of independent variable therefore makes a difference to the OLS estimate of the slope. This is because of the bias caused by noise in the independent variable (Sect. D.3). If one of the variables is noisy and the other is not, we must treat the noisy variable as the dependent one to get an unbiased estimate of the slope.

D.5 Uncorrelated noise in both x and y
If there is independent noise in both x and y, we cannot get an unbiased estimate of m using OLS. This case can be be treated with "orthogonal" or "total least-squares" regression, in which the RMS deviation of the points from the line is minimised in a direction orthogonal to the line, but that requires a prior estimate of the relative size of and , which we do not have. Other methods, called "error in variables", have been developed for this case (e.g. Cahill et al. 2015).

D.6 Correlated noise in x and y
Another situation to consider is that of correlated noise in x and y. Suppose that where is a constant and i is noise that is different in each ensemble member. Because i affects both x i and ŷ i , the noise x i (t) − x i (t) = i (t) in x and the noise ŷ i (t) − y i (t) = i (t) + i (t) in y have a non-zero correlation coefficient var( )∕ √ 2 var( ) + var( ) . Now by following the method of Appendix D.3 we obtain

E(m) = E
cov(x,ŷ) var(x) ≃ m var(x) + var( ) var(x) + var( ) = m 1 + ( ∕m)(var( )∕var(x)) 1 + (var( )∕var(x)) , assuming x and are uncorrelated. This case is more general than, and encompasses, all of those previously considered. If var( ) ≪ var(x) , the noise in x is negligible, and we recover E(m) = m . If = m , y i (t) = m(x i (t) + i (t)) , as in Eq. (5), in which case we have shown that E(m) = m still (Appendix D.2). If = 0 , the noise in x and y is decorrelated, and E(m) = m∕(1 + var( )∕var(x)) < m (as in Appendix D.3). The general formula with ≠ 0 applies to two relevant physical situations in which T is x, R is y and m is the climate feedback parameter for forced climate change on multidecadal timescales.
Firstly, suppose there is unforced variability that arises spontaneously in N and causes correlated variability T ′ in T. This is the case which Proistosescu et al. (2018) call "radiatively forced", and we describe it qualitatively in Sect. 4. We can illustrate the effect with a simple model. Suppose that that the spontaneous random variability Φ(t) in N(t) has a stepwise behaviour, such that Φ(t) = Φ j for j ≤ t < j+1 , with a step-change in N of Φ j − Φ j−1 at t = j . According to the step model (Appendix A), the response of T ′ to Φ is for j ≤ t < j+1 , where Θ(t) is the response of T per unit step-change in forcing at t = 0 . This T ′ response will add a further perturbation T ′ to N, assuming the same climate feedback parameter applies to both forced and unforced variations. If T F (t) is the response of T to external forcing F(t), we have T = T F + T � , N = F − T F + Φ j − T � and R = F − N = T F − Φ j + T � . We can rewrite this as with This has the for m of Eq. (8) for cor related n o i s e , w i t h x = T F + H , = Θ(t − j )Φ j , y = R , = ( Θ(t − j ) − 1)∕Θ(t − j ) = − 1∕Θ(t − j ) and m = , where H is the response of T to Φ earlier than j .
Physically, the correlation arises because the noise in T is the response to Φ j , while the noise in R is the sum of Φ j itself and the response in N to Φ j . Since the responses to Φ j in both N and T are proportional to Φ j , the noise in R and T is correlated. From = m − 1∕Θ(t − j ) we obtain − m = −1∕Θ(t − j ) < 0 because for climate stability we must have Θ(t) > 0 . Hence < m ⇒ E(m) < m . The climate feedback parameter will inevitably be underestimated if the correlation is due to spontaneous fluctuations in N. The effect is therefore similar to regression dilution (Appendix D.3) but it is not formally the same.
The correlation is present because both Φ and T have non-zero timescales of change. A zero timescale of response in T means it changes instantly when the energy balance is perturbed, keeping the system always in equilibrium with T = F + Φ . This requires Θ(t) = 1∕ for all t > 0 , and hence = 0 , so the correlation vanishes. With stepwise variation, Φ has persistence with a non-zero timescale. This can be removed by replacing its step-changes at times j with -function spikes. In that case Φ = 0 between these times, and Φ j does not appear in R = (T F + T � ) . This is the situation of perfectly correlated noise described by Eq. (5), with = T � , effectively the same as no noise, because signal and noise cannot be distinguished.
Secondly, could represent unforced variability that arises spontaneously in T on interannual timescales, causing an immediate radiative response in R that may have a climate feedback parameter ≠ m . The estimate of m obtained by regression of R against T will be biased in the direction of by unforced variability. The larger var( )∕var(x) , the greater the bias. The ratio will be large if unforced variability is large, or if the record is short and hence shows little forced change. Unlike the previous cases, the bias in m could be in either direction; when ≶ m , E(m) ≶ m.