1 Introduction

The Earth's climate sensitivity is a key measure of the longer-term climate response to external forcing. It is perhaps the most important ill-quantified climate system parameter. In principle, climate sensitivity represents the equilibrium change in mean surface temperature to a doubling of atmospheric CO2 concentration from preindustrial levels, once the deep ocean has reached a stable state. In practice it is normally estimated using some approximate measure, often derived from disequilibrium changes. Climate sensitivity has been estimated from various types of evidence, but none of these has narrowly constrained its value. The first five Assessment Reports by the Intergovernmental Panel on Climate Change (IPCC) relied heavily on estimates of climate sensitivity from global climate model (GCM) simulations. The 1.5–4.5 K likely range for climate sensitivity in the 2013 IPCC Fifth Assessment Report (AR5) was identical to the range presented in the landmark Charney (1979) report, with the great increase in GCM sophistication since 1979 not having led to any narrowing of the climate sensitivity range.

GCMs use semi-empirical approximations (parameterizations) to represent subgrid-scale cloud and convection processes that are known to be critical to determining the model's climate sensitivity, which varies by up to a factor of three among GCMs. In one well regarded GCM, a simple change to how convective precipitation was parameterizedFootnote 1 varied its climate sensitivity by a factor of two, with no obvious change in how well the model otherwise performed (Zhao et al. 2016). Changing the order in which the various parameterized atmospheric modules were updated in each time step was found to vary another GCM's climate sensitivity by a factor of up to two, with ambiguity existing regarding the optimum ordering (Donahue and Caldwell 2018). Moreover, the universal use in GCMs of deterministic parameterizations may bias their climate sensitivity upwards (Strommen et al. 2019). Such issues make the reliability of GCM-derived estimates of climate sensitivity questionable.

In the light of such issues, and the further widening of the range of GCM climate sensitivities in the latest (CMIP6) generation of GCMs (Zelinka et al. 2020), the IPCC Sixth Assessment Report (AR6) abandoned the previous reliance on GCM climate sensitivities. Instead, evaluation of climate sensitivity was approached by combining estimates based on different lines of evidence, such as process understanding (feedback analysis), the historical instrumental record, and paleoclimate data.

Combining different lines of evidence should, to the extent that they are independent, enable climate sensitivity to be estimated more precisely than from any single line of evidence (Stevens et al. 2016). A comprehensive attempt to do so was made by Sherwood et al. (2020, henceforth S20), a 92-page study. S20 was conducted under the auspices of the World Climate Research Programme's Grand Science Challenge on Clouds, Circulation and Climate Sensitivity and provides a very detailed investigation of climate sensitivity. As the most influential recent assessment, S20 was cited over twenty times in the relevant AR6 chapter, which approached climate sensitivity estimation on very similar lines to S20, albeit not using its formal probabilistic methods. There are in principle considerable strengths in S20's scientific approach. Its main results were derived by combining understanding from feedback analysis (Process evidence) with evidence from changes since circa 1850 (Historical evidence), and from cold and warm past periods (Paleoclimate evidence)—three lines of evidence that S20 judged to be largely independent.

The contribution the present study makes to estimation of climate sensitivity is three-fold. First, it identifies statistical problems in S20. The main methodological argument is that, when Bayesian methods are used, an Objective rather than a Subjective Bayesian approach should be taken. This means that rather than the investigator choosing the prior distribution, the prior distribution should be mathematically computed, based on the assumed statistical model relating to all the evidence to be analyzed (Bernardo 2009). S20 used a Subjective Bayesian statistical method, with an investigator-selected prior distribution, that has been shown may produce unrealistic climate sensitivity estimation when used to combine differing types of evidence (Lewis 2018), and S20 provided no evidence that it did not do so in this case. Moreover, for all except Process evidence, S20 used a method of estimating likelihoods that turns out to be unsound. This study validates its likelihood estimates by using multiple methods and cross-checking their results. S20's method is shown to often result in serious likelihood underestimation at higher climate sensitivity levels.

The second contribution of this study is that it develops and applies an Objective Bayesian approach to combining differing climate sensitivity evidence, using a mathematically computed prior distribution. The results using the methodology developed and the same input assumptions as S20 are then used to assess what effect the statistical problems identified in S20 have on its results. It is found that they bias S20's estimation of climate sensitivity downwards, although only to a minor extent even at the upper uncertainty bound when all three lines of evidence are combined.

This study's third contribution is to review and where appropriate revise the input assumptions used by S20, paying particular regard to more recent evidence, and to investigate the effect of the revised input assumptions on estimates of climate sensitivity using the developed Objective Bayesian methodology. Some of the revisions to input assumptions relate to the treatment in certain cases of CO2 forcing and/or the warming it causes. This study differs from S20 regarding the appropriate scaling of CO2 forcing, and comparison of warming, where different changes in CO2 atmospheric concentrations are involved, and regarding scaling CO2 forcing where its use requires a different estimation basis from that on which the forcing estimate was derived. The combined effects of the revisions to S20's CO2 related estimates and to other input assumptions result in a major reduction in estimated climate sensitivity.

The paper is structured as follows. Climate sensitivity measures are discussed in Sect. 2. Section 3 deals with statistical methods, while Sect. 4 reviews S20's input assumptions and proposes certain revisions to them. Section 5 sets out results based on S20's original input assumptions but using the corrected likelihood estimates, using alternatively S20's chosen Subjective Bayesian prior distribution or the Objective Bayesian (mathematically computed) prior distributions. Section 6 presents results using the revised input assumptions and Objective Bayesian prior distributions. Section 7 discusses the statistical issues and both their effects and the effects of using different input assumptions.

2 Climate sensitivity measures

The traditional measure of climate sensitivity is the equilibrium change in global mean surface temperature (GMST) following a doubling of the atmospheric CO2 concentration (equilibrium climate sensitivity, henceforth ECS). While the equilibrium involved allows for the deep ocean to reach a steady state, it excludes changes in slow components (e.g., ice sheets). Such an equilibrium is achievable in a GCM but not in the real climate system. The corresponding equilibrium change over timescales that allow for feedbacks from changes in the slow components to occur is called Earth system sensitivity (ESS).

Under the standard linear forcing-feedback framework, the excess of a change ΔF in effective radiative forcing (ERF) over the change in top-of-atmosphere (TOA) planetary radiative imbalance, ΔN, is equal and opposite to the climate system's radiative response ΔR, measuring all radiation downwards. ERF is a measure of the increase in TOA radiative imbalance resulting from a change in atmospheric composition, such as an increase in CO2 concentration, with surface temperature held constant but the atmosphere allowed to adjust to the change. ΔR is taken to be the product of the change ΔT in global mean near-surface air temperature (GMAT), or in GMST, and a fixed climate feedback parameter \(\lambda^{{{\text{fixed}}}}\). Accordingly:

$$\left( {\Delta F - \Delta N} \right) = - \Delta R = - \lambda^{{{\text{fixed}}}} \Delta T$$

Under this framework, it follows that for \(\Delta N = 0\), representing equilibrium:

$$\Delta T = - \Delta F/\lambda^{{{\text{fixed}}}}$$

and hence for the ERF from a doubling of CO2 concentration, F2⤬CO2:

$${\text{ECS}} = - F_{2 \times CO2} /\lambda^{{{\text{fixed}}}}$$

However, in GCMs (and the real climate system) the climate feedback parameter may not in fact be fixed, in which case a linear projection to \(\Delta N = 0\) will not provide an accurate estimate of ECS.

Rather than seeking to estimate ECS, S20 instead estimate an effective climate sensitivity S, that corresponds to the effective sensitivity in GCMs derived from feedbacks occurring during the first 150 simulation years after an abrupt quadrupling of CO2 concentration from its preindustrial level (abrupt4xCO2), treating climate feedbacks as being fixed.

For GCMs, S is normally derived by linearly regressing, usually using annual average values, changes (ΔN) in TOA radiative imbalance on changes (ΔT) in GMAT over abrupt4xCO2 simulations, with changes being measured relative to values during an unforced preindustrial control simulation by the GCM. The ΔT and ΔN values are rescaled to reflect the ratio of F2⤬CO2 to the ERF from a quadrupling of CO2, F4⤬CO2. The slope of the regression line, λ, is a measure of the effective climate feedback parameter operating over the regression period. The regression line is continued forwards to \(\Delta N = 0\), indicating radiative equilibrium, with S being defined as the ΔT value at that point, and backwards to \(\Delta T = 0\), the rescaled ΔN value at that point providing an estimate, \(F_{{2 \times {\text{CO2}}}}^{{{\text{regress}}}}\), of F2⤬CO2. Hence:

$$S = - F_{{2 \times {\text{CO2}}}}^{{{\text{regress}}}} /\lambda$$

If the climate feedback parameter is not fixed, in general λ differs from \(- F_{2 \times CO2} /{\text{ECS}}\), S differs from ECS, and \(F_{{2 \times {\text{CO2}}}}^{{{\text{regress}}}}\) differs from F2⤬CO2.

In the vast majority of GCMs, the local slope of the relationship between ΔN and ΔT weakens over the course of 150-year abrupt4xCO2 simulations, strongly suggesting that the model ECS exceeds S. Since feedbacks activated only on a long timescale affect the climate extremely slowly, S is more relevant than ECS (or ESS) to climate change over the next few centuries. However, deriving S from paleoclimate evidence, which reflects equilibrium changes, requires an estimate of the ECS to S ratio, with its excess over one being defined as \(\zeta = {\text{ECS}}/S - 1\).

To obtain a valid estimate for climate sensitivity to doubled CO2 concentration from data involving a different change in CO2 concentration, it is necessary to scale the temperature change involved by the ratio of ΔF2xCO2 to the ERF associated with the particular change in CO2 concentration, even assuming that climate sensitivity is unaffected by the effect of the difference in CO2 concentration change on the climate state. S20 define their S in GCMs as the linearly regressed warming over years 1–150 after a quadrupling of CO2 concentration, extrapolated to zero ΔN and then divided by two. This scaling factor, while popular, is difficult to justify when the actual ratio of the ERF change involved (F4⤬CO2) to F2⤬CO2 has been estimated with reasonable precision to be 2.10, 5% greater than twice that from doubled CO2 concentration (Byrne and Goldblatt 2014; Etminan et al. 2016; Meinshausen et al. 2020).

S20 defend their division of abrupt4xCO2 temperature changes by 2 (rather than 2.10) on the basis that it brings S estimated on their basis closer to estimated ECS in models with very long abrupt2xCO2 simulations, which they estimate as 6% higher than S derived by halving temperature changes in those models' abrupt4 × CO2 simulations, implying \(\zeta = 0.06\). However, that argument conflicts with their valid desire for a measure that is as closely related as possible to scenarios of practical relevance. Moreover, the increase in S that S20 introduce by basing it on a biased scaling of F4⤬CO2 to F2⤬CO2 results in inconsistent estimation of S between their three lines of evidence, a serious flaw. The biased scaling only affects (via the resulting ζ estimate) their estimation of S from Paleoclimate evidence, since its estimation from both Process and Historical evidence is based on directly estimated F2⤬CO2, and is independent of scaled F4⤬CO2.

Here, S20's one-half scaling factor and the resulting 0.06 central ζ estimate for \({\text{{ ECS}}/S - 1}\) is retained when investigating the effect of the objective Bayesian statistical method on their results. However, in Sect. 4 it is revised to 0.135, the mean ζ estimate in both abrupt2xCO2 and abrupt4xCO2 long-run simulations (16 in all) by eleven GCMs (Rugenstein et al. 2020). No scaling from F4⤬CO2 to F2⤬CO2 is required for these calculations, since temperature changes are only compared within abrupt4xCO2 and abrupt2xCO2 simulations, not between them.

3 Methods

Scientific knowledge regarding the properties of any real-world system, or of a simplified conceptual model used to represent it, emanates from observing aspects of the system behavior. The results of such observations or assessments based thereon ('data-variables') are typically numerical and somewhat uncertain, and are regarded as subject to random errors. Conceptual models of the system usually relate the data-variables used as input assumptions to system properties of interest that are regarded as fixed but unknown ('parameters'), assumed here to be represented by continuously-valued variables. A key role of statistical inference is then to draw valid conclusions from data-variables regarding such parameters, as regards their values and associated uncertainty.

It is essential for scientific inference that the statistical methods used are calibrated, in the sense that the uncertainty ranges they generate closely approximate confidence intervals. That is, over the long run (over many applications of the method to different data sets) the true parameter value will lie below a properly derived (x%–y%) confidence interval in about x% of cases, and above it in about (1—y)% of cases.

The data likelihood, a joint function of data and parameter values, plays a central role in statistical parameter inference. It represents the joint probability density of the observed data as a function of the parameter value(s). Provided errors in the observed data are independent, their joint probability density is the product of that for each data-variable. Usually, only the ratio of the data likelihood to its highest value matters.

An important property of likelihood functions is that, where two likelihood functions concerning the same system and parameters but derived from independent data exist, the information they jointly contain about the parameter is representable by their product (Birnbaum 1962; Pawitan 2001 Sects. 2.3 and 7.2). This property is used when combining the three different lines of evidence.

3.1 Bayesian parameter inference

There are two main statistical paradigms, Frequentist and Bayesian (Bernardo and Smith 1994). In both, parameter inference revolves around likelihood functions. However, Bayesians treat fixed but uncertain parameters as having distributions representing degrees of belief, in effect as if random variables with a probability distribution, while Frequentists do not, notwithstanding Frequentist confidence distributions (Schweder and Hjort 2002, 2016).

Both Stevens et al. (2016) and S20 employ Bayesian methods for combining climate sensitivity evidence; Frequentist methods appear less suitable for this task. Bayesian methods provide a means of coherently updating personal beliefs about an unknown parameter with external evidence as to its value. However, for continuously-valued parameters they do not in general provide calibrated estimates that properly reflect such evidence. By contrast, Frequentist confidence measures are derived from randomness in the data values and are intrinsically calibrated.

In the continuous case, from Bayes' theorem (Bayes 1763) the posterior probability density function (PDF), \(p_{{\varvec{\theta}}} \left( {{\varvec{\theta}}|{\varvec{y}}} \right)\), for a parameter (vector) θ on which observed data y depend, is proportional to the data likelihood \(p_{{\varvec{y}}} \left( {{\varvec{y}}|{\varvec{\theta}}} \right)\) (the probability density of the data treated as a function of θ, for fixed y) multiplied by the density of a 'prior distribution' (prior) for θ, \(p_{{\varvec{\theta}}} \left( {\varvec{\theta}} \right)\):

$$p_{{\varvec{\theta}}} ({\varvec{\theta}}|{\varvec{y}}) = c \, p_{{\varvec{y}}} ({\varvec{y}}|{\varvec{\theta}})p_{{\varvec{\theta}}} ({\varvec{\theta}})$$

(the subscripts indicating the variable each density is for). The constant c is such that \(p_{{\varvec{\theta}}} \left( {{\varvec{\theta}}|{\varvec{y}}} \right)\) integrates to unit probability; it is the reciprocal of \(\int {p_{{\varvec{y}}} ({\varvec{y}}|{\varvec{\theta}})p_{{\varvec{\theta}}} (\theta )d{\varvec{\theta}}}\). If the parameter being estimated were a random variable with actual probability distribution \(p_{{\varvec{\theta}}} \left( {\varvec{\theta}} \right)\) then (5) would follow from the conditional probability lemma. However, this is not the case here (Fraser and Reid 2011).

The Bayesian equivalent of a confidence interval, a credible interval, reflects probability implied by the posterior PDF. Whether a credible interval is calibrated or not will among other things depend on the choice of prior, which the investigator is free to select. S20 select a prior that is uniform in λ-space, and therefore proportional to \(F_{{{2} \times {\text{CO2}}}} /S^{{2}}\) in S-space. In the common 'Subjective Bayesian' view adopted by S20, the prior is a probability distribution representing the investigator's degrees of belief about parameter values before incorporating information from the current data. There is no requirement that the posterior PDF be calibrated, and the resulting credible intervals may be far from actual confidence intervals (Fraser 2011). However, to avoid Bayesian inference providing misleading results, it is necessary to use a prior that provides correct calibration of posterior probabilities to frequencies and hence confidence intervals (Fraser et al. 2010; Lewis and Grünwald 2018).

In the alternative 'Objective Bayesian' view, in the absence of existing evidence regarding parameter values the prior should consist of a mathematical weighting function intended to have minimal influence on inference relative to the data, so that it is as noninformative as possible.

Under this approach, any existing probabilistic evidence concerning the parameter being estimated may appropriately be represented by a likelihood function for a notional observation, from which a posterior density has been calculated using a noninformative prior (Hartigan 1965), rather than by a posterior density.

Noninformative priors are mathematical priors that are generally intended to result in probability matching posterior PDFs, which produce credible intervals that are (at least approximately) true confidence intervals, and they are often judged on that basis (Berger and Bernardo 1992, p. 36; Kass and Wasserman 1996). Typically, a noninformative prior primarily reflects how the expected informativeness of the data about the parameter value(s) varies with the parameter value(s). How informative data are expected to be about the parameter value(s) is represented by the (expected) Fisher information,Footnote 2 which has a key role in likelihood-based inference.

The gold-standard method for producing noninformative priors is reference analysis (Bernardo 1979; Berger and Bernardo 1992), which results in Bayesian inference that is objective in the sense that it only depends on observed data and model assumptions, as is the case for Frequentist inference (Bernardo 2011). In Objective Bayesianism as well as Subjective Bayesianism, however, subjective choices will still be made by the investigator in relation to the data and model used.

Under Subjective Bayesianism, independent new evidence about a parameter is incorporated by updating an existing posterior PDF (treated as the prior) by multiplying it by the data likelihood function for the new evidence and renormalizing to unit total probability, as S20 does. Such updating would be valid mathematically were the parameter a random variable, but it is not. Bayesian updating satisfies Subjective Bayesian axioms (Bernardo and Smith 1994) and results in coherent personal beliefs, but that does not imply that the resulting inference will exhibit satisfactory probability matching even if the existing posterior PDF was well calibrated.

Bayesian updating is in any event unsupportable for Objective Bayesianism, since the noninformative prior for the original likelihood will in general differ from that for the new likelihood. Bayesian updating would therefore produce inference that varied with the order in which different evidence was incorporated (Kass and Wasserman 1996). It can also result in quite poor frequentist calibration (Lewis 2018).

The order-dependence problem does not arise if Bayes theorem is applied once only, to the joint likelihood function for the combined evidence, with a single noninformative prior being computed that reflects the nature of the combined evidence Lewis (2013a, b). This is the method employed in the present study. As shown in Lewis (2018) and Lewis and Grunwald (2018), using such a single-step method when combining climate sensitivity evidence results in more realistic inference than using Bayesian updating, even when a noninformative prior is used to incorporate the first line of evidence, although the magnitude of the improvement will vary.

Where a univariate parameter, such as climate sensitivity, is the only parameter being estimated, a Jeffreys' prior (Jeffreys 1946), which in that case is normally also the reference prior, gives credible intervals that match confidence intervals more closely than any other prior (Welch and Peers 1963; Hartigan 1965), and is therefore the most appropriate prior to use for weighting the combined evidence-providing data likelihood functions. A Jeffreys' prior is proportional (with arbitrary scaling) to the square root of the Fisher information (of its determinant for a multivariate parameter).

Fisher information for different likelihood functions combines additively, provided the likelihood functions are derived from independent data (Pawitan 2001, Sect. 8.4). Therefore, Jeffreys' prior for inference from the combined likelihood function is obtainable by adding in quadrature the Jeffreys' priors for the separate likelihood functions, after scaling each to equal the square root of the Fisher information. The probability matching of posterior PDFs derived by this method has been tested and found to be accurate in cases involving various probability distributions (Lewis 2013a, b), including in the context of combining evidence regarding climate sensitivity (Lewis 2018).

3.2 Statistical models

I use the same statistical models as S20. These derive from simple forcing-feedback physical models, for the various lines of evidence, as follows. Terms on the right-hand sides of (6) to (13) each represent the 'true' value of each variable, the best observational estimate of which is taken to include an additive error (ε) term, which has been omitted for clarity. These variables are termed "data-variables", as generally their estimated values are ultimately derived from observational data. Their estimated error characteristics are inputs to the statistical models.

For Process evidence, total climate feedback λ is taken as the sum of component feedbacks:

$$\lambda = \lambda_{{{\text{Planck}}}} { + }\lambda_{{\text{WV + LR}}} { + }\lambda_{{{\text{sfcAlbedo}}}} { + }\lambda_{{{\text{clouds}}}} { + }\lambda_{{{\text{stratospheric}}}} { + }\lambda_{{{\text{atmosComp}}}}$$

where λPlanck is feedback from extra emission to space from vertically uniform warming, the anti-correlated water vapor (WV) and lapse rate (LR) feedbacks are combined into λWV+LR, λsfcAlbedo is surface albedo feedback, λclouds is cloud feedback, λstratospheric is feedback from changes in stratospheric water vapor and temperature, and λatmosComp is feedback from changes in atmospheric composition. The error/uncertainty term (ελ) for total feedback λ represents the sum of independent error terms for its components, with λclouds being likewise the sum of components, each subject to independent errors. Since errors are independent and are assumed to be normally distributed, ελ is also normally distributed, with variance equal to the sum of the variances of its components.

Provided that the estimates of the climate feedback components are of the values they take over 150 year abrupt4xCO2 GCM simulations, which they all are, the resulting climate feedback estimate will be on a basis consistent with its derivation over such simulations, and hence be of λ. As a result, S can be derived from this feedback estimate, using (4), without any adjustment being made to it. S20 do so by dividing the feedback estimate into (unadjusted) F2×CO2, but (4) requires use of \(F_{{2 \times {\text{CO2}}}}^{{{\text{regress}}}}\). F2×CO2 should accordingly be multiplied by a scaling factor, γ, to convert it to \(F_{{2 \times {\text{CO2}}}}^{{{\text{regress}}}}\) (Sect. 4.1), so that:

$$S = - \gamma F_{{2 \times {\text{CO}}2}} /\lambda$$

For Historical evidence, differences in sea surface temperature (SST) change patterns may cause feedback estimated using (1), denoted \(\lambda_{{{\text{hist}}}} = - [\Delta F_{{{\text{Hist}}}} - \Delta N_{{{\text{Hist}}}} ]/\Delta T_{{{\text{Hist}}}}\), to differ from that over 150-year GCM abrupt4xCO2 simulations, λ. An estimate of the effect, denoted \(\Delta \lambda = \lambda - \lambda_{{{\text{hist}}}}\), of such differences in SST change patterns (the historical pattern effect) is allowed for when computing S, but not when calculating an alternative measure, Shist. Since the appropriate λ is thereby being used to estimate S, F2×CO2 should be scaled by γ, as for Process evidence. Using (7):

$$S = - \gamma F_{{{2} \times {\text{CO2}}}}^{{}} /(\lambda_{{{\text{hist}}}} + \Delta \lambda ) = \gamma F_{{{2} \times {\text{CO2}}}}^{{}} /\{ [\Delta F_{{{\text{Hist}}}} - \Delta N_{{{\text{Hist}}}} ]/\Delta T_{{{\text{Hist}}}} - \Delta \lambda \}$$


$$S_{{{\text{hist}}}} = - F_{{{2} \times {\text{CO2}}}}^{{}} /\lambda_{{{\text{hist}}}} = F_{{{2} \times {\text{CO2}}}}^{{}} /\{ [\Delta F_{{{\text{Hist}}}} - \Delta N_{{{\text{Hist}}}} ]/\Delta T_{{{\text{Hist}}}} \}$$

ΔFHist is the sum of forcing component ERFs, which (apart from F2×CO2) have uncertainties that are independent of each other and of the likewise independent uncertainties in the other right hand side terms in (8) and (9):

$$\begin{gathered} \Delta F_{{{\text{Hist}}}} = \Delta F_{{{\text{Hist}}}}^{{_{{{\text{CO2/2}\times}}} }} F_{{{2} \times {\text{CO2}}}} + \, \Delta F_{{{\text{Hist}}}}^{{_{{{\text{othGHG}}}} }} + \, \Delta F_{{{\text{Hist}}}}^{{_{{{\text{O3}}}} }} + \, \Delta F_{{{\text{Hist}}}}^{{_{{{\text{aerosol}}}} }} + \Delta F_{{{\text{Hist}}}}^{{_{{_{{{\text{LandUse}}}} }} }} \hfill \\ \, + \, \Delta F_{{{\text{Hist}}}}^{{_{{{\text{vapor}}}} }} + \, \Delta F_{{{\text{Hist}}}}^{{_{{{\text{BCsnow}}}} }} + \, \Delta F_{{{\text{Hist}}}}^{{_{{{\text{contrails}}}} }} + \, \Delta F_{{{\text{Hist}}}}^{{_{{{\text{solar}}}} }} + \, \Delta F_{{{\text{Hist}}}}^{{_{{_{{{\text{volcanic}}}} }} }} \hfill \\ \end{gathered}$$

where \(\Delta F_{{{\text{Hist}}}}^{{_{{{\text{othGHG}}}} }}\) includes well-mixed non-CO2 greenhouse gases, \(\Delta F_{{{\text{Hist}}}}^{{_{{{\text{O3}}}} }}\) includes tropospheric and stratospheric ozone, \(\Delta F_{{{\text{Hist}}}}^{{_{{{\text{vapor}}}} }}\) represents stratospheric water vapor and \(\Delta F_{{{\text{Hist}}}}^{{_{{{\text{BCsnow}}}} }}\) represents black carbon on snow and ice.

When S is inferred from Process or Historical evidence using S20's assumptions, γ is set to one, as no adjustment to F2×CO2 was made in S20.

For Paleoclimate evidence, there are separate models for changes related to the Last Glacial Maximum (LGM), mid-Pliocene Warm Period (mPWP) and Paleocene-Eocene Thermal Maximum (PETM), all of which involve a division by (1 + ζ) to convert an estimate for ECS into one for S.

3.2.1 For the LGM

$$S \, = \Delta T_{{{\text{LGM}}}} /\{ (1 + \zeta )[\Delta F_{{{\text{LGM}}}}^{{{\text{CO2}{{/2 \times}}}}}+ (\Delta F_{{{\text{LGM}}}}^{{{\text{exCO2}}}} + 0.5\alpha \Delta T_{{{\text{LGM}}}}^{2})/F_{{{{2 \times \text{CO2}}}}} ]\}$$

where \(\Delta F_{{{\text{LGM}}}}^{{{\text{CO2}{{/2 \times}}}}}\) represents the LGM − preindustrial CO2 forcing change as a fraction of that from CO2 doubling; \(\Delta F_{{{\text{LGM}}}}^{{{\text{exCO2}}}}\) represents the corresponding change in non-CO2 forcing; and α is a coefficient for state dependence in climate feedback.

3.2.2 For the mPWP

$$S \, = \Delta T_{{{\text{mPWP}}}} /\{ [\log (1 + \Delta {\text{CO2}}_{{{\text{mPWP}}}} )/\log_{} \left( 2 \right)](1 + f_{{{\text{mPWP}}}}^{{{\text{CH4}}}} )(1 + \zeta )(1 + f_{{{\text{mPWP}}}}^{{{\text{ESS}}}} )\}$$

where \((1 + f_{{{\text{mPWP}}}}^{{{\text{ESS}}}} )\) represents the ratio of ESS to ECS for the mid-Pliocene, \(f_{{{\text{mPWP}}}}^{{{\text{CH4}}}}\) the forcing from methane relative to that from CO2, and ΔCO2mPWP the fractional increase in mid-Pliocene CO2 concentration over an assumed 284-ppm equivalent preindustrial well-mixed greenhouse gases state. A logarithmic CO2 forcing–concentration relationship holds over this range.

3.2.3 For the PETM

$$S = [1/(1 + \zeta )]/\{ [(1 + f_{{{\text{PETM}}}}^{{{\text{CO2nonLog}}}} ){\text{log}}(1 + \Delta {\text{CO2}}_{{{\text{PETM}}}} )/\log \left( 2 \right)](1 + f_{{{\text{PETM}}}}^{{{\text{CH4}}}} )/\Delta T_{{{\text{PETM}}}} - \beta_{{{\text{PETM}}}} /F_{{2 \times {\text{CO}}2}} \}$$

where βPETM allows for possible state-dependence of climate feedback and any slow feedbacks affecting ESS but not ECS, and \(f_{{{\text{PETM}}}}^{{{\text{CO2nonLog}}}}\) scales CO2 forcing from a logarithmic relationship with concentration, to correct for deviations therefrom at high concentration. S20 omits \(f_{{{\text{PETM}}}}^{{{\text{CO2nonLog}}}}\), so it is set to zero when S is inferred using S20's assumptions. Uncertainty in \(f_{{{\text{PETM}}}}^{{{\text{CO2nonLog}}}}\) is minute relative to that in ΔCO2PETM and is therefore ignored.

Significant correlation is assumed to exist between errors in ΔTPETM and ΔTmPWP and between errors in ΔCO2PETM and ΔCO2mPWP, with errors in all other variables being independent.

The foregoing equations are rearrangements of S20 Eqs. (4), (18), (20), (21), (22), (23) and (24), with the additional γ and \(\left( {{1} + f_{{{\text{PETM}}}}^{{{\text{CO2nonLog}}}} } \right)\) terms.

S20 also considered, but neither created a statistical model for nor used in their main results, 'emergent constraints' on climate sensitivity. These depend on relationships between selected observationally-constrainable variables and climate sensitivity in GCMs. In almost all cases, strong relationships found in one generation of GCMs have been statistically insignificant and/or substantially different in another GCM generation (Caldwell et al. 2014, 2018; Schlund et al. 2020), casting substantial doubt on their reliability. Biases common to all or most models are another concern.

3.3 Likelihood estimation for S

Likelihood estimation for a parameter is not straightforward when its value depends on multiple data-variables, even apart from the question of how to combine different lines of evidence. For each line of evidence separately, each set of data-variable value realizations corresponds to a unique value of the parameter and data-variable errors are independent, so the joint probability density for any set of data-variable values can be derived as the product of their PDFs at the values concerned, and assigned to the parameter value that they imply.

However, each value of the parameter, S, will correspond to an infinite set of combinations of differing data-variable values. Accordingly, producing a single likelihood corresponding to each S value requires some method of weighting probability densities for the different data-variable value combinations.

S20's likelihood estimation method involves sampling S uniformly and F2×CO2 pro rata to its PDF, the sample ratios providing λ samples, and for each line of evidence except Process (where the likelihood is analytically calculable) also sampling pro rata to its PDF each remaining data-variable involved other than ΔT. They take the likelihood of each resulting multivariate sample set as equal to the PDF of ΔT at the value implied by the sample set's λ and data-variable values. They bin the multivariate sample sets by their S values, and compute the S likelihood for each bin as the average of the likelihoods of the sample sets it contains.

While S20's likelihood estimation method may satisfactorily estimate the actual likelihood for S in simple cases, it is not clear why it would provide a realistic estimate of the likelihood where, for example, a data-variable has substantial asymmetrical uncertainty or S is related to it non-linearly. Such circumstances arise with Historical evidence, due to the asymmetrical PDF for aerosol ERF, and with PETM (and to a lesser extent mPWP) evidence, due to the logarithmic relationship between CO2 concentration ratio and ERF. Investigation confirms that S20's method of likelihood estimation is indeed unsound (supplemental material S2). Their method causes substantial misestimation of Historical and (worsened by a coding error resulting in the ΔCO2PETM standard deviation used being one tenth of its correct value—supplemental material S2) of PETM likelihood, and non-negligible misestimation of mPWP likelihood. Therefore, I do not use S20's likelihood estimation method in this study.

Rather than relying on a single likelihood estimation method, I employ three alternative methods, with the resulting likelihoods cross-checked. Each method involves setting up S value bins on a fine (0.01 K) grid spanning 0−20 K.

The first and third likelihood estimation methods involve first randomly sampling all the data-variables involved in estimating S from the line of evidence concerned that are not treated as fixed, weighting the sampling pro rata to their PDFs. The S value that each resulting multivariate sample set's data-variable values implies is computed, and each sample set is allocated to the appropriate S bin. The number of sample sets in each bin then provides an estimated (posterior) PDF for S. Between 107 and 108 sample sets are drawn, depending on the case. Since this procedure requires a unique S value for each sample-set, these likelihood estimation methods can only be used for single lines of evidence (for a single period in the case of paleoclimate evidence). This sampling method of deriving a posterior PDF for S has been widely used (Gregory et al. 2002; Otto et al. 2013; Lewis and Curry 2015, 2018). The method is prior-free, in the sense that no explicit prior selection is required, however it is equivalent to Bayesian estimation using a noninformative prior.

The first likelihood estimation method effectively involves estimating, at each S-bin value, the probability-weighted likelihood integrated over data-variable space; it gives the highest weight to those combinations of data-variable values most likely to arise. This 'integrated likelihood' method is implemented by taking the sample sets generated and used to derive a PDF for S as set out in the preceding paragraph, and computing the likelihood for each sample set (as the product of the data-variable PDFs at their sampled values). The likelihood for the S value at each bin center is then derived as the simple average of the likelihoods of all sample sets in the bin.

The second method uses the profile likelihood (Pawitan 2001), a widely used measure that typically provides a close approximation to likelihood derived using more sophisticated methods. This method applies the entire weight to that combination of data-variable values which, at the S value concerned, maximizes the likelihood. The profile likelihood is derived by using an optimization algorithm to find, for each fine-grid value of S, the data values combination that maximizes the product of the data-variable PDFs for the line or lines of evidence concerned, allowing for any uncertainties that are common or correlated between data-variables.

Finally, a likelihood is calculated using the data-doubling method (Efron 1993; Lewis 2018). This involves the supposition that the evidence involved represented an observational data set, and that an identical but independent data set had also been observed. A posterior PDF for S based on emulating the stronger evidence provided by such doubled data is computed. This can often be effected by halving the actual variance of each data-variable and sampling pro rata to the reduced variance data-variable PDFs. An implied likelihood is then computed by dividing that PDF by the posterior PDF corresponding to the actual data set, derived by sampling pro rata to the actual data-variable PDFs. The validity of this 'doubled data likelihood' method follows directly from (5) if a noninformative prior is involved, since the same prior is noninformative for repeated observations from the same experiment, and the doubled data likelihood will equal the product of the original data likelihood with itself:

$$p_{{\varvec{\theta}}} ({\varvec{\theta}}|{\varvec{yy}})/p_{{\varvec{\theta}}} ({\varvec{\theta}}|{\varvec{y}}) = [c_{1} \, p_{{\varvec{y}}} ({\varvec{y}}|{\varvec{\theta}})p_{{\varvec{y}}} ({\varvec{y}}|{\varvec{\theta}})p_{{\varvec{\theta}}} ({\varvec{\theta}})]/[c_{2} p_{{\varvec{y}}} ({\varvec{y}}|{\varvec{\theta}})p_{{\varvec{\theta}}} ({\varvec{\theta}})] = c_{3} p_{{\varvec{y}}} ({\varvec{y}}|{\varvec{\theta}})$$

However, this method will not work satisfactorily where evidence is represented by a distribution for which emulating a doubled data version is problematic. That is the case for S20's Historical evidence, where the aerosol ERF distribution is highly asymmetrical and has no analytical form.

Before each derived PDF and likelihood is used it is generally smoothed with a spline-based method, retaining sufficient degrees of freedom to very closely match the shape of the unsmoothed original.

3.4 Noninformative prior and posterior PDF estimation

When either an integrated or doubled-data likelihood is used, a prior for S is derived by dividing the associated sampling-derived PDF by the estimated likelihood. This is an exact probability-matching prior by construction, and, since a Jeffreys' prior provides the closest probability matching, the derived prior is necessarily a noninformative Jeffreys' prior (provided the likelihood is valid).

The profile likelihood method only produces a likelihood, so it is necessary to separately derive a Jeffreys' prior, \(\pi_{{{JP}}} (S)\), to use therewith. The 'data-space movement' method used is based on a direct measure of the local informativeness of the data about the parameter. Details of this method, and of the calibration of all the Jeffreys' priors, are given in the supplemental material (S3).

A PDF for S can then be derived as the product of the profile likelihood and the related data-space movement prior. This PDF cannot account for probability outside the range of S values used, so it is normalized to unit probability over that range. References to profile likelihood method PDFs are to such PDFs derived directly, or after combining likelihoods and priors from different lines of evidence.

3.5 Applying the Objective Bayesian statistical methods to combining S20's evidence

The statistical models employed in S20 to link S to data-based evidence necessitate a more general approach to combining lines of evidence using a single combination-based noninformative prior than was employed in Lewis (2018) and Lewis and Grunwald (2018). The mechanics involved are detailed in the supplemental material (S4).

A key motivation for using several different likelihood and noninformative prior estimation methods is that comparison of their performance when combining lines of evidence can provide confidence that both those methods, and the combination methods used, are valid. For S20's Historical evidence, to which the doubled data method cannot successfully be applied, likelihoods from the integrated likelihood and profile likelihood methods are almost identical. However, the profile likelihood data-space movement prior from Historical evidence is poor, due to imperfect optimization and difficulty representing the informativeness of the aerosol forcing distribution used by S20, and profile likelihood based estimates of S from all lines of evidence combined cannot reliably be derived by a single optimization. I therefore examine the combined-evidence likelihoods, priors and posterior PDFs for S that the three methods produce when combining S20's Process evidence with LGM and mPWP Paleoclimate evidence. For the two sampling-based methods, doing so involves combining separate estimates from Process, LGM and mPWP evidence. For the profile likelihood method, these are one-step estimates from simultaneous inference using data from all three individual lines of evidence. For all methods, the combined-evidence posterior PDF for S is normalized to unit probability over 0–20 K.

Figure 1a compares the combined Process, LGM and mPWP evidence likelihood estimates from the aforementioned three methods. They are indistinguishable. Figure 1b compares the three related Jeffreys' priors from \(S = 1.5{\text{ K}}\) up. Below that level the integrated likelihood and doubled data sampling-derived priors are artefacted, due to the paucity of samples for Process evidence, and behave erratically. However, the likelihood is almost zero below \(S = 1.5{\text{ K}}\), and the total probability in that region is only 0.05%, so the effect on inference for S is negligible. The resulting posterior PDFs for S using the three methods (Fig. 1c) are indistinguishable. Their medians are all within 0.01 K of each other, their 5th percentiles are all 2.07 K and their 95% percentiles are all within ± 0.03 K.

Fig. 1
figure 1

Results from combining S20's Process evidence with LGM and mPWP Paleoclimate evidence using the integrated likelihood, doubled data and profile likelihood based methods. a Combined evidence likelihoods; b Combined evidence priors; c Combined evidence posterior probability density functions. In each case the solid black line is from the integrated likelihood method, the dashed magenta line is from the doubled data method and the dotted cyan line is from the profile likelihood method and/or the related data-space movement prior

The very close agreement among combined-evidence inference for S using three different methods of deriving and combining likelihoods, noninformative Jeffreys' priors and PDFs provides strong support for the general validity of all the methods for this application.

The main results presented in Sects. 5 and 6 use the robust sampling-based integrated likelihood method—which in all cases produces a satisfactory likelihood and prior—for all lines of evidence.

4 Review and revision of S20 data-variable assumptions

I now make some revisions to S20's data-variable assumptions for various lines of evidence, which are justified on the basis of more recent evidence, by a preferable alternative interpretation of the same evidence, or because they remedy an error or omission. The scaling factor γ for F2×CO2 is included in these revisions. The original and revised estimates for all data-variables are set out in Tables 1, 2 and 3, with the reasons for changes. The evidence justifying each revision is reviewed in detail in the supplemental material (S5); evidence relating to a number of the unrevised data-variable estimates is also reviewed there. Results from applying the Objective Bayesian approach to inference using S20's assumptions and the revised assumptions are given in Sects. 5 and 6 respectively.

Table 1 Parameters of the data-variable distributions used to estimate S from Process evidence
Table 2 Parameters of the data-variable distributions used to estimate S from Historical evidence. All changes are in global and time means between the 1861–1880 and 2006–2018 periods
Table 3 Parameters of the data-variable distributions used to estimate S from Paleoclimate evidence

The units of all stated feedback values are Wm−2 K−1. Uncertainties indicated by ± represent one standard deviation, with a Normal distribution, denoted N(mean, standard deviation), assumed.

4.1 F2×CO2 and its scaling when using Eq. (4)

S20 use the estimate of stratospherically-adjusted forcing from doubled CO2 of 3.80 Wm−2 per the simplified formula in Etminan et al. (2016), and add 5% for tropospheric adjustments, arriving at an ERF estimate for F2×CO2 = 4.0 ± 0.3 Wm−2. Meinshausen et al. (2020) fitted Etminan et al.'s results more precisely, obtaining a F2×CO2 value 1.5% lower. Based on their more accurate formula, and using the same 5% tropospheric adjustment, F2×CO2 ERF was assessed at 3.93 ± 0.3 Wm−2 in AR6 (Forster et al. 2021: The ratio of F4×CO2 to F2×CO2 per the Meinshausen et al. (2020) formula is 2.10 × , 5.0% higher than under a log(concentration) relationship. I adopt these numbers in Sect. 6 when estimating S.

Care must be taken to use the appropriate F2×CO2 value when applying (4). S20 use their estimate of the actual ERF from a doubling of CO2 concentration. However, as stated in Sect. 3.2, when feedback is estimated using a linear model on a basis consistent with behavior during years 1–150 of abrupt4xCO2 simulations, normally by ordinary least squares regression, then F2×CO2 should be converted into an estimate of \(F_{{{{2 \times CO2}}}}^{{{\text{regress}}}}\), the ERF implied by the y-axis regression line intercept, by multiplying it by \(\gamma = F_{{{{2 \times CO2}}}}^{{{\text{regress}}}} /F_{{{{2 \times CO2}}}}^{{}}\). That is because S is defined in terms of the climate feedback λ arising from system behavior over years 1–150 after a (hypothetical or actual) quadrupling of CO2 concentration, and as per (4) equals \(- F_{{{{2 \times CO2}}}}^{regress} /\lambda\) (Sect. 2). When climate feedback weakens during the course of 150-year abrupt CO2 forcing simulations, as it does in the vast majority of GCMs, \(F_{{{{2 \times CO2}}}}^{{{\text{regress}}}}\) will underestimate the GCM's actual F2×CO2. Therefore, dividing the actual F2×CO2 by climate feedback estimated over the whole simulation period is bound to overestimate S (supplemental material S1; compare red and black lines in Figure S1.1). For an ensemble of CMIP5 and CMIP6 GCMs, the \(F_{{{{2 \times CO2}}}}^{regress} /F_{{{{2 \times CO2}}}}^{{}}\) ratio is 0.86 ± 0.09 (supplemental material Table S1, rounding up the standard deviation, S5.1.5). I adopt this estimate.

S20 recognize this issue, conceding a similar overestimation of S, but neglect it, asserting incorrectly that it only affects feedback estimates from GCMs. This misconception results in S20's estimates of S from Process and Historical evidence being biased high. The issue is intrinsic to the use of (4) and the S20 definitions of S and λ, which involve a single λ value, estimated on a basis consistent with that obtained by regressing over 150-year abrupt4xCO2 simulations.

For Process evidence, the bias is self-evident, since almost all λ components are estimated by or on a basis consistent with regressing changes over abrupt4xCO2 simulations. In particular, low cloud feedback, the dominant cause of weakening feedback over abrupt4xCO2 simulations, and hence of \(\gamma < 1\), is so estimated (supplemental material S5.1.1, S5.1.3).

For Historical evidence, S20 estimate λhist as \(- (\Delta F_{{{\text{Hist}}}} - \Delta N_{{{\text{Hist}}}} )/\Delta T_{{{\text{Hist}}}}\) and divide it into–F2×CO2 to estimate Shist (supplemental material Figure S1.1, blue line). They then adjust λhist by their estimate of the difference between λ in abrupt4xCO2 simulations and λhist, and estimate S by dividing the resulting λ estimate into–F2×CO2, rather than (as should be done) into \(- F_{{{{2 \times CO2}}}}^{{{\text{regress}}}}\), thereby overestimating S (supplemental material Figure S1.1, red line).

Accordingly, in the statistical models S20 uses to estimate S from both Process and Historical evidence, F2×CO2 needs to be scaled by \(F_{{{{2 \times CO2}}}}^{regress} /F_{{{{2 \times CO2}}}}^{{}}\). I do so in Sect. 6. Paleoclimate estimation of S is unaffected, since that in effect estimates climate feedback from equilibrium changes, derives ECS by dividing it into–F2×CO2, and then converts ECS into S (see (11), (12) and (13)).

4.2 Process evidence

The data-variable distributions adopted by S20, and as revised here, for estimating S from Process evidence are summarised in Table 1. The main changes made are to low cloud feedback, reflecting strong recent evidence that it is weaker than S20's assessment, and scaling F2×CO2 to \(F_{{{{2 \times CO2}}}}^{{{\text{regress}}}}\) (Sect. 4.1).

The significant revision in estimated tropical and mid-latitude (60°S–60°N) marine low cloud feedback is discussed in detail in the supplemental material (S5.1.3). In brief, both the S20 and the revised estimates depend primarily on observational estimates of low cloud response to cloud-controlling factors (CCF). S20's assessment of tropical marine low cloud feedback was based primarily on, and equals, an estimate from the Klein et al. (2017) review, which only took two CCF into consideration. For the 30–60° mid-latitude bands S20 also used GCM-derived evidence. The revised median estimate is the observationally-constrained 60°S–60°N value from Myers et al. (2021). They use a more comprehensive set of CCF and argue that their feedback estimate is more realistic than Klein et al.'s. Cessana and Del Genio (2021) likewise find the Klein et al. feedback estimate to be too high.

The revisions to the S20 median data-variable values change the central λ estimate from Process evidence alone from −1.30 ± 0.44 in S20 to −1.53 ± 0.44, while the maximum likelihood estimate of S changes from 3.08 K to 2.21 K.

4.3 Historical evidence

The data-variable distributions adopted by S20, and as revised here, for estimating S from Historical evidence are summarised in Table 2. The changes made to S20's data-variable estimates are discussed in detail in the supplemental material (S5.2). The main changes made are to aerosol forcing, reflecting the most quantitative of the recent evidence that it is weaker than S20 assumed, to other forcings and the ΔGMAT–ΔGMST difference, reflecting AR6 assessments, to the historical pattern effect, reflecting evidence that most SST datasets indicate little unforced element, and scaling F2×CO2 to \(F_{{{{2 \times {\text{CO2}}}}}}^{{{\text{regress}}}}\).

The significant revision in estimated aerosol ERF is discussed in considerable detail in the supplemental material (S5.2.3). Briefly, S20 use the unconstrained aerosol forcing distribution from Bellouin et al. (2020; hereafter B20). That distribution is based on a complex theoretical formula that depends on a number of factors, all of which are estimated separately. There is considerable evidence suggesting that B20 overestimate aerosol forcing strength. The revised estimate median uses recent evidence (Gryspeerdt et al. 2019; Possner et al. 2020; Glassmeier et al. 2021) regarding just one of the factors involved in B20's calculations: the cloud liquid water path sensitivity factor used when adjusting the radiative forcing from aerosol-cloud interactions to an ERF basis. The revised median aerosol ERF estimate is derived by carrying out the same computation as in B20 except for changing the estimate of that sensitivity factor. The revised ERF estimate adopts a Gaussian uncertainty distribution with the same 5% bound as assessed in AR6; B20's theoretically-derived distribution assigns significant probability to extremely negative aerosol ERF values, but these appear inconsistent with observational evidence that is independent of the global temperature record.

Another significant revision is that to the Historical pattern effect feedback adjustment, discussed in detail in the supplemental material (S5.2.4). Briefly, S20's median estimate is based on that reported by Andrews et al. (2018), which was derived by comparing, in six GCMs, estimated feedback in abrupt4 × CO2 simulations with that in fixed SST simulations with evolving observational historical SST patterns. S20 reduced the Andrews et al. (2018) estimate by 0.1 Wm−2 K−1 to allow for the possibility that the pattern effect may be smaller than reported in that study. However, substantial evidence now exists that the observational SST dataset used for all the fixed SST simulations assessed by Andrews et al. (2018) is an outlier in terms of the magnitude of the pattern effect that it gives rise to (Lewis and Mauritsen 2021; Zhou et al. 2021; Fueglistaler and Silvers 2021). The revised pattern effect adjustment reflects that evidence.

The revisions to the S20 data-variable median values reduce the estimates of S and Shist that they imply from 5.82 to 2.16 K, and from 3.37 to 2.05 K, respectively.

4.4 Paleoclimate evidence

Paleoclimate evidence has the advantage of being largely independent of, and sometimes involving a much larger signal than, the historical period, but suffers from relating to different states of the Earth. The evidence is also derived from imprecise, geographically limited, and potentially biased proxies that provide estimates for only some relevant variables.

S20 evaluate evidence from climate transitions during the LGM, mPWP and PETM, but their main results exclude PETM evidence. For all three periods, an estimate of ECS is converted into one for S by dividing it by (1 + ζ), the distribution of which is revised (Sect. 2; supplemental material S5.3.1), as is the median F2×CO2 estimate (Sect. 4.1).

4.4.1 LGM

The best studied paleoclimate transition, and that most used for estimating climate sensitivity, is the change from the LGM, the coldest phase in the last ice age, some twenty thousand years ago to the preindustrial Holocene. A significant advantage of the LGM transition is that, unlike more distant periods, there is proxy evidence not only of changes in temperature and CO2 concentration but also of non-CO2 forcings, and that enables estimation of the effects on radiative balance of slow (ice sheet, etc.) feedbacks, which need to be treated as forcings in order to estimate ECS (and hence S) rather than ESS. Moreover, the temperature proxy evidence is sufficient to enable spatially-weighted global means to be estimated (Annan and Hargreaves 2013).

The two changes made to S20's data-variable estimates are discussed in detail in the supplemental material (S5.3.2). The revision to land ice and sea-level forcing adds an estimate of the omitted albedo change caused by sea-level fall exposing more land. The revision to ΔT brings it closer to the average of estimates from S20's cited sources. The revisions I make to S20's median LGM data-variable values, along with the revised ζ estimate, reduce the estimate of S that they imply from 2.63 K to 1.97 K.

4.4.2 mPWP

The mid-Pliocene warm period, approximately 3 Ma ago, was moderately warmer than preindustrial times, and in that respect a closer analogue than the LGM of conditions expected during this century. However, the temperature change involved was smaller than for the LGM, there is more uncertainty about CO2 levels, temperature proxies are more limited, and usable proxy-based estimates of non-CO2 forcing are unavailable.

The changes made to S20's data-variable estimates are discussed in detail in the supplemental material (S5.3.3).They relate to ΔT and the ESS–ECS ratio, in both cases reflecting ratios per the more recent PlioMIP2 project (Haywood et al. 2020). The revisions I make to S20's mPWP data-variable median values, along with the revised ζ estimate, reduce the estimate of S that they imply from 3.36 to 2.33 K, more in line with estimates from the LGM and PETM.

4.4.3 PETM

The PETM temperature excursion period some 56 Ma ago was much warmer than the present, and differed geographically and orographically. S20 state that the PETM is arguably the best pre-Pliocene warm interval for estimating ECS; it has been fairly well studied and involves a large signal. Nevertheless, they excluded PETM evidence in their Baseline estimates "due to the large uncertainties and the danger of over-constraining the likelihood should these be underestimated". While the PETM uncertainties are substantial, S20 makes generous allowance for them, and any underestimation of uncertainties appears much more likely to cause overestimation than underestimation of S (supplemental material S5.3.4). In view of that and of the large signal involved in the PETM, I use it, but as an alternative to the mPWP rather than combining their evidence, since doing so provides very little benefit when the estimated data-variable error correlations between the two periods are allowed for (supplemental material S5.3.5).

The one revision I make to S20's PETM data-variable median values allows for the CO2 ERF–concentration not being exactly logarithmic, which in combination with the revised ζ estimate reduces the estimate of S that they imply from 2.38 K to 1.99 K. This change is discussed in detail in the supplemental material (S5.3.4), where the evidence relating to several of S20's other PETM data-variable estimates is also discussed.

The data-variable distributions adopted by S20, and as revised here, for estimating S from Paleoclimate evidence for each period are summarised in Table 3.

5 Results using S20 data-variable assumptions: comparison using different methods

I now compare S20's results with those derived here using the same input assumptions and either a computed noninformative Jeffreys' prior or the same prior as used in S20. I start by comparing likelihoods, as these are the foundation of parameter inference and are unaffected by the prior used, and then discuss the computed Jeffreys' priors. Finally, the posterior PDFs and numerical percentile values for S produced in this study are presented and compared with those in S20.

5.1 Likelihoods

The likelihoods derived using the profile likelihood method, the sampling-based integrated likelihood and the data doubling methods described in Sect. 3.3 agree very closely with each other (supplemental material Figures S1 to S3). They should also agree with the likelihoods shown in S20, as they are based on the same statistical models and data-variable assumptions. The integrated likelihoods derived in this study are shown by solid lines in Fig. 2; likelihoods from S20 are the same color but dotted. The overall Paleoclimate likelihood is that from combining evidence from the LGM and mPWP.

Fig. 2
figure 2

Likelihoods for S based on S20's data-variable assumptions as derived in this study (solid lines) and, for comparison, those shown in S20 (dotted lines). a Likelihoods from evidence for the three paleoclimate periods. b Likelihoods from Process evidence and from combining Paleoclimate evidence for the LGM and mPWP; S20 did not use evidence for the PETM for their estimation of S. c Likelihoods from Historical evidence for both S and Shist. d Likelihoods from combined Process, Paleoclimate (LGM and mPWP combined), and Historical evidence, computed as the product of the likelihoods from the three lines of evidence. The likelihoods used in this study have been derived using the integrated likelihood method and normalized to a maximum of one. The S20 likelihoods were accurately digitized over the full S ranges of the relevant figures in S20

Figure 2a shows that mPWP paleoclimate evidence discriminates more strongly against very high S values than does LGM or, particularly, PETM evidence, despite its median value being the highest of the three. This is primarily because fractional uncertainty in forcing, and in those terms that effectively modify forcing, is lowest for mPWP evidence and highest for PETM evidence.

Figure 2b and c show that Paleoclimate evidence likelihoods downweight the possibility of very high S values most strongly, while Historical evidence does so very weakly. The latter primarily reflects the use of the unconstrained Bellouin et al. (2020) aerosol forcing distribution, which assigns substantial probability to strongly negative values.

S20's own likelihoods generally peak marginally earlier than those computed here, and thereafter decline faster. The difference is barely noticeable for the LGM, but rather larger for the mPWP and for the overall Paleoclimate (LGM and mPWP combined) evidence. The difference is major for the likelihoods from PETM and Historical evidence. This is particularly the case, in ratio terms, for Shist (cyan lines in Fig. 2c). The differences arise because S20 employ an invalid method for deriving likelihoods (supplemental material S2). The virtual identity of the present study's estimated likelihoods using three different methods (two for Historical evidence) provides further confirmation that the S20 likelihoods are incorrect, and quantifies their inaccuracy.

When likelihoods from all lines of evidence used in S20 are combined multiplicatively (Fig. 2d), the resulting likelihood drops far more sharply than those from any individual line of evidence. The absolute difference between the combined S20 likelihoods and those computed here is relatively small in this case. This is because the S20 likelihoods are reasonably accurate below the likelihood maxima, and the combined likelihood drops to a low level before the errors in the S20 likelihoods (which partially cancel) grow very significant. By \(S = 5{\text{ K}}\), the S20 combined likelihood is ~ 25% lower than that calculated here, but by that point the 95th probability percentile has been reached.

The likelihood difference for Process evidence is small, but of the opposite sign to the other cases; S20's likelihood peaks marginally later than that computed here and declines more slowly after the peak. The Process likelihood can also be derived using a formula that accurately approximates the distribution of the ratio of two normally-distributed variables, here λ and F2×CO2 (Raftery and Schweder 1993; Lewis 2018). The likelihood per that formula almost exactly matches the likelihoods derived using this study's three methods, but not S20's likelihood. Since S20's Process likelihood computation is not subject to the defects identified in their other likelihood computations, that suggests some other problem may exist in S20's statistical computations.

5.2 Computed Jeffreys' priors

Figure 3 shows the computed and calibrated integrated likelihood based Jeffreys' priors for each of the three main lines of evidence, before and after transformation to λ-space (which aids comparing them), and so-transformed priors for individual Paleoclimate periods. Since these priors are very similar to those estimated using the revised assumptions, comments on them are deferred until the latter are discussed in Sect. 6.

Fig. 3
figure 3

Jeffreys' priors based on S20's data-variable assumptions, computed using the integrated likelihoods. a For each main line of evidence and for the combined evidence from all three lines. Paleoclimate evidence from the LGM and mPWP is included but, as in S20, not that from the PETM. b As (a) but with the prior, although still plotted against S values, having been transformed into λ-space. c As for (b), but for evidence from individual paleoclimate periods. Transformation of the prior to λ-space is effected in the standard way, by multiplying the original, S-space, prior by the absolute Jacobian of the inverse transformation, being S2/F2×CO2. The relatively small uncertainty in F2×CO2 is ignored when effecting this transformation, which accounts for the decline in the transformed Process evidence prior as S reduces to a low level. Note that the separate priors shown cannot validly be added in quadrature to obtain a Jeffreys' prior for the All combined case, or for Paleoclimate evidence using the priors in panel (c), because such summing would multiply count the influence of uncertain data-variables in common. The panel (a) and (b) plots start at 1.5 K because the priors become artefacted below that level

5.3 Posterior PDFs and percentiles for S

Figure 4a shows (solid lines) sampling-derived primary posterior PDFs for each main line of evidence, representing in each case the product of the estimated integrated likelihood and a Jeffreys' prior, normalizing to unit probability over the 0–20 K S range used. Posterior PDFs from using a uniform-in-λ prior, as in S20, with the same likelihoods are also shown (dashed lines). For Process evidence, for which Jeffreys' prior is uniform in λ, the two PDFs coincide. For Historical and particularly Paleoclimate evidence, use of a uniform-in-λ prior biases the posterior PDF towards lower S values and, in the Paleoclimate case, excessively constrains high S values.

Fig. 4
figure 4

Posterior PDFs based on S20's data-variable assumptions. a PDFs for S from separate Process, Historical and Paleoclimate evidence, using computed Jeffreys' priors (solid lines) and uniform-in-λ priors (dashed lines). b Combined evidence PDFs for S derived in this study (solid lines) using both Jeffreys' prior and a uniform-in-λ prior and, for comparison, the Baseline PDF in S20 using a uniform-in-λ prior (dotted line), accurately digitized from the relevant figure in S20. In panels (a) and (b) this study's PDFs for S have been normalized to unit probability over 0–20 K. Normalization has a negligible effect on the comparison with the S20 Baseline PDF. c Unnormalized PDF for Shist, that from this study (solid cyan line) and the "non-Bayesian Shist PDF" from S20's Fig. 11(b), accurately digitized (dashed orange line). The dotted red line shows the S20 PDF scaled by a factor of 0.86. The black line shows the PDF implied by use of S20's Shist likelihood with their uniform-in-λ prior

The primary combined evidence posterior PDF (Fig. 4b, solid red line) represents the product of the estimated Process, Historical and Paleoclimate likelihoods, being the combined likelihood, and Jeffreys' prior for the combined evidence. The PDF using a uniform-in-λ prior is also shown (solid blue line). The difference between these is much smaller than in the case of separate Historical or Paleoclimate evidence, reflecting the combined likelihood (Fig. 2d) being much narrower. The dotted cyan line shows S20's Baseline, uniform-in-λ prior based, PDF. It is very close to the uniform-in-λ prior based PDF derived here, which follows from the closeness of its combined evidence likelihood to that derived here, notwithstanding the substantial differences in the Paleoclimate and, particularly, Historical likelihoods at high S.

Figure 4c shows (solid cyan line) this study's posterior PDF for Shist, derived by a sampling-based method, without normalization to unit probability over 0–20 K. The dashed orange line shows S20's corresponding "non-Bayesian Shist PDF", derived directly (by sampling) from their (19), the equivalent of (9) here. When S20's PDF is scaled by a factor of 0.86Footnote 3 (dotted red line) it closely matches this study's sampling-based Shist PDF (which equates to a Bayesian posterior PDF derived using a noninformative prior).

Although all S20's main probabilistic estimates are based on Bayesian analysis with a uniform-in-λ prior, no such results are given for Shist. However, the solid black line in Fig. 4c shows a uniform-in-λ prior based PDF, using the accurate emulation of S20's Shist likelihood (supplemental material Figure S2.1(a)). This PDF is much better constrained at high Shist than the non-Bayesian sampling-derived PDFs, primarily reflecting S20's misestimation of the Shist likelihood.

Table 4 presents results in the form of medians and 66%, 90% and 95% uncertainty ranges for posterior PDFs for S and Shist on S20's data-variable assumptions, using this study's methods, with the comparative S20 results where available. It is evident from the high percentile S values that Paleoclimate evidence gives the strongest constraints on upper uncertainty bounds, with Historical evidence constraining them least. That is consistent with the relative shapes of the likelihood functions for the three lines of evidence.

Table 4 Primary results using S20's data-variable assumptions, sampling-derived integrated likelihoods and Jeffreys' priors. PDFs for S have been normalized to unit probability over 0–20 K except where stated. All values are in K and except for the medians (50th percentiles) are rounded to the nearest 0.05 K

Notwithstanding the difficulty the optimization-based profile likelihood method has in deriving a satisfactory data-space movement prior for S20's Historical evidence, the 'All combined' 5% to 95% percentile values from combining that method's results for each line of evidence are all within 0.05 K of those using the sampling-based integrated likelihood method and related Jeffreys' priors, as given in Table 4.

Compared to S20's Baseline combined evidence results, the Table 4 median estimate for S is approximately 0.13 K higher, and the 95% bound 0.35 K higher. The likely reasons for these differences being small are that (i) neither of S20's most seriously inaccurate likelihood estimates were used for its Baseline estimate, resulting in the combined likelihood that they used for that purpose deviating only modestly from that given by this study's methods until both have fallen to a moderate level, as well as being nearly identical to it up to the likelihood maximum; (ii) both S20's uniform-in-λ prior and this study's combined-evidence Jeffreys' prior fall sharply over the high likelihood (> 0.5 × its maximum) region, by about two-thirds in S20's case, which means that differences in the likelihood and (still declining) prior used beyond that region have a minor effect on S estimates; and (iii) over the high likelihood region the Jeffreys' prior only increases by ~ 25% relative to a uniform-in-λ prior, which difference is only sufficient to produce small upward shifts in the median and higher percentile S estimates.

PDFs for S used in Table 4 were normalized to unit probability over 0–20 K, except in one case. As discussed in Sect. 3, for combined evidence virtually all probability lies within the 0–20 K range over which computations are performed and over which total probability is normalized to unity. Likewise, almost no probability for S lies outside 0–20 K when combining two or more lines of evidence, or using Paleoclimate evidence alone, and under 1% does for Process evidence alone. However, when using Historical evidence alone 30% of samples produce S values that are above 20 K or negative due to ΔR > 0, implying an unstable climate system, and 15% of samples do so for Shist. The substantial proportion of sampled S and Shist values exceeding 20 K primarily reflects the significant probability assigned by S20's data-variable assumptions to highly negative aerosol ERF values: \(\Delta F_{{{\text{Hist}}}}^{{_{{{\text{aerosol}}}} }} < - 2{\text{ Wm}}^{2}\) in 17% of samples. Unnormalized results for S and Shist from Historical evidence are therefore also given, without probability being restricted to any range of S values. This unrestricted basis, which correctly reflects the implications of the data-variable uncertainty distributions, is usual for sampling-based energy budget studies that derive Shist (Gregory et al. 2002; Otto et al. 2013; Lewis and Curry 2015, 2018).

S20 did not provide any estimate of the transient climate response (TCR), a shorter-term climate sensitivity measure, however Table 4 does do so. TCR is estimated as for Shist but omitting the deduction for ΔN in (9), a common method (Otto et al. 2013; Lewis and Curry 2015, 2018; Forster et al. 2021, so that:

$${\text{TCR}} = F_{{2 \times {\text{CO}}2}} {/(}\Delta F_{{{\text{Hist}}}} /\Delta T_{{{\text{Hist}}}} )$$

The resulting median TCR estimate of 2.26 K exceeds the AR6 likely range, and there is a 7% probability that TCR exceeds 20 K. Moreover, if S20's estimates of the historical pattern effect are accurate then over half of it is unforced (supplemental material S5.2.4) and will have depressed \(\Delta T_{{{\text{Hist}}}}\). Amending (15) to correct for that would increase the implied S20 TCR estimate.

Table 5 presents equivalent results from posterior PDFs based on uniform-in-λ priors, with the comparative S20 results where available. Where Process evidence is not used, the entire posterior probability will in theory be located immediately above zero S, resulting in all S percentiles being almost zero (supplemental material S7). This study's Table 5 results reflect imposing a restriction to \(S \ge 0.01{\text{ K}}\), which avoids the uniform-in-λ prior producing non-negligible probability at very low S values.

Table 5 Results if using S20's uniform-in-λ prior as well as their data-variable assumptions, with PDFs normalized to unit probability over 0–20 K, and comparisons with similarly based results in S20's Table 10. All values are in K and except for the medians (50th percentiles) are rounded to the nearest 0.05 K

The S values are lower, particularly at high percentiles, than when using Jeffreys' priors, except for Process alone where the two types of prior and hence the S values are identical. This behavior reflects the fact that, while all the Jeffreys' priors decrease with S, they decline less rapidly than the uniform-in-λ prior. Equivalently, when transformed to λ-space, all the non-Process Jeffreys' priors increase with S, whereas the uniform-in-λ prior does not. The S values at higher percentiles derived here using a uniform-in-λ prior differ from those per S20's Baseline results, as would be expected given the identified differences between their likelihoods. However, the differences in S values are small, only reaching 0.1 K beyond the median. That is consistent with the differences in likelihoods only becoming significant beyond the medians; since the uniform-in-λ prior varies with S2 the effect on the posterior PDFs of sizeable likelihood differences is muted at higher S values.

Comparing the present study's results using the two types of prior, the differences between their 95th percentile S values are quite significant when Historical and/or Paleoclimate evidence are used, either alone (differences of ~ 1 K) or in combination (a difference of 0.6 K, or 0.9 K using S20's results). When Process evidence is used in combination with Historical and/or Paleoclimate evidence the differences are smaller. When combining all three lines of evidence, the 95% bound is only 0.2 K lower when using a uniform-in-λ prior, and the difference in medians is under 0.1 K. That is mainly due to the combined evidence being much more informative about S, and constraining it more tightly, than any of the separate lines of evidence, and partly due to contributions to the combined Jeffreys' prior from Process and Historical evidence respectively being uniform-in-λ, and increasing only gently relative to a uniform-in-λ prior.

S20 also present various posterior PDFs based on a uniform-in-S prior. That prior is unsuitable for S estimation unless ΔT uncertainty dominates, which it almost never does, and will often result in uncertainty ranges that are far from being true confidence intervals. For Historical evidence, use of a uniform-in-S prior has been shown to be unsuitable and to result in seriously biased estimation (Annan and Hargreaves 2011; Lewis 2014). For Process evidence, the likelihood for λ is normal in λ space, a case for which there is general agreement that use of a uniform prior is appropriate. When transformed to S space, the resulting appropriate prior for S for Process evidence will (applying the Jacobian factor) be proportional to S−2, very far from uniform.Footnote 4 Most damningly, use of a uniform-in-S prior would result, even for 'All combined' evidence, in the S values for all percentiles being an unbounded function of the upper bound of the S range over which normalization to unit probability occurs, and to increase without limit as that bound tends to infinity.Footnote 5

6 Results using the revised data-variable assumptions

Results are now presented using the data-variable assumptions as revised per Sect. 4, including rectification of the omission of the necessary scaling of F2×CO2 when using Process and Historical evidence ("the revised assumptions"), employing the integrated likelihood method with computed Jeffreys' priors.

6.1 Likelihoods

Figure 5 shows likelihoods derived using the revised assumptions (solid lines), with similarly calculated likelihoods using S20's data-variable assumptions in the same colors but dotted. The revised assumptions produce likelihoods that peak at lower S values, and are lower at high S values, than do S20's assumptions.

Fig. 5
figure 5

Likelihoods for S based on this study's revised data-variable assumptions (solid lines) and, for comparison, those derived in this study using S20's data-variable assumptions (same color, dotted, lines). Likelihoods, on both sets of data-variable assumptions, have been derived using the integrated likelihood method and normalized to a maximum of one. a Likelihoods from evidence for the three Paleoclimate periods: LGM, mPWP and PETM. b Likelihoods from Process evidence and from combining Paleoclimate evidence from the LGM and either the mPWP or the PETM. S20 did not combine LGM and PETM evidence so no comparison dotted line is shown for this case. c Likelihoods from Historical evidence for both S and Shist. d Likelihoods from combining Process and Historical evidence with Paleoclimate evidence from the LGM combined with that from either the mPWP or the PETM

The likelihoods for both separate and combined evidence derived using the integrated likelihood method, the data doubling method, and the profile likelihood method, are all almost identical when employing the revised assumptions (supplemental material Figures S4 to S7).This includes, unlike on S20's data-variable assumptions, likelihoods from the doubled-data method applied to Historical evidence, confirming that it is difficulty incorporating the unusual historical aerosol forcing distribution employed by S20 that prevents data-doubling working satisfactorily in that case.

6.2 Computed Jeffreys' priors

Figure 6a shows the computed and calibrated Jeffreys' priors for S for each main line of evidence, with Paleoclimate represented by LGM evidence combined with either mPWP or PETM evidence, and for all main lines of evidence combined, from S = 1 K upwards. Below that level, they start to become artefacted—as they do below ~ 1.5 K when using S20's data-variable assumptions—due to the paucity of samples for Process and Historical evidence. The likelihood and the probability are both almost zero below those levels, so the effect on inference for S is negligible.

Fig. 6
figure 6

Computed Jeffreys' priors based on based on this study's revised data-variable assumptions. a Prior in S-space for each main line of evidence and for the combined evidence from all three types. Paleoclimate evidence from the LGM and PETM combined is included as well that from the LGM and mPWP combined. b As for (a) but with the S priors transformed into priors in λ-space (that is, priors for λ, but plotted at the corresponding S values). c Priors transformed into λ-space for evidence from individual paleoclimate periods. The transformations from S-space to λ-space are effected by multiplying them by S2/F2×CO2, for simplicity ignoring the minor uncertainty in F2×CO2 and hence not being exact

In all cases the priors decline rapidly with increasing S, reflecting declining informativeness of the evidence about S, making it difficult to compare their behavior at high S levels. Figure 6b shows these priors transformed into λ-space—that is converted into priors for λ—but plotted against the corresponding values of S. Figure 6c shows the transformed priors for LGM, mPWP and PETM Paleoclimate evidence separately. Strictly, because for computational reasons the relatively small uncertainty in F2×CO2 is ignored when effecting the transformation, it is into priors for \(\overline{{F_{{{{2 \times \text{CO2}}}}} }} /S\), whereas \(\lambda = F_{{{{2 \times \text{CO2}}}}} /S\). The transformed priors show how informative the evidence is about λ at each S value, which varies comparatively little. Their shapes are all similar when using the revised data-variable assumptions to those when using S20's assumptions (Fig. 3b, c), except that the revised Historical evidence prior flattens out much earlier with rising S. For Process evidence, the prior should in both cases be uniform (constant) when transformed to λ-space—the same prior as used for all lines of evidence in S20—since the λ likelihood is for the sum of normally distributed variables. The drop in the transformed priors that occurs at low S is due to uncertainty in F2×CO2 not being removed by the transformation. At low S values, which arise where ΔR is large and ΔT is small, fractional uncertainty in ΔT translates to significant absolute uncertainty in λ, hence Paleoclimate evidence (which has large ΔT uncertainty) is relatively uninformative.

At low S values, Process evidence is most informative, and dominates the combined evidence prior. As S increases, mPWP and hence Paleoclimate evidence becomes increasingly more informative, due mainly to fractional uncertainty in ΔR being low for the mPWP, and at high S dominates the combined evidence prior. Historical evidence is less informative than Process evidence at all S values, although only modestly so at high S when using S20's data-variable assumptions. The priors generally have slightly lower values when using the revised rather than S20's original data-variable assumptions. The reasons for this are discussed in the supplemental material (S6).

In all cases the priors derived using the doubled data and profile likelihood data-space movement methods are close to those derived using the primary integrated likelihood method over S of 1–10 K, outside which virtually zero probability lies (supplemental material Figures S8, S9). In some Paleoclimate cases there is minor divergence at \(S > 4{\text{ K}}\) when using the data-space movement method, but for combined Paleoclimate evidence little probability exists there. Although all the priors have been calibrated to equal (the square root of) Fisher information, those derived using the profile likelihood data-space movement method are almost identical with and without calibration, confirming the soundness of that method. These comments apply also (at S > 1.5 K) to the priors derived when using S20's data-variable assumptions (Fig. 3), save that only the integrated likelihood method produces a usable prior for S20's Historical evidence (see Sects. 3.3 and 3.5), and hence for the combined evidence.

6.3 Posterior PDFs and percentiles for S

The estimated posterior PDFs based on the revised assumptions (Fig. 7, solid lines) all peak at lower S values, and are better constrained beyond their peaks, than those based on S20's assumptions (shown dotted).

Fig. 7
figure 7

Posterior PDFs for S based on the revised data-variable assumptions (solid lines) and, for comparison, on S20's original assumptions (dotted lines), in both cases using the sampling-based integrated likelihood method. The PDFs for individual separate lines of evidence, are obtained directly by sampling, but equate to the product of the estimated integrated likelihood and derived Jeffreys' prior. Save in panel (a), comparative PDFs involving the PETM are not given, since S20 did not provide any such PDFs. Except in panel (d), the PDFs have all been normalized to unit probability over 0–20 K. a PDFs for S from Paleoclimate LGM, mPWP and PETM evidence separately. b PDFs for S from separate Process, Historical and Paleoclimate (LGM combined with mPWP or PETM) evidence. c Combined evidence PDFs derived using, as Paleoclimate evidence, that from the LGM combined with that from either the mPWP (black lines) or the PETM (purple line). d Unnormalized PDFs for S and Shist. These account for probability outside 0–20 K, which is substantial when using S20's original assumptions

Figure 7a shows the PDFs for separate lines of Paleoclimate evidence, derived using the revised data-variable assumptions and, for comparison, using S20's assumptions. Figure 7b shows PDFs from combining Paleoclimate LGM evidence with that from the mPWP or PETM, and for separate Process and Historical evidence. Figure 7c shows the final PDFs after combining Process, Historical and Paleoclimate evidence. The PDFs incorporating PETM evidence are almost identical to those incorporating mPWP evidence; they peak marginally earlier, and at a slightly higher level.

Figure 7d shows unnormalized posterior PDFs for S and Shist. The PDFs for S are the same as in panel (b) except for the lack of normalization. When using the revised assumptions, there is little difference between the normalized and unnormalized PDFs, as only 2.0% of the probability lies beyond S = 20 K, down from 30% when using S20's assumptions.

Posterior PDFs computed using the doubled data method, or the profile likelihood method and its data-space movement prior, are visually identical to those using the primary integrated likelihood method shown in Fig. 7, save for a marginal difference in the peak PDF level for separate LGM and PETM evidence (supplemental material Figures S10 to S13). For separate evidence, the sampling based integrated likelihood and doubled data method both derive their PDFs directly by sampling, so they are bound to be identical in these cases. However, PDFs computed using the non-sampling profile likelihood method and its very differently derived data-space movement prior are completely independent of those from the integrated likelihood and doubled data methods.

Table 6 presents this study's primary results, in the form of medians and 66%, 90% and 95% uncertainty ranges from posterior PDFs, on the revised data-variable assumptions, derived using the integrated likelihood method and Jeffreys' priors. Results based on combining different pairs of lines of evidence, as well as all of them, are given. PDFs for S used in Table 6 were normalized to unit probability over 0–20 K; almost no probability (≤ 0.1%) lies outside that range.

Table 6 Primary results using the revised data-variable assumptions and Jeffreys' priors. All values are in K and except for the medians (50th percentiles) are rounded to the nearest 0.05 K

As when using S20's assumptions, the S values at high percentiles confirm that Historical evidence is least important for constraining the upper uncertainty bounds, but Process evidence now constrains them almost as strongly as Paleoclimate evidence.

The limited revisions made to S20's assumptions reduce by one-third, from 3.23 to 2.16 K, the median estimate of S given by the combined evidence, using Jeffreys' priors and warm Paleoclimate evidence from the mPWP in both cases. The 83% and 95% uncertainty bounds reduce respectively from 4.1 to 2.7 K and from 5.05 to 3.2 K. If warm Paleoclimate evidence is instead taken from the PETM when the revised assumptions are used, the PDF percentiles from the median upwards reduce further, by ~ 0.05 K.

All the profile likelihood method (using the associated data-space movement prior) derived percentile values are within ± 0.02 K of those in Table 6, with medians identical, when combining evidence in two stages (supplemental material S4). Moreover, when using the revised assumptions, the profile likelihood optimization process can simultaneously combine Process, Historical, LGM, and either mPWP or PETM evidence, and hence produce, in a single step, a posterior PDF for all lines of evidence combined. The 1% to 99% percentile points of those two PDFs match those from the sampling-based integrated likelihood method within ± 0.02 K.

The median S values when omitting evidence from each of the three main lines in turn, with Paleoclimate evidence combining LGM evidence with that from either the mPWP or PETM, are all within 0.1 K of the average of the two 'All combined' values.

It is useful to establish how sensitive the combined-evidence results are to the various categories of revisions to data-variable assumptions. A few of the revisions might be regarded as more questionable since they are based wholly or partly on reevaluation of existing evidence. That category includes LGM cooling and non-CO2 forcing, and also the revision to Historical aerosol forcing, which although largely based on newer evidence concerns a very poorly constrained forcing.

Table 7 divides the revisions made into six categories, starting with those that appear least debatable, being that to F2×CO2 together with appropriate adjustments (omitted in S20) to the calculation of the CO2 ERF estimates used, or which arise from alignment of the CO2 concentrations used for estimating of the ECS to S ratio. Without these changes, the bases of S estimation are biased and are not consistent between lines of evidence.

Table 7 Sensitivity of the primary combined evidence results (using LGM + mPWP for Paleo) to the various revisions of data-variable assumptions, using the integrated likelihood method and Jeffreys' priors. All values are in K and except for the medians (50th percentiles) are rounded to the nearest 0.05 K

The next set of changes relate primarily to the substitution of the AR6 Historical non-aerosol ERF time series estimates, and of the AR6 zero-mean estimate of the difference between Historical GMST and GMAT warming, for the estimates used by S20. Together, these two categories of revision reduce the median S estimate by almost 0.6 K, and the 95% uncertainty bound by almost 1 K.

The third category comprises all other revisions that are based entirely on newer evidence or later data, other than cloud feedback, the least well constrained feedback. The main changes involved are to scaling factors used for estimation of S from mPWP evidence, with those factors being derived from a more recent model intercomparison project than previously, and to the estimate of the Historical pattern effect. These revisions reduce the median S estimate by a further 0.25 K. Next, cloud feedback is revised, which reduces the median S estimate by almost 0.15 K more, to 2.25 K, with the 95% uncertainty bound now down to 3.4 K. Including the penultimate category, of revisions to LGM cooling and non-CO2 forcing, brings the median S estimate down by another 0.1 K, to 2.15 K.

Finally, and somewhat counter intuitively, revising the Historical aerosol ERF distribution, with the resulting S estimation basis now being the same as that in Table 6, does not further reduce the median S estimate. That estimate remains unchanged within computational uncertainty, although the 95% (and 97.5%) uncertainty bounds reduce by a further 0.05 K. Investigation suggests the principal cause is likely that, although the aerosol ERF distribution used by S20 has a median that (after scaling between periods using the AR6 time series) is some 0.2 Wm−2 more negative than for the adopted revised distribution, its mode is actually the less negative of the two. The shapes of the two distributions are such that the Historical S likelihoods resulting from their use, with other revisions to Historical data-variables having been made, are very similar below approximately 2 K. While highly negative aerosol ERF values, which correspond to high S values, are much more probable when using the S20 aerosol distribution, resulting in a larger Historical S likelihood, those high S values are almost ruled out by their low likelihood from Process and Paleoclimate evidence.

Table 8 presents sampling-derived medians and 66%, 90% and 95% uncertainty range bounds for S, for the three main lines of evidence and for each Paleoclimate period separately, using the revised and (in italics) the original S20 assumptions. S20 did not provide any similar estimates. Posterior PDFs for S were normalized to unit probability over 0–20 K except where stated; results are also given without excluding probability outside that range, where it is non-negligible.

Table 8 Percentile results for individual paleoclimate periods, and for each main line of evidence, separately, including for Shist and TCR, using the revised assumptions and, in italics in the following row, using the same methods but S20's assumptions. All values are in K and except for the medians (50th percentiles) are rounded to the nearest 0.05 K

Consistency between median S values from different lines of evidence is much improved when based on the revised assumptions; they span (without restriction to 0−20 K) only 1.9−2.4 K, compared to 2.4−6.1 K using S20's assumptions.

The median TCR is 1.54 K, within the AR6 likely range. The median TCR and Shist estimates are respectively almost one-third lower, and almost 40% lower, than when using S20's assumptions.

7 Discussion

This study first identifies statistical problems in S20. Using a Subjective Bayesian statistical method involving an investigator-selected prior distribution, as S20 does, may produce unrealistic climate sensitivity estimation when used to combine differing types of evidence (Lewis 2018), even assuming that the data likelihood functions are correct. In this case, I found that the method S20 used for estimating likelihoods for all but Process evidence was in fact unsound, and that it underestimated likelihood at high S levels, substantially so in some cases. I also found that S20 used an uncertainty estimate for PETM CO2 forcing that was a factor of ten too low, due to an apparent coding error, further biasing their likelihood estimate (although not affecting their main results).

This study then develops an Objective Bayesian approach to combining differing climate sensitivity evidence that, unlike the method used in Lewis and Grünwald (2018), is not restricted to dealing with a particular simple statistical model. The approach involves computationally deriving Jeffreys' prior distributions that are designed to maximize the influence of the data on the results and to produce probabilistic estimates that are as close as possible to being confidence intervals, and thus are well calibrated. Three different inferential methods employed for this purpose each provide nearly identical estimated likelihoods and Jeffreys' priors, and final results. This result is very supportive of the validity of the methods used and of the results they produce.

The robustness of S20's results to the use of properly calibrated statistical methods and validly calculated likelihood estimates is then examined, using the Objective Bayesian methods developed in this study. It is shown that while S20's choice of prior and its likelihood misestimation lead to over-constraining of high S levels, based on S20's data-variable assumptions the downwards bias in S20's Baseline combined evidence results is modest: the median S estimate is approximately 0.13 K low, and the 95% uncertainty bound 0.35 K low. However, the bias in S20's No Process results is over twice as large.

The other main contribution of this study is to assess the impact of revising various input data-variable distributions used by S20, by:

  1. (i)

    adjusting the F2×CO2 value used for inferring S from Process and Historical evidence to reflect the effect of climate feedback changing over GCM abrupt4xCO2 simulations, as should undoubtedly be done;

  2. (ii)

    allowing for the CO2 concentration-ERF relationship being slightly non-logarithmic, and estimating the ECS to S ratio in a way that is unaffected by that relationship;

  3. (iii)

    changing some of S20's other data-variable estimates to reflect more recent information; and

  4. (iv)

    using arguably better justified (albeit not based purely on more recent information), alternative estimates for a few other data-variables.

I find that doing so results in substantially lower and better constrained estimates for S. The median S estimate when combining all lines of evidence, using the Objective Bayesian method and the LGM and mPWP for Paleoclimate evidence, reduces from 3.23 to 2.16 K.

All the revised data-variable estimates are not only defensible but, given the evidence now available, in my view are better justified than S20's original estimates. Moreover, omitting the only revisions dependent, to a greater or lesser extent, on reevaluation of existing evidence only very modestly changes the combined evidence results, with the omission just of the revision of the Historical aerosol forcing having almost no effect on the results.

It therefore currently remains quite plausible that S is below 2 K. The truncation in S20's results of the lower bound for S does not appear justified given the range of data-variable estimates supported by relevant, mainly more recent, studies. There is 36% probability of S being under 2 K, considerably greater than the 26% probability of S exceeding 2.5 K, according to the revised data-variable assumptions 'All combined: Paleo LGM + mPWP' results; they also imply that it is extremely unlikely that S is below 1.5 K, and extremely unlikely that S is above 3.2 K.

The revised data-variable median Historical evidence estimates of Shist and TCR are somewhat higher than the comparable estimates in Lewis and Curry (2018), of 1.66 K and 1.33 K respectively. The excess is mainly due to a stronger aerosol ERF change, even after revising S20's assumptions. Further revising S20's median aerosol ERF to match the change per the AR5 time-series, extended post-2011 using AR6's annual changes, would reduce the Table 8 median Shist and TCR to respectively 1.82 K and 1.40 K. Changing the base period to 1869–1882 to match Lewis and Curry (2018), avoiding the poorly observed 1861–1868 period, would further reduce those estimates, to 1.79 K and 1.37 K. The methane shortwave ERF adjustment, and greater estimated change in radiative imbalance, in AR6 can account for the small remaining differences.