1 Introduction

Every practising scientist knows that their everyday work is made up of choices. Sherwood et al in their recent assessment of Equilibrium Climate Sensitivity (ECS) refer to methodological choices they were faced with as “unavoidably subjective” (Sherwood et al 2020, 9). The existence of choices that require some measure of subjective judgement in science, in turn, has for a long time been acknowledged by philosophers of science. A crucial example of such subjective factors are values, i.e. something we find desirable or worthy of pursuit. For example, the choice for one theory over competing theories is not only mediated by evidence, but also by preference for values such as simplicity, fruitfulness, scope, or predictive accuracy (Kuhn 1977; Sober 2015; Schindler 2018) that are integral to science and often called epistemic values. Other values, correspondingly called non-epistemic values and including moral, political, or social values (Longino 1990, 1997; Douglas 2009; Elliott 2017; Brown 2020)Footnote 1, can also influence science, for example through background assumptions (Anderson 2004), the choice of research topics, questions, and methods (Elliott 2017), or when choosing how much evidence to demand before drawing conclusions (Douglas 2009). This implies that science is not free of values — as might however still be an assumption among scientists (e.g. Steel et al., 2017), policymakers, or the general public. The latest Working Group I report by the Intergovernmental Panel on Climate Change (IPCC) has acknowledged the involvement of values with regard to the production of climate change information e.g. for decision-making (IPCC 2021, Chs. 1, 10). The relevance of values also in less applied climate science has yet to be discussed within the climate-scientific community (Pulkkinen et al 2022a).

Many influences of value-judgements are not very controversial. They have justified and positive roles to play in science (Elliott 2017) and the emerging consensus among philosophers is that striving for value-free science is both unrealistic and undesirable. A role for (both epistemic and) non-epistemic values is furthermore compatible with contemporary notions of scientific objectivity (e.g. Tsou et al. 2015; Longino 1990; Koskinen 2020; for a proposal on the objectivity of the IPCC specifically see Jebeile 2020) and does hence not undermine the reliability of science for societal purposes (Wilholt 2013). Nonetheless, the influence of values can be highly problematic, if value-judgements direct research towards a predetermined conclusion (Anderson 2004) or are pathways of ideological influence on scientific results (Elliott 2017). Value influences can also occur without the scientists being aware thereof; some value-judgements might for example not be noticed because they are uncontroversial and conventional (e.g. Schroeder 2019). Current research efforts in philosophy are therefore directed towards drawing a distinction between ‘legitimate’ and ‘illegitimate’ roles of values (Douglas 2009; Elliott and McKaughan 2014; Intemann 2015; Schroeder 2019; Lusk 2020). Suggested frameworks for this include an emphasis on legitimate aims of research (Elliott 2017; Elliott and McKaughan 2014; Intemann 2015); that non-epistemic values can legitimately play a role in evaluating the risk of error (i.e. balancing the risk of false positives and false negatives) (Douglas 2009); that scientific research needs to be democratic and incorporate the input of stakeholders to be legitimate (Douglas 2005; Intemann 2015; Schroeder 2019; Brown 2020); or that value-judgements guiding science are legitimate as long as they are made transparent (Elliott 2017, 2020). These normative proposals need to be applicable if they are to be implemented in climate science (e.g. Elliott 2017; Laplane et al 2019).

To address the need for value-management in climate science identified in the philosophical literature, the crucial first step, timely now for foundational climate science, is for scientists to become aware of and acknowledge value-judgements in their work (Pulkkinen et al 2022a). To that effect, we provide a detailed analysis of choices and value-judgements in multi-model based assessments in climate science, using ECS as a case study. ECS — the “equilibrium (steady state) change in the surface temperature [of Earth] following a doubling of the atmospheric carbon dioxide [(CO2)] concentration from pre-industrial conditions” (IPCC 2021, glossary) — is an “[almost] iconic” number (Knutti et al. 2017, 727; Mauritsen and Roeckner 2020, 9) that is taken to characterise various aspects of anthropogenic climate change and is as such important for climate policy (Kaya et al. 2016; IPCC 2021). Since the beginnings of climate science, research efforts have hence focused on ECS assessments, making them benchmarks of climate science. Only in the latest assessments has this meant a turn away from relying directly on model output (Sherwood et al 2020; Forster et al 2021). We will show that the earlier assessments, not just that by (Sherwood et al 2020), involve plenty of subjective choices, which make room for value-judgementsFootnote 2. The same is the case in contemporary assessments of other variables or phenomena, which continue to be based on multi-model ensembles (IPCC 2021, Chs. 4, 10).

In the remainder of this paper, we provide the historical context on climate modelling and ECS uncertainty (Section 2); describe the methodology used to analyse choices and value-judgements in model-based ECS assessments (Section 3); illuminate typical choices and value-judgements in each assessment step from the choice of research question to the publishing and communication of the findings (Section 4); draw conclusions across the steps with regard to the types of relevant values, discuss whether sensitivity studies can avoid values, and suggest requirements for applicable norms and ideals regarding value-management (Section 5); discuss the steps of the latest assessments that are based on multiple lines of evidence and how those assessments differ in terms of choices and values from model-based assessments, highlighting aspects that remain applicable there and identifying which others would need to be the focus of further in-depth study (Section 6); and provide conclusions (Section 7).

2 Context: historical evolution of ECS estimates, and their uncertainty ranges

The history of climate sensitivity begins with Arrhenius’s (1896) estimate (for a summary of all historical contributions, see, e.g., Forster et al 2021). Since Charney et al (1979), a series of assessment results provide snapshots of the complex historical evolution of the knowledge regarding climate sensitivity and its uncertainty range (Fig. 1). From Charney et al (1979) up to and including the IPCC’s Fifth Assessment Report (AR5; Collins et al. 2013), the assessments were directly based on climate models, meaning that the history of ECS estimates is closely linked to that of climate modelling. Only very recently, with Forster et al (2021) following Sherwood et al (2020) and a turn away from model-based assessment to multiple other lines of evidence, has the ECS uncertainty range narrowed (Fig. 1; see also Voosen 2020; Meehl et al 2020).

Climate sensitivity is an emerging property in global climate models or Earth system models (short: models), so that model ECS can be evaluated by prescribing an idealised CO2 perturbation and running a simulation to approximate equilibrium (e.g. Charney et al 1979; Rugenstein et al 2020). The resulting sensitivity measure is here referred to as EqCS. This approach is practical if only simple ocean representations such as slab-ocean models are used, as was the case for models analysed in early assessments (Charney et al 1979; Mitchell et al. 1990), but is less feasible for models with dynamical ocean components due to computing constraints. The methods used to estimate ECS have therefore evolved along with the models themselves (see also Meehl et al 2020). In this ‘model evolution’, generations of modellers from various disciplines have replaced, changed, and added representations of more and more climate system components and physical processes (Randall et al 2019), but it is common for current-generation models to also still carry components that are largely unchanged since their early days. Since the introduction of the ECS concept, further acknowledgement of time scale differences among Earth system components (e.g. ice sheets, biogeochemical cycles, atmospheric feedbacks) and improved understanding of the Earth system (e.g. state dependencies and pattern effects) have also revealed the need to define ECS more precisely (Forster et al 2021). The sensitivity measure that is derived by applying the current standard method (Gregory et al 2004; Eyring et al 2016; Forster et al 2016), sometimes referred to as effective climate sensitivity (EffCS) (Zelinka et al 2020), is still largely assigned the same general purpose as the original ECS was and may also be referred to as “ECS calculated by the Gregory method” (Meehl et al 2020, 3).

Fig. 1
figure 1

From Charney to IPCC AR6: Historical evolution of major ECS estimates and their communication. Shown are the assessment result, i.e. the best estimate for real-world ECS (purple crosses) and its uncertainty range (whiskers), and the ECS values directly derived from climate models (black dots) and their unweighted multi-model mean (MMM; grey crosses) from the respectively latest model ensemble available at the time the assessment was made. For the assessment results, where given, the best estimate (not discussed in TAR, explicitly not determined in AR5), the likely range (red; from FAR on referred to as likely which from TAR on is specified as 33–66%); the very likely range (orange; 10–90%); the extremely likely range (yellow; 5–95%); and/or the virtually certain range (blue; 1–99%) are shown. In the Charney report, the uncertainty range (referred to as “we believe [...] that [...] [ECS] will be in [this] range”) (Charney et al 1979, 16) is composed of the model-derived probable bounds (pink) and additional, process-informed, uncertainty (light pink). In AR4, the possibility of values higher than the likely range is emphasised (turquoise). Sherwood et al (2020, 1) provide a second set of ranges (dashed lines) derived from “tests of robustness to difficult-to-quantify uncertainties and different priors”. The x axis labels indicate where effective climate sensitivity (EffCS) is introduced, which is one of the changes over time in the types of models, experiments, and methodologies employed (Section 2). Data from Charney et al (1979), Flynn and Mauritsen (2020), Meehl et al (2020), Sherwood et al (2020) and IPCC reports up to AR6, for details see SI

The assessed ECS uncertainty range from Charney et al (1979) to Collins et al. (2013) was directly based on ECS estimates from the respectively latest generation of models (black dots in Fig. 1). The assessed range stayed remarkably constant, but the underlying scientific understanding did improve (Collins et al. 2013) and the communication of uncertainty evolve: The Charney report gave a most likely value of near 3C along with “rough estimates of the probable bounds” giving 1.5C to 4.5C based on “at best informed guesses” (Charney et al 1979, 16), while IPCC AR5 did not give a best estimate and stated that “ECS is likely in the range 1.5C to 4.5C with high confidence, extremely unlikely less than 1C (high confidence) and very unlikely greater than 6C (medium confidence)” (Collins et al. 2013, 1033, following IPCC guidelines introducing the separation between ‘confidence in the validity of a finding’ and standardised ‘quantified measures of uncertainty’, Mastrandrea et al 2010). Such historical examples emphasise changes over time in the choices made regarding the communication of results and their uncertainty. Given the emergence of new methodologies, tools, and evidence, as well as changes in the societal context over time, we can not assume that those choices were made under the same premises in the different assessments. This complicates the interpretation of the choices from the perspective of values in science. Therefore, instead of comparing choices between assessments made at different times in history, we next analyse choices made during any one assessment. We examine the complete assessment process, as is required because choices and value-judgements made anywhere within the process may all impact the final assessment result.

3 Methodology: identifying relevant value-judgements

We examine the process of assessing climate sensitivity by distinguishing between choices, on the one hand, and value-judgements (e.g. ‘I value a simple model (perhaps more than a complex one)’) and values (simplicity, complexity, etc.)Footnote 3 that are relevant to the choice, on the other hand. This terminology is used throughout the remainder of the paper. The relevance of a value or value-judgement to a choice can mean that the choice is or may be guided fully or partially by the value-judgement. It can also mean that the options themselves can carry different value-implications.Footnote 4 This coarse notion is established in the literature and is here sufficient for the aims of the study.

The discussion is structured by dividing the assessment process into five steps (Fig. 2). These steps are tied to the procedure of a typical assessment and build on each other both in a time dimension and a ‘knowledge-building’ dimension. For each step of a model-based assessment (roman numerals and names as in Fig. 2a and as the subheaders in Section 4), one or more specific examples of choices made in the scientific literature are given. For each example (labelled using capital letters A–M, in the text and in Tables 12345), we give a description of the choice along with the referenced literature; state who has agency in making the decision and, where applicable, what strategies scientists may apply to investigate this influence; state how this choice/decision might impact the assessments result; and give examples of which epistemic values and non-epistemic value-judgements may be relevant to the choice. Note that some of the values and value-judgements listed are mainly pragmatic considerations, and as such controversial in the literature with regard to their classification as epistemic or non-epistemic values. Some readers might therefore contest the positioning of these considerations in Tables 12345, which is however not crucial for the conclusions of the paper. For a discussion of the corresponding steps for assessments on other lines of evidence, see Section 6.

Fig. 2
figure 2

ECS assessment process. Steps of (a) model-based ECS assessments and (b) assessments based on multiple lines of evidence rather than on direct model output. The steps build on each other as indicated by the step-arrows. The placement on the x axis indicates the relative importance of epistemic (left) and non-epistemic (right) values, to show that all values of both kinds may be relevant to all steps, but that epistemic and non-epistemic values, respectively, dominate more at either end of the assessment process. If step (iii) in (a) is adjusted, both schemata (a, b) apply also to assessments of other climate-scientific results

4 Results: choices and value-judgements in the model-based ECS assessment process

Table 1 Choices and values in step (i) of model-based ECS assessments

In what follows, we describe the steps of a model-based ECS assessment process depicted in Fig. 2a, identifying choices and value-judgements following the methodology described in the previous section.

4.1 (i) Choices and value-judgements in choosing the research question

In the assessment process (Fig. 2a), the first step is to choose whether to assess the variable or phenomenon (here: ECS) in the first place (Table 1A with Seneviratne and Hauser 2020; Stevens et al 2016b), which is foundational for the assessment. This choice of research question is closely linked to the aims of research, and as such mostly uncontroversially influenced by non-epistemic values (Elliott 2017; Intemann 2015). The aim of estimating climate sensitivity, for example, has not been motivated by curiosity alone, but also by growing concern about the possibility of anthropogenic climate change (Fleming 1998; Weart 2010) and its impact on life on Earth. Furthermore, many assessments have been and are being commissioned by governments (e.g. Press 1981) so it is uncontested that (social) value-judgements are relevant for determining which research is deemed pursuit-worthy. For the IPCC reports, the procedure is such that each report is preceded by a scoping meeting, in which scientists develop a draft outline for the report for governments to approve (IPCC 1999). Direct and indirect influence on the choice of research question is also exerted by scientific and non-scientific institutions (funding agencies, journals, etc.). Research questions also need to be chosen in any of the scientific work underlying the assessments. Value-judgements can be relevant to those choices, too, which might influence the assessment results indirectly (Table 1B with Doherty et al. 2008; Nowottnick E et al 2011; Kovilakam and Mahajan 2016 and Table 1C with Johansson et al 2015; Lewandowsky et al 2015).

Table 2 As Table 1, except with examples for assessment step (ii) rather than (i)

4.2 (ii) Choices and value-judgements in model-building

We consider model-building the next step of a model-based assessment (Fig. 2a). This is because model-based assessments of ECS and other quantities do rely on models being built, even if current-generation climate models are not built specifically to assess ECS or for any other single purpose. In this step, scientists face numerous choices due to uncertainty about the most appropriate structure (structural uncertainty) and parameter values of a model (parametric uncertainty) (e.g. Winsberg 2018). Motivated by their differing characteristics, we discuss in the following (1) numerical and computational choices centralFootnote 5 and/or deepFootnote 6 in the model; (2) the choice of which processes to include and how; and (3) tuning choices.

(1) Examples for these types of choices include details of a model’s computational implementation and numerical discretisation (Table 2D). Such choices tend to be built-in and inherited throughout the model development, causing climate models to have path-dependency (Lenhard and Winsberg 2010), so that choices made at one time in history impact what options are available at a later time (Winsberg 2012). These details may impact present-day model ECS through determining model bias (Toniazzo et al 2020). They can also determine model-running efficiency at modern high-performance computing (HPC) facilities (Hewitt et al 2011; Wedi et al. 2013), which impacts ECS assessments indirectly by affecting for example the scope for sensitivity studies regarding model-building choices, the scope for exploring the influence of internal variability on ECS (Deser et al 2020), and the feasibility of estimating EqCS rather than EffCS (Section 2, Rugenstein et al. 2020). We cannot assume that the original choices were made in knowledge of their present-day consequences. Value-judgements in the original choice “might very well have been opaque to the actors who put them there, and they are certainly opaque to those who stand at the end of the [...] model construction” (Winsberg 2012, 132), but we identify largely epistemic values as relevant.

(2) Choices regarding which model components or processes to represent (Table 2E) are deep in the model but contribute all the more to model uncertainty. These choices can generally involve value-judgements by prioritising processes that are known or suspected to influence model performance regarding a specific region or a phenomenon that impacts some stakeholders more than others (Table 1B). We highlight choices relating to the representation of processes that can not explicitly be modelled but need to be parameterised (Schneider et al 2017). Significant for ECS are in particular the parameterisation of convective and microphysical (aerosol-)cloud processes through their influence on cloud feedbacks (Boucher et al. 2013; Zelinka et al 2020). There are plenty of feasible alternatives concerning which microphysical processes to include, and how, and how they are assumed to interact with each other and other parts of the model, even if in practice the variety of choices implemented has reduced with the widespread adoption of one particular scheme. Regarding the choice of processes, the values of completeness (e.g. Randall et al. 2007, 592; Boucher et al. 2013, 573) or complexity (Flato et al. 2013, 749) are often evoked, as expressed also in the historical tendency for the number of processes considered to increase (e.g. Edwards 2011). However, completeness can come in tension with simplicity, and there is often a “battle between simple, targeted, and selective representation on one hand and completeness on the other hand” (Knutti 2018, 339). How to parameterise a given process will be subject to additional considerations; an example is choices on how to represent heterogeneous freezing of cloud droplets to ice and rain to snow (e.g. Bigg 1953; Reisner et al. 1998), again evoking values like simplicity. Highlighting that the different parameterisations “must work in unison”, Boucher et al. (2013, 584) state that “[t]he system of parameterisations must balance simplicity, realism, computational stability and efficiency”, evoking a range of epistemic and pragmatic values.

(3) Given a model structure, subject to choices above, the degrees of freedom offered by parametric uncertainty are typically explored in the “art” (Hourdin et al 2017, 589) of tuning the model (Table 2F). The objective of tuning is generally to ensure that the model is performance-wise a reasonable representation of the observed climate system; the choice of the specific tuning target varies between modelling groups (Hourdin et al 2017; Schmidt et al 2017), and both epistemic and non-epistemic value-judgements are relevant here (Schmidt and Sherwood 2015). Tuning practice can have immediate implications for the interpretation of inter-model spread (Kiehl 2007; Knutti 2008), which calls for transparency (Hourdin et al 2017; Schmidt et al 2017). For example, rather than tuning their model to other metrics and letting ECS emerge, Mauritsen and Roeckner (2020) turn things around by explicitly tuning to a target ECS. Further choices in tuning refer, e.g., to observational uncertainty; Bender (2008) showed for instance that for the common target of top-of-the-atmosphere radiative balance derived from global satellite estimates, the dataset chosen can impact model ECS. Value-judgements can be relevant to such choices of one dataset over the other.

A strategy to address some of the freedom of choice in model-building are perturbed parameter ensembles (PPEs), where the space of some or all the degrees of freedom offered by the poorly constrained parameters is sampled (see Section 5), and sometimes used as input to statistical emulators (e.g. Sexton et al 2021) to describe even more parameter combinations. The use of multi-model ensembles (MMEs; step iv) aims to complementarily address structural model uncertainty, while single-model-large ensembles specifically address internal variability uncertainty.

4.3 (iii) Choices and value-judgements in deriving model ECS

After choosing to study ECS and building a model, the third step in model-based assessments is to derive an ECS estimate for a given climate model (Fig. 2a), which requires the choice of a method. The main methodological choice has historically been that of the model setup (coupled vs. slab-ocean, Section 2) and of the model ‘experiment’ (terminology as in Schmidt and Sherwood 2015). For the widely available CMIP6 data from current-generation models, a panel of expert scientists in community consultation have chosen the standard of coupled 4xCO2 simulations (Eyring et al 2016) that is then implemented by the modelling groups. Alternative choices can in theory be made at the ECS assessment stage but are practically inhibited by computing resources, time, and specific expertise required to run a model even when its code, instructions, and documentation are published. The choice of coupled 4xCO2 simulations then suggests estimating ECS using EffCS (as also estimated by Sherwood et al 2020) with the 150-year ‘Gregory method’ (Section 2), which will impact the assessed ECS value as well as its interpretation. Epistemic and non-epistemic value-judgements are relevant to the choice (Tab. 3G). Within the constraints of model data availability, alternative choices regarding methodological details can readily be made (Boucher et al 2020; Rugenstein et al 2020), since authors of inter-model comparison studies will routinely analyse the published model data themselves rather than using the ECS values published by the modelling centres.

4.4 (iv) Choices and value-judgements in combining multi-model results

The fourth step is to combine estimates from different models (Fig. 2a). Different models and model versions differ in their ECS estimates derived with a given method (step iii) because the choices in the model-building (step ii) were made differently, and such estimates are combined to a range and (potentially) a best estimate in this step. The first choice here is that of the ensemble, for example the latest CMIP generation, largely for epistemic and pragmatic reasons (Table 4H). Epistemic reasons include the acknowledgement of model improvement over time, both in terms of improved physical bases of many parameterisations and/or reduced phenomenological model bias (IPCC 2021), motivating the common choice to analyse models by generation and compare with earlier generations (as, e.g., Collins et al. 2013; Zelinka et al 2020; Flynn and Mauritsen 2020). Rauser et al. (2015) argue that alternative options to the choice of ensemble — i.e. the use of cross-generational ensembles — should be made more often. One argument for this is that the small number of generational CMIP ensembles available to date provide limited power for assessing the robustness of emergent constraints (e.g. Schlund et al 2020). Further choices might include the cut-off date by which model data has to be available in CMIP in order to be included in the analysis. This is clearly guided by pragmatic factors, although non-epistemic value-judgements might also be relevant (Table 4H). There can also be explicitly epistemic value-judgements, i.e. to exclude some models (e.g. due to suspected model artefacts like not understood model drifts; PAGES2k-PMIP3 group 2015). Choices in this step (like weighting a model based on some metric) might effectively ‘undo’ some choices made in the model-building (like tuning the model to a different metric) that influence a specific model’s ECS value. This is an illustration of the important notion that the impact of choices and relevant value-judgements can not be assumed to be linearly additive along the assessment steps (Winsberg 2018).

Table 3 As Table 1, except with examples for assessment step (iii) rather than (i)
Table 4 As Table 1, except with examples for assessment step (iv) rather than (i)

The models forming the ensemble are often treated equally when combined to a best estimate (usually the multi-model mean, MMM) and to a characterisation of model spread (usually percentiles of the full range) (e.g. Flato et al. 2013, 745). This ‘plain’ MMM that results from weighting all models equally is sometimes also referred to as “model democracy” (e.g. Abramowitz et al2019, 101), but displays first and foremost the value of simplicity since it is a simple solution. Although model spread is routinely treated “as some measure of uncertainty” in the literature and also in past of IPCC reports (Collins et al. 2013, 1043), it is not the same as uncertainty regarding our knowledge of the real world’s ECS (Tebaldi and Knutti 2007; Parker 2018). A variety of model ensemble-based uncertainty quantification approaches have therefore been developed (Flato et al. 2013) that involve choosing to weigh, filter, scale, or otherwise constrain model-based estimates (Eyring et al 2019; Brunner et al 2020). The constraint may for example be based on model performance with regard to observations and/or model independence (Masson and Knutti 2011; references in Abramowitz et al 2019). Both epistemic and non-epistemic values might be relevant for quantifying uncertainties (Table 4I and Winsberg 2012). When choosing an uncertainty quantification method, one might consider that they differ in various and complex ways, for instance regarding their measure of performance (typically, present-day climate or forced climate change; see discussion and references, e.g., in Brunner et al 2020). The methods also differ in how they assume the model sample to be related to the true climate (Abramowitz et al 2019), or whether they make explicit use of the model diversity (i.e. spread) to reduce uncertainty in emergent constraints (Schlund et al 2020 and references therein). The methods and their application include a variety of technical choices (e.g. Tokarska et al 2020) that require a high degree of expert knowledge (see, e.g., Ribes and Terray 2013, 2852; Schurer et al 2018, 8657) that might involve non-epistemic value-judgements (Table 4J). Other methods include similar choices more implicitly (Brunner et al 2020).

4.5 (v) Choices and value-judgements in publishing and communicating findings

The last step of a model-based assessment is that of publishing and communicating (Fig. 2). The communication of results involves many choices regarding how information is visualised, worded, and more generally framed — the same result may thus be communicated in multiple ways and each option may evoke a specific value (Elliott 2017). By framing we mean how a result is embedded in the context and how the result is represented — since trying to present the data without a frame is a framing decision as well (ibid.), this is necessarily a choice, potentially trading off different values (see also McKaughan and Elliott 2013). For communication to be effective, such framing decisions depend on the choice of the target audience (Corner et al. 2018; Šucha and Sienkiewicz 2020). The target audience of IPCC reports, which include ECS assessments, is largely defined by scientific and non-scientific institutions under clear involvement of non-epistemic value-judgements. Because this target audience is rather broad (IPCC 2012), decisions will have to be made regarding the relative importance of reaching different audience subgroups that may for instance differ in terms of cultural background or statistical literacy; non-epistemic value-judgements are relevant to this choice.

Choices regarding the presentation of the assessment results apply particularly to ECS uncertainty. Not many numerical ECS values above zero can be strictly ruled out, i.e. a likelihood of exactly 0 be assigned to them (Sherwood et al 2020). Uncertainty may then be chosen to be presented as full probability density functions or as ranges (Table 5K and e.g. Bindoff et al. 2013, Fig 10.20) and choices have to be made regarding the visualisation of these distributions or ranges and regarding the verbal representation of the results. For example, the choice of which likelihood intervals to report on (for typical choices hereon, see Knutti et al. 2017, Fig. 2) or which aspects of the distribution to highlight (Fig. 1) remain despite standardised IPCC guidelines for “assessing, characterizing, and reporting uncertainties in a more consistent – and to the extent possible, quantitative – fashion” (Moss and Schneider 2000), and can impact assessed and communicated ECS through the perception and interpretation of uncertainty. Emanuel (2014) and Sutton (2018) argue that the tails of the distributions, including low-likelihood high-impact scenarios, have been given (too) little focus in past IPCC reports. In terms of concrete choices regarding the wording, Sutton (2018, 1155) suggests that “using terms (...) such as ‘very unlikely’ or ‘extremely unlikely’ [constituted] a clear steer that policymakers should largely ignore such possibilities.” This might imply trade-offs between the values of avoiding alarmism as a (potentially misassumed, Oreskes 2020) risk to trust in science on the one hand, and ensuring effectively communicated completeness of information to policymakers that enables societal action on the other hand (Table 5K). Recent, complementary approaches communicate the plausibility of a specific numerical ECS value together with that numerical value’s implications (‘storylines’, Sutton 2018; Shepherd et al 2018).

Table 5 As Table 1, except with examples for assessment step (v) rather than (i)

Not only the final assessment result, but also the research underlying the preceding assessment steps (i-iv) is subject to communication. However, in those steps, the audience consists predominantly of other scientists, so that choices regarding how the findings are communicated are arguably less influential than they are in step (v). Instead, it seems that the more influential choices are whether a finding is communicated, i.e. published, at all. ‘Finding’ is here broadly defined as including assessment results and model-derived results, model-experiment data, and the model codes. Several model versions will for example be produced in the model development process, but not all published. The choice of which one(s) to publish is arguably part of the model-building (step (ii)), but looking at it from this angle might highlight additionally relevant value-judgements, for example when thinking about the publication of outlying models (Table 5L with Andrews et al 2019) or non-conforming results (Table 5M with Schwartz 2007; Knutti et al 2008). Questions on the publication of null-results may also occur, be it an unsuccessful effort to build a model with a specific characteristic, the probing as to whether a recent observation warrants a change in best-estimated ECS, or a not-working emergent constraint. Additional influence might here be contributed by peer reviewers and journal editors that can include similar value-judgements. Any resulting publication bias risks unsupported conclusions (for emergent constraints, examples in, e.g., Schlund et al 2020).

5 Discussion

The analysis in Section 4 shows that value-judgements are relevant at each step underlying multi-model based assessments. Here, we look across the steps to synthesise implications for the role and management of values in science more generally.

5.1 Implications for the value-free ideal of science

The abundance and variety of value-judgements that we have identified support the philosophical literature that questions the ‘value-free ideal’ of science (VFI; e.g. Betz 2013; Lacey 1999; however John 2015; Jebeile 2020). Although there are also more relaxed, and thereby more ambiguous, articulations of the VFI (e.g. to minimise the influence of non-epistemic values in scientific reasoning, Reiss and Sprenger 2020), the VFI in its strictest interpretation (e.g. Betz 2013) holds that social values ought not influence justification of scientific findings. Our results show that in this way eliminating or substantially reducing the influence of non-epistemic values on accepting or assessing theories, on gathering evidence, or on scientific reasoning more broadly, is at least difficult, and in many cases neither possible nor desirable. Many choices in the model-building step (Table 2), the choice of the time scale in the very definition of ECS (Table 3G), or model-weighting choices (Table 4I) are for example part of such activities, and we have identified non-epistemic value-judgements that are relevant to these choices. We have further identified instances where the influence of non-epistemic values may make science more attuned to the needs of society, e.g. giving political incentives to study ECS in the first place (Table 1A) or leading the scientific community to scrutinise findings of very high or low ECS with their particularly high societal consequences (Table 5M). The latter is a quintessential example of non-epistemic value-judgements legitimately affecting scientific reasoning with regard to the quality of evidence through what philosophers call ‘inductive risk’: when scientists’ choosing of what counts as enough evidence for accepting a hypothesis involves evaluating social or ethical consequences of accepting an incorrect hypothesis or rejecting a correct one (i.e. of making a false positive or a false negative error, respectively) (e.g. Douglas 2009). We suggest however that judgements related to inductive risk are relevant by different degrees within the ECS assessment process: Because social or ethical consequences of choices in the model-building step are not easily identified or predicted (e.g. Table 2D; also Schmidt and Sherwood 2015, 154 and references therein), it will be difficult to make the choices according to corresponding risk preferences. We therefore suggest that non-epistemic values might in general have more influence at steps towards either end of the assessment process, where potential consequences of an option to a choice are more readily anticipated (Fig. 2). There are, however, notable non-epistemic value-judgements relevant also in the model-building step (e.g. Tables 1B and 2EF), which are important to consider even if they refer only indirectly to ECS.

5.2 Do sensitivity studies avoid value-judgements?

Various frameworks are being employed to address or circumvent the impacts that methodological choices have on assessment results. Sensitivity studies that address model uncertainty range from single-parameter and single-model analyses to PPEs and MMEs. Insofar as different options to a choice reflect different value-judgements, sensitivity studies reflect not only the dependence of the assessment on underlying choices, but also on values. However, because it is computationally (e.g. Table 2DE) and intellectually unfeasible for sensitivity studies to sample the complete space spanned by all the choices (and values) underlying the steps of ECS assessments, this can not eliminate value-judgements. Instead, there are value-judgements relevant to the very choice of which (e.g. parameter) space to sample. Sensitivity studies regarding choices (and values) made in the other assessment steps are subject to the same intellectual and similar practical limitations, for example the limited number of alternative communication options that can be realised in any one report (e.g. Table 5K).

5.3 Requirements for applicable normative ideals

Existing scientific practice may already fulfil aspects of normative ideals regarding values in science proposed in the philosophical literature and listed in the introduction. The historical evolution of uncertainty ranges given for ECS (Fig. 1) might for instance be witness to the delivery of Parker and Risbey’s (2015) desiderata of faithfulness and completeness for uncertainty communication in light of maturing physical understanding: The better ECS was understood, the more remaining gaps of uncertainty were also identified, so a report might have been faithful and complete given the knowledge at the time even if later reports state broader uncertainty ranges. Broad, and conscious, implementation of these and other norms will however require the development of an awareness among scientists regarding value-judgements in their work (Pulkkinen et al 2022a) and it requires the norms to be applicable to their work in the first place. For this the diversity of the choices and value-judgements (Section 4) poses challenges, four of which stand out based on our study:

  • the distributed agency and influence from institutions and agents beyond the scientific community;

  • the opacity regarding the final impact of choices (e.g. on ECS and the associated uncertainty range) and the technical nature of many choices;

  • the multipurposeness of models as tools as well as the multiple aims of scientific assessments and the diversity of the target audience;

  • the variety of different types of value-judgements and their possible influence, inclusive of, but not limited to, risk preferences.

A satisfactory normative account regarding the role of values in climate modelling would take these characteristics into account and hence be applicable to climate-scientific practice. The existing normative ideals — highlighting legitimacy of value influence if the aims of research are legitimate or if the influence is restricted to evaluating the risk of error or given the democratic endorsement of values or stakeholder input or given transparency of value-judgements — have difficulties accommodating the above-mentioned characteristics, limiting the ideals’ applicability to ECS assessments and climate science more broadly. The technical nature of choices raises a number of issues. For example, it is not clear how one have might have meaningful stakeholder involvement. Furthermore, the opacity of models risks that transparency in model-builders’ value-judgements might lead to false assumptions on what the value-judgements imply for the model results. It is therefore challenging to see how the existing normative suggestions can be applied across the variety of different choices analysed. Even though the normative ideals might not have been developed to be applicable to any or all of the specific choices in ECS assessments, we argue that broad applicability is a virtue for normative frameworks. Besides being desirable, having a single framework to follow is perhaps even necessary if it is to be implemented in practice. Our doubts regarding the applicability of existing norms imply support for the notion that it is important for scientists to develop an awareness of values in their research because this is not only a necessary first step (Pulkkinen et al 2022a) for value-management and a large part of some norms themselves (e.g. Brown 2020), but also allows scientists to engage in the discussion to facilitate the development of more applicable norms (e.g. Pulkkinen et al 2022b).

6 From multi-model assessments to multiple lines of evidence

As a break with previous practice, and a break-through in terms of constraining ECS, the most recent ECS assessments — Sherwood et al (2020) and IPCC AR6 (Forster et al 2021) — are not based on direct model estimates, but use multiple, other lines of evidence. Besides the fundamental value of power of constraint, the shift from model-based assessment to other lines of evidence is also a choice that balances values like broadness of evidential basis vs. simplicity and ease of reproducibility, or novelty of method vs. historical continuity of method (see Section 2). With the steps of the new assessment process comes also a modified set of choices and value-judgements, some of which are implied by the acknowledgement of subjective aspects by Sherwood et al (2020). A complete analysis of choices and value-judgements in that new type of assessment deserves to be the focus of a follow-up study, but key points are sketched out in the following.

The assessment steps in Sherwood et al (2020) and Forster et al (2021) have both similarities and differences with steps in model-based assessments (Fig. 2). They similarly involve ‘choosing research question’ and ‘publishing and communicating findings’, meaning that choices and value-judgements discussed in Section 4(i) and (v), respectively will directly apply (e.g. Tables 1A, C5K, M). The new steps in between involve the choice of the lines of evidence; the choice of literature informing the assessment of each line of evidence and their synthesis; and the combination of the lines of evidence into a final assessment result. To these new steps, some of the choices and value-judgements identified for the steps immanent to model-based assessments apply equally or similarly, including for example those regarding the definition of ECS (Table 3G).

Choices and value-judgements new to ECS assessments based on lines of evidence other than models regard firstly the choice of these lines of evidence. Both Sherwood et al (2020) and Forster et al (2021), for instance, include process understanding, warming over the instrumental record and paleoclimates, while Forster et al (2021) include emergent constraints as a fourth line of evidence. This might broaden the evidence base but potentially introduce additional errors associated with emergent constraints (Schlund et al 2020).

Secondly, the literature that informs each line of evidence needs to be chosen and combined to an interim assessment result. Appreciable value-judgements can be relevant to these choices. This starts with the unfeasibility, even for a team of authors and many peer reviewers, to be aware and familiar with each and every scientific publication, making the selection of literature sensitive to the configuration of the assessment’s author team. Although value-judgements can also be relevant for choosing and potentially subsetting from a CMIP suite of models (Section 4(iv)), this selection is arguably more delimited than the selection of results from the body of literature. The synthesis of the literature into assessment results for each line of evidence also involves choices of weighting; compared to ‘model democracy’ (one model, one vote), ‘paper democracy’ (one paper, one vote) is not even a point of discussion given the differing scope and evidence base of individual studies. Value-judgements may also be relevant to the assessment via the assumed trust in the names or affiliations of authors of specific publications, more than there is room for in a model selection process. This may even be a possible source of illegitimate bias, addressed by value-management ideals proposing diversity among the scientific community (e.g. Longino 1990).

Finally, also the combination of the different lines of evidence to a final assessment result involves choices and value-judgements. Forster et al (2021, 7.5.5), for instance, write that the lines of evidence “can be combined [...] formally using Bayesian statistics, though such a process is complex and involves formulating likelihoods and priors (Annan and Hargreaves, 2006; Stevens et al., 2016; Sherwood et al., 2020[...]). However, it can be understood that [the general principle allows another approach for combining the lines of evidence]” (emphasis added). This suggests that the values of complexity vs. simplicity are relevant here, among other potential non-epistemic value-judgements. Just as the use of a model does not imply less involvement of values compared to non-quantitative methods, the use of ‘formal’ methods does not necessarily imply less involvement of values compared to more informal methods. The choice of priors, to which the resulting ECS range can be sensitive (Sherwood et al 2020), is for instance much discussed both in the scientific (e.g. Sherwood et al 2020 and references therein) and the philosophical literature (e.g. Steel 2015; Sprenger 2018).

In contrast to model-based assessments, it is more feasible for the steps of this other assessment type to all be performed by the same agent(s). There is also a reduction in the multipurposeness of tools and aims of research compared to model-based assessments, because the literature is reviewed and synthesised specifically and only for the purpose of assessing ECS. We further suggest that potential consequences of the options to a choice are more readily anticipated in this type of assessment, which means that inductive risk considerations regarding the final assessment result have the possibility to influence in choices in all steps more than is the case in model-based assessments (Fig. 2a). Thereby, existing value-management ideals involving the matching of risk preferences to those of society (e.g. Intemann 2015) may here be more applicable than for model-based ECS assessments.

However, the challenges identified for model-based assessments also hold for many of the literature findings, on which the new assessments are based. Information from models is for example involved in numerous, essential ways in each of the lines of evidence in IPCC AR6 (Forster et al 2021, 7.5.5–6), meaning that choices and value-judgements discussed in Section 4(ii) may be inherited even by assessments that are not directly based on models. This means that value-management for ECS assessments also of the new type would ultimately have to include the values relevant for model-building, and any other scientific work providing evidence with relevance for ECS assessment.

7 Conclusions

Based on the argument that awareness of values in the scientific process is a necessary first step towards proper value-management, and that such awareness is largely lacking in foundational areas of climate science, we examined choices and value-judgements underlying multi-model-based assessments in climate science. We used Equilibrium Climate Sensitivity (ECS) as a case study, and found that numerous epistemic and non-epistemic value-judgements are relevant at every step of the assessment process, impacting the assessment in various ways. We argued that non-epistemic values associated with risk preferences are likely more relevant at steps towards either end of the assessment process. We further noted that sensitivity studies do not provide a way to avoid value-judgements, and identified characteristics of model-based assessments that limit the applicability of existing normative proposals from the philosophical literature for the management of values. We also discussed the latest ECS assessments that are based on multiple lines of evidence rather than on direct model estimates; we highlighted choices and value-judgements that remain equal or similar and identified which others would need to be the focus of further in-depth study. The findings are important for ongoing and future research, which will share or even inherit many of the value-judgements discussed here. This applies not only to studies of ECS (Meehl et al 2020), but also to future projections derived either from emulators constrained with ECS or from multi-model projections combined with expert judgement (IPCC 2021, Chs. 4, 10), or even to assessments of climate impacts based on model inter-comparisons (e.g. Warszawski et al. 2014). Further reflection on value-judgements in climate-scientific assessments is hence desirable.