Evaluating process-based integrated assessment models of climate change mitigation

Process-based integrated assessment models (IAMs) project long-term transformation pathways in energy and land-use systems under what-if assumptions. IAM evaluation is necessary to improve the models’ usefulness as scientific tools applicable in the complex and contested domain of climate change mitigation. We contribute the first comprehensive synthesis of process-based IAM evaluation research, drawing on a wide range of examples across six different evaluation methods including historical simulations, stylised facts, and model diagnostics. For each evaluation method, we identify progress and milestones to date, and draw out lessons learnt as well as challenges remaining. We find that each evaluation method has distinctive strengths, as well as constraints on its application. We use these insights to propose a systematic evaluation framework combining multiple methods to establish the appropriateness, interpretability, credibility, and relevance of process-based IAMs as useful scientific tools for informing climate policy. We also set out a programme of evaluation research to be mainstreamed both within and outside the IAM community.

IPCC special report on global warming of 1.5°C was informed by 411 scenarios from 10 global IAMs . Process-based IAMs are also used more directly in climate policy formulation, including the periodic global stocktake of progress under the Paris Agreement (Grassi et al. 2018), international negotiations under the UNFCCC (UNEP 2015;UNFCCC 2015), and national strategies, targets, and regulatory appraisals (BEIS 2018;Weitzel et al. 2019).
In order to be useful scientific tools for climate policy analysis, policymakers need to have confidence in IAMs and their analyses. Evaluating modelling tools such as IAMs means assessing both the models and their performance so as to articulate the grounds on which they can be declared good enough for their intended use (Oreskes 1998). Evaluation is a necessarily broad and open-ended process.
IAM evaluation has a long history in multi-model comparison projects (Huntington et al. 1982), but other evaluation methods have been implemented on a more ad-hoc basis by individual modelling teams with little community-wide coordination or consolidation. While important for tacit learning and model development, such evaluation practices are less impactful on the wider climate research community.
Process-based IAMs have also been criticised for a range of perceived failings, including technological hubris (Anderson and Peters 2016), omitted drivers of sociotechnical change (Geels et al. 2017), and understating future uncertainties (Rosen and Guenther 2015). As IAMs have become more widely used for informing policy (van Beek et al. 2020), the lack of a coherent and systematic approach to evaluation has become more conspicuous by its absence.
In response, we contribute the first synthesis of IAM evaluation research, drawing on a wide range of examples across six different evaluation methods: historical simulations, nearterm observations, stylised facts, model hierarchies from simple to complex, model intercomparison projects (including diagnostic indicators), and sensitivity analysis. For each method, we review key milestones in historical development and application, and draw out lessons learnt as well as remaining challenges.
Following Cash et al. (2003), we also propose four criteria against which evaluation can help improve IAMs and their usefulness in policy contexts: appropriateness, interpretability, credibility, and relevance. We map each evaluation method onto these criteria, and conclude by arguing for a systematic evaluation framework which combines the strengths of multiple methods to overcome the limitations of any single method.

Process-based IAMs not benefit-cost models
Throughout this article, we use 'IAMs' to mean process-based integrated assessment models (or what Weyant (2017) calls 'detailed process' or DP IAMs). These IAMs: (1) Represent explicitly the drivers and processes of change in global energy and land-use systems linked to the broader economy, often with a high degree of technological resolution in the energy supply (2) Capture both biophysical and socioeconomic processes including human preferences, but do not generally include future impacts or damages of climate change on these processes (3) Project cost-effective 'optimal' mitigation pathways under what-if assumptions or subject to pre-defined outcomes such as limiting global warming to 2°C (Sathaye and Shukla 2013) Many process-based IAMs originate in energy system models or energy-economy models which have since integrated land use, greenhouse gas emissions, and other climate-related processes (Krey et al. 2014;Sathaye and Shukla 2013). Examples of process-based IAMs include AIM-Enduse (Hibino et al. 2013), GCAM (Iyer et al. 2015), IMACLIM (Waisman et al. 2011), IMAGE (van Vuuren et al. 2015), MESSAGE-GLOBIOM (Krey et al. 2016), and REMIND (Luderer et al. 2013). We provide further examples in the Online Resources, including process flow diagrams of select IAMs in Online Resources 7.
In this article on model evaluation, we do not consider integrated assessment models used for cost-benefit analyses which have simplified representations of energy and land-use systems (and which are also referred to in the literature as 'IAMs'). Examples of cost-benefit integrated assessment models include DICE (Nordhaus 2013), PAGE (Hope and Hope 2013), and FUND (Anthoff and Tol 2013). These highly aggregated models are used in a cost-benefit framework to analyse economically optimal levels of abatement taking into account future impacts of climate change (Moore and Diaz 2015;Stern 2006). Cost-benefit models have also been widely applied in the USA and elsewhere for projecting the social cost of carbon in order to internalise climate-related impacts in regulatory appraisal processes (Greenstone et al. 2013;NAS 2016).
By focusing only on process-based IAMs, we follow established precedent which recognises fundamental differences between the two types of model (Kunreuther et al. 2014). Evaluation research confronts very different issues in process-based and cost-benefit models even though similar methods can be applied. For process-based IAMs with detailed representations of biophysical and socioeconomic processes across multiple subsystems, evaluation is concerned with how causal mechanisms and interactions generate outcomes of interest. As process-based IAMs are complex 'black boxes', running in specialised programming environments with high technical barriers to entry, evaluation is also concerned with issues of interpretability and transparency.
In contrast, causal processes encoded in cost-benefit type models are simplified and widely accessible (e.g. in spreadsheet tools) so understanding how outcomes of interest are generated is less of an issue. Rather, evaluation is concerned with the empirical and theoretical defensibility of assumptions such as discount rates (Metcalf and Stock 2015), of parameterised relationships such as climate impact damage functions (Cai et al. 2015), and of general modelling approaches such as the integration of mitigation co-benefits in a welfare maximisation framework (Stern 2016).
Consequently, we focus only on evaluation of process-based IAMs, and do not consider cost-benefit type models further.
3 What is IAM evaluation and why do it?
As Barlas and Carpenter (1990) observe: 'Models are not true or false but lie on a continuum of usefulness'. IAMs are useful for strengthening scientific understanding of coupled human and natural systems relevant to climate change (Moss et al. 2010). As an example, reference or baseline scenarios are modelled to characterise salient uncertainties in future development pathways (e.g. Riahi et al. 2017). IAMs are also useful for informing climate policymakers on the options and implications of decisions or indecision (Edenhofer and Minx 2014). As an example, policy or mitigation scenarios are modelled to help policymakers understand the consequences of different carbon pricing tariffs at regional or global scales (Vrontisi et al. 2018).
Evaluating IAMs should help improve their usefulness as scientific tools for policy-relevant analysis.
Evaluation is an open-ended process testing both model structure and model behaviour (Barlas 1996;Eker et al. 2018). 1 Evaluating model structure tests how the modelled system is represented in equations (encoding laws, principles, causal relationships, and drivers of change), parameterisations (making simplifying assumptions about complex phenomena), constraints (imposed as external conditions), variables, and values assigned to input variables or parameters (Pirtle et al. 2010). 2 Evaluating model behaviour tests how observed system responses are reproduced. This is commonly done by selecting an evaluation period and tuning or calibrating specific model variables and parameters to match the initial conditions and external drivers of change observed over that period. The model is then run to test how well it predicts non-calibrated outcomes (Snyder et al. 2017).
IAMs represent complex, dynamic systems characterised by deep uncertainties. Uncertainties are epistemic (ignorance), parametric (inexactness), and societal (values) . Epistemic uncertainties are associated with limits to knowledge of how the modelled system functions. Parametric uncertainties are associated with the reduction of complex phenomena to tractable model formulations and parameterisations (Oreskes 1998). Societal uncertainties are associated with values and worldviews which become embedded in model assumptions. Representation of social or economic processes not based on physical laws may be particularly contested (Oppenheimer et al. 2016;Schneider 1997).
As a result of these uncertainties, whether an IAM accurately represents the system being modelled cannot be definitively established (Oreskes et al. 1994;Sargent 2013). The same applies to other models used to assess environmental problems in coupled human-natural systems (Beck and Krueger 2016;van der Sluijs et al. 2005). As we discuss further below, this also means that for IAMs there is some overlap between uncertainty analysis and evaluation.
Evaluating IAM behaviour is similarly problematic. A close fit of IAM output to observational data does not necessarily mean the IAM accurately represents the modelled system. First, simulation results may be specific to the tuned parameterisations (Oreskes et al. 1994). Second, more than one model conceptualisation or parameterisation can generate the same output (Beugin and Jaccard 2012;van Ruijven et al. 2010). For example, van Ruijven et al. (2010) found multiple combinations of parameters could broadly reproduce historical transport energy demand in the TIMER model. Beugin and Jaccard (2012) similarly found multiple parameter distributions in backcasting simulations could reproduce observed technology adoption decisions in the CIMS model. Third, two or more settings (or errors) in the model inputs and parameterisations may partially cancel each other out (Schindler and Hilborn 2015). 3 All these evaluation issues apply generically to other models of complex systems. 4 As a consequence of these difficulties in formally testing IAMs' structure and behaviour, IAM evaluation is necessarily a continual and iterative process of learning, development, and improvement (Barlas 1996).
We propose four inter-related criteria which guide this IAM evaluation process: appropriateness, interpretability, credibility, and relevance (see also Cash et al. (2003)).
First, IAM evaluation should improve the appropriateness of a model for addressing a specific scientific question (Jakeman et al. 2006;Sargent 2013). Matching model to task becomes harder as ever more extensive, higher resolution representations of coupled naturalhuman systems create IAMs with numerous possible applications (Gargiulo and Gallachóir 2013;van Beek et al. 2020).
Second, IAM evaluation should improve the interpretability of results, taking model structure and assumptions into account (DeCarolis et al. 2017;McDowall et al. 2014). Mapping meaning from the modelled world to the real world is of longstanding concern (Wynne 1984). A modelling mantra applicable to IAMs is to communicate 'insights not numbers' (Peace and Weyant 2008) alongside assumptions, uncertainties, and limitations (Beck and Krueger 2016;Cooke 2015;Kloprogge et al. 2007).
Third, IAM evaluation should improve the credibility of modelling analysis among user communities. How the producers and users of knowledge interact may be as important in determining the credibility of IAMs as the modelling analysis itself (Fischhoff 2015;Nakicenovic et al. 2014). In setting out best practice for model development and evaluation, Jakeman et al. (2006) urge a 'sceptical review' of models by users.
Fourth, IAM evaluation should improve the relevance of modelling analysis for informing scientific understanding and supporting decision-making on climate change mitigation. Cash et al. (2003) use the term 'salience' in a similar way to describe scientific information which is relevant to decision-making bodies or publics. The credibility and relevance criteria are most closely linked to the application of uncertain model results in complex and contested policy domains like climate change (Beck and Krueger 2016;. In strengthening the appropriateness, interpretability, credibility, and relevance of IAMs, it is important to emphasise that evaluation does not make IAMs more accurate nor more reliable in predicting the future. This is not what IAMs are designed to do. Rather, evaluation helps to improve the IAMs as useful scientific tools for understanding mitigation pathways, decision options, and outcomes (Edenhofer and Minx 2014;Peace and Weyant 2008). IAMs sit alongside many other decision-support tools for informing climate policy, ranging from expert elicitations and bottom-up sectoral modelling to learning from experience and participatory appraisals (Kunreuther et al. 2014). 3 The converse also holds. Divergence between model output and observational data does not necessarily mean the IAM is a poor representation of the modelled system. Divergence may be due to errors in inputs defining initial conditions or exogenous drivers of change, and large divergences may occur even when the model is only very slightly mis-specified (Thompson and Smith 2019). 4 As an example in climate modelling, Tebaldi and Knutti (2007) note: 'Although model agreement with observations is very valuable in improving the model, and is a necessary condition for a model to be trusted, it does not definitely prove that the model is right for the right reason. There are well-known examples where errors in different components of a single model tend to cancel. The use of the same datasets for tuning and model evaluation raises the question of circular reasoning'.

Methods for evaluating process-based IAMs
IAMs began to emerge in the late 1980s, building on longer established traditions in energy system and macroeconomic modelling (as well as early system dynamics models). Concern for model evaluation dates back to these IAM precursors in the 1970s. In particular, the US-based Energy Modelling Forum (EMF) played a pivotal role in the early experimentation and development of methods now used in IAM evaluation (Smith et al. 2015).
These IAM evaluation methods can be classified into six types. Three evaluation methods use observational data: historical simulations, near-term observations, and stylised facts. Two evaluation methods use comparisons between models: model hierarchies from simple to complex, and model inter-comparison projects. Sensitivity analysis is a sixth evaluation method which is commonly applied to individual models, but can also form part of model inter-comparisons. Sensitivity analysis is an example of how uncertainty analysis techniques overlap with evaluation methods given the many structural uncertainties which characterise IAMs' representation of incompletely understood socioeconomic and biophysical processes (Millner and McDermott 2016). Setting out the 'ten basic steps of good, disciplined model practice', Jakeman and colleagues emphasise the 'comprehensive testing of models' using methods to establish 'high enough confidence in estimates of model variables and parameters, taking into account the sensitivity of the outputs to all the parameters jointly, as well as the parameter uncertainties' (Jakeman et al. 2006) (see Online Resources 1 for further discussion).
For each of the six evaluation methods, we review historical development and applications ( Fig. 1), and summarise lessons learnt as well as remaining challenges. We also consider the importance of model checks, documentation, and transparency for enabling independent verification of the scientific and policy applications of IAMs.

Historical simulations
Although prominent in early IAM evaluation practice (Toth 1994), historical simulations or hindcasting studies are relatively uncommon. Available simulations tend to be limited in time horizon, spatial scale, and model output compared to observations. Examples of simulated quantities in IAMs compared against observations include energy demand for the USA during 1960-1990(Manne and Richels 1992; energy use in US buildings during the period 1995-2010 (Chaturvedi et al. 2013); the Indian economy's response to rising oil prices during -2006(Guivarch et al. 2009); and transportation energy demand in Western Europe during 1970(van Ruijven et al. 2010). In each case, the simulations led to revised modelling assumptions to reduce divergence from observations. (For further details and examples, see Online Resources 2.) One hindcasting study with the AIM-CGE model compared a much broader set of simulated quantities to historical data including the primary energy and electricity supply mixes at both global and regional scales (Fujimori et al. 2016). An analogous study with the GCAM model examined model fit to historical global and regional land-use allocations (Snyder et al. 2017). These global hindcasting studies are rare as IAMs represent very diverse biophysical and socioeconomic processes as well as policy signals . Simulated quantities must be sufficiently disaggregated to match this heterogeneity in underlying causal mechanisms (Schwanitz 2013). Relevant causal mechanisms (or model components) should also be structurally constant over the simulation period.
The ability of IAMs to reproduce historical observations has further limitations as an evaluation method for several reasons.
First, historical simulations cannot demonstrate models' predictive reliability in future conditions that lie outside the range of historical experience (Oreskes 1998). This is a particular issue for IAMs as the modelled system may not exhibit structural constancy between past and future. 5 Policy decisions informed by IAM analysis may even lead to changes in the causal relationships enshrined in a model's representation of energy, land use, and economic systems (DeCarolis et al. 2012;Weyant 2009). As an early historical example, the Club of Rome's 'Limits to Growth' scenarios from the 1970s are considered widely influential in shaping changes to resource management decision-making and policies (Nye 2004).
Second, IAMs are commonly used to define least-cost mitigation pathways to serve as normative reference points for what-if problems such as how to limit warming to 2°C. Moreover, some IAMs are designed to find solutions which are inter-temporally optimal assuming perfect foresight over a 100-year timeframe. These normative applications of IAMs are not designed to reproduce how the modelled system actually behaves (Keppo et al. 2021). IAMs may also include optimisation elements to capture price formation in markets. However, Fig. 1 Historical development and landmark studies in IAM evaluation methods 5 Hodges and Dewar (1992) set out various conditions for testing model behaviour against observations, one of which is structural constancy. DeCarolis et al. (2012) give examples of how this condition may be violated in IAMs: 'Condition 2 requires that the 'causal structure' of the system being modeled remain constant through time. Energy economies at different geographic scales ... violate this condition. National priorities, technological change, and resource availability can result in structural economic shifts that are not captured by [energyeconomy optimisation] models'. Note that the term 'structural change' is used differently in economic modelling to refer to changes in the sectoral composition of the economy and its value-adding activity. This type of structural change should be endogenously simulated, and does not represent a challenge to model evaluation. real markets are imperfect and IAMs may not capture the numerous distortions through which observed prices are reflected (Trutnevyte 2016).
Third, IAMs focus on system responses to policy interventions relative to a dynamic and uncertain baseline, rather than an equilibrium (Rosen and Guenther 2016). As IAM baselines are dynamic, it is difficult to clearly separate drivers of change (e.g. economic growth, prices) from system responses (e.g. energy resource use, technology deployment, and greenhouse gas emissions).
These limitations are compounded by the practical challenge of finding observational data to describe historical energy and land-use systems in sufficient detail (Chaturvedi et al. 2013). Data challenges are more formidable in developing countries (van Ruijven et al. 2011), and prior to the 1970s when few energy data were systematically collected (Macknick 2011).
Historical simulations are therefore limited in their ability to give confidence in IAMs' representation of modelled systems (De Carolis 2011). But they remain a useful evaluation method under certain conditions: observational data are available; external drivers of change are clearly identifiable; the structure of modelled system components is constant; and normative characteristics can be relaxed.

Near-term observations
The unfolding future provides near-term observations which can be compared against ex ante model projections made a decade or more ago. This is distinct from longer-term historical simulations which are run ex post.
Baseline scenarios from the IPCC Special Report on Emission Scenarios (SRES) were projected by IAMs in the late 1990s (Nakicenovic et al. 2000). These have been tracked against actual socioeconomic developments and emissions (Manning et al. 2010;Raupach et al. 2007;van Vuuren and O'Neill 2006). Through the 2000s, emission trends tended to track the upper bound of ex ante projections across a range of baseline assumptions (Peters et al. 2012). One implication is that scenario studies may inadequately capture uncertainty ranges in key drivers of change (Schwanitz and Wierling 2016). However, a recent comparison of multiple IPCC scenarios against observed fossil CO 2 emissions as well as population, economic, and energy system drivers over the period 1990-2020 found that observations have tended to track middle-of-the-road IAM projections (Strandsbjerg Tristan Pedersen et al. 2021).
IAM projections of energy prices and demand have also been compared against unfolding near-term observations (Craig et al. 2002;Pilavachi et al. 2008;Smil 2000). One recent study found near-term IAM projections underestimated the contribution of energy demand reductions to sustained CO 2 emission declines in 18 industrialised countries (Le Quéré et al. 2019). Near-term projections by IAMs of renewable energy technology deployment have also been found to consistently under-estimate observed growth rates, even under assumptions of stringent climate policy to limit warming to 2°C (Creutzig et al. 2017).
Divergence from near-term observations is a potential source of insight for improving modelling efforts-if modellers look back at past projections (Koomey et al. 2003). However, modelled responses to policy interventions in the near term are not necessarily good indicators of long-term trends (van Vuuren et al. 2010). 6 IAMs are generally designed to represent long-term dynamics such as the replacement of capital stock and path dependence from increasing returns to scale. Many IAMs also run on 5-or 10-year time steps which capture only multiyear averages (see Online Resources 5). However, IAMs have been applied recently to analyse both long-term and near-term outcomes of policy processes such as the national commitments made under the Paris Agreement (Vrontisi et al. 2018).
As with historical simulations, recent historical experience is useful for comparison against ex ante IAM projections only under certain conditions. First, only modelled processes with short-term characteristics or local responses should be tested against observations to ensure comparisons are clearly interpretable. Second, IAMs should resolve processes in short time steps (1-5 years) or have structural elements responsive to short-term drivers. Third, the system response to policy interventions (e.g. renewable energy regulation) or exogenous shocks (e.g. oil crises, collapse of the Soviet Union) should be clear and isolatable.

Stylised facts
An alternative method for drawing on history to evaluate IAMs examines whether generalisable historical patterns or 'stylised facts' are reproduced in model projections. This approach derives from the economist, Kaldor, who proposed 'a stylised view of the facts' that held when observing economic growth over long time periods, ignoring business cycles or other causes of volatility (Kaldor 1957;Leimbach et al. 2015).
The IPCC Special Report on Emission Scenarios (SRES) introduced comparisons of historical patterns in energy intensity and primary energy shares with future trends simulated by IAMs (Nakicenovic et al. 2000). Continuing in this vein, Schwanitz (2013) proposed a set of stylised facts describing aggregate long-term behavioural features of the energy system and economy that are broadly applicable and expected to persist. Several studies have tested IAMs' ability to reproduce such patterns under both baseline and climate policy assumptions. Examples include developing country transitions from traditional fuels to electrification as incomes rise (van Ruijven et al. 2008); durations of technology diffusion correlating positively with extents of diffusion (Wilson et al. 2012); and primary energy consumption correlating positively with economic growth (Schwanitz 2013). In each case, model projections were broadly consistent with historical dynamics, albeit with local or spatial differences. (For further details and examples, see Online Resources 3.) Rates of change in key system variables can also be compared between past and future to evaluate the responsiveness of actual and modelled systems. To date, this method has been applied principally to IAM projections of technology deployment. Maximum projected rates of change are broadly consistent with maximum rates observed historically, even in scenarios with stringent climate policy (Iyer et al. 2015;van Sluisveld et al. 2015). Some studies have further triangulated between past, modelled futures, and expert opinions. Compared to IAM analysis, experts tend to have more bullish expectations for projected growth in renewable technologies, but more conservative expectations for fossil and nuclear technologies (van Sluisveld et al. 2018).
Testing the ability of IAMs to reproduce stylised facts is an additional way to draw on observational data to build confidence in structural representations of long-term system dynamics. However, the use of stylised facts as an evaluation method is restricted to aggregate system-level indicators or relationships, rather than specific causal mechanisms. This makes it hard to attribute any divergence from historical patterns.

Model hierarchies from simple to complex
Models such as IAMs face a tension between elaboration and elegance (Held 2005). More complex models may be more realistic, but may also be less tractable and interpretable: '… it is ironic that as we add more factors to a model, the certainty of its predictions may decrease even as our intuitive faith in the model increases' (Oreskes 2003). Whether a model has a 'good-enough' representation of the modelled system can be effectively tested through stripped-down versions designed to capture only the fundamental drivers of change (Jakeman et al. 2006).
'Model hierarchies' is a term used to describe models of the same system but spanning a range of complexity in terms of processes, dimensions, parameterisations, and spatial resolution (Held 2005;Stocker 2011). This is common in climate modelling: 'With the development of computer capacities, simpler models have not disappeared; on the contrary, a stronger emphasis has been given to the concept of a 'hierarchy of models' as the only way to provide a linkage between theoretical understanding and the complexity of realistic models' (p113, Treut et al. 2007). Climate models that are conceptually simpler or have less fine-grained resolution of processes and regions are useful for testing understanding of the modelled system. This helps interpret more complex models (Stainforth et al. 2007). 7 In the early years of the Energy Modelling Forum (EMF), it was common practice to use simplified analytical frameworks to help interpret and understand larger-scale model results (Huntington et al. 1982;Thrall et al. 1983). This early practice of developing IAM model hierarchies has declined in more recent years. IAM development has tended inexorably towards ever finer scale resolution of ever more processes to assess ever more outcomes. For example, IAMs are now commonly used to analyse not just emission pathways but also progress towards a wide range of sustainable development goals (McCollum et al. 2018;von Stechow et al. 2015). 8 One of the few fairly recent examples of a simple model being used to test structural uncertainty in a more complex IAM is the aptly named 'SIMPLE' model of global agriculture. This was designed to represent a minimal set of biophysical and economic relationships while still capturing the main drivers of global cropland use (Baldos and Hertel 2013). As an evaluation exercise, the SIMPLE model was tested to see if it could reproduce observed global trends in key indicators including crop land area, production, yield, and price over a historical simulation period from 1961 to 2006. The good fit to observations gave confidence in the basic model conceptualisation of land-use change dynamics that is embedded in more complex IAMs (Baldos and Hertel 2013).
There are many other opportunities for simple models of resource use, energy service demands, energy commodity trade, or technology deployment, to sit alongside complex global IAMs in model hierarchies which balance the competing merits of both elegance and elaboration. As well as enabling transparent testing of key elements of model structure, model 7 Stainforth et al. (2007) go on to explain: 'It seems clear that the use of large nonlinear models is necessary in climate science but in the analysis of their output we must clearly identify assumptions which imply simpler linear models would have sufficed, at least until we understand the physics of the linear relations our complex models have revealed to us.' 8 IAMs do have some simplified components including reduced-form climate or land-use models (Calvin and Fisher-Vanden 2017;Moore et al. 2017). These simplified model components can be tested against standalone models or sectoral models of greater complexity  hierarchies from simple to complex also mean appropriate models can be matched to the needs defined by particular research questions.

Model inter-comparisons
Model inter-comparison projects (MIPs) compare outputs, insights, and fits to observations across an ensemble of models. Like model hierarchies, MIPs are used to explore structural uncertainties in different models' representations of the same system.
Comparing results between multiple IAMs is a longstanding feature of climate mitigation analysis (Gaskins Jr. and Weyant 1993;Smith et al. 2015). The Energy Modelling Forum (EMF) started doing model comparisons in 1976, and in its early studies alternated policyrelevant MIPs with MIPs that were more diagnostic of model behaviour (Huntington et al. 1982;Sweeney and Weyant 1979). MIPs coordinated by EMF have contributed to IPCC assessments since 1995. (For further details including a historical timeline, see Online Resources 4.) To enable comparability, MIPs require carefully designed experiments that harmonise key scenario assumptions (including external drivers and constraints) and standardise the reporting of model output (Huntington et al. 1982). IAM MIPs use controlled variations of policy assumptions (Clarke et al. 2009;Tavoni et al. 2015;Wilkerson et al. 2015), technology assumptions Riahi et al. 2015), or socioeconomic development assumptions (Kriegler et al. 2016) or explore ensemble uncertainties. Other MIPs have focused on specific regions (van der Zwaan et al. 2016) or economic sectors (Ruane et al. 2017).
MIPs are a prominent evaluation method for IAMs, generating strong tacit learning for participating modelling teams. Within-ensemble agreement in IAM MIPs is often interpreted as providing 'robust' insights. However, agreement within the ensemble should be interpreted cautiously if structural differences between models are not systematic and models share approaches or components (Parker 2013). Policy decision-making informed by IAM MIPs should be based on 'estimates from many plausible, structurally distinct models' (Millner and McDermott 2016).
Model diagnostics are a specialised application of MIPs using a standardised set of indicators or performance metrics. These indicators classify model behaviour under harmonised scenario assumptions (Bennett et al. 2013). Diagnostic indicators therefore serve to 'fingerprint' models. Although descriptive, these indicators are an enabling step towards explaining characteristic IAM performance in terms of model structure and assumptions (Wilkerson et al. 2015). Model fingerprints also enable specific IAMs to be selected to match the analytical needs of specific scientific or policy questions. (For further details of the IAMs participating in these diagnostic studies, see Online Resources 5.)

Sensitivity analysis
Sensitivity analysis is used to identify model inputs and parameterisations influential on model output, and to attribute some of the uncertainties in outputs to uncertainties in inputs. By focusing on parametric uncertainty, sensitivity analysis is useful for testing the stability of the model over possible parameter ranges (e.g. to identify non-linearities or threshold effects in model behaviour). In IAMs, sensitivity analysis also helps identify value-laden parameters-like discount rates-that are influential on model outcomes (Beck and Krueger 2016). This guides further empirical research if uncertainties are parametric, or user-led appraisals of appropriate input ranges if uncertainties are societal (van der Sluijs et al. 2005).
Local or 'one-at-a-time' methods test output sensitivities to changes in single inputs or parameters; global methods vary multiple inputs or parameters simultaneously (Saltelli et al. 2008). Local sensitivities in IAMs are commonly reported in model evaluation studies. Influential inputs or parameters include rates of time preference (Belanger et al. 1993), rates of technological change (Sathaye and Shukla 2013), and investment costs of energy supply technologies (Koelbl et al. 2014). However, local sensitivity analyses on discrete parameters provide limited insights on how well a model represents the modelled system (Saltelli and D'Hombres 2010).
Global sensitivity analyses are also possible in IAMs using computationally efficient techniques (Borgonovo 2010). These have been used to explore the multi-dimensional global space spanned by uncertain model inputs and parameters, both in single models (Pye et al. 2015), and in MIPs Marangoni et al. 2017;McJeon et al. 2011). This is a promising avenue of research for opening up the interpretability of IAM results in terms of input assumptions, particularly if reported alongside model applications (Mundaca et al. 2010). (For further details and examples, see Online Resources 6.) However, care must be taken not to over-interpret results as model sensitivities may not correspond to those in real-world systems (Thompson and Smith 2019). The range over which inputs or parameters are sensitised may be informed by historical variation or defined arbitrarily (e.g. ± 10%). Formal techniques combining quantitative assessment with qualitative (expert) judgement can help inform both realistic ranges over which to vary parameters, and the interpretation of the sensitivity analysis (van der Sluijs et al. 2005).

Summary of strengths and limitations
Each of the IAM evaluation methods reviewed has certain restrictions on its application, and limitations on what can be learnt about model structure and behaviour. Each IAM evaluation method also faces conceptual, methodological, or practical challenges for its use and further development. Table 1 summarises the strengths and limitations of each method (see Online Resources 8 for further detail). Applying multiple evaluation methods in concert allows the limitations of one to be compensated by the strengths of another.

A systematic multi-method approach for strengthening IAM evaluation
Each evaluation method contributes to the testing of IAMs against one or more of the four evaluation criteria: appropriateness, interpretability, credibility, and relevance. These connections between method and criteria are shown as coloured arrows in the upper panel of Fig. 2. In the text below we explain each connection.
Evaluation methods that delineate specific characteristics of models or model performance support appropriateness in matching tool with task. Diagnostic indicators help select IAMs with specific performance characteristics to answer related policy questions. A hierarchy of models allows simpler, more clearly interpretable IAMs to be used for characterising general system dynamics. IAMs with specific causal mechanisms tested against observations in historical simulations or as stylised facts are appropriate for policy analysis linked to those mechanisms. These connections from evaluation method to the appropriateness criterion are shown as red arrows in Fig. 2.
IAMs vary widely in their resolution of mitigation measures, policy options, and spatial scales (Table 2.5M6 in Forster et al. 2018). Evaluating IAMs against the appropriateness criterion helps establish the 'resolution adequacy' of a model for a given task. Several evaluation methods contribute insights on resolution adequacy. For example, near-term observations that diverge from historical model projections may reveal influential processes omitted from model representations. Model intercomparison projects (MIPs) or diagnostic experiments involving models with different technological, process, and spatial resolutions are useful for identifying the influence of resolution on the outcomes being explored in the MIP (see Online Resources 9 for further discussion). MIPs also contribute to the interpretability of IAMs by linking model behaviour and resulting policy insights to structural representations of energy, land use, and economic processes. Sensitivity analysis similarly links model behaviour to input assumptions and parameter values. Standardisation of performance metrics and diagnostic indicators supports interpretability by 'fingerprinting' the distinctive behaviour of each model Wilkerson et al. 2015). As IAMs represent multiple systems and their interactions, a pragmatic approach for improving interpretability is to evaluate individual model components sequentially (Harmsen et al. 2015;van Vuuren et al. 2011). These connections from evaluation methods to the interpretability criterion are shown as purple arrows in Fig. 2.
All three evaluation methods using historical data to test IAMs' abilities to reproduce observed short-and long-term dynamics are important for establishing credibility. In model hierarchies, simpler models have the further advantage of being more accessible to third parties (Craig et al. 2002;Crout et al. 2009;Edmonds and Reilly 1983). More transparent, documented, and accessible models and model results opening them up to independent review by a potentially diverse range of modellers, domain experts, and policy users (DeCarolis et al. 2017;DeCarolis et al. 2012;NCC 2015). This can include peer review or expert appraisal of models' theoretical consistency in conceptualising and representing the modelled system (Oppenheimer et al. 2016;van der Sluijs et al. 2005). These connections from evaluation methods to the credibility criterion are shown as blue arrows in Fig. 2.
Evaluation methods should similarly help strengthen the relevance of IAMs for advancing understanding of emission pathways and policy responses. Over the longerterm, both MIPs and sensitivity analyses help improve IAMs' ability to identify robust alternatives for achieving defined policy goals such as 2°C climate stabilisation (De Carolis 2011; Drouet et al. 2015). Relevance for contemporary policy issues could be strengthened by testing IAM projections against near-term observations linked to clear policy levers, or by coupling IAMs to more detailed sectoral models to better capture constraints. These connections from evaluation methods to the relevance criterion are shown as green arrows in Fig. 2.
As Fig. 2 shows, the connections from methods to criteria are not unique. Multiple methods applied to single criteria may raise different issues or identify different grounds for improvement. This is entirely consistent with IAM evaluation as a continual process of learning and model development (Barlas 1996). A multi-method approach to IAM evaluation is necessary not just to ensure progress against the full set of evaluation criteria but also to compensate the limitations of some methods with the strengths of others (Table 1).
The lower panel of Fig. 2 gives an example of how this systematic IAM evaluation framework could be applied. We use renewable energy deployment for meeting climate targets as a specific IAM application. Whereas stylised facts may show long-term IAM projections conservatively in line with observed historical dynamics, near-term observations may reveal IAM underestimation of recent deployment trends. Resulting insights for IAM improvement include both relaxing long-term growth constraints and updating near-term technology cost and performance assumptions. Both insights from different methods strengthen the IAM's usefulness for the specific application. (For further details of this example, see Online Resources 8.) The need for a systematic multi-method approach to IAM evaluation defines a major medium-to-long-term programme for IAM evaluation research and practice. Community-wide activities can play an important role in both driving and coordinating such a programme. High-value activities realisable in the near-term include: -Automating the calculation of diagnostic indicators in scenario databases : to lower the threshold for diagnostic fingerprinting of new generations of IAMs. -Establishing community protocols for global sensitivity analyses over a wide range of variables for both individual modelling groups and as part of MIPs (Marangoni et al. 2017): to increase the frequency, feasibility, and comparability of uncertainty analysis within the IAM community, and to prioritise evaluation research efforts for reducing parametric uncertainties. -Publishing community libraries of relevant historical data for hindcasting and stylisedfacts experiments (Schwanitz 2013): also helps to minimise the variation in MIPs due to unnecessary base-year differences. Further examples of the effort required to standardise, interpret, facilitate, and best use IAM evaluation methods are provided in Online Resources 9.

Conclusion
Evaluating IAMs helps establish the legitimacy of their use, the appropriateness and adequacy of their application, and confidence in their results among users. We have synthesised many examples, benefits, insights, and limitations of applying different evaluation methods to IAMs. With the growing prominence of global IAMs in international and national climate policy, the time is ripe for establishing a more systematic approach to IAM evaluation, combining multiple methods in an ongoing, collaborative process involving both modellers and users.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.