Introduction

The Coupled Model Intercomparison Project (phase 6) (CMIP6) collects several simulations of global climate models (GCM) currently used to interpret past and future climate changes (Eyring et al. 2016; IPCC 2021). However, these GCMs calculate equilibrium climate sensitivity (ECS) values ranging from 1.8 to 5.7 \({^\circ }\)C IPCC (2021). The ECS is the most important climatic parameter as it measures the long-term increase in air temperature near the surface that should result from an increase in radiative forcing of approximately 3.8 W/m\(^{2}\), which corresponds to a doubling of the atmospheric CO2 concentration from 280 ppm (which is defined as the preindustrial level) to 560 ppm. The uncertainty of the ECS is highly problematic as it indicates that the climate system is still poorly understood and modeled. Consequently, also the extent of future climate change is rather uncertain as the impact of anthropogenic CO2 emissions on the climate cannot yet be adequately quantified (cf. Knutti et al. 2017).

The uncertainty of the ECS stems from the fact that various climate feedback mechanisms—in particular water vapor and cloud cover—are still too little known and modeled, as already found 60 years ago by Möller (1963). In the absence of climate feedback mechanisms, the Stefan–Boltzmann law for blackbodies predicts that a doubling of the atmospheric CO2 concentration could cause an increase in global surface temperature of about 1 \({^\circ }\)C. Therefore, only strong positive climate feedbacks could significantly increase the ECS above such a value, but their existence is still debated.

Constraining the ECS value is an urgent task of climatology. In fact, at least two-thirds of the CMIP6 GCMs could be severely defective. For example, by grouping models into low (\(1.5<ECS\le 3.0\) \({^\circ }\)C), medium (\(3.0<ECS\le 4.5\) \({^\circ }\)C) and high (\(4.5<ECS\le 6.0\) \({^\circ }\)C) sensitivity values, if, say, the actual ECS is less than \(3.0\) \({^\circ }\)C, the GCMs with \(ECS>3\) \({^\circ }\)C should be ignored. Therefore, it is very important that detailed evaluations of the models are carried out in order to determine if, where and how the models should improve both on a global scale—as proposed, for example, in this work—and on regional scales, as done in numerous other studies (e.g.: and many others Heo et al. 2014; Seo et al. 2018).

Constraining ECS also has important policy implications because the expected warming for the 21st century depends on the value of the model’s ECS (Grose et al. 2017; Scafetta 2022): the higher the ECS, the greater the expected warming due to GHG emissions. For example, Huntingford et al. (2020) found that the wide ECS range of CMIP6 GCMs implies that at thermal equilibrium the global surface temperature could warm up between 1.0 and 3.3 \({^\circ }\)C above the pre-industrial period (1850–1900) even if anthropocentric emissions cease today.

Scientists already wondered whether a strong response to greenhouse gases could be realistic (Voosen 2019). Indeed, high ECS CMIP6 models have already been found to perform poorly (e.g.: Ribes et al. 2021; Scafetta 2022; Tokarska et al. 2020; Zhu et al. 2020) while the medium and even the low ECS models are being carefully evaluated.

For example, Nijsse et al. (2020) derived that the most likely ECS interval should be 1.9–3.4 \({^\circ }\)C while alternative studies, often empirical based, have suggested that the actual ECS could be even lower, probably between 1 and 2.5 \({^\circ }\)C (e.g.: Lewis and Curry 2018; Lindzen and Choi 2011; Scafetta 2013; Stefani 2021; Wijngaarden and Happer 2020). Most GCMs seem to overestimate the observed surface warming since 1980 (Scafetta 2021b, 2022) and also that observed in the global (McKitrick and Christy 2020) and tropical troposphere (Mitchell et al. 2020), in particular at its top (200–300 hPa) where the CMIP6 GCMs predict an unobserved hotspot (McKitrick and Christy 2018). A similar situation also occurred with the previous CMIP3 and CMIP5 GCMs (Fu et al. 2011; Scafetta 2012a, 2013). Actually, as Knutti et al. (2017) acknowledged, there is a dichotomy between the observed and modeled ECS as GCMs tend to favor sensitivity values at the top of the probable range, while several studies based on instrumentally recorded warming and some from paleoclimate favor values in the lower part of the range. Therefore, not only the models with high ECS, but also those with medium ECS should be and are being seriously questioned.

Scafetta (2021a) and Scafetta (2022) showed that the performance of the GCMs improves as their ECS decreases and, in any case, the low ECS GCMs appear to be the best performing models. However, even low-ECS GCMs need further evaluation because biases in some regions (e.g. on land) could be offset by opposite biases in other regions (e.g. on ocean). Furthermore, serious uncertainties remain in the solar forcing and in the temperature records themselves (Connolly et al. 2021; D’Aleo 2016; Watts 2022). These uncertainties question the warming trend reported by the available climate records and, directly or indirectly, the models themselves. Finally, climate systems seem to be regulated by various natural oscillations from the decadal to the millennial scales, which the GCMs are unable to reproduce, the presence of which would also imply low ECS values, probably between 1 and 2 \({^\circ }\)C (Scafetta 2012a, 2013, 2021c).

Focusing on the performance of the CMIP6 GCMs, Scafetta (2022) proposed that the probable ECS range could be constrained by statistical investigation to find which GCM group—low, medium or high ECS—best reproduces the observed global surface warming between the 1980–1990 and 2011–2021 as reported by ERA5-T2m (Hersbach et al. 2020; Simmons et al. 2021). The period 1980–2021 was chosen because it is optimally covered by all available climatic temperature records. Scafetta (2022) analyzed the “average” simulations provided by the Koninklijk Nederlands Meteorologisch Instituut (KNMI) Climate Explorer (Oldenborgh 2020) of 38 CMIP6 GCMs with three shared socioeconomic pathways (SSP) emission scenarios, which also counted for a partial evaluation of the internal variability of the models. The low ECS GCM group was found to be perfectly compatible, at least on a global scale, with the 2011–2021 warming relating to the 1980–1990 period. Conversely, both GCM groups with medium and high ECS showed too high warming trends.

A possible objection to the analysis proposed in Scafetta (2022) is that temperature records should be compared with actual members of the CMIP6 GCM ensemble instead of their ensemble averages because the unforced internal variability of the models produces different results due to uncertainties in the initial conditions as well as in the internal parameters of the models. This problem will be addressed in this paper considering that:

  1. 1.

    physical models, including the GCMs, should be accurate and precise (see Appendix 2);

  2. 2.

    there are still open issues regarding the reliability of the available global surface temperature records.

In fact, theoretical models must reproduce observations within a reasonably small error. In our case it should be evident that the poor precision of a GCM cannot be used as a pretext to justify its poor accuracy. For example, a low-precision model could produce a very wide range of different hindcasts due to its internal variability. In this situation, even if some of its hindcasts fit the observations, the result should still be considered unsatisfactory if the mean of the GCM set diverges too much from the actual data. Similarly, if an ECS GCM group produces a set of hindcasts that too sparsely encompass the observations, the ECS values that characterize that group should be considered unrealistic even though some of the models in the same group might perform better than others. In general, the accuracy, precision and ECS category of the GCMs must be evaluated simultaneously.

Furthermore, surface-based temperature records appear to exhibit non-climatic warming biases due to poorly corrected urban heats or other local surface phenomena (e.g.: Connolly et al. 2021; D’Aleo 2016; Scafetta 2021a; Watts 2022). To account for this problem, the satellite temperature measurements of the lower troposphere using microwave resonance units (MSU) proposed by the U. of Alabama Huntsville (UAH-MSU-lt v6) (Spencer et al. 2017) will also be analyzed.

UAH-MSU-lt is the temperature record that features the lowest global warming trend (about 0.13 °C/decade) from 1980 to 2021 among all available global temperature records. According to GCM simulations, the troposphere is expected to warm up faster than the surface (up to a factor of 3) because greenhouse gases are expected to warm the atmosphere first (Mitchell et al. 2020). Consequently, the global warming trend of the troposphere estimated from satellite measurements should be further reduced to simulate the global warming trend at the surface. Here, these corrections are ignored and UAH-MSU-lt is assumed to represent the possible lowest limit for the global warming trend of the surface. Therefore, comparison with this satellite temperature record could help assess the presence of non-climatic warming bias in the surface temperature records, particularly on land where large contaminated areas appear to exist (cf. Scafetta and Ouyang 2019; Scafetta 2021a).

Indeed, preliminary analyzes have shown that the land seems to have warmed too much and too quickly compared to the ocean (Scafetta 2021a). Connolly et al. (2021) used data from rural stations only and showed that the warming of the Northern Hemisphere’s land surface should be  significantly lower than what reported by the available surface-based temperature records based on both rural and urban stations. Watts (2022) examined the quality of the U.S. temperature stations from which official temperature records are obtained and concluded that approximately 96% of them could not meet the National Oceanic and Atmospheric Administration (NOAA) requirements for “acceptable placement” because they could be significantly contaminated by different heat sources. In general, the surface temperature records and the homogenization algorithms used to adjust them present several problems that may have exaggerated the warming. Thus, the integrity of the available global surface temperature records and, therefore, the ability to correctly determine the global warming trend of the 20th and 21st century should be questioned as well (Connolly et al. 2021; D’Aleo 2016).

There is a different MSU record (Mears and Wentz 2016), which shows a warming trend that is more compatible with those presented by the surface-based temperature records. However, this alternative satellite-based record is not analyzed here because it would overlap the results of the surface-based temperature records. In any case, adopting it in the present study may not be optimal because it only covers the latitude range from 70.0\({^\circ }\) S to 82.5\({^\circ }\) N and because it appears to perform worse than UAH-MSU-lt that better agrees with the radiosonde temperature database (Christy et al. 2018).

Here, we significantly expand the analysis presented by Scafetta (2022) by testing 143  GCM average simulations and all 688 GCM member simulations available on the KNMI website against four surface-based global temperature records (ERA5-T2m, HadCRUT5, GISTEMP v4, NOAAGlobTemp v5) and the UAH-MSU-lt v6 satellite-based record. Since we wish to narrow the ECS range, we again group the models into three classes corresponding to low, medium and high ECS values, as proposed in Scafetta (2022). ECS GCM groups that produce systematically biased trends (e.g. too hot or too cold relative to the observed temperatures) should be questioned and not used for policy even though some simulations may appear to reproduce the observations. Finally, we compare the GCM hindcasts with observed land and ocean warming values to determine whether the surface-based records could be regionally biased and whether the ECS should be further constrained towards lower values.

Data and methods

We analyze the monthly reanalysis field near-surface air temperature (ERA5-T2m) record from 1980 to 2021 (Hersbach et al. 2020; Simmons et al. 2021). We repeat the same analysis using the HadCRUT5 (infilled data) (Morice et al. 2021), GISTEMP v4 Lenssen et al. (2019), and NOAAGlobalTemp v5 (Zhang et al. 2019) global surface temperature records. Some of these records, however, may not cover the entire surface of the globe from 1980 to 2021. There are other global surface temperature records such as those proposed by the Japanese Meteorological Agency (JMA, Ishihara 2006) and by the Berkeley Earth group (BE, Rohde and Hausfather 2020), which will also be discussed briefly. For completeness, as explained in the Introduction, we add a comparison with the UAH-MSU-lt v6 temperature measurements (Spencer et al. 2017).

We also analyze all 143 “average” surface air temperature (tas) records and all 688 ensemble member records from 38 different CMIP6 GCMs downloadable from KNMI Climate Explorer. These simulations were produced using historical forcings (1850–2014) further extended up to 2100 with four different SSP scenarios: SSP1-2.6 (low GHG emissions), SSP2-4.5 (intermediate GHG emissions), SSP3-7.0 (high GHG emissions ) and SSP5-8.5 (very high greenhouse gas emissions) (IPCC 2021). These four scenarios are nearly indistinguishable until 2021. Thus, from 1850 to 2021,  the four simulation sets can be considered independent assessments of the same models under nearly identical forcing conditions, which also helps to assess in first approximation the internal variability of the models.

The 1980–2021 period was chosen to better evaluate the performance of the CMIP6 GCMs. This period is optimally covered by numerous climatic temperature records including those based on satellite measurements that are alternative to those based on land and oceanic measurements that could be affected by various non-climatic biases, which are difficult to eliminate (D’Aleo 2016; Watts 2022). In fact, going back in time from 1980 to 1850, the temperature records are affected by ever-larger uncertainties and uncovered areas, which makes evaluating the CMIP6 models even more difficult. A possible advantage of the present study is that previous studies evaluating the performance of the CMIP6 models attempted to constrain the ECS by comparing GCM simulations only with surface climate records from 1850 to 2020 (Ribes et al. 2021) or from 1981 to 2014 (Tokarska et al. 2020), or even using uncertain paleoclimate records (Zhu et al. 2020) and concluded that only high-ECS models (\(ECS>4.5\) \({^\circ }\)C) could be excluded. However, there are open questions as to whether cooling adjustments applied to different Earth surface temperature records from 1850 to 1980 are justified (D’Aleo 2016) and whether more recent periods of the same climate records are affected by non-climatic warming biases (Connolly et al. 2021; Scafetta 2021a). These biases could have exaggerated the 20th century warming trend and incorrectly provided support for the medium-ECS GCMs.

The 1980–2021 warming for each record is calculated by evaluating the 2011–2021 average temperature anomaly relative to the 1980–1990 period. 11-year intervals are used to bypass biases due to interannual fluctuations such as those related to ENSO and the 11-year solar cycle. Then, we apply standard statistical tests to decide if and how the observed warming values for each of the temperature records are reproduced by the three ECS GCM groups.

The ERA5-T2m global surface temperature average warming from 1980–1990 to 2011–2021 is estimated to be:

$$\begin{aligned} \Delta T_{mean}=0.578\,{^\circ }{\text {C}}. \end{aligned}$$
(1)

The other temperature records give: HadCRUT5 (infilled data), \(\Delta T_{mean}=0.581\) \({^\circ }\)C; GISTEMP v4, \(\Delta T_{mean}=0.570\) \({^\circ }\)C; NOAAGlobalTemp v5, \(\Delta T_{mean}=0.523\) \({^\circ }\)C. HadCRUT5 (infilled data), GISTEMP, and ERA5-T2m give nearly identical warmings. We also observe that HadCRUT5 (non-infilled data) gives 0.549 °C and HadCRUT4 (Morice et al. 2012) gives 0.521 °C. BE gives \(\Delta T_{mean}=0.591\) \({^\circ }\)C and JMA gives \(\Delta T_{mean}=0.557\) \({^\circ }\)C, which do not differ much from the above estimates. Thus,  the available surface-based global temperature records measure that the global surface warming from 1980–1990 to 2011–2021 has been between 0.52 and 0.59 \({^\circ }\)C, or approximately between 0.50 and 0.60 \({^\circ }\)C, with an average of 0.56 \({^\circ }\)C. In contrast, the satellite-based UAH-MSU-lt v6 temperature record gives \(\Delta T_{mean}=0.402\) \({^\circ }\)C, suggesting that 2011–2021 actual warming may have been even less than 0.40 \({^\circ }\)C because, as explained in the introduction, according to the GCMs the temperature trend of the troposphere should be scaled down to make it compatible with the surface warming trend.

For the temperature records, since 1980 the error of the average over an 11-year period can be estimated to be very small, \(\bar{\sigma }_{95\%}\approx 0.01\) \({^\circ }\)C (see Appendix 1), which represents about 2% of the warming from 1980–1990 to 2011–2021, and is less than the differences between the various temperature records.

As explained in Sect. 1, the proposed analysis groups the CMIP6 GCMs into three subsets characterized by low (\(1.5<ECS\le 3.0\) \({^\circ }\)C), medium (\(3.0<ECS\le 4.5\) \({^\circ }\)C) and high (\(4.5<ECS\le 6.0\) \({^\circ }\)C) sensitivity values. This choice is based on the following heuristic considerations. In fact, the IPCC (2013) estimated that the ECS had to have a “likely” range of 1.5–4.5 \({^\circ }\)C. This range can be heuristically divided into at least two equal parts: \(1.5<ECS\le 3.0\) \({^\circ }\)C and \(3.0<ECS\le 4.5\) \({^\circ }\)C. In 2013, the CMIP5 GCMs were used. However, the IPCC (2021) adopted the CMIP6 GCMs that extended the ECS range up to 6 \({^\circ }\)C so that an equally large third range, \(4.5<ECS\le 6.0\) \({^\circ }\)C, could be added to the previous two. Zelinka et al. (2020) explained that the causes of the increased climate sensitivity in the CMIP6 models were due to stronger positive cloud feedbacks due to decreased extratropical cloud cover and albedo that, however, might be questionable.

Therefore, the interval \(1.5<ECS\le 3.0\) \({^\circ }\)C collects the GCMs with ECS values most consistent with different empirical results, as discussed in Sect. 1; the interval \(3.0<ECS\le 4.5\) \({^\circ }\)C collects the other GCMs that also the IPCC (2013) would have considered acceptable; finally, the interval \(4.5<ECS\le 6.0\) \({^\circ }\)C collects the GCMs included in the IPCC (2021) but which in 2013 the IPCC itself considered to predict an unlikely high ECS.

Analysis of the CMIP6 GCM simulations

Figure 1 shows the GCM simulations (left) and their ensemble \(mean\pm 1\sigma\) range (right) grouped according to the three GCM ECS sets with respect to the ERA5-T2m global surface temperature record (black, moving averages at 12 months). All records are temperature anomalies relative to the period 1980–1990. Figure 2 shows a similar comparison with respect to the HadCRUT, GISTEMP, NOAAGlobTemp and UAH-MSU-lt temperature records.

Fig. 1
figure 1

(Left) GCM global surface temperature simulations (colored curves) and (right) \(\pm 1\sigma\) GCM global surface temperature ensembles (yellow area) versus the ERA5-T2m record (black, 12-month moving average)

Fig. 2
figure 2

GCM global surface temperature ensembles (yellow area, \(\pm 1\sigma\)) versus HadCRUT5 (infilled data), GISTEMP v4, NOAAGlobTemp v5, and UAH-MSU-lt v6 temperature records (black, 12-month moving average).

Both figures show that as the ECS increases, the global surface warming predicted by the models also increases. However, only the low-ECS GCM group can be considered perfectly consistent with the surface-based global temperature records because it encloses them well within the \(\pm 1\) \(\sigma\) GCM range (yellow area).

Figures 1 and 2 also show that, compared to the satellite record, even the GCM group with low ECS seems to overestimate the observed warming. In fact, even for the low ECS GCM group from 2011 to 2021 the UAH-MSU-lt record is not well enclosed within the \(\pm 1\sigma\) model ensemble (yellow) area although a better agreement is found in the period 2015–2020. The latter was characterized by the significant El Niño warming events of 2015–2016 and 2020 (Appendix 1, Fig. 10). Therefore, the 2015–2020 warming for the period 2000–2014 could also be temporary (Scafetta 2021c) and not related to the warming hindcasted by the models because it is clearly due to natural climatic fluctuations while the average warming produced by the models is due to anthropogenic forcing. From 2015 to 2022, in fact, a slightly cooling trend is observed. From 2000 to 2014 the UAH-MSU-lt v6 record also clearly shows the so-called global warming “hiatus” or “pause” (IPCC 2013). This decade-long lack of warming began to seriously question the GCMs, and various statistical solutions were proposed to circumvent the problem by referring to the fluctuations of the unforced internal variability of the models (e.g. Meehl et al. 2011). Figures 1 and 2 also show that, at the present, the “pause” appears missing or attenuated in the latest versions of the surface-based global temperature records.

Analysis of the GCM average simulations

Scafetta (2022) analyzed the average simulations of 38 GCMs using the historical + SSP2-4.5, SSP3-7.0, and SSP5-8.5 radiative forcing scenarios up to June 2021; the warming values for each model were collected in the table there published. Figure 3 graphically shows the results of the same analysis, which was updated to the whole year 2021 and also included the SSP1-2.6 simulations, compared to the temperature observations (green vertical lines). 143 average records are analyzed. For each ECS GCM group the statistics provide (see Table 1):

  • High-ECS GCMs (51 records): \(\Delta T_{mean}=0.94\pm 0.22\) \({^\circ }\)C;

  • Medium-ECS GCMs (43 records): \(\Delta T_{mean}=0.79\pm 0.10\) \({^\circ }\)C;

  • Low-ECS GCMs (49 records): \(\Delta T_{mean}=0.59\pm 0.10\) \({^\circ }\)C.

The result confirms that the GCM group with low ECS is perfectly compatible with the observed warming (Eq. 1) within the \(\pm 1\) \(\sigma\) range. In contrast, both GCM groups with medium and high ECS show warming biases. Moreover, as Scafetta (2022) already observed, Fig. 3 also shows that none of the medium and high ECS models predict an average warming of less than 0.6 \({^\circ }\)C, which is above the warming reported by all global temperature surface records. This result suggests that models with \(ECS>3\) \({^\circ }\)C should be questioned at the 95% confidence level. Thus, by considering only the GCM ensemble averages for the four SSPs, the real ECS should be equal to or lower than 3 \({^\circ }\)C.

Table 1 Warming from 1980–1990 to 2011–2021 for average simulations of 38 GCMs using historical + SSP1-2.6, SSP2-4.5, SSP3-7.0, and SSP5-8.5 forcings
Fig. 3
figure 3

Average temperature changes (2011–2021 minus 1980–1990) hindcasted by 38 CMIP6 GCMs mean simulations. The blue vertical lines represent the temperature change measured by HadCRUT5 (infilled data), ERA5-T2m, GISTEMP v4, NOAAGlobTemp v5, and UAH-MSU-lt v6 temperature records, respectively, with their 95% confidence interval. The three yellow boxes represent the \(\pm 1\sigma\) dispersion of the data referring to the low (\(1.5<ECS\le 3.0\) \({^\circ }\)C), medium (\(3.0<ECS\le 4.5\) \({^\circ }\)C) and high (\(4.5<ECS\le 6.0\) \({^\circ }\)C) ECS GCMs. See Table 1

However, Fig. 3 also shows that if the UAH-MSU-LT record better reproduces the actual 2011–2021 warming, the GCM group with low ECS would also be too hot because, out of 49 GCM ensemble averages with low ECS, 48 cases (98%) are warmer than 0.40 \({^\circ }\)C. The GCM that best agrees with the satellite record is CAMS-CSM1-0 whose ECS is 2.29 \({^\circ }\)C.

Analysis of the full range of the GCM ensemble members

Figure 4 shows in four panels the temperature variations (2011–2021 minus 1980–1990) of the 688 simulations of GCM ensemble members available per forcing set (Hist+SSP1-2.6, Hist+SSP2-4.5, Hist+SSP3-7.0 and Hist+SSP5-8.5; red dots) against the five temperature records (vertical lines). The figure visually confirms that the vast majority of ensemble member simulations produced by the GCM groups with medium and high ECS run too hot relative to  all five temperature records.

Fig. 4
figure 4

Temperature change (2011–2021 minus 1980–1990) hindcasted by the full range of model simulations (red dots). The vertical lines represents the global surface warming from 1980–1990 to 2011–2021 reported by HadCRUT5 (infilled data), ERA5-T2m, GISTEMP v4, NOAAGlobTemp v5, and UAH-MSU-lt v6 temperature records, respectively, with their 95% confidence interval. Each of the four panel represents a set of forcing: a Hist+SSP1-2.6, b Hist+SSP2-4.5, c Hist+SSP3-7.0 and d Hist+SSP5-8.5

To examine how observed warming values are placed within the distributions of possible GCM hindcasts for each of the three ECS groups, we count how many member simulations record temperatures colder or warmer than each of the five temperature records. Table 2 reports the results.

Table 2 Number of single GCM simulations reporting mean temperature changes (2011–2021 minus 1980–1990) lower or higher than HadCRUT5 (infilled data), ERA5-T2m, GISTEMP v4, NOAAGlobTemp v5 and UAH-MSU-lt v6, respectively

The analysis confirms that the low-ECS models produce results that well enclose the 2011–2021 average temperatures obtained using the surface temperature temperature records, which always fall within the statistical interval \(\pm 1\sigma\) (corresponding to the 16–84% probability interval) of the distribution of the GCM hindcasts. In contrast, 94–100% and 97–100% of hindcasts produced by the GCMs with medium and high ECS are warmer than all five temperature records, respectively. Therefore, also considering the full range of the available CMIP6 GCM simulations, the GCMs with medium and high ECS run too hot. Thus, the actual ECS should be equal to or lower than 3 \({^\circ }\)C.

However, 96% of GCM simulations from the low-ECS GCM are warmer and only 4% cooler than the lower troposphere temperature record. Thus, once again, we found that if UAH-MSU-lt better reproduces the actual global warming from 1980–1990 to 2011–2021, the vast majority of the low-ECS GCM ensemble members would also be found to run too hot.

Statistical modeling of the GCM unforced internal variability

Figure 5 shows the boxplots relating to the simulations shown in Fig. 4 for each model. Again, the GCM group with low ECS is best centered around the surface-based observations indicated by the horizontal blue lines while the GCM groups with medium and high ECS exhibit systematic warming bias except for very few models. However, the dispersion of the boxplots varies greatly among the GCMs because the models are not physically equivalent to each other and, furthermore, probably because of the different number of simulations available for each model.

Fig. 5
figure 5

Boxplots of the CMIP6 ensemble members depicted in Fig. 4 for each CMIP6 GCM; # represents the number of the available simulations for each GCM. The horizontal blue lines represents the global surface warming from 1980–1990 to 2011–2021 reported by HadCRUT5 (infilled data), ERA5-T2m, GISTEMP v4, NOAAGlobTemp v5, and UAH-MSU-lt v6 temperature records, respectively. The whiskers extend from each end of the box for a range up to 1.5 times the interquartile range

In fact, the GCMs are represented unevenly in the KNMI collection because the number of simulations available for each GCM varies from 3 to 100 among the models: see Fig. 5. Therefore, the statistics discussed in Sect. 3.2 may be skewed  towards models with a larger number of available simulations because they will weigh more in the statistical test reported in Table 3. This problem could be solved by using a Monte Carlo strategy to simulate the spread of GCM hindcasts that could be associated with unforced internal variability. This exercise is proposed below.

Table 3 Probability \(P_{\Delta T<GCMs}\) and \(P_{\Delta T>GCMs}\) that the 2011–2021 warming hindcast from 1980–1990 to 2011–2021 for each ECS GCM ensemble is warmer or colder, respectively, than HadCRUT5 (infilled data), ERA5-T2m, GISTEMP v4, NOAAGlobTemp v5 and UAH-MSU-lt v6, respectively: see Fig. 6

It can be assumed that each GCM produces simulations distributed around a mean \(\mu _{m}\) with a given standard deviation \(\sigma _{m}\) characterizing its internal variability. We note that \(\sigma _{m}\) should be assumed constant for all GCM averages because it could be interpreted as a “precision” requirement for GCMs. Indeed, GCM hindcasts should always agree with observations within an acceptable statistical uncertainty.

We propose three different options for \(\sigma _{m}\) covering approximately the ranges of the GCM boxplots shown in Fig. 5: \(\sigma _{H}\approx 0.05\) \({^\circ }\)C (high precision), \(\sigma _{M}\approx 0.10\) \({^\circ }\)C (medium precision), and \(\sigma _{L}\approx 0.15\) \({^\circ }\)C (low precision).

Figure 5 suggests that the high-precision option (\(\sigma _{H}\approx 0.05\) \({^\circ }\)C) could be satisfied by most GCMs; it requires the model mean to be within \(\pm 0.1 \,{^\circ }\)C (95% confidence interval) of the actual warming value. The 95% confidence range becomes \(\pm 0.2 \,{^\circ }\)C for the medium-precision requirement (\(\sigma _{M}\approx 0.10\,{^\circ }\)C) and \(\pm 0.3\,{^\circ }\)C for the low-precision option (\(\sigma _{L}\approx 0.15\,{^\circ }\)C).

Appendix 2 shows that the interval \(\pm 0.1\,{^\circ }\)C (95% confidence), which corresponds to the high precision option, \(\sigma _{H}\approx 0.05\,{^\circ }\)C, should be the preferred choice for the acceptable uncertainty related to the internal variability that should be requested for the GCMs because it could be derived from the variability of the temperature records themselves.

Figure 5 also shows that the low-precision option \(\sigma _{L}\approx 0.15\,{^\circ }\)C is only consistent with the EC-Earth3 GCM. The usefulness of this model should be questioned because it hindcasts 2011–2021 global surface warming values ranging between 0.5 and 1.2 \({^\circ }\)C with an average of 0.82 \({^\circ }\)C. This means that EC-Earth3 is both inaccurate and imprecise in hindcasting the global surface warming from 1980 to 2021.

Figure 6 shows the combined probability density functions (PDF) and the related boxplots derived from all the GCM means reported in Fig. 3 and Table 1 with the three precision requirements for the three ECS GCM groups compared to the warming levels obtained with the adopted five temperature records. The complementary Gaussian error function was used to evaluate the relative statistical position of the five actual warming values within each probability density function.

Fig. 6
figure 6

ac GCM probability density functions relative to three model precision requirements: \(\sigma =0.05\,{^\circ }\)C, \(\sigma =0.10\,{^\circ }\)C and \(\sigma =0.15\,{^\circ }\)C, which approximately correspond to \(\sigma _{95\%}=0.10\,{^\circ }\)C, \(\sigma _{95\%}=0.20\,{^\circ }\)C and \(\sigma _{95\%}=0.30\,{^\circ }\)C. The green vertical lines represent the global surface warming from 1980–1990 to 2011–2021 reported by HadCRUT5 (infilled data), ERA5-T2m, GISTEMP v4, NOAAGlobTemp v5 and UAH-MSU-lt v6 temperature records. df Boxplots of the the probability density functions depicted in panels ac, respectively, using double whiskers and boxes indicating the following probability ranges: 2.5%, 16%, 25%, 50%, 75%, 84% and 97.5%

For each model mean \(\mu _{m}\) and precision \(\sigma\), the probability \(P_{m}\) that the GCM hindcast is larger than the measured warming \(\Delta T\) is

$$\begin{aligned} P_{m}=\frac{1}{\sigma \sqrt{2\pi }}\intop _{\Delta T}^{\infty }e^{-\frac{(t-\mu _{m})}{2\sigma ^{2}}^{2}}\,dt=\frac{1}{2}\,\mathrm {erfc}\left( \frac{\Delta T-\mu _{m}}{\sigma \sqrt{2}}\right) . \end{aligned}$$
(2)

Thus, the mean \(P_{\Delta T<GCMs}=\frac{1}{N}\sum _{m=1}^{N}P_{m}\) across all models for each ECS GCM group gives the probability of obtaining simulations warmer than the reference temperature value. \(P_{\Delta T<GCMs}\) can also be obtained by integrating the probability density functions shown in Fig. 6a–c from the green line to infinity or by using a Monte Carlo strategy by generating, for example, 1000 computer values from a Gaussian distribution with mean \(\mu _{m}\) and standard deviation \(\sigma\). The relevant statistics are shown in Table 3.

Figure 6a–c show that the GCM group with low ECS (blue curves) always produces predictions well-centered on the observed warming for the four surface temperature records because their 2011–2021  values always fall within the \(\pm 1\sigma\) statistical interval (which corresponds to the 16–84% probability range) of the GCM distributions for the high, medium, and low precision options, respectively. However, once again, if the actual 1980–2021 warming is given by  UAH-MSU-lt, even the GCM group with low ECS seems to be biased towards too hot values in 95%, 91% and 85% of possible cases, respectively, for the three precision options (Table 3).

The predictions of the medium (purple) and high (red) ECS GCM groups always show significant warming biases. Also, particularly for the GCM group with high ECS, the PDF appears to have two peaks, implying that the GCMs in this group are physically very different from each other because they produce very different warming hindcasts that are clustered around 0.8 \({^\circ }\)C and 1.2 \({^\circ }\)C; the warmest PDF peak is mostly due to the CanESM5 GCM.

For the high-precision requirement (\(\sigma _{H}=0.05\,{^\circ }\)C), these two GCM groups produce results warmer than the observed values from a minimum of 98% to a maximum of 100% of their possible output, which is outside the 95% confidence interval. For the medium precision option (\(\sigma _{M}=0.10\,{^\circ }\)C), the medium and high GCM groups produce results warmer than the observed values from a minimum of 93% to a maximum of 100% of their possible outputs, which is at the limit of the 95% confidence interval. For the low precision option (\(\sigma _{L}=0.15\,{^\circ }\)C), the GCM groups with medium and high ECS produce warmer results than the four surface-based temperature records from 88 to 95% of cases. Conversely, 99% or more of the theoretical hindcasts of the GCM groups with high and medium ECS would be warmer than UAH-MSU-lt even for the low precision option (\(\sigma _{L}=0.15\,{^\circ }\)C).

The boxplots illustrated in Fig. 6d–f were obtained using the Monte Carlo strategy proposed above which simulates 1000 randomly distributed outputs for each of the 143 model averages for each of the three precision options (for a total of \(3\times 143{,}000\) theoretical hindcasts). The three panels show that in all cases, with respect to the observed temperature values, the groups with medium and high ECS are well outside the 68% confidence interval (i.e. the \(\pm 1\sigma\) interval). Furthermore, the GCM groups with medium and high ECS indicate levels of warming that are respectively 30% and 50% greater than those actually observed and, consequently, their accuracy is rather low. The accuracy of the low-ECS GCM group is good compared to the surface-based temperature records, but it still reports average warming that is about 30% larger than that reported by the satellite temperature record. The whisker extension of the boxplot shows that the precision of the low, medium and high ECS groups varies from modest (\(\pm 0.2\,{^\circ }\)C) to very poor (\(\pm 0.5\,{^\circ }\)C) range from low ECS and high precision GCM group to high and low precision GCM group.

Testing the land versus the ocean warming

Surface-based temperature records imply that the GCM group with low ECS performs better than those with medium and high ECS, which suggests that the most likely ECS value should be equal or lower than  3 \({^\circ }\)C. However, UAH-MSU-lt implies that even the low-ECS GCMs may perform quite poorly. The observed discrepancy between the surface and satellite temperature records may be due to the presence of various non-climatic warming biases in the surface temperature records (Connolly et al. 2021; D’Aleo 2016; Scafetta 2021a; Watts 2022). This problem is now being investigated by comparing the GCM hindcasts against the land and the ocean temperature observations.

Figure 7a–f show the areal distribution of warming from 1980–1990 to 2011–2021 produced by the CMIP6 GCM ensemble average and by HadCRUT5, ERA5-T2m, GISTEMP v4, NOAAGlobTemp v5 and UAH-MSU-lt v6. Equivalent maps for each GCM are found in Scafetta (2021b).

Figure 7b shows that the UAH-MSU-LT v6 temperature record covers the latitude range 80\({^\circ }\) S–80\({^\circ }\) N. Figure 7c, d show that ERA5-T2m and HadCRUT5 (infilled data) are global because they adopt interpolations of meteorological models to extend coverage also in data-scattered regions of the globe such as the poles and other inhabited areas (large deserts and forests). Figure 7e, f show that the GISTEMP and NOAAGlobTemp records do not cover large areas, in particular, the polar regions are poorly represented.

Figure 7b is characterized by lighter colors than the other temperature panels, which means that the UAH-MSU-lt temperature record shows less warming than the surface-based temperature records almost everywhere. All six temperature panels in Fig. 7 also show that the land area has warmed more than the ocean region. In any case, Fig. 7c–f show that the surface temperature records present a greater temperature difference between the land and the ocean regions. The visual comparison with the CMIP6 ensemble average simulation (Fig. 7a) suggests the same general pattern but, furthermore, the oceanic area appears slightly warmer than all five temperature records. The temperature records also show extensive ocean areas where significant cooling is observed such as around Antarctica, the eastern equatorial Pacific, the North Atlantic and a few other regions. These cooling regions reveal interesting dynamic patterns that are not captured by the average simulation of the CMIP6 ensemble. These patterns are best emphasized in the areal t-test proposed in Scafetta (2022).

Fig. 7
figure 7

Areal distribution of the warming from 1980–1990 to 2011–2021 for the CMIP6 ensemble average simulation and for the HadCRUT5 (infilled data), ERA5-T2m, GISTEMP v4, NOAAGlobTemp v5, and UAH-MSU-lt v6 temperature records

Table 4 reports the warming over 80\({^\circ }\) S:80\({^\circ }\) N, 60\({^\circ }\) S:80\({^\circ }\) N, 0\({^\circ }\) N:80\({^\circ }\) N and 60\({^\circ }\) S:0\({^\circ }\) S latitudinal ranges from 1980–1990 to 2011–2021 over land+ocean (total), land, and ocean. Table 4 also reports the ratios between the land and the ocean warming levels.

Table 4 Left columns: observed and hindcasted warming over 80\({^\circ }\) S:80\({^\circ }\) N, 60\({^\circ }\) S:80\({^\circ }\) N, 0\({^\circ }\) N:80\({^\circ }\) N, and 60\({^\circ }\) S:0\({^\circ }\) S latitude ranges from 1980–1990 to 2011–2021 over land+ocean (total), land, and ocean, and land/ocean ratio

The area 0\({^\circ }\) N:80\({^\circ }\) N shows that from 1980–1990 to 2011–2021 the surface temperature records warmed on average by about \(0.32\pm 0.05\) \({^\circ }\)C more than the satellite-based UAH-MSU-lt record, while the area 80\({^\circ }\) S:80\({^\circ }\) N the surface-based records warmed on average by about \(0.15\pm 0.02\,{^\circ }\)C more than the satellite record. A similar warming bias on land also appears in the Southern Hemisphere (60\({^\circ }\) S:0\({^\circ }\) S) because the surface-based temperate records show ocean warming averaging \(0.05\pm 0.03\) \({^\circ }\)C less than the satellite record while their land area warmed by \(0.08\pm 0.03\) \({^\circ }\)C more.

Figure 8 shows the results for each GCM model (using the 143 GCM average simulations available for each SSP) for the latitudinal interval 60\({^\circ }\) S:80\({^\circ }\) N, which is optimally covered from all temperatures records and includes all continents except Antarctica. The results are also reported in Tables 5, 6 and 7 and could be used to evaluate possible anomalous temperature trends on the continents.

Fig. 8
figure 8

2011–2021 warming relative to 1980–1990 of the models over a land+ocean, b land, c ocean areas within the 60\({^\circ }\) S:80\({^\circ }\) N latitude range for the 143 average model simulations (colored dots) and the five temperature records (colored lines). d Ratio between the land and the ocean warming. See Tables 5, 6 and 7

Table 5 Low-ECS GCMs: hindcasted warming from 1980–1990 to 2011–2021 within the 60\({^\circ }\) S:80\({^\circ }\) N latitude range from 1980–1990 to 2011–2021 over land+ocean (total), land, ocean, and land/ocean ratio
Table 6 Medium-ECS GCMs: hindcasted warming from 1980–1990 to 2011–2021 within the 60\({^\circ }\) S:80\({^\circ }\) N latitude range from 1980–1990 to 2011–2021 over land+ocean (total), land, ocean, and land/ocean ratio
Table 7 High-ECS GCMs: hindcasted warming from 1980–1990 to 2011–2021 within the 60\({^\circ }\) S:80\({^\circ }\) N latitude range from 1980-1990 to 2011–2021 over land + ocean (total), land, ocean, and land/ocean ratio

Figure 8a compares the synthetic and observed global warming levels from 1980–1990 and 2011–2021. Figure 8b, c show the land and the ocean average warming levels, respectively. These figures show that the performance of the models is similar to what we have obtained in the previous sections, i.e. the GCM group with low ECS performs significantly better than the medium and high GCM groups, which show warming bias for most of their GCMs.

Figure 8d shows the relationships between average warming on the land and the ocean areas. The mean land/ocean ratio for the vast majority of the models is \(1.75\pm 0.20\), which is a value placed between the results obtained for the surface temperature records (ranging from 1.95 to 2.32 ) and that of the satellite temperature record, which gives 1.51.

The results shown in Fig. 8 can be interpreted as follows.

  1. 1.

    Figure 8b shows that the land surface temperature records are on average 0.4 \({^\circ }\)C warmer than the satellite-based one. On the contrary, Fig. 8c shows that the surface-based ocean temperatures are on average up to a maximum of 0.1 \({^\circ }\)C warmer than the satellite ones.

  2. 2.

    Therefore, it can be assumed that on the ocean, the satellite-based temperature record is sufficiently compatible with the surface-based ones. If so, the large divergence observed on land between surface and satellite recordings could suggest that the land measurements are significantly contaminated by non-climatic warming biaes, including those related to urbanization (cf.: Connolly et al. 2021; D’Aleo 2016; Scafetta 2021a; Watts 2022).

  3. 3.

    A similar conclusion would also be indirectly supported by the GCM hindcasts which show that the CMIP6 models are usually unable to correctly reconstruct the large land/ocean temperature ratio observed in the surface temperature records. In fact, the models give a land/ocean ratio equal to \(1.75\pm 0.20\), while the surface records give ratios between 1.95 and 2.32.

  4. 4.

    However, as the GCMs attempt to reconstruct the global surface warming of the surface temperature records even though they cannot adequately explain their large land/ocean warming ratio, the models could have calibrated internal parameters to obtain a compromise that attempts to approximate the global surface warming by simulating a warmer ocean and a cooler land than observed.

If point 4 above is correct, the reliability of the low-ECS GCMs should also be questioned. In fact, Fig. 8d shows contradictory results regarding the low-ECS GCMs because some models agree better with the surface-based temperature records, few others agree better with the satellite temperature record, while the rest report land/ocean ratios between the two levels, as the vast majority of the medium and high ECS models does.

We now assume that the GCM’s predicted land/ocean temperature ratio (average ratio = \(1.75\pm 0.20\)) corresponds to the actual physical characteristics of the climate system and that the ocean temperature warming of the surface records (on average, \(0.43\pm 0.03\) \({^\circ }\)C, see Table 4) is sufficiently accurate. If so, from 1980-1990 to 2011-2021 the earth’s surface within the latitude interval 60\({^\circ }\) S:80\({^\circ }\) N should have warmed by \(0.75\pm 0.1\,{^\circ }\)C instead of the observed \(0.93\pm 0.03\,{^\circ }\)C. If the hypothesis is correct, the spurious warming of the land surface due to uncorrected non-climatic warming biases could be quantified as approximately +0.2 \({^\circ }\)C. The proposed correction implies that global surface warming from 1980–1990 to 2011–2021 could be at least about 0.05 \({^\circ }\)C (\(\sim\) 10%) lower than what the surface-based records report, which increases further the warming bias of the medium and high-ECS GCMs observed in Figs. 1, 2, 3, 4, 5 and 6.

The results depicted in Fig. 8 also help to better evaluate the individual GCMs. For example, Fig. 5 suggests that three high-ECS models (CNRM-CM6-1-f2, CNRM-ESM2-1-f2 and CIESM) produce relatively close warming to what is reported by the surface-based temperature records. However, Fig. 8d indicates that the same models fail to produce the land/ocean temperature ratio of the same temperature records showing significantly lower (CNRM) or higher (CIESM) results than reported. Therefore, it appears that in these GCMs  the biases that occur in some regions are offset by opposite biases that occur in other regions.

The last four columns of Table 4 report the global (land+ocean) and land warming calculated assuming that the ocean warming of the temperature records is correct and that the land/ocean warming ratios hindcasted by the models is correct as well. The global estimate was calculated from the ocean and the land ones weighted with their relative area percentages within each latitudinal range. In particular, we found that for the Northern Hemisphere (0\({^\circ }\) N:80\({^\circ }\) N), the land could have warmed about 0.087 \({^\circ }\)C less than what reported on average by HadCRUT5, ERA-T2m, and GISTEMP. This bias roughly corresponds to the different warming estimated in Connolly et al. (2021) for the northern hemisphere land area by comparing the temperature records reconstructed by using both urban+rural stations and rural-only stations that should present significantly mitigated non-climatic warming biases.

In conclusion, the proposed land-ocean comparison suggests that the surface-based temperature records most likely exhibit non-climatic warming biases and that the actual global surface warming from 1980–1990 to 2011–2021 may have been approximately between 0.50 and 0.55 \({^\circ }\)C, which is approximately 10% lower than what is reported in Sect. 2. This means that the medium and high-ECS GCM groups are further confirmed to run too hot and that the low ECS GCM group performs slightly worse than concluded in Sect. 3 because the average warming of its hindcasts from 1980–1990 to 2011–2021 is approximately 0.6 \({^\circ }\)C (Table 1). However, if UAH-MSU-lt reproduces the global surface warming more accurately, the surface-based temperature records would exhibit warming bias of up to 30% of the reported values, which would indicate that even the low ECS GCMs run significantly too hot and need to be scaled down by 33% to reduce their mean warming from 0.6 to 0.4 \({^\circ }\)C, which is the warming reported by the satellite-based measurements. Indeed, another indirect evidence that the land surface temperature records could be affected by a significant warming bias is also given by the divergence observed between instrumental and dendroclimatological proxy temperature records over the past 50 years, where the former show a warming trend significantly higher than the latter (Büntgen et al. 2021; Esper et al. 2018; Scafetta 2021a).

Climate change expectations for the 21st century

Climate impacts several areas of economic and environmental importance and its changes may require the implementation of various adaptation policies. However, climate change could also adversely affect some of the Earth’s climate systems such as in areas of water scarcity, coastal communities, natural ecosystems and others IPCC (2022). It is reasonable to assume that if climate change is too rapid and too significant, different areas could reach a point of vulnerability where adaptation will no longer be sufficient to avoid serious adverse effects. However, adaptation policies are much more affordable than mitigation ones, so the risks associated with possible future climate change should not be overestimated.

The IPCC (2021) used the GCM CMIP6 and various scenarios of global socioeconomic change predicted up to 2100 to produce hypothetical future stories on climate change for the 21st century. Four SSP scenarios were studied here: the SSP1-2.6 (low GHG emissions in which CO2 emissions are reduced to zero around 2075); SSP2-4.5 (intermediate GHG emissions in which CO2 emissions increase around the current rate until 2050, and then decrease but not reach net zero by 2100); SSP3-7.0 (high GHG emissions where CO2 emissions double by 2100); and SSP5-8.5 (very high GHG emissions where CO2 emissions triple by 2075).

The IPCC (2022) states that if the global surface temperature rises significantly above 2 \({^\circ }\)C over the next few decades compared to the pre-industrial period (1850–1900), adaptation policies may not be sufficient to reduce high risks related to climate change at least in some areas. Aggressive climate mitigation policies should therefore be implemented because the CMIP6 GCMs predict that the temperature will likely increase between 2 and 3 \({^\circ }\)C (compared to 1850–1900) by 2050 if anthropogenic greenhouse gas emissions are not significantly reduced as soon as possible.

However, in the previous sections we found that only the GCM group with low ECS, which is also the one predicting less warming, optimally reproduces the observed warming from 1980–1990 to 2011–2021 reported by the surface-based global temperature records. Therefore, its scenario forecasts for the 21st century should be preferred for policy. Furthermore, we also found that global warming from 1980–1990 to 2011–2021 reported by the surface temperature records may need to be reduced on average by about 10% assuming that the ocean warming is correct and that the correct land/ocean temperature ratio is the one predicted by the models. Finally, if UAH-MSU-lt better reproduces the actual warming from 1980–1990 to 2011–2021, even the simulations of the low ECS GCMs would be running too hot and the warming they produce would need to be reduced by 33% to optimally accommodate the observations. Here, we show and discuss how the climate could change in the 21st century under the above assumptions.

Figure 9 shows the simulations produced by the low ECS GCMs from 1980 to 2100 using the historical + SSP1-2.6, SSP2-4.5, SSP3-7.0, and SSP5-8.5 scenarios in three conditions: panels a1–a8 show the original GCM simulations versus HadCRUT5 (infilled data); panels b1–b8 show the GCM simulations reduced by 10% compared to NOAAGlobalTemp v5; and panels c1–c8 show the GCM simulations reduced by 33% compared to UAH-MSU-lt v6. The ordinates represent the temperature anomaly relative to the 1850–1900 average of the corresponding GCM set. The temperature records are baselined with the model simulations in 1980–1990. Table 8 reports the global surface warming forecasts produced by the low ECS GCMs in the periods 1980–1990, 2011–2021, 2040–2060 and 2080–2100 in the same three conditions.

Fig. 9
figure 9

Low-ECS GCM simulations from 1980 to 2100 using historical + SSP1-2.6, SSP2-4.5, SSP3-7.0, and SSP5-8.5 scenarios: (a1a8) original GCM simulations versus HadCRUT5 (infilled data); (b1b8) GCM simulations reduced by 10% versus NOAAGlobalTemp v5; (c1c8) GCM simulations reduced by 33% versus UAH-MSU-lt v6. The ordinates represent the temperature anomaly relative to the 1850–1900 average of the correspondent model set. The temperature records are baselined with the GCM simulations in 1980–1990

Table 8 Low-ECS GCMs: global surface warming in the periods 1980–1990, 2011–2021, 2040–2060, and 2080–2100 using the simulations depicted in Fig. 9. Original GCM simulations; (RF = 0.90) GCM simulations reduced by 10%; (RF = 0.67) GCM simulations reduced by 33%. The temperature anomaly is relative to the 1850–1900 average of the correspondent model set. The temperature records are baselined with the models simulations in 1980–1990

The analysis shows that the expected warming of the low-ECS GCMs by 2040–2060 is close to 2\({^\circ }\)C also for the SSP3-7.0 and SSP5-8.5 scenarios, which Hausfather and Peters (2020) described as “unlikely” and as “highly unlikely”, respectively. However, if the surface temperature records contain a warming bias and, therefore, the GCM simulations need to be scaled down to better agree with the actual warming, the projected warming for 2040–2060 could be lower (or even significantly lower if UAH -MSU-lt v6 is correct) than 2 \({^\circ }\)C also for the SSP3-7.0 and SSP5-8.5 scenarios.

There is indirect evidence that the surface-based temperature reconstructions could be affected by non-climatic warming biases. In fact, compared to the 1850–1900 mean, the 1980-1990 average warming is 0.54 \({^\circ }\)C for HadCRUT5 (infilled data), 0.48 \({^\circ }\)C for GISTEMP v4 (using the period 1880–1900) and 0.47 \({^\circ }\)C for NOAAGlobalTemp v5 (using the period 1880–1900). However, the low ECS GCMs give a slightly lower 1980–1990 warming, which is \(0.41\pm 0.20\) \({^\circ }\)C by averaging all GCM simulations although they better hindcast the 1980-2021 warming. In contrast, the medium and high ECS GCMs give \(0.48\pm 0.27\) \({^\circ }\)C and \(0.47\pm 0.23\) \({^\circ }\)C, respectively, which better fit the climate records; but then these same GCMs fail to hindcast the observed warming from 1980 to 2021.

Furthermore, the warming hindcasted by the the low-ECS models from 1850–1900 to 1980–1990 would be lower than \(0.41\pm 0.20\) \({^\circ }\)C if the climate simulations produced by them were to be scaled down. This evidence would suggest that the more recently applied homogenization adjustments to climate data to attempt to remove their non-climatic biases may have been inadequate and may have added or left spurious warming. For example, the continuous homogenization adjustments made to the surface-based temperature records during the last 10 years may have improperly cooled the raw temperature data of the past for many land stations (D’Aleo 2016) and, simultaneously, may have improperly increased the warming trend from the 1970s to the present, and, in particular, that of the period 2000–2021 (Connolly et al. 2021; Scafetta 2021a; Watts 2022). In fact, the scientific literature has indicated the period 2000–2014 as a “hiatus” or “pause” in global warming (IPCC 2013) because all surface and satellite climate records available before 2014 (e.g. HadCRUT3, which was discontinued in 2014, Brohan et al. 2006) showed more than a decade of relatively little change. Later, however, new versions of the surface temperature records were published (e.g. HadCRUT4 and later HadCRUT5 non-infilled and infilled data) and the 2000–2014 “pause” has gradually disappeared because, from one climate version to the following one, it has been replaced by an increasingly strong warming trend; e.g. the 2000–2014 trend was 0.03 °C/decade for HadCRUT3, 0.08 °C/decade for HadCRUT4, 0.10 °C/decade for HadCRUT5 non-infilled data, and 0.14 °C/decade for HadCRUT5 infilled data. Yet, the 2000–2014 global warming “hiatus” is still visible in the UAH-MSU-lt v6 record, which shows a 2000–2014 warming trend of 0.012 °C/decade (Fig. 2).

The above findings and considerations suggest that the actual ECS should be relatively low, which implies that, over the next few decades,  climate change will likely be moderate and that adaptation policies should be sufficient to manage any adverse effects that may occur.

Conclusion

Here I tested how well the CMIP6 GCMs—grouped into low, medium and high ECS subgroups—hindcast the global surface temperature warming from 1980–1990 to 2011–2021 reported by four surface temperature records (ERA5-T2m, HadCRUT5, GISTEMP v4, and NOAAGlobTemp v5) and by the satellite-based UAH-MSU-lt v6 temperature record. The latter was used as the lowest possible estimate for the global surface temperature warming during the analyzed period. The rationale for adding a comparison with the lower troposphere temperature record is that surface temperatures could be affected by significant non-climatic warming bias due, for example, to poorly corrected urban heats and many other factors (Connolly et al. 2021; D’Aleo 2016; Scafetta 2021a; Watts 2022). For example, indirect evidence for a significant warming bias, especially over land, may be also provided by the so-called “Divergence Problem” that is the apparent decoupling between three ring width chronologies and the rising temperature measurements starting from the 1970s (Büntgen et al. 2021; Esper et al. 2018; Scafetta 2021a).

Using the 143 GCM mean simulations available for four different SSPs, all medium and high ECS models turn out to be warmer than observations. Using the 688 CMIP6 ensemble member simulations available, 94–100% of the simulations produced by GCMs with medium and high ECS hindcasted greater warming than the five temperature records. In contrast, the low-ECS models are statistically distributed around the observed warming values obtained from the four surface-based temperature records. However, if the UAH-MSU-LT record better represents the actual 2011–2021 warming, even the low-ECS GCM group would produce on average too hot hindcasts.

I also tested whether the internal variability of the models could produce results distributed around the observations. Its effect was modeled using three fixed precision options. Assuming high (\(\sigma _{H}\approx 0.05\,{^\circ }\)C) and medium (\(\sigma _{M}\approx 0.10\,{^\circ }\)C) precision, it was found that 98–100% and 92–98%, respectively, of all possible outputs from the medium and high ECS GCMs would be warmer than observations. Only the theoretical results produced by the low-ECS GCM group optimally agree with the surface-based temperature records. If the required model accuracy is quite low (\(\sigma _{L}\approx 0.15\) \({^\circ }\)C), the middle and high GCM simulation groups would agree better with the data, but this agreement could still be quite unsatisfactory because 87–93% (which is still well outside the \(\pm 1\) \(\sigma\) or 68% confidence interval) of their hindcasts would still be too hot. In any case, the low precision option should be considered very unsatisfactory because it would allow the GCMs to deviate too much from the observations. Moreover, such poor precision would not seem consistent with the natural variability of the data as argued in Appendix 2. Figure 5 suggests that such a low precision could only occur for EC-Earth3 GCM.

Figures 5 and 6 also show that very few GCMs with medium and high ECS could produce some simulations consistent with the actual temperature values. In particular, the two high-ECS CNRM models (Séférian et al. 2019; Voldoire et al. 2019) appear to perform better than the other models of the same group. However, as a group, the high-ECS models are physically incompatible with the low-ECS ones. Indeed, the internal parameters of the GCMs are carefully tuned to obtain results as acceptable as possible (Hourdin et al. 2017; Mauritsen et al. 2019). Therefore, the good performance of some isolated cases could hardly be used to validate the corresponding model since the tuning operations also risk masking fundamental physical problems and, therefore, the need for model and/or forcing improvements.

It was found that only the low-ECS GCM group agrees optimally with the surface-based temperature records because their full hindcast range well encompasses the actual temperature warming values from 1980–1990 to 2011–2021. Therefore, since the three ECS chosen ranges should be considered large enough to be incompatible with each other, the GCM group with low ECS should be preferred to the other two, implying that the most likely ECS should be equal to or lower than 3 \({^\circ }\)C. This result confirms (Scafetta 2022). In fact, the performance of the models seems to increase as the ECS decreases (Scafetta 2021b).

However, the actual ECS could also be significantly lower than 3 \({^\circ }\)C if the UAH-MSU-lt record better represents the 2011–2021 surface warming. In fact, the satellite record shows that from 1980–1990 and 2011–2021 the global surface temperature may have warmed by about 0.40 \({^\circ }\)C, which is about 30% less than 0.58 \({^\circ }\)C as reported by ERA5-T2m, HadCRUT5 (infilled data), and GISTEMP v4. In this case, even the GCM group with low ECS would show poor accuracy in reproducing the temperature data because their average hindcast is about 0.60 \({^\circ }\)C. This means that the actual ECS could also be 33% lower than that which characterizes the low ECS GCM group: that is, it could need to be reduces from 1.8–3.0 \({^\circ }\)C to 1.2–2.0 \({^\circ }\)C. This conclusion cannot be ruled out because: (1) the surface temperature records appear to be severely affected by non-climatic warming bias (Connolly et al. 2021; D’Aleo 2016; Scafetta 2021a; Watts 2022), as the direct comparison between land and ocean warming proposed here also seeems to confirm (Fig. 8); (2) because a number of independent studies have concluded that the ECS could be within such a low range (e.g.: Lewis and Curry 2018; Lindzen and Choi 2011; Scafetta 2013; Stefani 2021; Wijngaarden and Happer 2020).

There is a third possibility which would also imply that the actual ECS should be relatively low. The climate system, in fact, appears to be also modulated by multidecennial and millennial natural oscillations such as those related to solar forcings and other astronomical ones, which are not reproduced by the GCMs (cf.: Scafetta 2013, 2021c; Wyatt and Curry 2014). Their presence implies that the ECS of GCMs should be at least halved (cf.: Loehle and Scafetta 2011; Scafetta 2012a, 2021c) and could vary approximately between 1.0 and 2.5 \({^\circ }\)C, as found by several independent studies (cf.: Lewis and Curry 2018; Lindzen and Choi 2011; Scafetta 2013; Stefani 2021; Wijngaarden and Happer 2020). If so, future climate warming and changes will be moderate and naturally oscillating (Scafetta 2013, 2021c) and the rate of global surface warming should likely remain quite low until 2030–2040, when solar activity is expected to increase again due to its natural multi-decadal oscillations (Scafetta 2012b; Scafetta and Bianchini 2022; and several others).

In any case, even remaining within the theoretical framework of the CMIP6 GCMs, it should be concluded that only the low ECS GCM group can be considered sufficiently validated by the global surface warming observed from 1980–1990 to 2011–2021. Therefore, only the 21st century climate projections produced by the low ECS GCMs should be used for policy. For decades to come, these models predict more moderate warming than the GCM groups with medium and high ECS do for similar greenhouse gas emission scenarios. By 2050, projected warming is expected to be around 2 \({^\circ }\)C or less even for the worst greenhouse gas emission scenarios. This moderate warming should not be considered particularly alarming because the impact and risk assessments related to it are considered “moderate” assuming even low to no adaptation (IPCC 2022). Furthermore, as surface-based temperature records are likely affected by warming biases and are characterized by natural oscillations that are not reproduced by the CMIP6 models, the global warming expected for the next few decades may be even more moderate than predicted by the low-ECS GCMs and could easily fall within a safe temperature range where climate adaptation policies will suffice. Therefore, aggressive mitigation policies aimed at rapidly and drastically reducing GHG emissions in order to avoid a too rapid rise in temperature do not seem justified, also because their costs seem to outweigh any realistic  benefits (cf. Bezdek et al. 2019).