Optimal fingerprinting is a technique based on multiple linear regression (e.g. Hasselmann 1979, 1993, 1997; Hegerl and von Storch 1996; Hegerl et al. 1997; Allen and Tett 1999; Thorne 2001; Stott et al. 2001, 2003; Tett et al. 2002; Allen and Stott 2003; Allen et al. 2006). In Sect. 5.1 we summarise the method, and in Sect. 5.2 the construction of fingerprints. Section 5.3 presents analyses for two signals (ANT and NAT) and Sect. 5.4 for three (GHG, NAT and OA).
Methodology
Optimal fingerprinting assumes that observations y may be represented as a linear sum of simulated signals (\(X_i\), referred to as fingerprints) in response to external forcings to the climate system, and unforced internally generated variability (\(\varepsilon\), also referred to as climate noise),
$$\begin{aligned} \mathbf{y } = \sum _{i=1}^{n} (\beta _{i} X_{i}) + \varepsilon = \beta \mathbf{X } + \varepsilon \end{aligned}$$
(1)
where \(\beta _i\) are scaling factors (also referred as amplitudes) corresponding to each of the fingerprints and n refers to the number of fingerprints.
The observations and fingerprints can have spatial and/or temporal dimensions e.g. a geographical pattern of change or time-series for multiple layers. The method assumes that the spatiotemporal responses X are correctly simulated regarding patterns, but not necessarily regarding amplitudes, and the beta are chosen to give the best fit to the observations. Thus, the method may indicate deficiencies in the magnitudes of response.
We assume that the fingerprints are perfectly known (have no errors) and therefore have no uncertainty; this is an acceptable approximation if we compute the fingerprints from a large ensemble of runs. In that case, the optimal \(\beta\) are evaluated using ordinary least squares (OLS) regression (Allen and Tett 1999),
$$\begin{aligned} {\tilde{\beta }} = (X^{T}C_{N}^{-1}X)^{-1}(X^{T}C_{N}^{-1}y) = B^{T}y \end{aligned}$$
(2)
where \({\tilde{\beta }}\) refers to the best estimate of \(\beta\) and \(C_{N}\) is the covariance of unforced variability, whose effect is to give higher weight to aspects of the response with greater signal-to-noise ratio and thus maximise the statistical significance. In the optimal fingerprinting algorithm, weighting by the inverse noise covariance is actually carried out by projecting the observations y and fingerprints X onto the EOFs scaled by the inverse singular values of \(C_{N}\), then the scaling factors are calculated using the OLS regression in the rotated space, so as to give more weight to those regions of phase space with lower unforced variability and less weight to those regions of phase space with higher unforced variability. This method thus allows to consider a truncated representation by projecting onto n leading EOFs, reducing the dimensions of the observations and fingerprints to the number of highest rank EOFs given by n, which has potential implications on the outcome. The maximum number of EOFs (maximum truncation) that can be estimated from the control depends on the size of the fingerprints and the length of the control simulation, which might not allow to estimate all EOFs.
In this study the size of the fingerprints (due to sufficient temporal and spatial averaging) and the amount of years of control available allow estimating all EOFs (i.e not truncating). Nevertheless, many studies have investigated the sensitivity of the results to truncation and we refer the reader to those (Hegerl and von Storch 1996; Allen and Tett 1999; Stott et al. 2001; Tett et al. 2002; Allen and Stott 2003; Jones et al. 2013).
Since the observations y contain unforced variability, there is a statistical uncertainty in \({\tilde{\beta }}\), which is normally distributed with mean \(\beta\) and variance,
$$\begin{aligned} {\tilde{V}}({\tilde{\beta }}) = B^{T}C_{N_{2}}B \end{aligned}$$
(3)
where \(C_{N_{2}}\) is another estimate of the noise covariance matrix, made independently of \(C_{N}\) (Allen and Tett 1999; Allen and Stott 2003). It is important to use independent estimates of noise covariance for the optimization and for the evaluation of uncertainty in order to avoid a systematic bias towards underestimation of the latter (Hegerl and von Storch 1996; Allen and Tett 1999).
It is not possible adequately to evaluate unforced variability from observations since they contain forced responses, do not cover the entire ocean and are at most 50 years long; which is insufficient to characterize variability on decadal or centennial time-scales. Therefore pre-industrial control run simulations AOGCMs are used, assuming that simulated variability gives a realistic estimate. \(C_{N}\) and \(C_{N_{2}}\) are estimated from independent sections of equal length from the control run.
Once we have estimated the scaling factors \({\tilde{\beta }}\) with their respective confidence intervals calculated from \(V({\tilde{\beta }})\), two tests are carried out to determine whether the signals are detected and their amplitude are consistent:
-
Detection: tests the null hypothesis that the observed pattern of response to a particular forcing or combination of forcings is consistent with zero and \({\tilde{\beta }}\) is positive. By consistent we mean the uncertainty of \({\tilde{\beta }}\) spans zero. If the uncertainty of \({\tilde{\beta }}\) is inconsistent with zero and positive, the null hypothesis is rejected and we conclude that the pattern is detected. Alternatively, if the uncertainty of \({\tilde{\beta }}\) is consistent with zero the null hypothesis is not rejected and we conclude that the pattern is not detected.
-
Amplitude consistency: assuming that the signals are detected, we next test the null hypothesis that the amplitude of the observed response is consistent with the simulated response. If the uncertainty range of \({\tilde{\beta }}\) includes unity then the simulated pattern/s is/are consistent with observations. If the uncertainty of the \({\tilde{\beta }}\) does not span unity the null hypothesis is rejected: if below unity, the simulated pattern of response is overestimated and therefore scaled down; if above unity, the simulated pattern of response is underestimated and therefore scaled up.
Constructing the fingerprints
To construct the CMIP5 multi-model mean fingerprints we use only those 8 models for which historical, historicalNat and historicalGHG simulations are all available are used, resulting in 40, 24 and 26 model simulations respectively (see Table 1). All the fingerprints are for the 46 years 1960–2005.
Table 1 CMIP5 climate models used in the detection and attribution studies The simplest fingerprint we consider is a timeseries for the global mean or a specific geographical region (ocean basin or latitude/longitude range) and one layer covering a particular depth range (e.g. 0–300 m). These fingerprints \(F_{N}(t)\) have only the time dimension, where the subscript N refers to a particular signal and t is the year The values are calculated from 3D annual means of ocean temperature by spatial averaging over the region of interest, weighted by the area of each grid cell, and then averaging the depth levels, weighted by the thickness of the levels.
The novelty in this study is to construct fingerprints with multiple levels which may allow us to discriminate the signals better by considering their vertical structure. These multi-level fingerprints \(F_{N}(t,l)\) have a dimension l for layer number (e.g. for two layers, 0–300 m and 300–700 m). The observations and each of the CMIP5 models have different numbers and depth of levels, so we calculated the vertical means by selecting the nearest available levels. Vertical interpolation was tested as an alternative but the differences in the results are negligible.
We also construct fingerprints considering multiple geographical regions: \(F_{N}(t,l,r)\), where r is the region number. This is particularly relevant for GHG and OA, which could be better discriminated since the anthropogenic aerosols are not well mixed and are concentrated in the Northern Hemisphere where they are mainly emitted.
For the optimal fingerprinting algorithm, the fingerprints are reshaped as X(N, j), where N is the number of fingerprints (one for each forcing or signal) and j is the number of elements in each fingerprint. Similarly, the observations have the form y(j). For example, with 46 years and two layers, \(j=2 \times 46 = 92\). If there are two regions and three layers \(j=2 \times 3 \times 46 = 276\).
Estimating unforced variability
A total of 5423 years are available from the pre-industrial control simulations of 8 CMIP5 models. Since their lengths are different, considering the entire control runs would bias the control covariance matrix to those models with longer control runs. To avoid this we follow previous studies (e.g. Gillett et al. 2002; Jones et al. 2013) by limiting the length of all control runs to the shortest available. With 500 years of each model 4000 years of control are used in total. Since two independent estimates of control are needed (for optimization and for the hypothesis testing), for each estimate 250 years of each control is used, concatenated to make a sequence of 2000 years. Segments are extracted from this sequence with dimension C(t, l, s), where t is year number within the segment, l is the number of levels and s is the number of segments. To maximise the amount of information that is used in calculating the noise covariance matrix, the control segments are maximally overlapped (segment 1 is years 1–46 of the control sequence, segment 2 is 2–47, etc.).
Two-signal attribution (anthropogenic and natural forcings)
Global mean analysis
First we consider one-layer analyses of temperature change for various depth ranges: 0–100 m, 0–300 m, 0–700 m, 0–1000 m and 0–2000 m. The best-estimate scaling factors and uncertainty for ANT and NAT forcings using spatially complete and sub-sampled fields are shown in Fig. 10. Both ANT and NAT signals are detected (the scaling factor uncertainty is inconsistent with zero) for all depth ranges in all datasets (ORAS4, EN4GR, EN4L and IK09) for both full and sub-sampled fields, with the exception of the 0-1500 m range for IK09 using spatially complete fields.
The uncertainty of the scaling factors is greater in the sub-sampled fields, consistent with the increase in variance upon sub-sampling (Sect. 4). For the upper layers (0–100 m and 0–300 m), where the observational coverage is best, both \(\beta _{ANT}\) and \(\beta _{NAT}\) are generally consistent with unity but the best estimate is usually less than unity (consistent with previous attribution studies, (e.g. Weller et al. 2016)). For the spatially complete fields, although not for the sub-sampled fields, \(\beta _{ANT} < 1\) for the upper 300 m, to match the strong cooling that occurs in the Tropical Pacific and Indian Oceans (see Sect. 2). For the deeper layers (0–700, 0–1000 and 0–2000), the spatially complete and sub-sampled fields exhibit different behaviour. In the spatially complete fields, \(\beta _{ANT} < 1\), except for ORAS4 for the 0–2000, but for the sub-sampled fields \(\beta _{ANT} > 1\), consistent with our inference of a sampling bias towards the North Atlantic where the warming is greatest (Sect. 4). For both spatially complete and sub-sampled fields, \(\beta _{NAT} \ge 1\) in the deeper layers, indicating insufficiently deep penetration of volcanic variability in the model simulations.
The scaling factors for two-layer and three-layer fingerprints (Fig. 11) have smaller uncertainty than for single layers, because of the extra constraint from the vertical structure. Both the ANT and NAT signals are detected for all depth levels and datasets for both spatially complete and sub-sampled fields, except IK09. If only the upper 300 m are covered, \(\beta _{ANT} < 1\) as for one-level fingerprints. With deeper layers, however, both the ANT and NAT scaling factors are consistent with unity in most cases. The spatially complete and sub-sampled fields show similar results, but again the uncertainty of the scaling factors for the sub-sampled fields is slightly larger, so the scaling factors are consistent with unity in more cases.
Regional analysis
With one-layer fingerprints for individual basins not all the signals are always detected, depending on the observational products and whether the fields are subsampled (Fig. 12 shows 0–300 m as an example). Using spatially complete fields the ANT signal is always detected and consistent with unity in the Atlantic, and in most cases in the Pacific, but not always in the Indian and Southern Ocean. The NAT signal is detected in the Pacific basin and Southern Ocean (except in IK09) in all cases, but not for the rest of basins. Overall, using sub-sampled fields means that signals are detected more frequently. As before, using multiple layers constrains the uncertainty of the scaling factors and the signals are detected for almost all depths for all datasets for both spatially complete and sub-sampled fields.
In general the ANT scaling factors for the upper Pacific and Indian Oceans are less than unity because of the observed cooling in the southern part of the tropics (Sect. 3). Therefore we consider a global mean fingerprint excluding the Pacific Ocean between 10\(^{\circ }\)N and 30\(^{\circ }\)S and the Indian Ocean north of 30\(^{\circ }\)S, to cut out the region of cooling, which may be due to unforced variability. With this fingerprint, the ANT signal is detected in all cases and mostly consistent with unity (not shown). However, the NAT signal is not detected in any case, implying that most of the volcanic signal is found in the tropical Pacific Ocean. This is consistent as well with the analysis of individual regions.
We also consider a common fingerprint for multiple regions by concatenating the Atlantic, Pacific, Indian and Southern Oceans. The results (not shown) are very similar to using global mean fingerprints, but the scaling factors tend to be smaller and some signals that were consistent with unity now have significantly less than unity, because considering multiple regions reduces the uncertainty of the scaling factors.
Residual consistency test
Allen and Tett (1999) proposed, as a consistency check, to compare the residuals of the best-estimate combination of signals with unforced variability from the independent estimate of the noise \(C_{N_{2}}\). The failure of this test means that the noise estimate is too large or too small, or that the simulated patterns of response are systematically, rather than statistically, in error, in which case the discrepancy will contribute to the residual. (A systematic error in the amplitudes of the patterns, if the patterns themselves are realistic, will not cause the check to fail, since the scaling factors will correct it). The residual consistency test fails in most cases discussed (crosses on the left side of Figs. 10, 11 and 12). The test succeeds the upper 100 m alone and for some individual basins, most frequently the Atlantic. This is probably because the Atlantic and the upper 100 m are the best observed, and agrees with Weller et al. (2016), who show consistency for the upper 220 m in the Atlantic. In the case of the Southern Ocean, for one-layer analyses for both spatially complete and sub-sampled fields, the residual consistency test passes for all depths in ORAS4 (Fig. 12), but not in the other products. In contrast to the Atlantic, the Southern Ocean is the region that is least observed, thus the most of the pattern is due to the ocean model in the reanalysis.
Given that both spatially complete and sub-sampled fields fail the residual consistency check, this may suggest that the cause is not observational sampling or infilling method, assuming the sub-sampled fields are not affected by observational uncertainty significantly. In all cases of failure, the variance of the residuals is larger than simulated unforced variance. This could be due to a model underestimate of the magnitude of unforced variability, or to model systematic errors in the forced pattern (a systematic error in the amplitudes of the patterns, if the patterns themselves are realistic, will however not cause the check to fail, since the scaling factors will correct it).
If we assume that simulated unforced variability has a realistic spatiotemporal pattern, we may correct its magnitude by applying a scaling factor. It is desirable also to allow for errors in the fingerprints (the patterns of forced response, (c.f. Huntingford et al. 2006). A simple method is to inflate the estimates of unforced variance by an another appropriate factor, assuming that the model errors have the same patterns as variability. To deal with both these possibilities at once we use F-ratio of the residual consistency test (variance of the residuals divided by control variance) as a scaling factor for \(C_{N}\) and \(C_{N_{2}}\).
The consistency test can no longer be used as such (because it is is forced to pass, in effect). The best estimates of beta remain the same but their uncertainty is increased (dotted uncertainty in Figs. 10, 11 and 12), especially for spatially complete fields. For instance, in one-layer and multi-layer global analyses with spatially complete fields, the ANT and NAT signals generally cannot be detected after scaling when layers below 300 m are included (the larger uncertainty makes beta consistent with zero), except in ORAS4, but with the sub-sampled fields the signals are still detected. The uncertainty of the scaling factors for ORAS4 does not increase as much as for the other observational datasets.
Three-signal attribution (to greenhouse gas, natural and other anthropogenic forcings)
Finally we carry out a three-fingerprint detection and attribution analysis considering ocean temperature change in time and depth for GHG, NAT and OA. Since we do not have simulations for OA forcings only, we deduce the OA scaling factor (\(\beta _{OA}\)) using the historical simulations, by making a linear combination of the historical, historicalNat and historicalGHG scaling factors (\(\beta _{hist}\), \(\beta _{histNat}\), \(\beta _{histGHG}\) respectively). has been widely investigated in previous detection and attribution studies of surface air temperature (e.g. Tett et al. 1999; Stott et al. 2001, 2003; Gillett et al. 2012; Stott and Jones 2012; Jones et al. 2013), but not of ocean temperature change. In this section we present only the analyses with multiple layers, since these give smaller uncertainty in the scaling factors, as in the two-signal case (Sect. 5.4).
Global mean analysis
All three signals are detected in many cases, but more are detected with sub-sampled than with spatially complete fields (Fig. 13). The uncertainties of the beta factors are generally smaller than in the two-signal analyses (Fig. 11), indicating that the extra degree of freedom improves the fit and reduces the contribution of misfit to the residual. As for the two-signal analyses, the residual consistency check almost always fails, and scaling the unforced variability estimates by the F-ratio increases the uncertainty of the scaling factors. After scaling, most signals are no longer detected in spatially complete fields, but in sub-sampled fields the uncertainties do not increase as much and in general the signals are still detected (Fig. 13e–h).
An issue that typically arises when considering this three-signal combination is that the GHG and OA signals tend to be degenerate (e.g. Allen et al. 2006), because they have a similar spatiotemporal pattern, although of different magnitude and opposite sign. This is shown by plotting the joint distribution of the \(\beta\) factors (Fig. 14, in which the crosses show the best guess and the ellipses enclose 90\(\%\) confidence regions). For the GHG and OA signals, the distribution is strongly tilted, indicating that the signals are correlated, that is, if one of the signals is underestimated or overestimated the other signal will have the same behaviour. This means that, although three signals can be detected, these two are not entirely independent.
Regional analysis
With fingerprints simultaneously considering four regions and multiple layers, all three signals are detected in most cases (Fig. 15) with \(\beta \ge 1\). After the noise estimate has been scaled by the F-ratio, the signals are still detectable, in sub-sampled fields. Detection is possible in more cases than for the global mean analysis (Fig. 13) and the uncertainty of the scaling factors is smaller, indicating that the extra discrimination of signals by using regional information improves the fit.
Like for the two-fingerprint analysis, we consider geographical regions with the objective of determining whether the GHG and OA could be better discriminated. Since the temporal structure of the warming may be different in both hemispheres we considered a Northern Hemisphere and Southern Hemisphere fingerprint, but no benefit is found even though the three signals are detected in most cases, showing the same behaviour as previous analysis.