Learning in greenhouse gas emission inventories in terms of uncertainty improvement over time
 242 Downloads
Abstract
This paper addresses the problem of learning in greenhouse gas (GHG) emission inventories understood as reductions in uncertainty, i.e., inaccuracy and/or imprecision, over time. We analyze the National Inventory Reports (NIRs) submitted annually to the United Nations Framework Convention on Climate Change. Each NIR contains data on the GHG emissions in a given country for a given year as well as revisions of past years’ estimates. We arrange the revisions, i.e., estimates of historical emissions published in consecutive NIRs into a table, so that each column contains revised estimates of emissions for the same year, reflecting different realizations of uncertainty. We propose two variants of a twostep procedure to investigate the changes of uncertainty over time. In step 1, we assess changes in inaccuracy, which we consider constant within each revision, by either detrending the revisions using the smoothing spline fitted to the most recent revision (method 1) or by taking differences between the most recent revision and the previous ones (method 2). Step 2 estimates the imprecision by analyzing the columns of the data table. We assess learning by detecting and modeling a decreasing trend in inaccuracy and/or imprecision. We analyze carbon dioxide (CO_{2}) emission inventories for the European Union (EU15) as a whole and its individual member countries. Our findings indicate that although there is still room for improvement, continued efforts to improve accounting methodology lead to a reduction of uncertainty of emission estimates reported in NIRs, which is of key importance for monitoring the realization of countries’ emission reduction commitments.
Keywords
Uncertainty Inaccuracy Imprecision GHG emission inventory Learning Regression model1 Introduction
Assessing the uncertainty of greenhouse gas (GHG) inventories is a complex problem that has been investigated for many years; however, no commonly accepted solution has been found. Low uncertainty of GHG emission inventories, namely, high accuracy and precision of emission estimates, is key to setting reduction targets for climate treaties (Jonas et al. 2010), monitoring treaty implementation (Bun et al. 2010), and establishing reliable emission trading schemes (Ermolieva et al. 2014).
According to the Guidelines for National Greenhouse Gas Inventories (cf. IPCC 2006, vol 1, Ch. 3), accuracy is an agreement between the true value and the average of repeated measured observations or estimates of a variable. Thus, inaccuracy (systematic error) is a result of failure to capture all relevant processes involved, because the available data are not representative of all realworld situations, or because of instrument error. Precision, in turn, is the agreement among repeated measurements or estimates of the same variable. High precision corresponds to a low random error.
Over time, as methods for accounting GHG emissions evolve (from the tier 1 and tier 2 approaches recommended in IPCC (2000, 2006) to the tier 3 approach considered in IPCC (2006), both the accuracy and precision of GHG inventories may change, undermining or improving the effectiveness of policies. The evolution of accounting methodology is particularly well reflected in the emission estimates published each year by the parties to the United Nations Framework Convention on Climate Change (UNFCCC) in the form of National Inventory Reports (NIRs). Each of these reports contains GHG emission data for a given year and revised estimates of past years’ emissions. These estimates are considered to reflect the best available knowledge and are therefore treated as “true emissions.” Yet, they are bound to change with the following year’s revisions, as new data and knowledge about emission sources and processes become available to the institutions preparing the GHG inventories. The emergence of this new knowledge may allow the reporting institutions “to learn” how to prepare better quality GHG inventories. Here, we understand learning in a positive (not normative) sense as a detectable increase in the accuracy of revisions and/or an increase in the precision of initial estimates of new GHG emissions over time.
The problem of investigating learning is in line with the discussion on uncertainty assessment of NIRs considered, for example, in Nahorski and Jęda (2007), where the uncertainty of each reported revision was analyzed separately, and in Marland et al. (2009) and Hamal (2010), where changes in uncertainty over time were investigated. The concept of learning was also discussed in Żebrowski et al. (2015). Here, we especially build upon the work of Jarnicka and Nahorski (2015), and Jarnicka and Nahorski (2016), where models for evolution of uncertainty structure over time were developed and applied to CO_{2} emission inventories submitted by parties to the UNFCCC in their NIRs; however, we distinguish between uncertainty related to reported revisions and uncertainty related to emissions, referring to them as inaccuracy and imprecision. This allows for learning to be considered in terms of reduction of inaccuracy and imprecision over time.
In this paper, we discuss methods of detecting and assessing learning in a set of consecutive NIRs. More specifically, we exclude estimates of carbon dioxide (CO_{2}) emissions from the land use, land use change, and forestry (LULUCF) sector, as the uncertainties of LULUCF emissions are large and may easily overshadow subtle trends in emission estimates. Detecting learning requires a twostage analysis. First, information on inaccuracy and imprecision needs to be extracted from revisions of GHG inventories. We deal with this problem in Section 2, where we describe our main method of assessing uncertainty components (method 1), based on the detrending of consecutive revisions. Subtraction of the estimated trend extracts inaccuracy and the transformed emission estimates are thus used to evaluate imprecision. The method works on the assumption that detrending “cleans” the data of the information on the “real emission,”^{1} leaving only the inventory uncertainty. To assess the quality of this “cleaning,” we use an auxiliary method (method 2), which follows a similar analysis, but with the estimated trend being replaced by the most recent revision of historical emission estimates. We conclude Section 2 with a graphical illustration of methods 1 and 2. The second stage of our analysis—the detection of learning—is discussed in Section 3; there, we consider the question of detecting trends in changes in inaccuracy and imprecision over time and how to interpret those trends as learning and develop an algorithm to detect and assess learning (algorithm 1). Section 4 presents the results obtained by applying this procedure to the GHG emission inventories of the EU15 and its individual member countries. Section 5 presents conclusions.
2 Data presentation and uncertainty assessment
Indexing the data
Uncertainty \( {U}_j^n \) represents an interplay between the inaccuracy and the imprecision unique to each data point \( {E}_j^n \). We observe that inaccuracy is associated with each revision, namely, an entire row of Table 1, rather than its single entries. Indeed, for each year j, j = 2001, … , 2015, the estimates \( {E}_j^n,\kern0.5em n=1990,\dots, j \), published in that year, were calculated using the same accounting method (by this, we mean choices on adopting specific emission factor values and on ascribing activity data to subsectors, but still following the accounting schemes suggested by the UNFCCC) and thus have the same systematic error, that is, the same inaccuracy. However, inaccuracy differs across revisions (for instance, due to improved emission factors or minor changes in the classification of activity data, which occurs from revision to revision). The evolution of inaccuracy is described by the time series U_{j, } j = 2001, … , 2015, where U_{j} denotes the inaccuracy of the jth revision.
Imprecision, on the other hand, is an attribute of a set of repeated estimates of the same quantity. It is therefore associated with the columns of Table 1, where the nth column, n = 1990, … , 2015, contains repeated estimates of emissions that occurred in the year n. The changes in imprecision of emission estimates are reflected by the time series U^{n}, n = 1990, … , 2015, where U^{n} is the estimate of imprecision based on \( {U}_j^n \), j = max {2001, n}, … , 2015.
First, we “clean” the data of information about the “real emission” to extract uncertainty. We perform that “cleaning” by operating on the rows of Table 1 and propose two variants of the “cleaning” procedure. The first variant is based on detrending the rows of Table 1. The second complementary method makes use of the most recent revision (the last row of Table 1) in place of the estimated trend, in order to assess the amount of information captured by the trend. We analyze the data thus transformed rowwise to extract the inaccuracy of consecutive revisions, reflected by the time series U_{j, } j = 2001, … , 2015. Finally, once the inaccuracy of revisions is extracted from the data, we perform a columnwise estimation of the imprecision of emission estimates U^{n}, n = 1990, … , 2015.
Differences (2) and (3) correspond to the inaccuracy of revisions. Inaccuracy is understood as a systematic bias, i.e., the difference between the true value and the average of its repeated estimates. However, each revision consists of a series of different values (i.e., just one estimate for each year, starting in 1990), not repeated estimates of the same value. Hence, using the standard deviation is a suitable way of describing the inaccuracy of revisions.
Interpreting the results obtained when applying method 1 depends on the fulfillment of assumptions in model (4), in particular, on the normality of differences \( {d}_j^n \). To verify normality, we use the ShapiroWilk test (considered the most reliable normality test) with significance level α = 0.05 and confirm the results with the Lilliefors test (recommended for use in small samples). If the normality assumption is satisfied, we also test the differences in model (4) for the significance of the population mean value, using the twotailed t test with α = 0.05. If the normality condition is not met, the t test cannot be used, as we deal with small samples (see, e.g., Cowan 1998). We can apply its nonparametric version, i.e., the MannWhitney test, but need to take into account that it refers to the median, not the mean value. In fact, that test only provides some information on the mean value for normallike distributions (in particular symmetric ones) when the mean and median are close to each other.
The assumption on the insignificance of the population mean value is of secondary importance and is needed only to formally confirm the way the standardization is performed. The assumption of normality, however, is of critical importance. If this assumption is satisfied, we can say that detrending “cleans” the data sufficiently, removing all the information on the “real emission,” so that we are left only with information on inaccuracy. If normality condition is not met, this may indicate that the estimation of the “real emission” was not good enough (most likely due to substantial approximation errors), which makes detrending less effective. This may affect the inaccuracy assessment and lead to different results in the learning investigation.
Note that, there is one row of data less to be analyzed in method 2, compared with method 1, as for every n, \( {E}_{2015}^n{E}_{2015}^n=0.\kern0.5em \)Moreover, as opposed to (2), the difference \( {d}_{2014}^n={E}_{2015}^n{E}_{2014}^n\kern0.5em \)does not represent residuals in a nonparametric regression approach. We can therefore expect that the normality condition may not be met (not only for this difference but for other differences too). This should result in a different behavior of these differences, compared with the approach based on the smoothing spline, but we have to check whether it helps in the learning investigation.
According to the above interpretation, verification of normality provides two types of information. If the normality condition is met, we can assume that differences (both in method 1 and method 2) consist only of inaccuracy (which needs to be estimated), but we must be aware that this information may be incomplete. On the other hand, the lack of normality means that part of the “real emission” has been left over in the analyzed differences, which may affect the behavior of inaccuracy (and therefore also imprecision), and make it difficult to capture learning.
Note that the interpretation of inaccuracy estimates (5) obtained with method 1 is similar to that for the inaccuracy estimates calculated with method 2, as in both cases, the relative estimates are calculated with respect to the “real emission” represented either by the smoothing spline \( {\mathrm{Sp}}_{2015}^n \) or by the most recent revision \( {E}_{2015}^n \). The relative imprecision estimates calculated in the second step of methods 1 and 2 are based on the results obtained in the first step—thus, they are also relative to the “real emission.”
3 Investigating learning
To detect and assess learning, if present, in inaccuracy and imprecision, we analyze the time series of their estimates \( {\hat{U}}_{j,\kern0.5em }\ j=2001,\dots, 2015 \)and \( {\hat{U}}^n,n=1990,\dots, 2015 \), obtained using method 1 or method 2 (presented in Section 2).
We assume that learning refers to improvement in the certainty and precision of emission inventories over time, that is, to an observed reduction in uncertainty. We distinguish between learning in the inaccuracy of revisions and learning in the imprecision of emission estimates; however, we may not be able to fully disentangle the two.
We check the aforementioned time series of inaccuracy and imprecision estimates for a trend, namely, the presence of a trend and then its monotonic behavior. In both cases, learning corresponds to the trend decreasing over time (the downward trend), where time is understood as a year of revision in the case of inaccuracy, and as a year in which emissions occurred, in the case of imprecision. This trend can be modeled by a regression curve taking positive values, being decreasing, and approaching zero asymptotically. We can expect some residual uncertainty always to be present. In that case, the trend will stabilize around some level above zero, which in principle can be modeled within the framework proposed here. However, assumptions on asymptotic behavior are of low practical importance, as we work with short samples. For simplicity, we assume that the trend decreases to zero. In addition, we require the curve modeling the trend to be concave up. This is a mild technical assumption, facilitating the use of regression models to assess learning, as we want to avoid the situation where the curve modeling the trend crosses the horizontal axis and takes on negative values.
Both examples presented in Fig. 2 illustrate learning, although the one depicted in Fig. 2b, illustrates it at a much slower rate. This shows that we can also assess the rate of learning based on the model fitted and on its goodness of fit. Thus, having estimated inaccuracy and imprecision, we first check them for a downward trend (detecting learning) and then assess that learning (if detected).
3.1 Detecting trends in uncertainty
To test uncertainty estimates for a downward trend, we first perform the Bartels test^{4} for randomness (Bartels 1982), testing the null hypothesis H_{0}: randomness against the leftsided alternative hypothesis H_{1}: trend. This nonparametric rank test is very sensitive in trend detection, showing evidence of a trend even if it is very weak. It does not, however, distinguish between a downward and an upward trend. To check this, the CoxStuart test^{5} (Cox and Stuart 1995) can be used, with null hypothesis H_{0}: randomness against the leftsided alternative hypothesis H_{1}: downward trend.
Both the above tests are quite easy to perform and work well for small samples (as in the analysis considered here) but as nonparametric ones they may, in some cases, be insufficiently powerful. Their combination is therefore important, allowing us to confirm the presence of the trend detected by the Bartels test (slightly oversensitive and therefore ideal for initial analysis) and, at the same time, to apply the CoxStuart test (less powerful) only to those data where the trend is present. To perform the aforementioned tests, we take the most common significance level α = 0.05, (e.g., Cowan 1998; Brandt 2014), as it works well in most cases. Setting α at 0.05 means that there is 5% chance of rejecting the null hypothesis when it is true (a type I error). By reducing α (e.g., to 0.01), we reduce the chance of a type I error but increase the chance of not rejecting H_{0} when the alternative hypothesis is true (a type II error). Thus, 5% seems to be a good balance between these two issues.
3.2 Assessing learning
If a downward trend in uncertainty is present, we can model it by fitting a regression curve. Since the linear regression cannot be used (a straight line does not satisfy the model requirements as it crosses the horizontal axis at some point) and we want to keep the analysis as simple as possible, we consider nonlinear regression models that can be transformed into a standard linear regression (e.g., Myers 1990; Hocking 2013). This allows us to use coefficients of determination R^{2} to compare the results.

exponential model

power model
Variable Y represents uncertainty (inaccuracy or imprecision), while t corresponds to time (in years). Thus, both take only positive values and can be logtransformed. If a < 0, both curves are decreasing to zero, but the first one at a much faster rate. The difference between their shapes can be observed in Fig. 2, where panel (a) illustrates model (M1), while panel (b) corresponds to model (M2).
Because of that difference, we distinguish between strong learning (learning at a faster rate) and weak learning (learning at a slower rate). We say that there is a strong learning in uncertainty when the observed downward trend can be modeled using (M1) with a reasonably good fit. If model (M2) is fitted instead, we call it weak learning (or learning at a slower rate).
We select the model based on its goodness of fit, measured by R^{2}, which indicates how much of the relationship between variables Y and t (uncertainty and time, respectively) is explained by the model used (e.g., Soong 2004; Ryan 2008). For instance, the value of R^{2} < 0.5 indicates that less than 50% of the relationship between variables is explained (and in such a case, the model most likely fails to satisfy the assumptions required, e.g., on the normality of residuals).
In this paper, we will consider such explanatory capabilities of the model as being insufficient and will use a cutoff value for R^{2} equal to 0.5. This choice of the cutoff value is arbitrary, as there are no strict rules regarding the threshold, although it is often assumed that it should equal at least 60–70%. In some areas, low values of R^{2} (around 30%) are considered sufficient. Taking a cutoff value at 50% seems to be reasonable here.
The values of R^{2} < 0.5 for model (M1) will be interpreted as no evidence of a strong learning. In such cases, model (M2) will be used, but if R^{2} for this model is again smaller than 0.5, we will say that even a weak learning could not be detected.
Algorithm to detect and assess learning
According to Algorithm 1, the exponential model is preferred over the power model, which is consistent with the interpretation given above. If fitting the exponential model gives R^{2} > 0.5, this is equivalent to a strong learning, in which case the power model is not considered. We use the power model, if fitting the exponential model gives R^{2} < 0.5. This means that the criterion for the choice of model (M1) or (M2) is, in fact, the cutoff value and that the values of R^{2} obtained as the results should be compared independently for each model.
4 Learning in the EU15 emission inventories
The method of detecting learning discussed in previous sections is generic and can be applied to any set of consecutive GHG inventories or their parts (specific sectors). Here, we demonstrate that potential, by applying the method to analyze the estimates of total CO_{2} emissions excluding LULUCF sector, submitted annually to the UNFCCC in the form of the NIRs^{6} produced by each of the EU15 member countries, along with the emission estimates for the entire EU15.^{7} The emission estimates analyzed cover the period from 1990 to 2015, published in the years 2001–2015.
4.1 Analyzing the EU15 emission inventories
The detrended differences oscillate randomly around zero. However, if we compare them, we can observe some regularities, as if they were following the same pattern (see Fig. 5a). The differences calculated according to the second method show rather chaotic behavior (Fig. 5b), but we can also observe groupings of differences with similar behavior, for example, those related to the most initial or most recent revisions.
This suggests that the detrended differences have been “cleaned” sufficiently, while those based on the most recent revision may still involve some information on the “real emission.” To verify this, we carry out normality tests (the ShapiroWilk and the Lilliefors test), with α = 0.05, and (if possible) t tests to verify the insignificance of the population mean value. The tests conducted show that in most cases, no statistical evidence can be found against the null hypothesis on the normality of the detrended differences. The tests fail in the case of the most initial revisions, which can partly be explained by the small sample sizes. In all cases where normality condition is met, we also conduct the twotailed t tests, which show that in most cases, the true population mean is statistically insignificant and can be assumed to be zero.
Checking normality for differences based on the most recent revision shows, in turn, that in most cases, the differences cannot be considered to be normally distributed. This translates into a different behavior and properties of differences calculated by method 1 and method 2.
Corollary 1

By detrending the revisions, we managed to remove all the information on the “real emission,” leaving only the inaccuracy.

By subtracting the most recent revision, we “cleaned” the data only partially; some information on the “real emission” is still present.
We find \( {\hat{\sigma}}_j \) and use them to evaluate changes in inaccuracy over time, as described in Diagrams 1 and 2 (for methods 1 and 2, respectively) and apply algorithm 1 (depicted in Table 2) to check them for learning. First, we analyze the inaccuracy estimates obtained using method 1. The Bartels test for randomness, with null hypothesis H_{0}: randomness against the leftsided alternative hypothesis H_{1}: trend, performed taking α = 0.05, detects a trend in inaccuracy (as p value = 0.0028 < α, we reject the null hypothesis on randomness). To check if it is a downward trend, we use the CoxStuart test, with H_{0}: randomness against H_{1}: downward trend. As p value = 0.77 > α, we reject H_{1} on a downward trend. However, to explain the results obtained by applying the Bartels test, we also use the rightsided CoxStuart test, with the alternative hypothesis on an upward trend. This time p value = 0.007 < α, which shows evidence for an upward trend in inaccuracy. Therefore, no learning in inaccuracy is detected.
Investigating learning in EU15 CO_{2} emission inventories (method 1)
Tests for randomness vs trend  model Y= e^{at + b}  

Inaccuracy  Bartels test  CoxStuart test  No learning in inaccuracy detected  
p = 0.0028  p = 0.007  
Trend  Upward trend  
Imprecision  Bartels test  CoxStuart test  Significance tests  Resid.  Fit  
p = 6.5 × 10^{−8}  p = 0.000024  b  280.2  p = 5.6 × 10^{−8}  SE = 0.7  R ^{2}  
a  − 0.14  p = 5.3 × 10^{−7}  Norm. (SW) p = 0.34  0.69  
Trend  Downward trend  Ftest  p = 5.3 × 10^{−7}  
Strong learning in imprecision 
Investigating learning in EU15 CO_{2} emission inventory (method 2)
Tests for randomness vs trend  model Y = e^{a ln (t) + b}  

Inaccuracy  Bartels test  CoxStuart test  No learning in inaccuracy detected  
p = 0.312  p = 0.773  
Randomness  Randomness  
Imprecision  Bartels test  CoxStuart test  Significance tests  Resid.  Fit  
p = 9.3 × 10^{−9}  p = 0.000021  b  1654.8  p = 7.0 × 10^{−9}  SE = 0.4  R ^{2}  
a  − 217.5  p = 6.9 × 10^{−9}  Norm. (SW) p = 0.21  0.79  
Trend  Downward trend  Ftest  p = 6.9 × 10^{−9}  
Weak learning in imprecision 
The analysis carried out according to algorithm 1 with both methods 1 and 2 showing that there is no learning in inaccuracy. Method 1 enabled a weak upward trend to be detected. Using method 2, we could observe random inaccuracy behavior over time. As the differences in method 2 were nonnormally distributed, it can be concluded that the inaccuracy has not been sufficiently extracted. Both methods allowed us to capture learning in imprecision, but method 1 resulted in detecting learning at a faster rate, while method 2 detected learning at a slower rate. This can be explained by a worse “cleaning” of the data when using method 2.
Corollary 2

There is no learning in inaccuracy (none of the approaches used allowed us to capture it).

We have not lost any information on uncertainty due to detrending, while extracting uncertainty with method 2 was insufficient

There is strong learning in imprecision (even insufficient extraction of uncertainty allowed us to capture it, although at a slower rate).
4.2 Learning assessment for the EU15 member countries
The data on GHG emissions in the EU Inventory Reports checked for possible learning in Section 4.1, are obtained by adding those reported by member countries. Analysis of the NIR data for each of the EU15 member countries should explain and confirm the previous results. Firstly, some countries are expected to follow the same scheme, where strong learning in imprecision is captured by applying method 1, and only weak learning in imprecision is captured by applying method 2. This refers to countries with high emissions reported (as their contribution to the data is significant), and those with particularly strong learning in imprecision detected using method 1. Secondly, there are likely to be countries showing no learning at all (which may have slightly weakened the downward trend in imprecision observed for the EU15). Of interest to us are any results in between, far from these extreme cases, and whether or not any similarities between neighboring countries can be observed.
We conduct the analysis, using both method 1 and method 2, and applying algorithm 1 to detect and assess learning, as in Section 4.1, and compare the results obtained for various countries.
The results of learning investigation allow for division of the countries analyzed into six groups.
4.2.1 Group I: no learning in inaccuracy, strong learning in imprecision
When analyzing inaccuracy estimated with method 2, the Bartels test indicated the presence of a trend, but that result was not confirmed in further analysis. As with the EU15, learning at a faster rate was captured using method 1, with the fit of the exponential model R^{2} = 0.79 for Germany, R^{2} = 0.74 for Netherlands, and R^{2} = 0.59 for the UK. A weak learning was captured, using method 2, where the fit of the power model, used to illustrate changes in imprecision for those countries, was equal to R^{2} = 0.73, R^{2} = 0.62, and R^{2} = 0.52, for Germany, Netherlands, and the UK respectively.
Given that the CO_{2} emissions for those countries are quite high compared with other countries, they have a large impact on the results obtained by the entire EU15. This impact is also due to the fact that similar statistical properties of the differences analyzed can be observed. The detrended differences turned out to be mostly normally distributed with the population mean value zero, while those obtained based on the most recent revision, as for the EU15, were mostly nonnormal. This can be interpreted, as before, in terms of sufficiently and insufficiently “cleaned” revision data series.
4.3 Group II: weak learning in inaccuracy, strong learning in imprecision
By investigating learning with method 1, we managed to observe strong learning in imprecision. Tests for randomness showed the presence of a downward trend in imprecision, and the exponential model fitted to this trend gave R^{2} = 0.77 for Austria and R^{2} = 0.84 for Finland.
Method 2, in turn, allowed learning to be captured both in inaccuracy and imprecision, although both at a slower rate. Tests for randomness showed evidence of a downward trend in inaccuracy. To assess this, the power model was used, giving a fairly poor fit with R^{2}, slightly over 50%, namely, R^{2} = 0.58 for Austria and R^{2} = 0.59 for Finland. This, however, enabled us to consider it a weak learning in inaccuracy. The analysis of changes in imprecision also indicated a weak learning, with the fit of the power model R^{2} = 0.61 for Austria and R^{2} = 0.75 for Finland. Such results may eventually indicate a strong learning in imprecision, as a weak learning was captured despite the insufficiently “cleaned” data. As we did not detect learning in inaccuracy in the case of detrending, the learning can be considered so weak that the sufficient “cleaning” of the data (by detrending) makes capturing it impossible.
4.3.1 Group III: weak learning in inaccuracy, strong learning in imprecision (detected only when using method 1)
Comparing the results obtained for Ireland with those for Austria and Finland, we can see a good fit of the exponential model, used to illustrate the changes in imprecision over time. This translates into strong learning in imprecision. In the case of Ireland, the fit is slightly worse, with R^{2} = 0.63, which may indicate that learning in imprecision is slightly less pronounced and becomes undetectable after extracting inaccuracy with method 2. Thus, leaving some information on the “real emission” (in method 2) enables a weak learning in inaccuracy to be captured, at the price, however, of not detecting learning in imprecision.
The case of Ireland illustrates the discussion in Section 2, confirming that using different approaches may, in some cases, be crucial.
4.3.2 Group IV: no learning in inaccuracy, weak learning in imprecision
At the same time, analysis of changes in inaccuracy over time with method 2 confirmed that there is no learning in inaccuracy. The behavior of inaccuracy estimates was, however, different than under method 1 (see Figs. 8 and 12).
In the first case, we observed an upward trend in inaccuracy. Method 2 showed, in turn, that changes in inaccuracy are random (as confirmed by tests for randomness). For Spain, the Bartels test indicated the presence of a trend, but further analysis did not confirm this result (the CoxStuart test showed the evidence for the randomness of the data).
4.3.3 Group V: weak learning in inaccuracy, no learning in imprecision
The power model fitted provided R^{2} = 0.59 and R^{2} = 0.67 for Denmark and Sweden, respectively. Neither method 1 nor method 2 enabled learning in imprecision to be captured. Tests for randomness showed no presence of a trend in imprecision, indicating random changes in imprecision over time and hence no learning in imprecision.
It is easy to observe the similarity in the behavior of the estimated uncertainty over time with respect to the changes both in inaccuracy and imprecision. We should stress that the data analyzed, both for Denmark and Sweden, seem to be chaotic and random. This was already noticeable when the differences were being analyzed. The detrended differences were mainly nonnormally distributed, which means that detrending did not sufficiently “clean” the data. The same could be observed for differences based on the most recent revision. Thus, due to the nature of the data for Denmark and Sweden, we were, in fact, unable to sufficiently extract the uncertainty.
4.3.4 Group VI: no learning in inaccuracy, no learning in imprecision
It should be noted that as two of these countries (i.e., Greece and Luxembourg) started their official reporting to the UNFCCC later (Greece since 2002, and Luxembourg since 2004), the samples analyzed in those cases were slightly shorter. However, this did not affect the results obtained. It is worth mentioning that for each of these four countries, as in the case of the data for Denmark and Sweden, the random and chaotic behavior could be observed. Only some of the detrended differences turned out to be normally distributed, which, as in the previous case, confirms the random nature of the emission inventories for these countries.
Corollary 3

Only for three countries (Austria, Finland, and Ireland) we managed to capture learning both in imprecision and in inaccuracy (the latter one at a slower rate).

Only three countries (Germany, Netherlands, and the UK) followed the scheme observed for the entire EU15, with learning in imprecision.

For 9 of the 15 countries considered, the CO_{2} emission inventories showed random changes in inaccuracy and imprecision rather than learning.

In most cases, we managed to detect only weak learning, either in inaccuracy (Denmark and Sweden) or in imprecision (Italy, Portugal, and Spain), or we detected no learning at all (Belgium, France, Greece, and Luxembourg).
Summary of learning results for EU15 countries
Country  Learning in inaccuracy  Learning in imprecision  Group  

Method 1  Method 2  Method 1  Method 2  
Austria  –  Weak  Strong  Weak  II 
Belgium  –  –  –  –  VI 
Denmark  –  Weak  –  –  V 
Finland  –  Weak  Strong  Weak  II 
France  –  –  –  –  VI 
Germany  –  –  Strong  Weak  I 
Greece  –  –  –  –  VI 
Ireland  –  Weak  Strong  –  III 
Italy  –  –  Weak  –  IV 
Luxembourg  –  –  –  –  VI 
Netherlands  –  –  Strong  Weak  I 
Portugal  –  –  Weak  –  IV 
Spain  –  –  Weak  –  IV 
Sweden  –  Weak  –  –  V 
UK  –  –  Strong  Weak  I 
5 Conclusions and policy recommendations
The practice of revising GHG inventories provides a unique opportunity to conduct a diagnostic analysis of the quality of emission estimates, in terms of both their accuracy and precision. The volume of data collected over the last 15 years has just become sufficient to allow for the application of statistical methods to detect a reduction of uncertainty (i.e., learning) in accountingbased estimates of national GHG emissions published in NIRs. We emphasize that further collection of new data (both new emission estimates and revision of the old ones) is recommended, as longer data samples increase the confidence in the results obtained in Section 4.
In general, method 1 appears to be better at detecting learning in imprecision compared with method 2. For the EU15, with method 1, we were able to find evidence of strong learning in imprecision, while with method 2, we captured only weak learning. This conclusion is strengthened by an observation (cf. Table 5) that whenever method 1 detects a strong learning in imprecision, method 2 indicates only a weak learning (for countries from groups I and II); Ireland (group III) is an exception here, as a weak learning in inaccuracy instead of imprecision was detected by method 2. Moreover, whenever method 1 detects weak learning in imprecision, method 2 fails to find any evidence of learning (for countries in group IV). Method 2, however, occasionally allows detection of weak learning in inaccuracy, when method 1 fails to find evidence of learning in inaccuracy (for countries in groups II, III, and V). Yet, this comes at the price of a generally worse performance in detecting learning in imprecision (only weak learning was detected for countries in group II, and no learning for groups III and V).
A closer look at the fulfillment of normality assumption sheds some light on this apparent difference in performance between the two methods discussed. In most cases, the differences between the emission estimates and the trend used in method 1 are normally distributed. Thus, detrending removes all the information about the “real emission,” but potentially also some information on inaccuracy. This may render inconspicuous trends in relative inaccuracy that are virtually undetectable using method 1. On the other hand, the differences between the most recent revision and the older ones, as analyzed in method 2, are in general not normally distributed. This means that some information on the “real emissions” was still left over in the data transformed, interfering with the estimation of inaccuracy and thus affecting the assessment of imprecision. This may be the reason why method 2 detects only weak learning (if any). However, this insufficient “cleaning” of the data from information on the “real emissions” may, in some cases, retain some information on inaccuracy (while being removed by method 1), making method 2 more suitable for detecting feeble trends in inaccuracy. To summarize, method 1 may have a slight tendency to underestimate learning in inaccuracy, while method 2 may be more pessimistic in assessing learning in imprecision.
We should note that there is no central agency providing independent inventorying of GHG emissions for the whole EU15, and the NIRs for the EU15 are simply obtained as the aggregated NIRs of its member countries. Thus, any learning which we were able to detect in emission data for the EU15 is due to improvements in GHG inventorying at the national level. This aggregation, however, has a smoothing effect on the evolution of inaccuracy and imprecision for the EU15 (Figs. 6 and 7) compared with individual member countries (Figs. 8, 9, 10, 11, 12). The reduced variability helps with detection of learning and in drawing stronger conclusions about the satisfactory performance of the methods proposed.
The results presented in this paper have several practical consequences for policy. First, the analysis carried out both for the entire EU15 (Section 4.1), and for its individual member countries (Section 4.2) shows that there is still much room for further reductions in the uncertainty of emission inventories reported to the UNFCCC. Evidence of a slow increase in accuracy is feeble at best, while many countries also fail to improve the precision of their emission estimates in a noticeable way.
We were unable to detect learning in inaccuracy in the emission estimates of the EU15 as a whole, which is generally consistent with our findings for individual member countries. Only in several cases of relatively small emitters did method 2 capture weak learning (as presented in Table 5). This apparent general lack of improvement in accuracy of inventories (both for the entire EU15 and on the national level) is likely to be explained by the fact that all emission estimates (both new and revised) are based on the same accounting schemes suggested by the UNFCCC in IPCC (2000), and later in IPCC (2006). However, the result of introducing new accounting guidelines in IPCC (2006) is noticeable in the formation of peaks in the differences between the smoothing spline and the most recent revision (Fig. 5a), as well as in inaccuracy estimates for the EU15 (Figs. 6a and 7a) and for most countries analyzed (see Figs. 8, 9, 10, 11, 12, 13, 14). This observation suggests that subsequent updates of GHG emissions accounting guidelines have the potential to reduce the inaccuracy of emission estimates.
An improvement in the precision of the EU15 emission estimates was detected by both methods proposed. We ascribe this effect to learning in imprecision detected for individual countries, mainly by the big emitters: Germany and the UK (strong learning), and possibly Italy and Spain (weak learning). A possible explanation of this improved precision is the availability of better knowledge about emission processes and emission factors. Further efforts to improve this knowledge are recommended, as they have been proven to reduce the inaccuracy of GHG estimates in the past.
Methods 1 and 2 presented here offer alternative ways of assessing uncertainty suggested in the reporting guidelines, namely, the tier 1 or tier 2 approach, and later also the tier 3 approach IPCC (2000, 2006). These different approaches used in uncertainty assessments published in NIRs make it difficult not only to compare uncertainty for various countries (using different approaches), but often also to track changes in uncertainty over time for a given country (which used different approaches in consecutive years). Step 1 of the proposed methods (cf. Diagrams 1 and 2), together with the evaluation of inaccuracy changes over time, can be useful in such cases (similar analysis was carried out in Jarnicka and Nahorski (2016) where the parametric model was considered, although the results were compared with official assessments only in a few available cases). Moreover, uncertainty estimates published in NIRs are not revised (except for emissions in the base year, usually 1990). This limits the insights into the evolution of uncertainty that could be collected from NIRs. The method proposed here offers a way of building a more complete picture of the evolution of uncertainty.
We conclude with a recommendation for continuation and expansion of the practice of annual revisions of GHG emission estimates published in consecutive NIRs. With the help of methods proposed here, these revisions allow monitoring of improvements in the quality of national GHG inventories and can possibly identify countries (or sectors) for which uncertainty of emission estimates are still not satisfactory. Reducing uncertainty of national GHG inventories is of key importance for monitoring whether countries have achieved their emission reduction commitments and for setting future reductions targets that are likely to ensure the desired results.
Footnotes
 1.
We explain this notion in greater detail in Section 2.
 2.
Calculation of the emission estimates, based on the measurements collected, takes approximately 2 years; thus, the most recent data reported in 2017 originate from the year 2015.
 3.
To simplify the notation, we omit the delay in publishing the data and assume that the NIR containing the estimates of emissions for the year j and the revised estimates of all previous years were published in the year j.
 4.
The Bartels test is the nonparametric version of von Neumann’s ratio test for randomness. It ranks the observations from the smallest to the largest and tests the ratio of the sequential variance calculated from consecutive ranks to the variance based on deviations of ranks from the mean. For values far from the test statistic (twosided test), there is evidence for nonrandomness. In the leftsided test (used in our analysis), randomness is tested against trend, while in the rightsided against regular oscillations.
 5.
The CoxStuart sign test is based on the binomial distribution. Its test statistic is the number of positive slopes between points that are separated by about half of the observations. The null hypothesis on randomness can be interpreted in terms of positive and negative slopes being equally likely. Both twosided or onesided alternative hypotheses can be considered. The leftsided alternative hypothesis, (considered here for the analysis) indicates that negative slopes are more likely than positive ones, which corresponds to a downward trend.
 6.
 7.
EU reports are the aggregate of GHG emission inventories of all member countries. Originally, these were EU15 countries, but after expansion of European Union these reports contain also emissions of new member states. However, for comparison, the EU15 data are included in reports of expanded EU.
Notes
References
 Bartels R (1982) The rank version of von Neumann’s ratio test for randomness. J Am Stat Assoc 77(377):40–46CrossRefGoogle Scholar
 Brandt S (2014) Data analysis: statistical and computational methods for scientists and engineers, 4th edn. Springer, New YorkCrossRefGoogle Scholar
 Bun A, Hamal K, Jonas M, Lesiv M (2010) Verification of compliance with GHG emission targets: annex B countries. Clim Chang 103(1–2):215–225. https://doi.org/10.1007/s1058401099066 CrossRefGoogle Scholar
 Cowan G (1998) Statistical data analysis. Clarendon Press, OxfordGoogle Scholar
 Cox DR, Stuart A (1995) Some quick tests for trend in location and dispersion. Biometrika 42(1/2):80–95CrossRefGoogle Scholar
 Ermolieva T, Ermoliev J, Jonas M, Obersteiner M, Wagner F, Winiwarter W (2014) Uncertainty, costeffectiveness and environmental safety of robust carbon trading: integrated approach. Clim Chang 124(3):663–646. https://doi.org/10.1007/s1058401308242 CrossRefGoogle Scholar
 Hamal K (2010) Reporting GHG emissions: change in uncertainty and its relevance for detection of emission changes. Interim Report IR10003. IIASA, LaxenburgGoogle Scholar
 Hocking RR (2013) Methods and applications of linear models: regression and the analysis of variance. In: Wiley series in probability and statistics, 3rd edn. John Wiley & Sons, Inc., HobokenGoogle Scholar
 IPCC (2000) Good practice guidance and uncertainty management in national greenhouse inventories, http://www.ipccnggip.iges.or.jp/public/gp/english/. Accessed 28 May 2019
 IPCC (2006) Guidelines for national greenhouse gas inventories, http://www.ipccnggip.iges.or.jp/public/2006gl/Accessed 13 Nov 2018
 Jarnicka J, Nahorski Z (2015) A method for estimating time evolution of precision and accuracy of greenhouse gases inventories from revised reports. Proc. 4th Intl Workshop on Uncertainty in Atmospheric Emissions, Kraków, Poland, 2015, pp. 97–102, available at http://www.ibspan.waw.pl/unws2015/images/publications/4thWorkshopProceedings.pdf. Accessed 28 May 2019
 Jarnicka J, Nahorski Z (2016) Estimation of temporal uncertainty structure of GHG inventories for selected EU countries. In: Ganzha M, Maciaszek L, Paprzycki M (eds) Proceedings of the 2016 FedCSiS Conference ACSIS, vol 8. IEEE, pp 459–465. https://doi.org/10.15439/2016F318
 Jonas M, Gusti M, Jęda W, Nahorski Z, Nilsson S (2010) Comparison of preparatory signal analysis techniques for consideration in the (post)Kyoto policy process. Clim Chang 103(1–2):175–213. https://doi.org/10.1007/s1058401099146 CrossRefGoogle Scholar
 Marland G, Hamal K, Jonas M (2009) How uncertain are estimates of CO2 emissions? J Ind Ecol 13:4–7. https://doi.org/10.1111/j.15309290.2009.00108.x CrossRefGoogle Scholar
 Myers RH (1990) Classical and modern regression with applications, 2nd edn. Duxbury Press, BelmontGoogle Scholar
 Nahorski Z, Jęda W (2007) Processing national CO_{2} inventory emission data and their total uncertainty estimates. Water Air Soil Pollut Focus 7:513–527. https://doi.org/10.1007/s1126700691146 CrossRefGoogle Scholar
 Ryan TP (2008) Modern regression methods, 2nd edn. Wiley Series in Probability and Statistics, John Wiley & Sons, New YorkCrossRefGoogle Scholar
 Soong TT (2004) Fundamentals of probability and statistics for engineers. John Wiley & Sons, New YorkGoogle Scholar
 Żebrowski P, Jonas M, Rovenskaya E (2015) Assessing the improvement of greenhouse gases inventories: can we capture diagnostic learning? Proc. 4th Intl Workshop on Uncertainty in Atmospheric Emissions, Kraków, Poland, 2015, pp. 90–96, available at: http://www.ibspan.waw.pl/unws2015/images/publications/4thWorkshopProceedings.pdf. Accessed 28 May 2019
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.