1 Introduction

After the Kyoto Protocol signature in 1997, governments, industry stakeholders and academia began to work on the development of effective and efficient environmentally driven policies and economic mechanisms. In such context, the European Union (EU) current efforts in support to sustainable development and climate change avoidance comprises three main challenges, i.e. greenhouse gases (GHG) emissions reduction, consistent increment on energy production from renewables (Renewable Energy Directive—RED), and increase in energy efficiency (Energy Efficiency Directive—EED).

Within the EU environment protection framework, the European Union Emissions Trading System (EU-ETS) was launched in 2005, and it currently covers approximately 45% of EU28 polluting emissions. The EU-ETS implementation observed a staggered approach, and 2020 is the last year of the third phase, as presented below.

  • EU-ETS phase 1 (2005-2007) - the absence of reliable data on actual emissions and consequent wrong estimations led to allowances surplus (in million metric tonnes of CO2 equivalents - MtCO2-eq):

    • Allocated allowances: 6370 MtCO2-eq;

    • CO2 emissions: 6215 MtCO2.

  • EU-ETS phase 2 (2008-2012):

    • Allocated allowances: 11373 MtCO2-eq;

    • Carbon emissions: 9613 MtCO2.

  • EU-ETS phase 3 (2013-2020):

    • Allowances surplus in 2017: 1.6 billion.

Moreover, in 2019 the EU had to artificially remove 700 million allowances from circulation. Hence, the observed evolution of the EU trading system indicates it sustained and substantial structural supply–demand imbalance that kept distorting the market and compromising the scheme effectiveness as an emissions reduction driver [1].

Moreover, notwithstanding the systematic and continued EU efforts and policies addressing the reduction of carbon emissions, the EU member States' projections converge to an EU-wide total GHG emissions reduction of at most 32%, which falls short of the 40% target for 2030. EU-ETS specific projections indicate that the stationary installations could reach a 10% reduction on emissions between 2020 and 2030, which is insufficient for the accomplishment of the 2030 reduction target of 43% compared to 2005 levels [2].

Thus, inaccurate carbon emissions predictions may be one of the root factors leading to the overall ineffectiveness of the EU environmental regulatory framework. Therefore the achievement of the European Union ambitious targets will require additional policies resulting from new holistic and creative approaches targeting carbon emissions trends prediction.

In such scenario, and considering the findings related to the EU-ETS experience, our contribution explored the following research opportunities: (a) market based climate change avoidance policies efficiency could be relevantly improved by a better accuracy on carbon emissions trends prediction; (b) there is a crucial need for more accurate carbon emissions predictions better supporting each climate change avoidance initiative (i.e. EU-ETS, EED, RED) by considering the particularities of industry / economy sectors under their coverage; (c) machine learning (ML) methods and techniques have the potential capacity to grasp such particularities from economic and energy consumption indicators, what might lead to more realistic carbon emissions forecasts.

Therefore, the present article introduces a computational learning framework for carbon emissions predictions incorporating a sophisticated and effective algorithm for the assessment and selection of the most relevant environmental impacting factors. Such algorithm was innovatively engineered with an artificial neural network (ANN), for the conformation of a forecast framework able to improve its architecture according to the selected impacting factors (predicting features). The framework evaluation against current mainstream machine learning models, and its benchmarking to recent published researches on carbon emissions prediction indicates that our research contribution is relevant and capable of supporting the improvement of environmental policies effectiveness.

2 Relevant related researches

A myriad of researches analyzed carbon emissions behavior, and attempted to predict it by means of several different approaches, methods and techniques, and considering different emissions impacting factors (predictors). Econometric approaches were used by Guan et al. [3], Anger [4], Li and Lu [5], Robalino-Lopez et al. [6], Scott et al. [7], Mi et al. [8]. And Game Theory emerged as one of the preferred approaches to address decision-making and supply chain challenges linked to carbon emissions and carbon policies, as in Chang and Chang [9], Yang et al. [10], Yang et al. [11], and Xu et al. [12].

Whereas General Equilibrium Theory (Wang and Wang [13], Gavard et al. [14], Zhang et al. [15]), Operational Research (Cui et al. [16], Hong et al. [17]), Index Number Theory (Wang et al. [18], Solaymani et al. [19]), Variational Inequality Theory (Allevi et al. [20]), and Grey Systems Theory (Jiang et al. [21]) also played an important role in support of such studies, a significant number of researches opted for the application of techniques within the Statistical and Computational Learning domain, as presented in Table 1.

Table 1 Methodologies, techniques and predicting variables for carbon emissions prediction

The literature review provided fundamental insights on the existing carbon emissions impacting factors and how to apply them in emissions prediction models. It was noted that the researches benefiting from Computational Learning and Statistical Learning theories were the ones providing more information regarding how the predictors (or the availability / choice of different predictors) impacts prediction confidence level and accuracy.

The literature review also allowed us to identify some very important challenges related to carbon emissions prediction, when considering the amount and diversity of potential predictors. Firstly, a particular predictor correlation to a specific target varies depending on the scenario / region. Secondly the systematic generation of trustworthy carbon emissions information started in the 1990 decade, and its availability is restricted to some parts of the world.

Thirdly, carbon emissions can be characterized as a worldwide multisectoral interconnected phenomenon, e.g. the pollution outcomes international flights may have contributing components spread in all continents if we consider the airline headquarters location, the flight route (origin-overfly area-destination), the aircraft manufacturer (engines manufacturer, fuselage manufacturer, tires manufacturer, etc.), the fuel producer (petroleum, biofuels).

As an additional example, consider, a huge transnational enterprise may move its heavily polluting production to regions where carbon policies are less strict or even inexistent (carbon leakage). In such scenario, carbon emissions prediction models should be scalable in order to progressively cover broader scopes and process more data.

However, such required scalability would lead to the use of an increasing number predictors (predicting model features space dimension), what would not be accompanied by additional instances, once the availability of trustworthy data is limited. Thus, any intended predicting model should be capable of addressing data related problems such as endogeneity, heteroskedasticity, multicolinearity, and dimensionality.

The overall insights accrued from the literature review drove us to choose a computational learning / statistical Learning approach for the design of our prediction framework. It was also noted that, considering the nature of carbon emissions related data, it would be generally impossible to work with any parametric learning method. Additionally, fitting high-dimensional statistical models often requires the use of non-linear parameter estimation procedures [32]. Therefore, we opted for Neural Networks as the core learning model of our proposed predicting framework.

3 Methods and framework evaluation

Within our research we designed and implemented a computational learning framework for carbon emissions predictions incorporating a RReliefF driven features selection method, and an iterative neural network architecture improvement.

The Relief family of algorithms [33, 34] incorporates the ability to qualify attributes in a dataset as a function of the Euclidean distance computed between neighbouring instances. Such non-parametric and non-myopic algorithms are able to capture non-linear relationships, and run in low-order polynomial time.

The algorithms outcomes are attributes weights that feature a probabilistic interpretation, i.e. the weights are proportional to the difference between two conditional probabilities: the probability of the attribute's value being different conditioned on the given nearest hit, and on the nearest given miss [34].

Figure 1 graphically presents the designed framework, composed by four modules: the Features Engineering Module (FEM), the Model Generation Module (MGM), the Model Evaluation Module (MEM), and the Prediction Explanation Module (PEM).

Fig. 1
figure 1

Computational learning framework modules

3.1 Data analysis

Our research focused on the European Union—28 States (EU28), and our dataset comprises data obtained from the European Union (Eurostats), International Energy Agency (IEA), Organisation for Economic Co-operation and Development (OECD),  and World Bank (WB) databases, as described below.

  • Sources: Eurostats, IEA, OECD, World Bank.

  • Scope:

    • EU28;

    • 1990 – 2017;

    • Total CO2 emissions / sectoral CO2 emissions;

    • 26 economic / energy indicators (candidate predictors).

  • Data Aggregation Levels:

    • Regional;

    • Annual;

    • Total Emissions / Energy Industries / Industries / Commerce – Public Services / Transport / Residential / Aviation.

Tables 2 and 3 introduce the prediction targets and potential predictors explored in our research.

Table 2 Prediction targets—MtCO2 / IEA
Table 3 Candidate predictors

The characteristics of our research data led us to the application of neural networks (NN) as the core learning model in our proposed network. In this section we provide a deeper analysis of our data, which corroborates our choice and provides additional insights on how to deal with the data issues mentioned in the aforementioned section.

Pearson's coefficient is a test that measures the statistical association between two continuous variables as a function of the covariance observed between them. It provides information about the magnitude of the association, or correlation, as well as the direction of the relationship.

The results of the Pearson correlation test are bound by some important assumptions regarding the tested data, i.e. the variables should be normally distributed, a feature linearity and homoskedasticity. The test outcomes are values ranging −1 to + 1, where + 1 indicates a perfect positive relationship, -1 indicates a perfect negative relationship, and a 0 indicates no relationship exists; strong correlations are indicated by values between ± 0.5 and ± 1. The research data was submitted to the test, and Table 4 presents the potential predictors listed in order of correlation strength.

Table 4 Pearson correlation test—target T1 / Total CO2 (MtCO2)

The analysis of the Pearson's test outcomes flags down some important insights. The predictor A18 (Energy Use) shows the strongest correlation with the total CO2 emissions, what is in accordance with the knowledge accrued from the literature review. However, as regards to the predictor A3 (population), the test result is counterintuitive, as it indicates a strong negative relationship between population and CO2 emissions; the literature review also contradicts such negative correlation.

Figure 2 provides the visualization of the predictors A18 and A3, and corroborates the Pearson's test outcome. A deeper analysis of the test outcome raises another important flag, i.e. the predictor A4 (Temperature HDD) shows an irrelevant correlation with total CO2 emissions, and considering the research scope (EU28), such outcome seems inconsistent with the real-world energy consumption dynamics.

Fig. 2
figure 2

Exploratory visualization for T1, A18, A3 and A4

The combined analysis of Table 4 and Fig. 2 indicates potential non-linearity between the variables, as well as eventual additional violations of the Pearson's test assumptions. Thus there is a need for a more sophisticated correlation analysis of the data, and as such, we submitted it to the Spearman correlation test and to the Hoeffding’s D Statistic test.

Spearman correlation uses ranks instead of assumptions about the distributions of the two variables and, as such, it analyzes the association between variables according to ordinal measurement levels. Thus, the test does not assume that the variables are normally distributed, and it can be applied to the cases in which the Pearson's assumptions (continuous-level variables, heteroskedasticity, and normality) are not fulfilled.

Similarly to the Pearson's test, Spearman analysis outcomes are values between −1 and + 1, and the test results are representative once the data can be ordinally arranged and the scores on one variable can be monotonically related to the other variable. The Hoeffding’s D test [35] measures the independence of the data sets by computing the distance between the product of the marginal distributions under the null hypothesis and the empirical bivariate distribution. The test is able to identify linear / non-linear, monotonic / non-monotonic functions, and also non-functional relationships. The test outcome is a value between −0.5 and + 1 where larger values indicate stronger relationships, and there is no information about the variables correlation direction.

We then continued the analysis of the potential abnormal results related to the predictors A3 and A4. Table 5 presents the results of the aforementioned tests, and it is possible to observe that A3 increases its relative correlation level as the test sophistication is improved. Moreover, Spearman test result corroborates the abnormal negative relationship between population and total carbon emissions.

Table 5 Potential predictors ranked according to the specified testes outcomes

As regards to A4, the joint results imply that none of the tests were able to properly analyze the correlation level between temperature and CO2 emissions. In such scenario, the next data analysis step comprised the use of a state-of-the-art computationally efficient, non-myopic, and non-parametric algorithm able to indicate and weight complex patterns of association, i.e. the RReliefF algorithm. Once a target variable is defined, RReliefF scores the correlated variables with values ranging from -1 (worst) to + 1 (best).

Table 6 presents the outcomes of the aforementioned analysis. The Hoeffding’s D test and the RReliefF algorithm do not provide information about the variables correlation direction, thus Pearson’s and Spearman’s outcomes are presented in terms of absolute value.

Table 6 Data tests results / comparative perspective for Target T1

Although the RReliefF score for feature A3 does not contradict the information obtained by the other tests, the analysis of the score attributed to feature A4 indicates an important level of correlation between temperature (A4) and CO2 emissions, whilst such condition was not apparent in the other tests outcomes. Consequently, such discrepancy between the tests was further investigated.

The aggregation level of the data is a characteristic that may adversely bound the effectiveness of such tests. Therefore, the next research step comprised the analysis of the CO2 emissions in a lower level of aggregation, i.e. the test of emissions data of specific industry / economy sectors. Figure 3 shows the comparison between the emissions from commercial and public services (T4) and temperature (A4).

Fig. 3
figure 3

Exploratory visualization: T4 and A4

The similarities among the curves are relevant, and Table 7 confirms such observation by presenting the results of the Hoeffding’s D correlation analysis, where one may observe the feature A4 listed as the most relevant one when T4 is the target variable. The application of the Hoeffding’s D test to a carbon emissions dataset featuring a lower level of aggregation confirmed the outcome of the RReliefF analysis.

Table 7 Hoeffding's D statistic test—prediction target T4

Based on such conclusions, the next research step consisted of the expansion of the RReliefF analysis to our research dataset in its entirety, while taking the carbon emissions with a lower level of aggregation, i.e. total emissions split into sectoral emissions (Table 2). Table 8 presents the results of the RReliefF scoring for our research data.

Table 8 RReliefF scores for the research dataset

Still analyzing the feature A4, it is possible to observe a strong correlation towards the target T6 (residential emissions), what is confirmed by the exploratory visualization in Fig. 4.

Fig. 4
figure 4

Exploratory visualization: T6 and A4

Such findings confirmed the applicability of the RReliefF algorithm to assess (score and rank) the correlation level of our research data, and qualified its use in our learning framework features engineering process.

3.2 Features engineering and neural network architecture iterative design

In the proposed learning framework, the features engineering and the model generation (i.e. NN architecture design) are iteratively accomplished by two modules, i.e. the Features Engineering Module (FEM) and the Model Generation Module (MGM), as can be observed in Fig. 1.

The FEM accomplishes the data dimensionality and quality treatment. Such combined treatment is done by a RReliefF driven Backwards Feature Elimination (BFE) aiming at: a) selecting relevant predictors, in order to reduce the dataset features space and avoid the dimensionality curse; b) reducing the computational complexity of the learning algorithm featured in the MGM; c) improving the accuracy of predictions; d) facilitating the interpretation of results, and; e) reducing the data storage space. The feature selection process observes the notation presented in Table 9

Table 9 RReliefF algorithm notation

The RReliefF Algorithm, as presented in Fig. 5, uses as input a vector of attribute values \(\left[ {\varvec{A}} \right]\) and predicted value \(\tau\) for each training instance \(I\), and provide as outcome a vector \(W\) containing the score of the attributes.

Fig. 5
figure 5

RReliefF algorithm pseudocode (Robnik-Šikonja and Kononenko, 2003)

As observed in the algorithm steps 8 and 9, RReliefF uses Eq. 16 for the iterative update of features weights according to theirs probabilistic relevance for the predictions. The intuition behind such weights computation as an expression of probabilistic relevance is conveyed by Eq. 17.

$$W\left[ A \right] = \frac{{P_{diffC|diffA} \cdot P_{diffA} }}{{P_{diffC} }} - \frac{{\left( {1 - P_{diffC|diffA} } \right) \cdot P_{diffA} }}{{1 - P_{diffC} }}$$
(17)

In the formulation of Eq. 17, \(P_{diffA}\) represents the probability of having different values of \(A\) within the nearest instances, \(P_{diffC}\) represents the probability of having a different prediction within the nearest instances, and \(P_{diffC | diffA}\) represents the probability of having a different prediction given a different value of \(A\) within the nearest instances.

Thus, within the FEM, the initial features \(A_{1} \ldots A_{26}\) (potential predictors) are scored by the RReliefF algorithm, the features then are indexed and ranked based on the attributed scores, what leads to a rearranged feature set. Next step, the interaction with the MGM starts, i.e. the rearranged features set is fed to the Backpropagation Neural Network Architecture (NN/BP), the network is trained and the vector containing the current learning framework status (features subset, NN/BP architecture, NN/BP prediction accuracy) is stored. Subsequently, new features subsets are created by backward feature elimination, the NN/BP is trained with the new subset, and the learning framework status vector is updated.

The NN/BP featured in the MGM has the following characteristics:

Feed-forward network \(\nu \left( x \right)\) defined as follows:

$$\nu \left( x \right):\, = f^{L} \left( {W^{L} f^{{L - 1}} \left( {W^{{L - 1}} ...f^{1} \left( {W^{1} x} \right)...} \right)} \right)$$
(18)

Number of layers: 3; hidden layer activation function (transfer function): logistic (sigmoid), defined as follows:

$$f^{L - 1} \left( {x_{i} } \right) = \frac{1}{{1 + e^{{ - x_{i} }} }}$$
(19)

Training method: backpropagation;

Normalization method: unit interval;

Training cost (loss) function: residual sum of squares, defined as follows:

$$RSS = \mathop \sum \limits_{i = 1}^{n} \left( {y_{i} - \nu \left( {x_{i} } \right)} \right)^{2}$$
(20)

where \({y}_{i}\) is the i th value of the target variable; \({x}_{i}\) is the \({i}^{th}\) value of the predicting variable; \(\nu ({x}_{i})\) is the predicted value of \({y}_{i}\).

As previously mentioned, the FEM and the MGM interact in an iterative way, and such interaction allows for an innovative BFE/RReliefF driven improvement for the NN/BP, i.e. the number of neurons in the hidden layer is changed along with the features subsets, and the learning framework status vector is updated accordingly. Once framework stop conditions are achieved, the status vector encloses the best features subset and the best NN/BP architecture in terms of prediction accuracy.

3.3 Learning framework evaluation

As observed in Fig. 1, the Model Evaluation Module (MEM) built in the implemented learning framework features the best NN/BP architecture (MGM outcome) and feeds it with the evaluation dataset. The module also features three additional ML models (Support Vector Machine—SVM, Gradient Boosting Machine—GBM, and Random Forest—RF), which are used to complement (benchmarking) the learning framework performance evaluation. The results of the benchmarking are presented in the next subsection, along with the overall assessment of the proposed framework performance.

3.3.1 Original contribution

To the best of our knowledge, our proposed learning framework is the first to implement an iterative neural network architecture improvement supported by a Backward Feature Elimination search method driven by the RReliefF algorithm.

The framework iteratively learns (NN/BP architecture, features subset) on ad hoc basis, i.e. specifically for each economy / industry sector. The implemented framework evaluation / validation processes benefited from real world data accrued by the EU, OECD, IEA, and the World Bank. The whole dataset covers the period 1990–2017, and the training and test datasets were determined in accordance to the Pareto principle for data sampling.

Table 10 presents the accuracy (Root Mean Square Error—RMSE for the test dataset) of the proposed learning framework for the totality of the EU28 CO2 emissions as well as for sectoral emissions. The table also presents the accuracy for the experiment control NN/BP, the framework featuring NN/BP supported by plain BFE (NN/BP-BFE), and the framework featuring NN/BP supported by RReliefF driven BFE (NN/BP-RReliefF/BFE).

Table 10 Computation learning framework evaluation figures

The accuracy figures in Table 10 demonstrate the improved performance of the proposed learning framework when compared to the control NN/BP, as well as to other possible framework designs. The table also presents the number of predictors in the learned features subset, and the number of neurons in the hidden layer.

As observed in Table 10, the proposed approach combining backward feature elimination, RReliefF feature qualification, and iterative improvement of the NN/BP architecture effectively boosted the carbon emissions prediction accuracy for the EU28 scope dataset.

The computational complexity of the neural network is \(O\left( {h^{a} } \right)\), where \(h\) is the number of hidden layers and \(a\) is the number of features (predictors), and the training process converged in less than 100 epochs, with a learning rate of 0.1. The RReliefF algorithm computational complexity is \(O\left( {n.m.a} \right)\), where \(n\) is the number of training instances, and \(m\) is the number of training instances used by the algorithm to update the weights. The computational environment featured the CPU Intel i9-9900 k supported by the GPU NVIDIA RTX 2080 (8 Gb) and 32 Gb DDR4 RAM, and the whole prediction process took approximately 1 h (average) per prediction target.

3.3.2 Learning framework validation

The validation of the original contribution provided by the proposed learning framework consisted of two different analysis. Firstly, its outcomes were compared to three current mainstream ML models (i.e. SVM, GBM, RF) using our research dataset. Secondly, our accuracy figures were benchmarked against the results of recently published researches targeting carbon emissions prediction.

In the validation process we worked with mean absolute percentage error—MAPE as the accuracy metric, and focused on the prediction target T1 (total CO2 emissions). Thus the best performing model designed by our proposed learning framework features 7 neurons in the hidden layer, and 16 predictors out of the candidates presented in Table 3, i.e.: A18, A4, A23, A16, A8, A17, A10, A6, A22, A12, A11, A25, A9, A7, A24, and A26.

The trained framework achieved 2.28% accuracy performance (MAPE) on the test dataset, and Table 11 allows us to benchmark the result of the proposed learning framework with the results of other NN/BP implementations.

Table 11 Computational learning framework benchmarking figures (MAPE) for prediction target T1—Total EU28 Carbon Emissions (NN/BP specific)

As observed in Table 11, the predictions of the proposed learning framework are significantly better than the ones provided by similar researches.

Table 12 compares the result of our proposed framework with the results of three ML models (GBM, RF, SVM) supported by plain BFE, and using our research dataset, as well as against the results of recently published researches models other than NN/BP.

Table 12 Computational learning framework benchmarking figures (MAPE) for prediction target T1—Total EU28 Carbon Emissions (mainstream ML models)

The figures in Tables 11 and 12 reinforce the relevance of the results accrued by our research, and confirm ANNs as a powerful algorithm capable of processing a large amount of non-linear and non-parametric data. The RReliefF algorithm, in turn, efficiently assess and rank the predicting variables (features) by effectively addressing non-linear relationships, time series, noisy and correlated features, as well as features interactions of high order (complex patterns associations).

The iterative combination of these two algorithms produced a powerful and scalable prediction tool able to process huge datasets featuring complex and incomplete data. The improved ad hoc learning capability of our framework makes it potentially applicable to any region in the world, and for any level of data aggregation.

Whereas our original contribution represents an important step towards the better design and implementation of environment protection initiatives and policies, its effectiveness would be greatly enhanced once combined with an explainable Artificial Intelligence (XAI) technique, given the black-box nature of ANN algorithms.

3.4 Predictions explanation

Regarding ML outcomes, it is critically important to ensure that its predictions accuracy performance relies on valid features (predictors) computations, i.e. the ML model is providing the right answer for the right reasons. Specifically addressing our research, although ANNs are top performing, non-parametric and scalable algorithms, they lack the required algorithmic transparency to adequately support policy making and decision-making processes targeting the complex environmental challenges. Therefore, we improved our learning framework with an XAI module (i.e. PEM—Prediction Explanation Module) featuring the Local Interpretable Model-agnostic Explanations—LIME technique [36].

LIME is a scalable method that creates local interpretable surrogate models (explanations) around a given instance in order to estimate how data points influence the global model predictions. LIME translates the explanation problem into an optimization problem. The search space comprises explanations generated by the local interpretable surrogate models \(g \in G\), where \(G\) is a class of interpretable models. Locality is defined by a proximity measure \(\pi \left( {x,z} \right)\) expressing the distance between an instance \(z\) and \(x\). The interpretability degree of the surrogate model is assessed by means of a complexity measure \({\Omega }\left( g \right)\). Thus, by considering \(f\left( x \right)\) the model to be explained, the local fidelity measure \({\mathcal{L} }\left( {f,g,\pi } \right)\) express how unfaithful \(g\) is in approximating \(f\) in the locality \(\pi\). Finally the LIME outcome ensuring both interpretability and local fidelity is defined by:

$$\begin{aligned}\varepsilon \left( x \right) =&{\text{arg min }}{\mathcal{L}}\left( {f,g,\pi } \right) + {\Omega }\left( g \right) \\&\quad\!\! {g \in G}\end{aligned}$$
(21)

Within the learning framework XAI module, the model to be explained is defined as \(f\left( x \right) = \nu \left( x \right)\) (i.e. the ANN). \(G\) comprises ridge regression models for the perturbed sample \(z^{\prime} \in Z\) (perturbed samples dataset), such that \(g\left( {z^{\prime}} \right) = \beta_{g} \cdot z^{\prime}\). The complexity measure \({\Omega }\left( g \right)\) is expressed in terms of non-zero coefficients in the linear model, \(\pi\) is defined by the Euclidian distance, and the local fidelity is computed as square loss. We thus define:

$${\mathcal{L}}\left( {f,g,\pi } \right) = \mathop \sum \limits_{z,z \in Z} \pi \left( {x,z} \right) \left( {\nu \left( z \right) - g\left( {z^{\prime}} \right)} \right)^{2} \left( {22} \right)$$
(22)

From a general perspective, for each prediction to be explained, LIME algorithm permutes (perturbation) the observation n times; the statistics for each variable are extracted and permutations are then sampled from the variable distributions. The model to be explained then predicts the outcome of all permuted observations, and the algorithm calculates the Euclidian distance from all permutations to the original observation, and selects the m features with highest absolute weight in a ridge regression fit of the complex model outcome.

Afterwards a simple model is fitted to the permuted data, explaining the complex model outcome with the m features from the permuted data weighted by its distance to the original observation. And finally, the algorithm extracts the feature weights from the simple model and use them to explain the local behavior of the complex model.

In order to demonstrate the whole outcome of our implemented learning framework, Table 13 presents predictions and explanations (features weights) for two cases, i.e. predictions for the years 2013 (EU-ETS phase 3 first year) and 2016 (most accurate learning framework prediction). For the sake of simplification, among the 16 selected features, Table 13 presents the explanations for the first 5 as ranked by the RReliefF algorithm.

Table 13 Predictions explanation module outcomes

As observed in Table 13, the XAI module featured with LIME provides insights on how a specific feature contributes for a specific prediction (case). We observe, for instance, that oil based energy supply (predictor A23) consistently and positively contributes for, and induces total CO2 emissions. In such context it is extremely important to differentiate the two different impacts, i.e. contribution and induction. Although it is clear that energy production from oil combustion definitely contributes for CO2 emissions, its use may induce the reduction of total CO2 emissions due to the replacement phenomenon, by which an increment on oil use may imply a reduction on the use of a more polluting source of energy, like coal. Hence, our learning framework outcomes express inductive relationships rather than contributory ones.

Such improvement turns our research contribution much more valuable for the design of better environmental initiatives and policies dependent on CO2 emissions forecasts. Additionally, the outcomes of our proposed learning framework may provide some background for future works addressing carbon emissions causality analysis, as well as potential improvements on both ANNs and XAI techniques.

4 Conclusions

The accurate forecast of CO2 emissions is one of the most important inputs for any decision-making process targeting climate change / global worming avoidance. Therefore, in our attempt to contribute to such global environmental challenge, we implemented a learning computational framework for carbon emissions predictions. Our framework features the capacity to iteratively improve the prediction features set and the backpropagation neural network (NN/BP) architecture according to the data statistical assessment computed by the RReliefF algorithm. Our research counted with real world data obtained from the European Union, International Energy Agency and World Bank, for the period 1990–2017.

The outcomes of the designed prediction framework were successfully evaluated against different NN/BP based solutions (NN/BP, NN/BP-BFE, NN/BP-RReliefF/BFE, NN/BP-CT, NN/BP-IPSO, NN/BP-PCA, NN/BP-RF)), as well as different mainstream machine learning models (GBM-BFE, RF-BFE, SVM-BFE, SVM-RF). Additionally, the featured XAI module provided insights on how different predictors impacted a specific prediction case. Therefore, our results demonstrated the effectiveness of our approach in terms of increased and explained prediction accuracy, which may adequately support the design and the improvement of environmental initiatives and policies.

Finally, the outcomes of our implemented learning framework may provide some background for future works addressing carbon emissions causality analysis, as well as potential improvements on both ANNs and XAI techniques.