1 Introduction

Energy is central to modern societies. A multitude of causal relationships with varying characteristics coexists between its parts, mutually influencing each other. The analysis and modeling of various energy-related phenomena are consequently also central problems within the scientific community. At the same time, the process of reliably modeling and predicting serial and time-dependent energy-related phenomena is both an important and difficult endeavor. The level of difficulty varies, among other things, according to the level of abstraction, meaning the time frame over which one is called upon to construct robust and efficient computational tools.

Since the variance of the observation values of energy-related phenomena and the consequent picture of their respective time series depend on the time frame of analysis, different strategies and methods are used. This work is about medium-term forecasting. Specifically, it is an investigation of ensemble methods regarding the problem of medium-term forecasting of the Greek energy load system, using additional information derived from the Greek injected energy balance. Ensembles are generally preferred for their robustness and increased accuracy. Here, to formulate an ensemble means to combine the outputs of various individual algorithms in order to generate better predictions. Given that, in machine learning, a time series forecasting problem can be formalized as a regression problem, and this paper presents results regarding the investigation of more than 400 selected ensemble methods based on machine learning regression techniques, leading to a recommendation of a specific layout. Moreover, it shows that the incorporation of specific multivariate input settings containing additional energy-related features generally improves the predictions.

Specifically, a number of 435 methods were tested. The experiments contained scenarios predicting five different time shifts: one-day, seven-day, fourteen-day, twenty-one-day, and finally thirty-day forecast time frames. Results were tested against six different valuation metrics. Considering the extracted evaluations and their similarities, the results of three of the six metrics were selected to be presented with various visualizations in the direction of optimal readability. The complete results, however, are in the appendix. Regarding input setups, in addition to the univariate version, each layout included multivariate feature settings related to both system load and a series of four additional variables that captured energy production values from four different energy sources. These included data on Greek injected energy production from the following sources: lignite and natural gas production, hydroelectric production, and finally energy imports. In total, 435 methods \(\times\) 64 feature setups \(\times\) 5 time shifts = 139200 experiment loops were executed. An efficient and low-cost ensemble scheme that generates the final predictions by averaging the outputs of two base learners, that is, the orthogonal matching pursuit and the Huber regressor, is proposed. The proposed method, being both computationally inexpensive and highly interpretable—elements that were among the initial goals of the present investigation—displays increased accuracy and robustness, as well as flexibility, in all forecast scenarios tested.

Below, a brief mention of various related works is given first. The presentation of the proposed method follows. Then, the experimental procedure and the results illustrated through a variety of techniques are presented. Lastly, the extracted conclusions are given, along with some future extensions.

2 Related work

Energy is fundamental within the structure of modern societies. So much so that until today, an instability that would jeopardize the smooth continuation of their energy supply seemed unthinkable for the populations of modern developed or even developing countries. Still, nowadays, that risk is unfolding. But even beyond that, an argument in which the ability to model time-dependent energy data becomes essentially imperative seems a relatively self-evident one. Given that, in terms of the various scientific endeavors, all motivations can be summed up as follows: there are too many substantial reasons to investigate reliable and efficient energy time series forecasting tools. And, indeed, the community did and continues to do so. Thus, there is a multitude of related works, of which only a small part is going to be mentioned here.

In [1], support vector regression along with a strategy for automatic lag selection in time series analysis is used in electricity load forecasting scenarios. Experiments involved four electricity demand forecasting datasets and showed that the combination of support vector machines (SVMs) along with the proposed automated lag selection strategy outperformed state-of-the-art lag selection strategies, achieving good overall forecasting performance. Again, for short-term forecasts of energy loads, a two-stage adaptive hybrid method incorporating a self-organized map (SOM) to cluster the input data and SVMs to fit each cluster is proposed in [2], while in [3], a hybrid scheme incorporating random forest and multilayer perceptron (MLP) is proposed. Also, in [4], an energy load forecasting scheme based on the k-nearest neighbors algorithm is proposed. The results indicate that the proposed method can accurately analyze and forecast short-term power loads with high accuracy. A hybrid model based on clustering and the auto-regressive integrated moving average (ARIMA) is used in [5] to forecast the electricity load of university buildings. Moreover, in [6], XGBoost was used for feature extraction and forecasting the weekly energy load in the Australian energy market, demonstrating high prediction accuracy. In [7], two strategies for embedding the discrete wavelet transform into neural network-based short-term load forecasting schemes are presented. For this purpose, four MLP architectures were tested on hourly load and temperature data for North American and Slovakian electric utilities. In [8], a machine learning process based on artificial neural networks (ANN) is used to forecast the very short-term load of two machine tools. In terms of robustness, the work points toward the use of a combination of time series features and ANNs.

Regarding deep learning schemes, in [9], a methodology for forecasting building energy load based on two long short-term memory (LSTM) schemes, that is, vanilla LSTM and LSTM-based sequence-to-sequence architecture, is provided. In [10], an efficient LSTM neural network-based framework trained and tested on publicly available data is proposed for short-term residential load forecasting. In [11], using electricity consumption data from France, a model incorporating optimally selected time-lag features and LSTMs stands out as the best among various configurations for modeling and forecasting medium-term aggregate load data. A hybrid method in a sequential learning layout for short-term residential load forecasting is presented in [12]. The model utilizes a convolutional neural network (CNN) to extract features from the input dataset, which are then fed into gated recurrent units (GRU). A CNN-based deep neural network algorithm for short-term load forecasting is introduced in [13], where three convolution and three pooling layers extract features that are eventually flattened and sent to the output layer under a fully connected layout. Moreover, a deep learning-based individual building load forecasting methodology using convolutional neural networks on historical loads is presented in [14]. In [15], an architecture based primarily on convolutional and recurrent neural networks is proposed to extract appropriate features while also modeling the implicit dynamics of historical data in the context of short-term load forecasting. Regarding multi-short-term load forecasting, in [16], a recurrent inception convolution neural network (RICNN) is presented. The method combines recurrent neural networks with one-dimensional convolutional neural networks. In [17], a fuzzy-based ensemble using hybrid deep neural networks is proposed for forecasting seven-day hourly loads. Using a novel fuzzy clustering technique that incorporates a newly proposed neural network architecture, the input data are partitioned into overlapping clusters and subsequently exploited to form an ensemble of regressors consisting of a radial basis function neural network (RBFNN), a CNN, a pooling and two fully connected layers. After experimentally evaluating various attention mechanisms with different RNNs and time horizons, a sequence-to-sequence recurrent neural network with an attention mechanism is proposed in [18]. Moreover, a novel ensemble scheme exploiting the stacking technique to combine various deep neural network models using sliding window-based principal component regression for short-term load forecasting models is proposed in [19]. Also, for short-term multi-energy load forecasting, a new method that exploits a combination of a convolutional neural network and a bidirectional gated unit, that is, CNN-BiGRU, and is optimized using an attention mechanism is presented in [20].

There is also a multitude of works regarding reviews and comparisons. Below, several indicative papers follow. In [21], a tutorial review of techniques, methodologies, and evaluation methods regarding probabilistic electric load forecasting is provided. Machine learning methods as applied to load forecasting problems are discussed in [22], divided into two categories: the first consists of individual techniques, and the other concerns hybrid methodologies. Various research papers concerning forecast methodologies for electricity load are reviewed comparatively in [23] based on multiple criteria, such as time frames, inputs, and outputs. A survey of standard deep learning techniques deployed within the energy load forecasting context is presented in [24], with an accuracy comparison favoring the incorporation of convolutional neural networks along with k-means clustering in terms of results based on the root-mean-square error (RMSE). Lastly, a review of academic works regarding electricity demand forecasting published over a period spanning from 2010-2020 is presented in [25], investigating the algorithms used, electricity consumption theories and impact factors, metrics, and forecast horizons. Since the above report is merely indicative, the reader is urged to further explore the relevant literature. In view of the above and beyond the foretold upcoming proposal of a specific ensemble method, this paper also aspires to be a comprehensive applied comparison of methods and input strategies.

3 Proposed method

The proposed method is an ensemble layout that consists of two individual base learners that are blended using a voting technique in which each of the base algorithms is trained on the whole training dataset, and the individual outputs are then combined through averaging to produce the final prediction.

Fig. 1
figure 1

Proposed method flowchart

Note that in regression, voting ensembles involve combining the predictions of multiple regression methods and forming a final prediction based on the average of the individual predicted values. Thus, using a voting technique in regression scenarios refers to averaging the results of several regression schemes used as base learners. The two base learners incorporated are the orthogonal matching pursuit and the Huber regressor. Regarding the combinatory scheme, Fig. 1 visually depicts, in a relative abstraction, the proposed method.

3.1 Huber regressor

The Huber regressor used is a linear model that utilizes the so-called Huber loss function, proposed by P. Huber in 1964 [26]. The function is defined piece-wise by the mathematical formula given below:

$$\begin{aligned} L\left( y,\hat{y} \right) =\left\{ \begin{array}{l}\frac{1}{2}\left( y-\hat{y} \right) ^{2} \; \; \text{if} \, \left| y-\hat{y} \right| \le a \\ a\cdot \left| y-\hat{y} \right| -\frac{a^{2}}{2} \; \; \text{otherwise} \end{array}\right. \end{aligned}$$
(1)

where y is the target variable, \(\hat{y}\) is the corresponding prediction, and \(\alpha \in \mathbb {R}^{+}\) is a hyperparameter. Given the above mathematical formula, one can observe the Huber loss function having the advantage of exploiting both the squared loss \(L(d)=d^{2}\) and the absolute loss function \(L(d)= \left| d \right|\). The first results in an arithmetic mean-unbiased estimator, while the latter in a median-unbiased estimator. While the squared loss function is heavily influenced by outliers, the Huber loss function switches to absolute loss at higher values of the residual \(|y-\hat{y}|\). This selective nature of the Huber loss function allows the method to exploit the benefits of a median-unbiased estimator, while at the same time maintaining the advantages of a mean-unbiased, minimum-variance estimator of the mean. This scheme produces a robust method that is resistant to outliers while not completely ignoring their existence and impact on the generated predictions. It should be noted that determining the value of the hyperparameter \(\alpha\) is more of an investigation than a simple decision. There are various studies incorporating different ways to select the fixed value of \(\alpha\), since choosing an inappropriate value can affect the prediction efficiency of the model. Among them, the use of the \(5\%\) asymptotic efficiency rule and selections that are based on the dimension of the input data stand out [27]. Nevertheless, in this work, alpha was set to 0.0001, a choice that has also been widely incorporated in the literature.

3.2 Orthogonal matching pursuit

Orthogonal matching pursuit [28] is an orthogonal variation of the matching pursuit (MP) algorithm. Like its ancestor, it is a greedy sparse approximation method. Both MP and OMP were designed and widely used as compressed sensing methods. Compressed sensing is a signal processing technique for efficiently acquiring and reconstructing a signal. MP actually reconstructs an approximation of the original “signal” by iteratively improving the solution by minimizing the residual norm. It can also be thought of as a forward greedy algorithm for feature selection, in which the correlation between the residual and the candidate features is used to decide which feature to add next at each iteration. The length of the orthogonal projection is used as the correlation between the residual and the candidate features. The correlated part is then subtracted from the residual, and the procedure is being repeated on the newly updated residual. When the residual falls below a threshold defined by the user, the algorithm terminates. The correlation values produced in each iteration step are used as weights for the corresponding selected features, forming a combination that constitutes the output prediction. The main difference between classic MP and OMP is that at each step of the latter, all of the extracted coefficients are updated by computing the orthogonal projection of the data on the set of features selected thus far [29].

Considering regression tasks, OMP aims to approximate the fit of a linear model with constraints imposed on the number of nonzero coefficients. Regarding the final prediction \(y-\hat{y}\), at the end of the task, the obtained solution would be

$$\begin{aligned} \hat{y}(w,x)=w_{0}+w_{1}x_{1}+w_{2}x_{2}+...+w_{p}x_{p} \end{aligned}$$
(2)

where \(w=(w_{1},w_{2},...,w_{p})\) denotes the vector of the coefficients of the linear model. The method aims to solve the problem presented in the following formula:

$$\begin{aligned} \underset{w}{\text{argmin}}\Vert y-X_{w} \Vert _{2}^{2}\quad \text { subject to} \quad \Vert w\Vert _{0} \le n_\mathrm{nonzero.coef.} \end{aligned}$$
(3)

OMP shares many common traits with classical MP, but it has been proven experimentally that it usually demonstrates better performance at the cost of an increase in computational cost [30]. Moreover, when certain restricted isometry conditions are satisfied, the stability and performance of the method are guaranteed [31].

3.3 Blended ensemble

As already mentioned, above, in Fig. 1, the process of forming the final forecast according to the proposed method is presented. Given that, and toward interpreting the efficiency—to be presented—exhibited by the proposed ensemble regarding the given energy framework, one can also observe the general properties of the structures of the aforementioned methods selected as constituent algorithms. In doing so, one notices that the two individual algorithms used as base learners seem to complement each other. Given the strengths of orthogonal matching pursuit, that is partly because the two methods exhibit different behaviors regarding the existence of outliers. Thus, the proposed method incorporates base algorithms that are both accurate and diverse. Specifically, while, as already mentioned, the Huber regressor balances the existence of extreme values, the orthogonal matching pursuit, given that it incorporates the L2 norm, seems to be particularly affected, not being able to respond well [32]. Thus, their combination tackles the issue.

Moreover, the incorporated blending scheme that combines the base learners by averaging [33] their fit outputs is simple and thus interpretable, while the whole process remains efficient both computationally and in terms of time. That is pretty straightforward given that after the individual models are fit, there is just one computation left to form the final result, i.e., to compute the mean of the corresponding individual predictions. Moreover, given that the final output is formed after the individual algorithms are run, parallelization is also quite plausible as a strategy for reducing the overall execution time.

4 Experimental procedure and results

The experimental procedure followed by the results will now be presented. The whole process is a broad continuation of the experiments presented in [34].

4.1 Experimental procedure

In what follows, the individual elements of the experimental process are presented. Specifically, the initial datasets are listed first, along with the applied pre-processing toward the construction of the final sets used in training. The base learners tested follow along with the ensemble strategy implemented. Finally, the metrics used to evaluate the results are given.

4.1.1 Data

Two initial datasets were used, covering a sampling period starting on November 30, 2020, and ending on October 30, 2022. The first dataset concerned the hourly load records of the Greek energy system [35]. The second dataset contains measurements from the Greek energy balance regarding four different injected energy production sources, i.e., natural gas, hydroelectric, lignite, and imports [36].

Table 1 Data statistics

The time series of the energy load was used as the prediction target, while those of the energy balance were utilized as features of possible covariates within the investigation context. Given that a claim regarding existing correlations between system load and injected energy production seems at least intuitively plausible, an experimental investigation was deemed necessary. As already mentioned, the framework of the investigation concerned medium-term forecasts of the energy load of the system, and as such, the hourly records were converted into daily data. Hence, a rolling average of 24 h was applied to the hourly system load data and then, converted to a 24-hour frequency, generating a time series consisting of a single daily record. Similarly, the average hourly energy production per source was extracted from the energy balance data. One should note that, according to the above, the number of the initial features was five. Thus, five features consisting of daily records containing averages for a period spanning from November 30, 2020, to October 30, 2022, were produced. Various aspects of the aforementioned set of features can be seen in Table 1.

Figure 2 depicts the correlation heatmap of the initial features used. In the heat map depicted, one can observe the correlations between the features used. From the outset, there is an intuitive tendency to correlate energy balance with energy load. Hence, this should be an indicative overview regarding an aspect of the task, providing insight regarding the initial feature space, given that the framework of this work explores possible improvements in the predictive schemes concerning the energy load after the incorporation of features describing the energy balance. However, the correlation heatmap is included as supplementary information so that a better understanding of the data can be realized, and given that it was provided solely to illustrate the correlations between the initial features in the dataset, it was not used in the selection or elimination of any features.

Fig. 2
figure 2

Feature correlation heatmap

The set containing the five initial features, however, was supplemented with three additional features, which consisted of time series derived after applying 7-day, 14-day, and 30-day rolling averages to the system load data. From these 8 final features, 63 multivariate combinations of input setups consisting of up to 4 input variables were created. These, together with the univariate version, constituted the 64 individual investigation cases for each method per time frame. Fig. 3 illustrates the process of creating the feature setups. The experiments were performed over 5 time frames: single-day, 7-day, 14-day, 21-day, and 30-day forecast shifts. The final datasets were divided into training and testing subsets according to an 80 percent and 20 percent ratio, respectively. A 10-fold cross-validation adjusted to time series scenarios was incorporated. In time series cross-validation, given that both the order of the data points matters and there are dependencies between consecutive data points, the data are split into multiple folds—here, 10 folds—with each fold consisting of a contiguous block of time. The model is trained on the first fold and evaluated on the second, then trained on the first two folds and evaluated on the third one, and so on, until all folds have been used for evaluation.

Fig. 3
figure 3

Feature setup creation

Lastly, it would be an oversight not to mention that, to identify possible outliers, an anomaly detection procedure was carried out using PyCaret’s Anomaly Detection Module. PyCaret’s Anomaly Detection Module is an unsupervised machine learning module used to identify rare items, events, or observations. Here, the isolation forest technique was used. This technique incorporates random forests to isolate outlier data points by creating isolation trees. An isolation tree is a binary tree that partitions the data space recursively by randomly selecting a feature and a split value within the range of the feature’s values. The process is repeated until each data point is isolated on its own leaf node. The anomaly score for each data point is calculated based on the path length of the tree that isolates it. The intuition behind this approach is that normal data points require more splits to be isolated than anomalous data points, which are typically fewer in number and farther from the majority of the data. Figure 4 depicts the results of the aforementioned procedure. The outliers are shown as red dots.

Fig. 4
figure 4

Anomaly detection: possible outliers per feature

4.1.2 Methods

As for the methods, a total of 25 regression algorithms were used as base learners. Hence, these methods constituted the components of the selected ensemble layouts that were created. After being tested for their performance as candidates in the ranking, their exported results provided the necessary insight for exploring candidate groupings. Finally, they were re-evaluated against each such produced ensemble. Table 2 includes the 25 methods used along with their familiar abbreviations and reference papers. For additional information regarding specifics about the algorithms used, the reader is urged to follow the referenced literature.

Table 2 Base algorithms

The Python library PyCaret [59] was used for the implementation of the machine learning methods examined. Cross validation on the train set was conducted according to the time series fold strategy of PyCaret. PyCaret automates hyperparameter tuning using a grid search with k-fold cross-validation, where k is a user-specified parameter. The default value is 10, which was also used in our experimental procedure. During cross-validation, the dataset is divided into k folds, and the model is trained and evaluated on each fold. The best set of hyperparameters is selected based on the average performance across all folds. In this work, a suitable procedure for time series data, that is, the “TimeSeriesSplit” procedure, was used as the cross-validation technique. This technique splits the data into consecutive time periods, ensuring that the model is evaluated over future time periods to avoid data leakage. At each iteration of cross-validation, the training set includes all the data up to a certain point in time, and the validation set includes the data from the next time period. The base algorithms were grouped into ensemble regressors with a blending technique. Blending is an ensemble technique that exploits the output of multiple models using consensus among estimators to produce a prediction. Specifically, given the individual models, a voting regressor is trained. One should note that within the forecasting framework, voting refers to averaging the individual outputs. Thus, first, each of these base learners is trained on the whole dataset, and then, an average of the individual outputs is generated. Here, the following grouping and control procedure was followed: Initially, in addition to the twenty-five individual methods, all possible blended combinations consisting of two of the twenty-five aforementioned learners were formed. This resulted in an initial ranking table containing a total of 325 methods: 25 base learners and 300 blended regressors consisting of all two-way combinations of the base 25. Given these results, each of the layouts was treated as a candidate for further grouping. That is, from a number of 325 methods—with each of them being treated as a possible method for further expansion in terms of ensemble combinations—and according to their corresponding performances in the metrics, the most promising layouts were sought and extracted, whether they concerned one of the individual algorithms or any of the above 300 ensemble combination schemes consisting of two base learners. Thus, given the insight extracted from the initial results, 110 additional ensemble layouts consisting of up to five base regressors were generated. After all the potential methods were identified, the training was repeated, producing a total of 435 method results that populated the final ranking tables.

4.1.3 Metrics

Results were formed according to six commonly used metrics, each of which exposes different views of the performance of the methods used. Specifically, the following six metrics were used to derive the overall results: MAE, MAPE, MSE, RMSE, RMSLE, and R\(^2\). In what follows, however, and regarding the presentation of the extracted results, three of the six metrics are presented. This decision was made taking into account, on the one hand, the optimal control of the results and, on the other hand, the similarity of the results of specific metrics.

4.2 Results

The presentation of the results that follows involves the following steps: The valuation data are presented in a variety of visualizations, which are organized first by feature setups and then by methods. A report of the proposed method and the justification of this choice follows. To construct the final ranking arrangements, Friedman and Neymeni post hoc tests [60, 61] were carried out. For the implementation of the statistical test, the Python library from STAC (Statistical Tests for Algorithm Comparison) was used [62]. Note that the data were not normally distributed, the performance of the methods examined varied significantly, and non-homogenous data variances were present. Additionally, the comparison included more than four methods that had a sufficient number of samples. Hence, the Friedman test was an appropriate choice, according to documentation.

Given the large number of results, aggregation strategies were adopted. Regarding input features, all 64 setups were joined in terms of metric, regardless of prediction time frame. Statistical tests were then applied to these groupings, the output of which is summarized in various rankings. It has already been mentioned that for reasons of better readability and given the similarity between results, elements from three of the six metrics are presented. Nevertheless, results regarding all six metrics used can be found in the appendix. Note that all results complement an argument in favor of the method proposed here. Since these abstractions do not represent errors but error rankings, aggregate error box plots are also presented. Again, these plots concern joined results regarding all time shifts. Given these, one can observe that of the tested feature settings, two in particular always occupy the top two positions alternately. As far as the algorithms are concerned and regarding presentation, a similar strategy to the above-mentioned was followed. The multitude of methods imposed a necessary reduction in their number in terms of presenting results. Thus, initially, a Friedman test was performed on each time shift per metric. Given the six metrics and five time frames, a set of thirty rankings emerged. From these, the 10 best methods related to each case were selected. An aggregate Friedman test was then performed on all 64 best-performing methods, combining all shifts and metrics.Footnote 1 In addition, again in relation to the 64 best algorithms, a Friedman ranking was also performed in terms of metric by uniting the time shifts. Lastly, an additional ranking was made by keeping those of the methods that have as constituent one of the base learners of the proposed one. Thus, out of 435 schemes, 128 remained. From all of the above, as well as from the thorough investigation of the individual results, the proposed ensemble emerged. Again, on the one hand, the overall results are contained in the appendix, and on the other, information related to the error values for each method is presented in corresponding violin plots. In all cases, Friedman rankings are also depicted on CD-diagrams.

4.2.1 Feature setups

Regarding the comparison of feature setups, the rankings were extracted with respect to metrics, taking into account all time shifts. All available feature setups are included in the comparisons presented by the Friedman rankings, as their number was such that no discarding had to be applied for the needs of a better monitoring of the results.

Fig. 5
figure 5

Aggregate feature ranking box plots: all shifts

Looking at the results depicted in Fig. 5, one can observe that in MAE the \(\texttt {Load}\_\texttt {RM}\_\texttt {30}\__\texttt {Gas}\) setup ranks first, while in the remaining two of the presented metrics, that is, R\(^2\) and RMSE, \(\texttt {Load}\_\texttt {RM}\_\texttt {30}\__\texttt {Gas}\__\texttt {Hydro}\) outperforms the rest. It has already been mentioned that the names from the layouts refer to the additional–to the target–features used in the model. Thus, the \(\texttt {Load}\_\texttt {RM}\_\texttt {30}\__\texttt {Gas}\) setup consists of the following features: firstly, the time series of the past system loads that is used in each feature input; in addition, a time series of system loads to which a 30-day rolling average has been applied; and, finally, the time series of energy production from natural gas. By the same logic, the additional feature–in relation to the above setup–that the \(\texttt {Load}\_\texttt {RM}\_\texttt {30}\__\texttt {Gas}\__\texttt {Hydro}\) setup incorporates is that of energy production through hydroelectric.

Furthermore, by observing Fig. 6, it can be verified that the use of the two aforementioned feature setups can be considered close in terms of their contribution to the prediction. One can observe the statistical correlations depicted by the joints with horizontal lines. Also, the univariate version always ranks below the multivariate setups. Furthermore, a cluster of feature setups that are always in the top ten can be identified. Such setups are \(\texttt {Load}\_\texttt {RM}\_\texttt {14}\__\texttt {Gas}\), \(\texttt {Gas}, \texttt {Load}\_\texttt {RM}\_\texttt {30}\), \(\texttt {Load}\_\texttt {RM}\_\texttt {30}\__\texttt {Hydro}\), and \(\texttt {Load}\_\texttt {RM}\_\texttt {14}\__\texttt {Gas}\__\texttt {Hydro}\). It is noticeable that although there are some rearrangements, the composition of the top ten in general remains almost the same.

Fig. 6
figure 6

Aggregate feature ranking CD—diagrams: all shifts

In addition to the above, an overall Friedman test was also performed, taking into account all time shifts and metrics. The results of this are presented in Fig. 7. There, in first place, is \(\texttt {Load}\_\texttt {RM}\_\texttt {30}\__\texttt {Gas}\__\texttt {Hydro}\), with \(\texttt {Load}\_\texttt {RM}\_\texttt {30}\__\texttt {Gas}\) following. However, the behavior exhibited by each of them is judged to be similar. Next are \(\texttt {Load}\_\texttt {RM}\_\texttt {14}\__\texttt {Gas}\), \(\texttt {Load}\_\texttt {RM}\_\texttt {14}\__\texttt {Gas}\__\texttt {Hydro}\), the Univariate version, and then the Hydro and Gas. It is generally observed that the contribution of Gas, Hydro and their rolling mean versions is greater than that of Lignite and Imports. Although the univariate version appears in the top ten, there are feature setups identified as contributing more to the predictions and not exhibiting similar behavior. So, the claim that the use of some specific multivariate feature setups is strengthened. In particular, the aforementioned layouts that contain rolling means show greater predictive power.

Fig. 7
figure 7

Aggregate feature ranking CD—diagrams: all shifts & metrics

4.2.2 Methods

As for the case study regarding methods, a Friedman test was first performed to extract the results both by metric and by time shift. However, here, for the presentation, an additional monitoring strategy was followed to tackle the large number of results. Initially, the results from 6 metrics \(\times\) 5 shifts = 30 statistical tests were extracted. Then, a set was created containing the results of the methods that appeared in the first ten positions of the ranking. As a result, the number of methods was reduced from 435 to 64. Friedman was then implemented on these 64 methods, taking into account all five time shifts per metric.

Fig. 8
figure 8

Aggregate method ranking violin plots: all shifts

From the tables, the violin plots, and the CD-diagrams of the results, various observations can be extracted. First of all, from Fig. 8, one can see that in MAE the ensemble method consisting of Huber regressor, Theil–Sen regressor, and orthogonal matching pursuit ranks first. However, in all other presented metrics, namely R\(^2\), RMSE, the proposed layout, that is, the ensemble method consisting of orthogonal matching pursuit and the Huber regressor outperforms them all. The latter is also the method that appears second in terms of MAE. Also, the ensemble consisting of Huber regressor, Theil–Sen regressor, and orthogonal matching pursuit that ranked first in terms of MAE ranks third in terms of every other metric. The blended method consisting of orthogonal matching pursuit along with Theil–Sen regressor is always found in the second place in the ranking in all metrics, except for the MAE, where it is in the third place. As such, it exhibits very good behavior. In general, methods that use orthogonal matching pursuit as a base learner dominate the top of the ranking. Furthermore, up to position 25, the vast majority of methods contain pairs or triplets of methods to which the aforementioned belongs. Other methods with particularly good performance include the orthogonal matching pursuit combined with the Theil–Sen regressor and the passive aggressive regressor, as well as the combination of orthogonal matching pursuit, Huber regressor, and the passive aggressive regressor, which are always placed between the 3 and 6 places. The method consisting of orthogonal matching pursuit and passive aggressive regression exhibits similar behavior.

Fig. 9
figure 9

Aggregate method ranking CD—diagrams: all shifts

Accordingly, although in lower positions than those mentioned above, one finds–always in the top ten in terms of the majority of the metrics–the following: orthogonal matching pursuit combined with the Theil–Sen regressor and random sample consensus, as well as orthogonal matching pursuit blended with the Huber regressor and least-angle regression. Moreover, it is rare to find an individual method in the top ten. For example, one finds the orthogonal matching pursuit occupying the 7 place in the ranking in terms of RMSE. Results strongly suggest that the use of ensembles improves the forecast. Figure 9 depicts the aforementioned rankings using critical difference diagrams that capture the various statistical correlations between the methods appearing in the top ten positions, in terms of their aggregate Friedman scores with respect to metrics and taking into account all time shifts. In general, a numerical correlation is observed in the methods that appear in close positions.

Given the aforementioned and after an extended check of the individual results, the method consisting of orthogonal matching pursuit (OMP) along with the Huber regressor (HR) was chosen as having the best behavior. Thus, to further confirm whether this ensemble can indeed be safely recommended, as well as to further investigate the performance of ensembles containing HR and OMP, this particular blended layout was compared with all schemes containing at least one of its base learners. Thus, another Friedman test was conducted. This time, all metrics were used to form the final decision. And indeed, the proposed method takes first place in this Friedman score ranking as well. Beyond that, the ensemble method consisting of Huber regressor, Theil–Sen regressor, and orthogonal matching pursuit ranks third, whereas the one consisting of orthogonal matching pursuit and Theil–Sen regressor ranks second. The results are presented in the corresponding critical difference (CD) plot in Fig. 10. There, one can observe the HR+OMP ensemble ranking first, followed by OPM+TR and HR+TR+OMP. Notably, these three methods are not connected by bold horizontal lines in the CD plot, indicating that they do not perform similarly. Thus, from the corresponding CD-diagram depicted in Fig. 10 one observes, on the one hand, the aforementioned prevalence of the proposed method and, on the other, that the methods appearing in the first three positions display a significant correlation difference, as they are not joined by horizontal lines. Therefore, the proposal of the above method is again reinforced.

Thus, both from the indicative results presented and from the overall performances in all metrics, the proposed method exhibited the best behavior. A more detailed overview of the results can be realized by referring to the appendix, where various summary tables of results for all metrics are listed.

Fig. 10
figure 10

Aggregate winner versus containing base ranking CD—diagrams: all shifts

5 Conclusion

This work investigated ensemble methods for medium-term multivariate energy load forecasting. Its goal work was twofold. On the one hand, to provide an extensive framework of comparisons between a large number of single and ensemble regression models in the task of medium-term energy load forecasting, and on the other hand, given that energy consumption and energy efficiency are both interrelated and crucial within the contemporary context, to propose a as simple as possible yet robust method that incorporates energy balance data in a multivariate layout. An extensive experimental process comparing methods and feature setups was conducted. An ensemble method was proposed. Thus, the present work is also a comparative study that benchmarks a multitude of individual methods and ensembles in the context of the problem of energy load forecasting. In addition, the incorporation of time series containing Greek energy balance data together with averaged versions of the system load observations in various multivariate layouts was tested for possible forecast improvements.

The results point in two directions with relative clarity. On the one hand, the use of multivariate setups containing various input features significantly and generally improves the generated forecasts. Specifically, the additional use of energy production features and, above all, the integration of elements of the system energy load, to which a rolling average has been applied over a one-month horizon, are overall placed in the first positions of the Friedman rankings. On the other hand, a multitude of selected blended method layouts was tested with ensembles consisting of up to five primary learners. A specific ensemble method exhibited the greatest efficiency and is proposed in relation to the context and framework of this work. The method in question included two base learners: the orthogonal matching pursuit and the Huber regressor. The two constituent methods of the proposed ensemble display such individual characteristics that, under the prism of the blended layout, they complement each other, forming a robust scheme. From both the individual and aggregated results, the general superiority of the proposed method can be seen: the method is generally ranked first in the various Friedman rankings, being, at the same time, both effective and low-cost. Moreover, the averaging combinatory scheme is both interpretable and easily parallelizable.

Regarding future work, the incorporation of sentiment analysis and emotion classification techniques within the system load forecasting task framework is already underway. As a result, a larger feature space is being researched. Moreover, the use of deep learning and other state-of-the-art techniques is under investigation, both in terms of possible ensemble proposals as well as in contrast with the learners used in this work in terms of both efficiency and cost in relation to the specific time-frame scenarios examined.