Motivation

The increasing penetration of renewable energy resources (RES) in the today’s power system has made energy forecasting a popular theme. It is very important for grid operators and decision makers to know how much power RES will produce over next hours and days (Dobschinski et al. 2017). Along with this, predicting load demand and consumption plays a vital role in operation and planning of modern power system. Storage of electrical energy is necessary in the case when there is excess power production from the RES and less load demand. However, it cannot be massively stored as energy storage is expensive, requires high maintenance and have limited lifespans. Because of this, utilities have to balance supply and demand at every moment. These limitations lead to several interesting characteristics of energy forecasting, which includes data collection and the need for precise accuracy. Forecasting errors lead to unbalanced supply-demand, which adversely affects the operational cost, reliability and efficiency. Forecasting energy production and consumption are usually based upon meteorological data like solar irradiation, temperature and number of occupants and appliances respectively. However, there are some scenarios where the on-site measurements for solar irradiation and other meteorological variables like temperature and humidity are unavailable and only the past power measurements are available. In such cases, data-driven models utilizing the available past power production data can be used. In this paper we aim to exploit the available past power data and to assess the performance of data-driven forecasting model in terms of accuracy by applying data pre-processing techniques. The idea is to choose an appropriate anomaly detection technique and data-driven methodology for energy production forecasting along with developing a unified model for long-term forecasting with step of short-term (hourly) accuracy.

Background

In the past decades, different approaches for forecasting energy production, distribution and consumption had been implemented. In the domain of energy consumption forecasting several techniques are used by researchers which includes traditional methods such as regression, time series, statistical methods along with soft computing techniques such as Artificial Neural Networks (ANNs), Support Vector Machine (SVM), fuzzy logic, and Grey prediction. A good overview of these techniques can be found in (Suganthi and Samuel 2012).

To perform predictions typically larger datasets in connection with deep learning are becoming common. In (Marino et al. 2016), authors implemented an energy load forecasting technique with long short term memory (LSTM). Two variants of LSTM are presented, standard LSTM and the LSTM-based Sequence-to-Sequence (S2S) architecture. Both variants are tested with one hour and one minute time step resolution data, the results indicate S2S worked well in both datasets.

To integrate RES in the power grid, forecasting photovoltaic (PV) yield is very important, as the output of PV systems is sensitive to weather conditions and to the varying strength of solar irradiance striking the PV surface throughout the day. The input variables and prediction horizon affect the accuracy of the prediction model. In general, the relevant variables which are available as inputs of the prediction model of solar power includes historical measurements of PV generation, historical measurements of explanatory variables like temperature, global irradiance, wind speed or cloud coverage (Wan et al. 2015). In the domain of energy production forecasting, there are several studies which reveal the potential of Artificial Intelligence (AI). Authors in (Saberian et al. 2014) implemented solar power modelling method using artificial neural networks (ANNs) which includes two neural network structures, namely, general regression neural network (GRNN) and feedforward back propagation (FFBP) to model a PV panel output power. They used meteorological data and estimated generated power to train the GRNN and FFBP. The results indicated higher accuracy when using FFBP. In (Khatib and Elmenreich 2015), authors proposed a generalized regression artificial neural network for predicting hourly solar radiation. In (Alanazi et al. 2017), authors implemented a nonlinear autoregressive neural network for prediction of irradiance. Authors in (Gandelli et al. 2014) implemented a new hybrid method PHANN (Physical hybrid artificial neural network), combining physical model with statistical model (neural network) and concluded that the PHANN method is more accurate than ANN. Authors in (Dolara et al. 2018), employed a similar method considering theoretical clear sky solar radiation model and stated the improved accuracy in case of hybrid method. It can be inferred that combining and implementing several techniques like statistical and physical could improve the performance. Based on the reviewed papers, we assume that it is possible to improve the accuracy by applying advanced approaches like soft computing techniques which outperforms naive methods coming from statistical theory. However, complex models such as deep learning models do have a limitation in terms of interpretability.

Nevertheless, it is possible to improve the accuracy by applying data pre-processing techniques (anomaly detection) i.e. feature selection and outlier rejection. Before applying any forecasting model these two important issues should be considered (Saleh et al. 2016), as both have a direct impact on the forecasting model performance. In large datasets, there is often the case where we have many ineffective features and a feature selection process could minimize the considered features to effective ones. This process can improve the model performance and provide faster decisions. The authors of (Saleh et al. 2016) implemented a data mining-based load forecasting strategy and divided the whole process in two parts data pre-processing and load estimation. The data pre-processing step performed outlier rejection to eliminate the bad data using a distance-based outlier rejection and feature selection using genetic algorithm. However, authors have clearly mentioned that the outliers are rejected based on a global view, where extreme values are considered as outliers. It is noteworthy that these values do represent real measured value and in some circumstances extreme values may indicate sudden events. Secondly to construct the case study they used historical electricity load dataset. They did not explore the validity of their model on the real time data set which may pose additional challenges.

In energy and power applications, anomaly detection emerges as an important aspect in fields like electric load forecasting (Chen et al. 2014; Chakhchoukh et al. 2011), energy production forecasting etc. In (Luo et al. 2018) authors implemented a model-based anomaly detection method for very short-term load forecasting. The method includes two components, an underlying model i.e. dynamic regression model (DRM) and an adaptive anomaly threshold. Some of the recent work on anomaly detection is presented in (Table 1).

Table 1 State-of-the-art: Anomaly detection or outlier rejection

Data-driven modelling (DDM) is emerging as another important aspect in forecasting energy production problem. The output power produced by PV is highly correlated with the weather conditions. Hence, they are usually considered as an important parameter in training the prediction algorithm. However, in cases when the weather data is unavailable, it is interesting to use data-driven models using only past PV output production data. DDM is based on analysing the data about a system, in particular finding connection between the system state variables (input, internal and output variables) without explicit knowledge of physical behaviour of the system. Authors in (Ordiano et al. 2017) implemented a simple weather-free data-driven models by considering only the past generated power and time of the day as an input. (Table 2) presents short review on work done in the domain of energy production forecasting.

Table 2 State-of-the-art: Data-driven modelling

(Filik et al. 2011) proposed a novel unified model for short, medium and long-term for hourly electric energy demand forecasting. The authors compared the accuracy of analytically developed model with three different ANN architectures and achieved highest accuracy with time delay back propagation ANN architecture.

Methods

Material

Energy forecasting algorithms are trained and tested on energy consumption and production datasets. These datasets contain energy readings from the smart meter and power output produced by PV. The forecasting approaches which are present in the literature usually utilize proprietary data. Instead, we will use freely avail- able benchmark data for testing future energy forecast models which makes the comparison between approaches easier to understand. In this study we intend to use the Open Power System Data (OPSD)(open-power-system-data.org) and the Australian Solar home electricity dataset provided by Ausgrid (aus-grid.com.au).

Model evaluation

We first split the data into training and testing datasets and then run the machine learning algorithm on the training dataset to generate the prediction model. Then we use the test dataset to evaluate the model. To avoid underfitting and overfitting cross validation will be performed. In order to evaluate the performance of the forecasting algorithm, various performance metrics are available in the literature. These standardized performance measures or metrics helps in providing forecast evaluations and benchmarking (Pelland et al. 2013). This includes Pearson correlation (ρ), mean bias error (MBE, or bias), mean square error (MSE) and root mean square error (RMSE), mean absolute error (MAE) and standard deviation (SDE).

1. Pearson correlation is the coefficient that measures the correlation between actual and forecasted value defined below in (1)

$$ \rho =\frac{\mathit{\operatorname{cov}}\left(\rho, \overline{\rho}\right)}{\sigma_{\rho }{\sigma}_{\overline{\rho}}} $$

2. The metric RMSE introduced by (Zhang et al. 2015) provides a global error measure throughout the entire forecasting period, given by (2)

$$ \mathrm{RMSE}=\sqrt{\frac{1}{N}{\sum}_{i=1}^N{\left({p}_{pred}-{p}_{meas}\right)}^2} $$

3. Metric MAPE access uniform prediction errors given by (3)

$$ \mathrm{MAPE}=\frac{100}{N}{\sum}_{i=1}^N\left|\frac{p_{pred}-{p}_{meas}}{p_0}\right| $$

Where pmeas represents actual solar power generation at ith time step, ppred is the corresponding solar power generation estimated by forecasting model, N is the number of points estimated in the forecasting period. This metric is useful for evaluating the overall performance of the forecasts, especially when extreme events are a concern.

Methodology

Figure 1 presents the flowchart of the proposed forecasting process. To address the research questions, we first propose to conduct a case study that aims to benchmark the anomaly detection method and evaluate the link between forecasting accuracy and anomaly detection method.

Fig. 1
figure 1

Flowchart of forecasting process based on predictive data mining techniques

In this work we plan to include three steps:

  • In the first step, data pre-processing techniques are applied to perform anomaly detection and outlier rejection. Three machine learning based approaches are considered:

    • Density-based anomaly detection

    • Clustering-based anomaly detection

    • Support vector machine-based anomaly detection

  • In the second step, the pre-processed data with chosen anomaly detection technique obtained from the first step is used to train the data-driven model based on predictive data mining techniques. The outcome of investigation from these steps will explore the interplay between anomaly detection technique and forecasting model accuracy.

  • The third step involves developing a unified model which forecasts accurately for different time horizons i.e. short-term, medium-term and long-term forecasting.

Conclusion

Intelligent decision making is important to provide an unprecedented flexibility in the energy management for the future power system. This requires accurate forecasts of future energy production and demand/consumption.

In this paper we first discussed the terminology of energy forecasting and its classification based on different time horizons followed by a detailed state-of-the- art which revealed that applying advanced soft computing approaches will likely outperform the statistical methods. However, complex methods have a limitation in terms of interpretability.

We proposed to apply data pre-processing technique along with data-driven forecasting model which can possibly improve the accuracy even when using partial information i.e. past power data.