# Embedding based quantile regression neural network for probabilistic load forecasting

- 1.3k Downloads

## Abstract

Compared to traditional point load forecasting, probabilistic load forecasting (PLF) has great significance in advanced system scheduling and planning with higher reliability. Medium term probabilistic load forecasting with a resolution to an hour has turned out to be practical especially in medium term energy trading and can enhance the performance of forecasting compared to those only utilizing daily information. Two main uncertainties exist when PLF is implemented: the first is the temperature fluctuation at the same time of each year; the second is the load variation which means that even if observed indicators are fixed since other observed external indicators can be responsible for the variation. Therefore, we propose a hybrid model considering both temperature uncertainty and load variation to generate medium term probabilistic forecasting with hourly resolution. An innovative quantile regression neural network with parameter embedding is established to capture the load variation, and a temperature scenario based technique is utilized to generate temperature forecasting in a probabilistic manner. It turns out that the proposed method overrides commonly used benchmark models in the case study.

## Keywords

Probabilistic load forecasting Feature embedding Artificial neural network Quantile regression Machine learning## 1 Introduction

Power load forecasting plays a core role in planning and scheduling of power system, for it not only reduces the costs of mismatching between generated power and actual demand, but also enhance the reliability of the whole system by eliminating the inadequate dispatching of energy. Among all literature introducing load forecasting techniques, most of them focus on point forecasting by generating fixed forecasting point at a specific moment in the future. Nevertheless, the power load is becoming cumulatively volatile with the growing fluctuation and uncertainty caused by natural and manual variation such as distributed renewable energy integration. As a result, forecasting approaches reflecting uncertainty on load are required by increasing number of decision-makers in the energy industry. Apparently, single-point prediction cannot represent the randomness appearing in load, and may sometimes invalidate the investment on power supply because of the sporadic gap between real and predicted values [1, 2].

Compared with point forecasting, probabilistic load forecasting describes the variation of the load by providing outputs in form of probability density function (PDF), confidential intervals, or quantiles of the distribution. It can be more suitable to confirm objective demands in system planning and energy trading, therefore being utilized in a wider range.

Literature on probabilistic load forecasting are relatively limited compare to traditional point forecasting. According to Hong and Fan [3], the combination of two or three of the following component can be utilized to generate probabilistic load forecasts: creating input scenario simulation, designing probabilistic models, and transforming point forecasts to probabilistic forecasts through post-processing. References [4, 5, 6] mainly utilized input scenario simulation, therefore, creating probabilistic forecasts. In [7], three basic input scenario generation methods, fix-date, shifted-date, bootstrap, were discussed, and an empirical study on these methods was established, measured by pinball loss.

Besides, more efforts have been devoted to generating probabilistic forecasting models. They can be summarized in following aspects: time series based, statistical regression based, sequence operation theory-based, and other machine learning method based. Fang [8] proposed a model based on chaotic time series. Sequence operation theory (SOT) was established by Kang [9], aiming to handle complicated probabilistic modeling. It has been utilized in modeling correlated stochastic variables [10] that can be used in generating probabilistic forecasts together with other statistic models. Statistical and other machine learning models were even more widely adopted in probabilistic forecasting like multiple linear regression [5, 11], quantile regression [12], gradient boosting [13], general addictive model (GAM) [14], kernel density estimation (KDE) [15], etc.

In addition, probabilistic forecasts according to post-processing are also proved to be effective. In Xie’s [6] and Mcsharry’s [16] studies, residual simulation was used to convert point forecasts to probabilistic forecasts. Liu [12] applied forecasting combination to optimize results, which tended to manifest a great boost in performance.

It can be concluded from the literature that probabilistic forecasting has a wide time scope from short-term to long term. Some of the works focused on short term probabilistic load forecasting [1, 17], whereas even more works were keen on medium and long term probabilistic load forecasting [4, 5, 6, 7, 14, 15], because there is great significance in energy trading and system planning [3].

This paper offers a solution for long term probabilistic forecasting in terms of hourly loads, applying the combination of input variable scenario simulation and a probabilistic model to generate forecasts. Concretely, artificial neural network (ANN) is utilized as the basic structure capturing nonlinear relationships of variables. Although ANN was mentioned in some literature related to probabilistic load forecasting [7, 18], it was simply treated as model generating point forecasts, yet the uncertainty of outputs which can be described by the model itself was ignored. Thus, we innovatively refined the traditional ANN to an intricate model that can generate probabilistic forecasts. We first fed the model with multiple inputs generated by the scenario-based method. Then regularized loss resembling quantile regression as loss function to be optimized by ANN, and advanced optimization algorithms to avoid local minimum are adopted to describe the randomness of the load in an annual scope.

Besides, we also use the embedding, a technique mapping low dimensional variables into high dimensional space, which has been widely adopted in handling categorical variables in other neural-network scenarios [19, 20]. It is proved to achieve better results than other common techniques utilized in previous literature, like one-hot encoding. Altogether, It turns out that the proposed method overrides state-of-art benchmarks in medium term probabilistic load forecasting in the dataset described in the section of the case study.

It should be pointed out that some literature have already considered both uncertainties in the input scenarios and output variations. However, they either combined input scenarios and output residual simulation based on relatively statistical methodology [21], or traditional probabilistic statistical model [14]. Compared to these efforts, our method stands out in the fusing probabilistic outputs into a malleable non-linear network, which does not require setting up an extra combination of input variables and can capture the non-linear dependencies between input and output variables better due to its complex structure.

- 1)
An ANN-based probabilistic forecasting model with regularized quantile optimization objective is proposed, considering both the randomness of inputs and the output variation described by a solid non-linear model.

- 2)
A novel embedding method is utilized to handle categorical input variables, manifesting potential effectiveness in enhancing the performance of load forecasting. It has strong malleability in other machine-learning related scenarios in the field of scheduling and operation for the power system.

- 3)
Dual uncertainties are considered based on the input fluctuation and load variation described by a robust non-linear model, which is relatively less considered in previous studies.

## 2 Framework

### 2.1 Outliers detection

Two steps are designed for outliers detection. The first step is a nave continuity-based method. It is hypothesized that the hourly load should not have a dual-side salutation at each point. So the anomalous criteria is set as:

However, this nave method cannot capture outliers beyond the temporary false record. Thus, the multiple linear regression model (also Vanilla Model in [11]) is utilized as an outliers detector in the second stage. This method is firstly proposed and proved to be effective in [21]. The absolute percentage error (APE) is calculated after fitting the historical hourly load for each hourly load in training set. The original load observations in training set with APE values higher than 50% are considered as outliers and are replaced by values estimated by the outliers detector.

Besides, it should be stated that it is of great significance to apply nave outliers detection in the first place. Granted, the baseline model can be a panacea to detect and modify relatively sparse outliers, yet the model based method can be detrimentally affected when the amount anomalous load points increases. For example, in bus load forecasting, the amount of outliers appearing in the bus load data cannot be neglectable, therefore researchers have to utilize a nave method to clean the data in the first place. It can be concluded that applying nave outlier detection before other more advanced anomalous modification method is quite necessary, bringing robustness to the process of load forecasting as a whole.

### 2.2 Trend analysis

We extract the linear trend by simply adding linear variables ranging from 0 to 1 as inputs of the following regression model. The experiment results turn out that the forecasting model performs better considering linear trend than that without linear trend inputs.

### 2.3 Data normalization

### 2.4 Training probabilistic forecasting models considering load cariation

The first stage of forecasting is training a regression model considering load variation, which is proposed as a quantile regression neural network (QRNN) in this paper, generating probabilistic results in the form of quantiles. Normalized hourly variables (temperature, day types etc.) act as training features whereas corresponding hourly loads are training labels, supervising the training process of QRNN. The training process iterates with fine tuning the parameters of the model, and it is terminated as long as the validation loss no longer decreases.

### 2.5 Combining temperature uncertainty in load forecasting on the basis of QRNN

Since QRNN is trained based on temporally simultaneous features, it cannot be utilized directly in forecasting one year ahead because some features, like hourly temperature in the next year, cannot be foreseen. So temperature uncertainty should be considered in real forecasting stage. The final results of load forecasting are generated by replacing the simultaneous temperature fed into QRNN with historical temperature scenarios.

## 3 Probabilistic load forecasting considering load variation and temperature uncertainty

In this section, formulation of the forecasting problem is illustrated, following the detailed description of the proposed model in this paper.

### 3.1 Problem formulation

*t*; \(N_{\tau }\) is the dimension of vector, which also means the number of quantiles \(\tau\); \(h(\cdot )\) denotes the general function mapping input variables to the output load, which in this paper \(h(\cdot )\) is established by QRNN; \(T_{t}\) refers to hourly temperature; \(Trend_{t}\) stands for the linear trend, ascending linearly from the first point to the last in the whole dataset; \(M_{t}\) (time mode) consists of four components, which can be formulated as:

*t*.

### 3.2 Embedding technique for categorical variables

In a forecasting problem, categorical variables like the day type at moment *t* should be converted to numeric representations in order to fit the most numerical solved formulas. Most common techniques are direct numbering and one-hot encoding. Generally speaking, embedding is technique mapping 1-dimensional categorical variables to numerical features into high dimensional space. It is turned out that the categorical variables mapped by embedding technique capture more information of categorical variables than other common techniques due to its flexibility in output vector dimensions and the complexity of embedding parameters.

*t*; \(\varvec{M}_{t}^{one-hot} \in \mathbb {R}^{4 \times N_{\text {max}}}\) is one-hot representation of \(\varvec{m}_t^\text {T}\), where \(N_{\text {max}}\) denotes the largest number of categories in elements of \(M_t\); \(\varvec{Q} \in \mathbb {R}^{N_{\text {max}} \times N_{em}}\) denotes the embedding parameter matrix, containing \(N_{\text {max}} \times N_{em}\) individual parameters, which can be learned and updated in the training process together with other parts of the neural network.

### 3.3 Quantile regression neural network

Artificial neutral network (ANN) has been proved to be suitable for regression problem with multiple features due to its complicated connection of variables and non-linear transformation through activation function [22]. Most commonly used ANN for regression problems utilize back propagation (BP) algorithm to update parameters by minimizing the loss between outputs of ANN \(\hat{y}\) and real value *y*, such as mean square error (MSE).

*N*stands for the number of samples fed into the network each time, \(E_t\) and \(\hat{E}_t^{\tau }\) are real value and predicted value corresponding to quantile \(\tau\) respectively.

By setting \(\tau\) as 1, 2, ..., \(N_{\tau }\), \(N_{\tau }\) forecasting results at time *t*, \(\hat{E}_t^{1}\), \(\hat{E}_t^{2}\), ..., \(\hat{E}_t^{N_{\tau }}\) are obtained through \(N_{\tau }\) QRNNs with different loss function. By concatenating these results, the estimation of \(\varvec{E}_{t}\) is obtained as \(\varvec{\widetilde{E}}_{t}\).

### 3.4 Combining temperature uncertainty on the basis of QRNN

It should be noted that \(\varvec{\widetilde{E}}_{t}\) indicates the variation of load knowing the exact simultaneous temperature beforehand. However, in a medium term forecasting problem, we cannot foresee the excessive annual horizon. As is acknowledged that temperature in a specific zone does not have similar pattern at the same moment for each year, the hourly temperature can be forecasted by the stacking temperatures at nearby moments in years before. Temperature scenario generation for temperature forecasting is proposed based on the aforementioned hypothesis and is proved to be effective in modeling the uncertainty of medium term hourly temperature [7].

*h*on the

*d*th day of year

*y*, then the temperature scenario can be represented as:

Then we replace \(T_t\) in (6) by elements in \(\varvec{Ts}^{h}_{y, d}\). As a result, the outputs of QRNN captures both temperature and load uncertainty. Final quantiles are generated according to empirical distribution constructed by these outputs.

## 4 Comparison and evaluation criteria

In this section, several evaluation criteria in the field of probabilistic forecasting are reviewed, and benchmark models for further comparison in case study will be proposed.

### 4.1 Evaluation criterion

*t*respectively; \(\tau\) is the targeted quantile of forecasting distribution. Actually, it is a similar representation of loss function in (7). Pinball loss considers the holistic contribution of forecasting results by integrating quantiles since quantiles are discrete and can be set to a feasible quantity, it can, therefore, simplifies the computing process. Moreover, it is obvious that a lower pinball loss indicates a better forecasting result. This is the criterion being used to evaluate the proposed method and benchmarks in this paper.

### 4.2 Benchmark models

In addition, a neural-network based model is introduced with (10) as optimizing target, we denote this model as MLP (multi-layer perceptron). This model act as a parallel with MLR since they all take in similar inputs and estimate parameters by optimizing the same objective (10), and merely consider temperature uncertainty. MLP has a similar structure with QRNN, yet it contains no embedding layers, only one hidden layer after the inputs are fed into the network, and ReLU as the activation function.

Besides, it should be mentioned that \(T_t\) should be replaced by temperature scenarios in final forecasting for all of the three benchmark models, generating probabilistic forecasting results.

## 5 Case study

In this section, we demonstrate an experiment based on real world dataset. This section will be organized as follows. The proposed model is built up with Keras, an advanced deep learning library in Python, and benchmark models are built up with Scikit-Learn.

### 5.1 Introduction of dataset and experiment settings

The hourly load and corresponding weather information are obtained from the official website of ISO New England, which is accessible to the public. The data consists of 8 different zones in New England, US. We only utilize the time information (hour, week, month, year), load, and drybulb temperature in this case study. In our experiment, the data from 2004 to 2015 are selected as the combination of our training set, validation set, and test set.

### 5.2 Procedures of proposed forecasting approach in the experiment

*lr*, the dimension of embedding layer \(N_{embedding}\), and regularization factors \(\lambda _1\), \(\lambda _2\), \(\lambda _3\) by minimizing the validation loss. Hourly data in 2015 are used for test of final forecasting performance. The outputs of QRNN are 9 quantile values \(\hat{E_{\tau }}\) estimated by minimizing (7), setting \(\tau\) from 0.1 to 0.9. Figure 5 shows the output intervals by QRNN with real temperature as an input. The interval implies the variation of the load even if the temperature is fixed.

In the second stage, as what has been declared in the last section, the uncertainty of temperature needs to be considered by giving a probabilistic forecast on the hourly temperature in 2015.

Temperature scenario based method demonstrated in Sect. 3 is proved to be more effective than other temperature forecasting techniques such as quantGAM [14] in this specific case study. Concretely, *m* and *n* in (8) are set to be 4 and 10 in the case study for all models. As a result, 90 temperature scenarios are generated and plugged into (6), and there will be 810 ultimate forecasting results. Final 9 quantiles are generated from the empirical distribution based on 810 results. Figure 6 shows the final results considering both load variation and temperature uncertainty.

### 5.3 Comparison and discussion

In this subsection, following crucial questions are about to be answered by making the corresponding comparison. Firstly, is a model combining output variation described by probabilistic model and temperature input uncertainty performs better than one only taking stochastic temperature scenarios into account? Secondly, can QRNN out perform other statistic models considering dual uncertainty? Thirdly, is embedding of categorical features beneficial for higher performance compared with traditional techniques like one-hot encoding? At the end, an overall comparison of forecasting performance is demonstrated between proposed models and three benchmark models.

Figure 7 shows three forecasting results of the same horizon. Apparently, three models underestimate the hourly load concordantly. Since QRNN captures both temperature uncertainty and load variation, the error is penalized by a greater forecasting interval, leading to the decrease in pinball loss, yet MLR without considering on load variation failed to compensate such error, therefore leading to a significant variance on this test day.

On the other hand, although LQR considers dual uncertainty as what has been illustrated in Sect. 4, the final forecasting results by LQR expressed in Fig. 7c indicates two main problems by simply modeling hourly load and temperature separately with nave linear quantile regression. Since the LQR model is trained separately when the hour and day types are fixed, loads are estimated independently and concatenated by the hour and dates to the final load series. This will lead to the discontinuity between hours, which can be detrimental to forecasting results due to the lack of smoothness. This argument actually undermines the “ training in separate hour” pattern in [14] since the load continuity within time is ignored. Besides, the forecasting interval is conspicuously widened. This can be explained that LQR only set temperature and its polynomials as inputs in the case study, which can lead to an overestimating problem because of scarcity in input feature types.

Annual forecasting pinball loss

Zone | QRNN | MLR | MLP | LQR | MaxRI |
---|---|---|---|---|---|

(MW) | (MW) | (MW) | (MW) | (%) | |

CT |
| 111.8 | 110.8 | 133.1 | 21.2 |

SEMA |
| 56.9 | 54.3 | 62.3 | 16.5 |

NEMA |
| 84.4 | 81.2 | 96.9 | 21.9 |

WCMA |
| 56.8 | 55.6 | 69.2 | 25.4 |

VT | 15.3 | 14.8 |
| 19.6 | 21.9 |

NH |
| 33.8 | 33.1 | 37.8 | 17.7 |

RI |
| 28.1 | 27.3 | 31.7 | 18.3 |

ME |
| 25.8 | 25.1 | 27.1 | 18.5 |

Comparison between embedding and one-hot encoding

Zone | Embedding (MW) | One-hot encoding (MW) |
---|---|---|

CT | 104.9 | 106.8 |

SEMA | 52.0 | 52.6 |

NEMA | 75.7 | 78.8 |

WCMA | 51.6 | 53.8 |

VT | 15.3 | 15.9 |

NH | 31.1 | 33.2 |

RI | 25.9 | 28.1 |

ME | 22.1 | 22.8 |

Table 1 shows the final forecasting pinball loss in 8 zones in New England by means of one proposed approach together with three benchmarks, and the maximum relative improvement (MaxRI) as well. With the fact that a lower pinball loss indicates a better probabilistic forecasting, QRNN overrides three benchmark models in 7 zones of 8 in total, yet it only underperforms 3.8% worser compared with the best model in this area. We can read the column of MaxRI that QRNN outperforms the benchmark models significantly. The relative improvements among all area reach 20% approximately, indicating the effectiveness of our proposed method against benchmarks in the case study.

In addition, MLR and MLP are parallel benchmarks as representatives of models considering the single uncertainty of temperature. The result turns out that they have similar performance in the case study, yet MLP performs slightly better since it has a higher capability in modeling non-linear effects and interactions between variables. Although LQR considers both load variation and temperature, the widened forecasting interval and discontinuity in load series may contribute to the high pinball loss.

To demonstrate the potential effectiveness of embedding toward categorical parameters, another comparison is conducted and the final results are shown in Table 2. It should be mentioned that the results of QRNN with embedding reported here are finetuned by adjusting embedding layers to minimize the validation loss. It can be concluded that compared to one-hot encoding, optimized parameter embedding can decrease the pinball loss and in other words, can better captures features of input variables in probabilistic forecasting.

## 6 Conclusion

In this paper, an innovative method on probabilistic load forecasting is proposed. By considering both input uncertainty and output variation, it turned out that the proposed QRNN model performs better than commonly used benchmark models. Besides, embedding techniques have shown potential in handling categorical inputs, which can enhance the overall performance of forecasting. Further studies can be conducted from multiple aspects, such as optimizing network structure with state-of-art techniques like deep neural networks and utilizing multi-temporal information to train the model, therefore mining more hidden information and enhance the performance of load forecasting.

## Notes

### Acknowledgements

This work was supported by National Key R&D Program of China (No. 2016YFB0900100).

## References

- [1]Yang W, Kang C, Xia Q et al (2006) Short term probabilistic load forecasting based on statistics of probability distribution of forecasting errors. Autom Electr Power Syst 30(19):47–52Google Scholar
- [2]Li Z, Ye L, Zhao Y et al (2016) Short-term wind power prediction based on extreme learning machine with error correction. Prot Control Mod Power Syst 1(1):1–8CrossRefGoogle Scholar
- [3]Hong T, Shu F (2016) Probabilistic electric load forecasting: a tutorial review. Int J Forecast 32(3):914–938CrossRefGoogle Scholar
- [4]Hyndman RJ, Fan S (2010) Density forecasting for long-term peak electricity demand. IEEE Trans Power Syst 25(2):1142–1153CrossRefGoogle Scholar
- [5]Hong T, Wilson J, Xie J (2014) Long term probabilistic load forecasting and normalization with hourly information. IEEE Trans Smart Grid 5(1):456–462CrossRefGoogle Scholar
- [6]Xie J, Hong T, Laing T et al (2017) On normality assumption in residual simulation for probabilistic load forecasting. IEEE Trans Smart Grid 8(3):1046–1053CrossRefGoogle Scholar
- [7]Xie J, Hong T (2016) Temperature scenario generation for probabilistic load forecasting. IEEE Trans Smart Grid. https://doi.org/10.1109/TSG.2016.2597178 Google Scholar
- [8]Rengcun F, Zhou J, Zhang Y et al (2009) Short-term probabilistic load forecasting using chaotic time series. Journal of Huazhong University of Science and Technology (Natural Science Edition) 37(5):125–128zbMATHGoogle Scholar
- [9]Kang C, Bai L, Xia Q et al (2002) Implement of probabilistic production cost simulation algorithm based on sequence operation theory. Proc CSEE 22(9):6–11Google Scholar
- [10]Zhang N, Kang C (2012) Dependent sequence operation for wind power outputs analyses. J Tsinghua Univ 52(5):704–709MathSciNetGoogle Scholar
- [11]Hong T, Wang P, Willis HL (2011) A naïve multiple linear regression benchmark for short term load forecasting. In: Proceedings of the 2011 IEEE power and energy society general meeting, Detroit, USA, 24–28 July 2011, pp 1–6Google Scholar
- [12]Liu B, Nowotarski J, Hong T (2017) Probabilistic load forecasting via quantile regression averaging on sister forecasts. IEEE Trans Smart Grid 8(2):730–737Google Scholar
- [13]Taieb SB, Hyndman RJ (2014) A gradient boosting approach to the Kaggle load forecasting competition. Int J Forecast 30(2):382–394CrossRefGoogle Scholar
- [14]Gaillard P, Goude Y, Nedellec R (2016) Additive models and robust aggregation for GEFCom2014 probabilistic electric load and electricity price forecasting. Int J Forecast 32(3):1038–1050CrossRefGoogle Scholar
- [15]Haben S, Giasemidis G (2016) A hybrid model of kernel density estimation and quantile regression for GEFCom2014 probabilistic load forecasting. Int J Forecast 32(3):1017–1022CrossRefGoogle Scholar
- [16]McSharry PE, Bouwman S, Bloemhof G (2005) Probabilistic forecasts of the magnitude and timing of peak electricity demand. IEEE Trans Power Syst 20(2):1166–1172CrossRefGoogle Scholar
- [17]Zhou J, Zhang Y, Li Q et al (2005) Probabilistic short-term load forecasting based on dynamic self-adaptive radial basis function network. Power Syst Technol 34(3):37–41Google Scholar
- [18]Ranaweera DK, Karady GG, Farmer RG (1996) Effect of probabilistic inputs on neural network-based electric load forecasting. IEEE Trans Neural Netw 7(6):1528–1532CrossRefGoogle Scholar
- [19]Hong T, Wang P, Willis HL (2016) Deep neural networks for youtube recommendations. In: Proceedings of the 10th ACM conference on recommender systems, Boston, USA, 15–19 September 2016, pp 191–198Google Scholar
- [20]Li Y, Xu L, Tian F et al (2015) Word embedding revisited: a new representation learning and explicit matrix factorization perspective. In: Proceedings of 2015 international conference on artificial intelligence. Buenos Aires, Argentina, 25–31 June 2015, pp 3650–3656Google Scholar
- [21]Xie J, Hong T (2016) GEFCom2014 probabilistic electric load forecasting: an integrated solution with forecast combination and residual simulation. Int J Forecast 32(3):1012–1016CrossRefGoogle Scholar
- [22]Lee D, Baldick R (2014) Short-term wind power ensemble prediction based on Gaussian processes and neural networks. IEEE Trans Smart Grid 5(1):501–510CrossRefGoogle Scholar
- [23]Mcdonald GC (2010) Ridge regression. Wiley Interdiscip Rev Comput Stat 1(1):93–100CrossRefGoogle Scholar
- [24]Hong T, Pinson P, Fan S et al (2016) Probabilistic energy forecasting: global energy forecasting competition 2014 and beyond. Int J Forecast 32(3):896–913CrossRefGoogle Scholar
- [25]Wan C, Xu Z, Pinson P (2014) Probabilistic forecasting of wind power generation using extreme learning machine. IEEE Trans Power Syst 29(29):1033–1044CrossRefGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.