1 Introduction

The electrical load is affected by various factors such as weather condition, distributed renewable integration, demand response implementation, energy policy, emergent events, etc. Traditional load forecasting only provides an expected value of future load. In modern power system, characterizing the uncertainties of future load will benefit the reliability and economy of the whole system. More and more probabilistic load forecasting (PLF) methods have been proposed in recent years [1].

The whole process of load forecasting includes data preprocessing, feature engineering, model establishment and optimization, result analysis and visualization. Quantile regression model is the most frequently used tool for PFL. An embedding based quantile regression neural network (QRNN) was proposed in [2] for PLF, where the discrete variables such as day type, day of the week, and hour of the day are modeled by the embedding matrix. Another improved QRNN was presented in [3]. The network involves Gaussian noise layer and dropout layer to overcome the overfitting issue. A “separate-forecasting-integrate” framework was proposed in [4] to forecast the net load with high penetration of behind-the-meter photovoltaic (PV). The decomposed parts are forecasted by QRNN, quantile random forest, or quantile gradient boosting regression tree (QGBRT); while the dependencies between load and PV uncertainties are modelled by copula function and describe dependent convolution (DDC) [5]. Quantile, density, and interval are three main forms of probabilistic forecasts. Kernel density estimation can make a bridge between quantile forecast and density forecast. The triangle kernel function was used in [6] to transform the quantiles obtained by QRNN to density forecasts.

Since there have already existed various point load forecasting methods, it is a good idea to transform the point forecasts into probabilistic forecasts. A temperature scenario generation approach was proposed in [7], where various temperature scenarios were fed into the point forecasting model to generate various point forecasts and then the quantiles could be calculated based on the point forecasts. A simple quantile regression averaging model was established on multiple point forecasts generated by sister forecasting models in [8]. A deep investigation of point forecasting residual was conducted in [9]. Residual simulation was implemented based on the normality assumption to produce probabilistic forecasts. Instead of focusing on specific PLF model, an ensemble approach was proposed in [10] to combine the quantile probabilistic forecasts of multiple models. The combining model was formulated as a linear programing (LP) problem to minimize the pinball loss of the final model.

PLF has also been applied for individual household in addition to system-level load [11]. Conditional kernel density estimation was used in [12] to model the conditional distribution of residential load on different time periods. In [13], the uncertainty within smart meter data was forecasted by boosting additive quantile regression model. Case studies show the superiority of the additive model compared with the normality assumption-based method. Gaussian process regression (GPR) was applied in [14] for residential load forecasting, where different forms of kernels were compared.

Most of the load forecasting literature focuses on the establishment and optimization of the forecasting model; while, very few research focuses on the preprocessing and feature engineering phase, especially for PLF. Feature selection, as an important part of the feature engineering, tries to select the most useful or relevant features from the original data to enhance the accuracy of load forecasts and increase the interpretability of the forecasting model. When the number of original features is too large, it can also be used to reduce the solving complexity of the model. Most of the research areas of feature selection are concentrated in the traditional point forecasting, such as least absolute shrinkage and selection operator (LASSO) method [15]. However, the feature selection in the field of probabilistic forecasting has rarely been studied. To the best of our knowledge, there is only one feature selection work for PLF introduced in [16] by varying the number of lagged hours and lagged days of the temperature. The proposed global feature selection method is quite similar to the exhaustive method. The method was tested on seven states of the United States. The main idea is to evaluate the performance of feature selection using pinball loss instead of root mean square error (RMSE).

In this paper, we enrich the PLF feature selection literature by proposing a novel sparse penalized quantile regression method. To tackle the challenges of large computation burden and non-differentiable issue, alternating direction method of multipliers (ADMM) [17] is applied to decompose the optimization problem. The results show that the proposed feature selection method can improve the performance of the original probabilistic forecasting model compared with original model and heuristic model.

The main contributions of this paper can be summarized as follows:

  1. 1)

    Proposing a feature selection method for PLF by introducing \(L_1\)-norm sparse penalty into quantile regression model.

  2. 2)

    Designing an ADMM algorithm to decouple the large-scale optimization problem and boost the efficiency of the model training process.

  3. 3)

    Validating the superiority of the proposed sparse penalized quantile regression method by numerical simulation based on the cases from the open datasets.

The reminder of this paper is organized as follows: Section 2 provides the dataset used and the regression models for load forecasting; Section 3 introduces a straightforward feature selection method which is used as benchmark in the case studies; Section 4 introduces the proposed feature selection method and the ADMM-based training method; Section 5 presents the implementation details of our proposed method; and Section 6 conducts the case studies on the open datasets from Global Energy Forecasting Competition in the year of 2012 (GEFCom 2012).

2 Data and model

2.1 Load dataset exploration

The load data used in this paper is the open dataset used for GEFCom 2012 [18]. The dataset includes the 5 years of hourly load data of 20 region power systems of North Carolina from 2004 to 2008. The corresponding hourly temperature data from 11 weather stations are also provided.

Figure 1 shows the four-year load data of the first zone and temperature data from the first weather station from 2004/01/01 to 2007/12/31. The load and weather show clear periodicity as well as large variations. We can also find an approximate quadratic relationship between the load and temperature data. To produce highly accurate forecasts, we need to carefully model how calendar date and temperature influence the load.

Fig. 1
figure 1

Load and temperature data from 2004/01/01 to 2007/12/31

2.2 Linear regression model considering recency effects

This section mainly introduces the forecasting model and the input features to be selected. The temperature has a complex impact on the electrical load. Both the current and past temperatures have influence on the load because of the inertia of air temperature change and consumers’ perception. For example, when the temperature rises suddenly at a certain time, the consumers of this area do not often respond immediately, but take the cooling measures for several minutes or hours, such as opening the air conditionings. Thus, the load sequence may lag behind the temperature sequence. We call it lag effect or recency effect. In addition, the way how temperature affects load may also change at different day of the year and hour of the day. It means there is cross effect between temperature and calendar date.

Multiple linear regression models are widely used in the field of load forecasting by connecting the characteristic variables and dependent load. They can generate accurate load forecasts without consuming abundant computation resources. Since the load is highly dependent on temperature as well as calendar variables, for instance hour, week, and month, [19] proposes a naive vanilla benchmark model considering cross effects of variables mentioned above corresponding to the time of load being forecasted. It can be formulated as:

$$\begin{aligned} {\left\{ \begin{array}{ll} \hat{y}_t = \beta _{0} + \beta _{1}N_t + \beta _{2}M_t + \beta _{3}W_t + \beta _{4}H_t + \beta _{5}W_tH_t +f_r(T_t) \\ \begin{aligned} f_r (T_t) &{} = \beta _{6}T_t + \beta _{7}T_t^2 + \beta _{8}T_t^3 + \beta _{9}T_tM_t + \beta _{10}T_t^2M_t \\ &{} \quad + \beta _{11}T_t^3M_t +\beta _{12}T_tH_t + \beta _{13}T_t^2H_t + \beta _{14}T_t^3H_t \end{aligned} \end{array}\right. } \end{aligned}$$

where \(\hat{y}_t\) denotes the estimated load at time t; \(\beta _{0}\) to \(\beta _{14}\) denote the linear coefficients of the regression model; \(T_t\) denotes the temperature at time t; \(M_t\), \(W_t\) and \(H_t\) denote month-of-the-year, day-of-the-week, and hour-of-the-day classification variables corresponding to time t, respectively; \(N_t\) denotes a series of natural numbers (i.e., 1, 2, ...) to describe the long-term increase of the load; and \(f_r(\cdot )\) is a function of \(T_t\) that accounts for the polynomial relationship between temperature and load and the cross effects among temperature, month of the year, and hour of the day.

An improved multiple linear regression model considering the recency effect is proposed in [20], which takes the moving historical average temperature, the lagged temperature, and their interactions with the calendar variables into the regression model. The model with recency effect can improve the performance of load forecasting. It is expressed as follows:

$$\begin{aligned} \begin{aligned} \hat{y}_t&=\ \beta _{0} + \beta _{1}N_t + \beta _{2}M_t + \beta _{3}W_t + \beta _{4}H_t \\&\quad +\beta _{5}W_tH_t + f_r (T_t) + \sum \limits _{d=1}^{N_D}{f_r (\widetilde{T}_{t,d})} + \sum \limits _{h=1}^{N_H}{f_r (T_{t-h})}\\ \end{aligned} \end{aligned}$$

where the last two terms represent the recency effect; \(N_D\) and \(N_H\) denote the numbers of days and hours of the lagged temperature that will be considered as recency effect, respectively; d and h denote the indexes for the lagged days and hours, respectively; and \(\widetilde{T}_{t,d}\) denote the moving historical average temperature, which is calculated as follows:

$$\begin{aligned} \widetilde{T}_{t,d} = \frac{1}{24}\sum \limits _{h=24d-23}^{24d}T_{t-h} \quad d=1,2,\ldots ,N_D \end{aligned}$$

Thus, the model considering recency effect in (2) can be neatly presented as follows:

$$\begin{aligned} \begin{aligned} \hat{y}_t = \varvec{\beta }^{\mathrm {T}}{\varvec{X}}_t \end{aligned} \end{aligned}$$

where \(\varvec{X}_t\) denotes a collection of all the features; and \(\varvec{\beta }\) denotes the vector of the coefficients to be trained.

The number of features \(N_F\) depends on the number of lagged hours \(N_D\) and number of lagged days \(N_H\). \(M_t\), \(W_t\), and \(H_t\) are all presented by the dummy variables. The dummy coding uses all-zero vector \(\mathbf {0}\) to present one category of the classification variables. Thus, the dummy encoding method is one dimension smaller than that of one-hot encoding method. It can guarantee that the final feature matrix is a full-rank matrix after adding an all-one column representing the intercept. \(M_t\), \(W_t\), and \(H_t\) need to be represented by 11, 6, and 23 classification variables. Thus, the total number of features \(N_F\) is:

$$\begin{aligned} \begin{aligned} N_F=&1+11+6+23+23\times 6\\ &+(3+3\times 11+3\times 23)(1+N_D+N_H)\\ =&\,284+105(N_D+N_H) \end{aligned} \end{aligned}$$

If we consider the temperature of lagging 7 days and 12 hours, the total number of features is 2279, which makes the regression model a very high dimensional problem and results in high computation burden. This is the main reason for conducting feature selection. In the following two sections, two LASSO-based feature selection methods (Pre-LASSO and Quantile-LASSO) are introduced.

3 Pre-LASSO based feature selection

This section first introduces a benchmark for feature selection named Pre-LASSO. The main idea of Pre-LASSO is to select the features based on point forecasting model and then use the selected features for quantile model training.

A forecasting model is trained to minimize the total loss:

$$\begin{aligned} \hat{\varvec{\beta }} = \arg \mathop {\min }_{\varvec{\beta }}\sum \limits _{t=1}^{N_T}l(r_t) \end{aligned}$$

where \(r_t\) is the fitting residual, \(r_t=y_t-\varvec{\beta }^{\mathrm {T}}\varvec{X}_t\), \(y_t\) is the real load at the time t; \(N_T\) is the length of all time periods; and \(l(\cdot )\) is the loss function. For traditional point forecasting, the loss function is square error (\(l(r_t)=r_t^2\)).

LASSO is an efficient and mature compression estimation method for feature selection and regularization [21]. It adds the \(L_1\)-norm sparse penalty to the original loss function of the regression model:

$$\begin{aligned} \hat{\varvec{\beta }} = \arg \mathop {\min }_{\varvec{\beta }}\sum \limits _{t=1}^{N_T}l(r_t) + \lambda ||\varvec{\beta } ||_1 \end{aligned}$$

where \(||\varvec{\beta } ||_1\) is the \(L_1\)-norm sparse penalty term; and \(\lambda\) is the weight for the sparse penalty and can be determined by cross validation. \(L_1\)-norm penalty can force the optimization process to change some regression coefficients to 0 or make the vector \(\varvec{\beta }\) sparse. The features with 0 value coefficients will be filtered out, thus this can be regarded as feature selection.

Since feature selection has been widely studied for point load forecasting, a very intuitive way is to conduct feature selection using traditional LASSO and then use the selected features for quantile regression. We call this approach as Pre-LASSO, in which the features are selected before training the probabilistic forecasting model.

The Pre-LASSO method can be divided into two stages. The first stage is to select features using traditional point forecasting based LASSO:

$$\begin{aligned} \hat{\varvec{\beta }} = \arg \mathop {\min }_{\varvec{\beta }\in \mathbb {R}^{N_F}}\sum \limits _{t=1}^{N_T}(y_t - \varvec{\beta }^{\mathrm {T}}\varvec{X}_t)^2 + \lambda ||\varvec{\beta } ||_1 \end{aligned}$$

The problem in (8) is solved using least angle regression (LARS) method [22].

The second stage is to conduct quantile regression based on the selected features:

$$\begin{aligned} \hat{\varvec{\beta }}_{2,q} =\arg \mathop {\min }_{\varvec{\beta }_{2,q} \in \mathbb {R}^{k_2}}\sum \limits _{t=1}^{N_T}\rho _q(y_t, \varvec{\beta }_{2,q}^{\mathrm {T}}\varvec{X}_t') \quad q = 1,2,\ldots ,Q \end{aligned}$$

where \(k_2\) denotes the number of features that have been selected in the first stage; \(\varvec{X}_t'\) denotes the feature vector that has been selected in the first stage; \(\varvec{\beta }_{2,q}\) denotes the coefficient vector for the \(q^{\mathrm {th}}\) quantile; \(\rho _q\) denotes the loss function; and Q denotes the number of quantiles to be forecasted. Note that the quantile regression model is trained individually for each quantile. For the \(q^{\mathrm {th}}\) quantile load forecasting, the loss function \(\rho _q\) is the pinball loss:

$$\rho _q(r_{q,t})=\left\{\begin{array}{ll} (1-q)r_{q,t}& r_{q,t}\le 0 \\qr_{q,t}& r_{q,t}>0\end{array}\right.$$

where \(r_{q,t}\) denotes the \(q^{\mathrm {th}}\) quantile error.

4 Sparse penalized quantile regression (Quantile-LASSO)

The Pre-LASSO method introduced in last section is straightforward and easily implemented by directly applying traditional LASSO. This method has two drawbacks:

1) Pre-LASSO directly selects the input features according to the performance of the point load forecasting instead of the performance of the probabilistic forecasting. Different supervised metrics may lead to different selected features.

2) It is unreasonable to use the same selected features for all quantile regression models. Feature selection should be conducted individually for each quantile.

In this section, we propose a sparse penalized quantile regression method to select the features by directly modifying the objective function of the quantile regression model. To distinguish our method with Pre-LASSO, we name this method as Quantile-LASSO.

4.1 Problem formulation

Traditional quantile regression model is to optimize the parameter \(\varvec{\beta }_{q}\) to minimize the total pinball loss:

$$\begin{aligned} \hat{\varvec{\beta }}_{q} =\arg \mathop {\min }_{\varvec{\beta }_{q}}\sum \limits _{t=1}^{N_T}\rho _q(r_{q,t}) \end{aligned}$$

where \(r_{q,t}=y_t-\varvec{\beta }_{q}^{\mathrm {T}}\varvec{X}_t\).

The Quantile-LASSO method can be easily formulated by adding an \(L_1\)-norm penalty into the objective function of the quantile regression:

$$\begin{aligned} \hat{\varvec{\beta }}_{q} = \arg \mathop {\min }_{\varvec{\beta }_{q}}\sum \limits _{t=1}^{N_T}\rho _q(r_{q,t}) + \lambda _q ||\varvec{\beta _q} ||_1 \end{aligned}$$

where \(\lambda _q\) is the weight for the sparse penalty of the \(q^{\mathrm {th}}\) quantile. For different quantiles, the best values of \(\lambda _q\) are different. Quantile-LASSO shares similar strategy for feature selection with traditional LASSO. Quantile-LASSO model in (12) is a special form of (7) by substituting the loss function \(l(r_t)\) with pinball loss \(\rho _q(r_{q,t})\).

Since the pinball loss and \(L_1\)-norm are convex, it is easy to prove that the model in (12) is a convex optimization problem. Even through the Quantile-LASSO model can be neatly represented like traditional LASSO, solving the optimization problem is not a trivial task. There are two main challenges:

  1. 1)

    Since the number of training samples and the number of features to be selected are very large, the feature selection process is casted to a big data problem and a large-scale convex optimization problem.

  2. 2)

    Both the pinball loss and \(L_1\)-norm are not differentiable everywhere. It is hard to use traditional gradient descent based optimization method to solve the problem.

4.2 ADMM algorithm

We tackle the above two challenges by using ADMM to decompose each iteration of the large-scale convex optimization problem into two sub-optimization problems. The two sub-optimization problems can be solved using off-the-shelf methods.

ADMM can efficiently solve the optimization problem in form of:

$$\begin{aligned} {\left\{ \begin{array}{ll} \text {min}(f(\varvec{r})+g(\varvec{\beta }))\\ \text {s.t.}\;\; \varvec{Ar}+\varvec{B\beta }=\varvec{C} \end{array}\right. } \end{aligned}$$

where \(\varvec{r}\) is the decision variable; \(f(\cdot )\) and \(g(\cdot )\) are convex functions; and \(\varvec{A}\), \(\varvec{B}\), and \(\varvec{C}\) are constant variables in the linear constraint. The Quantile-LASSO model for each quantile in (12) has the same form as (13).

The augmented Lagrangian function of (12) can be written as follows:

$$\begin{aligned} \begin{aligned} L_{\gamma }(\varvec{r},\varvec{\beta },\varvec{u})&=\rho _{q}(\varvec{r}) + \lambda ||\varvec{\beta } ||_1 + \varvec{u}^{\mathrm {T}} (\varvec{y}-\varvec{\beta }^{\mathrm {T}}\varvec{X}-\varvec{r})\\&\quad + \frac{\gamma }{2} ||\varvec{y}-\varvec{\beta }^{\mathrm {T}}\varvec{X}-\varvec{r} ||_2^2 \end{aligned} \end{aligned}$$

where \(\varvec{u}\) is the dual variable; \(\gamma\) is a defined positive constant to control the step of each iteration; \(\varvec{y}\) and \(\varvec{X}\) are the vector of \(y_t\) and the matrix of \(\varvec{X}_t\) in all \(N_T\) time periods, respectively; and \(\rho _{q}(\varvec{r})=\sum \limits _{t=1}^{N_T}\rho _{q}(r_{q,t})\).

ADMM takes advantages of the decomposability of dual ascent and superior convergence properties of the multipliers [17]. The basic idea of ADMM is to minimize the values of two original decision variables \(\varvec{r}\) and \(\varvec{\beta }\), as well as update the dual variables. In this way, the augmented Lagrangian function \(L_{\gamma }(\varvec{r},\varvec{u},\varvec{\beta })\) decreases gradually. Thus, ADMM consists of two sub-optimization problems and one update in each iteration:

$$\left\{ {\begin{array}{*{20}c} {\varvec{{\beta}} }^{k+1} :=&\arg \min _{\varvec{\beta }} \, L_{\gamma }(\varvec{\beta },\varvec{r}^k,\varvec{u}^{k}) \\ \begin{aligned} \varvec{r}^{k+1} :=&\arg \min _{\varvec{r}} \, L_{\gamma }(\varvec{\beta }^{k+1},\varvec{r},\varvec{u}^{k}) \\ \varvec{u}^{k+1} :=&\ \varvec{u}^{k} + \gamma (\varvec{y} - \varvec{X}\varvec{\beta }^{k+1} - \varvec{r}^{k+1}) \end{aligned} \\ \end{array} } \right.$$

If we define \(\varvec{s}=\varvec{y}-\varvec{\beta }^{\mathrm {T}}\varvec{X}-\varvec{r}\), then we have

$$\begin{aligned} \begin{aligned} \varvec{u}^{\mathrm {T}} \varvec{s} + \frac{\gamma }{2} ||\varvec{s} ||_2^2 =\frac{\gamma }{2} ||\varvec{s}+\frac{1}{\gamma }\varvec{u} ||_2^2-\frac{1}{2\gamma } ||\varvec{u} ||_2^2 \end{aligned} \end{aligned}$$

Thus, the update of \(\varvec{\beta }\) can be rewritten as follows:

$$\begin{aligned} \begin{aligned} \varvec{\beta }^{k+1} :=&\arg \min _{\varvec{\beta }}\,\lambda ||\varvec{\beta }||_1+ (\varvec{u}^{k})^{\mathrm {T}} \varvec{s}+ \frac{\gamma }{2} ||\varvec{s}^{k} ||_2^2 \\&\cdot\arg \min _{\varvec{\beta }}\, \lambda ||\varvec{\beta }||_1+\frac{\gamma }{2} ||\varvec{s}+\frac{1}{\gamma }\varvec{u}^k ||_2^2 -\frac{1}{2\gamma } ||\varvec{u}^k ||_2^2\\&\cdot\arg \min _{\varvec{\beta }} \,\lambda ||\varvec{\beta }||_1+\frac{\gamma }{2} ||\varvec{y}-\varvec{X}\varvec{\beta }-\varvec{r}^k+\frac{1}{\gamma }\varvec{u}^k ||_2^2 \end{aligned} \end{aligned}$$

The sub-optimization problem in (17) has the same form as (8), which can also be solved using LARS method [22].

The update of \(\varvec{r}\) can be rewritten as follows:

$$\begin{aligned} \begin{aligned} \varvec{r}^{k+1} :=&\arg \min _{\varvec{r}}\, \rho _{q}(\varvec{r}) +(\varvec{u}^{k})^{\mathrm {T}} \varvec{s}+ \frac{\gamma }{2} ||\varvec{s} ||_2^2 \\&\cdot\arg \min _{\varvec{r}} \,\rho _{q}(\varvec{r}) +\frac{\gamma }{2} ||\varvec{s}+\frac{1}{\gamma }\varvec{u}^k ||_2^2 -\frac{1}{2\gamma } ||\varvec{u}^k ||_2^2 \\&\cdot\arg \min _{\varvec{r}} \,\rho _{q}(\varvec{r}) + \frac{\gamma }{2} ||\varvec{y} - \varvec{X}\varvec{\beta }^{k+1} - \varvec{r}+\frac{1}{\gamma } \varvec{u}^k ||_2^2 \\ \end{aligned} \end{aligned}$$

The sub-optimization problem in (18) has the close-form solution by using subdifferential calculus:

$$\begin{aligned} \begin{aligned} \varvec{r}^{k+1} :=S_{1/\gamma }\left( \varvec{y} - \varvec{X}\varvec{\beta }^{k+1}+\frac{\varvec{u}^k}{\gamma }-\frac{2\varvec{q}-\mathbf{1}}{\gamma }\right) \\ \end{aligned} \end{aligned}$$

where \(\varvec{q}\) and \({\mathbf{1}}\) are \(N_T\times 1\) vectors with all the same element q and 1, respectively; and S denotes the soft thresholding operator, which is defined as:

$$S_{a}(b)=\left\{\begin{array}{lll}b-a & b>a\\ 0 & -a \leq b \le a\\b+a & b<-a \end{array}\right.$$

To summarize, the large-scale Quantile-LASSO model is decomposed into two sub-optimization problems, where one can be solved using LARS method, and the other has a close-form solution. In this way, the Quantile-LASSO model can be solved in an efficient way and search the global optimum.

5 Implementation

This section will introduce the process to implement the proposed feature selection method for probabilistic forecasting. The whole procedures are shown in Fig. 2.

Fig. 2
figure 2

Procedures for Quantile-LASSO implementation

First, we collect the historical load data and its corresponding temperature data. Data preprocessing including data cleaning, normalization, and dataset split is also conducted. Both the loads are cleaned by exploring the relationship between load and temperature and detecting the sudden changes. Details are provided in our previous work [2]. All the features are normalized into [0, 1] using min-max scaling. The whole dataset is split into three parts for training, validation and testing, respectively.

For the Quantile-LASSO method, feature selection is conducted individually for each quantile q. It means the Quantile-LASSO method is implemented for Q times. Take the \(q^{\mathrm {th}}\) quantile as an example, we generate the search path of \(\lambda _q\), which includes a number of possible values of the adjustment parameter \(\lambda _q\). Then, we conduct line search according to the search path of \(\lambda _q\). For each possible value of \(\lambda _q\), the ADMM algorithm proposed in Section 4.2 is employed to train the Quantile-LASSO model in (12). Different possible values of \(\lambda _q\) produce different coefficients \(\varvec{\beta }_q\), and thus results in different selected features. We evaluate the performance of the selected features and trained model using pinball loss on validation dataset for each value \(\lambda _q\). In this way, we can search the optimal adjustment parameter \(\lambda _{q,{\mathrm {best}}}\) according to their performances. Finally, we conduct PLF using the selected features with the best penalty coefficient \(\lambda _{q,{\mathrm {best}}}\) on testing dataset and record their performance for comparison with other benchmarks such as Pre-LASSO method.

The Pre-LASSO method has similar implementation procedures to the Quantile-LASSO method. The search path of \(\lambda\) is also generated and tested one by one using the validation dataset. However, the optimal adjustment parameter \(\lambda _{\mathrm {best}}\) is searched according to RMSE instead of pinball loss on the validation dataset. After determining the optimal adjustment parameter \(\lambda _{\mathrm {best}}\), we retain the features with non-zero coefficients to train the quantile regression model on validation dataset and test the performance on the testing dataset. For the Pre-LASSO method, the quantile regression model for different quantiles uses the same selected features.

After the two methods have been trained and validated, we compare the performances of the two methods on the same testing dataset in terms of pinball loss.

6 Case studies

6.1 Experiment setups

The load and temperature data used in the case studies are from GEFCom 2012, of which the basic information is introduced in Section 2. We choose three-year load and temperature data from 2005 to 2007 as the training dataset, first half-year data of 2008 as the validation dataset, and the second half-year data of 2008 as the test dataset, which means the final performance of the load forecasting is evaluated on the second half-year data of 2008.

We use average quantile score (AQS) to evaluate the performance of the proposed and competing methods. AQS is defined as the average of the pinball loss of all the quantiles:

$$\begin{aligned} S_{AQS}=\frac{1}{Q N_T}\sum _{q=1}^{Q}\sum _{t=1}^{N_T}\rho _{q}({{{\hat{y}}}_{q,t}}-{{y}_{t}}) \end{aligned}$$

where \({{\hat{y}}_{q,t}}\) denotes the forecasted \(q^{\mathrm {th}}\) quantile of the load. A total of 9 quantiles \(0.1, 0.2, \ldots , 0.9\), which are denoted as \(q_1, q_2, \ldots , q_9\), are used to form the probabilistic forecasts in this paper.

Fig. 3
figure 3

Changes of AQS on testing dataset with variation of \(\lambda\) for Area 6

Since the proposed feature selection is designed for linear regression model, two base competing methods are the original linear quantile regression and the linear quantile regression based on Pre-LASSO. The base forecasting model is illustrated in Section 2 to consider the recency effects of temperature on loads. There are two variables to be determined: the number of lagged days \(N_D\) and the number of lagged hours \(N_H\). We choose two variable pair \((N_D, N_H)\) as (3, 4) and (7, 12) , respectively, in our case studies, which are denoted as D3-H4 model and D7-H12 model. The search path of \(\lambda _q\) in Quantile-LASSO model and \(\lambda\) in Pre-LASSO model for the \(L_1\)-norm penalty is the same for both D3-H4 model and D7-H12 model. If the recency-effect model is D3-H4 model or D7-H12 model, we have 1019 or 2279 features to be selected, respectively, without consideration of intercept according to (5). Neural network and gradient boosting regression tree are commonly used in powerful regression models for load forecasting. They have been widely used in GEFCom 2012 and 2014 [18, 23]. To show the superiority of the MLR model for load forecasting, these two nonlinear quantile regression models, QRNN [24] and QGBRT [25] with default parameters are also tested for comparison.

6.2 Results

Figures 3 and  4 present the changes of AQS on testing dataset with variation of \(\lambda\) for two randomly selected areas (Area 6 and Area 9) using Pre-LASSO. The search path of \(\lambda\) varies from 0.0001 to 100, i.e., \(-{\mathrm {lg}} (\lambda )\) varies from \(-2\) to 4. From Figs. 3 and  4, we can see that with the decrease of \(\lambda\), i.e., the increase of \(-{\mathrm {lg}} (\lambda )\), the number of features that has been selected \(N_{oS}\) shows clear increase trends for all cases. When \(-{\mathrm {lg}} (\lambda )>3\), \(N_{oS}\) is close to the number of original features.

Fig. 4
figure 4

Changes of AQS on testing dataset with variation of \(\lambda\) for Area 9

It can also be seen from Figs. 3 and  4 that a lower AQS means a better performance of the model. AQSs of two models in two areas have similar rough trends: decrease first, then go up and down before getting a stable value. The final stable values of AQSs corresponding to the case of \(-{\mathrm {lg}} (\lambda )=4\) can be viewed as the performance of the model without feature selection. It means that we can find a value of \(\lambda\) that corresponds to the minimum of AQS before the AQS gets stable according to the cross validation. The minimum values of AQSs are produced by the optimal value of \(-{\mathrm {lg}} (\lambda )\). The optimal values of adjustment parameter \(-{\mathrm {lg}} (\lambda )\) are different for different models and in different areas.

For D3-H4 model in Area 6 and Area 9, the numbers of selected features with the optimal adjustment parameters of Pre-LASSO are about 550 and 300, respectively, which are only 55% and 30% of the number of original features. We can see only 0.9% and 3.2% improvements in terms of AQS. However, for D7-H12 model, only 20% and 10% of the original features have been selected for Area 6 and Area 9, and gain 4.5% and 4.1% improvements, respectively.

The feature selection based on D3-H4 model in Area 6 fails to bring a significant improvement on the AQS performance; while the feature selection based on D7-H12 model gains a more significant improvement. A possible reason is that D7-H12 model contains more recency effects of temperature and the effective features are selected. The Pre-LASSO method selects different features and has different improvements for different areas and different models, because the loads in different areas have different responsiveness on the current and lagged temperatures. For example, if the load of an area is less affected by the temperature, the elimination of the temperature-related features will less change the performance of the PLF model in this area. However, for the area whose load is very sensitive to temperature, feature selection may show higher gains on the performance. Figure 5 presents the forecasted quantiles and the real load of Area 2 over ten days from 17 July 2007 to 26 July 2007. The forecasting results are obtained using D7-H12 model with Quantile-LASSO.

Fig. 5
figure 5

Probabilistic forecast results of Area 2 from 2007/07/17 to 2007/07/26

Table 1 Numbers of features of Quantile-LASSO and Pre-LASSO for D7-H12 model
Table 2 Performance of different feature selection methods in terms of pinball loss

Table 1 gives the number of features that are selected by Quantile-LASSO method with the optimal adjustment parameters \(\lambda _{q,\text{best}}\) for different areas from \(q_1\) to \(q_9\). The Quantile-LASSO method is implemented based on the D7-H12 model. It is interesting that the numbers of selected features show increase trends in the first four quantiles (from \(q_1\) to \(q_4\)). However, there is no clear trend for larger quantiles. The lower quantile may correspond to the base load and is less influenced by complex recency effects. Thus, the number of selected features is much smaller.

The number of features selected by Pre-LASSO does not change for different quantiles. Compared with Pre-LASSO, Quantile-LASSO method can dynamically select features by adjusting the sparse penalty (adjustment parameter \(\lambda\)) according to the performance of the model on validation dataset. In this way, Quantile-LASSO can produce better probabilistic forecasting after feature selection compared with Pre-LASSO.

Table 2 presents the AQS of Quantile-LASSO method, Pre-LASSO method, original model without feature selection (MLR), QRNN, and QGBRT, where the inputs of QRNN and QGBRT are the same as MLR model without feature selection. Both Pre-LASSO and Quantile-LASSO with D3-H4 model and D7-H12 model have lower AQS than the original model without feature selection. In addition, the results again verify that Quantile-LASSO has better performance compared with Pre-LASSO and original method for all ten areas by selecting features individually for each quantile. QRNN has the worst performance because of large number of input features without selection. For D3-H4 model, QGBRT, a powerful regression technique, has better performance compared with Quantile-LASSO in three areas. However, for D7-H12 model, Quantile-LASSO has the best performance in all areas. QGBRT may have worse performance with larger number of input feature which also indicates the importance of feature selection for probabilistic forecasting.

7 Conclusion

This paper provides a Quantile-LASSO method for feature selection in the PLF model by adding \(L_1\)-norm penalty into the loss function. ADMM algorithm is proposed to solve the large-scale optimization problem. The method is compared with Pre-LASSO method and original model without feature selection. Pre-LASSO is easy to implement using off-the-shelf algorithm but has limited improvement. Quantile-LASSO has higher degree of freedom to adaptively select the features for different quantiles. The averaged relative improvements of Quantile-LASSO are 10.46% and 6.06% compared with traditional quantile regression without feature selection and Pre-LASSO in terms of pinball loss, respectively. Future works will focus on parallel implementation of the proposed ADMM algorithm on larger datasets and the applications of probabilistic wind power forecasting or PV forecasting.