Deep neural networks for the quantile estimation of regional renewable energy production

Wind and solar energy forecasting have become crucial for the inclusion of renewable energy in electrical power systems. Although most works have focused on point prediction, it is currently becoming important to also estimate the forecast uncertainty. With regard to forecasting methods, deep neural networks have shown good performance in many fields. However, the use of these networks for comparative studies of probabilistic forecasts of renewable energies, especially for regional forecasts, has not yet received much attention. The aim of this article is to study the performance of deep networks for estimating multiple conditional quantiles on regional renewable electricity production and compare them with widely used quantile regression methods such as the linear, support vector quantile regression, gradient boosting quantile regression, natural gradient boosting and quantile regression forest methods. A grid of numerical weather prediction variables covers the region of interest. These variables act as the predictors of the regional model. In addition to quantiles, prediction intervals are also constructed, and the models are evaluated using different metrics. These prediction intervals are further improved through an adapted conformalized quantile regression methodology. Overall, the results show that deep networks are the best performing method for both solar and wind energy regions, producing narrow prediction intervals with good coverage.


Introduction
In the last few years, there has been a large increase in the installed capacity of both wind and solar renewable energy. Wind and solar energy are nondispatchable energy sources, which means that they are not under the control of an operator; instead, these energy sources depend on weather conditions. This dependence makes the integration of wind and solar energy into the electricity grid more difficult than operable sources. Given that the amount of energy to be generated cannot be controlled, the only alternative is to forecast it with as much accuracy as possible. Numerical weather prediction (NWP) systems, which are based on mathematical/physical models of the atmosphere, are one of the most accurate ways to predict meteorological variables. However, in electricity generation, the most relevant dependent variable is how much electricity will be generated. One way of determining this is to couple NWP systems with machine learning models. The goal of the latter is to find the relation between NWP variables (the inputs to the model) and the electricity produced (the output, which is the dependent variable). An example of this approach can be found in [1].
However, most works deal with point or deterministic forecasts. It is currently becoming increasingly important to estimate the uncertainty associated with renewable energy forecasts [2]. Such forecasts, which are known as probabilistic forecasts, are more informative than the forecasts obtained from deterministic models. Several works have shown how probabilistic renewable energy forecasts allow / Published online: 2 August 2022 Applied Intelligence (2023) 53:8318-8353 for improvements in the management of power systems [3], the participation in the electricity market [4][5][6], and the bidding strategy of ancillary services of renewable power plants [7].
Probabilistic forecasts can be represented in different ways. Sets of quantiles are one of the most widely used representations of the predicted probability distribution [8]. A well-known method to estimate quantiles is to minimize the quantile loss using (linear) quantile regression, where linear models are trained for each of the quantiles to be estimated. It is important to remark that these are conditional quantiles (the model outputs the quantile, which is conditioned to the inputs/independent variables). However, in quantile regression, it is assumed that the relation between inputs and outputs is linear. For nonlinear relationships, other machine learning methods can be used. For instance, support vector machines can be extended to quantile regression by using quantile loss as a penalization term [9]. Random forests can be easily extended so that quantiles are output instead of deterministic predictions [10]. Gradient boosting techniques can be formulated as gradient descent optimization, and therefore, they can also return conditional quantiles by minimizing quantile loss [11]. A disadvantage of gradient boosting as well as linear quantile regression and support vector machines is that a different model has to be fit for each different quantile. The recently introduced natural gradient boosting method, which follows the general gradient boosting framework, can also be used to estimate quantiles; in this case, all quantiles can be provided using a single model [12]. All the nonlinear methods for quantile estimation described above have been used in recent energy forecasting work [13][14][15][16] for SVRQR, QRF, GBR, and NGB.
Deep neural networks (DNNs) are nonlinear models that have been very successful in recent years in many research fields, such as computer vision [17][18][19][20], natural language processing [21,22] and renewable energy forecasting [23][24][25][26]. In [27], the most widely used methods in power research, such as convolutional neural networks (CNNs), autoencoders and deep belief networks, were reviewed. However, deep learning has been mainly used for point forecasting. For example, in [28], CNNs were employed for wind power point prediction. In [29], similar research was carried out; here, dense fully-connected neural networks were utilized to forecast wind power for a single wind farm. A hybrid LSTM-CNN method was employed in [30] to make point predictions of solar power, and LSTM models were also studied in [31] for short-term renewable electricity generation for a location. Apart from the most common renewable energy sources (i.e., solar and wind sources), the modeling of hydrogen production has also been considered using DNNs [32] but not from a probabilistic perspective.
Given that the most common training method of neural networks is gradient descent, these networks can also be used to obtain conditional quantiles by minimizing quantile loss. As a result, probabilistic predictions can be obtained. Despite their overall good performance, neural networks have not received much attention for comparative studies of probabilistic forecasting of renewable energies.
For instance, [33] made a comparison of several methods for computing probabilistic forecasts, but no neural networks were used. They started with several point forecast methods, including decision trees, nearest neighbors, gradient boosting, random forests, and lasso/ridge regression, and used some ensemble techniques to obtain quantiles for probabilistic solar energy forecasting. Additionally, in [34], different methods, such as decision trees, random forests and gradient boosting together with bootstrapping, were compared for the construction of probabilistic forecasts, but again, no neural networks were used in the study. In other studies [35,36], in addition to random forests and gradient boosting decision trees, neural networks were used for quantile estimation. However, these neural architectures only contained one or two hidden layers [37,38], and deep network performance was not studied.
In this article, we propose the use of deep neural networks (networks with more than 2 layers) for the quantile estimation of renewable energy (both solar and wind energy). Instead of estimating a single quantile as in other works [39,40], the proposed quantile regression deep neural network (QRDNN) model has been designed to estimate multiple quantiles. In previous works, a different network was trained for each quantile to be estimated, which requires a large computational effort. The QRDNN model outputs all required quantiles using a single model, hence saving computational time. The combination of multiple layers and multiple output quantiles allows for complex nonlinear processing in the initial layers, while the last layer is used to adapt to each of the quantiles.
Pairs of quantiles can be used as the lower and upper limits of prediction intervals (PIs), which are widely used to represent the uncertainty of the dependent variable with a given probability. PIs should be as narrow as possible. However, this property is not directly considered when estimating quantiles. Therefore, PIs obtained from quantiles may be wider than necessary. In this work, QRDNN has been extended using conformalized quantile regression (CQR) [41], which allows the PIs obtained from the quantiles to be calibrated. The CQR is a very recently introduced calibration method that has seldom been used for deep networks [42]. Additionally, CQR has been adapted to the power generation problem addressed in this work by using several time prediction horizons. This is achieved by computing conformity scores that are dependent on the time horizon. This allows the separate adaptation of PIs to the characteristics of each time horizon.
The field of application for this work is probabilistic forecasting at the regional level. Most works deal with energy forecasting at the local level (e.g., a single wind farm or photovoltaic plant), but for some applications, electricity production is required to be aggregated at the regional level (e.g., areas, regions, or countries) [43][44][45][46]. In regional forecasting, geographical dispersion of plants in the region provides a balancing effect that results in a lower variability on the energy production, compared to the production of individual plants (solar or eolic). On the other hand, regional forecasting has some particular issues, such as maintenance operations, or down-regulation of individual plants, that add noise to the electricity production data. To empirically study regional forecasting, quantile models are obtained for the electricity production in four provinces in Spain at different forecasting horizons. To obtain a greater understanding of deep networks for renewable probabilistic forecasting, the two most important renewable energies, solar (Ciudad Real and Córdoba provinces) and wind (Granada and Lugo provinces) energies, are studied.
In summary, the main contributions of this article are: • The combination of deep neural networks containing multiple layers and multiple quantiles at the output (QRDNN) are used to estimate a set of quantiles, which allows the estimation of a set of PIs. • The conformalized quantile regression method is applied to calibrate multiple PIs obtained from the quantiles at the network output and its adaptation is applied to multiple time prediction horizons. • An exhaustive comparative study in the context of regional renewable energy forecasting for both solar and wind energy is conducted. The performance of QRDNN has been compared with linear quantile regression (LQR) and state-of-the-art methods, such as support vector quantile regression (SVQR), gradient boosting quantile regression (GBQR), natural gradient boosting (NGB) and quantile regression forests (QRFs). Systematic hyperparameter tuning by a grid search is used for all methods. This comparison has been made using metrics related to quantile estimation as well as metrics related to the goodness of the PIs obtained from the quantiles.
The structure of this article is as follows. In Section 2, the meteorological and production datasets are described. In Section 3, the machine learning methods employed in the article are introduced. In Section 4, the methodology, models, metrics, and evaluation procedure are presented. In Section 5, the obtained results are documented and discussed. Finally, in Section 6, the main conclusions of this work are drawn.

Data
As previously mentioned, NWP variables (independent variables) are used as the inputs to predict the amount of renewable energy (solar or wind) generated in a region. Regarding the independent variables, an observational spatial grid is set across different Spanish regions ("provincias") from which we will be able to obtain these variables. This means that for every point on the observational grid, a complete set of NWP variables will be collected.
Data in the netCDF4 format are provided by the European Centre for Medium-Range Weather Forecasts (ECMWF) in the ERA5 database [47]. Overall, it is possible to obtain two data products: the ensemble mean and reanalysis data. The former represents the actual meteorological forecasts for each of the variables, which are provided as the mean of a forecast ensemble (data from the variables forecast by NWP at different time horizons and at several locations in the spatial grid). Additionally, reanalysis data are posterior calibrations produced with the aim of reducing forecasting errors.
While a spatial resolution of 0.25 • × 0.25 • is allowed for the reanalysis data, a resolution of 0.5 • ×0.5 • is provided for the ensemble mean data. Furthermore, reanalysis data are provided hourly, while the ensemble mean data are obtained every 3 hours beginning at 00:00 hours. However, in this article, some preliminary tests were made, suggesting that the uncertainty of the ensemble mean data allows for better modeling of the energy generation uncertainty. Therefore, the dataset has been constructed with the ensemble mean data. The NWP variables are extracted from a spatial grid with a 0.5 • × 0.5 • resolution.
We define four grid extensions to cover the majority of the four regions (Spanish provinces). The grids in the regions of Córdoba and Ciudad Real are employed for solar energy prediction. In addition, the grids in Lugo and Granada are used for wind energy prediction. Figure 1 shows the observational grid for these four regions. The grid in Lugo includes longitudes from -8 ō to -6.5 ō and latitudes from -8 ō to 44 ō . In Córdoba, the grid includes longitudes from -5.5 ō to -4 ō and latitudes from 37 ō to 39 ō . In Granada, the grid spans LON from -4.5 ō to -2 ō and LAT from 36.5 ō to 38 ō . Finally, the grid in Ciudad Real includes LON from -5 ō to -2.5 ō and LAT from 38.5 ō to 39.5 ō .
Additionally, data for the dependent variable (generated energy) is obtained from the open data portal ESIOS of the Spanish regulator Red Eléctrica Española [48]. Within this portal, users can obtain data related to energy consumption, generation, and exchange, among other indicators. Electricity generation data are provided in hourly intervals. In addition, data can be filtered by the type of production (solar or wind in our design) and by region. Therefore, we have selected the type of energy and the desired temporal set according to our selected regions.
We now explain how the complete dataset is built. The data provided by ECMWF must be transformed to obtain a 2-dimensional data matrix with observations in the rows and variables in the columns. NWP variables, which were provided by ECMWS in netCDF4 format, are contained in a three dimensional array. Each variable is measured at a specific latitude, longitude, and time. An arrangement is made so that every time point is an observation and every different variable X i in each latitude j and longitude k is an input. For example, if we have N meteorological variables in a j × k spatial grid, the procedure will allow us to obtain a set of T observations (rows) and N × j × k independent variables (columns).
Specifically, for our purpose, the variables shown in Table 1 are utilized. These selections are made according to other research in regional point energy prediction [1] that resulted in successful outcomes.
The ECMWF provides 8 daily time horizon forecasts for each variable: 00:00, 03:00, 06:00, 09:00, 12:00, 15:00, 18:00, and 21:00 (8 is therefore the temporal resolution of the ensemble mean data). Therefore, there will be a maximum number of 8 observations per day in our dataset. As previously explained, the independent variables for each observation (i.e., each row in the dataset) are obtained from ECMWF [47]. In addition, the dependent variable (electrical energy produced) is obtained from the ESIOS system [48] by matching the time horizons of each observation with the times from the ESIOS system (e.g., data from the 15:00 time horizon from ECMWF is matched with energy produced during the 15:00-16:00 time period from the ESIOS system). For wind energy, all forecast horizons are used. For solar energy, only those time horizons that correspond to year-round daylight hours (i.e., 09:00, 12:00, and 15:00) are used.

Methods
Given the independent variables x = (x 1 , x 2 , . . . , x p ), the conditional distribution function (1) indicates the probability that the dependent variable Y is less than or equal to a given value. The α-quantile (2) is defined as the probability that Y is smaller than Q α (x) is α.
In the following subsections, the machine learning methods used in this article to estimate the quantiles conditioned to the independent variables are described. In general, these quantile models will be represented byQ α (x). In these methods, a training set with N ins instances I train = {(x 1 , y 1 ), . . . , (x N ins , y N ins )} is used to fitQ α (x).

Linear quantile regression
The general framework of the linear quantile regression (LQR) model is derived from the linear regression model, which allows us to make predictions and inferences over the quantiles for some given dependent variables. Therefore, the α-quantile for a dependent variable is modeled as the linear combination of predictors: where x = (x 1 , x 2 , . . . , x p ) is the set of predictors, β = (β 1 , β 2 , . . . , β p ) is the set of coefficients and p is the number of predictors. In contrast, classical linear regression models are built according to the minimization of the residuals from fitted values. Therefore, LQR has the quantile loss (or pinball loss) function (5) as the element to minimize, which is defined in terms of the residual (4). The LQR loss, which is asymmetrical, has a different penalty for residuals above (u ≥ 0) or below (u < 0), and it can be shown that its minimization converges to the required α-quantile.
Here, (5) is applicable for a single (x, y) pair, but generally, it is defined over a set of N ins instances T = {(x i , y i ) N ins i=1 }, as shown in (6).
Thus, analogously to the linear model regression, the fitting process for LQR becomes a minimization process so that the parametersβ can be obtained.
where N ins is the size of the training data (i.e., the number of instances or observations). The LQR requires one model per quantile be trained:Q α 1 ,Q α 2 , . . . ,Q α Nquan , where N quan is the number of quantiles to be estimated. During this study, the R package quantreg is implemented to fit the different LQR models and obtain an estimation of the conditional quantiles. Information about this package implementation can be found in [49].

Support vector quantile regression
Support vector quantile regression (SVQR) is a technique for estimating quantiles and is based on the idea of support vector regression (SVR) [50].
Standard SVR can be used for classification and regression. In the simplest approaches, linear models f are constructed, as shown in (8).
where ω is a vector of weights, which are obtained by solving the minimization problem formulated in (9). This optimization process is utilized to find a balance between the simplicity of the model (the first term of (9)) and the loss of the model for each of the instances (the second term of (9)).
where λ is a regularization hyperparameter that represents the tradeoff between these terms (sometimes C, or Cost, is used instead, where C = 1 λ ), and L(u) is the loss function. For classification problems, the loss is usually the hinge loss, while for regression problems, the -insensitive L 1 loss is commonly used.
Nonlinear modelsf (x) = ω T χ(x) + b can also be obtained by using nonlinear mappings χ. These nonlinear mappings are not explicitly applied. Instead, kernels and the kernel trick allow us to solve the optimization process required to train the SVR model without actually carrying out the mapping. The most widely used kernel is the radial basis function kernel (i.e., the Gaussian kernel), which is defined in (10). Nonlinear models can be defined in terms of the kernel in (11).
where a i are coefficients obtained by the optimization process once the kernel has been included and γ is the kernel bandwidth parameter. The concepts from SVR have been extended to quantile estimation [9] and used in recent work related to the energy field [13] by using the quantile loss L α , which was defined in the previous section in (5), in the SVR optimization defined in (9). This extension allows us to use the SVR mechanism to extend quantile regression to nonlinear models. Given that the loss function L α is different for different α values, a different modelQ α has to be obtained for each α. The liquidSVM library is a recent and fast implementation of SVRs that provides methods for SVRbased quantile estimation [51] and is used for this study.

Gradient boosting quantile regression
Gradient boosting (GB) is an ensemble machine learning method. The GB models have the mathematical form shown in (12).
where h i (x) are the members of the ensemble (called weak models) and γ i ≥ 0 are the weights of each model in the ensemble. M is the size of the ensemble (i.e., the total number of weak models). The GB training method is sequential in the sense that a sequence of partial ensembles F 1 , F 2 , . . ., F m , . . . are constructed until the final ensemble F M is obtained. This process is carried out by computing F m+1 (x) = F m + γ i h m (x) so that F m+1 improves the previous ensemble F m by adding a new ensemble member h m . This process is repeated until the ensemble is complete.
Each new h m model added to the ensemble is trained in a way that ensures that the transition from ensemble F m to ensemble F m+1 follows a gradient descent procedure. This means that by adding h m to ensemble F m , the transition to F m+1 goes in the direction opposite that of the loss function gradient, i.e., in the direction in which the error decreases the most. This is achieved by training each h m with a modified dataset, in which the inputs are the same as those in the original dataset, but the outputs are the negative gradients represented in (13).
Thus, every h m model added to the ensemble is trained with the {(x 1 , r 1 ), (x 2 , r 2 ), . . . , (x N ins , r N ins )} dataset. This general formulation of gradient boosting allows the method to optimize any loss function for which partial derivatives can be computed. Typically, loss functions such as the mean square error (MSE) or mean absolute error (MAE) are used, and this allows GB ensembles that optimize those loss functions to be obtained. However, this mechanism also allows us to obtain ensembles that optimize quantile loss, which is the function of standard gradient boosting software packages (for this article, LightGBM [52] is used). Given that different α values lead to different quantile loss functions, a different ensemble has to be trained for every α-quantile.
In this section, the main ideas of GB as applied to quantile regression have been illustrated. Other technical details have not been discussed, but a complete overview of GB can be found in [11]. Additionally, although in principle the ensemble member h i can be any kind of model, most implementations have used regression trees as base models, which have been shown to be very powerful and efficient approaches. Finally, in this study, we have used the LightGBM implementation, which has its advantages and technical issues. While slightly different to the foundational ideas discussed in this section, LightGBM can be examined in [52].
The main hyperparameters of GB are the number of ensemble members M, the maximum depth of the trees in the ensemble, and the shrinkage (or learning rate) ν. If a learning rate different than 1.0 is used, then the GB ensemble becomes (14). All these hyperparameters allow us to regularize the ensemble and control overfitting.
Large M values, large maximum depth, or large learning rates usually lead to overfitting, and their values must be carefully adjusted so that models with good generalization are obtained.

Natural gradient boosting
Natural gradient boosting (NGBoost) is a recent method that uses boosting models for computing probabilistic predictions in regression problems [12,16,53]. The first difference between NGBoost and standard boosting is that the ensemble model is used in NGBoost to estimate the parameters of the conditional probability distribution (e.g., the mean μ and standard deviation log(σ ) of the normal distribution f (μ,σ ) (y|X = x)) rather than the dependent variable Y . In other words, the output(s) of the boosting ensemble described in 12 are the parameters of the probability distribution for the dependent variable and not the dependent variable itself. For instance, if the parameters are μ and log(σ ), a model with two ensembles, one ensemble per parameter, are obtained (see 15). Quantiles can be then obtained from these probability distributions (namely, The second difference between NGBoost and standard boosting is that rather than using the standard gradient, as shown in 13, NGBoost uses the natural gradient. The reason is that to obtain gradients for this formulation, distances between different probability distributions must be computed. However, the distances between the parameters that represent distributions (e.g. (μ, log(σ ))) do not represent the differences between their associated probability distributions well. Thus, natural gradients, which use divergences such as the Kullback·Leibler divergence or the L 2 divergence are defined as the proper way to consider the differences between the actual probability distributions. Natural gradients are used instead of standard gradients for the GB algorithm.

Quantile regression forests
Random forests (RFs) are another ensemble machine learning method. Unlike gradient boosting, the ensemble in RFs is not based on the improvement of a weak learner; instead, it is based on fitting a large number of learners and bagging to make a joint prediction.
One of the particularities of this method is that it relies on randomization to prevent overfitting. From training data {(x 1 , y 1 ), ..., (x N ins , y N ins )} of size N ins , each one of the M base learners {h 1 (x), h 2 (x), . . . , h M (x)} (regression trees for this project) takes a bootstrapped sample with replacement. Furthermore, only a random subset of m features from the p available features are employed to grow the trees. Trees are grown until the minimum sample size required for splitting a node is reached [54]. The number of trees M, the maximum number of selected features m, and the minimum number of observations required to split a node of the tree are important hyperparameters of this method.
Following [10], predictions are made using standard random forests by averaging the individual predictions of each of the trees in the ensemble ({h 1 , h 2 , . . . , h M }). Each tree h i makes a prediction by sending a new instance x down the tree until it reaches a leaf. The leaf contains all the observations {(x i , y i )} that reached it during the training process. The prediction of the forest is simply the average of the dependent variable of those instances (ŷ( This process can be used for point prediction, and random forests can easily be used for estimating quantiles [10]. Given that the leaf reached by a new instance x contains a set of observations, {(x i , y i )}, {y i } can be used for constructing an empirical distribution. These empirical distributions can be averaged across all trees in the ensemble. From this average distribution, quantiles can be computed.
More formally, let: be the proportion of instances in T (x, h j ) for which y i = y. If no instance in T (x, h j ) has the value y for the dependent variable, The final conditional distribution function can be estimated by the empirical distribution of the unique values y i ∈ L(x, h j ), assuming that each value has probability w(x, y i ), as can be seen in (16).
where uv is the number of unique values of the dependent variable present in leaves l(x, h j ) and {y 1 , y 2 , . . . , y uv } are the unique values. Unlike LQR and GBQR, QRF allows the extraction of all desired quantiles (α 1 , α 2 , . . . , α N quan ) from a single model.
During the development of this article, the scikit-garden in Python was implemented to fit the different QRF models [55].

Quantile regression deep neural networks
Neural networks have been proven to be powerful methods for both classification and regression. In this work, DNNs are used to estimate a set of quantiles, and the model named QRDNN (see Fig. 2) has been introduced. Like most fullyconnected DNNs, QRDNN can be visualized with an input layer, which contains predictors or inputs x, several hidden layers, where each layer has a defined number of neurons, and an output layer, which, in this work, are the estimated quantiles.
The operation of the hidden layers can be understood as matrix multiplication followed by a nonlinear activation function g (e.g., ELU, ReLU or sigmoid). If x is the vector of inputs, L 1 is the weight matrix from the input layer to the first layer, and b 1 is the vector of biases from the first hidden layer. Then, the output of the first layer is given by (17).
With the exception that the activation of the previous layer is utilized, the same structure is followed for the remaining hidden layers until the output layer is reached. Thus, the output of the i-th layer (i=2,3,...) is given by (18).
where L i is the weight matrix from the layer i − 1 to layer i and b i is the bias vector of layer i. The outputs of the neural network (activation of the output layer) are the estimated quantiles and the network will have one neuron output for each α-quantile to be estimated, as can be seen in Fig. 2.
Training large neural networks that contain several hidden layers with many neurons in each layer may lead to overfitting. A common approach to prevent overfitting is to use dropout layers. These additional layers randomly hide or ignore some outputs from a hidden layer with a probability p. Thus, the DNN will not employ all weights. Therefore, it is more difficult to overfit the training data, which results in a network with better generalization.
Loss functions usually used for training neural networks are the mean square error (for regression) and cross-entropy (for classification). However, when the neural network is used to estimate quantiles, these functions are not useful. Given that quantile estimation can be formulated as the minimization of quantile loss ((5) and (6)), the approach followed in this work is based on the optimization of these functions.
However, instead of using (5) and (6) in a straightforward way, (19), an equivalent formulation, is used. The reason is that a straightforward implementation of (5) and (6) would require a loop over the instances, where for every instance, a check on whether the residual (4) is positive or negative must be completed, and then αu and (α − 1)u can be computed. In (19), the explicit loop is removed, which allows for a more efficient execution when using PyTorch [56] and graphical processing units (GPUs). where Given that 0 < α < 1 and (α − 1) is always negative, max will return αu α,i if the residual u i is positive, and (α−1)u α,i otherwise. Hence, it is equivalent to (5). represents the addition of all elements in the vector.
Deep networks have several hyperparameters that are important to tune. In this work, these are: • The number of layers, and the number of neurons per layer. If the model is too complex, there is a risk of overfitting in the network, but if the model is too simple, underfitting might occur. • The learning rate. The learning rate controls the size of the learning step. If it is too large, the optimum can be missed. • The batch size. Generally, the loss and parameter updates are completed in packets called minibatches, which are smaller than the complete dataset. Finding the right minibatch size can be important. • Activation layer. Different nonlinear layers may work better for particular problems; hence, it is important to find the right one. In this article, sigmoid, tanh, ELU and ReLU are tested. The ELU, which has a parameter α that controls its shape, is a (soft) alternative to ReLU. • Optimizer. Whereas stochastic gradient descent (SGD) is the most widely used optimizer, for some problems, better results may be obtained using advanced optimizers. In this article, we also test Adam, an optimizer that is well-known for its excellent results [57].
To program the neural networks for this work, the PyTorch framework is used [56].
For QRDNN, better results may be achieved in terms of quantile loss when training only one conditional quantile rather than training a set of quantiles. However, this requires training one deep neural network for every conditional quantile. As the goal of this article is to propose an efficient method where several PIs can be built from the multiplequantile output, training QRDNN with a set of α quantiles is preferred.
The inputs x of the model are the selected meteorological variables on the grid that cover the regions of interest. For instance, for the Lugo region, a grid of size 5 × 4 is defined (see Fig. 1(a)). Given that Lugo is a wind region, and 8 meteorological variables have been selected for wind, the model will have 5 × 4 × 8 = 160 variables. For Córdoba, which is a solar region, x will contain 5 × 4 × 15 = 300 meteorological variables.
In this article, conditional PIs are also constructed from the (conditional) quantiles. A conditional PI for inputs x is a pair of lower and upper bounds that contain the dependent variable with a probability called the prediction interval nominal probability (PINP). Alternatively, the probability of not covering the dependent variable can also be used. Note that in other works, α is used to represent this probability. However, in this work, α represents the αquantiles. Therefore, in this study, this probability will be referred to as ε = 1 − P I NP . PIs can be computed by using quantiles ε 2 and 1 − ε 2 as lower and upper bounds, respectively. Using these quantiles, a probability of For instance, a 99% prediction interval can be built as shown in (21).

Evaluation procedure
To train and evaluate the models, three datasets are constructed: the training, validation, and test sets. Two full years of data are used for the training set, one different year is used for the validation set and hyperparameter tuning, and another year is used for the test set. A 4-year period is selected so that the maximum generation remains approximately constant for the whole period. As a result, models that are trained using some years can be tested without having to adapt the remaining years. The datasets created for each of the four Spanish regions considered in this work are described below: • Lugo (wind energy). All independent variables in the three sets were standardized by computing the required standard deviation and mean of the training and validation sets for each region and using them on the training, validation, and test partitions.
Concerning the dependent variable, some transformations were applied to address normality issues. A decimal logarithm transformation was applied and was followed by a standardization using the same procedure as that used for the independent variables. In Fig. 3, the transformation process of the dependent variable can be seen. In Fig. 3 (a) and (c), the histograms of the dependent test variable are shown for Granada (wind energy) and Ciudad Real (solar energy), respectively. The Ciudad Real dependent variable histogram, which has an almost bimodal distribution (i.e., small and large amounts of energy generated), suffers from a larger shape change. In Fig. 3, (3(b) and (d)), the histograms of the transformed dependent variable are presented.
This transformation allows us to reduce the skewness of the distribution.
Standardizing the complete dataset may potentially improve the training process as both dependent and independent variables have the same range of values and similar shapes.
In addition, some model weights will no longer dominate others.
In the training process, 10 quantiles are modeled by each method for every region. These quantiles are given as follows: Q .005 , Q .025 , Q .05 , Q .075 , Q .1 , Q .9 , Q .925 , Q .95 , Q .975 and Q .995 . This enables the possibility of building 5 PIs that have different coverage: PI 80% , PI 85% , PI 90% , PI 95% and PI 99% .
To select the best possible combination of hyperparameters, an exhaustive grid search is completed. We explore all possible combinations of hyperparameter values within a predefined space. The sets of values are presented in Table 2.
It is important to note the differences between the methods used in the fitting process. While LQR and GBQR can obtain one conditional quantile per model, QRF, NGB, and SVQR can fit the complete conditional distribution function and QRDNN can obtain all ten quantiles at once by means of the set structure.
The evaluation procedure has been developed by choosing the hyperparameter set with the smallest average quantile loss across the ten selected quantiles (24). Thus, we extract the 10 conditional quantiles from the methods and calculate the mean quantile loss across them for all the hyperparameter values. Therefore, this means that GBQR is modeled with the same hyperparameter configuration for all the quantiles so that a homogeneous selection can be obtained.
Thus, the best hyperparameter values for each method, region and type of energy (wind/solar) are given in Table 3.
In some methods, such as LQR, GBQR and QRDNN, predicting close multiple conditional quantiles may introduce the problem of quantile crossing. This may specifically occur when quantiles are very close (e.g. Q 0.975 and Q 0.995 ). To solve this problem and when evaluating the models on

Metrics
During the development of this work, several metrics were employed to evaluate the different models. First, quantile loss was used to evaluate models on the test set and to select the best performing model during the hyperparameter  (4), (5), and (6), but it is reproduced in (22) and (23) below for convenience. where In general, we are interested in obtaining models not just for a specific quantile α but for a set of quantiles α = {α 1 , . . . , α q , . . . , α N quan }. In this case, the average quantile loss across all different quantiles can be used.
The continuous ranked probability score (CRPS) is a metric that measures the quality of a probability distribution [58]. When the distribution is represented by multiple quantiles, as it is in our case, CRPS is defined by (25).
It can be seen that CRPS is the addition of two components. The first component measures the distance between each of the quantiles and the actual value of the dependent variable. The value of this component will be minimized when the quantiles accurately reflect the data distribution.
The second component, which is independent of the data, measures the distance between the quantiles. The minimization of this component leads to sharper distributions (i.e. quantiles are closer to each other). The lower the CRPS is, the better. In fact, when quantile predictions degenerate to point predictions (i.e. all quantiles become the same value, and a single prediction is produced), CRPS becomes the mean absolute error (MAE).
Other metrics have been used in this work to evaluate PIs. The prediction interval coverage probability (PICP) [59] measures the proportion of instances covered by the interval, and it is given by (26).
where 1 y i ∈P I (x i ) is an indicator function whose value is 1 when y i ∈P I (x i ) for a given x i and 0 otherwise.P I (x i ) is the prediction interval associated with instance x i . The PICP is expected to be larger than the actual probability, which is known as the prediction interval nominal probability (PINP), but should be as close to it as possible.
Another important metric is the width of the generated intervals. The average interval width (AIW) [59] is shown in (27) and it is normalized for the maximum possible width of every set. (27) whereÛ pp(x i ) andLow(x i ) are the upper and lower bounds of the prediction interval for x i , respectively.
Given that it is trivial to attain high coverage by increasing the interval width, a simple but effective metric is the ratio between the coverage and width [15], as shown in (28). When there are similar PICPs among different models, a larger ratio provides a better understanding of model performance.
The Winkler score (WS) (see (29)) is a widely used metric to evaluate PIs. It is basically the width of the PI with an added penalty for those observations outside the interval bounds [60]. Therefore, the smaller the WS is, the better.
whereÛ pp ε (x i ) andLow ε (x i ) represent the upper and lower bounds of the interval for x i , and ε is defined for the PIs by (1 − ε) = P I NP the desired coverage. W ε is obtained as the average of the W i,ε over all the instances in a test set.

Conformalized quantile regression for prediction interval estimation
As seen in Section 4.1, the properties of the associated prediction interval, such as the coverage or width, are not directly take into account when constructing PIs from estimated conditional quantiles. In other words, we rely on a good estimation of the quantiles, but the PI itself is not directly optimized.
To consider these properties in our estimated PIs, we apply conformalized quantile regression (CQR) [41] in our methodology. The CQR framework is based on the posterior adjustment of conditional quantiles by means of a validation set. This has been recently applied to wind power estimation in a time series context with good results [42].
LetQ α (x) be a model obtained from training set I train for estimating two quantiles to construct a PI with a target coverage of 1 − ε = P I NP , as depicted in (30).
Then, conformity scores are computed by evaluating PIs on the validation set I val : (31) This score represents the distance from the value y i to the PI when the target value is not covered by the PI and the maximum distance to one of the PI bounds when the PI includes the target variable. Therefore, this score considers both undercoverage and overcoverage.
Finally, we can build a PI with calibrated quantiles for y i+1 from data x i+1 as where The PIs constructed from calibrated quantiles are supposed to better approach the PINP, reducing their width in case of overcoverage and increasing it in case of undercoverage. We note that this is a general approach. In our problem, we estimate PIs for different PINPs and time horizons. Therefore, we propose adapting the CQR methodology by computing different conformity scores for each PINP and time horizon considered. Therefore, the resulting calibrated PI for a specific PINP and time t is shown in (33).
Note that I val,t is the validation set, but only for observations at time t. Overall, this approach is useful for solar energy regions, where PIs present more differences depending on the time horizon.

Results
In this section, we discuss the results obtained on the test sets. First, model performance is evaluated in terms of the accuracy of the quantiles. Next, PIs are built from the quantiles and tested according to their coverage, width, and WS. Then, PIs are estimated from the calibrated quantiles as explained in Section 4.4 to show their improvement. Finally, an analysis by season is presented for the PIs generated by the QRDNN.

Quantile estimation
We present the average quantile loss (Fig. 4) obtained by the 10 quantiles and report the results by the time horizon (hours), method, and region. Results have also been averaged across all time horizons, as displayed in the rightmost columns of Fig. 4.
Regarding wind energy forecasting ( Fig. 4(a)), the largest loss in all cases is observed when LQR is used. The performance of NGB is usually the second worst, followed by GBQR. Regarding the best performing methods, the best results on the Lugo data are achieved using QRDNN and QRF, whereas on the Granada data, the best results are achieved using QRDNN and SVQR; slightly less loss is observed for the majority of the time horizons and also on average for QRDNN. Furthermore, we note that the four methods perform better on the Lugo data than on the Granada data.
With respect to solar energy forecasting, as shown in Fig. 4(b), the largest loss for every time horizon is always observed when using LQR. On average, NGB and QRF are the second worst performing methods on the Ciudad Real and Córdoba data, respectively. The most accurate method is again QRDNN, where slightly less loss is observed except for on the Ciudad Real data at 12:00.
Another metric that gives a general understanding of the method performance regarding the accuracy of quantiles is CRPS (Fig. 5). It can be seen that the results are similar to those of quantile loss: the lowest loss is mainly obtained using QRDNN, both for solar and wind energy predictions.
However, there are some changes in the rest of the methods. For example, a low CRPS is obtained using LQR for solar energy prediction (Fig. 5(b)). However, we will later see that this result comes at the cost of low coverage.
Favorable performance is not obtained for this metric when QRF is used because in solar energy prediction, the worst CRPS values are obtained. For wind energy prediction, QRF has worse results than those obtained using GBQR. Similar CRPS values are achieved using NGB and SVQR in the wind energy regions. In solar regions, SVQR is slightly better than NGB on the Ciudad Real data, and the opposite behavior is observed on the Córdoba data.

Prediction interval estimation
Methods studied in this work are used to build PIs from their estimated quantiles, as described in Section 4. Therefore, the PI metrics PICP and AIW are presented for the regions of Granada, Lugo, Ciudad Real and Córdoba in Tables 4, 5, 6 and 7, respectively.
Each of these tables contains 5 subtables, one per PINP target value. There is one row per method and one column per time horizon. The table cells show both the PICP value and AIW value (the latter is shown in parentheses). The rightmost column is the average of the PICP and AIW values (the latter is shown in parentheses) across all time horizons. Note that in all the resulting tables, when a PICP equal to or higher than the target PINP is achieved for a given method, the corresponding value is shown in bold.
First, the PICP and AIW results from Granada (wind energy) are presented in Table 4. Generally, the desired coverage is not able to be achieved using LQR, GBQR, and NGB, whereas it can be achieved using SVQR on a few occasions. In contrast, reasonable coverage is obtained using QRF and QRDNN at most of the times and on average (rightmost column). Coverage is achieved for all PIs at all hours using QRF, except at 09:00. However, the desired coverage is not obtained in some cases using QRDNN: this mostly occurs in the first half of the day (00:00, 03:00, 06:00, 09:00) and also for high PINP values. However, it is important to note that the difference between PICP and PINP is quite small in these cases. In contrast, the AIW is the smallest among the rest of the methods for every target PINP and at hour analyzed. Additionally, in terms of the mean (rightmost column of Table 4), only when using QRDNN and QRF can the target PINP (or close to it) be obtained, but the narrowest intervals (smallest AIW) are obtained using QRDNN.
In Table 5, the results on the Lugo data (wind energy) are shown. A similar behavior, one in which the desired coverage is not reached, is observed for LQR and GBQR. Except for the PINP at 99%, coverage is also not obtained using SVQR. In this region, the performance of NGB is improved and the method is able to be used to achieve the desired coverage (on average) for PINP at 80%, 85%, and 90% with a relatively low interval width, whereas it is achieved using QRF for all PINP values and times. The PINP in all cases for the 80% and 85% PIs, and in most cases for the 90% and 95% PIs are met using QRDN. Nevertheless, the only PINP where coverage is not reached on average (rightmost column) by QRDNN is the 99% PINP, but even in this case, it is very close (PICP=98.7%). On average, QRDNN intervals are still generally narrower.
We continue with the solar energy regions, starting with the Ciudad Real data ( Table 6). As can be seen, the desired coverage for any target PINP cannot be achieved using LQR and GBQR, although GBQR comes close to achieving coverage for the 99% PINP (98% on average). Similar results are observed for the wind energy prediction, as QRF is the best performing method regarding PICP coverage: PICP coverage is achieved using QRF for every hour at PINP values of 80%, 85%, 90%, and 95%. For the PINP target of 99% at 15:00, the coverage is close to but does not meet the desired coverage (98.90%) when using QRF. However, on average (rightmost column), coverage is met using this method. When using NGB, the target coverage is achieved on average for the PINP at 80%, 85%, and 90%, whereas when using SVQR, target coverage is achieved for the PINP at 99%. Lastly, the coverage at all hours is met using QRDNN, and on average, coverage is met for the 80%, 85% and 90% targets. Furthermore, although the coverage is not satisfactory for the 95% and 99% PIs using QRDNN, it is fair to say that it is not far away on average (94.9% and 98%, respectively), and the width is generally lower compared to the rest of the nonlinear methods.
Finally, results for Córdoba (solar energy) are presented in Table 7. Once again, accurate coverage is not achieved using LQR. However, Córdoba is the first region where reasonable coverage for some hours is achieved using GBQR. For example, using this method, coverage is achieved at 09:00 for both the 80%, 85% and 90% PIs, but coverage is not achieved for the rest of the hours. For the 95% and 99% PIs, the PINP for the first 2 and 3 time horizons, respectively, are achieved using GNQR, and coverage is also achieved on average for the 95.5% and 99.5% PIs. As expected, the target coverage at most of the hours for every PI is achieved using QRF, and it is always achieved on average for the other PIs. The behavior of QRDNN is similar to those in the rest of the regions: the coverage is achieved for every hour at the 80% and 85% PIs. For the 90% and 95% PIs, coverage is only not met at 09:00 (the case at 90% PI is close). For the 99% PI, the desired coverage is only reached at 12:00, although it stays quite close to the PINP in the remaining cases. Good performance is achieved for NGB regarding the coverage target (it is achieved for the mean PINP values at 80%, 85% and 90%) while PIs are kept narrow. In addition, almost all PINP values are met on average (except 80%) for SVQR with a PI of similar width to those obtained by QRDNN. We can say that this is the only region where other nonlinear methods (SVQR and NGB) compete with QRDNN regarding PI width (and coverage). However, we will see in the following section how the results can be improved by calibrating the PIs.
In summary, according to the previous analyses, two methods stand out overall, QRF and QRDNN. Using QRF, the target coverage is always achieved, and using QRDNN, the coverage is either achieved or close to being achieved,  %. Only the average results across all time horizons are considered. It can be seen that the desired PINP is achieved using QRDNN in most cases, and even when coverage is not attained, the difference is smaller than 1.02% in the worst case, which is much smaller in general. However, the intervals using QRDNN are 8% to 29% narrower than those using QRF (8% to 24% if only intervals where P I CP ≥ P I NP are considered).
To complete the PI coverage and width analysis, an example of the 95% PINP for July 2018 using the Lugo data is provided in Fig. 6. The red area represents the interval generated by QRDNN, and the blue area represents the interval generated by QRF.
The real wind power production data is represented by black points.
It is easily noted that the QRF intervals are wider for the majority of the points in the set.   In terms of the mean PICP (rightmost column), QRF, QRDNN, and NGB can be utilized to achieve the required coverage or are close to it, but the intervals generated using QRDNN are generally narrower  In terms of the mean PICP (rightmost column), QRF, QRDNN, and NGB can be used to achieve the required coverage or close to it, but QRDNN PIs are generally narrower Table 7 Results on the Córdoba data (solar energy), which are similar to those in Table 4 Method  The PICP is a crisp metric in the sense that for an individual instance, its value is either 0 or 1. However, if an instance is outside the PI but very close to the PI bound, the PICP value will still be 0. A smoother understanding of the obtained PIs is presented using WS ((29)). Its value is basically the interval width plus a penalization value, which is linear with the distance between the instance and the PI bounds (the penalization value is zero if the instance is within the PI). Thus, if the instance is outside the PI, but not too far away from the lower or upper bounds, the penalization will be low. However, the penalization value grows quickly with distance, as it is weighted by 2 1−P INP ). In Table 9, the mean value of the Winkler score for each region, method and PI are shown. The best WS value is shown in bold. In the first wind region (Granada), the lowest WS for all coverage is achieved using QRDNN, except for WS99, where the best coverage is achieved using QRF. For the Lugo data, the best score for every coverage using QRDNN, except for WS90, where the performance of QRF is slightly better. These results coincide with those in which the target PINP is achieved or almost met for a given method. In general, the worst values are obtained using LQR and GBQR. In addition, in the solar energy regions (Ciudad Real and Córdoba), we find that QRDNN is the best performing Fig. 6 Example of the 95% prediction interval using QRDNN (red area) and QRF (blue area) for the Lugo data (July 2018). The real wind energy production data is represented by black points. The QRDNN provides narrower intervals for the majority of the points method for every coverage and region, except for the PINP at 99% for the Ciudad Real data, where SVQR performs slightly better.
In summary, QRDNN is the best performer for the WS metric, except for a few cases.
We conclude this section of the analysis by commenting the final metric: the (mean) coverage-width ratio (Table 10). First, we consider the fact that narrow intervals can be achieved at the cost of large differences with respect to the required coverage for some methods. Methods whose coverage deviates from the target PINP by more than one unit have been represented using a smaller font, and they are not taken into account when computing the best ratio. Thus, for example, for the PINP value of 99%, we only take into account methods with a PICP value equal or greater than 98% to compute the best ratio (in bold).
It can be seen that QRDNN is clearly the best performing method for both the wind energy regions (Granada and Lugo), and for one solar energy region (Ciudad Real). This DNN-based method is only surpassed on the Córdoba data by NGB. As we will see in the next section, this may be caused by a bad quantile calibration, where the constructed PIs may be wider than necessary.

Prediction interval estimation with calibrated quantiles
Given the results regarding quantile and PI estimation and taking coverage, width and ratio into account, QRF and QRDNN are considered to be the two best performing methods in the regions of Granada, Lugo, and Ciudad Real, where both methods have achieved robust performance. However, in the region of Córdoba, NGB is considered The best value for each region is shown in bold. The best values are achieved using QRDNN in most cases jointly with QRDNN due to its good results in relation to the coverage-width ratio (Table 10).
In this section, we show how improvements in the PI quality can be made by following the conformalized regression methodology presented in Section 4.4. For this purpose, we report the coverage and width of the above mentioned methods in their conformalized forms.
First, Table 11 shows the PICP and AIW results on the Granada data using the conformalized forms of QRF (CQRF) and QRDNN (CQRDNN). Generally, we can see that conformalizing reduces the coverage. This may be due to the fact that a larger coverage than required is obtained using these methods, and calibration with the validation set reduces it, which also results in narrower PIs. Nevertheless, although in some cases the calibrated PIs do not achieve the target PINP, there is never a large deviation, with the advantage that AIW is reduced. DNN-based methods (QRDNN and CQRDNN) remain the methods with the best performance due to their narrow PIs.
In Table 12, the PICP and AIW results on the Lugo data using CQRF and CQRDNN are shown. Similarly to the case of the Granada data, the coverage tends to be reduced when there an excess of coverage with respect to the target PINP is found in the validation set for the conformalized methods. As a result, the PICP values obtained by CQRF and CQRDNN are closer to the PINP and in some cases below it. However, the improvement in AIW makes conformalizing worthwhile, especially for the PINP at 80%, 85%, and 90%, where the improvement in AIW is exceptional (see the mean column in Table 12). For the 99% PINP, the AIW value is slightly larger, but the coverage is also increased, which is what is required in this case. Table 13 shows the PI estimation performance of the conformalized methods, CQRF and CQRDNN, for the Ciudad Real data (solar). For CQRF, there is only a slight reduction in coverage from the calibration quantiles result, but this still results in a significant improvement in the PI width, especially for QRDNN and the 80%, 85%, and The best value for each region is shown in bold. Methods whose coverage deviates from the target PINP by more than one unit have been represented using a smaller font. QRDNN is the best method for the Granada, Lugo and Ciudad Real data 90% target PINPs. There is some AIW increase for the 95% and 99% PINPs, but for the 95% case, this result is actually required to increase the coverage. Although both methods benefit from calibration, conformalized QRDNN is still better than CQRF in terms of AIW. Results for the Córdoba data are displayed in Table 14. As previously mentioned, NGB was chosen to compare with QRDNN in this region. It is interesting to note that calibration does not improve the PIs for NGB (CNGB), as excessive coverage and a larger AIW are obtained. On the other hand, PIs are greatly improved using CQRDNN: coverage is closer to the PINP target, satisfying it, and the AIW decreases. Furthermore, the employment of calibrated quantiles makes CQRDNN the best performing method for this region, and are superior to the original results using NGB.
In summary, in most cases, calibrated quantiles (conformalized quantile regression) result in a PICP value that is closer to the target PINP value and a decreased AIW. In particular, CQRDNN benefits particularly from calibration, as also shown in Table 15, which shows that the coveragewidth ratio improves when using CQRDNN instead of QRDNN. As can be seen, improvements occur for most PINP values and regions, especially for Córdoba. We note that it is more difficult to improve the ratio for the PINP at 99%, which is understandable due the high coverage requirements. In general, we can confirm that employing calibrated quantiles improves the PI quality. Overall, results show the good performance of deep NN-based methods.

Analysis by season
Generally, a better overall performance in relation to prediction interval coverage, width, and quality for the time horizons analyzed is achieved using CQRDNN. To complete this section of results, PIs obtained using this method are studied from a seasonal perspective. Thus, predictions made on the test set will be disaggregated into the four seasons of the year to check for possible variability during the year. There is one subtable for each PINP target value. The rightmost column of each subtable is the average of the PICP and AIW results across all time horizons. Values in bold indicate that a PICP value equal to or greater than the target PINP is achieved for a given method. In terms of the mean, the coverage is reduced when using the conformalized methods, but the resulting PICP values are not far from the target PINP, with the advantage that AIW is reduced in most cases (except for the 99% target). CQRDNN is generally the best performer in terms of AIW In terms of the mean, the coverage is reduced when using the conformalized methods, but the resulting PICP values are not far from the target PINP, with the advantage that AIW is reduced in most cases. CQRDNN is generally the best performer in terms of AIW  In terms of the mean, the coverage is reduced when using the conformalized methods, but the resulting PICP values are not far from the target PINP, with the advantage that AIW is reduced in most cases. CQRDNN is generally the best performer in terms of AIW  We present the PICP and AIW results for every season, region, and PINP in Table 16.
In summary, the results based on the season follow the general trends displayed in the first part of this section, although specific behaviors are observed for some seasons. Generally, good coverage is achieved using CQRDNN with a few exceptions (e.g., the Granada data in the summer and winter, the Lugo data in the spring and summer (wind), and the Ciudad Real data in spring (solar)). However, the results remain close to the PINP. Relatively narrow intervals are still obtained using CQRDNN. For both solar regions, the narrowest intervals are obtained during the summer as this is the most stable season regarding solar radiation. For the Granada data (wind energy), some PINP values in the summer and winter were low for CQRDNN. Regarding the width of the intervals, it can be seen that intervals for summer are wider, while in autumn, winter, and spring the AIW values are similar for equal values of the PINP. The analysis by season for the second wind energy region (Lugo) shows similar patterns, with the coverage being achieved, except for some PINP values in the spring and summer. However, the coverage remains close, especially for the summer results. Regarding AIW, the highest values are observed during the summer, while the narrowest intervals are obtained during winter and autumn.
For the solar energy regions, the results on the Ciudad Real data indicate a high coverage is achieved using CQRDNN during summer and for most PINP values during autumn. Coverage is lower during winter and spring. Regarding the seasons, the AIW during winter and autumn is high compared to spring and summer. Finally, for the Córdoba data, the required coverage is generally achieved by CQRDNN in each season. In this region, we cannot find significant differences for the AIW across the presented seasons.
In summary, the results based on season follow the general trends displayed in the first part of this section, although some specific behaviors are observed during some seasons. Generally, CQRDNN performs well with respect to coverage, with a few exceptions (Granada in summer and winter (wind), Lugo in spring and summer (wind), and Ciudad Real in spring (solar)). However, the results remain close to the PINP. Relatively narrow intervals are still achieved using CQRDNN. For both solar regions, the narrowest intervals are obtained during summer, as this is the most stable season regarding solar radiation. Values in bold indicate that a PICP equal to or higher than the target PINP is achieved for this method. Regarding the PICP, differences across seasons are observed. For solar regions, the narrowest intervals are obtained in summer, which is the most stable season regarding solar radiation

Conclusions
In this work, deep neural networks (QRDNNs) were utilized to estimate multiple quantiles in the context of regional renewable energy production forecasting in Spain. These networks were compared with methods that have been used recently in the energy forecasting field, such as support vector quantile regression (SVQR), gradient boosting quantile regression (GBQR), natural gradient boosting (NGB) and random forests (RFs). The NWP variables were extracted from a spatial grid that encompasses the region and its extension for this purpose. Four regions were selected because they are representative of each type of renewable energy: Granada and Lugo for wind energy; and Ciudad Real and Córdoba for solar energy. Models were used to predict 10 conditional quantiles. The methodology involved systematic hyperparameter tuning by a grid search, where the best performing models were selected according to the mean quantile loss. In addition, from the 10 quantiles estimated, 5 PIs were constructed for different nominal probability coverage (80%, 85%, 90%, 95% and 99%).
Both quantiles and intervals were evaluated by the appropriate metrics (quantile loss, CRPS, interval coverage and width (PICP and AIW, respectively), coverage-width ratio, and WS). With respect to quantile estimation, the best performing method for the quantile loss metric for all regions, on average (across all time horizons), is QRDNN. This method is followed by QRF (wind) and by GBQR and SVQR (solar). Regarding CRPS, the lowest values are obtained using QRDNN and this time is followed by GBQR (wind) and NGB/SVQR (solar). In summary, QRDNN shows consistently good performance across both metrics and energy types/regions, whereas the performance of the other methods may depend on the metric and/or region.
Regarding PIs obtained from the quantiles and the coverage (PICP) and width (AIW) metrics, QRF and QRDNN are the two most consistent methods. The desired coverage (PINP) is always obtained (on average across all time horizons) using QRF in both solar and wind energy regions, whereas the PINP is obtained for most cases on average using QRDNN, and in any case, it never deviates from the desired value by more than 1%. An important advantage of QRDNN is that the intervals it generates are 8% to 29% narrower than the ones generated by QRF.
Another metric that displays the quality of QRDNN intervals is the WS. In solar energy regions, QRDNN is always the method with the lowest score, except for the PINP at 99% on the Ciudad Real data. For predicting wind energy, these results hold true for the majority of cases. Finally, concerning the ratio of PICP and AIW, QRDNN is always the best performing method, except on the Córdoba data (once the methods that are far away from the desired coverage are excluded).
In this work, conformalized quantile regression has been applied to further improve the quality of PIs. This is based on the calibration of the estimated conditional quantiles using a validation set. The general methodology has been extended by taking into account the time horizon of the prediction leading to an improvement in interval width. Overall, the conformalized version of QRDNN (CQRDNN) tends to perform better.
In summary, QRDNNs and especially their conformalized form, achieve consistently good performance across the different metrics, for both regional wind and solar electricity production, and are remarkable with respect to the narrowness of the generated PIs while offering good coverage.