1 Introduction

Considerable efforts have been made in the past decades to make solar energy a real alternative to the conventional energy generation system. There are two main technologies, solar thermal electricity (STE) and solar photovoltaic (PV) energy, and many countries have already reached a notable solar share in their energy mixes. Moreover, important growth is expected in the near future (International Energy Agency, 2018).

Contrary to conventional generation, solar electricity generation is conditioned by weather, and thus it is highly intermittent. Transient clouds and aerosol intermittency lead to considerable variability in the solar power plants yield on a wide range of temporal scales, particularly in minutes to hours time scales. This presents serious issues regarding solar power plant management and their yield integration into the electricity grid [1]. Currently, in addition to expensive storage-based solutions, the use of solar radiation forecasts is the only plausible way to mitigate the intermittency. Therefore, the development of accurate solar radiation forecasting methods has become an essential research topic [2].

Solar forecasting methods can be classified depending on the forecasting horizon. Nowcasting methods are mostly related to one-hour ahead forecasts, short-term forecasting with up to 6 hours ahead forecasts and forecasting methods are aimed at producing days ahead forecasts. The techniques associated with these methods are essentially different [3,4,5]. In recent years, these has been increasing interest, particularly, in short-term forecasting, fostered by the expected massive deployment of solar PV energy. Accurate short-term solar forecasts are important to ensure the quality of the PV power delivered to the electricity network and, thus, to reduce the ancillary costs [6, 7]. Short-term forecasting has also been successfully used for the management of STE plants [8, 9] and for the participation of PV and STE plants in the energy market [8, 10].

Short-term forecasts can be derived either from satellite imagery [11, 12] or from Numerical Weather Prediction (NWP) models [13,14,15]. As solar radiation measured datasets have become progressively available, the use of data-driven methods have become increasingly popular [16]. In [15, 17, 18] a comparison of the performance of different methods is assessed.

The use of NWP models for short-term solar forecasting has some important advantages, such as the global and easy availability of the forecasts. Because of that, this approach was extensively evaluated during the past decade [14, 15, 19]. Nevertheless, the reliability is far from optimal and machine-learning methods play an important role in providing enhanced solar forecasts derived from NWPs models [20, 21]. In this context, the inputs for machine learning techniques are forecasts of several meteorological variables provided by numerical weather prediction (NWP) physical models such as the European Center for Medium Weather Forecasts (ECMWF) and the Global Ensemble Forecast System (GEFS). Meteorological variables are forecast for the points of a grid over the area of interest. However, the number of features resulting from these grids is usually large, especially if several vertical levels are included in the grid. This may result in models that do not generalize well, and techniques to reduce the dimensionality of data are required.

Dimensionality reduction techniques can be divided into feature selection and feature extraction. Feature selection methods select the most relevant variables in the grid, while feature extraction summarizes information from the whole grid into fewer features. Both approaches have been used in the context of renewable energy forecasting with machine learning [22, 23]. Feature selection techniques have been used in [24] where methods such as Linear Correlation, ReliefF, and Local Information Analysis have been explored to study the influence on forecast accuracy of the number of NWP grid nodes used as input for the solar forecasting model.

In [25], feature extraction (PCA) is compared with feature selection (a minimal redundancy and maximal-relevance method) to reduce the dimensionality of variables in a grid for wind power forecasting in the east of China. The authors conclude that PCA is a good choice to simplify the feature set, while obtaining competitive results. PCA has also been used in [26] together with domain knowledge to extract features from a NWP grid to improve renewable energy forecasting. Advanced machine learning methods, such as convolutional neural networks, have also been used as a feature extraction scheme for wind power prediction using NWPs, showing competitive results compared to a PCA baseline [27]. García-Hinde et al. [28] presents a study on feature selection and extraction methods for solar radiation forecasting. The study includes classical methods, such as PCA or variance and correlation filters, and novel methods based on the adaptation of the support vector machines and deep Boltzmann machines for the task of feature selection. Results show that one of the novel methods (the adaptation of support vector machine) and PCA select high relevance features. Verbois et al. [29] combine feature extraction (PCA) and stepwise feature selection of NWP variables for solar irradiance forecasting, comparing favorably with other benchmark methods. In [30] a hybrid approach that combines PCA and deep learning is presented to forecast wind power from hours to years, showing a good performance. A recent study on solar irradiance forecasting has compared many methods on different datasets, where PCA has been used as the main method for feature extraction and dimensionality reduction [31]. In general, it is observed that PCA, even in recent works, is one of the most widely-used methods to extract features in renewable energy forecasting.

PCA is a multivariate statistical analysis that transforms a number of correlated variables into a smaller group of uncorrelated variables called principal components [32]. PCA has two main limitations. First, it performs a global linear analysis by an axis transformation that best represents the mean and variance of the given data, but lacks the ability to give local information representation. Second, PCA is an unsupervised method, that is, the target output is not used to extract the new features and this may be a drawback to finding the best low dimensional representation whenever labels are available.

In this article we propose Supervised Local Maximum Variance Preserving (SLMVP), a kernel method for supervised feature extraction and dimensionality reduction. The method considers both characteristics: it preserves the maximum local variance and distribution of the data, but also considers the distribution of the data by the response variable to find an embedding that best represents the given data structure. This method can be applied to multiclass and regression problems when the sample size m is small and the dimensionality p is relatively large or very large as opposed to Fisher’s Linear Discriminant Analysis (LDA) [33], one of the foundational and most important approaches to classification. In summary, SLMVP uses the full or partially labeled dataset to extract new features that maximize the variance of the embedding that best represents the common local distances [34] and computationally is based on weighted graphs [35]. Additionally, this method is able to perform a linear and non-linear transformation of the original space by using different kernels as the similarity metric.

To validate the SLMVP method, it has been tested to extract features in order to improve solar radiation forecasting (both Global Horizon Irradiance (GHI) and Direct Normal Irradiance (DNI)) for a 3-hour forecasting horizon, and compared to PCA (the most popular workhorse in the area), but also to other state-of-the-art methods that have not been previously used in the context of solar radiation forecasting. These methods are (1) Locality Preserving Projection (LPP, an unsupervised local dimensionality reduction method) that finds linear projective maps that arise by solving a variational problem that optimally preserves the neighborhood structure of the dataset [36]; (2) Linear Optimal Low-Rank (LOL, a supervised dimensionality reduction method) that learns a lower-dimensional representation in a high-dimensional low sample size setting extending PCA by incorporating class-conditional moment estimates into the low-dimensional projection [37], and (3) Supervised Non-negative Matrix Factorization (SNMF) that extends Negative Matrix Factorization (NMF) to be supervised [38, 39]. SNMF integrates the logistic regression loss function into the NMF framework and solves it with an alternating optimization procedure. All of these methods are able to solve the “large p, small m” problem as opposed to many classical statistical approaches that were designed with a “small p, large m” situation in mind (e.g. LDA). Features have been extracted from meteorological forecasts (obtained from the GEFS) in points of a grid around two locations in the Iberian peninsula: Seville and Lisbon. Two grid sizes have been tested, small and large. The performance of SLMVP has been compared with PCA, LPP, LOL, and SNMF using two different regressors, a linear one (standard Linear Regression (LR)) and a non-linear technique (Gradient Boosting). Thus the main contributions of this work are:

  • A new local and supervised dimensionality reduction method capable of solving the “large p, small m” problem.

  • The application of SLMVP to reduce the dimensionality of the NWP variables in a grid for the solar radiation forecasting problem.

  • The comparison with PCA, one of the most widely-used methods in the context of renewable energy for feature extraction, LPP, and two state-of-the-art recent supervised methods, LOL and SNMF, showing the usefulness of the proposed method.

The structure of the article is as follows. Section 2 explains the SLMVP method, which is tested using the data described in Section 3 and the experimental design included in Section 4. The Conclusions section summarizes the main results.

2 Supervised dimensionality reduction method: Kernel-SLMVP

As has been mentioned in Section 1, PCA is an unsupervised method that performs a global analysis of the whole dataset. As opposed to the global-based data projection techniques like PCA, other methods based on local structure preservation i.e. ISOMAP [40], LLP [36], Laplacian Eigenmaps [41], and Locally Linear Embedding [42] have been proposed in order to overcome the characteristic of being global. Although these techniques use linear optimization solutions, they are also able to represent nonlinear geometric features by local linear modeling representation that lies in a low dimensional manifold [43]. Note that these non-linear methods still do not consider labeled data, that is, they are unsupervised methods. Recently, Linear Optimal Low-Rank (LOL) projection has been proposed incorporating class-conditional means. The key intuition behind LOL is that it can jointly use the means and variances from each class (like LDA), but without requiring more dimensions than samples [37]. Another recent method is Supervised Non-negative Matrix Factorization (SNMF) that extends Negative Matrix Factorization (NMF) to be supervised [38]. SNMF integrates the logistic regression loss function into the NMF framework and solves it with an alternating optimization procedure. For both methods, regression can be done by projecting the data onto a lower-dimensional subspace followed by the application of linear or non-linear regression techniques. This mitigates the curse of high-dimensions.

The Supervised Local Maximum Variance Preserving (SLMVP) dimensionality reduction method solves the problem of LPP to work on problems with “large p small m” and the global approach of LOL and SNMF, despite being supervised. Therefore, SLMVP preserves the maximum local variance of the data being able to represent non-linear properties, but also considers the output information (in a supervised mode) to preserve the local patterns between inputs and outputs. In summary, it uses the full or partially labeled dataset to extract new features that best represent the local maximum joint variance.

SLMVP is based on a graph representation for a given set of inputs x1,x2,…,xmRp, and a set of outputs y1,y2,…,ymRl. With m being the sample data points and, p and l the number of input and output features, in our case the dimensionality of l = 1 and p = 342 for the small grid, and m = 12274 for the large grid. The application of any similarity function \({\mathscr{S}}\) on the inputs \({\mathscr{S}}_{x}(\textbf {X}) : \textbf {X} \in \Re ^{m \times p}\) and \({\mathscr{S}}_{y}(\textbf {Y}) : \textbf {Y} \in \Re ^{m \times l}\) defines an input weighted graph {H,U} and an output weighted graph {I,V} with H and I being the nodes, and U and V the vertex, respectively. The graphs are not constrained and can be fully connected, or some weights can have a zero value meaning that the connection between those points disappears. The weight of the links represents the similarity between two data points. These characteristics allow the method with the capability of being local. Following [41] and [35] a graph embedding viewpoint can be used to reduce the dimensionality, mapping a weighted connected graph G = (V,E) to a line so that the connected points stay as close together as possible.

The unsupervised dimensionality reduction problem aims to choose the mapping \(\textbf {y}_{i}^{\prime } = \textbf {A}^{T}\textbf {x}_{i} : \textbf {y}_{i}^{\prime } \in \Re ^{k}\) and kp, which minimizes the distance with its neighbors in multidimensional data and can be expressed by the next cost function:

$$ J_{ns} = \frac{1}{2}\sum\limits_{ij}\|\textbf{y}^{\prime}_{i} - \textbf{y}^{\prime}_{j}\|^{2} w_{ij} $$
(1)

where WRmxm is the similarity matrix \({\mathscr{S}}_{x}(\textbf {X})\).

Following this graph embedding approach, SLMVP solves the supervised version and the wish to choose the mapping \(\textbf {y}_{i}^{\prime } = \textbf {A}^{T}\textbf {x}_{i} : \textbf {y}_{i}^{\prime } \in \Re ^{k}\) and kp, which minimizes the distance with its neighbors in multidimensional data but preserves only those distances that are shared in the input and output spaces, given the similarity functions for each of them \({\mathscr{S}}_{x}\) and \({\mathscr{S}}_{y}\). The cost function is then expressed by:

$$ J_{s} = \frac{1}{2}\sum\limits_{ij}\|\textbf{y}^{\prime}_{i} - \textbf{y}^{\prime}_{j}\|^{2} z_{ij} $$
(2)

where ZRm×m represents the joint similarity matrix between input \({\mathscr{S}}_{x}(\textbf {X})\) and output \({\mathscr{S}}_{y}(\textbf {Y})\) similarity matrices, being \(z_{ij} = {\sum }^{m}_{k=1}u_{ik}v_{kj}\). Note the difference between (1) that is non supervised and (2) that defines a supervised manifold learning problem using the similarity matrix between inputs and outputs.

The minimization of the cost function (2) can be expressed in its kernelization form (Kernel-SLMVP) after some transformations as the following maximization problem:

$$ \max tr(\boldsymbol{Y}^{\boldsymbol{T}}{\boldsymbol{K}}_{x} {\boldsymbol{K}}_{y} {\boldsymbol{Y}}) $$
(3)

where \(\textbf {K}_{x} = {\mathscr{S}}_{x}(\textbf {X})\) and \(\textbf {K}_{y} = {\mathscr{S}}_{y}(\textbf {Y})\) are the input and output similarity graphs expressed as kernel functions (i.e. polynomial K(a,b) = (1 + ab)p or Gaussian \(K(a,b)=e^{\frac {-\lvert a-b \lvert ^{2}}{2\sigma ^{2}}}\)). Finally, the (3) can be solved as an eigenvector problem on B as follows:

$$ \boldsymbol{XK}_{x}\boldsymbol{K}_{y}\boldsymbol{X}^{T}\boldsymbol{B} = \lambda\boldsymbol{B} $$
(4)

where B is the learned latent space. The projection of the input space data X on this space P = BTX are the new extracted features to be used by the machine learning model. The Python code of SLMVP has been released publicly at [44].

3 Data description

The dataset used in this study concerns GHI and DNI measurements at two radiometric solar stations in the Iberian Peninsula: Seville and Lisbon. GHI and DNI have been acquired with a Kipp & Zonen CMP6 pyranometer, with a 15-minute resolution.

The set of inputs is a collection of forecasted meteorological variables obtained from GEFS at different levels of the atmosphere and at different latitudes and longitudes. More specifically, 9 meteorological variables at different levels are used (see Table 1), making a total of 38 attributes at each latitude-longitude pair. Latitudes go from 32 to 51 and longitudes go from -18 to 6 with a resolution of 0.5 degrees. In this work, two grids of different sizes have been used: a small grid with 3 × 3 = 9 points around the solar station (Seville and Lisbon) and a larger one with 17 × 19 = 323 points. For Seville, the larger grid covers the Iberian Peninsula (latitudes: 36 to 44, longitudes: 350 to 359.5, both with a resolution of 0.5 degrees). In the case of the Lisbon solar station, the larger grid has been shifted to cover part of the Atlantic Ocean (latitudes also go from 36 to 44 and longitudes go from 346 to 355.5). Figure 1 shows both the wide and narrow grids, centered around Seville and Lisbon (in blue). Since each point in the grid contains 38 attributes, the small grid results in 3 × 3 × 38 = 342 input variables, and the larger one in 17 × 19 × 38 = 12274 inputs.

Table 1 Meteorological Variables
Fig. 1
figure 1

17 × 19 (black) and 3 × 3 (red) grids

GEFS provides predictions of meteorological variables for a 3-hour forecasting horizon every 6 hours each day (00:00am, 06:00am, 12:00pm, and 18:00pm). The corresponding GHI and DNI measurements are also used. To select the relevant hours of the day for GHI and DNI, samples with a zenithal angle larger than 75 degrees have been removed. Given this restriction, data times range from 9:15am to 6:00pm. The total input–output data covers from March 2015 to March 2017.

In this study, GHI and DNI are normalized by the irradiance of clear sky according to (5).

$$ I_{kt}(t)=I(t)/I_{cs}(t) $$
(5)

where I(t) stands for GHI or DNI at time t and Ics(t) is the irradiance of clear sky at a particular at time t.

4 Experimental validation

In order to study the performance of the SLMVP algorithm, it has to be combined with a regression method to predict normalized GHI and DNI for a 3-hour forecasting horizon and compared with the other above mentioned methods PCA, LPP, LOL, and SNMF. The regression technique uses as inputs the attributes/features from the input-space transformation obtained by the SLMVP, PCA, LPP, LOL, and SNMF methods. As suggested in [37], to learn the projection matrix for the LOL method, we partition the data into K partitions (we select K = 10) equally separated between the target variable range [0 − 1] to obtain a K-class classification problem. In this work, linear and non-linear regression methods have been tested. As a non-linear method, a state-of-the-art machine learning technique has been used: Gradient Boosting Regression (GBR) [45, 46]. This technique has shown considerable success in predictive accuracy in recent years (see for instance [24, 47,48,49]).

In this Section, first the methodology employed is described. Then, the results comparing SLMVP with PCA, LPP, LOL, and SNMF for different GEFS grid sizes will be presented.

4.1 Methodology

Cross-validation (CV) has been applied to study the performance of SLMVP, PCA, LPP, LOL, and SNMF. In standard CV, instances are distributed randomly into CV partitions. But our study involves time series of data, and therefore there are temporal dependencies between consecutive samples (in other words, consecutive samples can be highly correlated). Hence, in this study, group 4-fold CV has been used, as explained next. Data has been split into 4 groups, one for each week of every month. Fold 1 thus contains the first week of each month (January, February, ...). Fold 2, the second week of every month and so on. This guarantees that, at least training and testing partitions will never contain instances belonging to the same week, which allows a more realistic analysis of the performance of the methods. Since in this work the optimal number of features must be selected, a validation set strategy has been used. For this purpose, each training partition (that contains 3 folds) is again divided into training and validation sets. The validation set contains a week of each month out of the three weeks of data available in the training partition. The remaining two weeks (the ones not used for validation) are used for training.

Mean Absolute Error (MAE) has been used as the performance measure (6). Given that a 4-fold CV has been employed, results are the CV-average of MAE.

$$ MAE = \frac{{\sum}_{i=1}^{i=N} \lvert y_{i} - o_{i} \lvert}{N} $$
(6)

where N is the number of samples and yi and oi are the actual value and the output of the model, respectively. Note that the number of samples for training are 480, which is smaller than the number of dimensions for the large grid 480 << 12274 and within the same scale factor for the small grid 342 <≈ 480.

The performance of the methods are evaluated as follows. Recall that the number of the selected projected features is very relevant and the obtained features for the different methods are also ordered by their importance. Then, in order to analyze the optimal number of dimensions, the performance of both linear and GBR regression methods is evaluated for 5, 10, 20, 50, 100 and 150 projected features. Given that 4-fold CV is used for performance evaluation, in each of the 4 CV iterations there is a training, validation, and testing partition. For each iteration, the regression models are trained with the training partition and then, the validation and test errors are obtained. The averages of the 4 iterations are obtained for the three errors (train, validation and test). The validation error is used to select the optimal number of features.

SLMVP, SNMF, and GBR have some hyper-parameters that require tuning in order to improve results. Five hyper-parameters were fitted: gamma parameter \(\gamma = \frac {1}{2\sigma }^{2}\) that defines the Gaussian kernel function of the SLMVP method, α, β, and 𝜃 that defines the weight of each term of the SNMF method, and number of estimators and tree depth (which belongs to GBR). The following range of values for each hyper-parameter were tested:

  • γ (SLMVP): from 0 to 2 in steps of 0.1

  • α, β, 𝜃 (SNMF): from 0, 0.1, 0.01, 0.001

  • Number of estimators (GBR): from 10 to 200 in steps of 10

  • Tree depth (GBR): from 1 to 10 in steps of 1

In order to tune the hyper-parameters, a systematic procedure known as grid-search was used. This method tries all possible combinations of hyper-parameter values. Models for each hyper-parameter combination are trained with the training partition and evaluated with the validation partition. The best combination on the validation set is selected.

4.2 Results

Table 2 shows the average GHI MAE for the best number of components for different methods and grid sizes. Table 3 displays the same information for DNI. The best number of components has been selected using the MAE for the validation set. In all cases, it is observed that the use of the nonlinear regression technique (GBR) improves considerably the errors for PCA and LPP, in some cases for LOL and SNMF, and always minor improvements for SLMVP. For instance, in the case of the small-grid for GHI in Seville (Table 2 top left), the use of GBR with PCA improves the MAE considerably (from 0.6467 with LR to 0.0126 with GBR accountable for a 6.8% improvement). Similar improvements for PCA, LPP, and LOL MAE can be observed for the large-grid, from 0.6084 to 0.0155 accountable for a 8.51% improvement (Table 2 top right). Lisbon GHI (Table 2 bottom) behaves in a similar way. For SNMF the differences are almost nonexistent. In the case of SLMVP, although GBR obtains better errors than LR, the difference between linear and non-linear is smaller than for PCA and LPP cases. For instance, observing the GHI results for Seville (top of Table 2), it can be seen that for the small grid (top left), the difference between GBR and LR (when using SLMVP) is only 0.1562 vs. 0.1673, and for the large-grid (top right), is 0.1653 vs. 0.1845. Similar differences can be observed for GHI at Lisbon (bottom of Table 2). This is reasonable because SLMVP uses a non-linear kernel, so even when using LR, some of the non-linearity of the problem has been included by SLMVP feature extraction process. Conclusions for DNI (Table 3) follow a similar trend: PCA and LPP benefit more from using a non-linear method (GBR) than SLMVP and SNMF, but LOL benefits more using a regularized linear regressor. LOL includes linear class prior information about which is beneficial for LR.

Table 2 Average test MAE and number of selected components for different methods and for GHI at Seville and Lisbon locations
Table 3 Average test MAE and number of selected components for different methods and for DNI at Seville and Lisbon locations

Analyzing the results depending on the size of grid (small vs. large), it is observed that the use of a large grid does not result in better MAE values. The best errors are always obtained with the small grid in all cases of Tables 2 (GHI) and 3 (DNI). When the large grid is used, more components are used for SLMVP but, as already mentioned, this does not improve the results.

Summarizing the results so far, for both irradiances, GHI and DNI, and both locations (Seville and Lisbon), the best performance is always obtained with the SLMVP method and the non-linear regression method (GBR). In order to quantify this improvement better, Table 4 shows the percentage improvement of SLMVP relative to PCA, LPP, LOL, and SNMF for the best models (SLMVP+GBR, PCA+GBR, LPP+GBR, LOL+LR, and SNMF-GBR/LR). In summary, it can be said that SLMVP offers results 4.92% better than LOL for Seville and around 3.99% than LOL for Lisbon, 5.88% better than PCA for Seville and around 3.12% than PCA for Lisbon, 25.93% better than LPP for Seville and around 16.29% than LPP for Lisbon, 6.21% better than SNMF for Seville and around 2.82% for Lisbon.

Table 4 Percentage improvement of SLMVP relative to PCA, LPP, LOL, and SNMF

In order to visualize the relation between the number of components and error, Fig. 2 shows the GHI validation and test MAE for the different number of components. This is done for SLMVP, PCA, LPP, LOL and SNMF using GBR as regressor, for Seville and Lisbon (top/botton, respectively), and for small and large grids (left/right, respectively). The same information is displayed in Fig. 3 for DNI. It is observed that the best number of PCA components is usually smaller than for other methods and that LPP and LOL usually benefit slightly with larger number of components.

Fig. 2
figure 2

Average MAE for GHI of SLMVP-GBR, PCA-GBR, LPP-GBR, LOL-GBR, and SNMF-GBR along the number of components (x-axis) for the small and large grids in Seville and Lisbon

Fig. 3
figure 3

Average MAE for DNI of SLMVP-GBR, PCA-GBR, LPP-GBR, LOL-GBR, and SNMF-GBR along the number of components (x-axis) for the small and large grids in Seville and Lisbon

SNMF and SLMVP have similar behavior with the number of components with the optimal number being slightly smaller for SLMVP. In contrast to PCA and LPP, the information and components found by SLMVP benefits the performance of the regression method. In Figs. 2 and 3 it is observed that up to 20 components, the errors decrease in all study cases. With a small grid, 20 components is the best solution for all datasets except Lisbon DNI, which reached the best solution with 50 components (see left part in Fig. 2 for GHI and for DNI left part of Fig. 3). When a large grid is used, more than 20 components are generally beneficial, with 50 or 100 components being selected as the best options (50 components for Seville and 100 components for Lisbon, although 50 and 100 components perform similarly for both locations). In those figures, it is also observed that although validation and test errors follow a similar trend, it is not always the case that the best error in validation corresponds to the best error in test. This should be expected because validation error is only an estimation obtained with a finite independent sample. But at least it can be seen that in all cases, using the validation error to determine the best number of components is a reliable way of achieving a reasonable test error.

Figures 2 and 3 also show that the performance of SLMVP is always better than PCA, LPP, LOL, and SNMF for every number of components (but for a few PCA exceptions and one for SNMF ). In order to quantify these improvements, Tables 567 and 8 show the percentage of improvements of SLMVP over PCA, LPP, LOL and SNMF using the best results regression model for the different number of components used, respectively. The superiority of SLMVP is clearly observed, but it is interesting to note that when 5 components are used (and some cases with 10 components), either PCA is better or the improvement of SLMVP is smaller. This suggests that PCA is able to find relevant information when only very few components are allowed. In any case, it is clear from Figs. 2 and 3 that more than 5 components are required in order to obtain the best results.

Table 5 Improvements in percentage (%) of SLMVP over PCA for the different number of components
Table 6 Improvements in percentage (%) of SLMVP over LPP for the different number of components
Table 7 Improvements in percentage (%) of SLMVP over LOL for the different number of components
Table 8 Improvements in percentage (%) of SLMVP over SNMF for the different number of components

Finally, to also verify that the SLMVP technique is superior to the current use of PCA, LPP, LOL, and SNMF not only for the optimal number of dimensions but independently of the number of dimensions selected, we have used a two-sample t-test for equal means to test the hypothesis that the obtained average error improvement for the different number of dimensions for each dataset is not due to chance. The obtained significance is shown in Table 9. We applied this test under the null hypothesis that the means are equal and the observations have different standard deviations. We used as observations the 6 test error data results (5, 10, 20, 50, 100, and 150 extracted components) obtained for each dataset . We conclude that the improvement obtained for the analysis SLMVP vs. LPP is significant, rejecting the hypothesis of equal means with a p-value always below < 0.001. Vs. PCA this p-value is below < 0.05 in 4 out of 8 cases (3 for Lisbon and 1 for Seville), vs. LOL the p-value is below < 0.05 also in 4 out of 8 cases (3 for Lisbon and 1 for Seville), and vs. SNMF the p-value is below < 0.05 also in 4 out of 8 cases (2 for Lisbon and 2 for Seville) rejecting the hypothesis of equal means.

Table 9 Dimensionality two-sample t-test analysis for equal means and 5, 10, 20, 50, and 150 dimensions

In summary, we observed that for Lisbon, the null hypothesis is rejected for 12 out of 16 cases (and the other two cases have a p-value close to the 5% threshold being the p-value= 0.08 and 0.1) and 8 out of 16 for Seville. These insights suggest that the source data for both locations have different properties and Lisbon may contain more noisy data and therefore our method obtains larger improvements because of its noise tolerant characteristics introduced by the use of locality.

5 Conclusions

Using Machine Learning methods to forecast GHI or DNI radiation, based on features that use NWP grids, typically results in a large number of attributes. In this article, a supervised method for feature transformation and reduction (SLMVP) has been proposed to extract the most relevant features solving the limitations of PCA technique to represent locality, non-linear patterns, and use labeled data. The PCA method is one of the most widely used methods to extract features and reduce dimensionality in renewable energy. Three other state-of-the-art dimensionality methods that include locality (LPP), and supervision (LOL and SNMF) have been also compared with.

The five methods have been tested and compared on radiation data at two different Iberian locations: Seville and Lisbon. Both linear and non-linear (GBR) regression methods have been used on the components extracted from SLMVP, PCA, LPP, LOL, and SNMF.

Results show that for both types of radiation (GHI and DNI) and both locations, SLMVP offers smaller MAE errors than the other methods. In order to assess the influence of the size of the NWP grid, two sizes have been tested, small and large. SLMVP results in better radiation estimates, but the small size grids display slightly better errors. It has also been shown that PCA tends to underestimate the number of features required to obtain the best results. LPP obtains the worst results and this is noticeable for large grids. SNMF has also shown a degradation in its performance for the large grid compared with SLMVP. In summary, it can be said that the small grid works better and the improvement of SLMVP over the other methods is about 6.24% at Seville GHI, 3.60% at Seville DNI, 1.73% at Lisbon GHI, and around 4.50% at Lisbon DNI.

Finally, although both SLMVP, PCA, and LPP benefit from using a non-linear regression method (GBR), this benefit is larger for PCA and LPP because they are not able to perform non-linear transformations. LOL does not benefit from non-linear regression and for some cases obtained better results using the regularized linear regressor. SNMF benefits slightly from non-linear regression for all but one of the small grids, but not for large ones. Because SMLVP is able to use non-linear transformations, the difference between using the linear and non-linear regression method is smaller as expected (but still present).

We can conclude that SLMVP is a competitive method for dimensionality reduction in the context of solar radiation forecast using NWP variables beating PCA, which is currently the most widely used, and LOL and SNMF which are two recent supervised dimensionality reduction state-of-the-art methods. Overall SLMVP also obtains better results independently of the number of dimensions used, showing its robustness.

We envision that different machine learning methods would benefit by their combination with SLMVP, and thus it will be of interest to verify it, using this and other domain datasets.