1 Introduction

Wind power is a critical pillar in the pursuit of global carbon neutrality, and its installation capacity has steadily increased in recent decades as reported by Global Wind Energy Council (GWEC 2022). This upward trend provides a solid foundation for powering our society with clean and renewable energy, while also mitigating the environmental pollution caused by fossil fuels. However, the volatility of wind power and its increasing penetration pose new challenges to the safety and stability of power grid operations. Studying wind power prediction (WPP) is critical and valuable because accurate results can facilitate power grids and wind farms to better manage the wind power generation uncertainty. Accurate wind power predictions can benefit many downstream applications, such as more efficient wind power integration (Wang et al. 2019a), intelligent market operations (Usaola et al. 2004), and monitoring wind turbine performance (Ma et al. 2014). WPP has become a classical problem in the renewable energy field and has attracted a large volume of studies (Landberg 1999; Liu et al. 2010; Liu et al. 2021b, c, d, e, a; Woo et al. 2019). These studies persistently aim to develop new technologies that achieve state-of-the-art performance in terms of accuracy and reliability.

In the literature, various approaches have been proposed for improving the accuracy of wind power predictions. The advancements in this field can be analyzed from two aspects, the problem formulation and methodological advancement, as displayed in Fig. 1. The formulation of the WPP problem can be further categorized based on input and output design considerations. Early studies attempted to estimate wind power using environmental and physical attributes (Landberg 1999), such as the topological information and meteorological information, with integrating wind power conversion dynamics or historical wind power generation records (Liu et al. 2010). With the wide deployment and continuous advancement of supervisory control and data acquisition (SCADA) systems in commercial wind farms, recent studies (Liu et al. 2021d; Woo et al. 2019; Khodayar and Wang 2018) have attempted to learn spatial–temporal correlations from the SCADA data to future wind power. To analyze temporal correlations, multivariate time series data were used to reflect the system dynamics, while to learn spatial correlations due to environmental factors, data collected from different sites were integrated into a tensor (Liu et al. 2021d). Another study (Woo et al. 2019) projected wind turbines into a 2-dimensional grid to reflect their geo-spatial relationship while such strategy might lose its effectiveness in cases that turbines were sparsely distributed. An alternative solution (Khodayar et al. 2018) involved organizing wind turbines into a graph constructed using locations and mutual information. Existing WPP studies can also be categorized based on various output considerations, which include the power output level, prediction horizon, and prediction type. Prediction of wind power outputs has been studied at three levels, the region, wind farm, and wind turbine. Depending on the prediction horizon, WPP tasks can be classified as short-term (0–6h ahead), medium-term (6–24h ahead), or long-term prediction (more than 24h ahead) (Khodayar et al. 2018). According to the considered output type, WPP tasks are classified into deterministic and probabilistic predictions. Deterministic predictions provide estimated spot values of future wind power while probabilistic predictions quantify the uncertainty of future wind power by inferring confidence intervals, quantiles, or even distributions.

Fig. 1
figure 1

The categorization of WPP studies in literature

Metrics for assessing the performance of WPP constitute another important consideration in problem formulation. The choice of evaluation metric depends on the type of WPP tasks. For deterministic WPP tasks, RMSE is the most widely used evaluation metric. Other metrics, such as mean absolute error (MAE), mean absolute percentage error (MAPE), and their standard deviations, are also meaningful for measuring WPP performance from different perspectives. In probabilistic WPP tasks, well-known and commonly used metrics include continuous ranked probability score (CRPS) and prediction interval coverage probability (PICP).

In addition to the problem formulation, WPP studies can also be categorized based on methodological development as shown in the right part of Fig. 1. Existing efforts in studying WPP methods have been devoted into two directions, physics-based and data-driven. Physics-based methods (Landberg 1999) focused on developing WPP models based on physical attributes and principles. On the other hand, data-driven methods (Sideratos and Hatziargyriou 2007; Brown et al. 1984; Treiber et al. 2016; Madhiarasan and Deepa 2017) focused on developing models for estimating future wind power outputs based on SCADA data and even a joint consideration of SCADA data as well as historical measurements and predictions of physical and environmental attributes. Three series of data-driven models (Sideratos and Hatziargyriou 2007) have been developed in literature, statistical models (Brown et al. 1984), classical machine learning (ML) models (Treiber et al. 2016), and recent deep learning (DL)-based models (Madhiarasan and Deepa 2017). Statistical models include time series models, such as persistent method (PM) (Bludszuweit et al. 2008), auto-regressive integrated moving average (ARIMA) (Chen et al. 2009), and Kalman filters (KF) (Bossanyi 1985), which model historical patterns of wind power to estimate its future, as well as linear regression (LR) models, such as least absolute shrinkage and selection (LASSO) (Cavalcante et al. 2017), which incorporates data of exogenous factors in addition to the consideration of historical wind power records. Classical machine learning algorithms, such as support vector regressor (SVR) (Zendehboudi et al. 2018), shallow neural networks (SNN) (Wang et al. 2021b), tree-based models like decision tree (DT) for regression (Heinermann and Kramer 2016) and random forest (RF) for regression (Lahouar and Slama 2017), etc., have been applied to model nonlinearities in wind power generation using SCADA data. With the recent development of deep learning, there has been increasing interest in applying DL-based models for better WPP performance. Recent DL-based models include classical ones, such as deep neural network (DNN) (Methaprayoon et al. 2007), convolutional neural network (CNN) (Liu et al. 2021d), recurrent neural network (RNN) (Cali and Sharma 2019), as well as latest ones, such as graph neural network (GNN) (Wu et al. 2022), attention-based model (AM) (Tian et al. 2022) and Physics-informed graph network (PINN) (Pombo et al. 2022).

DL-based methods have brought unique advantages on data processing and feature engineering into WPP modeling. Convolutional and recurrent mechanisms automate the process of embedding a vast input space from spatial and temporal aspects to derive a low-dimensional representation. Researchers have hypothesized that this new data processing paradigm carries a larger scope of information in data to benefit WPPs. Two well-known variants of RNN models, long short term memory (LSTM) (Liu et al. 2019b, a) and gated recurrent unit (GRU) (Liu et al. 2022b, a), have been frequently applied to extract temporal patterns of wind power data. Signal processing techniques, such as the empirical mode decomposition (EMD) (Abedinia et al. 2020), variational mode decomposition (VMD) (Abdoos 2016), and SSA (Dong et al. 2017), have been used to filter out noises and extract data fluctuation modes to facilitate recurrent neural networks to better learn latent patterns in time series for WPPs.

DL techniques have been developed to solve both deterministic and probabilistic WPP tasks. Various DL-based structures (Lahouar and Slama 2017; Methaprayoon et al. 2007; Liu et al. 2021d; Cali and Sharma 2019) have been developed for direct inference of deterministic values of future wind power outputs. To perform probabilistic WPP tasks, DL-based models have been extended for the quantile regression (QR) (Nielsen et al. 2006), lower upper bound estimation (LUBE) (Khosravi and Nahavandi 2013), kernel density estimation (KDE) (Bessa et al. 2012), and mixture density network (MDN) (Zhang et al. 2020a, b, c). In a few studies, clustering algorithms, such as K-means (Wang et al. 2018), C-means (Yang et al. 2021a), EM (Liu et al. 2018a, b, c), etc., have been employed to group wind power time series into different clusters representing different wind power fluctuation patterns, which help specify deep learning models into prediction scenarios.

Although deep learning has proven effectiveness in processing large volumes of data, its robustness and generalization ability across different WPP tasks still require further improvement. One possible reason for this is the fact that the hyperparameters and configurations of deep models, such as the number of layers and neurons in each layer, can greatly affect their performance. To address this challenge, a number of optimization algorithms have been developed, such as the grid search (GS) (Zhang et al. 2014), cuckoo search (CS) (Li et al. 2021), genetic algorithm (GA) (Huang et al. 2015), particle swarm optimization (PSO) (Amjady et al. 2011), and grey wolf optimization (GWO) (Lu et al. 2020), which aim to attain the optimal architecture for a given deep learning model. Another attempt observed is incorporating attention mechanism for dynamically processing features after the input or latent layers to adaptively respond WPP scenarios (Yang and Zhang 2021). An emerging trend is developing PINN based WPP models (Huang and Wang 2022). Physics and domain knowledge were leveraged to better govern the network design and prediction performance to enable a better generalization in WPPs (Lagomarsino-Oneto et al. 2023).

This paper provides a survey of recent WPP studies powered by deep learning from two dimensions, the WPP formulations and WPP methods. We discuss recent innovations in WPP formulations in terms of the input designs, output types, and evaluation metrics. In terms of methodology, this paper offers a systematic summary of recent advancements in four main components of many deep learning-based WPP (DL-WPP) modeling pipelines, the input signal processing, data pattern clustering, latent feature engineering, and model architecture optimization. In addition, we review three types of learning schemes based on these four components. The review focuses particularly on the latent feature engineering component, which is a significant benefit of deep learning due to its ability to accommodate high-dimensional input space and automate the engineering of latent features representing such input space. Based on the characteristics of the deep learning models, 28 deep learning network configurations covered in this review are grouped into 8 types. Finally, future research directions are discussed by analyzing the limitations of current WPP studies and identifying the hot spots of great potential for benefiting the technical development in WPP studies. The contributions of this paper can be summarized as follows.

  • Compared with existing WPP reviews (Vargas et al 2019; Qian et al. 2019; Khalid and Javaid 2020; Wang et al. 2021c, 2016, 2020; Jung and Broadwater 2014; Marugán et al. 2018; Liu et al. 2020a, b), which commonly listed a large set of reported WPP methods, this paper surveyed the literature with jointly landing discussions in the WPP problem formulation evolution as well as the mechanism advancement in data-driven WPP from classical models to carefully generate a clearer image on the WPP advancement driven by the recent rapid DL development.

  • This paper sheds light on surveying over 140 recent studies investigating DL-WPP methods. Based on this comprehensive analysis of recent state-of-the-art studies, this paper provides a high-level overview of the current frontiers of data-driven WPPs.

  • This paper serves as an interface for accessing emerging WPP developments, indexing them from multiple perspectives. Readers can quickly refer to their interested WPP studies according to the data organization, feature engineering, evaluation metrics, deep learning-based modeling frameworks, etc.

  • This paper provides details of current state-of-the-art DL-based modeling methods, guiding the audience to replicate these WPP models. Meanwhile, a comparative analysis of the performance of latest DL-based WPP models is conducted based on datasets of three commercial wind farms to facilitate the audience to further understand the effectiveness of these models on WPP tasks.

  • Discussions of the up-to-date promising future research directions are provided via analyzing the limitations of existing WPP studies and identifying new DL hot spots for possibly advancing WPP performance. These future trends aim to serve as guidance for researchers to study more advanced WPP methods leading to a more accurate and reliable WPP performance.

The remaining parts of this paper are organized as follows. Section 2 summarizes recent DL-WPP studies in terms of the problem formulations. Section 3 revisits and elaborates details of emerging deep learning-based methods for WPP tasks. Section 4 provides a discussion of promising future research trends and Sect. 5 concludes the insights of this survey study.

2 Data-driven wind power prediction problem formulation

Let \(x\) represent the input data related to wind power generations and \(y\) represent the actual wind power output. The objective of data-driven WPP studies is to develop a model \(f(\cdot )\) that predicts the wind power output, where the predicted output is denoted by \(\widehat{y}=f(x)\). Typically, the model \(f(\cdot )\) is trained by minimizing the error measures between the prediction \(\widehat{y}\) and the true value \(y\).

Based on this formulation, new developments have occurred on the design of input \(x\) and the output \(y\) considered in WPP studies. Along with the innovation in problem formulation, we also observe new developments in metrics for more effectively evaluating WPP performance. Thus, a survey from this perspective is also needed. In this section, we first review on the scope of the considered input \(x\) and its organization forms. Then, we summarize the different settings of output y considered in WPP studies. Finally, we survey the evaluation metrics used in WPP tasks.

2.1 WPP input and its organization

As shown in Table 1, recent WPP studies have mainly focused on numerical weather predictions (NWP) and SCADA data as inputs. The NWP input data, \(x\in {\mathbb{R}}^{T\times {N}_{NWP}}\), where \(T\) represents the length of the sequence and \({N}_{NWP}\) represents the number of NWP attributes (e.g., the predicted temperature and wind speed), have been utilized in WPP model development (Bessa et al. 2012). These NWP data are usually extracted from the data provided by other weather forecasting sources. For example, in studies (Hong et al. 2016; Klinges et al. 2022; Kokkos et al. 2021), the NWP data extracted from the global gridded weather data were utilized to enable the consideration of gridded macroclimatic variables as the input.

Table 1 Inputs of WPP studies

Wind turbine SCADA data is another frequently employed data source for developing inputs \(x\in {\mathbb{R}}^{T\times {N}_{SCADA}}\), where \({N}_{SCADA}\) describes the number of SCADA attributes, e.g., the historical wind power, historical wind power ramp, wind speed, wind direction, generator torque, blade pitch angle, etc., in WPP studies (Methaprayoon et al. 2007; Liu et al. 2021b, c, d, e, a). With the advancement of the SCADA system, more related attributes become available for enhancing WPP. Several recent studies (Valsaraj et al. 2020, 2022; Külüm et al. 2023; Weide et al. 2022) targeted at enriching the information supply for WPP tasks by including anemometric data collected at higher heights, which aimed to overcome the deficit of anemometers mounted on the back end of the turbine nacelle on measuring wind speeds.

A joint consideration of both weather and historical data as the input for WPP tasks has also been reported (Ghadi et al. 2014; Cali et al. 2019). Both NWP and SCADA attributes were simply integrated into one input described as \(x=[{x}_{NWP},{x}_{SCADA}]\), where \({x}_{NWP}\in {\mathbb{R}}^{{T}_{NWP}\times {N}_{NWP}}\) represents the part containing NWP data, \({x}_{SCADA}\in {\mathbb{R}}^{{T}_{SCADA}\times {N}_{SCADA}}\) represents the input SCADA data, \({T}_{NWP}\) is the time length of NWP data considered, and \({T}_{SCADA}\) is the time length of SCADA data considered.

To incorporate more relevant attributes and enrich WPP input, recent studies (Pombo et al. 2022; Huang and Wang 2022; Li and Zhang 2022; Zhang and Zhao 2021) have also considered applying the physics-informed methods to augment data and improve accuracy. Such efforts jointly consider knowledge from astronomy, fluid dynamics, power curve modelling, and the spatiotemporal correlation among sensors and actuators. Astronomy knowledge usually comprises of the solar position, which can be extracted by means of the azimuth and elevation angles, as well as the weather conditions which is usually captured by installed cameras. Fluid dynamics consider the flux which measures a quantity’s flow rate carried by a moving fluid per unit of normalized area, as well as the turbulence intensity which evaluates the level of velocity fluctuation of a fluid. Power curve presents the relationship between wind speed and output power.

To further improve WPP performance, studies (Liu et al. 2021d; Woo et al. 2019; Khodayar and Wang 2018) have explored incorporating higher dimensional information from data sources to build more meaningful WPP inputs. Data considered in the input was expanded from a single wind turbine or a single wind farm site to a group of its neighbors with a size \(P\) which enables a consideration of the spatial influences, such as the wake effects, geostrophic wind, ground roughness, etc., as shown in Fig. 2. These studies (Liu et al. 2021d; Woo et al. 2019; Khodayar and Wang 2018) organized high dimensional input data using three strategies, the stacking, projection, and graph-based model. Meanwhile, although the following description is offered with using the SCADA data as an illustrative example, the joint consideration of meteorological measurements and predictions as well as SCADA data can also be expanded to a group of neighbors of the targeted turbine for learning the spatial pattern among those attributes.

Fig. 2
figure 2

Three types of data organization considering spatial influence

2.1.1 Stacking strategy

In (Liu et al. 2021d), SCADA data collected from \(P\) turbines were directly stacked into a 3-dimensional tensor, \(x\in {\mathbb{R}}^{T\times P\times {N}_{SCADA}}\) to consider richer information. However, the organized input \(x\) does not reflect the actual geographical distribution of the turbines.

2.1.2 Projection strategy

In (Woo et al. 2019), data from multiple sources were projected onto a 2-dimensional grid to construct a 4-dimensional tensor \(x\in {\mathbb{R}}^{T\times W\times H\times {N}_{SCADA}}\) using the SCADA data of \(P\) turbines, where W and H are the width and height of grids. This strategy considers the geographical distribution of wind turbines. However, once the distribution of the turbines is sparse, the majority values in \(x\) are blank, leading to inefficiency and impaired WPP performance.

2.1.3 Graph-based modeling strategy

In (Khodayar and Wang 2018; Liu et al. 2023), a graph is developed to model the relationship between wind turbines via incorporating geographical information, such as the longitude, latitude, and altitude as well as the mutual information between the wind turbines. This strategy expands the input \(x\) to a tuple \(x=[G,{x}_{SCADA}]\), where \({x}_{SCADA}\in {\mathbb{R}}^{T\times P\times {N}_{SCADA}}\) is the stacked SCADA data and \(G\) is the modeled graph given by the correlation matrix. Efficient correlation matrices considered include following candidates:

  • Mutual information matrix: \({G}_{ij}=MI(i,j)\).

  • Exponential negative distance matrix: \({G}_{ij}={e}^{-Dis(i,j)}\).

  • Multiplication of mutual information and exponential Negative Distance: \({G}_{ij}=MI\left(i,j\right)\times {e}^{-Dis(i,j)}\)

  • Mutual information controlled exponential Negative Distance: \({G}_{ij} = \left\{\begin{array}{c}0, if MI(i,j)<\tau \\ {e}^{-Dis(i,j)}, if MI(i,j)\ge \tau \end{array}\right.\)

where \(MI\left(i,j\right)\) is the mutual information of turbines \(i\) and \(j\), \(Dis\left(i,j\right)\) is the distance between turbines \(i\) and \(j\), as well as \(\tau\) is a predefined threshold.

2.2 WPP output settings

As shown in Table 2, recent WPP studies considered different output settings. Regarding the targeted prediction level, most studies targeted on either the wind turbine level power prediction (Liu et al. 2021d; Woo et al. 2019; Khodayar and Wang 2018; Zhang et al. 2019a, b) or the wind farm level power prediction (Dong et al. 2016; Ghadi et al; 2014). A few studies (Osório et al. 2015; Catalão et al. 2010) uniquely discussed predicting a regional level wind power output, which is the total power output of multiple wind farms.

Table 2 Output settings of WPP studies

Depending on the prediction output type, two WPP tasks, the deterministic WPP and probabilistic WPP, are studied. Deterministic WPPs (Treiber et al. 2016; Madhiarasan and Deepa 2017; Bludszuweit et al. 2008) predict the spot value of wind power. However, point predictions are prone to errors and lack of the capability to quantify the future wind power uncertainty. To address this issue, probabilistic WPPs (Khosravi and Nahavandi 2013; Bessa et al. 2012) investigate the provision of the confidence interval, quantile, or distribution of the future wind power, which enable the uncertainty quantification.

Meanwhile, WPP prediction tasks are typically classified into three types based on the prediction horizons which aim to serve different downstream tasks:

  1. 1.

    Short-term WPP (0–6h ahead): Short-term WPPs are the most commonly discussed. The results can be applied to enhance the efficiency of wind power utilization in grids, scheduling (Wang et al. 2019a), reducing regulation costs in electricity market operations (Usaola et al. 2004), as well as optimizing and monitoring wind farm performances (Ma et al. 2014).

  2. 2.

    Median-term WPP (6–24h ahead): Median-term WPPs mainly aim to support dynamic operations in power systems, such as balancing between the wind generation and load (Menemenlis et al. 2012), energy scheduling (Shi et al. 2012), and load following (Paterakis et al. 2014).

  3. 3.

    Long-term WPP (more than 24 h ahead): Long-term WPPs contribute into a variety of downstream tasks, such as electricity pricing (Wang et al. 2017a, b, c), unit commitment (Wang et al. 2008), turbine maintenance (Ren et al. 2021), storage management (Blonbou et al. 2011), and power trading (Pircalabu et al. 2017).

    Another important extension of WPP involves wind power ramp prediction (Cui et al. 2023; Hu et al. 2023), which uniquely sheds the research light on better supporting WPP tasks against the future sudden large wind speed changes. Both the point prediction of wind power ramps (Gallego et al. 2014) and the probabilistic prediction of wind power ramps (He et al. 2023) have been studied.

2.3 WPP performance evaluation metrics

To evaluate the WPP performance, a variety of assessment metrics have been designed and applied.

Table 3 summarizes the metrics applied in recent WPP studies (Chitsaz et al. 2015; Osório et al. 2015; Catalão et al. 2010; Qureshi et al. 2017). These metrics can be divided into two groups, metrics for evaluating deterministic and probabilistic WPPs.

Table 3 Metrics considered in evaluating WPP performance

In deterministic WPPs, absolute error-based errors including MAE, NMAE, MAPE and symmetric MAPE (sMAPE) as well as squared error-based metrics including mean square error (MSE), mean square error (RMSE), and normalized mean square error (NRMSE) are most utilized. MAE is a basic but widely considered metric that takes the average absolute error of each pair of observed and predicted wind power outputs. NMAE presents the normalized version of MAE according to the maximal wind power generation. MAPE and sMAPE present the proportion of the absolute error comparing with the actual wind power generation and the average of prediction and actual wind power, respectively. Squared error metrics penalize the large prediction error via taking the square of the error. In the case that two models show the same performances in terms of MAE on a dataset, the one with larger errors in certain data points is more likely to obtain a larger RMSE.

Evaluating probabilistic WPP performance is more complex than evaluating deterministic WPP. Three types of metrics, prediction interval-based, distribution-based, and quantile-based metrics, are considered. Regarding prediction intervals, reliability and sharpness are critical performance measures. PICP is the most widely applied reliability metric, reflecting the proportion of prediction intervals containing the actual wind power generation. Average coverage error (ACE) metric is another reliability metric, which measures the difference between PICP and the nominal confidence of the prediction interval. prediction interval normalized average width (PINAW) is a typical sharpness metric, measuring the average width of prediction intervals. To comprehensively consider both reliability and sharpness, the coverage width-based criterion (CWC) metric is defined as a multivariate function of PINAW and PICP. CRPS and skill score are typical distribution-based and quantile-based metrics, respectively, which measure the fitness of the predicted distribution by comparing it with the actual wind power distribution.

3 Deep learning based wind power prediction methods

This section presents a comprehensive review of the DL-based methods in recent WPP studies. As shown in Fig. 3, most of reported DL-based WPP (DL-WPP) methods can be categorized into four groups, Scheme 1–Scheme 4, based on the adopted learning scheme, which consists of four information processing components, the signal processing component \({f}_{sp}(\cdot )\), clustering component \({f}_{c}(\cdot )\), feature engineering component \({f}_{fe}(\cdot )\), and optimization component \({f}_{o}(\cdot )\). First, we provide a review of Scheme 1–Scheme 4.

Fig. 3
figure 3

Four learning schemes for developing DL-WPP models

3.1 Learning schemes

Scheme 1 (Ma et al. 2014; Woo et al. 2019; Treiber et al. 2016) presents the simplest end-to-end modeling pipeline based on deep learning. The \({f}_{fe}(\cdot )\) is developed using deep learning algorithms as models to take the inputs and generate the wind power predictions. Optionally, a meta-learning process \({f}_{o}(\cdot )\) can be conducted to optimize hyper-parameters of \({f}_{fe}(\cdot )\). The power prediction \(\widehat{y}\) is obtained using Eqs. (1) and (2).

$$\widehat{y}={f}_{fe}^{*}(x)$$
(1)
$${f}_{fe}^{*}={f}_{o}({f}_{fe}, x\mathrm{^{\prime}},y\mathrm{^{\prime}})$$
(2)

where \({f}_{fe}^{*}\) is regarded as the optimal feature engineering component for generating predictions, \(x^{\prime},y^{\prime}\) are inputs and outputs of the training set.

Scheme 2 extends Scheme 1 via incorporating wind power sequence mode decomposition to generate prediction modes sharing similar patterns (Zu and Song 2018; Han et al. 2019a, b; Dong et al. 2017). The formulation of Scheme 2 is illustrated via Eqs. (3)–(6). The \({f}_{sp}(\cdot )\) is used to decompose the input into \(m\) subseries \({s}_{1}, {s}_{2},\dots ,{s}_{m}\). The \({i}^{th}\) subseries is processed using the corresponding feature engineering component \({f}_{f{e}_{i}}(\cdot )\). As in Scheme 1, optionally, all of the feature engineering components can also be optimized via \({f}_{o}(\cdot )\) in Scheme 2.

$$\widehat{y}={\sum }_{i=1}^{m}{\widehat{y}}_{i}$$
(3)
$${\widehat{y}}_{i}={f}_{f{e}_{i}}^{*}\left({s}_{i}\right) \forall i=\mathrm{1,2},\dots ,m$$
(4)
$${s}_{1},{s}_{2},\dots ,{s}_{m}={f}_{sp}(x)$$
(5)
$${f}_{f{e}_{1}}^{*},{f}_{f{e}_{2}}^{*},\dots ,{f}_{f{e}_{m}}^{*}={f}_{o}({f}_{f{e}_{1}},\dots ,{f}_{f{e}_{m}},{f}_{sp}, x\mathrm{^{\prime}},y\mathrm{^{\prime}})$$
(6)

Scheme 3 presents a further advancement on top of Scheme 2 by clustering data based on wind power sequence modes to create prediction modeling scenarios (Liu et al. 2021c; Abedinia et al. 2020; Azimi et al. 2016). The formulation of Scheme 3 is described in Eqs. (7)–(10). The \({f}_{sp}\left(\cdot \right)\) is applied to decompose the input into \(m\) subseries \({s}_{1}, {s}_{2},\dots ,{s}_{m}\). The generated sub-sequences are then used to group data into \(n\) clusters \({cl}_{1},c{l}_{2},\dots ,c{l}_{n}\) using a clustering component \({f}_{c}(\cdot )\). Data of the \({i}^{th}\) cluster are processed using the corresponding feature engineering component \({f}_{f{e}_{i}}(\cdot )\), which can be optionally optimized by the \({i}^{th}\) optimization component \({f}_{{o}_{i}}(\cdot )\).

$$\widehat{y}={f}_{f{e}_{j}}^{*}(x)$$
(7)
$$j={f}_{c}({s}_{1},{s}_{2},\dots ,{s}_{m})$$
(8)
$${s}_{1},{s}_{2},\dots ,{s}_{m}={f}_{sp}(x)$$
(9)
$${f}_{f{e}_{i}}^{*}={f}_{{o}_{i}}\left({f}_{f{e}_{i}},{f}_{sp}, {x}^{\mathrm{^{\prime}}},{y}^{\mathrm{^{\prime}}}\right) \forall i=\mathrm{1,2},\dots ,n$$
(10)

Scheme 4 develops hybrid models to attain a greater flexibility in the specification. In Scheme 4, multiple models are used to generate predictions. The final prediction is calculated using an optimized ensemble process based on the performance of each model. The formulation of Scheme 4 is described by Eqs. (11)–(14).

$$\widehat{y}={\sum }_{i=1}^{m}{c}_{i}{\widehat{y}}_{i}$$
(11)
$${c}_{1},{c}_{2},\dots ,{c}_{m}={f}_{e}({f}_{f{e}_{1}}^{*},{f}_{f{e}_{2}}^{*},\dots ,{f}_{f{e}_{m}}^{*}, {x}^{\mathrm{^{\prime}}},y\mathrm{^{\prime}})$$
(12)
$${\widehat{y}}_{i}={f}_{f{e}_{i}}^{*}\left(x\right) \forall i=\mathrm{1,2},\dots ,m$$
(13)
$${f}_{f{e}_{i}}^{*}={f}_{{o}_{i}}\left({f}_{f{e}_{i}}, {x}^{\mathrm{^{\prime}}},{y}^{\mathrm{^{\prime}}}\right) \forall i=\mathrm{1,2},\dots ,m$$
(14)

Table 4 summarizes recent WPP studies and their adopted learning schemes. Next, we will conduct a comprehensive review on four components, \({f}_{sp}(\cdot )\),\({f}_{c}\left(\cdot \right)\), \({f}_{fe}(\cdot )\), and \({f}_{o}(\cdot )\), of WPP methods.

Table 4 Learning schemes in DL-WPP model development

3.2 Signal processing, clustering, feature engineering and optimization components

3.2.1 Signal processing component

In WPP studies, time series inputs are typically treated as a signal. Therefore, advanced signal processing methods may better capture the patterns of the input time series from different perspectives. As shown in Table 5, there are four types of signal processing methods, the frequency-based methods, mode decomposition-based methods, singular spectrum analysis-based methods, and combined methods.

Table 5 Signal processing methods
3.2.1.1 Frequency-based methods

The frequency-based methods aim to study the raw signal in different frequency domains for improving WPP performance. The Fourier transform (Zhou et al. 2022a, b) and wavelet transform (Catalão et al. 2010; Ahn and Hur 2023; Nascimento et al. 2023; Zhang et al. 2022a, b; Chi and Yang 2023; Aly 2022) are two most widely applied frequency-based methods. The Fourier transform decomposes the input signal into frequency components, which are represented as the sum of sine and cosine of different frequencies. In comparison, the wavelet transform decomposes the input signal into wavelets, which are obtained via shifting and scaling a continuously differentiable wavelet function. To enhance the performance of wavelet transform methods, some variants, such as wavelet packet decomposition (WPD) (Zu and Song 2018; Meng et al. 2016) and empirical wavelet transform (EWT) (Yan et al. 2020; Liu et al. 2018a, b, c), have been proposed.

  • Mode decomposition-based methods: The mode decomposition-based methods including the VMD (Abedinia et al 2020) and EMD (Abdoos 2016) aim to decompose the input signal into several intrinsic mode functions and the residual.

  • Singular spectrum analysis-based methods: singular spectrum analysis (SSA) (Zhang et al. 2019a) aims to obtain spectrum information on the input signal via singular value decomposition on trajectory matrix of the input time series.

  • Secondary methods: Secondary methods (Wu et al. 2020a) combine multiple series decomposition methods to improve the efficiency of WPP.

The signal processing component can also serve filtering out the noises (Saffari et al. 2021; Zhang et al. 2022a; Wang et al. 2023). In (Peng et al. 2020), the wavelet transform is utilized to effectively denoise the original signals, while avoiding distortion and information loss to some extent, The mother wavelet in the wavelet transform can be scaled and time-shifted by the scale factor and the time-shifting factor, producing a series of sub-wavelets to extract the target features under different resolutions.

3.2.2 Clustering component

Clustering methods aim to extract intrinsic information from wind data by dividing the input data into groups based on their similarity. Each group is then processed by a different feature engineering module. In the WPP literature, K-means (Wang et al. 2018), C-means (Yang et al. 2021a), and expectation maximization (Liu et al. 2018a) are the most widely used clustering algorithms. Time series clustering methods, such as the K-shape algorithm (Liu et al. 2021d), may also improve WPP performance by directly analyzing the similarity among the time series.

3.2.3 Optimization component

Optimization algorithms are applied to attain the best architecture of the deep learning model, which includes determining the number of neurons and layers. The grid search (GS) algorithm (Liu et al. 2021d) is the simplest and most widely used algorithm. In the GS algorithm, a set of candidate architectures are defined based on domain knowledge or preliminary trials, and the best architecture is selected based on its performance on the validation set.

However, the performance of the GS algorithm is highly dependent on expert knowledge, as the number of candidates is usually limited to reduce computational cost. To explore a larger solution space, meta-heuristic algorithms, such as the GA (Liu et al. 2021a), PSO (Ma et al. 2014), shark smell optimization (SSO) (Abedinia et al. 2020), atomic search (AS) (Li et al. 2020a, b, c), CS (Li et al. 2021), clonal selection algorithm (CSA) (Chitsaz et al. 2015), crisscross optimization (CSO) (Yin et al. 2017), dragonfly algorithm (DA) (Shi et al. 2017a, b), sparrow search (SS) (Abdoos 2016), and GWO (Lu et al. 2020), have been proposed. Table 6 summarizes the optimization algorithms used in WPP studies.

Table 6 Optimization algorithms

3.2.4 Feature engineering component

One significant advantage of deep learning is the ability of automating and adaptively learning a low-dimensional embedding, which is a set of latent features representing the raw high-dimensional input. Many deep learning methods can simultaneously serve the feature extraction and the prediction in WPP modeling. The feature extraction module \({f}_{fe}(\cdot )\) aims to extract informative latent features from the input. The prediction module \({f}_{p}(\cdot )\) aims to model the mapping from extracted features to the wind power output. Therefore, the entire process including the feature engineering to the prediction can be expressed by Eqs. (15) and (16),

$$\widehat{y}={f}_{fe}\left(z\right)={f}_{p}({z}_{f})$$
(15)
$${z}_{f}={f}_{f}(z)$$
(16)

where \(z\) is the input of the feature engineering module. It can be either the original input \(x\) or the subseries \(s\).

Next, we will review a total of 28 deep learning models grouped into 8 types as listed in Table 7. Among these models, fully connected neural networks can be used for both the feature extraction and prediction modules. Probabilistic output models can only be used for the prediction module. Deep learning models including autoencoders, convolutional networks, recurrent networks, etc., are usually regarded as the feature extraction module, which learn a low-dimensional representation from the input to be fed into an additional regression layer for generating predictions.

Table 7 Deep learning models

3.2.5 Fully connected neural networks

Figure 4 illustrates the structure of fully connected neural networks, which consist of three components: an input layer, one or more hidden layers, and an output layer. The input \(z\) or \({z}_{f}\) is initially flattened to a one-dimensional vector \({z}_{1}\), which is fed into the neural network. It is processed by \(n\) neural layers including n−1 hidden layers and one output layer. Finally, results are generated from the output layer. The last hidden layer provides the extracted features.

Fig.4
figure 4

Fully connected neural networks

Let \({z}_{i}\) denote the output of the \({i}^{th}\) layer, which is also the input of \({\left(i+1\right)}^{th}\) layer, the formulation of fully connected neural network is described in Eqs. (17)\(-\)(19),

$${z}_{i+1}={f}_{a}\left({W}_{i}{z}_{i}+{b}_{i}\right) \forall i=\mathrm{1,2},\dots ,n-1$$
(17)
$${z}_{f}={f}_{f}(z)={z}_{n-1}$$
(18)
$${f}_{fe}\left(z\right)={f}_{p}\left({z}_{f}\right)={z}_{n}$$
(19)

where \({W}_{i}\) is a learnable matrix, \({b}_{i}\) is a learnable vector, and \({f}_{a}\) is an activation function.

In WPP studies, the output layer of FNN is designed to provide an accurate prediction of the wind power \(\widehat{y}\). Therefore, the parameters \({W}_{i}\) and \({b}_{i}\) are obtained by minimizing the error between \(\widehat{y}\) and actual output \(y\). Let \(z{\prime}\) and \(y{\prime}\) be the input and output of the training set respectively, and \(\widehat{y}{\prime}\) as the prediction. The parameters \({W}_{i}\) and \({b}_{i}\) of FNN can be formulated as Eqs. (20) and (21),

$${\widehat{y}}{\prime}={f}_{FNN}({W}_{1},{b}_{1},\dots ,{W}_{n},{b}_{n};z{\prime})$$
(20)
$${W}_{1},{b}_{1},\dots ,{W}_{n},{b}_{n}={\text{argmin}}L({\widehat{y}}{\prime},y{\prime})$$
(21)

where \({f}_{FNN}\) is the FNN model serving as either the feature extraction or prediction module, \(L({\widehat{y}}{\prime},y{\prime})\) is a pre-defined loss function. Let \({N}_{t}\) be the number of the instances in the training set. Common loss functions used are provided as follows, where MSE is the most frequently applied one.

  • MSE: \(L\left({\widehat{y}}{\prime},y{\prime}\right)=\frac{1}{{N}_{t}}{\sum }_{i=1}^{{N}_{t}}{\left({\widehat{y}}_{i}{\prime}-{y}_{i}{\prime}\right)}^{2}\)

  • MAE: \(L\left({\widehat{y}}{\prime},y{\prime}\right)=\frac{1}{{N}_{t}}{\sum }_{i=1}^{{N}_{t}}|{\widehat{y}}_{i}{\prime}-{y}_{i}{\prime}|\)

  • Huber Loss Function: \(L\left({\widehat{y}}{\prime},y{\prime}\right)=\frac{1}{{N}_{t}}{\sum }_{i=1}^{{N}_{t}}\left\{\begin{array}{c}\frac{1}{2}{\left({\widehat{y}}_{i}{\prime}-{y}_{i}{\prime}\right)}^{2} (if \left|{\widehat{y}}_{i}{\prime}-{y}_{i}{\prime}\right|>1)\\ \left|{\widehat{y}}_{i}{\prime}-{y}_{i}{\prime}\right|-\frac{1}{2} (Otherwise)\end{array}\right.\)

Apart from the generic DNN (Methaprayoon et al. 2007; Abedinia et al. 2020), some variants of the FNN, such as restricted Boltzmann machine (RBM) (Peng et al. 2016), deep belief network (DBN) (Wang et al. 2018; Zhang et al. 2019a) and extreme learning machine (ELM) (Yin et al. 2017; Ding et al. 2020), are proposed to enhance the prediction accuracy. The fuzzy neural network (Khodayar et al. 2022; Bilal et al. 2023; Qiao et al. 2022; Xu et al. 2022a; Li et al. 2020a), which utilizes fuzzy influence techniques to determine the values of the neurons, has also received discussions. The Fuzzy NN is well-known for its effectiveness in tackling the uncertainty in the SCADA data, making them well-suited for WPP with incomplete information or ambiguous data.

3.2.5.1 Autoencoders

The generic autoencoder (AE) has the same structure as the FNN shown in Fig. 4. AEs aim to encode the input \(z\) into a latent representation \({z}_{f}\) using the information of \(z\) itself. Hence, different from FNN, the output layer of AE targets to output the reconstruction of the input \(z\), and the values of the hidden nodes are regarded as the feature \({z}_{f}\). The parameters \({\theta }_{AE}\) of AE are inferred via minimizing the reconstruction loss between the input \(z\) and its reconstruction \(\widehat{z}\). The formulation of AE is provided in Eqs. (22) and (23),

$${\widehat{z}}{\prime}=W{f}_{AE}\left({\theta }_{AE};{z}{\prime}\right)+b$$
(22)
$$W,b,{\theta }_{AE}={\text{argmin}}L(\widehat{z}{\prime},z{\prime})$$
(23)

where \({f}_{AE}\) is the AE model, which serves as the feature extraction module.

Recently, a few variants of the AE have been proposed to improve the performance of WPPs:

  • Stacked AE (SAE): In study (Wang et al. 2021a, b, c, d), the AE model is improved by stacking multiple AEs together. The first AE takes the original input \(z\) and outputs the latent feature \({z}_{1}\). The subsequent \({i}^{th} (i>1)\) AE takes the output \({z}_{i-1}\) of \({(i-1)}^{th}\) AE as the input and produces the latent feature \({z}_{i}\).

  • Sparse SAE: In study (Yin et al. 2021), a Sparse SAE is proposed to learn more concise features by introducing a sparse penalty term into the loss function of AE, is proposed. Let \({\overline{z} }_{i}\) denote the average value of \({z}_{i}\), and \(\rho\) denote a sparse parameter, which is set to a small number near 0. The sparse penalty term is defined as the Kullback–Leibler (KL) divergence of \(\rho\) and \({z}_{i}\), \(KL\left(\rho ||{\overline{z} }_{i}\right)=\rho {\text{log}}\left(\frac{\rho }{\overline{{z }_{i}}}\right)+\left(1-\rho \right){\text{log}}(\frac{1-\rho }{1-{\overline{z} }_{i}})\).

  • In study (Li et al. 2020a, b, c), rough neurons in the SAE are introduced to address the uncertainty of the wind. Different from generic AEs, the output \({z}_{i}\) is determined using the rough set theory.

3.2.5.2 Probabilistic output models

In the literature, four probabilistic output models for generating probabilistic outputs are observed: QR, LUBE, KDE, and MDN. The descriptions of these models are provided as follows.

  • QR: The QR aims to directly estimate the quantile with neural networks. In QR, the best parameters of NNs \({\theta }_{QR}\) can be obtained by minimizing the negative skill score.

  • LUBE: The LUBE model aims to directly estimate the quantile with neural networks, which is usually trained by minimizing the CWC metric.

  • KDE: KDE methods attempt to estimate the distribution of wind power, which is modeled by a probability density function.

  • MDN: The MDN method mixes multiple PDFs to allow sufficient flexibility in modeling the wind power distribution. Usually, the Gaussian mixture model (GMM) is adopted to model the PDF because of its simplicity and convenience for sampling and computing the distribution. However, the GMM may lead to density leakage problems in the mixture model. Recent studies (Zhang et al. 2020a, b, c) addressed this issue by replacing the GMM with the beta kernel. The KDE and MDN models are usually optimized by maximizing the likelihood of the distribution. One recent study (Yang et al. 2021a, b) further improved the training process by using a Wasserstein distance-based adversarial learning algorithm.

3.2.5.3 Convolutional neural networks

CNNs (Liu et al. 2019b, a; He et al. 2020) are known for their shift invariance, meaning they can detect objects equally well regardless of their locations in the input. CNNs consist of convolution layers and pooling layers. In each convolution layer, a learnable kernel \(g\) is used to learn local features of the data. Typically, a pooling layer is used immediately after each convolution layer to aggregate the local information of the output of the convolution layer and select the most concise and efficient features. The general formulation of CNN is provided in Eqs. (24)–(26),

$${f}_{f}(z)={z}_{n+1}$$
(24)
$$z_{{i + 1}} = Pool\left( {g_{i} *z_{i} } \right)\,\forall i = {\text{1}},{\text{2}}, \ldots ,n$$
(25)
$${z}_{1}=z$$
(26)

where \(n\) is the number of convolution and pooling layers, \({g}_{i}\) is the kernel of the \({i}^{th}\) convolution layer, \(*\) denotes the convolution operator, and \(Pool\left(\cdot \right)\) is a pooling function. Depending on the type of CNN used (1DCNN, 2DCNN, or 3DCNN), different types of convolution operators and pooling functions are utilized in WPP studies.

3.2.5.4 1-dimensional CNN (1DCNN)

As shown in the top part of Fig. 5, 1DCNN takes a one-dimensional vector as input. If the original input is multidimensional, it should be flattened to a vector before being fed to the 1DCNN. In 1DCNN, 1-dimensional convolution operator and pooling functions are utilized. The formulation of 1D convolution operator is provided in Eq. (27), where \(k\) is the size of the kernel. Common pooling functions including the max pooling and average pooling are described in Eqs. (28)–(29) and Eqs. (30)–(31), respectively, where \(l\) describes the size of pooling kernel.

$$\left( {g * z} \right)_{p} = \mathop \sum \limits_{{j = p - \frac{k}{2}}}^{{p + \frac{k}{2}}} g_{{j - p - \frac{k}{2}}} z_{j}$$
(27)
Fig. 5
figure 5

1D and 2D convolutional neural networks

$$Maxpool\left(z\right)=\widetilde{z}$$
(28)
$${\widetilde{z}}_{p}=\underset{j=p\times l,p\times l+1,\dots ,p\times l+l-1}{{\text{max}}}{z}_{j}$$
(29)
$$Avgpool\left(z\right)=\overline{z }$$
(30)
$${\overline{z} }_{p}=\frac{1}{l}{\sum }_{j=p\times l}^{p\times l+l-1}{z}_{j}$$
(31)
3.2.5.5 2-dimensional CNN (2DCNN)

As shown in the bottom part of Fig. 5, the 2DCNN expands 1DCNN to a 2-dimensional grid with considering the input of a 2-dimensional matrix form. If the original input is three dimensional, \(z\in {\mathbb{R}}^{T\times P\times {N}_{feat}}\), where \(T\) is the length of time series, \(P\) is the number of wind turbines, and \({N}_{feat}\) is the number of features in each time step and wind turbine of input \(z\), a common organization is to regard the time steps as different input channels, and the features of different wind turbines are organized as a matrix.

3.2.5.6 3-dimensional CNN (3DCNN)

The 3DCNN is typically used to extract spatial–temporal features from the 3-dimensional input \(z\in {\mathbb{R}}^{T\times P\times {N}_{feat}}\). The definition of 3DCNN is similar to that of 1DCNN and that of 2DCNN.

To obtain the best parameter estimation of the kernels \(g\), the extracted features \({f}_{CNN}\left(z\right)={z}_{n+1}\) are usually transformed into the prediction via an FNN. The loss \(L(\cdot ,\cdot )\) between the prediction and actual wind power in the training set is applied to optimize \(g\), as shown in Eqs. (32) and (33).

$$\widehat{y}{\prime}={f}_{FNN}({f}_{CNN}\left(z\right))$$
(32)
$${g}_{1},{g}_{2},\dots ,{g}_{n}={\text{argmax}}L({\widehat{y}}^{\mathrm{^{\prime}}},y)$$
(33)
3.2.5.7 Recurrent neural networks

As shown in Fig. 6, RNNs are efficient models for processing 2-dimensional multi-variate time series data \(z\in {\mathbb{R}}^{T\times {N}_{feat}}\) across \(T\) time steps. In each step, the RNN takes \({z}_{t}\), the \({t}^{th}\) time step of \(z\), and the last hidden state \({h}_{t-1}\) as inputs and outputs the current hidden state \({h}_{t}\). The general formulation of RNN is provided in Eqs. (34) and (35).

Fig. 6
figure 6

Recurrent neural networks

$${f}_{f}\left(z\right)=[{h}_{1},{h}_{2},\dots ,{h}_{T}]$$
(34)
$${h}_{t}=RNN({h}_{t-1}, {z}_{t}) \forall i=\mathrm{1,2},\dots ,T$$
(35)

where \({h}_{0}\) is pre-defined vector usually set to a vector of zeros. As the vanilla RNN encounters the gradient vanishing and explosion issues, variants of RNN, such as the GRU and LSTM, as well as an advanced development on top of RNN, such as the attention mechanism, BiLSTM, and ConvLSTM, more frequently appear in WPP studies.

 

  • GRU: In each time step \(t\), the GRU (Tian et al. 2022) utilizes a reset gate \({r}_{t}\) and an update gate \({u}_{t}\) to control the hidden state \({h}_{t}\). The reset gate \({r}_{t}\) determines whether the last state \({h}_{t-1}\) is considered or reset to a new state, and the update gate \({u}_{t}\) determines whether \({h}_{t}\) is updated by the new input \({z}_{t}\) or remains the old value \({h}_{t-1}\)

  • LSTM: Similar to GRU, the LSTM (Neshat et al. 2021) uses an input gate \({i}_{t}\), a forget gate \({f}_{t}\) and an output gate \({o}_{t}\) to control the input, forget and output process of the hidden state \({h}_{t}\).

  • BiLSTM: The BiLSTM (Jahangir et al. 2020; Huang et al. 2022) is an efficient variant of LSTM. Different from the generic LSTM, which only considers the past information from \({h}_{t-1}\), the BiLSTM takes into consideration of both past and future information in each time step.

  • Attention Mechanism: The gradient of the RNN models accumulates across all time steps, potentially leading to gradient vanishing and gradient exploding issues when the length of the input time series is long. To alleviate such issues, the attention mechanism (Yang and Zhang 2021; Ren et al. 2022; Zhang et al. 2023) is introduced with considering all of the hidden states \({h}_{1},{h}_{2},\dots ,{h}_{t-1}\) in the past time steps.

  • ConvLSTM: As shown in Fig. 7, the ConvLSTM (Wilms et al. 2021) is a model to extract spatial–temporal patterns by leveraging convolution operations to modeling spatial correlations and the LSTM for learning temporal patterns. The ConvLSTM takes a four-dimensional input \(z\in {\mathbb{R}}^{W\times H\times T\times {N}_{feat}}\), which is placed on a \(W\times H\) grid according to the distribution of wind turbines. The formulation of ConvLSTM is provided in Eqs. (36)\(-\)(42).

    $${f}_{ConvLSTM}\left(z\right)=[{h}_{1},{h}_{2},\dots ,{h}_{T}]$$
    (36)
    $${i}_{t}=sigmoid\left({W}_{zi}*{z}_{t}+{W}_{hi}*{h}_{t-1}+{W}_{ci}\circ {c}_{t-1}+{b}_{i}\right) \forall t=1, 2,\dots ,T$$
    (37)
    $${f}_{t}=sigmoid\left({W}_{zf}*{z}_{t}+{W}_{hf}*{h}_{t-1}+{W}_{cf}\circ {c}_{t-1}+{b}_{f}\right) \forall t=1, 2,\dots ,T$$
    (38)
    $${g}_{t}=tanh({W}_{zg}*{z}_{t}+{W}_{hg}*{h}_{t-1}+{b}_{g}) \forall t=1, 2,\dots ,T$$
    (39)
    $${c}_{t}={f}_{t}\circ {c}_{t-1}+{i}_{t}\circ {g}_{t} \forall t=1, 2,\dots ,T$$
    (40)
    $${o}_{t}=sigmoid\left({W}_{zo}*{z}_{t}+{W}_{ho}*{h}_{t-1}+{W}_{co}\circ {c}_{t-1}+{b}_{o}\right) \forall t=1, 2,\dots ,T$$
    (41)
    $${h}_{t}={o}_{t} tanh({c}_{t}) \forall t=1, 2,\dots ,T$$
    (42)

    where \({f}_{ConvLSTM}(\cdot )\) is the ConvLSTM model, \({W}_{zi}, {W}_{hi}, {W}_{zf}, {W}_{hf}, {W}_{zg}, {W}_{hg}, {W}_{zo}, {W}_{ho}\) are learnable kernels, and \({W}_{ci}, {W}_{cf},{{W}_{co},b}_{i},{b}_{f},{b}_{g},{b}_{o}\) are learnable vectors.

Fig. 7
figure 7

Illustration of ConvLSTM

However, in WPP tasks which faces the sparse distribution of wind turbines, the large portion of blank values may degrade the efficiency and performance of the model. In such cases, graph-based models are better choices for extracting spatial features.

3.2.5.8 Graph-based neural networks

Graph based neural networks leverage the graph that represents the geographical correlation of the wind turbines to extract the spatial features. As shown in Fig. 8, GNN is a generic graph-based neural network. In GNN, the input \({z}_{i}\) in each wind turbine \(i\) is transformed to the feature space via a DNN. The feature is then concatenated to form a matrix \({M}_{z}\), which is multiplied by the graph matrix \(G\), and activated by an activation function \({f}_{a}\). The general formulation of GNN is provided in Eqs. (43)–(45).

$${f}_{GNN}\left(z\right)={z}_{f}$$
(43)
$${z}_{f}={M}_{z}G$$
(44)
$${M}_{z}=[{f}_{DNN}\left({z}_{1}\right),{f}_{DNN}\left({z}_{2}\right),\dots ,{f}_{DNN}\left({z}_{P}\right)]$$
(45)

where \({f}_{GNN}\left(\cdot \right)\) represents the GNN model.

Fig. 8
figure 8

Illustration of GNN

The structure of GNN can be expanded to extract spatial–temporal features of the input \(z\). There are two types of improvements to achieve this goal.

The first improvement is replacing the DNN with RNN. By this mean, the temporal features of each wind turbine are first extracted by the RNN, and the spatial–temporal features are finally provided by the GNN.

The second improvement is to reorganize the input \(z\) as a time series. In each time step, the GNN is used to extract the spatial feature. These features are finally concatenated together and processed by an RNN to obtain the spatial–temporal features.

Graph convolutional neural network (GCN) is an efficient variant of GNN. Unlike the original GNN, GCN utilizes a graph Laplacian \(L\) instead of \(G\), which is formulated as Eq. (46).

$$L={I}_{P}-{D}^{-\frac{1}{2}}G{D}^{-\frac{1}{2}}$$
(46)

where \({I}_{P}\) is an identity matrix with order \(P\), the \(D\) is a diagonal matrix formulated as Eq. (47).

$$D_{{ij}} = \left\{ {\begin{array}{*{20}c} {0(i \ne j)} \\ {\sum\limits_{{j = 1}}^{P} {G_{{ij}} (i = j)} } \\ \end{array} } \right.$$
(47)
3.2.5.9 Self-attention-based neural networks

Self-attention-based neural networks are constructed based on the attention mechanism, which is shown in Fig. 9. Transformer (Tian et al. 2022) is a generic self-attention-based neural network that has been designed to extract temporal features of the input. In Transformer, the input \(z\in {\mathbb{R}}^{T\times {N}{\prime}}\) is regarded as a time series with \(T\) time steps. Here, \({N}{\prime}={N}_{feat}\), if only data of the target turbine are considered, and having \({N}{\prime}=P\times {N}_{feat}\), if data from neighboring \(P\) wind turbines are considered. To learn the relationship between time steps, the input \({z}_{t}\) of each time step \(t\) is first transformed to the query, key and value vectors \({Q}_{t},{K}_{t}, {V}_{t}\in {\mathbb{R}}^{T\times {d}_{h}}\), where \({d}_{h}\) is the number of hidden dimensions, via three different DNNs respectively. The vectors \({Q}_{t},{K}_{t}, {V}_{t} (t=\mathrm{1,2},\dots ,T)\) are concatenated to form multivariate time series \(Q, K, V\). The attention \(A\) of the input is evaluated as the similarity between \(Q\) and \(K\). Finally, the features are produced by the activation value of multiplication of the \(A\) and \(V\). The formulation of Transformer is provided in Eqs. (48)–(51).

$${f}_{Trans}\left(z\right)={z}_{f}={f}_{a}(AV)$$
(48)
$$A=Sim(Q,K)$$
(49)
$$Q,K,V=\left[{Q}_{1},{Q}_{2},\dots ,{Q}_{T}\right],\left[{K}_{1},{K}_{2},\dots ,{K}_{T}\right],\left[{V}_{1},{V}_{2},\dots ,{V}_{T}\right]$$
(50)
$${Q}_{t},{K}_{t},{V}_{t}={f}_{DNN,Q}\left({z}_{t}\right),{f}_{DNN,K}\left({z}_{t}\right),{f}_{DNN,V}\left({z}_{t}\right) \forall t=\mathrm{1,2},\dots ,T$$
(51)

where \({f}_{Trans}\left(z\right)\) is the Transformer model, \({f}_{DNN,Q}\), \({f}_{DNN,K}\), \({f}_{DNN,V}\) are three DNN models. \(Sim(\cdot ,\cdot )\) is a similarity function.

Latest WPP studies have explored the performance of Transformer-based models, such as the Informer (Zhou et al. 2021; Nascimento et al. 2023; Huang et al. 2022), Autoformer (Wu et al. 2021), Pyraformer (Liu et al. 2021b), and Fedformer (Zhou et al. 2022a, b; Deng et al. 2022). Studies observed that the transformer-based models could improve WPP efficiency by sampling a set of informative queries and keys with incorporating signal processing methods. Additionally, the DLinear model (Zeng et al. 2023) reported the performance improvement of WPPs through a simpler attention scheme. In DLinear, the attention mechanism was defined as a linear transformation from the historical time steps to the future time steps.

Fig. 9
figure 9

Illustration of self attention mechanism

3.3 Prediction performance analyses

In this section, we first conduct an analysis based on results of existing studies to consolidate views of the WPP improvement brought by DL-WPP methods comparing with traditional ones as reported in Table 8. Next, to horizontally compare the effectiveness of developed DL models in WPP tasks, we conduct a computational experiment replicating famous and recent DL models based on our collected wind farm SCADA datasets.

Table 8 RMSE improvements of deep learning methods

The performance improvement of the developed DL-WPP method over the best-performed traditional machine learning WPP method is analyzed article-wise based on results reported of considered articles. Results of such analytics are reported in Table 8. In each article, the DL configuration of the reported DL-WPP method is analyzed and the best performed classical machine learning based WPP method is identified. As the Root Mean Square Error (RMSE) is the only metric that simultaneously utilized in all considered articles, RMSE values of the developed DL-WPP method and the best-performed traditional WPP method are retrieved and the RMSE improvement percentage is computed for each article. Analytical results in Table 8 revealed that existing studies unanimously observed the improvement generated by DL models and RMSE improvement could range from 2.5% to 87.68% across studies. Such significant variation can be caused by multiple reasons. First, the WPP task setups across studies differ in terms of considered prediction horizons and targets. Secondly, results of reported WPP methods were examined based on datasets collected by different research groups, which might possess completely different wind patterns. Moreover, the coverage of benchmarking models applied is different in studies reported in Table 8. Although we can conclude the performance improvement based on each work in Table 8, it is difficult to derive a fair conclusion via a horizontal comparison of DL-WPP methods developed in existing studies due to previously mentioned three reasons.

To discover more meaningful insights, in this work, we would like to further verify the effectiveness of recent DL-WPP method development via reproducing and comparing latest DL-WPP models based on our SCADA datasets collected from three commercial wind farms, which cover a larger population of wind turbines and more recent samples. Specific descriptions of three datasets are offered in Table 9.

Table 9 Dataset Descriptions

Since deterministic short-term WPP is most frequently studied, we consider the wind turbine power output prediction with horizons ranging from 10 to 60 min in this computational experiment. Meanwhile, processed data are divided into the training set, validation set, and test set respectively with a 0.6:0.2:0.2 split ratio. The commonly considered RMSE metric is employed to evaluate the WPP performance of developed models. An extended set of promising DL models including DLinear, Informer, Transformer, CNN, LSTM, GRU and DNN are considered in this further computational experiment based on recent studies (Zhou et al. 2022a, b; Zhou et al. 2021; Zeng et al.2023).

Results of our computational experiments are reported in Table 10. It is observable that the DLinear model significantly outperforms other candidates on most datasets. Meanwhile, we also identify that, in more than 69% test cases, DLinear obtains 2.0% improvements compared with the best performed one from other considered DL models. The performance of Informer, Transformer, and GRU models are also promising, which are only 2.0%, 3.1% and 3.9% worse in average than that of the DLinear, respectively. Comparing the performance on different datasets, the DLinear model is 7.6%, 9.4% and 2.8% better than the candidates on average. Thus, the DLinear model may be a better option for achieving the state-of-the-art performance in the considered short-term WPP task by comparing with other DL models.

Table 10 RMSE (KW) of deep learning models for short-term WPP

4 Promising trends for future deep learning-based WPP studies

In future WPP studies, DL techniques will take an increasingly important role with the rapid development and advancement. Although existing WPP studies have achieved promising performance in WPP, the WPP can be further improved by addressing the following limitations and issues.

  • Inputs: Although spatial–temporal correlations are already considered in some studies, most of them are based on the self-correlation in the data and the locations of the wind farm sites. Some important factors, such as the wake effect and topological information, can be considered to improve the performance. In (Park and Park 2019), a Physics-informed graph network (PIGN) has been developed to attack such issues and promising results in the wind power estimation have been reported. More advanced mechanisms can be developed to bring more values into the WPP tasks.

  • Features: Existing deep learning-based WPP studies engineer the WPP features using well-designed models, which highly depend on the training data. More robust and reliable WPP features are required to improve the WPP performance.

  • Models: With rapid development of the deep learning techniques, especially in the natural language processing and computer vision, the deep models evolve at a fast pace. Apart from the advanced deep models in other research communities, we can also design advanced models specific to WPP tasks based on our domain knowledge.

  • Evaluation: RMSE and MAE are the most widely used metrics in recent studies. However, such metrics only present the error in average, which are inadequate in real grid applications. More specific metrics are required to evaluate the WPP performance specific to certain downstream grid operations.

Next, we present four promising research trends for applying DL techniques in WPP.

4.1 More advanced design of WPP input organization

In the WPP input organization, we observe three main trends, the incorporation of geographical information, the privacy-preserving data sharing paradigm, and the usage of multiple resolution data.

Geographical information is a crucial factor in physics-based methods (Landberg 1999). Current WPP studies (Khodayar and Wang 2018) have utilized the location of the wind turbines to depict their distribution and improve WPP performances. Other geographical information, such as topography and roughness, could also be utilized to better represent the actual geographical situations of wind turbines and wind farms.

Privacy-preserving data sharing is another important aspect in WPP studies. When data are distributed across different wind farms, sharing them may raise safety and privacy concerns in situations where WPP methods and service providers involve external entities. Moreover, wind power units from different power plants located in different regions or countries may not always be accessible due to various imposed regulations. To address these issues, (Liu and Zhang 2022a, b) proposed a bi-party data-driven modeling framework to learn the spatial–temporal features in different wind farms while preserving privacy. More methods with advanced privacy-preserving schemes for various WPP modeling tasks could be developed to further enhance the performance of WPP.

Most existing WPP studies have considered the sampling interval of SCADA data to be the same as the desirable WPP resolution. However, a recent study (Liu and Zhang 2022a) has discovered that the usage of multiple sampling resolution data may significantly improve WPP performance. It is also interesting to investigate whether multiple resolution data in spatial dimension can enhance WPP. In other words, it may be possible to utilize turbine-level SCADA data to predict farm-level wind power output.

In summary, the WPP input organization plays a crucial role in determining the amount of information conveyed to WPP tasks. Therefore, further studies should be conducted to investigate advanced WPP input organizations, providing better references for predictions.

4.2 Identifying WPP features Benefiting Domain Generalization

As stated in Sect. 3, the input space considering spatial temporal correlation is \({\mathbb{R}}^{T\times P\times {N}_{SCADA}}\), which is extremely large. In applying DL methods into WPPs, learning latent representations from the overwhelming input space may be too specific towards a particular WPP task and dataset, resulting in the lack of generalizability. To alleviate this issue, the ML field has presented a study trend on identifying a subset of meaningful features which possesses the causal relationship towards the concerned output. These features are considered as ones helping the model domain generalizability.

To identify a subset of latent features beneficial to the domain generalization, several deep learning methods (Arjovsky et al. 2019) utilize the invariant risk minimization (IRM), which is a learning scheme from the causal learning paradigm that optimizes the loss function under different environments. This approach can identify causal features that are robust to shifts in the environment. Therefore, it is promising to apply such a technique to WPP studies due to the high volatility of wind and the high-dimensional attributes after considering the spatial–temporal-system dynamics correlations. It is foreseeable that more WPP studies will consider the domain generalization issue into the WPP feature engineering, which seem to be scarce in the current literature.

4.3 Efficient feature engineering models

Currently, various complex models, including CNN, RNN, and transformer-based models, are commonly used in WPP studies. However, these models cannot be considered superior to others in terms of the best performance in all situations. For instance, one study (Zeng et al. 2023) showed that a simple linear model outperformed most of the complicated transformer-based models when predicting a long sequence. However, the reason behind this observation is still unclear. More studies are required to scientifically identify the optimal WPP methods for different WPP tasks.

Physics-informed WPP is also an interesting direction. Currently, physicals-informed techniques (Gijón et al. 2023; Wu et al. 2023; Tartakovsky et al. 2023; Pombo et al. 2022) are commonly utilized in expanding the dataset to obtain richer information. Because of the intrinsic relationship among the system dynamics in SCADA data, it is also promising to utilize physical principles for guiding the model design. Physics and domain knowledge can be leveraged to better govern the network design and prediction performance to enable a better generalization in WPPs (Lagomarsino-Oneto et al. 2023). With such development, it is possible to obtain a more reliable and robust prediction. In (Park and Park 2019), a PIGN has been developed to attack such issues and promising results in the wind power estimation have been reported. More advanced mechanisms can be developed to bring more values into the WPP tasks based on PINN.

4.4 Evaluation of the WPP performance in different scenarios

As stated in Sect. 2, deterministic WPP studies commonly utilized RMSE and MAE based metrics to evaluate the WPP performance. These metrics merely evaluate the difference between the prediction and actual power output. All instances are equally treated in these metrics. However, in some power grid operations, there are three additional requirements for wind power predictions.

  • Concentration on the high wind power output instances: It is crucial to concentrate on high wind power output instances as it is more challenging for power systems to operate during periods of high wind power. Although predicting peaks is difficult, predicting high wind power output instances in WPPs should receive more attention.

  • Prevention of adverse effect: It is important to prevent adverse effects which occur when the WPP prediction decreases while the actual WPP output increases, causing undesired operations in the power system. Therefore, more effective metrics need to be developed and utilized to evaluate the WPP performance, taking into account these requirements and other factors in power system operations.

  • Wind power ramp consideration: The wind power ramp events possess a great potential of facilitating WPP models to address impacts brought by sudden large wind changes. Hence, it is of great practical value to study WPP methods together with better utilizing the historical wind power ramp pattern and with integrating more effective wind power ramp predictions.

Overall, these requirements highlight the need for more research in developing effective WPP methods and metrics that can meet the diverse needs of power grid operations.

5 Conclusions

This paper provided a comprehensive review of recent deep learning development in WPPs. It covered more than 140 recent WPP studies with advanced deep learning from two perspectives: WPP formulations and WPP methods.

First, different developments in WPP formulations including the new designs of inputs, new settings of outputs, and new application of evaluation metrics reported in recent WPP studies were summarized. The evolution of input designs was propelled by the broader availability of input sources and an increased interest in leveraging high-dimensional data for WPPs. Early data-driven studies mostly considered one source of input, either historical SCADA or NWP data. Subsequently, the input design evolved to jointly consider data sources and model the spatial influence to convey richer information in modeling. Moreover, the scope of input data considered expanded from a targeted wind turbine or wind farm site to include a group of neighboring sites. This change enabled a more comprehensive analysis of the influence of spatial–temporal data patterns on WPP modeling. To efficiently utilize data information under such setting, new data organization strategies including stacking, projection, and graph modelling were presented. To cope with incorporating more relevant attributes and enriching the WPP input, physics-informed methods were also employed to augment data and subsequently improve the accuracy based on a joint consideration of relevant knowledge. Meanwhile, in the WPP model output setting, the distribution of the power output at different levels was considered to provide more value prediction outcomes and to enable the quantification of future power output uncertainty. Recent studies also explored the link between the wind power ramp and WPP tasks, which aimed at coping with the impact of sudden large wind changes on WPP performances.

Next, a comprehensive review of the deep learning-based modeling methods for WPPs was conducted. Based on the process of converting inputs to wind power prediction, most of presented deep learning based modeling frameworks could be decomposed into four components, the signal processing component, clustering component, feature engineering component, and optimization component. It was observed that many advances in WPP modeling facilitated by deep learning were seen in the feature engineering component, mainly leveraging higher dimensional information and engineering low-dimensional but representative latent features for attaining better WPP performances. To conduct a more in-depth review of the development of feature engineering techniques, a total of 8 groups of 28 state-of-the-art deep learning models were compared. The FNN model was one basic but well-known NN-based option. Via stacking multiple neural layers, the FNNs were able to transform the input to a set of latent features representing the useful information beneficial to the WPP task. Unlike FNNs, the AE models focused on extracting the latent representation based on the information of input itself. To address the uncertainty of the future wind power, probabilistic models were studied to provide the quartiles, confidence intervals, or distributions instead of a spot value of future wind power in WPP. Convolutional models and recurrent models were typically designed to extract the local spatial features and temporal features of the high-dimensional input data, respectively. To jointly analyze the spatial and temporal patterns, the ConvLSTM was proposed via integrating the advantages of convolution and recurrent models. However, the input data of ConvLSTM needed to be projected into a two-dimensional grid, which could be inefficient with a sparse distribution of the wind turbines or wind farms. To efficiently analyze the spatial patterns of wind data, graph-based models were also investigated. The attention-based models were also explored in WPP studies to adaptively analyze the spatial or temporal patterns of the input via attention mechanisms. Recent results reported that the attention-based models offered higher efficiency and better performance than convolutional and recurrent models in WPP tasks.

To verify the performance advancement brought by deep learning methods, we surveyed the existing literature and reported the improvement generated by deep learning methods against traditional models. We further verify the effectiveness of recent WPP development via reproducing and comparing latest deep learning models based on SCADA datasets collected from three commercial wind farms. The results demonstrated that DLinear achieved better performance on all of datasets considered in this work. The Informer, Transformer, and GRU model could also obtain promising results.

The future trends in WPP studies including the advanced input organization design, interpretable WPP features, more emerging modeling mechanisms, and more effective evaluation metrics were introduced and discussed. It is expected that these research areas will receive more attention in future WPP studies, as improvements in these aspects are likely to lead to further improvements in WPP accuracy and reliability.

In summary, this review serves as a guide for researchers and software developers dedicated to WPP studies and its downstream tasks.