1 Introduction

One of the most significant effects of human activity is global warming. This occurs when greenhouse gases, such as CO2, CH4, N2O, and water vapor, are produced in greater quantities in the atmosphere due to the excessive use of fossil fuels as an energy source [2]. To prevent global warming, transitioning from fossil fuels to renewable energy sources—such as hydro, geothermal, solar, wind, wood, plant residues, biomass, tidal, and wave—is important [35]. One of the most promising renewable energy sources for producing enormous amounts of energy is solar energy. However, solar energy is impacted by its limited availability during daytime and its seasonal nature, making it unstable and unpredictable. As a result, predicting their output is challenging and requires sophisticated methodologies. In the literature, the four categories of methodologies employed for this purpose include physical methods, statistical models, artificial intelligence techniques, and their hybrid and ensemble approaches [3, 27].

Physical methods use atmospheric physical and mechanical data such as wind speed, temperature, rainfall, humidity, and day duration and are based on numerical weather prediction (NWP) to simulate atmospheric dynamics (Lorenz, Scheidsteger, Hurka, Heinemann, & Kurz, 30; Urquhart et al., [47]. Statistical approaches are used to regress unknown constants and to determine the mathematical relationship between inputs and outputs (Ye, Yang, Han, & Chen, 51). They are widely used in the literature, but their performance did not meet expectations because they are unsuitable for modeling nonlinear relationships (De Giorgi, Congedo, & Malvoni, 10; Sheng, Xiao, Cheng, Ni, & Wang, 43; Yadav, Kannan, Meraj, & Masaoud, 49). The capacity of artificial intelligence algorithms to model nonlinear relationships, such as machine learning and more recently deep learning, has caught the attention of researchers. A wide range of machine learning algorithms are employed (AlShafeey & Csaki, 2021; Belmahdi, Louzazni, & El Bouardi, 2022; De Leone, Pietrini, & Giovannelli, 11; Frederiksen & Cai, [16] Malvoni, De Giorgi, & Congedo, 33; Markovics & Mayer, [34] Yagli, Yang, & Srinivasan, 50; Zhao et al., [56]. The future of deep learning in solar energy prediction is promising due to its capacity for generalization. Feed Forward Neural Network (FFN) (Rodriguez, Azcarate, Vadillo, & Galarza, 2022; Rodriguez, Galarza, Vasquez, & Guerrero, 2022), Long Short-Term Memory (LSTM)/ Recurrent Neural Network (RNN) [14]; Hafiz, Awal, de Queiroz, & Husain, 20; Hossain & Mahmood, [21, 29] X. Luo, Zhang, & Zhu, 32; Park, Lee, Kang, Choi, & Lee, 38; Wang et al., [48], Convolutional Neural Network (CNN) (C. Zhang, Li, Jiang, Luo, & Xu, 54), Gated Recurrent Unit (GRU) (Qu, Xu, Sun, & Liu, 40), ED (Chang, Bai, & Hsu, 8). Data decomposition, feature extraction, or hybridization of algorithms are used to compose hybrid and ensemble algorithms (Ishaq, Kwon, et al., 2022; Khan, Walker, & Zeiler, 23; Lai, Zhong, Pan, Ng, & Lai, [25] P. Li, Zhou, Lu, & Yang, 26; Lin et al., [28] Ospina, Newaz, & Faruque, 37; Tang, Yang, Zhang, & Zhang, 46; J. Zhang, Tan, & Wei, 55).

In this study, we present a thorough performance comparison among LSTM-ED architectures—namely, LSTM-ED, Conv-LSTM-ED, and CNN-LSTM-ED—to predict daily solar PV energy output. Due to the challenges of handling instability and unpredictability in continuous historical data with only LSTM, we prefer to use several ED-based LSTM architectures. To our knowledge, there are no previous studies that compare solar energy prediction using these specific network architectures. The evaluation is done against a 1741-day-long 1-day interval historical dataset. This comprehensive dataset allows us to fully characterize the effectiveness of these methods and to compare them fairly. The panel data are provided by 26 different panels connected to an inverter on the roof of the Hidayet Türkoğlu Sports Complex in Istanbul, Turkey. In addition to the panel data, meteorological data are also used for the prediction step. Three error metrics are used to evaluate the performance of the models: Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and R-square (R2)) are used to evaluate the performances of the models. The experimental findings reveal that Conv-LSTM-ED has an overall good efficiency in the accuracy of the predictions. Additionally, measurements using the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) indicate that Conv-LSTM-ED is superior to the other two models in terms of both fitness and complexity.

The remainder of this paper is organized as follows. The materials and methods used in this study are introduced in Sect. 2, starting with the studies and data collections and moving on to the LSTM-ED architectures that are used. Beginning with the pre- processing of the data, error metrics utilized for evaluation, and hyper-parameter settings of the models, Sect. 3 represents the case study. Results and discussion are found in Sect. 4, commencing with an evaluation of model performance and effectiveness using error metrics, hyper-parameter settings, and figures. Section 5 provides a summary of the findings and suggested further research.

2 Materials and methods

2.1 Studied area and data collection

In order to develop reliable and successful deep learning applications, high-quality, long-term historical data is essential. However, acquiring such data can be challenging because data providers may not record it or may be unwilling to share it. The solar PV panel and meteorological data used in this study are entirely real, having been obtained directly from their providers. The panel data were collected from 26 different panels connected to an inverter on the roof of the Hidayet Türkoğlu Sports Complex in Istanbul, Turkey, at one-day intervals. Data were collected separately from each panel, and experiments were conducted individually for each one. The information obtained from the panels includes the power (P) output of each panel and the energy (E) produced by it. The technical information about the panels is detailed specifically in Table 1.

Table 1 Technical Panel Information

Solar PV panels exhibit varying generation characteristics depending on the time of year. As a result, meteorological data significantly influence production characteristics and are also used as input features in this study. Due to the annual periodicity in weather patterns, at least one year of data is necessary for training algorithms effectively. Consequently, all datasets used in this study encompass panel and meteorological data spanning from February 24, 2017, to November 30, 2021. Each panel is represented by 1741 data points, amounting to a total of 45,266 data points across all panels. The meteorological data were obtained from the Güngören/Davutpaşa Marmara Automatic Meteorology Observation Station, which is operated by the 1st Regional Directorate of the Republic of Turkey Ministry of Environment, Urbanization and Climate Change, Turkish State Meteorological Service, and is located near the Hidayet Türkoğlu Sports Complex. The meteorological variables used in this study are listed in Table 2.

Table 2 Meteorological Features

In this study, we predicted the value of energy (E in kWh) for the next day (t + 1) using daily input features including power (P in kWe), maximum apparent power (MaxAP), minimum apparent power (MinAP), mean apparent power (MeanAP), maximum relative humidity (MaxRH), minimum relative humidity (MinRH), mean relative humidity (MeanRH), maximum temperature (MaxT), minimum temperature (MinT), mean temperature (MeanT), maximum wind direction (MaxWD), mean wind direction (MeanWD), maximum wind speed (MaxWS), mean wind speed (MeanWS), and the energy (E in kWh) produced on the current day (t).

2.2 Methodology

This study proposes LSTM-ED, Conv-LSTM-ED, and CNN-LSTM-ED models to measure and compare the reliability and accuracy of one-step-ahead PV forecasting. The methods used in this study are summarized as follows.

2.2.1 LSTM

Recurrent Neural Networks (RNNs) are a specialized version of traditional FFNs with cyclic connections to hidden neurons that can handle sequential data processing. Unlike traditional FFNs, which are unable to accept sequential inputs and require all of their inputs (and outputs) to be independent of one another, RNN models realize a sequence-to-sequence mapping between input and output. As a result, the output is derived based on previous computations. For lengthy sequences, RNNs can utilize past sequential information. In practice, however, the duration of the sequential information they can remember is limited to only a few steps back due to RNN memory constraints. Figure 1 illustrates the unrolled RNN, and Fig. 2 depicts the RNN architecture.

Fig. 1
figure 1

Unrolled RNN cell

Fig. 2
figure 2

RNN architecture

\({x}_{t}\) and \({h}_{t}\) represents the input of the cell and recurrent information of the cell at time t, respectively. tanh is a hyperbolic tangent activation function.

LSTM models effectively extend the memory of RNNs, enabling them to learn and maintain dependencies over long input sequences. This extended memory can retain information for prolonged periods, with capabilities to read, write, and delete data from its memory cells. The memory unit within an LSTM is known as a ‘gated cell,’ where ‘gate’ refers to the mechanism that determines whether to preserve or discard information in the memory. An LSTM model identifies critical features from the input and retains them for extended durations. During the training phase, the significance of the information is controlled by the weight values assigned to it, dictating whether it should be kept or removed. Consequently, an LSTM model learns to discern which information is crucial to keep and which to discard.

An LSTM network generally consists of three gates: the forget gate, the input gate, and the output gate. The forget gate decides whether to retain or discard existing information in the cell state. This decision is made using a sigmoid function that processes information from the previous output and the current input to produce the forget gate’s output. Mathematically, this can be expressed as follows:

$${f}_{t}=\sigma \left({W}_{f}\bullet \left[{h}_{t-1},{x}_{t}\right]+{b}_{f}\right)$$
(1)

where \(\sigma\) is a sigmoid activation function ranging from value 0 to 1. \({f}_{t}\) is the forget gate and \({W}_{f}\) is the weight of the forget gate. While \({h}_{t-1}\) represents the recurrent information of the cell at time \(t-1\), \({x}_{t}\) represents the input of the cell at time t. And \({b}_{f}\) is the bias of the forget gate.

The input gate determines whether the new information should be added to the cell. To do this, a sigmoid layer decides which values will be updated. Subsequently, a tanh layer generates a vector of candidates for the state. The outputs of the sigmoid and tanh layers are written as follows:

$${i}_{t}=\sigma \left({W}_{i}\bullet \left[{h}_{t-1},{x}_{t}\right]+{b}_{i}\right)$$
(2)
$${\stackrel{\sim}{C}}_{t}=tanh\left({W}_{C}\bullet \left[{h}_{t-1},{x}_{t}\right]+{b}_{C}\right)$$
(3)

where \({i}_{t}\) is the input gate, \({W}_{i}\) is the weight of the input gate and \({b}_{i}\) is the bias of the input gate. \({\stackrel{\sim}{C}}_{t}\) represents the candidate vectors and tanh is a hyperbolic tangent activation function ranging from value − 1 to 1. While \({W}_{C}\) is the weight of the candidate state, \({b}_{C}\) is the bias of the cell state. By multiplying the output of these two layers the cell state (\({C}_{t}\)) is updated as below:

$${C}_{t}={f}_{t}*{C}_{t-1}+{i}_{t}*{\stackrel{\sim}{C}}_{t}$$
(4)

The output gate determines which parts of the current cell state will contribute to the final output. To achieve this, the sigmoid function takes the previous state and the current input into account. The new cell state is then obtained by multiplying the tanh function’s output with the sigmoid layer’s output. This new state is passed on to the next state. The mathematical equations for the output gate are as follows:

\({o}_{t}=\sigma \left({W}_{o}\bullet \left[{h}_{t-1},{x}_{t}\right]+{b}_{o}\right)\)

(5)

\({h}_{t}={o}_{t}*\text{tanh}\left({C}_{t}\right)\)

(6)

\({f}_{t}=\sigma \left({W}_{f}\bullet \left[{C}_{t-1},{h}_{t-1},{x}_{t}\right]+{b}_{f}\right)\)

(7)

\({i}_{t}=\sigma \left({W}_{i}\bullet \left[{C}_{t-1},{h}_{t-1},{x}_{t}\right]+{b}_{i}\right)\)

(8)

\({o}_{t}=\sigma \left({W}_{o}\bullet \left[{C}_{t-1},{h}_{t-1},{x}_{t}\right]+{b}_{o}\right)\)

(9)

where \({o}_{t}\) is the output gate, \({W}_{t}\) is the weight of the output gate and \({b}_{o}\) is the bias of the output gate. In the equations above while ‘∙’ represents the connection of two vectors, element-wise multiplication is represented by ‘*’. Inner structure of the LSTM unit according to all equations mentioned above is given in Fig. 3.

Fig. 3
figure 3

Inner structure of LSTM unit

The LSTM architecture is given with Fig. 4.

Fig. 4
figure 4

LSTM architecture

2.2.2 Conv-LSTM

Conv-LSTM is a variation of LSTM that was initially presented by Shi et al. [44] to address spatio-temporal prediction challenges in precipitation forecasting. In a Conv-LSTM network, the input features are 3D spatio-temporal tensors, with the first two dimensions representing the spatial dimension. Unlike the classic LSTM model, the Conv-LSTM cell’s input-to-state and state-to-state transitions use convolutional operations that produce 3D tensors instead of matrix multiplication. The following equations can be used to construct this model, where convolution is indicated by the ‘*’ symbol and the Hadamard product is shown by the ‘◦’ symbol:

\({i}_{t}=\sigma \left({W}_{xi}-{X}_{t}+{W}_{hi}-{H}_{t-1}+{W}_{Ci}\circ {C}_{t-1}+{b}_{i}\right)\)

(10)

\({f}_{t}=\sigma \left({W}_{xf}-{X}_{t}+{W}_{hf}-{H}_{t-1}+{W}_{Cf}\circ {C}_{t-1}+{b}_{f}\right)\)

(11)

\({C}_{t}={f}_{t}\circ {C}_{t-1}+{i}_{t}\circ \text{tanh}\left({W}_{xC}*{X}_{t}+{W}_{hC}*{H}_{t-1}+{b}_{C}\right)\)

(12)

\({o}_{t}=\sigma \left({W}_{xo}-{X}_{t}+{W}_{ho}-{H}_{t-1}+{W}_{Co}\circ {C}_{t-1}+{b}_{o}\right)\)

(13)

\({H}_{t}={o}_{t}\circ \text{tanh}\left({c}_{t}\right)\)

(14)

where \({X}_{1},\dots ,{X}_{t}\) are the inputs, \({C}_{1},\dots ,{C}_{t}\) are the output of cells and \({H}_{1},\dots ,{H}_{t}\) are the hidden states. \({i}_{t}\), \({f}_{t}\) and \({o}_{t}\) are input, forget and output gates, respectively. The inner structure of the Conv-LSTM unit, as described by Eqs. (10–14), is depicted in Fig. 5.

Fig. 5
figure 5

The inner structure of Conv-LSTM

2.2.3 CNN-LSTM

CNN is frequently used in feature engineering—because it has the ability to focus on the most important features of the data—and in time series forecasting as well. Many time-series tasks that feed-forward networks with fixed-size time windows cannot handle can be addressed using LSTM. CNN-LSTM, on the other hand, combines the advantages of both CNN and LSTM networks. In this sense, it can be considered a hybrid network that allows for various combinations of CNN and LSTM architectures. The main architecture of this network typically includes a one-dimensional convolutional layer, a pooling layer, an LSTM hidden layer, and a fully connected layer. The architecture of the CNN-LSTM is illustrated in Fig. 6.

Fig. 6
figure 6

Architecture of CNN-LSTM

2.2.4 Encoder-decoder models

In many sequence-to-sequence applications, where the goal is to construct a model that predicts future outputs from a given set of historical or past data, the encoder-decoder paradigm is widely used in computer science and related fields (Ekinci, Omurca, & Özbay, 2021; Gürses-Tran, Körner, & Monti, 2022; Qiu, Tang, Liu, Teng, & Yao, 39; B. Zhang et al., [53] Zhu, Li, Mao, Li, & Yan, 57). However, LSTM architectures are not always effective for sequence-to-sequence prediction due to difficulties arising from fixed input and output shapes (Zdravkovic, Ciric, & Ignjatovic, 2022). To address this issue, Cho et al. [9] proposed using RNNs with an encoder-decoder model. In this model, one RNN network serves as the encoder, extracting timing characteristics from historical data, while another RNN network acts as the decoder, predicting the time series. The encoder compresses the data from the variable-length input sequence into a vector derived from the sequence of RNN hidden states. The final hidden state of the encoder provides a fixed-dimensional representation of the input sequence. The decoder uses one RNN network to generate the variable-length output sequence [15]; Gangopadhyay, Tan, Huang, & Sarkar, 17; Gupta, Malhotra, Narwariya, Vig, & Shroff, 18). The encoder-decoder model is a generic approach for learning a conditional distribution of variable-length sequences based on the assumption of another variable-length sequence, where the input and output may not be the same length [31].

Sutskever et al. [45] proposed replacing the RNNs in the ED model with LSTMs for natural language processing (NLP) tasks. The first LSTM layer reads the input language sentences and obtains their vector representation of the highest dimension. The second LSTM layer then generates sentences in the output language from this vector. This model is referred to as LSTM-ED.

Unlike the traditional LSTM model, LSTM-ED reads the entire input sequence into a fixed-length vector (Nguyen, Kalra, & Kim, 36). This internal representation is both the initial hidden state of the decoder and the final hidden state of the encoder. Its built-in “memory” allows it to remember long input sequences, which is why the model is also known as sequence embedding and has been shown to perform well with very long input sequences.

In the Conv-LSTM-ED model, Conv-LSTM is used for encoding, and LSTM is used for decoding. The incorporation of Conv-LSTM as the encoder enhances the model’s feature extraction capabilities. This aligns with our hypothesis that enhancing the model to extract data features will improve the prediction accuracy. The Conv-LSTM model reads a sequential input and passes it directly to a decoder LSTM through convolutional operations.

Another variation of the LSTM-ED model incorporates a CNN in the encoder layer. While CNNs alone cannot directly process sequential data, they are used to read the data and extract key features. An LSTM then decodes these features and forecasts a sequential series. This model is called CNN-LSTM-ED because it uses a CNN as the encoder and an LSTM as the decoder.

3 Case Study

3.1 Data preparation

Missing data is a problem that must be addressed in the context of data science applications. The presence of missing data not only results in a loss of information but also hinders the ability of models to extract hidden patterns from datasets (T. Liu, Jin, Zhong, & Xue, 2022). Consequently, rows with missing data are either removed or the gaps are filled using various methods. However, deleting rows with missing data is not an advisable approach, especially for time series data. Instead, missing data should be filled using time interpolation. Interpolating missing values is a crucial step in data preprocessing in data science. In the dataset we are examining, missing values are present only in the P and E features. It has been observed that there are a total of 4,350 missing values across 26 panels in the dataset. These missing values have been filled using the padding method. Padding interpolation involves filling in missing values with the last available value above the missing entry [5].

The experimental power and energy data used in this study range from 0 to 157.1 and from 0 to 3463, respectively. Model accuracy can be adversely affected by imbalances in data ranges. To address this issue, min-max normalization is employed, which maps the range to [0, 1]. Min-max normalization facilitates accurate predictions by preserving the relationships between data instances. The formula for min-max normalization is as follows:

$${v}_{i}=\frac{{x}_{i}-{x}_{min}}{{x}_{max}-{x}_{min}}$$
(15)

where \({x}_{i}\) denotes the \({i}_{th}\) data point, \({x}_{min}\) denotes the data point with the lowest value, and \({x}_{max}\) denotes the data point with the highest value. The normalized version of \({x}_{i}\), denoted by \({v}_{i}\), lies between [0, 1].

Training and test sets must be properly defined for accurate evaluation of models. However, there is no fixed rule for splitting a dataset into training and test sets. In this case, the dataset is divided into a 75% training set and a 25% test set. The training set contains 1307 observations, while the test set comprises 434 observations.

3.2 Error metrics

We employ RMSE, MAE, R2, and loss value as error metrics in our experiments to evaluate the performance of the experimental algorithms. Additionally, we use the AIC and BIC criteria to assess the architectures.

RMSE, which estimates the standard deviation of the residuals, is the square root of the average squared difference between the observed and predicted values. A higher RMSE indicates a worse fit, with greater dispersion around the mean, while a lower RMSE signifies a better fit, with smaller discrepancies between the observed and predicted values. Therefore, a lower RMSE value is preferable. The formula for calculating RMSE is as follows:

$$RMSE=\sqrt{\frac{1}{n}\sum _{i=1}^{n}{y}_{i}-{\widehat{y}}_{i}}$$
(16)

where \({y}_{i}\) is the expected value, \({\widehat{y}}_{i}\) is the predicted value of ith observation. n is the total number of observations in the test set.

MAE represents the average absolute difference between the observed and predicted values. A lower MAE indicates that the prediction error of the model is smaller, which corresponds to more accurate predictions. The formula for calculating MAE is as follows:

$$MAE=\frac{1}{n}\sum _{i=1}^{n}\left|{y}_{i}-{\widehat{y}}_{i}\right|$$
(17)

R2, for instance, is the square of the correlation coefficient between the observed and predicted values, and it assesses how well the model explains the variance in the data [12]. A high R2 value indicates a strong correlation between the observed and predicted values, whereas a low R2 value suggests a weak correlation. R2 values range from [0, 1]. If R2 is greater than 0.8, the model is considered to be reliable; if it is lower, the model is considered to be less reliable [24]. R2 is calculated using the following formula:

$${R}^{2}=\frac{\sum _{i=1}^{n}{\left({y}_{i}-{\widehat{y}}_{i}\right)}^{2}}{\sum _{i=1}^{n}{\left({y}_{i}-\stackrel{-}{y}\right)}^{2}}$$
(18)

where \(\stackrel{-}{y}\) is the mean of expected values of variable y.

When working with different models, it is very important to perform calculations related to statistical model fitting, as model selection is based on these results. The most popular techniques for choosing between models are the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). While R-squared (R²) is used solely to evaluate how well the model fits the data, AIC and BIC also take into account model complexity [7, 13].

The AIC, which assesses the fit of the model, is calculated using the following formula:

$$AIC=L\times \text{log}\left(\frac{1}{n}\sum _{i=1}^{n}{y}_{i}-{\widehat{y}}_{i}\right)+2\times F$$
(19)

The effectiveness of the model for data prediction is assessed using the Bayesian Information Criterion (BIC), which is calculated as follows:

$$BIC=L\times \text{log}\left(\frac{1}{n}\sum _{i=1}^{n}{y}_{i}-{\widehat{y}}_{i}\right)+F\times \text{log}\left(F\right)$$
(20)

In Eqs. 19 and 20, ‘L’ represents the number of observations in the dataset, and ‘F’ denotes the number of features. A model is considered optimal when both the AIC and BIC values are at their lowest. The fundamental distinction between these two criteria is that the BIC tends to prefer simpler models, while the AIC may favor more complex models [1].

3.3 Hyper-parameters of ED-based LSTM architectures

In the Jupyter IDE, we construct our environment using the Python programming language along with the sklearn and Keras libraries.

In this study, we have suggested comparing ED based LSTM architectures to determine which model and which hyper-parameters are better for forecasting energy (E).

In the LSTM-ED architecture, the number of neurons in the input layer is set to 15, corresponding to the number of features in the datasets. There is no general rule for determining the optimal number of hidden neurons; however, our LSTM layers have 128 hidden neurons. The LSTM encoder receives input sequences, and a repeat vector is used to feed the compressed vector from the LSTM encoder to the LSTM decoder, which also has 128 hidden neurons. The output states of the LSTM encoder serve as inputs to the LSTM decoder. Finally, two time-distributed layers transform the output of the LSTM decoder into the predicted output. A dense layer applies within the time-distributed layer, acting as a wrapper layer (Wolf & Yang, 2020). The first dense layer has 100 units with the Rectified Linear Unit (ReLU) activation function, followed by a second dense layer with a single unit. An activation function is essential for assessing the deep learning model’s performance. To conclude, a dropout layer set at 0.2 acts as a regularizer to reduce overfitting, and a flatten layer is included.

In the Conv-LSTM-ED model, a two-dimensional Conv-LSTM is used as an encoder to handle one-dimensional, univariate time series, while an LSTM is used as a decoder, similar to the previous architecture. The ConvLSTM2d, which supports two-dimensional Conv-LSTM models, is now capable of handling one-dimensional univariate time-series data. A two-dimensional convolutional layer extracts 128 feature mappings from the input sequence. These feature maps are then amplified. Following the convolutional layer, a flatten layer creates a vector of 128 values by flattening the output feature map values. A repeat vector feeds the compressed vector from the flatten layer to the LSTM decoder with 128 hidden neurons. ReLU is used as an activation function in both the encoder and decoder. Finally, two time-distributed layers transform the output of the LSTM decoder into the predicted output. The first dense layer with 100 units uses ReLU as the activation function, followed by a second dense layer with a single unit. A dropout layer of 0.2 is added at the end.

We have also developed a CNN-LSTM-ED model for a single variable that uses two convolutional layers as the encoder and LSTM as the decoder. The sequence from the previous day serves as the model’s input. With a kernel size of one timestamp, the first convolutional layer extracts 128 feature mappings from the input sequence. The second layer performs the same function on the feature maps created by the first layer and amplifies the features. This second convolutional layer is followed by a max-pooling layer, which reduces the feature maps by half while retaining the maximum values of the features. Next, a long vector comprising 128 values is created by flattening the output feature map values from the max-pooling layer. A repeat vector feeds the compressed vector from the flatten layer to the LSTM decoder with 125 hidden neurons. ReLU is used as the activation function in both the encoder and decoder. Finally, two time-distributed layers transform the output of the LSTM decoder into the predicted output. The first dense layer with 100 units uses ReLU as the activation function, followed by a second dense layer with a single unit. A dropout layer of 0.2 is included at the end.

Optimizing the loss function is crucial for developing deep learning models and fine-tuning parameters. The optimization of the loss function has gained significant importance, especially in recent times. Throughout the training phase of all models, we employ the Adam optimizer, which uses a stochastic gradient descent algorithm to compute first and second-order moments. Adam is known for its quick convergence, which is the primary reason for its use in this study. All models are trained with a batch size of 72, over 50 and 100 epochs, respectively.

The proposed architectures are illustrated in Fig. 7.

Fig. 7
figure 7

(a) LSTM-ED (b) Conv-LSTM-ED (c) CNN-LSTM-ED

4 Results and discussions

Five evaluation metrics were used to assess the performance of the ED-based LSTM, Conv-LSTM, and CNN-LSTM models. To confirm the models’ reliability, the study evaluated their performance across 26 panels.

4.1 Evaluation of model performances

It is important to note that the training algorithms and structural compositions of the ED-based LSTM, Conv-LSTM, and CNN-LSTM models differ from one another. Consequently, this study investigated the performance and stability of each model. Using the training datasets, all three models underwent 50 and 100 iterations of training. Subsequently, their performances were evaluated using the test datasets. The model that produced the highest average R2 value over a single time step in the test rounds was selected as the best model for each experiment. The top-performing ED-based LSTM models were then compared. Finally, the findings from the LSTM-ED models are presented.

The results for each iteration of the LSTM-ED models during the test phases are displayed in Tables 3 and 4. With low RMSE and MAE values, and exceptionally high R2 values in the test phases, it appears that for 50 iterations, all models could be adequately trained for nearly all panels on average. However, a problem with all models was observed concerning panel 1 (p1) in terms of R2. Consequently, the success of time series prediction is as dependent on the data quality as it is on the algorithms used. When the error metrics are averaged over 100 iterations, it is observed that the models fail more frequently at 100 iterations than at 50 iterations across all metrics. Overall, the performance results indicate that the Conv-LSTM-ED model surpasses the LSTM-ED and CNN-LSTM-ED models in the test phases by achieving higher R2 values and lower RMSE and MAE values. The CNN-LSTM-ED model exhibits much larger error ranges across all three metrics than the other two models for all iterations. These findings suggest that the Conv-LSTM-ED model excels in terms of model stability, reliability, and accuracy.

Table 3 Performances of the LSTM-ED models for 50 iterations for test dataset
Table 4 Performances of the LSTM-ED models for 100 iterations for test dataset

The scatter diagrams for the LSTM-ED, Conv-LSTM-ED, and CNN-LSTM-ED models for p1 (the worst) and p10 (the best) predictions after 50 and 100 iterations are shown in Figs. 8 and 9, respectively.

Fig. 8
figure 8

Scatter diagrams of the p1

Fig. 9
figure 9

Scatter diagrams of the p10

When examining the spread of the scatter plots, it is observed that the spread for p10 is smaller than that for p1, and the spread obtained after 50 iterations is also smaller than the spread obtained after 100 iterations. From this, we can infer that the differences between the expected and predicted values for p10 and after 50 iterations are smaller, respectively.

To assess the generalization capacity of each model on the test dataset, we examined the loss plot for panel p10 shown in Fig. 10. The relatively low test loss for each model suggests that their generalization capacity is satisfactory, indicating that the models perform successfully.

Fig. 10
figure 10

Loss values of the p10

4.2 Evaluation of model efficiencies

When selecting a model, the one that minimizes the information criterion is typically chosen. In the literature, two common information criteria are the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). AIC tends to favor larger models due to its lighter penalty, which can lead to inconsistency. In contrast, BIC includes a larger penalty term, which promotes consistency but can also be less effective.

To determine the correct model order, the magnitudes of these information criteria are compared across several models. Box plots are often utilized to assess model quality and to identify the model that best fits the data. Figure 11 presents the AIC and BIC box plots for each iteration count.

Upon examining Fig. 11, it is clear that the AIC and BIC values are in agreement with the R2, RMSE, and MAE values. The lowest AIC and BIC values are achieved with the Conv-LSTM-ED model at 50 iterations. Furthermore, a comparison between the models and the number of epochs indicates that neither the specific model nor the number of epochs provides a distinct advantage for the current dataset. Combining all the results, it becomes apparent that the Conv-LSTM-ED model demonstrates robust capabilities in forecasting solar photovoltaic energy.

Fig. 11
figure 11

(a) AIC for 50 iterations (b) AIC for 100 iterations (c) BIC for 50 iterations (d) BIC for 100 iterations

5 Conclusions

This study examines the performance of ED based LSTM models for predicting the solar energy output of PV panels. To demonstrate the viability and effectiveness of the compared models, several case studies are presented as examples. The findings reveal that the proposed models have successfully predicted the energy output of solar PV panels with precise accuracy and reasonable prediction intervals.

Initially, the ED-based LSTM models are tested on 26 PV panels connected to a single inverter to facilitate comparison. Subsequently, the relationship between the models’ performance and their complexity is evaluated. The experimental dataset was collected from an inverter connected to 26 distinct panels installed on the roof of the Hidayet Türkoğlu Sports Complex in Istanbul, Turkey. The study also considers meteorological factors that significantly influence the panels’ performance for energy prediction. The results show that the Conv-LSTM-ED model outperforms the LSTM-ED and CNN-LSTM-ED models, although all three networks exhibit similar accuracy and are highly effective for making one-day-ahead predictions.

This work compares application-ready deep learning models from a high-level programming library, with a focus on solar PV panel energy prediction. As such, it serves as a guide for researchers seeking to develop practical solutions. Future research may explore the impact of varying look-ahead times on model performance. Additionally, an investigation could be conducted to determine which subset of features plays a more significant role in prediction accuracy through feature selection.