1 Introduction

The increasing global awareness and research interest in reducing the end-use of conventional energy resources (i.e., fossil fuels) are mainly due to their excessive emission rates of greenhouse gases, which are among the key factors behind the current climate change issues. Renewable energy resources conveniently serve as a direct substitute, especially when their costs are competitive and mature technologies (Maji et al. 2019; Bouchouicha et al. 2020a). These resources are abundant worldwide and, most importantly, have a lower carbon footprint (El-Shimy 2017; Tseng 2017; Li et al. 2019; Hassan et al. 2021a).

As a well-established clean power production approach, solar power generation is gaining renewal research momentum for its offered benefits (Slimani et al. 2020). Solar photovoltaics (PV) is considered a mainstream option in the power sector. An increasing number of countries generate more than 20% of their electricity using PV systems. Nations that have shown a great interest in a significant shift to solar energy over the past 3 years include Egypt, Brazil, Mexico, Algeria, Pakistan, Turkey, and the Netherlands (IRENA 2020; Bailek et al. 2017a, 2017b).

Despite the benefits of clean electricity generation, it also has some limitations. The main drawback of those technologies, especially PV, is the stochastic and intermittent nature of the resource. The generated PV power varies mainly due to continuous changes in solar irradiance (Bouchouicha et al. 2019). This, as described by Ehsan et al. (Ehsan et al. 2017), results in several issues when connecting the PV system to the grid. Li et al. (Li et al. 2019) found that the intermittent solar irradiance affects the general dispatch ability function of the generated power. This critical issue can be reduced by accurately predicting the PV power production. The grid operator can directly intervene to increase system efficiency and energy balance in stand-alone or grid-connected modes.

Accurate estimates of the energy performance of PV systems are valuable for the planning and operational security of the power systems (Wu et al. 2021; Razmjoo et al. 2019). Many highly accurate models have been proposed for different locations around the globe. Mazumdar et al. (Mazumdar et al. 2014) derived a statistical method to empirically model the ramping behavior of utility-scale solar PV power output for short time scales. The analysis was carried out in terms of ramp rate (i.e., ramp up or ramp down), which is the change in the generated power of the PV system. It was reported that the proposed model could be used to estimate the frequencies of PV ramp events. Hassan et al. (Hassan et al. 2021b) developed genetically optimized models based on an autoregressive exogenous neural network to predict PV power production in different sub-hourly time steps (i.e., 5 to 60 min) at different desert locations in Algeria and Australia. The proposed models are sufficiently accurate, with relative random error components varying between 10 and 20%. The performances of their proposed models were also higher when having smaller prediction horizons. Bouchouicha et al. (2020a) used linear and non-linear approaches to estimate the electric power production of a 20 MWp plant installed in the Adrar region, South of Algeria, based on instantaneous radiometric and meteorological data in 15-min time intervals. According to their results, all artificial neural network-based models are superior in prediction accuracy and performance stability. Cascade-forward neural network-based models provided the most reliable predictions. Trigo-González et al. (Trigo-González et al. 2019) estimated the hourly energy production in energy yield. A multiple linear regression model was presented to determine the hourly PV production using the performance ratio factor based on selected technologies (cadmium telluride and multi-crystalline silicon). Their linear regression model was validated against an independent dataset and showed a root mean square error of around 8% for San Pedro de Atacama plants and 7 to 16% for Antofagasta plants. Wang et al. (Wang et al. 2021) compared the performances of different prediction models for forecasting PV power production. A simple efficiency, temperature correction, and one-diode model were proposed for each PV configuration. The simple efficiency model overestimated the power output of PV modules by approximately 10%, except for cadmium telluride (CDTE) PV modules.

Furthermore, it was reported that the one-diode model has the best accuracy for predicting monocrystalline (Mono-Si) power output and polycrystalline (Poly-Si) PV modules. Regressive/linear methods can be classified as linear, multiple-linear stationery, and non-linear stationary methods. These methods estimate the correlation between a dependent variable (i.e., produced power) and independent variables (i.e., predictors) and require high-quality historical data on PV output and weather conditions to enhance the PV power estimates.

Machine learning models could be too complicated for some end users. However, it is standard today to use advanced machine learning methods in data analysis and forecasting of many solar energy systems. For instance, Dolara et al. (Dolara et al. 2015) compared three physical models to predict the power output of a PV cell. The models were based on equivalent circuits of three, four, and five input parameters: photo-generated current, reverse saturation current, diode ideality factor, series resistance, and shunt resistance. Experimental data of 29 PV modules were collected for Milano, Italy, and the data was used to calibrate the models. With fast computations, the adopted calibration method showed good accuracy for the considered PV modules. Alam et al. (Alam et al. 2015) proposed a scheme for modeling and identifying the maximum power output of a stand-alone PV generator. To establish the performance of PV, a static model (i.e., with non-varying inputs) was developed in a MATLAB/Simulink environment, considering ambient air temperature and solar irradiance as real-time variables. Then, two cases of constant solar radiation with the static model and constant solar radiation with the dynamic PV model were analyzed and compared. The PV model was validated with the experimental data for 30 days, and results were analyzed for each 30-min interval. It was concluded that the developed model has a high accuracy of 99.72%. Liu et al. (Liu et al. 2018) developed a two-stage model to estimate the percentage of prediction intervals for PV power output. The generalized regression neural network, extreme machine learning, and Elman neural network were integrated using the optimized backpropagation genetic algorithm to develop a weight-varying combination forecast model. The non-parametric kernel density estimation method was adopted to estimate the prediction intervals concerning the statistical distribution of the errors of the earlier deterministic point predictions. It was reported that the proposed algorithm produced much higher accuracy results than the conventional approaches.

This study conducts a baseline study for desert areas in Australia and Algeria to provide needed information for future developments. Many studies have been conducted to monitor challenges for solar energy in desert areas, e.g., radionuclides (Aba et al. 2018; El-Kenawy et al. 2022), PV plants’ performance (Aoun et al. 2019), improvement of sustainable energy systems (Bailek et al. 2018), and passive air pollution (Tang and Al-Dousari 2006). However, to the authors’ knowledge, the evaluation of PV power production of various PV technologies under the arid desert climate is unavailable. The intermittent nature of PV production poses a significant obstacle in integrating PV systems into the electric grid. As a result, accurate predictions are needed. Within the scope of this paper, regression models based on multiple environmental factors affecting PV power generation of various types of photovoltaic panels in six typical arid desert areas in Australia and Algeria are established and tested, taking into consideration the technological features of photovoltaics, along with the actual characteristics of the operation settings and climatic conditions for considered sites in hourly time scales. Then, an effective ensemble-learning approach is used to improve the performance capabilities of the optimal (best-fit) input combinations for a more accurate estimate.

2 Materials and methods

2.1 Data collection

Desert climate is experienced in arid regions. It is characterized by excessive evaporation and very low precipitation, ranging between 25 and 200 mm annually (Vaughn 2005; Sikka 1997). Dry climate regions cover 26.22% of the global land (Kottek et al. 2006). Adrar, located in Algeria, is the second-largest town in the Algerian desert in the southern region of Algeria. It is characterized by energy-rich solar resources (Bouchouicha et al. 2017, 2015; Bailek et al. 2020a) and relatively flat terrain, where the highest point reaches 421 m. The region receives annual global solar irradiation higher than 2200 kWh/m2, with around 3500 sunshine hours (mostly clear-sky days). Alice Springs is located in Australia’s interior desert region. It is part of the northern territory of Australia, with marginal rainfall. The area receives an average global horizontal irradiation of ~ 6.17 kWh/m2/day with a daily average sunshine duration of more than 9 h (Darula et al. 2010).

Desert areas tend to have clear skies for most of the year, making it easier to forecast the PV output power compared to, e.g., tropical and temperate climates. However, other factors should be considered. For instance,

  • It is widely established that the performance of solar PV systems is degraded with increasing temperatures (Rezk and Fathy 2017). Therefore, the actual outdoor performance of the solar PV cells needs to be quantified before exploring the large-scale deployment of PV plants.

  • Sandstorms are frequent, and wind speeds are higher in typical desert areas, which results in a considerable deviation from the expected performance of PV panels based on standard test conditions (Mostefaoui et al. 2019).

  • The dust accumulation rate is typically higher, and the typical frequency of PV cleaning is smaller since many such plants are installed remotely (Mostefaoui et al. 2019; Huang et al. 2016).

  • The climate in the studied regions is mostly clear throughout the year, hence the arid desert climate classification. However, this does not mean all days of the year have clear skies. The examined areas have their considerable shares of cloudy and rainy days (Weatherspark. 2022).

Long-term measurements of the PV parameters and the relevant meteo-solar parameters are used in this study. For Adrar, these datasets are obtained from the Renewable Energy Research Unit in the Saharan Region (URERMS) site, which is located at 27°53′N latitude and 00°16′W longitude and has an elevation of 269 m. For Alice Springs, similar datasets are obtained from the Desert Knowledge Australia Solar Centre (DKASC), located at 23°46′S latitude and 133°52′E longitude, at 558 m (Desert Knowledge Australia Solar Centre - Download Data. 2020). As shown in Table 1, various photovoltaic technologies in six plants have been considered in this study, namely monocrystalline, hybrid silicon (heterojunction “HIT” cells), amorphous silicon, cadmium telluride, and polycrystalline.

Table 1 General specifications of the six considered PV modules

All data that passed the simple quality tests were used for the study in this work. The data quality check considers all monitored parameters’ physical and statistically possible, and extremely rare limits, as detailed in Hassan et al. 2021c. For instance, the tests ensured that the ratio between the ground-level global horizontal solar irradiance and the extraterrestrial horizontal irradiance (i.e., the global clearness index) is above 0.0 (considering daylight hours) and below 1.2. The tests also ensured that the minimum wind speed and humidity are ≥ 0.0 m/s and 0.0%, respectively, and the maximum relative humidity is ≤ 100%. Besides, the automated check highlighted any short-term data gaps in the recorded datasets. Finally, the few missing or omitted data points (usually near sunset and sunrise) have been re-filled using the two-directional exponential smoothing (Hassan et al. 2021c). The recorded datasets consist of solar PV power production and other meteo-solar parameters, namely global horizontal solar irradiance, ambient air temperature, relative humidity, and wind speed. All parameters are averaged from the original 5-min resolution to obtain the hourly average values. The period of measurements varies from two to five years, depending on the station (Table 1). Features with higher bounds will dominate and affect the calculation process. Therefore, it is essential to scale and normalize data to guarantee that all features lay in the same bounds and are treated similarly by the physical and machine learning models. One of the simple ways to scale data is the min–max scaler, in which data features are scaled and bounded between the range of 0 and 1 using the min–max scaler. Figure 1 depicts the frequency distributions of normalized power production from each plant throughout the periods of data collection shown in Table 1. The produced power from these plants is normalized based on the peak capacity of each plant, hence the Wh/Whp unit (the “p” subscript is for “peak”).

Fig. 1
figure 1

Frequency histograms of normalized PV power of each technology

The original data sets can be divided into three subsets based on the corresponding sky conditions, represented by the global clearness index, as shown in Table 2. The global clearness index is the ratio between the ground-level glob irradiance and the corresponding extraterrestrial horizontal irradiance (Hassan et al. 2021d) . It can be seen that 64.78–65.69% of the input datasets correspond to clear sky conditions, 22.53–23.47% are registered for partly cloudy sky conditions, while 11.75–12.31% correspond to cloudy sky conditions, as presented in Table 2. This indicates that the study areas are predominately clear sky weather conditions (desert environment). This facilitates the prediction of PV power generation using relatively simple models. Cloud formations are frequent and unpredictable, unlike cloudy sky conditions, leading to the relatively poor prediction accuracy of simple models. On the other hand, Table A1 in the supplementary material provides quantitative statistical summaries of the different measured meteo-solar parameters during measurement periods.

Table 2 Frequency of data corresponding to the four categories of sky conditions for all selected technologies

In general, the performance of the PV power system is influenced by electrical and solid-state material characteristics. However, meteo-solar parameters, such as global horizontal irradiance (\(GHI\)), average air temperature (\(TEM\)), wind speed (\(WSP\)), and relative humidity (\(REH\)), have been frequently reported as the most influential variable in determining the instantaneous PV power output, with different degrees of influence. Table A2 in the supplementary material shows that the Pearson correlation coefficients (R) between PV power output and GHI are the strongest, ranging between 0.870 and 0.970 for the six studied plants. The other meteorological parameters are less correlated to the produced power, but the correlation coefficients are still considerable.

2.2 Ensemble learning models

A weighted sum ensemble is an ensemble learning approach that combines the predictions from multiple models, where the contribution of each ensemble member to the final prediction is to be weighted proportionally to its capability or skill. The ensemble learning approach is adopted to achieve better performance than the performance obtained from single methods. Five machine learning techniques were selected as follows: Decision Tree Regressor (DTR), Random Forest Regressor (RFR), MLP Regressor, Support Vector Regression, and K-Neighbors Regressor. Due to their better classification performances, these algorithms were employed to implement the ensemble using K-Neighbors Regressor (KNR-ensemble).

The Random Forest (RF) algorithm is a successful and widely employed ensemble technique (Mao and Wang 2012). RF algorithm can be used for regression and classification. The considerable interest in RF is gained due to its immunity to noise and accuracy compared to other single classifiers. No reasonable change is expected in the RF tree due to small changes in training data because of the hierarchical architecture of tree classifiers. The main drawback of the RF algorithm is the high variance. However, RF usually performs better than the decision tree (DT) algorithm.

Multilayer perceptron (MLP) with two or more hidden layers is considered an artificial neural network (ANN). MLP is one of the excellent algorithms for classification and regression (Keshtegar et al. 2022; Bouchouicha et al. 2020b). This is due to MLP’s ability to learn with a non-linear decision boundary. It is very flexible to give a reasonable solution to real-world tasks. MLP has many artificial neurons and connections named processing elements (PEs). These PEs emulates the human nervous system operations based on a particular training algorithm.

Support Vector Regression (SVR) is also a robust algorithm (VanDeventer et al. 2019; Liu et al. 2021). SVR has the flexibility to define how much error is acceptable in our model and will find an appropriate line (or hyperplane in higher dimensions) to fit the data. This can be achieved by tuning the tolerance of falling outside the acceptable error rate and an acceptable error margin.

The K-Neighbors Regressor (KNR) algorithm could use the similarity measure technique to classify cases or samples after storing the variable samples (Qun’ou et al. 2021). The KNR algorithm classifies data using the nearest samples or points. An adjustable parameter k, nearest neighbors, can be updated to force the model to be flexible. The default value of the k parameter is one.

Polar Rose Guided Whale Optimization algorithm based on the Dynamic Adaptive technique (AD-PSO-Guided WOA) has been used for feature selection in the present investigation. The main target of the individual Adaptive Dynamic Polar Rose Algorithm combined with the Guided Whale Optimization Algorithm in the exploitation group is to move toward the optimal or best solution. The main target of the individuals in the exploration group is to search the area around the leaders. The change (update) between the agents of the population groups is working dynamically (Ghoneim et al. 2021). Algorithm 1 shows the complete steps of computations in the AD-PSO-Guided WOA algorithm. The algorithm starts with initializing the population parameters, fitness or objective function, number of required iterations, and the parameters needed to start the AD-PSO-Guided WOA algorithm. The fitness function is then evaluated for all populations for the best solution. The algorithm converts all the available solutions to binary ones by the following equation.

$$\begin{array}{ll}& {X}_{d}^{(t+1)}=\left\{\begin{array}{ll}0&\;\mathrm{if\;Sigmoid}\;({X}^{*})<0.5\\ 1& \mathrm{otherwise}\end{array}\right.\end{array}$$
(1)
$$\mathrm{Sigmoid}\;({X}^{*})=\frac{1}{1+{e}^{-10({X}^{*}-0.5)}}$$
(2)

where \({X}^{*}\) is the best individual. The algorithm searches for and updates the best solution at the end of iterations. If the algorithm is stacked, it starts to select three random search solutions \({X}_{rand1}\), \({X}_{rand2}\), and \({X}_{rand3}\), to be used in updating the current search agents (solutions) position based on the following equation.

$$X\left(t+1\right)={w}_{1} {X}_{rand1}+z {w}_{2} \left({X}_{rand2}-{X}_{rand3}\right)+\left(1-z\right) {w}_{3} \left(X-{X}_{rand1}\right)$$
(3)

where \(z=1-{\left(\frac{t}{Ma{x}_{iter}}\right)}^{2}\) at iteration \(t\) and maximum iterations \(Ma{x}_{iter}\). The fitness function \({F}_{n}\) is then calculated for each \({X}_{i}\) from this form called Guided WOA. Otherwise, the fitness function \({F}_{n}\) will be calculated using the PSO algorithm for each \({X}_{i}\). The algorithm ends by the end of iterations, selecting the best solution.

Algorithm 1
figure a

The binary AD-PSO-Guided WOA algorithm.

2.3 Regression models

Recent literature has established that an empirical model for estimating PV power output can be developed by employing linear and multi-linear regression models (Trigo-González et al. 2019; Azevedo Dias et al. 2017). In addition, the diurnal fluctuation of PV power production equally follows linear and non-linear trends (Dolara et al. 2015). This can be attributed to the linear and non-linear fluctuations of the influential meteo-solar parameters, e.g., the global horizontal solar irradiance. The empirical models developed in this study are based on the correlation between meteo-solar parameters and PV power (Mostefaoui et al. 2019). This compares the two approaches (regression and ensemble learning) for predicting the PV power output in the desert environment. The nine regression models are expressed as

$${P}_{pv}={b}_{1}\;GHI+a$$
(4)
$${P}_{pv}={b}_{1}\;TEM+a$$
(5)
$${P}_{pv}={b}_{1}\;WSP+a$$
(6)
$${P}_{pv}={b}_{1}\;GHI+{b}_{2}\;TEM+a$$
(7)
$${P}_{pv}={b}_{1}\;GHI+{b}_{2}\;WSP+a$$
(8)
$${P}_{pv}={b}_{1}\;GHI+{b}_{2}\;REH+a$$
(9)
$${P}_{pv}={b}_{1}\;TEM+{b}_{2}\;REH+{b}_{3}\;WSP+a$$
(10)
$${P}_{pv}={b}_{1}\;GHI+{b}_{2}\;TEM+{b}_{3}\;WSP+a$$
(11)
$${P}_{pv}={b}_{1}\;GHI+{b}_{2}\;TEM+{b}_{3}\;REH+a$$
(12)

where \({P}_{pv}\) is the produced power, and \(a\), \({b}_{1}\), \({b}_{2}\), and \({b}_{3}\) are the fitted regression coefficients.

2.4 Model evaluation

About 80% of the dataset collected at each location was used for fitting/training the regression and ensemble models. The remaining 20% was employed to test the reliability of the developed models. The sampling and assignment of observations to the two subsets were performed randomly based on a uniform distribution instead of chronological partitioning. This is to reduce the dependency of the developed models on the specific data used in the fitting process and to ensure an equivalent performance of the models when handling new datasets (Bouchouicha et al. 2019).

The metrics used for performance evaluation include the mean bias error (MBE), the root mean square error (RMSE), the relative root mean square error (RRMSE, %), and Pearson’s correlation coefficient (R), all calculated to present non-dimensional error estimates. The main calculation formulas for the six metrics are as follows (Muzathik et al. 2011; Jamil et al. 2018; Almorox et al. 2020, 2021; Bailek et al. 2017)

$$\mathrm{MBE}=\frac{1}{M}\sum_{m=1}^{M}\left(\widehat{{Y}_{m}}-{Y}_{m}\right)$$
(13)
$$RMSE=\sqrt{\frac{1}{M}\sum_{m=1}^{M}{[\widehat{{Y}_{m}}-{Y}_{m}]}^{2}}$$
(14)
$$RRMSE=\frac{RMSE}{\overline{{Y }_{m}}}\times 100$$
(15)
$$R=\frac{{\sum }_{m=1}^{M}\left(\widehat{{Y}_{m}}-\overline{\widehat{{Y }_{m}}}\right)\left({Y}_{m}-\overline{{Y }_{m}}\right)}{\sqrt{\left[{\sum }_{m=1}^{M}{\left(\widehat{{Y}_{m}}-\overline{\widehat{{Y }_{m}}}\right)}^{2}\right]\left[{\sum }_{m=1}^{M}{\left({Y}_{m}-\overline{{Y }_{m}}\right)}^{2}\right]}}$$
(16)

where \(M\) is the number of observations in the subset, \(\widehat{{Y}_{m}}\) and \({Y}_{m}\) are the mth estimated and observed PV power values, respectively, and \(\overline{\widehat{{Y }_{m}}}\) and \(\overline{{Y }_{m}}\) are the arithmetic means of the estimated and observed values, respectively.

3 Results and discussion

Firstly, linear and multiple linear regression techniques were used to study the relationship between meteo-solar parameters and power production of various PV technologies in desert climate conditions. As a result, for each PV technology in this study, performance analyses employed six multi-linear (MLR) and three linear regression (LIR) models for MOS1, HIS, AMS, MOS2, CDT, and POS technologies (Eqs. (4) to (12)). The fitted coefficients of all models are shown in Table 3.

Table 3 Regression models and their fitted coefficients in this study

In the first step and to obtain an objective overview of models’ performances, only three technologies are selected for the preliminary analysis, namely MOS1, HIS, and AMS technologies. Figure 2 compares the different categories of models based on the test data. To enhance the readability of this figure, only R and RMSE values are displayed. The values of R and RMSE of PV power output reported best fits using multi-linear regression models, compared to the corresponding performances of linear regression models in the test stage.

Fig. 2
figure 2

Comparison between the best- and worst-performing multi-linear and-linear regression models

Various comparisons are also made to assess the performance of the selected and tested correlations under arid desert climates. The estimation results of the power output of MOS and AMS technologies using the different models depicted that for the MOS technology, the developed models registered higher values of RRMSE compared to AMS technology. In addition, the results show that the performance results of the different linear regression models are close to each other, except for the first LIR model (Eq. (4)), where it is noticed that the magnitudes of R and RRMSE values are more significant, especially for AMS technology. As such, the best-performing model from the linear regression category (Eq. (4)) is selected based on the error metrics of the testing subsets.

It follows from Table 4 that the different multiple-linear regression models produce a wide range of R values (ranging from 0.4030 for model #7 to 0.9762 for model #9), as well as a wider range of estimated RRMSEs (ranging from 11.3389% for model #9 to 48.46% for model #7). Model #9, categorized as a multiple-linear model with the inputs of global horizontal irradiance, ambient temperature, and relative humidity, shows superior performance in terms of testing error measures. It is also depicted that model #8 (another multiple-linear model) performs close to model #8. Regarding Rs, model #9 emerges as the best-fitted model for HIS and AMS technologies. In contrast, model #8 yielded the best performance for MOS1 technology.

Table 4 Testing performance indices of PV power models (based on normalized \({P}_{pv}\) values) for MOS1, HIS, and AMS technologies

The relative ability of the models to predict the PV Power output is, a priori, a function of sky conditions. So far, the models have been analyzed under all-sky conditions. However, this section analyzes them under specific sky conditions (clear, partially cloudy, and cloudy skies), as mentioned in Table 2. The so-called Taylor diagrams are used to obtain an analytical description of the two best-performing regression models under different sky conditions for all technologies. The Taylor diagram, shown in Fig. 3, indicates that considering the sky conditions, each model behaves differently in predicting the PV power output of all technology. It is observed that model #9 shows better estimates under overcast and partially cloudy sky conditions and produces equal error estimates to those of model #8 under relatively clearer skies. This implies that the global irradiance, air temperature, and relative humidity are more related to the PV power output of all technology, followed by global irradiance with temperature and wind speed.

Fig. 3
figure 3

Evaluation of the two best-performing regression models under different sky conditions for all technologies

Next, the proposed ensemble learning techniques with the top-performing input (global horizontal irradiance, ambient temperature, and relative humidity) were used to predict the power production of various PV technologies in desert climate conditions. Figure 4 compares the relative performance of the individual and ensemble learning models, compared to the best-performing regression model, in terms of percentage drops in RMSEs. It can be seen from Fig. 4 that the RMSEs drops do not exceed the value of 50% for the results of DTR, MLP, SVR, and RFR models for the various PV technologies; except for MOS technology, the percentage exceeds this value. In contrast, the ensemble learning techniques reported a percentage exceeding 75%. Generally, all the individual selected models have different RMSE drops varied from technology to technology. Also, as expected, the proposed ensemble model registered higher RMSE drop values than individual models. Therefore, it can be inferred that the proposed ensemble model outperformed individual models.

Fig. 4
figure 4

Percentage drops in the RMSE when replacing model #9 (best-performing regression model) with machine learning models

The error estimates values for the proposed ensemble model using each PV module technology data are calculated and summarized in Table 5. Table 5 shows that the proposed ensemble model generated a higher numerical range of error values corresponding to maximum RMSE and RRMSE values of 0.0323 and 6.3820% for MOS2 technology. This is followed by CDT technology, with an RMSE of 0.0315 and an RRMSE of 5.9625%. It is worth mentioning that the MOS2 plant is the only one located at Adrar, where seasonal passing clouds take place. This demonstrates the importance of incorporating cloud parameters into PV power prediction models, which will be further examined in future works.

Table 5 PV power prediction errors (normalized values) when applying the proposed ensemble model for each technology

On the other hand, it was observed that the model provided the best performance for MOS1 technology, corresponding to maximum MBE and RRMSE values of − 0.0004 and 1.5109%. Moreover, the correlation coefficient of the results obtained using the proposed model for almost all the PV technologies used in the present work exceeded 0.99, with a maximum of 0.9994 for MOS1 technology.

Table A3 in the supplementary material compares the results of the proposed ensemble-learning technique with alternative models developed in previous studies in terms of RRMSE. It is clear from the table that the proposed approach has substantially lower error values (down to RRMSE of 1.5109%) compared to conventional machine-learning algorithms, such as ANN, SVR, and MLR, with RRMSEs ranging between 3.6 and 9.546%.

In general, multi-linear regression models offer competitive performance under different sky conditions. Their performance decreases as the sky cloudiness increases. Besides, when integrating global solar radiation, ambient air temperature, and relative humidity measurements as typical inputs, the regression models’ performance improves generally and achieves efficient performance with the proposed ensemble learning, with an estimated accuracy of over 99%.

Finally, the solar and climate data variables contain enough information to predict the power generation of different PV technologies accurately. Also, it should be stated that PV modules of the different technologies operate in an outdoor environment with numerous fluctuations in other operating conditions such as the cloud and dust pollutants parameters, leading to deterioration in the output power of the PV modules. However, the effect of these factors varies slightly among the different PV technologies used for power generation. The dust pollution effect depends on the local area where the PV system is mounted and the site’s local environmental conditions (Abd El-Wahab et al. 2018; Darwish et al. 2018). This will be covered in future studies.

4 Conclusions

This study proposes new regression and ensemble-learning models by studying six different desert sites in the Algerian Big South desert and the Australian Northern Territory to achieve accurate estimations of the performance of PV power plants running in desert areas. Six PV module technologies were selected for the analysis. A feature selection method was developed to enhance the ensemble-learning models using the AD-PSO-Guided WOA algorithm. The proposed approach considers the photovoltaic system’s technological features and the actual characteristics of the operation settings and climatic conditions for experiment sites (global irradiance, ambient temperature, and relative humidity). The following points summarize the findings of the study:

  • Multi-linear regression models offer competitive performances under the different sky conditions, with their performances declining as the sky becomes heavier.

  • By incorporating global solar irradiance, ambient air temperature, and relative humidity measurements as model inputs, the performances of the regression models generally improve.

  • With these inputs, the ninth developed regression model showed RRMSE values of up to 11.33%, based on the normalized values of the PV power output.

  • The proposed K-Neighbors Regressor ensemble model showed a reduction of 83.8% in the RMSE of the top-performing regression model, with an estimated accuracy of over 99%.

  • The drops in RMSE do not exceed 50% for DTR, MLP, SVR, and RFR-based models for the various PV technologies, except for MOS technology. In contrast, the ensemble learning techniques reported a percentage exceeding 75%.

  • Generally, all selected individual models have different percentage reductions in RMSE that vary from one technology to another. However, the proposed ensemble model registered higher percentage reductions in RMSE values.

  • The proposed ensemble model generated a higher numerical range of error values corresponding to maximum RRMSE values of 6.382% for the MOS2 technology.

  • The ensemble model also provided the best performance for MOS1 technology, corresponding to a maximum RRMSE of 1.511%.

  • The correlation coefficients of the proposed model for almost all PV technologies adopted in the present work exceeded 0.99, with a maximum of 0.9994 for the MOS1 technology.

  • It is concluded that the proposed model best fits all examined PV technologies, which are the most suitable for desert locations. It also outperforms conventional models in the literature by reducing the RRMSE by up to 6.32 folds.

However, it should be noted that this study primarily focused on meteorological parameters’ overall influences without considering the ground and atmospheric parameters, such as dust accumulation and cloud formations, which are not regularly measured in meteorological networks. In future works, these parameters will be considered for more precise predictors. It is also recommended to re-evaluate the proposed models under other climate zones, including tropical and temperature climates.