Introduction

Perovskite solar cells (PSCs) are a class of photovoltaic devices that have received much interest recently [1,2,3,4] because of their high-power conversion efficiencies, low cost, and simple fabrication process [5,6,7]. However, despite these advantages, the performance and stability of PSCs still need improvement, and there is a need to optimize their properties for practical applications [6]. They are limited by several factors, such as poor stability [6], low reproducibility [8], and a limited understanding of the underlying physics [1, 9]. Recently, machine learning (ML), with its ability to analyze large amounts of data and identify patterns, has emerged as a powerful tool to address these challenges and optimize the performance of PSCs [10,11,12,13,14,15,16,17,18,19].

Machine learning is an artificial intelligence subfield that creates algorithms and models that allow computers to learn from data and improve their predictions or choices over time [20]. Machine learning algorithms can be broadly categorized into two types. Firstly, supervised learning [21], where the algorithm learns from labeled training data to predict new, unseen data. Secondly, unsupervised learning [22], where the algorithm learns from unlabeled data to identify patterns or relationships in the data. With the increasing availability of large amounts of data and powerful computing resources, machine learning is becoming an increasingly important tool in healthcare, finance, and e-commerce [23].

Machine learning has been increasingly utilized in optimizing and designing perovskite solar cells [10,11,12,13,14,15,16,17,18,19]. Machine learning algorithms like random forest, decision trees, and neural networks can analyze large amounts of data and identify critical factors that affect the performance of PSCs [11, 24]. Such ML algorithms have been employed to predict photovoltaic performance and improve device efficiency [10]. By using these algorithms, researchers can perform experiments more efficiently and quickly and gain a deeper understanding of the relationships between the various parameters that influence the performance of PSCs. For example, some researchers have used machine learning to analyze the effect of various processing conditions and material properties on the performance of PSCs [25]. This has allowed for identifying key factors that influence the performance of PSCs, such as the crystallinity of the perovskite material, thickness of the perovskite layer, the morphology of the electrodes, and the interface quality between the perovskite layer and the electrodes [10,11,12,13,14,15,16,17,18,19].

Studies reported in the literature have shown that machine learning can accurately predict the PCE of perovskite solar cells based on the composition and structural parameters [13, 17, 26, 27]. This can reduce the number of experiments required and accelerate the optimization process. In addition, machine learning has been used to identify the dominant factors affecting the performance of perovskite solar cells, allowing for targeted improvement efforts [13, 27]. Moreover, deep learning approaches have been applied to analyze perovskite solar cell imaging data, such as scanning electron microscopy and X-ray diffraction, to understand better the relationship between the microstructure and photovoltaic performance [28, 29].

Our main problem is selecting an appropriate machine learning model or algorithm that is compatible and valid for a specific problem. Generally, the problem the research aims to discuss is perovskite solar cells and their power conversion efficiencies in terms of analysis, optimization, and prediction. The proposed study compares machine learning algorithms in different terms to introduce a general idea or a conclusion about which type of algorithm excels in one area compared to another. In other words, a detailed, comprehensive study of machine learning algorithms regarding accuracy, complexity, and resource consumption.

Dataset generation

Following the objective of the current study to investigate the compatibility of various ML models with perovskite-based datasets, constructing datasets that reflect the performance of perovskite solar cells is a concrete step. Herein, we generate three datasets for a perovskite cesium lead halides solar cell, with combining simulation data and experimental results. The purpose beyond selecting CsPbX3 (X = I, Br, or Cl) perovskite solar cells is related to our prior knowledge in simulating, fabricating, and characterizing such type of PSCs [4, 30,31,32,33,34,35,36,37,38,39]. Based on the schematic in Fig. 1, a typical CsPbX3 solar cell is constructed of a perovskite thin film sandwiched between a hole transport layer (NiO) and an electron transport layer (mesoporous TiO2). Additionally, top, and counter electrodes are used for collecting carriers, where a symmetric cell is assumed in the current study with fluorine-doped tin oxide (FTO) coated glass as electrodes. All the material parameters associated with the cell are listed in Table 1.

Fig. 1
figure 1

Schematic diagram demonstrating the main building layers of cesium lead halide PSC while highlighting the key parameters impacting the PCE

Table 1 Material parameters for the CsPbX3 (X = I, Br, or Cl) PSC

As mentioned earlier, one of the critical aspects related to dataset construction is the data validation using real experimental measurements. Accordingly, we chose to use NiO as a hole transport layer, knowing that other materials, such as spiro may show better performance. However, due to limited fabrication facilities, we attempted to utilize NiO. Knowing that the same methodology provided in this study can be implemented to other configurations of PSCs. We consider such investigations as a part of future extension to the current study.

The first dataset initiated in this study correlated the material properties, mainly porosity and the layer thickness of the TiO2, to the overall solar cell power conversion efficiency (PCE). This matches our proposed work with dye-sensitized solar cells in [11, 24]. Herein, the dataset size is limited to 1,470 points with experimentally seeded data from our work in [34, 35], along with simulation results captured using SCPS optoelectronic modeling [32, 39]. Secondly, the electron transport layer thickness, doping, and defects are correlated in a dataset against the cell PCE using a fully numerically based SCAPS dataset. The dataset showed around 5600 points from three inputs against one output. Finally, the thickness, doping, and defects levels for the hole transport layer, electron transport layer, and perovskite thin film are studied in terms of the overall PCE of the PCS. The third dataset is generated in nine inputs and one output format with 50,400 points. The three generated datasets show variation in both dataset size as well as complexity, i.e., the number of inputs. We intended to enforce such variation to provide practical evaluation for the proposed ML models under various datasets while fixing the dataset’s nature.

Machine learning algorithms.

As mentioned earlier, this study aims to provide a comprehensive analysis of the application of machine learning models to complex datasets, specifically in the field of perovskite solar cell PCE. We aim to provide insights into effectively comparing, selecting, implementing, and applying different machine learning algorithms to given datasets. Using a range of popular regression models and hyperparameter tuning techniques, we seek to identify best practices for achieving optimal model performance and reducing bias in machine learning analyses [40,41,42]. Ultimately, our goal is to guide researchers and practitioners looking to leverage machine learning methods for their data-driven projects.

In our study, we utilized three distinct datasets to model the relationship between several input parameters and the PCE of solar cells, as discussed in the previous section. To ensure a comprehensive evaluation, we employed a variety of machine learning regression models, namely random forest (RF) [11, 24, 43, 44], gradient boosting (GBR) [45], K-nearest neighbors (KNN) [46], and linear regression [47]. We deliberately selected various machine learning algorithms with varying complexity and bias-variance tradeoff responses to mitigate potential biases and limitations in our comparative analysis. This approach was undertaken to achieve a diverse and comprehensive set of models, thus ensuring a more balanced and unbiased evaluation. The bias-variance tradeoff represents a well-known challenge in machine learning, which refers to the balance between the amount of bias in a model (i.e., prior assumptions about the form of the final model output, regardless of the nature of the problem) and the degree of variance (i.e., the sensitivity of a model's performance to the training data and amount) [48]. We fine-tuned the hyperparameters of the selected ML models to optimize their performance on each respective dataset, thereby minimizing the root mean square error.

Concerning the hyperparameters, we adopted two popular techniques for hyperparameter tuning: grid search and randomized search [49]. Grid search involves defining a grid of hyperparameter values and exhaustively searching all possible combinations of these values to determine the optimal set of hyperparameters. Alternatively, randomized search utilizes random samples hyperparameters from a defined distribution and evaluates their performance on the data. While hyperparameter tuning can significantly improve the performance of machine learning models, it can also be computationally expensive and time-consuming. However, given the importance of achieving optimal model performance, we deemed it necessary to devote significant resources to this pre-stage of our analysis. The overall procedure is described in the flowchart presented in Fig. 2. Concerning the computational cost required to run the hyperparameter optimizers and the ML models, all the models were operated using our Lab workstation. The computational unit is a 2 × Xeon Gold 6240 2.6 GHz processor, with 36 cores and 24 M.B. cache, each 32 GB RAM, supported by 2 × 480 GB SSD HD.

Fig. 2
figure 2

Flowchart presenting an ML algorithm's entire procedure for PSC dataset handling

Results and discussion

Based on the methodology described in "Datasets generation" and "Machine learning algorithms", the generated datasets are inputted into four different ML models; random forest, gradient boosting, K-nearest neighbors, and linear regression. Additionally, hyperparameter tuning techniques are used to optimize the learning processes. We obtained the results for each dataset separately, and due to some similarities, we generalized the hyperparameters of the algorithms in Table 2 for all three datasets. A linear regression model was directly implemented due to simplicity and non-requirement for exhaustive tuning processes. However, it should be noted that this approach was specific to the datasets considered, and we do not recommend generalizing it as a standard practice for all machine learning problems. Although it was found to work well in our case and would not significantly affect the performance, we recommend applying hyperparameter tuning methodologies to each problem in an isolated manner to ensure optimal results.

Table 2 Hyperparameters extracted for optimized machine learning algorithms

The study observed the variation of the algorithms in the accuracy and response to different training sizes. The primary ratio of 70/30 of the dataset for the training and validation sets was used to evaluate the models. The default configuration and hyperparameters of the algorithms were considered and compared with a tuned version to examine the impact of hyperparameter tuning on performance.

Generally, the default settings for the selected ML algorithms have performed well. However, after embedding the optimized hyperparameters, the tuned versions have also shown a remarkable improvement and boost to the accuracy and performance of the model, as clearly demonstrated in the next section, thus, illustrating the importance of hyperparameter tuning. The default settings may not always produce an acceptable result on different types and problems, so the algorithms should be modified accordingly. For evaluation, four main aspects were considered: accuracy, complexity, computational cost, and training time. The following sections illustrate these aspects for each ML model under the three proposed datasets.

Accuracy, training size, feature importance, and contribution:

We have run our models on the datasets separately, captured the errors in the root mean square error (RMSE) metric, and constructed a plot observation of the actual efficiencies against the predictions. There is not much of a standard for a viable or acceptable accuracy, but it is more of a user preference. The metrics available to assess performances mainly revolve around the error, with many variations such as mean absolute error, mean squared error, root mean squared error, and R-squared. We have also run them on different training sizes to compare performance and adaptation to training between the models.

The first dataset with thickness and porosity as inputs exhibited near-impeccable performance. The dataset consisted of 1470 points with thickness values ranging from 0.1 to 7.0 μm and porosity from 0 to 20%. However, this dataset was considered limited and trivial due to the small changes in efficiency and the nearly linear relation between input and output parameters. Additionally, the limited number of samples (1470 points) resulted in only 440 points being used for the models, as 30% were set aside for testing. The RMSEs for a 70:30 training and validation split and a 30:70 split are listed in Table 3, cf. Fig. 3.

Table 3 Root means square error for the three datasets with and without tuning in the machine learning algorithms
Fig. 3
figure 3

Actual efficiency against predicted efficiency applied on the first dataset for a gradient boosting default model, b random forest default model, c K-nearest neighbors default model, d gradient boosting tuned model, e random forest tuned model, f K-nearest neighbors tuned model, and g linear regression ML algorithms

Although the RMSE of the linear regression was the highest, linear regression was robust against changes in the amount of training data, whereas the other models had their RMSEs nearly doubled. Given the linearity and simplicity of the dataset, linear regression was the preferred model due to its timesaving nature and high accuracy with less resource consumption, as highlighted later in the next sections. Compared to other models, linear regression showed robust performance using fewer resources and did not require complex approaches such as decision trees in the other models. While the RMSE of linear regression was higher than the other models and the plot was less accurate, it was deemed highly efficient with less resource consumption for datasets of similar nature.

Alternatively, random forest and gradient boosting offered useful extra features, such as feature importance, illustrating each input parameter's effect on the overall output. It was found that both thickness and porosity inputs were almost equally effective in determining the output efficiency. Despite this advantage, their more complex decision trees resulted in higher resource consumption, which was optional for this dataset type. Thus, for future studies with similar datasets, it would be essential to weigh the benefits of these extra features against their higher computational costs.

The second dataset contained 5600 points with three inputs (thickness, doping, and defects of the ETL) to one output (PCE). Compared to the first dataset, this dataset provided more variety and challenges for machine learning models. However, the non-linear relationship between the inputs and output made the linear regression model inadequate, as it is biased towards linear models regardless of the dataset. The model has normalized all the values around 9–10%, which indicates a failure to capture the relationship between the parameters and the output. Thus, it was not considered for this case. The input–output plots of the models illustrated a significant decrease in performance in efficiencies ranging from 7.5 to 10%, which was attributed to a lack of enough samples and outliers in the dataset leading to inadequate model training in such samples. Ensemble methods were proven to handle missing data and perform better in these challenging measures. The predicted efficiencies of the model were close to ideal, as shown in Fig. 4; RMSE values are listed in Table 3.

Fig. 4
figure 4

Actual efficiency against predicted efficiency applied on the second dataset for a gradient boosting default model, b K-nearest neighbors default model, c random forest default model, d gradient boosting tuned model, e K-nearest neighbors tuned model, f random forest tuned model, and g linear regression ML algorithms

The significant impact of inadequate training on tuned versions suggests that training the model on sufficient samples is crucial. Otherwise, the effectiveness of hyperparameter tuning is greatly diminished. Hyperparameter tuning is only effective when the model is trained on enough samples, and the dataset is representative of the population it is supposed to model. If the model is not trained on enough samples or the dataset is not representative, hyperparameter tuning may not be effective. This is because hyperparameter tuning tries to optimize the model's parameters to fit the dataset as closely as possible. However, if the dataset is not representative or needs more samples, the optimized parameters may not be helpful for the population the model is supposed to model.

The random forest and gradient boosting models were found to perform better than linear regression and KNN, with RF and GBR having better RMSEs than their tuned versions. The feature importance showed that doping was the most influential input parameter, followed by thickness and defects, affecting the efficiency by 93.6, 5.72, and 0.68%, respectively, for the RF model and 93.65, 5.66, and 0.69%, respectively, for the GBR model. The non-linear relationship between inputs and output required more complex models like random forest and gradient boosting to provide accurate predictions. However, these models came at a higher computational cost compared to linear regression and KNN. Therefore, for similar datasets in the future, the tradeoff between accuracy and computational cost should be weighed when selecting a model.

The third dataset includes the content of the second dataset with additional samples and inputs. The dataset has nine inputs, including thickness, doping, defects (for each of the three main critical layers in PSC), and one output, the PCE. Despite the increase in the number of samples and inputs, the model's performance still drops in the middle of the plot due to outliers and a small number of samples with such values. In this enlarged dataset, linear regression captured some patterns and relationships. This is an improvement from the previous dataset. Nevertheless, after analyzing the performance of the models, linear regression still needs to appear to be a competitive alternative to the other models, as indicated by Fig. 5.

Fig. 5
figure 5

Actual efficiency against predicted efficiency applied on the third dataset for a gradient boosting default model, b random forest default model, c K-nearest neighbors default model, d gradient boosting tuned model, e random forest tuned model, f K-nearest neighbors tuned model, and g linear regression ML algorithms

The analysis reveals a minimal increase in RMSEs than the previous case (see Table 3), indicating a slight performance drop. However, it can be considered negligible. Overall, the tuned gradient boosting regressor has shown absolute dominance in terms of minimal errors in all cases, but it comes at a price. On the other hand, random forest has shown a very close performance to GBR with fewer resources, as discussed in the following sections. In this case, the variation in training size did not significantly impact the RMSEs. This observation can be attributed to the fact that the dataset was already sizeable.

Still with the third dataset, after increasing the sample size and input parameters to the model, this time considering the same parameters but for two other layers in the solar cell system, the model still considers the ETL doping mentioned previously as the dominant factor with 79% contribution in determining the efficiency of the solar cell, see Fig. 6. We can attribute the dominating impact of the ETL doping to the significance of the ETL layer in the PSC structure. Generally, both ETL and HTL layers have an essential role in the operation of a PSC. However, the integration of the ETL layer fronting the perovskite thin film boosted its importance in PSC structures. Beside the standard function related to transporting captured electrons to the electrode, the ETL layer acts as an optical filter to the photons received by the perovskite active layer. Consequently, the variation in the doping in the ETL promote a trade-off between increasing the mobility of the carriers, mainly electrons, and increasing the level of impurities in the porous medium, which can negatively impact the perovskite absorption. Accordingly, the ETL doping showed a significant influence on the PCE of perovskite solar cells. Following the random-forest importance ranking demonstration in Fig. 6, it reconsidered the ETL defects (8.7%) as more effective than the ETL thickness (1.7%), which is contrary to the findings on the previous datasets, as well as the thickness of the perovskite layer (8.3%) being more effective on the efficiency than the ETL and HTL layers’ thickness, see Fig. 6.

Fig. 6
figure 6

Importance and contribution of input parameters effect on efficiency according to GBR

Complexity

The complexity of an algorithm refers to the number of resources required for the algorithm to perform its task. In this study, each algorithm's complexity was assessed based on the number of crucial hyperparameters required for each algorithm to achieve optimal performance. These tradeoffs can result from these hyperparameters and the methodology of operation of each algorithm. Random forest and KNN were generally considered to have lower complexity than gradient boosting. Random forest requires fewer hyperparameters than gradient boosting, which makes it easier to tune and faster to train. Additionally, random forest has fewer tradeoffs, such as the number of trees to include and the depth of each tree. In contrast, KNN has only one crucial hyperparameter, the number of nearest neighbors, which makes it simpler to implement and tune. Alternatively, gradient boosting has higher as it requires several crucial hyperparameters, such as the learning rate, number of trees, and tree depth. Tuning these hyperparameters can be time-consuming, and the performance can be sensitive to the choice of hyperparameters. As for linear regression, in complex cases, one may apply regularization techniques which would increase the complexity of the model. However, considering dominant non-linearity, we applied a direct linear regression model and variation in our datasets. Thus, we implemented a direct default linear regression model. Additionally, the complex tradeoffs that can result from assigning a hyperparameter value as too high or too low can affect the performance in complex relationships influencing the overall complexity of the model. In the case of KNN, selecting the number of neighbors can be challenging since choosing a value that is too low may lead to over-fitting, while selecting a value that is too high may result in under-fitting.

Regarding each algorithm's approach, gradient boosting and random forest have been considered complex regression approaches due to the exhaustive decision tree construction and mathematics involved. On the other hand, linear regression and KNN are very straightforward. Ranking the algorithms in our results based on complexity, the least to highest order would be linear regression, KNN, random forest, and gradient boosting.

Computational cost and training time

In this study, while the training time and computational cost of each model could be perceived as negligible for the case in question, given that it was a matter of seconds and did not consume high resources, they were still taken into consideration as they could offer valuable insights into the performance of the same models when applied on an ultra-larger scale.

Gradient boosting was found to be computationally and training-wise more expensive and time-consuming than random forest and the rest of the candidates; one reason behind that would be that gradient boosting builds trees sequentially, where each tree is fitted on the residuals of the previous tree. This process requires more computational power, especially when the number of trees and depth of each tree is large. On the other hand, random forest builds trees independently in parallel and combines them afterward, which is less computationally expensive than gradient boosting. However, in general cases, the actual computational cost of each algorithm can depend on several factors, such as the dataset size, the complexity of the features, the number of trees, and the hyperparameters used. Therefore, the relative computational cost of random forest and gradient boosting may vary depending on the specific situation.

Compared to random forest and gradient boosting, KNN, and linear regression are generally less computationally expensive. The computational cost of KNN depends on the number of data points and the number of features used. However, it does not involve any model training process, which makes it relatively faster. Moreover, for linear regression, the cost is typically lower than that of random forest and gradient boosting, especially for smaller datasets with fewer features. The first tested dataset in this study resembles a valid example. However, it is worth noting that the actual computational cost of KNN and linear regression can depend on the specific implementation and the size and complexity of the dataset, among other factors. Therefore, it is always recommended to compare the computational efficiency of different algorithms in a specific setting before making a final decision.

To find the optimal hyperparameters for each model, we used Grid Search and Randomized Search, which are techniques for tuning the model's hyperparameters. Grid Search exhaustively searches through all possible combinations of hyperparameters, while Randomized Search randomly samples from a defined search space. This is a pre-stage that can be computationally expensive and time consuming, as it took about 10–30 min for some models to find the optimal hyperparameters out of a few options offered to the model.

Finally, it is worth to highlight that the implemented procedure in this study can be demonstrated as a generic methodology for studying the performance of PSCs using ML models. Future attempts can be conducted with various dataset features that may include other interesting parameters, that can influence the cell PCE. These parameters include but not limited to interface engineering between the absorber and the HTM. Additionally, the future studies shall investigate datasets with different behavior, i.e., various solar cells technologies.

Conclusion

In conclusion, this study evaluated the performance of four machine learning algorithms, linear regression, KNN, random forest, and gradient boosting, on three different datasets with different characteristics. The hyperparameters of the algorithms were generalized, and their default settings were compared to their tuned versions to examine the impact of hyperparameter tuning on performance. The study observed the variation of the algorithms in accuracy and response to different training sizes. The study found that hyperparameter tuning can significantly improve the accuracy and performance of the models. The default settings may not always produce an acceptable result on different types and natures of problems, and thus the algorithms should be modified accordingly. Regarding the complexity of the algorithms, linear regression and KNN are more straightforward, while gradient boosting is the most complex. Random forest has fewer tradeoffs, making it easier to tune and faster to train. KNN has only one crucial hyperparameter, while gradient boosting requires several.

We can also emphasize the correlation between the criteria, whereas a higher complexity expects a higher computational cost and accuracy in return. Compared, simpler models tend to perform less in more challenging problems. However, one model may not be better overall, but rather that the topic is subjective and depends on the user's preferences and resources. In other words, for selecting a machine learning model, the question should be about which criteria matter the most. In terms of computational cost and training time, gradient boosting was found to be the most expensive and time-consuming, followed by random forest, while KNN and linear regression are generally less computationally expensive. However, the actual computational cost of each algorithm can depend on several factors, such as the dataset size and complexity. To mitigate potential biases and limitations in our comparative analysis, we deliberately selected a wide range of machine learning algorithms with varying degrees of complexity and bias-variance tradeoff responses. Hyperparameters have been fine-tuned to optimize their performance on each dataset, thereby minimizing the root mean square error. Finally, we can recommend random-forest as a compromising model for those datasets extracted from PSCs. Although gradient boosting recorded the best RMSE, its computational cost especially with several hyperparameter optimizations diluted its superiority in terms of accuracy. Alternatively, simple model as linear regression fails to compile sophisticated multi-features datasets. Making random-forest as a compromising model with acceptable accuracy and computational cost.