Selecting an appropriate machine-learning model for perovskite solar cell datasets

Salah, Mohamed M.; Ismail, Zahraa; Abdellatif, Sameh

doi:10.1007/s40243-023-00239-2

Selecting an appropriate machine-learning model for perovskite solar cell datasets

Original Paper
Open access
Published: 25 September 2023

Volume 12, pages 187–198, (2023)
Cite this article

Download PDF

You have full access to this open access article

Materials for Renewable and Sustainable Energy Aims and scope Submit manuscript

Selecting an appropriate machine-learning model for perovskite solar cell datasets

Download PDF

1558 Accesses
7 Citations
Explore all metrics

Abstract

Utilizing artificial intelligent based algorithms in solving engineering problems is widely spread nowadays. Herein, this study provides a comprehensive and insightful analysis of the application of machine learning (ML) models to complex datasets in the field of solar cell power conversion efficiency (PCE). Mainly, perovskite solar cells generate three datasets, varying dataset size and complexity. Various popular regression models and hyperparameter tuning techniques are studied to guide researchers and practitioners looking to leverage machine learning methods for their data-driven projects. Specifically, four ML models were investigated; random forest (RF), gradient boosting (GBR), K-nearest neighbors (KNN), and linear regression (LR), while monitoring the ML model accuracy, complexity, computational cost, and time as evaluating parameters. Inputs' importance and contribution were examined for the three datasets, recording a dominating effect for the electron transport layer's (ETL) doping as the main controlling parameter in tuning the cell's overall PCE. For the first dataset, ETL doping recorded 93.6%, as the main contributor to the cell PCE, reducing to 79.0% in the third dataset.

Machine Learning Algorithms in Photovoltaics: Evaluating Accuracy and Computational Cost Across Datasets of Different Generations, Sizes, and Complexities

Article 17 January 2024

Prediction of Efficiency for KSnI3 Perovskite Solar Cells Using Supervised Machine Learning Algorithms

Article 09 March 2024

Optoelectronic devices informatics: optimizing DSSC performance using random-forest machine learning algorithm

Article 28 March 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Perovskite solar cells (PSCs) are a class of photovoltaic devices that have received much interest recently [1,2,3,4] because of their high-power conversion efficiencies, low cost, and simple fabrication process [5,6,7]. However, despite these advantages, the performance and stability of PSCs still need improvement, and there is a need to optimize their properties for practical applications [6]. They are limited by several factors, such as poor stability [6], low reproducibility [8], and a limited understanding of the underlying physics [1, 9]. Recently, machine learning (ML), with its ability to analyze large amounts of data and identify patterns, has emerged as a powerful tool to address these challenges and optimize the performance of PSCs [10,11,12,13,14,15,16,17,18,19].

Machine learning is an artificial intelligence subfield that creates algorithms and models that allow computers to learn from data and improve their predictions or choices over time [20]. Machine learning algorithms can be broadly categorized into two types. Firstly, supervised learning [21], where the algorithm learns from labeled training data to predict new, unseen data. Secondly, unsupervised learning [22], where the algorithm learns from unlabeled data to identify patterns or relationships in the data. With the increasing availability of large amounts of data and powerful computing resources, machine learning is becoming an increasingly important tool in healthcare, finance, and e-commerce [23].

Machine learning has been increasingly utilized in optimizing and designing perovskite solar cells [10,11,12,13,14,15,16,17,18,19]. Machine learning algorithms like random forest, decision trees, and neural networks can analyze large amounts of data and identify critical factors that affect the performance of PSCs [11, 24]. Such ML algorithms have been employed to predict photovoltaic performance and improve device efficiency [10]. By using these algorithms, researchers can perform experiments more efficiently and quickly and gain a deeper understanding of the relationships between the various parameters that influence the performance of PSCs. For example, some researchers have used machine learning to analyze the effect of various processing conditions and material properties on the performance of PSCs [25]. This has allowed for identifying key factors that influence the performance of PSCs, such as the crystallinity of the perovskite material, thickness of the perovskite layer, the morphology of the electrodes, and the interface quality between the perovskite layer and the electrodes [10,11,12,13,14,15,16,17,18,19].

Studies reported in the literature have shown that machine learning can accurately predict the PCE of perovskite solar cells based on the composition and structural parameters [13, 17, 26, 27]. This can reduce the number of experiments required and accelerate the optimization process. In addition, machine learning has been used to identify the dominant factors affecting the performance of perovskite solar cells, allowing for targeted improvement efforts [13, 27]. Moreover, deep learning approaches have been applied to analyze perovskite solar cell imaging data, such as scanning electron microscopy and X-ray diffraction, to understand better the relationship between the microstructure and photovoltaic performance [28, 29].

Our main problem is selecting an appropriate machine learning model or algorithm that is compatible and valid for a specific problem. Generally, the problem the research aims to discuss is perovskite solar cells and their power conversion efficiencies in terms of analysis, optimization, and prediction. The proposed study compares machine learning algorithms in different terms to introduce a general idea or a conclusion about which type of algorithm excels in one area compared to another. In other words, a detailed, comprehensive study of machine learning algorithms regarding accuracy, complexity, and resource consumption.

Dataset generation

Following the objective of the current study to investigate the compatibility of various ML models with perovskite-based datasets, constructing datasets that reflect the performance of perovskite solar cells is a concrete step. Herein, we generate three datasets for a perovskite cesium lead halides solar cell, with combining simulation data and experimental results. The purpose beyond selecting CsPbX₃ (X = I⁻, Br⁻, or Cl⁻) perovskite solar cells is related to our prior knowledge in simulating, fabricating, and characterizing such type of PSCs [4, 30,31,32,33,34,35,36,37,38,39]. Based on the schematic in Fig. 1, a typical CsPbX₃ solar cell is constructed of a perovskite thin film sandwiched between a hole transport layer (NiO) and an electron transport layer (mesoporous TiO₂). Additionally, top, and counter electrodes are used for collecting carriers, where a symmetric cell is assumed in the current study with fluorine-doped tin oxide (FTO) coated glass as electrodes. All the material parameters associated with the cell are listed in Table 1.

Table 1 Material parameters for the CsPbX₃ (X = I⁻, Br⁻, or Cl⁻) PSC

Full size table

As mentioned earlier, one of the critical aspects related to dataset construction is the data validation using real experimental measurements. Accordingly, we chose to use NiO as a hole transport layer, knowing that other materials, such as spiro may show better performance. However, due to limited fabrication facilities, we attempted to utilize NiO. Knowing that the same methodology provided in this study can be implemented to other configurations of PSCs. We consider such investigations as a part of future extension to the current study.

The first dataset initiated in this study correlated the material properties, mainly porosity and the layer thickness of the TiO₂, to the overall solar cell power conversion efficiency (PCE). This matches our proposed work with dye-sensitized solar cells in [11, 24]. Herein, the dataset size is limited to 1,470 points with experimentally seeded data from our work in [34, 35], along with simulation results captured using SCPS optoelectronic modeling [32, 39]. Secondly, the electron transport layer thickness, doping, and defects are correlated in a dataset against the cell PCE using a fully numerically based SCAPS dataset. The dataset showed around 5600 points from three inputs against one output. Finally, the thickness, doping, and defects levels for the hole transport layer, electron transport layer, and perovskite thin film are studied in terms of the overall PCE of the PCS. The third dataset is generated in nine inputs and one output format with 50,400 points. The three generated datasets show variation in both dataset size as well as complexity, i.e., the number of inputs. We intended to enforce such variation to provide practical evaluation for the proposed ML models under various datasets while fixing the dataset’s nature.

Machine learning algorithms.

As mentioned earlier, this study aims to provide a comprehensive analysis of the application of machine learning models to complex datasets, specifically in the field of perovskite solar cell PCE. We aim to provide insights into effectively comparing, selecting, implementing, and applying different machine learning algorithms to given datasets. Using a range of popular regression models and hyperparameter tuning techniques, we seek to identify best practices for achieving optimal model performance and reducing bias in machine learning analyses [40,41,42]. Ultimately, our goal is to guide researchers and practitioners looking to leverage machine learning methods for their data-driven projects.

In our study, we utilized three distinct datasets to model the relationship between several input parameters and the PCE of solar cells, as discussed in the previous section. To ensure a comprehensive evaluation, we employed a variety of machine learning regression models, namely random forest (RF) [11, 24, 43, 44], gradient boosting (GBR) [45], K-nearest neighbors (KNN) [46], and linear regression [47]. We deliberately selected various machine learning algorithms with varying complexity and bias-variance tradeoff responses to mitigate potential biases and limitations in our comparative analysis. This approach was undertaken to achieve a diverse and comprehensive set of models, thus ensuring a more balanced and unbiased evaluation. The bias-variance tradeoff represents a well-known challenge in machine learning, which refers to the balance between the amount of bias in a model (i.e., prior assumptions about the form of the final model output, regardless of the nature of the problem) and the degree of variance (i.e., the sensitivity of a model's performance to the training data and amount) [48]. We fine-tuned the hyperparameters of the selected ML models to optimize their performance on each respective dataset, thereby minimizing the root mean square error.

Concerning the hyperparameters, we adopted two popular techniques for hyperparameter tuning: grid search and randomized search [49]. Grid search involves defining a grid of hyperparameter values and exhaustively searching all possible combinations of these values to determine the optimal set of hyperparameters. Alternatively, randomized search utilizes random samples hyperparameters from a defined distribution and evaluates their performance on the data. While hyperparameter tuning can significantly improve the performance of machine learning models, it can also be computationally expensive and time-consuming. However, given the importance of achieving optimal model performance, we deemed it necessary to devote significant resources to this pre-stage of our analysis. The overall procedure is described in the flowchart presented in Fig. 2. Concerning the computational cost required to run the hyperparameter optimizers and the ML models, all the models were operated using our Lab workstation. The computational unit is a 2 × Xeon Gold 6240 2.6 GHz processor, with 36 cores and 24 M.B. cache, each 32 GB RAM, supported by 2 × 480 GB SSD HD.

Results and discussion

Based on the methodology described in "Datasets generation" and "Machine learning algorithms", the generated datasets are inputted into four different ML models; random forest, gradient boosting, K-nearest neighbors, and linear regression. Additionally, hyperparameter tuning techniques are used to optimize the learning processes. We obtained the results for each dataset separately, and due to some similarities, we generalized the hyperparameters of the algorithms in Table 2 for all three datasets. A linear regression model was directly implemented due to simplicity and non-requirement for exhaustive tuning processes. However, it should be noted that this approach was specific to the datasets considered, and we do not recommend generalizing it as a standard practice for all machine learning problems. Although it was found to work well in our case and would not significantly affect the performance, we recommend applying hyperparameter tuning methodologies to each problem in an isolated manner to ensure optimal results.

Table 2 Hyperparameters extracted for optimized machine learning algorithms

Full size table

The study observed the variation of the algorithms in the accuracy and response to different training sizes. The primary ratio of 70/30 of the dataset for the training and validation sets was used to evaluate the models. The default configuration and hyperparameters of the algorithms were considered and compared with a tuned version to examine the impact of hyperparameter tuning on performance.

Generally, the default settings for the selected ML algorithms have performed well. However, after embedding the optimized hyperparameters, the tuned versions have also shown a remarkable improvement and boost to the accuracy and performance of the model, as clearly demonstrated in the next section, thus, illustrating the importance of hyperparameter tuning. The default settings may not always produce an acceptable result on different types and problems, so the algorithms should be modified accordingly. For evaluation, four main aspects were considered: accuracy, complexity, computational cost, and training time. The following sections illustrate these aspects for each ML model under the three proposed datasets.

Accuracy, training size, feature importance, and contribution:

We have run our models on the datasets separately, captured the errors in the root mean square error (RMSE) metric, and constructed a plot observation of the actual efficiencies against the predictions. There is not much of a standard for a viable or acceptable accuracy, but it is more of a user preference. The metrics available to assess performances mainly revolve around the error, with many variations such as mean absolute error, mean squared error, root mean squared error, and R-squared. We have also run them on different training sizes to compare performance and adaptation to training between the models.

The first dataset with thickness and porosity as inputs exhibited near-impeccable performance. The dataset consisted of 1470 points with thickness values ranging from 0.1 to 7.0 μm and porosity from 0 to 20%. However, this dataset was considered limited and trivial due to the small changes in efficiency and the nearly linear relation between input and output parameters. Additionally, the limited number of samples (1470 points) resulted in only 440 points being used for the models, as 30% were set aside for testing. The RMSEs for a 70:30 training and validation split and a 30:70 split are listed in Table 3, cf. Fig. 3.

Table 3 Root means square error for the three datasets with and without tuning in the machine learning algorithms

Full size table

Although the RMSE of the linear regression was the highest, linear regression was robust against changes in the amount of training data, whereas the other models had their RMSEs nearly doubled. Given the linearity and simplicity of the dataset, linear regression was the preferred model due to its timesaving nature and high accuracy with less resource consumption, as highlighted later in the next sections. Compared to other models, linear regression showed robust performance using fewer resources and did not require complex approaches such as decision trees in the other models. While the RMSE of linear regression was higher than the other models and the plot was less accurate, it was deemed highly efficient with less resource consumption for datasets of similar nature.

Alternatively, random forest and gradient boosting offered useful extra features, such as feature importance, illustrating each input parameter's effect on the overall output. It was found that both thickness and porosity inputs were almost equally effective in determining the output efficiency. Despite this advantage, their more complex decision trees resulted in higher resource consumption, which was optional for this dataset type. Thus, for future studies with similar datasets, it would be essential to weigh the benefits of these extra features against their higher computational costs.

The second dataset contained 5600 points with three inputs (thickness, doping, and defects of the ETL) to one output (PCE). Compared to the first dataset, this dataset provided more variety and challenges for machine learning models. However, the non-linear relationship between the inputs and output made the linear regression model inadequate, as it is biased towards linear models regardless of the dataset. The model has normalized all the values around 9–10%, which indicates a failure to capture the relationship between the parameters and the output. Thus, it was not considered for this case. The input–output plots of the models illustrated a significant decrease in performance in efficiencies ranging from 7.5 to 10%, which was attributed to a lack of enough samples and outliers in the dataset leading to inadequate model training in such samples. Ensemble methods were proven to handle missing data and perform better in these challenging measures. The predicted efficiencies of the model were close to ideal, as shown in Fig. 4; RMSE values are listed in Table 3.

The significant impact of inadequate training on tuned versions suggests that training the model on sufficient samples is crucial. Otherwise, the effectiveness of hyperparameter tuning is greatly diminished. Hyperparameter tuning is only effective when the model is trained on enough samples, and the dataset is representative of the population it is supposed to model. If the model is not trained on enough samples or the dataset is not representative, hyperparameter tuning may not be effective. This is because hyperparameter tuning tries to optimize the model's parameters to fit the dataset as closely as possible. However, if the dataset is not representative or needs more samples, the optimized parameters may not be helpful for the population the model is supposed to model.

The random forest and gradient boosting models were found to perform better than linear regression and KNN, with RF and GBR having better RMSEs than their tuned versions. The feature importance showed that doping was the most influential input parameter, followed by thickness and defects, affecting the efficiency by 93.6, 5.72, and 0.68%, respectively, for the RF model and 93.65, 5.66, and 0.69%, respectively, for the GBR model. The non-linear relationship between inputs and output required more complex models like random forest and gradient boosting to provide accurate predictions. However, these models came at a higher computational cost compared to linear regression and KNN. Therefore, for similar datasets in the future, the tradeoff between accuracy and computational cost should be weighed when selecting a model.

The third dataset includes the content of the second dataset with additional samples and inputs. The dataset has nine inputs, including thickness, doping, defects (for each of the three main critical layers in PSC), and one output, the PCE. Despite the increase in the number of samples and inputs, the model's performance still drops in the middle of the plot due to outliers and a small number of samples with such values. In this enlarged dataset, linear regression captured some patterns and relationships. This is an improvement from the previous dataset. Nevertheless, after analyzing the performance of the models, linear regression still needs to appear to be a competitive alternative to the other models, as indicated by Fig. 5.

The analysis reveals a minimal increase in RMSEs than the previous case (see Table 3), indicating a slight performance drop. However, it can be considered negligible. Overall, the tuned gradient boosting regressor has shown absolute dominance in terms of minimal errors in all cases, but it comes at a price. On the other hand, random forest has shown a very close performance to GBR with fewer resources, as discussed in the following sections. In this case, the variation in training size did not significantly impact the RMSEs. This observation can be attributed to the fact that the dataset was already sizeable.

Still with the third dataset, after increasing the sample size and input parameters to the model, this time considering the same parameters but for two other layers in the solar cell system, the model still considers the ETL doping mentioned previously as the dominant factor with 79% contribution in determining the efficiency of the solar cell, see Fig. 6. We can attribute the dominating impact of the ETL doping to the significance of the ETL layer in the PSC structure. Generally, both ETL and HTL layers have an essential role in the operation of a PSC. However, the integration of the ETL layer fronting the perovskite thin film boosted its importance in PSC structures. Beside the standard function related to transporting captured electrons to the electrode, the ETL layer acts as an optical filter to the photons received by the perovskite active layer. Consequently, the variation in the doping in the ETL promote a trade-off between increasing the mobility of the carriers, mainly electrons, and increasing the level of impurities in the porous medium, which can negatively impact the perovskite absorption. Accordingly, the ETL doping showed a significant influence on the PCE of perovskite solar cells. Following the random-forest importance ranking demonstration in Fig. 6, it reconsidered the ETL defects (8.7%) as more effective than the ETL thickness (1.7%), which is contrary to the findings on the previous datasets, as well as the thickness of the perovskite layer (8.3%) being more effective on the efficiency than the ETL and HTL layers’ thickness, see Fig. 6.

Complexity

The complexity of an algorithm refers to the number of resources required for the algorithm to perform its task. In this study, each algorithm's complexity was assessed based on the number of crucial hyperparameters required for each algorithm to achieve optimal performance. These tradeoffs can result from these hyperparameters and the methodology of operation of each algorithm. Random forest and KNN were generally considered to have lower complexity than gradient boosting. Random forest requires fewer hyperparameters than gradient boosting, which makes it easier to tune and faster to train. Additionally, random forest has fewer tradeoffs, such as the number of trees to include and the depth of each tree. In contrast, KNN has only one crucial hyperparameter, the number of nearest neighbors, which makes it simpler to implement and tune. Alternatively, gradient boosting has higher as it requires several crucial hyperparameters, such as the learning rate, number of trees, and tree depth. Tuning these hyperparameters can be time-consuming, and the performance can be sensitive to the choice of hyperparameters. As for linear regression, in complex cases, one may apply regularization techniques which would increase the complexity of the model. However, considering dominant non-linearity, we applied a direct linear regression model and variation in our datasets. Thus, we implemented a direct default linear regression model. Additionally, the complex tradeoffs that can result from assigning a hyperparameter value as too high or too low can affect the performance in complex relationships influencing the overall complexity of the model. In the case of KNN, selecting the number of neighbors can be challenging since choosing a value that is too low may lead to over-fitting, while selecting a value that is too high may result in under-fitting.

Regarding each algorithm's approach, gradient boosting and random forest have been considered complex regression approaches due to the exhaustive decision tree construction and mathematics involved. On the other hand, linear regression and KNN are very straightforward. Ranking the algorithms in our results based on complexity, the least to highest order would be linear regression, KNN, random forest, and gradient boosting.

Computational cost and training time

In this study, while the training time and computational cost of each model could be perceived as negligible for the case in question, given that it was a matter of seconds and did not consume high resources, they were still taken into consideration as they could offer valuable insights into the performance of the same models when applied on an ultra-larger scale.

Gradient boosting was found to be computationally and training-wise more expensive and time-consuming than random forest and the rest of the candidates; one reason behind that would be that gradient boosting builds trees sequentially, where each tree is fitted on the residuals of the previous tree. This process requires more computational power, especially when the number of trees and depth of each tree is large. On the other hand, random forest builds trees independently in parallel and combines them afterward, which is less computationally expensive than gradient boosting. However, in general cases, the actual computational cost of each algorithm can depend on several factors, such as the dataset size, the complexity of the features, the number of trees, and the hyperparameters used. Therefore, the relative computational cost of random forest and gradient boosting may vary depending on the specific situation.

Compared to random forest and gradient boosting, KNN, and linear regression are generally less computationally expensive. The computational cost of KNN depends on the number of data points and the number of features used. However, it does not involve any model training process, which makes it relatively faster. Moreover, for linear regression, the cost is typically lower than that of random forest and gradient boosting, especially for smaller datasets with fewer features. The first tested dataset in this study resembles a valid example. However, it is worth noting that the actual computational cost of KNN and linear regression can depend on the specific implementation and the size and complexity of the dataset, among other factors. Therefore, it is always recommended to compare the computational efficiency of different algorithms in a specific setting before making a final decision.

To find the optimal hyperparameters for each model, we used Grid Search and Randomized Search, which are techniques for tuning the model's hyperparameters. Grid Search exhaustively searches through all possible combinations of hyperparameters, while Randomized Search randomly samples from a defined search space. This is a pre-stage that can be computationally expensive and time consuming, as it took about 10–30 min for some models to find the optimal hyperparameters out of a few options offered to the model.

Finally, it is worth to highlight that the implemented procedure in this study can be demonstrated as a generic methodology for studying the performance of PSCs using ML models. Future attempts can be conducted with various dataset features that may include other interesting parameters, that can influence the cell PCE. These parameters include but not limited to interface engineering between the absorber and the HTM. Additionally, the future studies shall investigate datasets with different behavior, i.e., various solar cells technologies.

Conclusion

In conclusion, this study evaluated the performance of four machine learning algorithms, linear regression, KNN, random forest, and gradient boosting, on three different datasets with different characteristics. The hyperparameters of the algorithms were generalized, and their default settings were compared to their tuned versions to examine the impact of hyperparameter tuning on performance. The study observed the variation of the algorithms in accuracy and response to different training sizes. The study found that hyperparameter tuning can significantly improve the accuracy and performance of the models. The default settings may not always produce an acceptable result on different types and natures of problems, and thus the algorithms should be modified accordingly. Regarding the complexity of the algorithms, linear regression and KNN are more straightforward, while gradient boosting is the most complex. Random forest has fewer tradeoffs, making it easier to tune and faster to train. KNN has only one crucial hyperparameter, while gradient boosting requires several.

We can also emphasize the correlation between the criteria, whereas a higher complexity expects a higher computational cost and accuracy in return. Compared, simpler models tend to perform less in more challenging problems. However, one model may not be better overall, but rather that the topic is subjective and depends on the user's preferences and resources. In other words, for selecting a machine learning model, the question should be about which criteria matter the most. In terms of computational cost and training time, gradient boosting was found to be the most expensive and time-consuming, followed by random forest, while KNN and linear regression are generally less computationally expensive. However, the actual computational cost of each algorithm can depend on several factors, such as the dataset size and complexity. To mitigate potential biases and limitations in our comparative analysis, we deliberately selected a wide range of machine learning algorithms with varying degrees of complexity and bias-variance tradeoff responses. Hyperparameters have been fine-tuned to optimize their performance on each dataset, thereby minimizing the root mean square error. Finally, we can recommend random-forest as a compromising model for those datasets extracted from PSCs. Although gradient boosting recorded the best RMSE, its computational cost especially with several hyperparameter optimizations diluted its superiority in terms of accuracy. Alternatively, simple model as linear regression fails to compile sophisticated multi-features datasets. Making random-forest as a compromising model with acceptable accuracy and computational cost.

Availability of data and materials

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request. The codes supporting this study's findings are available from the corresponding author upon reasonable request.

Code availability

Not applicable for that section.

References

Yang, R.X., Tan, L.Z.: Understanding size dependence of phase stability and band gap in CsPbI3 perovskite nanocrystals. J. Chem. Phys. 152(3), 034702 (2020)
Article CAS Google Scholar
Zhang, H., Ji, X., Yao, H., Fan, Q., Yu, B., Li, J.: Review on efficiency improvement effort of perovskite solar cell. Sol. Energy 233, 421–434 (2022)
Article CAS Google Scholar
Mujahid, M., et al.: Progress of high‐throughput and low‐cost flexible perovskite solar cells. Solar RRL 4(8), 1900556 (2020)
Article CAS Google Scholar
Mahran, A.M., Abdellatif, S.O.: Optoelectronic modelling and analysis of transparency against efficiency in perovskites/dye-based solar cells. In: 2021 International Conference on Microelectronics (ICM), pp. 178–181. IEEE (2021)
Yu, P., Zhang, W., Ren, F., Wang, J., Wang, H., Chen, R., Zhang, S., Zhang, Y., Liu, Z., Chen, W.: Strategies for highly efficient and stable cesium lead iodide perovskite photovoltaics: mechanisms and processes. J. Mater. Chem. C 10(13), 4999–5023 (2022)
Article CAS Google Scholar
Sharma, R., Sharma, A., Agarwal, S., Dhaka, M.: Stability and efficiency issues, solutions and advancements in perovskite solar cells: a review. Sol. Energy 244, 516–535 (2022)
Article CAS Google Scholar
Aydin, E., Troughton, J., De Bastiani, M., Ugur, E., Sajjad, M., Alzahrani, A., Neophytou, M., Schwingenschlogl, U., Laquai, F., Baran, D.: Room-temperature-sputtered nanocrystalline nickel oxide as hole transport layer for p–i–n perovskite solar cells. ACS Appl. Energy Mater. 1(11), 6227–6233 (2018)
Article CAS Google Scholar
Ma, S., Yuan, G., Zhang, Y., Yang, N., Li, Y., Chen, Q.: Development of encapsulation strategies towards the commercialization of perovskite solar cells. Energy Environ. Sci. 15(1), 13–55 (2022)
Article CAS Google Scholar
Blancon, J.-C., Even, J., Stoumpos, C., Kanatzidis, M., Mohite, A.D.: Semiconductor physics of organic–inorganic 2D halide perovskites. Nat. Nanotechnol. 15(12), 969–985 (2020)
Article CAS Google Scholar
Abdellatif, S.O., et al.: Experimental studies for glass light transmission degradation in solar cells due to dust accumulation using effective optical scattering parameters and machine learning algorithm. IEEE J. Photovolt. 13(1), 158–164 (2022)
Article Google Scholar
Al-Sabana, O., Abdellatif, S.O.: Optoelectronic devices informatics: optimizing DSSC performance using random-forest machine learning algorithm. Optoelectron. Lett. 18(3), 148–151 (2022)
Article Google Scholar
Yılmaz, B., Yıldırım, R.: Critical review of machine learning applications in perovskite solar research. Nano Energy 80, 105546 (2021)
Article Google Scholar
Parikh, N., Karamta, M., Yadav, N., Tavakoli, M.M., Prochowicz, D., Akin, S., Kalam, A., Satapathi, S., Yadav, P.: Is machine learning redefining the perovskite solar cells? J. Energy Chem. 66, 74–90 (2022)
Article CAS Google Scholar
Odabaşı, Ç., Yıldırım, R.: Machine learning analysis on stability of perovskite solar cells. Sol. Energy Mater. Sol. Cells 205, 110284 (2020)
Article Google Scholar
Odabaşı, Ç., Yıldırım, R.: Performance analysis of perovskite solar cells in 2013–2018 using machine-learning tools. Nano Energy 56, 770–791 (2019)
Article Google Scholar
Liu, Z., Rolston, N., Flick, A.C., Colburn, T.W., Ren, Z., Dauskardt, R.H., Buonassisi, T.: Machine learning with knowledge constraints for process optimization of open-air perovskite solar cell manufacturing. Joule 6(4), 834–849 (2022)
Article Google Scholar
Li, J., Pradhan, B., Gaur, S., Thomas, J.: Predictions and strategies learned from machine learning to develop high-performing perovskite solar cells. Adv. Energy Mater. 9(46), 1901891 (2019)
Article CAS Google Scholar
Hu, Y., Hu, X., Zhang, L., Zheng, T., You, J., Jia, B., Ma, Y., Du, X., Zhang, L., Wang, J.: Machine-learning modeling for ultra-stable high-efficiency perovskite solar cells. Adv. Energy Mater. 12(41), 2201463 (2022)
Article CAS Google Scholar
Guo, Z., Lin, B.: Machine learning stability and band gap of lead-free halide double perovskite materials for perovskite solar cells. Sol. Energy 228, 689–699 (2021)
Article CAS Google Scholar
Mahesh, B.: Machine learning algorithms—a review. Int. J. Sci. Res. (IJSR) 9, 381–386 (2020)
Google Scholar
Singh, A., Thakur, N., Sharma, A.: A review of supervised machine learning algorithms. In: 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), pp. 1310–1315. IEEE (2016)
Alloghani, M., et al.: A systematic review on supervised and unsupervised machine learning algorithms for data science. In: Supervised and Unsupervised Learning for Data Science, pp. 3–21 (2020)
Triantafyllidis, A.K., Tsanas, A.: Applications of machine learning in real-life digital health interventions: review of the literature. J. Med. Internet Res. 21(4), e12286 (2019)
Article Google Scholar
Al-Saban, O., Abdellatif, S.O.: Optoelectronic materials informatics: utilizing random-forest machine learning in optimizing the harvesting capabilities of mesostructured-based solar cells. In: 2021 International Telecommunications Conference (ITC-Egypt), pp. 1–4. IEEE (2021)
Yan, W., Liu, Y., Zang, Y., Cheng, J., Wang, Y., Chu, L., Tan, X., Liu, L., Zhou, P., Li, W., Zhong, Z.: Machine learning enabled development of unexplored perovskite solar cells with high efficiency. Nano Energy 99, 107394 (2022)
Article CAS Google Scholar
Lu, Y., Wei, D., Liu, W., Meng, J., Huo, X., Zhang, Y., Liang, Z., Qiao, B., Zhao, S., Song, D.: Predicting the device performance of the perovskite solar cells from the experimental parameters through machine learning of existing experimental results. J. Energy Chem. 77, 200–208 (2023)
Article CAS Google Scholar
Liu, Y., Yan, W., Han, S., Zhu, H., Tu, Y., Guan, L., Tan, X.: How machine learning predicts and explains the performance of perovskite solar cells. Solar RRL 6(6), 2101100 (2022)
Article CAS Google Scholar
Karhunen, J., Raiko, T., Cho, K.: Unsupervised deep learning: a short review. In: Advances in Independent Component Analysis and Learning Machines, pp. 125–142 (2015)
Ebrahim, M.A., Ebrahim, G.A., Mohamed, H.K., Abdellatif, S.O.: A deep learning approach for task offloading in multi-UAV aided mobile edge computing. IEEE Access 10, 101716–101731 (2022)
Article Google Scholar
Abdellatif, S.O., Josten, S., Khalil, A.S., Erni, D., Marlow, F.: Transparency and diffused light efficiency of dye-sensitized solar cells: tuning and a new figure of merit. IEEE J. Photovolt. 10(2), 522–530 (2020)
Article Google Scholar
Hassan, M.M., Sahbel, A., Abdellatif, S.O., Kirah, K.A., Ghali, H.A.: Toward low-cost, stable, and uniform high-power LED array for solar cells characterization. In: New Concepts in Solar and Thermal Radiation Conversion III, vol. 11496, pp. 60–65. SPIE, ISO 690 (2020)
Eid, A.A., Ismail, Z.S., Abdellatif, S.O.: Optimizing SCAPS model for perovskite solar cell equivalent circuit with utilizing Matlab-based parasitic resistance estimator algorithm. In: 2020 2nd Novel Intelligent and Leading Emerging Sciences Conference (NILES), pp. 503–507. IEEE (2020)
Hatem, T., Ismail, Z., Elmahgary, M.G., Ghannam, R., Ahmed, M.A., Abdellatif, S.O.: Optimization of organic meso-superstructured solar cells for underwater IoT 2 self-powered sensors. IEEE Trans. Electron Devices 68(10), 5319–5321 (2021)
Article CAS Google Scholar
Abdellatif, S., Sharifi, P., Kirah, K., Ghannam, R., Khalil, A.S.G., Erni, D., Marlow, F.: Refractive index and scattering of porous TiO₂ films. Microporous Mesoporous Mater. 264, 84–91 (2018)
Article CAS Google Scholar
Abdellatif, S.O., Fathi, A., Abdullah, K., Hassan, M.M., Khalifa, Z.: Investigating the variation in the optical properties of TiO₂ thin-film utilized in bifacial solar cells using machine learning algorithm. J. Photon. Energy 12(2), 022202 (2022)
Article CAS Google Scholar
Hassan, M.M., Ismail, Z.S., Hashem, E.M., Ghannam, R., Abdellatif, S.O.: Investigating the tradeoff between transparency and efficiency in semitransparent bifacial mesosuperstructured solar cells for millimeter-scale applications. IEEE J. Photovolt. 11(5), 1222–1235 (2021)
Article Google Scholar
Mahran, A.M., Abdellatif, S.O.: Investigating the performance of mesostructured based solar cells under indoor artificial lighting. In: 2021 International Telecommunications Conference (ITC-Egypt), pp. 1–5. IEEE (2021)
Hassan, M.M., Iskander, N.N., Abdellatif, S.O., Kirah, K.A., Ghali, H.A.: Investigating the parasitic resistance of mesoporous-based solar cells with respect to thin-film and conventional solar cells. In: Organic, Hybrid, and Perovskite Photovoltaics XXI, International Society for Optics and Photonics, p. 1147424 (2020)
Ismail, Z.S., Sawires, E., Amer, F.Z., Abdellatif, S.O.: Investigating the capacitive properties of all-inorganic lead halides perovskite solar cells using energy band diagrams. In: 2022 IEEE International Conference on Semiconductor Electronics (ICSE), pp. 45–48. IEEE (2022)
Isabona, J., Imoize, A.L., Kim, Y.: Machine learning-based boosted regression ensemble combined with hyperparameter tuning for optimal adaptive learning. Sensors 22(10), 3776 (2022)
Article Google Scholar
Pedregosa, F.: Hyperparameter optimization with approximate gradient. In: International Conference on Machine Learning, pp. 737–746. PMLR (2016)
Kaneko, H., Funatsu, K.: Fast optimization of hyperparameters for support vector regression models with highly predictive ability. Chemom. Intell. Lab. Syst. 142, 64–69 (2015)
Article CAS Google Scholar
Liu, Y., Wang, Y., Zhang, J.: New machine learning algorithm: random forest. In: Information Computing and Applications: Third International Conference, ICICA 2012, Chengde, China, September 14–16, 2012. Proceedings 3, pp. 246–252. Springer (2012)
Rodriguez-Galiano, V., Sanchez-Castillo, M., Chica-Olmo, M., Chica-Rivas, M.: Machine learning predictive models for mineral prospectivity: an evaluation of neural networks, random forest, regression trees and support vector machines. Ore Geol. Rev. 71, 804–818 (2015)
Article Google Scholar
Konstantinov, A.V., Utkin, L.V.: Interpretable machine learning with an ensemble of gradient boosting machines. Knowl.-Based Syst. 222, 106993 (2021)
Article Google Scholar
Sun, S., Huang, R.: An adaptive K-nearest neighbor algorithm. In: 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery, pp. 91–94. IEEE (2010)
Seber, G.A., Lee, A.J.: Linear Regression Analysis. Wiley, New York (2003)
Book Google Scholar
Khan, F.A., Stoyanovich, J.: The unbearable weight of massive privilege: revisiting bias-variance trade-offs in the context of fair prediction. arXiv preprint http://arxiv.org/abs/2302.08704 (2023)
Liashchynskyi, P., Liashchynskyi, P.: Grid search, random search, genetic algorithm: a big comparison for NAS. arXiv preprint http://arxiv.org/abs/1912.06059 (2019)

Download references

Funding

The authors would like to acknowledge the support and contribution of the STDF, as this work is under the support fund of project ID 33502.

Author information

Authors and Affiliations

Electrical Engineering Department, Faculty of Engineering and FabLab in the Center for Emerging Learning Technology (CELT), The British University in Egypt (BUE), El-Sherouk, 11837, Cairo, Egypt
Mohamed M. Salah, Zahraa Ismail & Sameh Abdellatif

Authors

Mohamed M. Salah
View author publications
You can also search for this author in PubMed Google Scholar
Zahraa Ismail
View author publications
You can also search for this author in PubMed Google Scholar
Sameh Abdellatif
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed equally to this manuscript. Conceptualization, SA; methodology, MM and SA; software, MM and SA; validation, MM, ZI and SA; formal analysis, ZI and SA; investigation, ZI and SA; resources, ZI and SA; data curation, MM, ZI and SA; writing—original draft preparation, SA; writing—review and editing, SA; visualization SA; supervision, SA; project administration, SA; funding acquisition, ZI, and SA. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Sameh Abdellatif.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interests.

Ethics approval

Not applicable for that section.

Consent to participate

Authors confirm their participation in this paper.

Consent for publication

Authors accept the publication rules applied by the Journal.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Salah, M.M., Ismail, Z. & Abdellatif, S. Selecting an appropriate machine-learning model for perovskite solar cell datasets. Mater Renew Sustain Energy 12, 187–198 (2023). https://doi.org/10.1007/s40243-023-00239-2

Download citation

Received: 16 May 2023
Accepted: 21 August 2023
Published: 25 September 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s40243-023-00239-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Selecting an appropriate machine-learning model for perovskite solar cell datasets

Abstract

Similar content being viewed by others

Machine Learning Algorithms in Photovoltaics: Evaluating Accuracy and Computational Cost Across Datasets of Different Generations, Sizes, and Complexities

Prediction of Efficiency for KSnI3 Perovskite Solar Cells Using Supervised Machine Learning Algorithms

Optoelectronic devices informatics: optimizing DSSC performance using random-forest machine learning algorithm

Introduction

Dataset generation

Machine learning algorithms.

Results and discussion

Accuracy, training size, feature importance, and contribution:

Complexity

Computational cost and training time

Conclusion

Availability of data and materials

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Selecting an appropriate machine-learning model for perovskite solar cell datasets

Abstract

Similar content being viewed by others

Machine Learning Algorithms in Photovoltaics: Evaluating Accuracy and Computational Cost Across Datasets of Different Generations, Sizes, and Complexities

Prediction of Efficiency for KSnI3 Perovskite Solar Cells Using Supervised Machine Learning Algorithms

Optoelectronic devices informatics: optimizing DSSC performance using random-forest machine learning algorithm

Introduction

Dataset generation

Machine learning algorithms.

Results and discussion

Accuracy, training size, feature importance, and contribution:

Complexity

Computational cost and training time

Conclusion

Availability of data and materials

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation