Application of Materials Informatics Tools to the Analysis of Combinatorial Libraries of All Metal-Oxides Photovoltaic Cells

Material informatics is engaged with the application of informatics tools, frequently in the form of machine learning algorithms, to gain insight into structure properties relationships of materials and to design new materials with desired properties. Here we describe the application of such algorithms to the analysis of solar cell (i.e., photovoltaic; PV) libraries made entirely from metal oxides (MOs). MOs-based solar cells hold the potential to provide clean and affordable energy if their power conversion efficiencies are improved. We demonstrate the power of dimensionality reduction methods to visualize the MOs-based solar cell space and the power of several algorithms to develop predictive models for key PV properties. We stress the importance of conducting such studies in collaboration with experimentalists.


Introduction
Materials informatics is an emerging field of research primarily engaged with the application of informatic principles to materials science for the purpose of discovering and developing new materials [1,2]. Several factors contribute to the continuous development of materials informatics including the rapid growth of available databases that contain experimental and computational information on structures and properties of materials and the multiple similarities between materials informatics and e.g., chemo-and bio-informatics. Thus, this younger field could draw from the experience and large arsenal of tools developed in the more established fields [2].
Central to materials informatics is the application of data mining techniques and in particular machine learning approaches, often referred to as Quantitative Structure Activity Relationship (QSAR) modeling, to derive predictive models for a variety of materials-related "activities". Such models could accelerate the development of new materials with favorable properties and provide insight into the factors governing these properties. In this work we present the application of several machine learning tools to the analysis of solar cells.
Growth in energy demands, coupled with the movement towards clean energy, are likely to make solar cells an important part of future energy resources. At present, most solar cells are based on silicon yet new alternatives are constantly emerging with examples including organic photovoltaic (OPV) cells, [3] dye sensitized solar cells (DSSC), [4] and pervoskites [5]. Another appealing alternative is presented by solar cells entirely made of metal oxides (MOs). Such cells are reasonably cheap to manufacture due to the abundance of their constituting elements, are environmentally friendly and are stable over time. Yet, their performances in terms of their ability to convert sunlight into electricity need to be improved [6]. Such improvements require the development of new MOs which could benefit from combining combinatorial materials science for producing solar cells libraries with machine learning approaches to analyze the resulting libraries and direct synthesis efforts.
MO-based solar cell libraries are produced through a combinatorial materials science approach involving the non-uniform deposition of two or more layers of different metal oxides on top of a glass support, followed by the introduction of the appropriate metal contacts. A typical library synthesized in this form consists of multiple cells (169 in our case) each of which is typically characterized by multiple composition and photovoltaic (PV) parameters including, for example, the thickness of the different layers that make up the cell and the thickness ratios between them, the band gap of the absorber layer, the short circuit photocurrent density (Jsc), the open circuit photovoltage (Voc) and the internal quantum efficiency (IQE) [6,7].

Results and Discussion
In this work we present the application of several machine learning tools to the analysis of single and multiple MO-based solar cell libraries. Some of these tools were recently incorporated into a newly developed MATLAB-based decision support system (DSS) called "PV Analyzer" [8]. PV Analyzer integrates tools common to chemo-and materials-informatics such as simple bi-parametric correlations, heat-maps and principle component analysis (PCA) with a workflow for the development of predictive quantitative structure activity relationship (QSAR) models based on the RANSAC algorithm which was originally developed for computer vision (see below). The PV Analyzer workflow is composed of three blocks: (1) Input data; (2) Data pre-processing and visualization; (3) RANSAC modeling (Fig. 1). Importantly, PV Analyzer is a modular system which allows for the facile incorporation of additional tools.
The PV libraries studied in this work were gathered over time from multiple literature reports [6,7,9,10 and references cited therein]. Importantly, all were manufactured by the same lab using largely similar methods. First, looking at individual libraries, we demonstrated the ability of standard PCA to identify outliers in the PV space and together with self organizing maps (SOM) to unveil local and global effects of a Molybdenum Oxide layer on PV properties [9]. Next, five libraries were compiled into a unified dataset for a total of 1165 solar cells where each solar cell was characterized by seven experimentally measured PV properties including short circuit photocurrent density (Jsc), open circuit photovoltage (Voc), internal quantum efficiency (IQE), maximum photovoltaic power (Pmax), fill factor (FF), series resistance (Rs) and Shunt resistance (Rsh). This database was subjected to several linear and non linear dimensionality reduction methods (PCA, kernel-PCA, Isomap and Diffusion map) to allow for a facile visualization of the resulting "PV space". Furthermore, we studied the relative performances of the different methods in terms of their ability to separate, in the reduced space, the five original libraries which were used to make up the space and maintain their original neighborhood [10]. We found that all methods were able to segregate the different libraries into unique regions of the reduced space (Fig. 2) but that PCA performed best in terms of the ability to correctly maintain the local environment of samples whereas Isomap did the best job in assigning class membership based on the identity of nearest neighbors, i.e., it was the best classifier. In addition, diffusion map identified the smallest number of outliers. We also found that many of the outliers identified by all methods could be rationalized.
Next we used several machine learning algorithms such as k nearest neighbors (kNN), genetic programming (GP) [11] and RANSAC to develop predictive QSAR models for key PV properties including Jsc, Voc and IQE for several libraries [7,12]. In particular, the Random Sample Consensus (RANSAC) algorithm is a predictive modeling tool widely used in the image processing field for cleaning datasets from noise. In this work we demonstrated that RANSAC could be used as a "one stop shop" algorithm for the development and validation of QSAR models, performing multiple tasks including outlier removal, descriptors selection, model development and predictions of test set samples while using the concept of applicability domain [13]. Contrary to most conventional QSAR models, the input descriptors used in these studies were experimentally measured rather than theoretically computed and included the thicknesses of the different MO layers used to construct the cells and the thickness ratios between them. The need to rely on measured rather than on calculated descriptors is because when the (well defined) different MOs are combined (see above for a short description of the manufacturing process), the exact composition/structure of the resulting solar cell is not well defined and consequently is not amenable to the calculation of structure-based descriptors. Our results (Table 1) demonstrate that QSAR models with good prediction statistics could be developed by any of these methods for most PV properties (except for several poor models developed for Voc) and that these models highlight important factors affecting these properties in accord with experimental findings. Specifically, the thicknesses of the different MO layers play a pivotal role in determining the PV properties. The resulting models are therefore suitable for designing better solar cells. Interestingly, the success of developing kNN-based models demonstrates that the similar property principle which is well established in the small molecules/pharmaceuticals hyperspace also holds in the PV space.

Conclusions
In this work we demonstrated the power of materials informatics methods to visualize the PV space of MO-based solar cell libraries and to develop predictive QSAR models for key PV properties. Importantly, we wish to emphasize the need to perform studies of the type described in this work in close collaboration with experimentalists in order to both provide physics/chemistry based explanation to the observed trends/effects/ correlations and to capitalize on the results.