Nanoinformatics pp 3-23 | Cite as

# Descriptors for Machine Learning of Materials Data

## Abstract

Descriptors, which are representations of compounds, play an essential role in machine learning of materials data. Although many representations of elements and structures of compounds are known, these representations are difficult to use as descriptors in their unchanged forms. This chapter shows how compounds in a dataset can be represented as descriptors and applied to machine-learning models for materials datasets.

## Keywords

Machine-learning interatomic potential Lattice thermal conductivity Recommender system Gaussian process Bayesian optimization## 1.1 Introduction

Recent developments of data-centric approaches should accelerate the progress in materials science dramatically. Thanks to the recent advances in computational power and techniques, the results from numerous density functional theory (DFT) calculations with predictive performances have been stored as databases. A combination of such databases and an efficient machine-learning approach should realize prediction and classification models of target physical properties. Consequently, machine-learning techniques are becoming ubiquitous. They are used to explore materials and structures from a huge number of candidates and to extract meaningful information and patterns from existing data.

A key factor in controlling the performance of a machine-learning approach is how compounds are represented in a data set. Representations of compounds are called “descriptors” or “features”. To perform machine-learning modeling, available descriptors must be determined according to the evaluation cost of the target property and the extent of the exploration space. Based on these considerations, we aim to select “good” descriptors. Prior or experts’ knowledge, including a well-known correlation between the target property and the other properties, can be used to select good descriptors. However, the set of descriptors in many cases is examined by trial-and-error because the predictive performance (i.e., the prediction error and efficiency of the model) strongly depends on the quality and data-size of the target property.

Section 1.2 shows how to prepare descriptors of compounds. Sections 1.3 and 1.4 introduce representations of chemical elements (elemental representations) and atomic arrangements (structural representations) required to generate compound descriptors. Sections 1.5, 1.6, 1.7, and 1.8 provide applications of machine-learning models for materials datasets, including the construction of a machine-learning prediction model for the DFT cohesive energy, the construction of the machine-learning interatomic potential (MLIP) for elemental metals, materials discovery of low lattice thermal conductivity (LTC), and materials discovery based on the recommender system approach.

## 1.2 Compound Descriptors

Most candidate descriptors can be classified into three groups. The first is the physical properties of a compound in a library and/or their derivative quantities, which are less available. The second is the physical properties of a compound computed by DFT calculations or their derivative quantities. The third is the properties of elements and the structure of a compound and/or their derivative quantities. Combinations of different groups of descriptors can also be useful.

Candidates for compound descriptors based on DFT calculations include volume, band gap, cohesive energy, elastic constants, dielectric constants, etc. The electronic structure and phonon properties can also be used as descriptors. Although a few first-principles databases are available, the numbers of compounds and physical properties in the databases remain limited. Nevertheless, when a set of descriptors that can well explain a target property is discovered, a robust prediction model can be derived for the target property. Examples can be found in the literature (e.g., Refs. [1, 2, 3, 4]). Other candidates are simply a binary digit representing the presence of each element in a compound (Fig. 1.1) [5]. When training data is composed of *m* kinds of elements, a compound is described by an *m*-dimensional binary vector with elements of one or zero. As a simple extension, a binary digit can be replaced with the chemical composition. Such an application is shown in Sect. 1.7.

*n*th representation of atom

*i*in compound \(\xi \).

Since the representation matrix is only a representation of the unit cell of compound \(\xi \), a procedure to transform the representation matrix into a set of descriptors is needed to compare different compounds. One approach for this transformation is to regard the representation matrix as a distribution of data points in an \(N_x\)-dimensional space (Fig. 1.2). To compare the distributions themselves, representative quantities are subsequently introduced to characterize the distribution as descriptors, such as the mean, standard deviation (SD), skewness, kurtosis, and covariance. The inclusion of the covariance enables the interaction between the element type and crystal structure to be considered.

A universal or complete set of representations is ideal because it can derive good machine-learning prediction models for all physical properties. However, finding a universal set of representations is nearly impossible. On the other hand, many elemental and structural representations have been proposed for a long time, not only in the literature on the machine-learning prediction but also in the literature on the standard physics and chemistry. Using these representations, many phenomena in physics and chemistry have been explained. Therefore, it is a good way for generating descriptors to make effective use of the existing representations.

## 1.3 Elemental Representations

The literature contains numerous quantities that can be used as elemental representations. This chapter employs a set of elemental representations composed of the following: (1) atomic number, (2) atomic mass, (3) period and (4) group in the periodic table, (5) first ionization energy, (6) second ionization energy, (7) electron affinity, (8) Pauling electronegativity, (9) Allen electronegativity, (10) van der Waals radius, (11) covalent radius, (12) atomic radius, (13) pseudopotential radius for the s orbital, (14) pseudopotential radius for the p orbital, (15) melting point, (16) boiling point, (17) density, (18) molar volume, (19) heat of fusion, (20) heat of vaporization, (21) thermal conductivity, and (22) specific heat. These representations can be classified into the intrinsic quantities of elements (1)–(7), the heuristic quantities of elements (8)–(14), and the physical properties of elemental substances (15)–(22). Such elemental representations should capture essential information about compounds. Therefore, they should assist in building models with a high predictive performance, as shown in Sects. 1.5, 1.7 and 1.8.

## 1.4 Structural Representations

The PRDF is a well-established representation for various structures. To transform the PRDF into structural representations applicable to machine learning, a histogram representation of the PRDF is adopted with a given bin width and cutoff radius (Fig. 1.3). The number of counts for each bin is used as the structural representation.

*i*and

*j*. For example, a pairwise Gaussian-type function is expressed as

*i*. The third-order invariant BOP \(W_l^{(i)}\) for atomic neighborhoods is expressed by

*j*symbol, satisfying \(m_1+m_2+m_3=0\). A set of both \(Q_l^{(i)}\) and \(W_l^{(i)}\) up to a given maximum

*l*is used as the structural representations.

## 1.5 Machine Learning of DFT Cohesive Energy

The simplest option is to use only the mean of each elemental representation as a descriptor. The prediction error, in this case, is 0.249 eV/atom. Figure 1.4a compares the cohesive energy calculated by DFT calculations to that by the KRR model, where only the test data in one of the 20 trials are shown. Numerous data points deviate from the diagonal line, which represents equal DFT and KRR energies. When considering the means, SDs, and covariances of the elemental representations, the prediction model has a slightly smaller prediction error of 0.231 eV/atom. Additionally, skewness and kurtosis are not important descriptors for the prediction.

Next, descriptors related to structural representations are introduced. They can be computed from the crystal structure optimized by the DFT calculations or the initial prototype structures. The former is only useful for machine-learning predictions when a target observation is expensive. Since the optimized structure calculation requires the same computational cost as the cohesive energy calculation, the benefit of machine learning is lost when using the optimized structure. The structural representations are computed from the optimized crystal structure only to examine the limitation of the procedure and representations introduced here. KRR models are constructed using many descriptor sets, which are composed of elemental and structural representations. The cutoff radius is set to 6 Å for the PRDF, GRDF, and AFS, while the cutoff radius is set to 1.2 times the nearest neighbor distance for the BOP. This nearest neighbor definition is common for the BOP.

Figure 1.4 compares the DFT and KRR cohesive energies, where the KRR models are constructed by (b) a set of the means of the elemental and PRDF histogram representations and (c) a set of the means, standard deviations, and covariances of the elemental and PRDF histogram representations. When considering the means of the elemental and PRDF representations, the lowest prediction error is as large as 0.166 eV/atom. This means that simply employing the PRDF histogram does not yield a good model for the cohesive energy. However, including the covariances of the elemental and PRDF histogram representations produces a much better prediction model and the prediction error significantly decreases to 0.106 eV/atom.

Considering only the means of the GRDFs, prediction models are obtained with errors of 0.149–0.172 eV/atom. These errors are similar to those of prediction models considering the means of the PRDFs. Similar to in the case of the PRDF, the prediction model improves upon considering the SDs and covariances of the elemental and structural representations. The best model shows a prediction error of 0.045 eV/atom, which is about half that of the best PRDF model. This is also approximately equal to the “chemical accuracy” of 43 meV/atom (1 kcal/mol).

Figure 1.4d compares the DFT and KRR cohesive energies, where a set of the means, SDs, and covariances of the elemental and trigonometric GRDF representations is adopted. Most of the data are located near the diagonal line. We also obtain the best prediction model with a prediction error of 0.041 eV/atom by considering the means, SDs, and covariances of the elemental, 20 trigonometric GRDF, and 20 BOP representations. Therefore, the present method should be useful to search for compounds with diverse chemical properties and applications from a wide range of chemical and structural spaces without performing exhaustive DFT calculations.

## 1.6 Construction of MLIP for Elemental Metals

A wide variety of conventional interatomic potentials (IPs) have been developed based on prior knowledge of chemical bonds in some systems of interest. Examples include Lennard-Jones, embedded atom method (EAM), modified EAM (MEAM), and Tersoff potentials. However, the accuracy and transferability of conventional IPs are often lacking due to the simplicity of their potential forms. On the other hand, the MLIP based on a large dataset obtained by DFT calculations is beneficial to improve the accuracy and transferability. In the MLIP framework, the atomic energy is modeled by descriptors corresponding to structural representations, as shown in Sect. 1.4. Once the MLIP is established, it has a similar computational cost as conventional IPs. MLIPs have been applied to a wide range of materials, regardless of chemical bonding nature of the materials. Recently, frameworks applicable to periodic systems have been proposed [9, 10, 11].

The Lasso regression has been used to derive a sparse representation for the IP. In this section, we demonstrate the applicability of the Lasso regression to derive the IPs of 12 elemental metals (Na, Mg, Ag, Al, Au, Ca, Cu, Ga, In, K, Li, and Zn) [11, 12]. The features of linear modeling of the atomic energy and descriptors using the Lasso regression include the following. (1) The accuracy and computational cost of the energy calculation can be controlled in a transparent manner. (2) A well-optimized sparse representation for the IP, which can accelerate and increase the accuracy of atomistic simulations while decreasing the computational costs, is obtained. (3) Information on the forces acting on atoms and stress tensors can be included in the training data in a straightforward manner. (4) Regression coefficients are generally determined quickly using the standard least-squares technique.

*i*is formulated as

*F*, we use a polynomial function to construct the MLIPs for the 12 elemental metals. In the approximation considering only the power of \(b_n^{(i)}\), the atomic energy is expressed as

To begin with, training and test datasets are generated from DFT calculations. The test dataset is used to examine the predictive power for structures that are not included in the training dataset. For each elemental metal, 2700 and 300 configurations are generated for the training and test datasets, respectively. The datasets include structures made by isotropic expansions, random expansions, random distortions, and random displacements of ideal face-centered-cubic (fcc), body-centered-cubic (bcc), hexagonal-closed-packed (hcp), simple-cubic (sc), \(\omega \) and \(\beta \)-tin structures, in which the atomic positions and lattice constants are fully optimized. These configurations are made using supercells constructed by the \(2\times 2\times 2\), \(3\times 3\times 3\), \(3\times 3\times 3\), \(4\times 4\times 4\), \(3\times 3\times 3\) and \(2\times 2\times 2\) expansions of the conventional unit cells for fcc, bcc, hcp, sc, \(\omega \), and \(\beta \)-tin structures, which are composed of 32, 54, 54, 64, 81, and 32 atoms, respectively.

RMSEs for the test data of linear ridge MLIPs using 240 terms (Unit: meV/atom)

Function type for \(f_n\) and \(p_{\mathrm{max}}\) | Na | Mg |
---|---|---|

Cosine \((p_{\mathrm{max}} = 1)\) | 7.3 | 11.8 |

Cosine \((p_{\mathrm{max}} = 2)\) | 1.6 | 2.6 |

Cosine \((p_{\mathrm{max}} = 3)\) | 1.4 | 1.6 |

Cosine, Gaussian \((p_{\mathrm{max}} = 3)\) | 1.4 | 1.1 |

Cosine, Bessel \((p_{\mathrm{max}} = 3)\) | 1.4 | 1.3 |

Cosine, Gaussian, Bessel \((p_{\mathrm{max}} = 3)\) | 1.4 | 0.9 |

RMSEs for the energy, force, and stress tensor of the Lasso MLIPs showing the minimum criterion score. Optimal cutoff radius for each element is also shown

Element | Cutoff radius (Å) | Number of basis functions | RMSE (energy) (meV/atom) | RMSE (force) (eV/Å) | RMSE (stress) (GPa) |
---|---|---|---|---|---|

Ag | 7.5 | 190 | 2.2 | 0.011 | 0.07 |

Al | 8.0 | 210 | 3.5 | 0.020 | 0.12 |

Au | 6.0 | 165 | 2.4 | 0.030 | 0.15 |

Ca | 9.5 | 234 | 1.2 | 0.010 | 0.03 |

Cu | 7.5 | 202 | 2.6 | 0.018 | 0.12 |

Ga | 10.0 | 266 | 2.2 | 0.017 | 0.09 |

In | 10.0 | 253 | 2.3 | 0.019 | 0.07 |

K | 10.0 | 197 | 0.3 | 0.001 | 0.00 |

Li | 8.5 | 222 | 0.4 | 0.005 | 0.02 |

Zn | 10.0 | 288 | 2.9 | 0.016 | 0.15 |

Figure 1.6a shows the dependence of the RMSE for the energy and stress tensor of the Lasso MLIP on the number of nonzero regression coefficients for the other ten elemental metals. The number of selected terms tends to increase as the regularization parameter \(\lambda \) decreases. The RMSEs for the energy and stress tensor tend to decrease. Although multiple MLIPs with the same number of terms are sometimes obtained from different values of \(\lambda \), only the MLIP with the lowest criterion score with the same number of terms is shown in Fig. 1.6a. Table 1.2 shows the RMSEs for the energy, force, and stress tensor of the optimal Lasso MLIP. The MLIPs are obtained with the RMSE for the energy in the range of 0.3–3.5 meV/atom for the ten elemental metals using only 165–288 terms. The RMSEs for the force and stress are within 0.03 eV/Å and 0.15 GPa, respectively.

Figure 1.6b compares the energies of the test data predicted by the Lasso MLIP and DFT for Al and Zn. Both the largest and second largest RMSEs for the energy are shown. Regardless of the crystal structure, the DFT and Lasso MLIP energies are similar. In addition, the RMSE is clearly independent of the energy despite the wide range of structures included in both the training and test data.

The applicability of the Lasso MLIP to the calculation of the force has been also examined by comparing the phonon dispersion relationships computed by the Lasso MLIP and DFT. The phonon dispersion relationships are calculated by the supercell approach for the fcc structure with the equilibrium lattice constant. The phonon calculations use the phonopy code [20]. Figure 1.6c shows the phonon dispersion relationships of the fcc structure for elemental Al and Zn computed by both the Lasso MLIP and DFT. The phonon dispersion relationships calculated by the Lasso MLIP agree well with those calculated by DFT. This demonstrates that the Lasso MLIP is sufficiently accurate to perform atomistic simulations with an accuracy similar to DFT calculations.

It is important to use an extended approximation for the atomic energy in transition metals [21, 22]. The extended approximation also improves the predictive power for the above elemental metals. The MLIPs are constructed by a second-order polynomial approximation with the AFSs described by Eq. (1.6) and their cross terms. For elemental Ti, the optimized angular-dependent MLIP is obtained with a prediction error of 0.5 meV/atom (35245 terms), which is much smaller than that of the Lasso MLIP with only the power of pairwise descriptors of 17.0 meV/atom. This finding demonstrates that it is very important to consider angular-dependent descriptors when expressing interatomic interactions of elemental Ti. The angular-dependent MLIP can predict the physical properties much more accurately than existing IPs.

## 1.7 Discovery of Low Lattice Thermal Conductivity Materials

Recently, Togo et al. reported a method to systematically obtain the theoretical LTC through first-principles anharmonic lattice dynamics calculations [23]. Figure 1.7a shows the results of first-principles LTCs for 101 compounds as functions of the crystalline volume per atom, *V*. PbSe with the rocksalt structure shows the lowest LTC, 0.9 W/mK (at 300 K). Its trend is similar to that in a recent report on low LTC for lead- and tin-chalcogenides.

Figure 1.7b compares the computed results with the available experimental data. The satisfactory agreement between the experimental and computed results demonstrates the usefulness of the first-principles LTC data for further studies. A phenomenological relationship has been proposed where \(\log \kappa _L\) is proportional to \(\log V\) [24]. Although a qualitative correlation is observed between our LTC and *V*, it is difficult to predict the LTC quantitatively or discover new compounds with low LTCs only from the phenomenological relationship. It should be noted that the dependence on *V* differs remarkably between rocksalt-type and zincblende- or wurtzite-type compounds. However, zincblende- and wurtzite-type compounds show a similar LTC for the same chemical composition. The 101 first-principles LTC data has been used to create a model to predict the LTCs of compounds within a library [5]. First, a Gaussian process (GP)-based Bayesian optimization [25] is adopted using two physical quantities as descriptors: *V* and density, \(\rho \). These quantities are available in most experimental or computational crystal structure databases. Although a phenomenological relationship is proposed between \(\log \kappa _L\) and *V*, the correlation between them is low. Moreover, the correlation between \(\log \kappa _L\) and \(\rho \) is even worse.

We start from an observed data set of five compounds that are randomly chosen from the dataset. The Bayesian optimization searches for the compound with a maximum probability of improvement [26] among the remaining data. That is, the compound with the highest Z-score derived from GP is searched. The compound is included into the observed dataset. Then another compound with the maximum probability of improvement is searched. Both the Bayesian optimization and random searches are repeated 200 times, and the average number of observed compounds required to find the best compound is examined.

The average numbers of compounds required for the optimization using the Bayesian optimization and random searches, \(N_{\mathrm{ave}}\), are 11 and 55, respectively. The compound with the lowest LTC among the 101 compounds (i.e., rocksalt PbSe) can be found much more efficiently using a Bayesian optimization with only two variables, *V* and \(\rho \). However, using a Bayesian optimization only with these two variables is not a robust method to determine the lowest LTC. As an example, the result of the Bayesian optimization using the dataset after intentionally removing the first and second lowest LTC compounds shows that \(N_{\mathrm{ave}}\) is 65 to find LiI using Bayesian optimization only with *V* and \(\rho \), which is larger than that of the random search (\(N_{\mathrm{ave}} = 50\)). The delay in the optimization should originate from the fact that LiI is an outlier when the LTC is modeled only with *V* and \(\rho \). Such outlier compounds with low LTC are difficult to find only with *V* and \(\rho \).

Better correlations with LTC can be found for parameters obtained from the phonon density of states. Figure 1.8 shows the relationships between the LTC and the physical properties. Other than volume and density, the following quantities are obtained by our phonon calculations: mean phonon frequency, maximum phonon frequency, Debye frequency, and Grüneisen parameter. The Debye frequency is determined by fitting the phonon density of states for a range between 0 and 1/4 of the maximum phonon frequency to a quadratic function. The thermodynamic Grüneisen parameter is obtained from the mode-Grüneisen parameters calculated with a quasi-harmonic approximation and mode-heat capacities. The correlation coefficients *R* between \(\log \kappa _L\) and these physical properties are shown in the corresponding panels. The present study does not use such phonon parameters as descriptors because a data library for such phonon parameters for a wide range of compounds is unavailable. Hereafter, we show results only with the descriptor set composed of 34 binary elemental descriptors on top of *V* and \(\rho \).

*V*, \(\rho \), and the 34 binary elemental descriptors for the 101 LTC data, low-LTC compounds are ranked according to the Z-score of the 54779 compounds.

*V*and \(\rho \). The magnitude of the Z-score is plotted in the panels corresponding to the constituent elements. The compounds are widely distributed in \(V-\rho \) space. Thus, it is difficult to identify compounds without performing a Bayesian optimization with elemental descriptors. The widely distributed Z-scores for light elements such as Li, N, O, and F imply that the presence of such light elements has a negligible effect on lowering the LTC. When such light elements form a compound with heavy elements, the compound tends to show a high Z-score. It is also noteworthy that many compounds composed of light elements such as Be and B tend to show a high LTC. Pb, Cs, I, Br, and Cl exhibit special features. Many compounds composed of these elements exhibit high Z-scores. Most compounds showing a positive Z-score are a combination of these five elements. On the other hand, elements in the periodic table neighboring these five elements do not show analogous trends. For example, compounds of Tl and Bi, which neighbor Pb, rarely exhibit high Z-scores. This may sound odd since \(\text {Bi}_2\text {Te}_3\) is a famous thermoelectric compound, and some compounds containing Tl have a low LTC. This may be ascribed to our selection of the training dataset, which is composed only of AB compounds with 34 elements and three kinds of simple crystal structures. In other words, the training dataset is somehow “biased”. Currently, this bias is unavoidable because first-principles LTC calculations are still too expensive to obtain a sufficiently unbiased training dataset with a large enough number of data points to cover the diversity of the chemical compositions and crystal structures. Nevertheless, the usefulness of biased training dataset to find low-LTC materials will be verified in the future. Due to the biased training dataset, all low-LTC materials in the library may not be discovered. However, some of them can be discovered. A ranking of LTCs from the Z-score does not necessarily correspond to the true first-principles ranking. Therefore, a verification process for candidates of low-LTC compounds after the virtual screening is one of the most important steps in “discovering” low-LTC compounds. First-principles LTCs have been evaluated for the top eight compounds after the virtual screening. All of them are considered to form ordered structures. However, the LTC calculation is unsuccessful for \(\text {Pb}_2\text {RbBr}_5\) due to the presence of imaginary phonon modes within the supercell used in the present study. All of the top five compounds, \(\text {PbRbI}_3\), PbIBr, \(\text {PbRb}_4\text {Br}_6\), PbICl, and PbClBr, show a LTC of \({<}0.2\) W/mK (at 300 K), which is much lower than that of the rocksalt PbSe, [i.e., 0.9 W/mK (at 300 K)]. This confirms the powerfulness of the present GP prediction model to efficiently discover low-LTC compounds. The present method should be useful to search for materials in diverse applications where the chemistry of materials must be optimized.

Finally, the performance of Bayesian optimization has been examined using the compound descriptors derived from elemental and structural representations for the LTC dataset containing the compounds identified by the virtual screening. GP models are constructed using (1) the means and SDs of the elemental representations and GRDFs and (2) the means and SDs of elemental representations and BOPs. Figure 1.10 shows the behavior of the lowest LTC during Bayesian optimization relative to a random search. The optimization aims to find PbClBr with the lowest LTC. For the GP model with the BOP, the average number of samples required for the optimization, \(N_{\mathrm{ave}}\), is 5.0, which is ten times smaller than that of the random search, \(N_{\mathrm{ave}} = 50\). Hence, the Bayesian optimization more efficiently discovers PbClBr than the random search.

To evaluate the ability to find a wide variety of low-LTC compounds, two datasets have been prepared after intentionally removing some low-LTC compounds. In these datasets, CuCl and LiI, which respectively show the 11th-lowest and 12th-lowest LTCs, are solutions of the optimizations. For the GP model with BOPs, the average number of observations required to find CuCl and LiI is \(N_{\mathrm{ave}} = 15.1\) and 9.1, respectively. These numbers are much smaller than those of the random search. On the other hand, for the GP model with GRDFs, the average number of observations required to find CuCl and LiI is \(N_{\mathrm{ave}} = 40.5\) and 48.6, respectively. The delayed optimization may originate from the fact that both CuCl and LiI are outliers in the model with GRDFs, although the model with GRDFs has a similar RMSE as the model with BOPs. These results indicate that the set of descriptors needs to be optimized by examining the performance of Bayesian optimization for a wide range of compounds to find outlier compounds.

## 1.8 Recommender System Approach for Materials Discovery

Many atomic structures of inorganic crystals have been collected. Of the few available databases for inorganic crystal structures, the ICSD [29] contains approximately \(10^5\) inorganic crystals, excluding duplicates and incompletes. Although this is a rich heritage of human intellectual activities, it covers a very small portion of possible inorganic crystals. Considering 82 nonradioactive chemical elements, the number of simple chemical compositions up to ternary compounds \(\text {A}_a\text {B}_b\text {C}_c\) with integers satisfying \(\max (a, b, c)\le 15\) is approximately \(10^8\), but increases to approximately \(10^{10}\) for quaternary compounds \(\text {A}_a\text {B}_b\text {C}_c\text {D}_d\). Although many of these chemical compositions do not form stable crystals, the huge difference between the number of compounds in ICSD and the possible number of compounds implies that many unknown compounds remain. Conventional experiments alone cannot fill this gap. Often, first-principles calculations are used as an alternative approach. However, systematic first-principles calculations without a priori knowledge of the crystal structures are very expensive.

Machine learning is a different approach to consider all chemical combinations. A powerful machine-learning strategy is mandatory to discover new inorganic compounds efficiently. Herein we adopt a recommender system approach to estimate the relevance of the chemical compositions where stable crystals can be formed [i.e., chemically relevant compositions (CRCs)] [30, 31]. The compositional similarity is defined using the procedure shown in Sect. 1.2. A composition is described by a set of 165 descriptors composed of the means, SDs, and covariances of the established elemental representations. The probability for CRCs is subsequently estimated on the basis of a machine-learning two-class classification using the compositional similarity. This approach significantly accelerates the discovery of currently unknown CRCs that are not present in the training database.

## References

- 1.K. Fujimura, A. Seko, Y. Koyama, A. Kuwabara, I. Kishida, K. Shitara, C.A.J. Fisher, H. Moriwake, I. Tanaka, Adv. Energy Mater.
**3**, 980 (2013)CrossRefGoogle Scholar - 2.A. Seko, T. Maekawa, K. Tsuda, I. Tanaka, Phys. Rev. B
**89**, 054303 (2014)CrossRefGoogle Scholar - 3.J. Lee, A. Seko, K. Shitara, K. Nakayama, I. Tanaka, Phys. Rev. B
**93**, 115104 (2016)CrossRefGoogle Scholar - 4.K. Toyoura, D. Hirano, A. Seko, M. Shiga, A. Kuwabara, M. Karasuyama, K. Shitara, I. Takeuchi, Phys. Rev. B
**93**, 054112 (2016)CrossRefGoogle Scholar - 5.A. Seko, A. Togo, H. Hayashi, K. Tsuda, L. Chaput, I. Tanaka, Phys. Rev. Lett.
**115**, 205901 (2015)CrossRefGoogle Scholar - 6.A. Seko, H. Hayashi, K. Nakayama, A. Takahashi, I. Tanaka, Phys. Rev. B
**95**, 144110 (2017)CrossRefGoogle Scholar - 7.P.J. Steinhardt, D.R. Nelson, M. Ronchetti, Phys. Rev. B
**28**, 784 (1983)CrossRefGoogle Scholar - 8.A.P. Bartók, R. Kondor, G. Csányi, Phys. Rev. B
**87**, 184115 (2013)CrossRefGoogle Scholar - 9.J. Behler, M. Parrinello, Phys. Rev. Lett.
**98**, 146401 (2007)CrossRefGoogle Scholar - 10.A.P. Bartók, M.C. Payne, R. Kondor, G. Csányi, Phys. Rev. Lett.
**104**, 136403 (2010)CrossRefGoogle Scholar - 11.A. Seko, A. Takahashi, I. Tanaka, Phys. Rev. B
**90**, 024101 (2014)CrossRefGoogle Scholar - 12.A. Seko, A. Takahashi, I. Tanaka, Phys. Rev. B
**92**, 054113 (2015)CrossRefGoogle Scholar - 13.T. Hastie, R. Tibshirani, J. Friedman,
*The Elements of Statistical Learning*, 2nd edn. (Springer, New York, 2009)CrossRefGoogle Scholar - 14.R. Tibshirani, J. R. Stat. Soc. B
**58**, 267 (1996)Google Scholar - 15.P.E. Blöchl, Phys. Rev. B
**50**, 17953 (1994)CrossRefGoogle Scholar - 16.J.P. Perdew, K. Burke, M. Ernzerhof, Phys. Rev. Lett.
**77**, 3865 (1996)CrossRefGoogle Scholar - 17.G. Kresse, J. Hafner, Phys. Rev. B
**47**, 558 (1993)CrossRefGoogle Scholar - 18.G. Kresse, J. Furthmüller, Phys. Rev. B
**54**, 11169 (1996)CrossRefGoogle Scholar - 19.G. Kresse, D. Joubert, Phys. Rev. B
**59**, 1758 (1999)CrossRefGoogle Scholar - 20.A. Togo, I. Tanaka, Scr. Mater.
**108**, 1 (2015)CrossRefGoogle Scholar - 21.A. Takahashi, A. Seko, I. Tanaka, Phys. Rev. Mater.
**1**, 063801 (2017)Google Scholar - 22.A. Takahashi, A. Seko, I. Tanaka (2017), arXiv:1710.05677
- 23.A. Togo, L. Chaput, I. Tanaka, Phys. Rev. B
**91**, 094306 (2015)CrossRefGoogle Scholar - 24.G.A. Slack,
*Solid State Physics*, vol. 34 (Academic Press, New York, 1979), pp. 1–71Google Scholar - 25.C.E. Rasmussen, C.K.I. Williams,
*Gaussian Processes for Machine Learning*(MIT Press, Cambridge, 2006)Google Scholar - 26.D. Jones, J. Global Optim.
**21**, 345 (2001)CrossRefGoogle Scholar - 27.D.B. Kitchen, H. Decornez, J.R. Furr, J. Bajorath, Nat. Rev. Drug Discov.
**3**, 935 (2004)CrossRefGoogle Scholar - 28.A. Jain, S.P. Ong, G. Hautier, W. Chen, W.D. Richards, S. Dacek, S. Cholia, D. Gunter, D. Skinner, G. Ceder et al., APL Mater.
**1**, 011002 (2013)CrossRefGoogle Scholar - 29.G. Bergerhoff, I.D. Brown,
*Crystallographic Databases*, edited by F.H. Allen et al. (International Union of Crystallography, Chester, 1987)Google Scholar - 30.A. Seko, H. Hayashi, H. Kashima, I. Tanaka (2017), arXiv:1710.00659
- 31.A. Seko, H. Hayashi, I. Tanaka (2017), arXiv:1711.06387

## Copyright information

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.