Handbook of Materials Modeling pp 1-27 | Cite as

# Quantum Machine Learning in Chemistry and Materials

## Abstract

Within the past few years, we have witnessed the rising of quantum machine learning (QML) models which infer electronic properties of molecules and materials, rather than solving approximations to the electronic Schrödinger equation. The increasing availability of large quantum mechanics reference datasets has enabled these developments. We review the basic theories and key ingredients of popular QML models such as choice of regressor, data of varying trustworthiness, the role of the representation, and the effect of training set selection. Throughout we emphasize the indispensable role of learning curves when it comes to the comparative assessment of different QML models.

## 1 Introduction

Society is becoming increasingly aware of its desperate need for new molecules and materials, be it new antibiotics or efficient energy storage and conversion materials. Unfortunately, chemical compounds reside in, or rather hide among, an unfathomably huge number of possibilities, also known as chemical compound space (CCS). CCS is the set of stable compounds which can be obtained through all combinations of chemical elements and interatomic distances. For medium-sized drug-like molecules, CCS is believed to exceed 10^{60} (Kirkpatrick and Ellis 2004). Exploration in CCS and locating the “optimal” compounds are thus an extremely difficult, if not impossible, task. Typically, one needs to constrain the search domain in CCS and obtain certain pertinent properties of compounds within the subspace and then choose the compounds with properties which come closest to some preset criteria as potential candidates for subsequent updating or validation. Of course, one can conduct experiments for each compound. Alternatively, one can also attempt to estimate its properties using modern atomistic simulation tools which, within one approximation or the other, attempt to solve Schrödinger’s equation on a modern powerful computer.

The latter approach is practically more favorable and referred as high-throughput (HT) computational screening (Greeley et al. 2006). In spite of its popularity, it is inherently limited by the computational power accessible considering that (1) the number of possible compounds is much larger than what HT typically is capable of dealing with (∼10^{3}) and (2) often very time-consuming explicitly electron correlated methods are necessary to reach chemical accuracy (1 kcal/mol for energies), with computational cost often scaling as *O*(*N*^{6}) (*N* being the number of electrons, a measure of the system size). Computationally more efficient methods generally suffer from rather weak predictive power. They range from force fields and semiempirical molecular orbital methods, density functional theory (DFT) methods to so-called linear scaling methods which assume locality by virtue of fragments or localized orbitals (Kitaura et al. 1999). It remains an outstanding challenge within conventional computational chemistry that efficiency and accuracy apparently cannot coexist.

To tackle this issue, Rupp et al. (2012) introduced a machine learning (ML) Ansatz in 2012, capable of predicting atomization energies of out-of-sample molecules fast and accurately for the first time. By now many subsequent studies showed that ML models enable fast and yet arbitrarily accurate prediction for *any* quantum-mechanical property. This is no “free lunch”; however, the price to pay consists of the acquisition of a set of pre-calculated training datasets which must be sufficiently representative and dense.

So what is machine learning? It is a field of computer science that gives computers the ability to learn without being explicitly programmed (Samuel 2000). Among the broad categories of ML tasks, we focus on a type called supervised learning with continuous output, which infers a function from labeled training data. Putting it formally, given a set of *N* training examples of the form {(*x*_{1},*y*_{1}), (*x*_{2},*y*_{2}), ⋯, (*x*_{N},*y*_{N})} with *x*_{i} and *y*_{i} being, respectively, the input (the representation) and output (the label) of example *i*, a ML algorithm models the implicit function *f* which maps input space *X* to label space *Y* . The trained model can then be applied to predict *y* for a new input *x* (belonging to the so-called test set) absent in the training examples. For quantum chemistry problems, the input of QML (also called representation) is usually a vector/matrix/tensor directly obtained from composition and geometry {*Z*_{I}, **R**_{I}} of the compound, while the label could be *any* electronic property of the system, notably the energy. The function *f* is implicitly encoded in terms of the nonrelativistic Schrödinger equation (SE) within the Born-Oppenheimer approximation, \(\hat {H}\varPsi = E\varPsi \), whose exact solution is unavailable for all but the smallest and simplest systems. To generate training data, methods with varied degrees of approximation have to be used instead, such as the aforementioned DFT, QMC, etc.

Given a specific pair of *X* and *Y* , there are multiple strategies to learn the implicit function *f* : *X* → *Y* . Some of the most popular ones are artificial neural network (ANN, including its various derivatives, such as convolutional neural network) and kernel ridge regression (KRR, or more generally Gaussian process regression).

Based on a recent benchmark paper (Faber et al. 2017b), KRR and ANN are competitive in terms of performance. KRR, however, has the great advantage of simplicity in interpretation and ease in training, provided an efficient representation is used. Within this chapter, we therefore focus on KRR or Gaussian processes exclusively (see Sect. 2 for more details.).

Often, each training example is represented by a pair (*x*_{i}, *y*_{i}). However, multiple {*y*_{j}}_{i} can also be used, e.g., when multiple labels are available for the same molecule, possibly resulting from different levels of theory. The latter situation can be very useful for obtaining highly accurate QML models with scarcely available accurate training data and coarse data being easy to obtain. Multi-fidelity methods take care of such cases and will be discussed in Sect. 3.

Once the suitable QML model is selected, be it either in terms of ANN, KRR, or in terms of a multi-fidelity approach, two additional key factors will have a strong impact on the performance: The material representation and the selection procedure of the training set. The representation of any compound should essentially result from a bijective map which uses as input the same information which is also used in the electronic Hamiltonian of the system, i.e., compositional and structural information {*Z*_{I},**R**_{I}} as well as electron number. The representation is then typically formatted into a vector which can easily be processed by the computer. Some characteristic representations, introduced in the literature, are described in Sect. 4, where we will see how the performance of QML models can be enhanced dramatically by accounting for more of the underlying physics. In Sect. 5, further improvements in QML performance are discussed resulting from rational training set selection, rather than from random sampling.

Having introduced the basics of ML, we are motivated to point out two aspects of ML that may not be obvious for better interpretation of how ML works: (1) ML is an inductive approach based on rigorous implementation of inductive reasoning and it does not require *any* a priori knowledge about the aforementioned implicit function *f* (see Sect. 2), though some insight of what *f* may look like is invaluable for rational design of representation (see Sect. 4); (2) ML is of interpolative nature, that is, to make reasonable prediction, the new input must fall into the interpolating regime. Furthermore, as more training examples are added to the interpolating regime, the performance of the ML model can be systematically improved for a quantified representation (see Sect. 4).

## 2 Gaussian Process Regression

In this section, we discuss the basic idea of data-driven prediction of labels: the Gaussian process regression (GPR). In the case of a global representation (i.e., the representation of any compound as a single vector, see Sect. 4 for more details), the corresponding QML model takes the same form as in kernel ridge regression (KRR), also termed the global model. GPR is more general than KRR in the sense that GPR is equally applicable to local representations (i.e., the representation of any compound as a 2D array, with each atom in its environment represented by a single vector, see Sect. 4 for more details). Local GPR models can still successfully be applied when it comes to the prediction of extensive properties (e.g., total energy, isotropic polarizability, etc.) which profit from nearsightedness. The locality can be exploited for the generation of scalable GPR-based QML models which can be used to estimate extensive properties of very large systems.

### 2.1 The Global Model

*ε*:

**x**∈

**X**is the representation,

**w**is a vector of weights, and

*ϕ*(

**x**) is the basis function (or kernel) which maps a

*D*-dimensional input vector

**x**into an

*N*dimensional feature space. This is the space into which the input vector is mapped, e.g., for an input vector

**x**

_{1}= (

*x*

_{11},

*x*

_{12}) with

*D*= 2, its feature space could be \(\phi ({\mathbf {x}}_1)=(x_{11}^2,x_{11}x_{22},x_{22}x_{11},x_{22}^2)\) with

*N*= 4.

**y**is the label, i.e., the observed property of target compounds. We further assume that the noise

*ε*follows an independent, identically distributed (iid) Gaussian distribution with zero mean and variance

*λ*, i.e., \(\varepsilon \sim \mathscr {N}(0, \lambda )\), which gives rise to the probability density of the observations given the parameters

**w**, or the likelihood:

*ϕ*(

*X*) is the aggregation of columns

*ϕ*(

**x**) for all cases in the training set. Now we put a zero mean Gaussian prior with covariance matrix

*Σ*

_{p}over

**w**to express our beliefs about the parameters before we look at the observations, i.e., \(\mathbf {w} \sim \mathscr {N}(0, {\Sigma }_p)\). Together with Bayes’ rule

**w**can be updated as

**w**is called the posterior with mean \(\bar {\mathbf {w}}\). Thus, similar to Eq. (4), the predictive distribution for

**y**

_{∗}=

*f*(

**x**

_{∗}) is

*I*is the identity matrix and

*K*(

**X**,

**X**) =

*ϕ′*(

**X**)

^{⊤}

*ϕ′*(

**X**) (\(\phi '(\mathbf {X}) = {\Sigma }_p^{1/2} \phi (\mathbf {X})\)) is the kernel matrix (also called covariance matrix, abbreviated as Cov). It’s not necessary to know

*ϕ*explicitly; their existence is sufficient. Given a Gaussian basis function, i.e., \(\phi '(x)=\exp (-(x-x_0)^2/(2l^2))\) with

*x*

_{0}and

*l*being some fixed parameters, it can be easily shown that the (

*i*,

*j*)-th element of kernel matrix

*K*is

_{p}is the

*L*

_{p}norm and

*σ*is the kernel width determining the characteristic length scale of the problem. Note that we have avoided the infeasible computation of feature vectors of infinite size by using some kernel function

*k*. This is also called the kernel trick. Other kernels can be used just as well, e.g., the Laplacian kernel, \(k({\mathbf {x}}_i, {\mathbf {x}}_j) = \exp \left (- \frac {||{\mathbf {x}}_i - {\mathbf {x}}_j||{ }_1}{\sigma } \right ) \).

**c**is the regression coefficient vector,

Note that this expression can also be obtained by minimizing the cost function \(C(\mathbf {w}) = \frac {1}{2}\sum _i (y_i - {\mathbf {w}}^{\top }\phi ({\mathbf {x}}_i))^2 + \frac {\lambda }{2} ||\mathbf {w}||{ }_2^2\) with respect to **w**

Note that *L*_{2} regularization is used, together with a regularization parameter *λ* acting as a weight to balance minimizing the sum of squared error (SSE) and limiting the complexity of the model. This eventually leads to a model called kernel ridge regression (KRR) model.

All variants of these global models, however, suffer from the scalability problem for extensive properties of the system such as energy, i.e., the prediction error grows systematically with respect to query system size (predicted estimates will tend toward the mean of the training data while extensive properties grow). This limitation is due to the interpolative nature of global ML models, that is, the predicted query systems and their properties must lie within the domain of training data.

### 2.2 The Local Version

*E*) of the system, it is usually expressed as a sum over atomic energies (

*e*):

*Ω*

_{I}is the atomic basin determined by the zero-flux condition of the electron density,

**n**(

**r**

_{s}) is the unit vector normal to the surface at

**r**

_{s}. The advantage of using Bader’s scheme is that the total energy is exactly recovered and that, at least in principle, it includes all short- and long-ranged bonding, i.e., covalent as well as non-covalent (e.g., van der Waals interaction, Coulomb interaction, etc.). Furthermore, due to nearsightedness of atoms in electronic system (Prodan and Kohn 2005), atoms with similar local chemical environments contribute a similar amount of energy to the total energy. Using the notion of alchemical derivatives, this effect, a.k.a. chemical transferability, has recently been demonstrated numerically (Fias et al. 2017). Thus it is possible to learn effective atomic energies based on a representation of the local atoms. Unfortunately, the explicit calculation of local atoms is computationally involved (the location of the zero-flux plane is challenging for large molecules), making this approach less favorable. Instead, we can also assume that the aforementioned Bayesian model is applicable to atomic energies as well, i.e.,

**x**

^{I}is an atomic representation of atom

*I*in a molecule. By summing up terms on both sides in Eq. 15, we have

*I*and

*J*run over all the respective atomic indices in molecule

*i*and

*j*and where \({\mathbf {x}}^I_i\) is the representation of atom

*I*in molecule

*i*.

*c*

_{i}=∑

_{j}([

*K*+

*λI*]

^{−1})

_{ij}

*E*

_{j}. This equation can be rearranged:

*J*to the total energy can be decomposed into a linear combination of contributions from each training compound

*i*, weighted by its regression coefficient,

*J*and atoms

*I*∈

*i*, where the contribution of atom

*I*grows with its similarity to atom

*J*:

We note in passing that the value of the covariance matrix element (i.e., Eq. (17)) increases when the size of either system *i* or *j* grows, indicating that the scalability issue can be effectively resolved.

### 2.3 Hyper-Parameters

Within the framework of GPR or KRR, there are two sets of parameters: (1) parameters that are determined via training, i.e., the coefficients **c** (see Eq. (12)), whose number grows with the training data, and (2) hyper-parameter whose value is set before the learning process begins, i.e., the kernel width *σ* in Eq. (11) and *λ* in Eq. (2).

As defined in Sect. 2.1, *λ* measures the level of noise in the training data in GPR. Thus, if the training data is noise free, *λ* can be safely set to zero or a value extremely close to zero (e.g., 1 × 10^{−10}) to reach optimal performance. This is generally true for datasets obtained by typical quantum chemical calculations, and the resulting training error is (almost) zero. Whenever there is noise in the data (e.g., from experimental measurements), the best *λ* corresponds to some finite value depending on the noise level. The same holds for the training error. In terms of KRR, *λ* seems to have a completely different meaning at first glance: the regularization parameter determining the complexity of the model. In essence, they amount to the same, i.e., a minute or zero *λ* corresponds to the perfectly interpolating model which connects every single point in the training data, thus representing the most faithful model for the specific problem at hand. One potential risk is poor generalization to new input data (test data), as there could be “overfitting” scenarios for training sets. A finite *λ* assumes some noise in the training data, and the model can only account for this in an averaged way; thus the model complexity is simplified to some extend by lowering the magnitude of parameters **w** so as to minimize the cost function *C*(**w**). Meanwhile, some finite training error is introduced. To recap, the balance between SSE and regularization is vital and reflected by a proper choice of *λ*.

Unlike *λ*, the optimal value of *σ* (*σ*_{opt}) is more dataset specific. Roughly speaking, it is a measure of the diversity of the dataset and controls the similarity (covariance matrix element) of two systems. Typically *σ*_{opt} gets larger when the training data expands into a larger domain. The meaning of *σ* can be elaborated by considering two extremes: (1) when *σ* approaches zero, the training data will be reproduced exactly, i.e., *c*_{i} = *y*_{i}, with high error for test data, i.e., with deviation to mean, and (2) when *σ* is infinity, all kernel matrix elements will tend toward one, i.e., a singular matrix, resulting in large errors in both training and test. Thus, the optimal *σ* can be interpreted as a coordinate scaling factor to render the kernel matrix well-conditioned. For example, Ramakrishnan and von Lilienfeld (2015) selected the lower bound of the kernel matrix elements to be 0.5. For a Gaussian kernel, this implies that \(K_{\min } = \exp (-D_{\max } ^2/2\sigma _{\mathrm {opt}}^2) \approx 0.5\), or \(\sigma _{\mathrm {opt}} \approx D_{\max }/\sqrt {2\ln 2}\), where \(D_{\max }\) is the largest distance matrix element of the training data. Following the same reasoning, *σ*_{opt} can be set to \(D_{\max }/\ln 2\) for a Laplacian kernel.

The above heuristics are very helpful to quickly identify reasonable initial guesses for hyper-parameters for a new dataset. Subsequently, the optimal values of the hyper-parameters should be fine-tuned through *k*-fold cross-validation (CV). The idea is to first split the training set into *k* smaller sets, and (1) for each of the *k* subsets, a model is trained using the remaining *k* − 1 subsets as training data; the resulting model is tested on the remaining part of the data to calculate the predictive error); this step yields *k* predictions, one for each fold. (2) The overall error reported by *k*-fold cross-validation is then the average of the above *k* values. The optimal parameters will correspond to the ones minimizing the overall error. This approach can become computationally demanding when *k* and the training set size are large. But it is of major advantage in problem such as inverse inference where the number of samples is very small, and its systematic applications minimize the likelihood of statistical artifacts.

### 2.4 Learning Curves

*ε*, which can be characterized by the mean absolute error (MAE) or root mean squared error (RMSE) of prediction) for a specific training set but also predictive errors for varied sizes of training sets. Therefore, we can monitor how much progress we have achieved after some incremental changes to the training set size (

*N*) so as to extrapolate to see how much more training data is needed to reach a desirable accuracy. The plot of

*ε*versus

*N*relationship is called the learning curve (LC), and examples are shown in Fig. 1 (note that only test error, i.e., MAE for the prediction of new data in test set, is shown; training errors are always zero or minute for noise-free training data).

Multiple factors control the shape of learning curve, one of which is the choice of representation. If the representation cannot uniquely encode the molecule, i.e., there may exist cases that two different molecules share the same input vector **x**_{i} but with different molecular properties, then it causes ambiguity to the ML algorithm (see more details in Sect. 4.1) and may consequently lead to no learning at all, as illustrated by the dashed curve in Fig. 1, with distinguishable flattening out behavior at larger training set sizes, resulting in poor ML performance.

*N*is sufficiently large, the predictive error is proportional to the so-called “fill distance” or

*mesh norm*

*h*

_{X}, defined as

**x**is again the representation of any training instance as an element of the training set

**X**, and

*Ω*represents the domain of studied systems (i.e., potential energy surface domain for chemistry problems). Clearly from the definition, fill distance describes the geometric relation of the set

**X**to the domain Ω and quantifies how densely

**X**covers Ω. Furthermore, fill distance intrinsically contains a dimension dependence

*d*, that is,

*h*

_{X}scales roughly as

*N*

^{−1∕d}if

**x**are uniform or random grid points in a

*d*dimensional space.

Apart from the exponent, there should also be a prefactor; thus the leading term of the overall predictive error can be described as *b* ∗ *N*^{−a∕d}, where *a* in the exponent is a constant. Therefore, to visualize the error vs. *N*, a log-log scale is the most convenient for which the learning curve can be represented by a linear relationship: \(\log (\varepsilon ) \approx \log (b) - \frac {a}{d} \log (N)\); thus *a*∕*d* quantifies the rate of learning, while the prefactor \(\log (b)\) is the vertical offset of the learning curve. Through a series of numerical calculations of learning a 1D Gaussian function as well as ground state properties of molecules with steadily improving physics encoded in the representation, it has been found (Huang and von Lilienfeld 2016) that the offset \(\log (b)\) is a measure of target property similarity, which is defined as the deviation of proposed model (corresponding to the representation used) from the true model (Huang and von Lilienfeld 2016). While, in general, we do not know the true function (machine learning would be meaningless if we did), we often do have considerable knowledge about relative target similarity of different representations.

Applying the findings above to chemistry problems, we can thus obtain some insight in how learning curves will behave. Several observations can be explained: First, the learning rate would be almost a constant or changes very little when different unique representations are used, as the rate depends primarily on the domain spanned by molecules considered in the potential energy surface. Secondly, for a series of isomers, it is much easier to learn their properties in their relaxed equilibrium state than in a distorted geometry.

The limitation that the learning rate will not change much for random sampling with unique representations seems to be a big obstacle toward more efficient ML predictions, meaning that developing better representation (to lower the offset) can become very difficult even if substantial effort has been invested. However, is it possible to break this curse, reaching an improved learning curve as illustrated by the pink line in Fig. 1? We believe that this should be possible. Note how the linear (log-log) learning curve is obtained for *statistical* models. This implies that there must be “redundancy” in the training data; and if we were able to remove those redundancies a priori, we might very be able to boost the performance and observe superior LCs, such as the pink line in Fig. 1 with large learning rates. In such a case, statistics is unlikely to hold, and the LC may be just a monotonically decreasing function, possibly also just a damped oscillator, rather than a line. Strategies for rational sampling will be elaborated in detail in Sect. 5.

## 3 Multilevel Learning

By default, we assume for each **x**_{i} ∈**X**, there exists one corresponding *y*_{i} ∈ *Y* in the training examples. It makes perfect sense if *Y* is easy to compute, i.e., in the circumstance that a relatively low accuracy of *Y* suffices (e.g., PBE with a medium-sized basis set). It is also possible that a highly accurate reference data is required (e.g., CCSD(T) calculations with a large basis set) so as to achieve highly reliable predictions. Unfortunately, we can only afford few highly accurate **x** and *y*’s for training considering the great computational burden. In this situation, one can take great advantage of the *y*’s with lower levels of accuracy which are much easier to obtain. Models which shine in this kind of scenario are called multi-fidelity, where reference data based on a high (low) level of theory is said to have high (low) fidelity. The nature of this approach is to explore and exploit the inherent correlation among datasets with different fidelities. Here we employ Gaussian process as introduced in Sect. 2 to explain the main concepts and mathematical structure of multilevel learning.

### 3.1 Multi-fidelity

**X**,

**y**

^{(1)}} (in which the pairs of data are \(({\mathbf {x}}_1, y_1^{(1)}), ({\mathbf {x}}_2, y_2^{(1)}),~\dots \)) and {

**X**,

**y**

^{(2)}}, where

**y**

^{(2)}has a higher level of fidelity. The number of data points in the two sets is, respectively,

*N*

_{1}and

*N*

_{2}and

*N*

_{1}>

*N*

_{2}, reflecting the fact that high-fidelity data are scarce. We consider the following autoregressive model proposed by Kennedy and O’Hagan (2000):

**y**

^{(1)}and

*δ*

^{(2)}are two independent Gaussian processes, i.e.,

**y**

^{(1)}and

*δ*

^{(2)}are independent (notated as

**y**

^{(1)}⊥

*δ*

^{(2)}) indicates that the mean of

**y**

^{(1)}

*δ*

^{(2)}satisfies

**E**[

**y**

^{(1)}

*δ*

^{(2)}] =

**E**[

**y**

^{(1)}]

**E**[

*δ*

^{(2)}], and thus the covariance between

**y**

^{(1)}and

*δ*

^{(2)}is zero, i.e., Cov(

**y**

^{(1)},

*δ*

^{(2)}) =

**E**[

**y**

^{(1)}

*δ*

^{(2)}] −

**E**[

**y**

^{(1)}]

**E**[

*δ*

^{(2)}] = 0. Therefore,

**y**

^{(2)}is also a Gaussian process with mean 0 and covariance:

**y**

^{(1)}and

**y**

^{(2)}, which represents the inherent correlation between datasets with different levels of fidelity and is derived as Cov(

**y**

^{(1)},

**y**

^{(2)}) =

*K*

_{12}=

*ρ*Cov(

**X**,

**X**) =

*ρK*

_{1}due to the same independence restriction. Now the multi-fidelity structure can be written in the following compact form of a multivariate Gaussian process:

*K*

_{11}=

*K*

_{1},

*K*

_{22}≠

*K*

_{2},

*K*

_{12}=

*K*

_{21}due to symmetry. The importance of

*ρ*is quite evident from the term

*K*

_{12}; specifically, when

*ρ*= 0, the high-fidelity and low-fidelity models are completely decoupled, and there will be no improvements of the prediction at all by combining the two models.

**x**

_{∗}, two levels of training data {

**X**,

**y**

^{(1)}} and {

**X**,

**y**

^{(2)}}. To this end, we first write down the following joint density:

We note in passing that since there are two correlation functions *K*_{1} and *K*_{2}, two sets of hyper-parameters regarding the kernel width and an extra scaling parameter *ρ* have to be optimized following the similar approach as explained in Sect. 2.4. This algorithm has already successfully been applied to the prediction of band gaps of elpasolite compounds with high accuracy (Pilania et al. 2017). But it can be naturally extended to other properties. So far, not much work has been done using this algorithm; its potential to tackle complicated chemical problems has yet to be unraveled by future work.

### 3.2 *Δ*-Machine Learning

*N*

_{1}is equal to

*N*

_{2}, the low- and high-fidelity models are, respectively, called baseline and target. The baseline property (

*y*

^{(b)}) is associated with baseline geometry as encoded in its representation (\(\vec {x}^{(b)}\)), and target property

*y*

^{(t)}is associated with target geometry \(\vec {x}^{(t)}\), respectively. The workhorse of this model is

The *Δ*-ML model has been shown to be capable of yielding highly accurate results for energies if a proper baseline model is used. Other properties can also be predicted with much higher precision compared to traditional single fidelity model (Ramakrishnan et al. 2015a). What is more, this approach can save substantial computational time. However, the *Δ*-machine learning model is not fully consistent with the multi-fidelity model. The closest scenario is that we set *K*_{1} = *K*_{2} when evaluating kernel functions in Eq. (31), but this will result in something still quite different. There are further issues one would like to resolve, including that (i) the coupling between different fidelities is not clear and that the correlation is rather naively accounted for through the *Δ* of the properties from two levels, assuming a smooth transition from one property surface (e.g., potential energy surface) from one level of theory to another. This is questionable and may fail terribly in some cases; (ii) it requires the same amount of data for both levels, which can be circumvented by building recursive versions.

## 4 Representation

The problem of how to represent a molecule or material has been a topic dating back to many decades ago, and the wealth of information (and opinions) about this subject is well manifested by the collection of descriptors compiled in Todeschini and Consonni’s Handbook of molecular descriptors (Todeschini and Consonni 2008). According to these authors, the molecular descriptor is defined as “the final result of a logic and mathematical procedure which transforms chemical information encoded within a symbolic representation of a molecule into a useful number or the result of some standardized experiment.” While the majority of these descriptors are graph-based and used for quantitative structure and activity relationships (QSAR) applications (typically producing rather rough correlation between properties and descriptor), our focus is on QML models, i.e., physics-based, systematic, and universal predictions of well-defined quantum-mechanical observables, such as the energy von Lilienfeld (2018). Thus, to better distinguish the methods reviewed here-within from QSAR, we prefer to use the term “representation” rather than “molecular descriptor.” Quantum mechanics offers a very specific recipe in this regard: A chemical system is defined by its Hamiltonian which is obtained from elemental composition, geometry, and electron number exclusively. As such, it is straightforward to define the necessary ingredients for a representation: It should be some vector (or fingerprint) which encodes the compositional and structural information of a given neutral compound.

### 4.1 The Essentials of a Good Representation

There are countless ways to encode a compound into a vector, but what representation can be regarded as “good”? Practically, a good representation should lead to a decent learning curve, i.e., error steadily decreases as a function of training set size. Conceptually, it should fulfill several criteria, including primarily uniqueness (non-ambiguity), compactness, and being size-extensive (von Lilienfeld et al. 2015).

Uniqueness (or being nonambiguous) is indispensable for ML models. We consider a representation to be unique if there is no pair of molecules that produces the same representation. Lack of uniqueness would result in serious consequences, such as ceasing to learn at an early stage or no learning at all from the very beginning. The underlying origin is not hard to comprehend. Consider two representation vectors **x**_{1} and **x**_{2} for two compounds associated with their respective properties *y*_{1} and *y*_{2}. Now suppose **x**_{1} = **x**_{2} while *y*_{1} ≠ *y*_{2} (no degeneracy is assumed). One extreme case is that only these two points are used when training the ML model; obviously we will encounter a singular kernel matrix with all elements being 1; huge prediction errors will result, and basically there is no learning. Even if molecules like these are not chosen for training, it should be clear that such a representation introduces a severe and systematic bias. Furthermore, when trying to predict *y*_{1} and *y*_{2} after training, the estimate will be the same as the input to the machine is the same. The resulting test error is therefore directly proportional to their property difference.

The compactness requires atom index permutation and rotational and translational invariance, i.e., all redundant degrees of freedom of the system should be removed as much as possible while retaining the uniqueness. This can lead to a more robust representation, meaning (1) the size of training set needed may be significantly reduced and (2) the dimension of the representation vector (thus the size) is minimized, a virtue which becomes important when the necessary training set size becomes large.

Being size-extensive is crucial for prediction of extensive properties, among which the most important is the energy. This leads to the so-called atomic representation or local representation of an atom in a compound. The local unit atom can also consist of bonds, functional groups, or even larger fragments of the compound. As pointed out in Sect. 2.2, this type of representation is the crucial stepping stone for building scalable machine learning models. Even intensive properties such as HOMO-LUMO gap, which typically do not scale with system size, can be modeled within the framework of atomic representations, as illustrated using the Re-Match metric (De et al. 2016). For specific problems, such as force predictions, an analytic form of representation is desirable for analysis and rapid evaluation and for subsequent differentiation (with respect to nuclear charges and coordinates) so as to account for response properties.

### 4.2 Rational Design

*R*

^{n}(

*R*being the internuclear distance and

*n*being some integer), while 3- and 4-body parts behave as periodic functions of angle and dihedral angle (modern force field approaches also include 2- to (

*n*− 1)-body interaction in

*n*-body interactions). FFs are essentially a special case of the more general many-body expansion (MBE) in interatomic contributions, i.e., an extensive property of the system (e.g., total energy) is expanded in a series of many-body terms, namely, 1-, 2- and 3-body terms, ⋯, i.e.,

*E*

^{(n)}is the

*n*-body interaction energy,

*R*

_{IJ}is the interatomic distance between atom

*I*and

*J*, and

*θ*

_{IJK}is the angle spanned by two vectors \(\vec {R}_{IJ}\) and \(\vec {R}_{IK}\). Other important properties can also be expressed in a similar fashion.

By utilizing the basic variables in MBE, including distance, angles, and dihedral angles in their correct physics-based functional form (e.g., the aforementioned 1∕*R*^{n} dependence of 2-body interaction strength), one can already build some highly efficient representations such as BAML and SLATM (*vide infra*). This recipe relies heavily on preconceived knowledge about the physical nature of the problem.

### 4.3 Numerical Optimization

**y**

**=**

**Xc**, where

**X**is a matrix with each of the

*N*rows being the descriptor vector

**x**

_{i}of length

*D*for each training data points,

**c**is the

*D*-dimensional vector of coefficients, and

**y**is the vector of training properties with the

*i*-th property being

*y*

_{i}. Our task is to find the tuple of features that yields the smallest sum of squared error: \(||\mathbf {y-Xc}||{ }_2^2\). Within LASSO, it is equivalent to a convex optimization problem, i.e., where the use of

*L*

_{1}norm of regularization term is pivotal, i.e., smaller

*L*

_{1}norm can be obtained when larger

*λ*is used, thereby purging features of lesser importance. This approach has been exemplified for the prediction of relative crystal phase stabilities (rock-salt vs. zinc-blende) in a series of binary solids (Ghiringhelli et al. 2015). Unfortunately, this approach is limited in that it works best for rather low-dimensional problems. Already for typical organic molecules, the problem becomes rapidly intractable due to coupling of different degrees of freedom. Under such circumstances, it appears to be more effective to adhere to the aforementioned rational design based heuristics, as manifested by the fact that almost all of the ad hoc representations in the literature are based on manual encoding.

### 4.4 An Overview of Selected Representations

Over the years, numerous molecular representations have been developed by several research groups working on QML. It’s not our focus to enumerate all of them but to list and categorize the popular ones. Two categories are proposed; one is based on many-body expansions in vectorial or tensorial form, such as Coulomb matrix (CM), Bag of Bonds (BoB), Bond, Angle-based Machine Learning (BAML), Spectrum of London and Axilrod-Teller-Muto potential (SLATM), and the alchemical and structural radial distribution-based representation introduced by Faber, Christensen, Huang, and von Lilienfeld (FCHL). The other category is an electron density model-based representation called smooth overlap of atomic positions (SOAP).

#### 4.4.1 Many-Body Potential-Based Representation

The Coulomb matrix (CM) representation was first proposed in the seminal paper by Rupp et al. (2012). It is a square atom-by-atom matrix with off diagonal elements corresponding to the nuclear Coulomb repulsion between atoms, i.e., CM_{IJ} = *Z*_{I}*Z*_{J}∕*R*_{IJ} for atom index *I* ≠ *J*. Diagonal elements approximate the electronic potential energy of the free atom, which is encoded as \(-0.5Z_I^{2.4}\). To enforce invariance of atom indexing, one can sort the atom numbering such that the sum of *L*_{2} and *L*_{1} norm of each row of the Coulomb matrix descends monotonically in magnitude. Symmetrical atoms will result in the same magnitude. A slight improvement over the original CM can be achieved by varying the power low of *R*_{IJ} (Huang and von Lilienfeld 2016). Best performance is found for an exponent of 6, reminiscent of the leading order term in the dissociative tail of London dispersion interactions. Thus, the resulting representation is also known as London matrix (LM). The superiority of LM is attributed to a more realistic trade-off between the description of more localized covalent bonding and long-range intramolecular non-covalent interactions (Huang and von Lilienfeld 2016).

*f*of all coordinates, we will end up with the same curve due to a spurious degeneracy imposed by lack of uniqueness. The BoB representation would not distinguish between these two molecules. Only after addition of higher-order many-body potential terms (e.g., the 3-body Axilrod-Teller-Muto potential), the spurious degeneracy is lifted.

Based on this simple example, an important lesson learned is that collective effects which go beyond pairwise potentials are of vital importance for the accurate modeling of fundamental properties such as energies. While adhering to the ideas of bagging for efficiency, a representation consisting of extended bags can be constructed; each may contain interatomic interaction potentials up to three- and four-body terms. BAML was formulated in this way, where (1) all pairwise nuclear repulsions are replaced by Morse/Lennard-Jones potentials for bonded/nonbonded atoms, respectively, and (2) the inclusion of three- and four-body interactions of covalently bonded atoms is achieved using periodic angular and torsional terms, with their functional form and parameters extracted from the universal force field (UFF) (Huang and von Lilienfeld 2016; Rappe et al. 1992). BAML achieves a noticeable boost of performance when compared to BoB or CM. Interestingly, the performance is systematically improving upon inclusion of higher and higher-order many-body terms, as the proposed energy model is getting more and more realistic, i.e., increasing similarity to target. Meanwhile and not surprisingly, the uniqueness issue, existing in two-body representations such as BoB, is also resolved (see Fig. 3). The main drawback of BAML, however, is that it requires pre-existing force fields, implying a severe bias when it comes to new elements or bonding scenarios. It would therefore be desirable to identify a representation which is more compact and ab initio in nature.

*I*in a molecule by accounting for all possible interactions between atom

*I*and its neighboring atoms through many-body potential terms multiplied by a normalized Gaussian distribution centered on the relevant variable (distance or angle). So far, one-, two-, and three-body terms have been considered. The one-body term is simply represented by the nuclear charge, while the two-body part is expressed as

*δ*(⋅) is set to normalized Gaussian function \(\delta (x) = \frac {1}{ \sigma \sqrt {2\pi } }e^{ - x^2 }\) and

*g*(

*r*) is a distance-dependent scaling function, capturing the locality of chemical bond and chosen to correspond to the leading order term in the dissociative tail of the London potential \(g(R) = \frac {1}{R^6}\). The three-body distribution reads

*θ*is the angle spanned by vector

**R**

_{IJ}and

**R**

_{IK}(i.e.,

*θ*

_{IJK}) and treated as a variable.

*h*(

*θ*,

**R**

_{IJ},

**R**

_{IK}) is the three-body contribution depending on both internuclear distance and angle and is chosen in form to model the Axilrod and Teller (1943) and Muto (1943) vdW potential:

*I*through concatenation of all the different many-body potential spectra involving atom

*I*as displayed in Eqs. (35) and (36). As for the global version SLATM, it simply corresponds to the sum of the atomic spectra.

SLATM and aSLATM outperforms all other representations discussed so far, as evidenced by learning curves shown in Fig. 3. This outstanding performance is due to several aspects: (1) almost all the essential physics in the systems is covered, including the locality of chemical bonds as well as many-body dispersion; (2) the inclusion of 3-body terms significantly improves the learning; and (3) the spectral distribution of radial and angular feature now circumvents the problem of sorting within each feature bag, allowing for a more precise match of atomic environments.

Most recently, the FCHL representation has been introduced (Faber et al. 2017a). It amounts to a radial distribution in elemental and structural degrees of freedom. The configurational degrees of freedom are expanded up to three-body interactions. Four-body interactions were tested but did not result in any additional improvements. For known datasets, FCHL-based QML models reach unprecedented predictive power and even outperform aSLATM and SOAP (see below). In the case of the QM9 dataset, for example, FCHL-based models of atomization energies reach chemical accuracy after training on merely ∼1’000 molecules.

#### 4.4.2 Density Expansion-Based Representation

*I*in a molecule is represented as the local density of atoms around

*I*. Specifically, it is represented by a sum of Gaussian functions with variance

*σ*

^{2}within the environment (including the central atom

*I*and its neighboring atoms

*Q*’s), with the Gaussian functions centered on

*Q*’s and

*I*:

**r**is the vector from the central atom

*I*to any point in space, while

**R**

_{Q}is the vector from atom

*I*to its neighbor

*Q*. The overlap of

*ρ*

_{I}and

*ρ*

_{J}then can be used to calculate a similarity between atoms

*I*and

*J*. However, this similarity is not rotationally invariant. To overcome this, we can integrate out the rotational degrees of freedom for all three-dimensional rotations \(\hat {R}\), and thus the SOAP kernel is defined:

The integration in Eq. (39) can be carried out by first expanding *ρ*_{I}(**r**) in Eq. (38) in terms of a set of basis functions composed of orthogonal radial functions and spherical harmonics and then collecting the elements in the rotationally invariant power spectrum, based on which *k* can be easily calculated. The interested reader is referred to Bartók et al. (2013).

SOAP has been used extensively and successfully to model systems such as silicon bulk or water clusters, each separately with many configurations. These elemental or binary systems are relatively simple as the diversity of chemistries encoded by the atomic environments is rather limited. A direct application of SOAP to molecules where there are substantially more possible atomic environments, however, yields learning curves with rather large offsets. This is not such a surprise, as essentially the capability of atomic densities to differentiate between different atom pairs, atom triples, and so on is not so great. This shortcoming remains even if one treats different atom pairs as different variables, as was adopted in De et al. (2016); averaging out all rotational degrees of freedom might also impede the learning progress due to loss of relevant information. To amend some of these problems, a special kernel, the RE-Match kernel (De et al. 2016), was introduced. And most recently, combining SOAP with a multi-kernel expansion enabled additional improvements in predictive power (Bartók et al. 2017).

## 5 Training Set Selection

The last section of this chapter deals with the question of how to select training sets. The selection procedure can have a severe effect on the performance. The predictive accuracy appears to be very sensitive on how we sample the training molecules for any given representation (or better ones). Training set selection can actually be divided into two parts: (1) How to create training set. The general principle is that the training set should be representative, i.e., it follows the same distribution as all possible test molecules in terms of input and output. This will formally prevent extrapolation and thereby minimize prediction errors. (2) How to optimize the training set composition.

The majority of algorithms in literature deal with (2) assuming the existence of some large dataset (or a dataset trivial to generate) from which one can draw using algorithms such as ensemble learning, genetic evolution, or other “active learning”-based procedures (Podryabinkin and Shapeev 2017). All of these methods have in common that they select the training set from a given set of configurations based only on the unlabeled data. This is particularly useful for “learning on the fly”-based ab initio molecular dynamics simulations Csányi et al. (2004), where expensive quantum-mechanical calculation is carried out only when the configurations are sufficiently “new.”

Step 1 stands out as a challenging task and few algorithms are competent. The most ideal approach is of course an algorithm that can do both parts within one step; the only competent method we know is the “amons” approach. We will elaborate on all these concepts below.

### 5.1 Genetic Optimization

To the best of our knowledge, the first application of a GA for generation and study of optimal training set compositions for QML model was published in Browning et al. (2017). The central idea of this approach is outlined as follows. For a given set (*S*_{0}) containing overall *N* molecules, the GA procedure consists of three consecutive steps to obtain the “near-optimal” subset of molecules from *S*_{0} for training the ML model (Browning et al. 2017): (a) Randomly choose *N*_{1} molecules as a trial training set *s*_{1}; repeat *M* times. This forms a population of training sets, termed the parent population and labeled as \(\hat {s}^{(1)} = \{s_1, s_2, \dots , s_M\}\). (b) An ML model is trained on each *s*_{i} and then tested on a fixed set of out-of-sample molecules, resulting in a mean prediction error *e*_{i}, which is assigned to *s*_{i} as a measure of how fit *s*_{i} is as the “near-optimal” training set and dubbed “fitness.” Therefore, the smaller *e*_{i} is, the larger the fitness is. (c) \(\hat {s}^{(1)}\) is consecutively evolved through selection (to determine which *s*_{i}’s in \(\hat {s}^{(1)}\) should remain in the population to produce a temporarily refined smaller set \(\hat {t}^{(1)}\); a set *s*_{i} with larger fitness means higher probability to be kept in \(\hat {t}^{(1)}\)), crossover (to update \(\hat {s}^{(1)}\) from \(\hat {t}^{(1)}\) and the new \(\hat {s}\) is labeled as \(\hat {s}^{(2)}\) with each set *s*_{i} in \(\hat {s}^{(2)}\) obtained through mixing the molecules from two *s*_{i}’s in \(\hat {t}^{(1)}\)), and mutation (to change molecules in some *s*_{i}’s in \( \hat {s}^{(2)}\) randomly to promote diversity in \(\hat {s}^{(2)}\), e.g., replace -CH_{2}- fragment by -NH- for some molecule. (d) Go to step (b) and repeat the process until there is no more change in the population and the fitness ceases to improve . We label the final updated trial training set as \(\hat {s}\).

It’s obvious that the molecules in \(\hat {s}\) should be able to represent all the typical chemistry in all molecules in *S*_{0}, such as linear, ring, cage-like structure, and typical hybridization states (*sp*, *sp*^{2}, *sp*^{3}) if they are abundant in *S*_{0}. Once trained on \(\hat {s}\), the ML model is guaranteed to yield typically significantly better results as the fitness is constantly increasing. This is not useful since the GA “tried” this already; the usefulness has to be assessed by the generalizability of \(\hat {s}\) as training set to test on a new set of molecules is not seen in *S*_{0}. Indeed, as shown in Browning et al. (2017), significant improvements in offsets can be obtained when compared to random sampling. While the remaining out-of-sample error is still substantial, this is not surprising due to the use of less advantageous representations. One of the key findings in this study were that upon genetic optimization, (i) the distance distributions between training molecules were shifted outward and (ii) the property distributions of training molecules were fattened.

### 5.2 Amons

We note that the naive application of active learning algorithms will still result in QML models which suffer from lack of transferability, in particular when it comes to the prediction of larger compounds or molecules containing chemistries not present in the training set. Due to the size of chemical compound space, this issue still imposes a severe limitation for the general applicability of QML. These problems can, at least partially, be overcome by exploring and exploiting the locality of an atom in molecule (Huang and von Lilienfeld 2017), resulting from the nearsightedness principle in electronic systems (Prodan and Kohn 2005; Fias et al. 2017).

We consider a valence saturated query molecule for illustration, for which we try to build an “ideal” training set. As is well known, any atom *I* (let us assume a *sp*^{3} hybridized C) in the molecule is characterized by itself and its local chemical environment. To a first-order approximation, we may consider its coordination number (*CN* for short) to be a distinguishing measure of its atomic environment, and we can roughly say that any other carbon atom with a coordination number of 4 is similar to atom *I*, as their valence hybridization states are all *sp*^{3}. Another carbon atom with *CN* = 3 in hybridization state of *sp*^{2} would be significantly different compared to atom *I*. It is clear, however, that *CN* as an identifier of atomic environment type is not enough: An *sp*^{3} hybridized C atom in methane molecule (hereafter we term it as a genuine C-*sp*^{3} environment) is almost purely covalently bonded to its neighbors, while in CH_{3}OH, noticeable contributions from ionic configurations appear in the valence bond wavefunction due to the significant electronegativity difference between C and O atoms. Thus one would expect very different atomic properties for the *sp*^{3}-C atoms in these two environments as manifested, for instance, in their atomic energy, charge, or ^{13}C-NMR shift. Alternatively, we can say that oxygen as a neighboring atom to *I* has perturbed the ideal *sp*^{3} hybridized C to a much larger extent in CH_{3}OH than the H atom has in methane. To account for these differences, we can simply include fragments which contain *I* as well as all its neighbors. Thus we can obtain a set of fragments, for each of which the bond path between *I* and any other atom is 1.

Extending this kind of reasoning to the second neighbor shell, we can add new atoms with a bond path of 2 relative to atom *I* in order to account for further, albeit weaker, perturbation to atom *I*. As such, we can gradually increase the size of included fragments (characterized by the number of heavy atoms) until we believe that all effects on atom *I* have been accommodated. The set of unique fragments can then be used as a training set for a fragment-based QML model. Note that we saturate all fragments by hydrogen atoms. These fragments can be regarded as effective quasi-atoms which are defined as atom in molecule, or “am-on.” Since amons repeat throughout chemical space, they can be seen as the “words” of chemistry (target molecules being “sentences”) or as “DNA” of chemistry (target molecules being genes and properties their function). Given the complete set of amons, any specific, substantially larger, query molecule can be queried. Used in conjunction with an atomic representation such as aSLATM or FCHL, amons enable a kind of chemical extrapolation which holds great promise to more faithfully and more efficiently explore vast domain chemical space (Huang and von Lilienfeld 2017).

## 6 Conclusion

We have discussed primarily the basic mathematical formulations of all typical ingredients of quantum machine learning (QML) models which can be used in the context of quantum-mechanical training and testing data. We explained and reviewed why ML models can be fast and accurate when predicting quantum-mechanical observables for out-of-sample compounds. It is the authors’ opinion that QML can be seen as a very promising approach, enabling the exploration of systems and problems which hitherto were not amenable to traditional computational chemistry methods.

In spite of the significant progress made within the last few years, the field QML is still very much in a stage of infancy. This should be clear when considering that the properties that have been explored so far are rather limited and relatively fundamental. The primary focus has been on ground state or local minimum properties. Application to excited states still remains a challenge (Ramakrishnan et al. 2015b), just as well as conductivity, magnetic properties, or phase transitions. We believe that new and efficient representations will have to be developed which properly account for all the relevant degrees of freedom at hand.

## Notes

### Acknowledgements

We acknowledge support by the Swiss National Science foundation (No. PP00P2_138932, 407540_167186 NFP 75 Big Data, 200021_175747, NCCR MARVEL).

## References

- Axilrod BM, Teller E (1943) Interaction of the van der Waals type between three atoms. J Chem Phys 11(6):299–300. https://doi.org/10.1063/1.1723844, http://scitation.aip.org/content/aip/journal/jcp/11/6/10.1063/1.1723844
- Bader RF (1990) Atoms in molecules: a quantum theory. Clarendon Press, OxfordGoogle Scholar
- Bartók AP, Payne MC, Kondor R, Csányi G (2010) Gaussian approximation potentials: the accuracy of quantum mechanics, without the electrons. Phys Rev Lett 104:136403. https://doi.org/10.1103/PhysRevLett.104.136403
- Bartók AP, Kondor R, Csányi G (2013) On representing chemical environments. Phys Rev B 87:184115. https://doi.org/10.1103/PhysRevB.87.184115
- Bartók AP, De S, Poelking C, Bernstein N, Kermode JR, Csányi G, Ceriotti M (2017) Machine learning unifies the modeling of materials and molecules. Sci Adv 3(12). https://doi.org/10.1126/sciadv.1701816, http://advances.sciencemag.org/content/3/12/e1701816
- Browning NJ, Ramakrishnan R, von Lilienfeld OA, Roethlisberger U (2017) Genetic optimization of training sets for improved machine learning models of molecular properties. J Phys Chem Lett 8(7):1351. https://doi.org/10.1021/acs.jpclett.7b00038
- Csányi G, Albaret T, Payne MC, Vita AD (2004) “Learn on the fly”: a hybrid classical and quantum-mechanical molecular dynamics simulation. Phys Rev Lett 93:175503Google Scholar
- De S, Bartok AP, Csanyi G, Ceriotti M (2016) Comparing molecules and solids across structural and alchemical space. Phys Chem Chem Phys 18:13754–13769. https://doi.org/10.1039/C6CP00415F
- Faber FA, Christensen AS, Huang B, von Lilienfeld OA (2017a) Alchemical and structural distribution based representation for improved QML. arXiv preprint arXiv:171208417Google Scholar
- Faber FA, Hutchison L, Huang B, Gilmer J, Schoenholz SS, Dahl GE, Vinyals O, Kearnes S, Riley PF, von Lilienfeld OA (2017b) Fast machine learning models of electronic and energetic properties consistently reach approximation errors better than dft accuracy. https://arxiv.org/abs/1702.05532
- Fasshauer G, McCourt M (2016) Kernel-based approximation methods using Matlab. World Scientific, New JerseyGoogle Scholar
- Fias S, Heidar-Zadeh F, Geerlings P, Ayers PW (2017) Chemical transferability of functional groups follows from the nearsightedness of electronic matter. Proc Natl Acad Sci 114(44):11633–11638Google Scholar
- Ghiringhelli LM, Vybiral J, Levchenko SV, Draxl C, Scheffler M (2015) Big data of materials science: critical role of the descriptor. Phys Rev Lett 114:105503. https://doi.org/10.1103/PhysRevLett.114.105503
- Greeley J, Jaramillo TF, Bonde J, Chorkendorff I, Nørskov JK (2006) Computational high-throughput screening of electrocatalytic materials for hydrogen evolution. Nat Mater 5(11): 909–913Google Scholar
- Hansen K, Biegler F, von Lilienfeld OA, Müller KR, Tkatchenko A (2015) Interaction potentials in molecules and non-local information in chemical space. J Phys Chem Lett 6:2326. https://doi.org/10.1021/acs.jpclett.5b00831
- Huang B, von Lilienfeld OA (2016) Communication: understanding molecular representations in machine learning: the role of uniqueness and target similarity. J Chem Phys 145(16):161102. https://doi.org/10.1063/1.4964627
- Huang B, von Lilienfeld OA (2017) Chemical space exploration with molecular genes and machine learning. arXiv preprint arXiv:170704146Google Scholar
- Kennedy MC, O’Hagan A (2000) Predicting the output from a complex computer code when fast approximations are available. Biometrika 87(1):1–13Google Scholar
- Kirkpatrick P, Ellis C (2004) Chemical space. Nature 432(7019):823–823. https://doi.org/10.1038/432823a
- Kitaura K, Ikeo E, Asada T, Nakano T, Uebayasi M (1999) Fragment molecular orbital method: an approximate computational method for large molecules. Chem Phys Lett 313(3–4):701–706. https://doi.org/10.1016/S0009-2614(99)00874-X
- Muto Y (1943) Force between nonpolar molecules. J Phys-Math Soc Jpn 17:629–631Google Scholar
- Pilania G, Gubernatis JE, Lookman T (2017) Multi-fidelity machine learning models for accurate bandgap predictions of solids. Comput Mater Sci 129:156–163Google Scholar
- Podryabinkin EV, Shapeev AV (2017) Active learning of linearly parametrized interatomic potentials. Comput Mater Sci 140:171–180Google Scholar
- Prodan E, Kohn W (2005) Nearsightedness of electronic matter. Proc Natl Acad Sci USA 102(33):11635–11638. https://doi.org/10.1073/pnas.0505436102
- Ramakrishnan R, von Lilienfeld OA (2015) Many molecular properties from one kernel in chemical space. Chimia 69(4):182. https://doi.org/10.2533/chimia.2015.182, http://www.ingentaconnect.com/content/scs/chimia/2015/00000069/00000004/art00005
- Ramakrishnan R, Dral P, Rupp M, von Lilienfeld OA (2014) Quantum chemistry structures and properties of 134 kilo molecules. Sci Data 1:140022. https://doi.org/10.1038/sdata.2014.22
- Ramakrishnan R, Dral P, Rupp M, von Lilienfeld OA (2015a) Big data meets quantum chemistry approximations: the
*Δ*-machine learning approach. J Chem Theory Comput 11:2087–2096. https://doi.org/10.1021/acs.jctc.5b00099 - Ramakrishnan R, Hartmann M, Tapavicza E, von Lilienfeld OA (2015b) Electronic spectra from TDDFT and machine learning in chemical space. J Chem Phys 143:084111. http://arxiv.org/abs/1504.01966
- Rappe AK, Casewit CJ, Colwell KS, Goddard WA III, Skiff WM (1992) Uff, a full periodic table force field for molecular mechanics and molecular dynamics simulations. J Am Chem Soc 114(25):10024–10035. https://doi.org/10.1021/ja00051a040
- Rasmussen C, Williams C (2006) Gaussian processes for machine learning. Adaptative computation and machine learning series. University Press Group Limited. https://books.google.ch/books?id=vWtwQgAACAAJ
- Ruddigkeit L, van Deursen R, Blum LC, Reymond JL (2012) Enumeration of 166 billion organic small molecules in the chemical universe database gdb-17. J Chem Inf Model 52(11): 2864–2875. https://doi.org/10.1021/ci300415d
- Rupp M, Tkatchenko A, Müller KR, von Lilienfeld OA (2012) Fast and accurate modeling of molecular atomization energies with machine learning. Phys Rev Lett 108(5):058301. https://doi.org/10.1103/PhysRevLett.108.058301
- Samuel AL (2000) Some studies in machine learning using the game of checkers. IBM J Res Dev 44(1.2):206–226Google Scholar
- Todeschini R, Consonni V (2008) Handbook of molecular descriptors, vol 11. Wiley, WeinheimGoogle Scholar
- von Lilienfeld OA (2018) Quantum machine learning in chemical compound space. Angew Chemie Int Ed. http://dx.doi.org/10.1002/anie.201709686
- von Lilienfeld OA, Ramakrishnan R, Rupp M, Knoll A (2015) Fourier series of atomic radial distribution functions: a molecular fingerprint for machine learning models of quantum chemical properties. Int J Quantum Chem 115(16):1084–1093. https://doi.org/10.1002/qua.24912