Abstract
Atomistic machine learning (AML) simulations are used in chemistry at an everincreasing pace. A large number of AML models has been developed, but their implementations are scattered among different packages, each with its own conventions for input and output. Thus, here we give an overview of our MLatom 2 software package, which provides an integrative platform for a wide variety of AML simulations by implementing from scratch and interfacing existing software for a range of stateoftheart models. These include kernel methodbased model types such as KREG (native implementation), sGDML, and GAPSOAP as well as neuralnetworkbased model types such as ANI, DeepPotSE, and PhysNet. The theoretical foundations behind these methods are overviewed too. The modular structure of MLatom allows for easy extension to more AML model types. MLatom 2 also has many other capabilities useful for AML simulations, such as the support of custom descriptors, farthestpoint and structurebased sampling, hyperparameter optimization, model evaluation, and automatic learning curve generation. It can also be used for such multistep tasks as Δlearning, selfcorrection approaches, and absorption spectrum simulation within the machinelearning nuclearensemble approach. Several of these MLatom 2 capabilities are showcased in application examples.
Introduction
Machine learning (ML) has taken computational chemistry by storm [1,2,3,4]. It is often applied to find a relationship between given molecular geometry and quantum chemical (QC) properties. A particularly useful application of such atomistic ML (AML) models is mapping molecular potential energy surfaces (PESs) [5,6,7,8]. Creating AML models is, however, a complicated task and requires domain knowledge. Thus, much effort has been put into developing a mathematical foundation and writing specialized software for such simulations.
One of us (POD) started to develop the MLatom program package [9, 10] for atomistic simulations with ML already in 2013 when not many such packages were available. At first, it was written entirely in Fortran and parallelized with OpenMP as a selfcontained blackbox program for userfriendly calculations. Now, this part comprises the Fortran core of MLatom called MLatomF. Later, MLatomF added a Python wrapper called MLatomPy implementing multistep tasks such as Δlearning [11] and selfcorrection [12]. We have implemented these and other methods developed by ourselves, such as structurebased sampling [12], the KREG model [12], MLnuclear ensemble approach (MLNEA) for precise absorption spectrum simulations [13], as well as selected literature methods, such as those based on the Coulomb matrix descriptor [14, 15], in MLatom for development purposes, tighter integration, and higher efficiency (see "Native Implementations"). We have used these native implementations also for developing methods for improving QC Hamiltonian [16], accelerating ML nonadiabatic excitedstate dynamics [17], and for PES construction with spectroscopic accuracy by introducing a hierarchical ML (hML ) [18] approach.
In recent years, we have witnessed the rapid rise of many exciting new ML models [4, 5]. They are often designed for different applications ranging from very accurate ML PES trained on as few as a hundred molecular configurations of small and mediumsized molecules [19, 20] to ML models trained on thousands or millions of points to be transferable to large molecules [21, 22]. Each has its own advantages and disadvantages. It is, therefore, highly desirable to be able to test different models before applying them to the problem at hand. This is, however, a formidable task because these models are scattered in many different software packages. Each has its own conventions for input and output.
We face the same problem in our research: when we want to test some promising ML model, there is often a high entry barrier for learning how to use the corresponding package. Sometimes the documentation is very poor, and only interaction with experienced users or developers enabled us to use some packages. Often, some critical functionality, such as hyperparameter optimization, is missing.
Thus, as a pragmatic solution, we have provided the community with an integrated platform based on MLatom that interfaces the selection of popular thirdparty software packages via MLatomPy written in Python 3.6 + [23, 24]. This platform is released as MLatom 2 with all Python interfaces available as opensource, free software for noncommercial use. Importantly, the same input and output structure can now be used for many stateoftheart, thirdparty ML models (see Interfaces). We have implemented the interfaces with sGDML [19, 25] (symmetrized gradientdomain ML), GAP and QUIP (providing GAP [26]SOAP [27] method), TorchANI [28] (providing ANI [21] methods), DeePMDkit [29] (providing the DPMD [30] and DeepPotSE [31] methods), and PhysNet [22] programs. This selection of methods covers popular representatives of various types of methods, ranging from those based on kernel methods (KMs) to neural networks (NNs). Our implementation also supports hyperparameter optimization using Bayesian methods with Treestructured Parzen Estimator (TPE) [32] via the hyperopt [33] package.
The modular structure of MLatom allows easy extension to other models in the future, as it requires only writing a separate independent module for converting MLatom input to the input of the thirdparty software and parsing the latter’s output. A similar approach is also used for interfacing various QC software packages [34]. This differs, however, from an alternative approach where some packages offer only part of an ML model, e.g., only a descriptor of a molecule to be used as an input for ML, as in DScribe [35].
In the following, we provide an overview of MLatom 2 capabilities, and details of native implementations and interfaces. We also demonstrate the application of MLatom 2 to several typical AML simulation tasks (hyperparameter optimization and generation of learning curves), Δlearning and structurebased sampling, and calculation of absorption spectra.
Overview
The philosophy behind MLatom is to provide the community with a blackboxlike program that allows a variety of calculation tasks required for ML atomistic simulations to be performed (Fig. 1). The program provides only with userfriendly and intuitive input, and no scripting is required. Under the hood, MLatom is built of modules designed to be independent of each other as much as possible to the extent that many modules can be used as standalone programs entirely independent from the main program. As needed, the modules are combined to create a seamless workflow eliminating stepbystep manual calculations.
The calculation tasks in MLatom are ML tasks and data set tasks. ML tasks are calculations involving training and using ML models. Our implementation includes a wide range of such tasks from basic to multistep procedures: using an existing ML model from a file, creating an ML model and saving it to a file for later use, estimating ML model accuracy (model evaluation to determine the generalization error), Δlearning [11], selfcorrection [12], learning curve and nuclearensemble spectrum generation [13]. Data set tasks perform all the operations necessary for ML simulations, such as converting geometries given in XYZ format to the input vector x for ML, splitting the data set into the required subsets (e.g., training, validation, and test), sampling points into these subsets, and performing statistical analysis of ML estimated values (e.g., error concerning reference values). These tasks can be performed either independently from each other, e.g., creating an ML model from available input vectors x or combined, e.g., by first converting XYZ coordinates to x and then creating an ML model. The user just needs to modify several input lines to perform simulations using the first or second option. In the following, we describe each of these tasks and define the most important concepts in ML atomistic simulations.
ML Tasks
Currently, MLatom supports only supervised learning, which boils down to finding and using an approximating function \(\widehat{f}\left(\mathbf{x};\mathbf{h};\mathbf{p}\right)\) that establishes a relationship between the reference values y and input vectors x in the training set based on statistically motivated assumptions [36] rather than on purely physical modelling. The approximating function typically has a large number of parameters p and socalled hyperparameters h.
Using ML Models
Using an existing ML model is conceptually simple as it requires information about the mathematical form of the approximating function and (hyper)parameters. It includes knowing how to transform a molecular geometry into an input vector x. One should pay attention, however, to many technical issues, such as ensuring consistent conventions for storing and retrieving this information from the file for longterm reusability and reproducibility of scientific simulations. Another technical issue is related to performance and accuracy, as the information to be stored can be quite sizable, which can quickly lead to storage and input/output bottlenecks. MLatom saves ML model parameters and other information in a binary format file with a fixed structure for the core ML models and uses native formats of interfaced thirdparty software without converting them.
In atomistic simulations, we are also often interested in derivatives of properties. For example, in molecular dynamics simulations, we need to know derivatives of energy with respect to atomic coordinates (energy gradients = − forces). Thus, MLatom can estimate both property values and partial derivatives with respect to elements of the input vector or atomic XYZ coordinates.
Creating ML Models
Creating an ML model is already a much more complicated task as one needs to find the model (hyper)parameters in the right way (Fig. 2). This means that one needs to search for optimal values in the parameter space, leading to as low a generalization error as possible [36]. This is not the same as fitting parameters (training) that would give as low an error in the training set as possible. Modern ML methods can easily and exactly reproduce (an extreme case of overfitting) the reference values in the training set [2]. Thus, the standard practice is to set aside a validation set to ensure that training on the remaining data points will not lead to a large error in the validation set, i.e., to avoid overfitting [36]. The remaining data points are called either training or subtraining set in the literature, which adds to the confusion. While both conventions are valid, we prefer to call them subtraining points both here and in MLatom. All data points that are used in creating the ML model we call the training set. This set includes the subtraining and the validation sets in our nomenclature. This convention allows for a fairer and more straightforward comparison of ML models trained on the same number of training points as the validation set, which is used indirectly in training, to be accounted for. For example, if the model is trained on 1000 points, but used another 1000 points for validation, then the reference data is needed for all 2000 points, and such a model cannot be compared to another model trained on only 1000 points without using such a validation set.
When additional information such as derivatives of properties is available, it can be included into the training set too. It is common to train ML models for PESs simultaneously on energies and gradients (or only gradients), which is known to improve the quality of ML PESs significantly compared to fitting only on energies [7, 37, 38]. NNs simply fit parameters to the reference properties and their derivatives [38], while KMs can include the derivative information into their model explicitly [7, 37].
Many knobs exist and can be tuned in the process of finding suitable parameters for an ML model. One such knob concerns the ML model itself, and another deals with the splitting into subtraining and validation sets. As with the first type, while we do not need to touch the model parameters as this is the machine’s task, we can influence the model by changing its hyperparameters manually [36, 39]. As a side note, the difference between parameters and hyperparameters is somewhat fussy as the latter can be found by a machine too. Some hyperparameters also enter the ML model, while others do not. The external hyperparameters that do not enter the ML model are clearly different from parameters, but influence the training process, e.g., the regularization hyperparameter in KRR [36].
In any case, (hyper)parameters can be fitted to attempt to reduce the generalization error of the ML model by minimizing the error in the validation set (Fig. 2). For socalled parametric models such as NNs, whose approximating function does not explicitly depend on the training points, finding (hyper)parameters reducing the validation error is usually the end of the story. However, nonparametric models such as kernel ridge regression (KRR) and Gaussian process regression (GPR) depend explicitly on the training points. In their case, not using the validation set for training the final ML model would lose valuable additional information available to the model and reduce its accuracy. Thus, after hyperparameters minimizing error in the validation set for models trained on the subtraining set are found (a procedure also known as model selection), MLatom uses them to train the final model on the whole training set.
During hyperparameter optimization in MLatom, by default, the rootmeansquared error (RMSE) is minimized, but the minimization of another type of error can be requested for native implementations. Alternatively, other defaults can be used by interfaces if they have their own hyperparameter optimization capabilities. When the ML model is trained on several different properties, the error (loss, L) should reflect the error for each of these properties. For example, for models trained on both property values and their derivatives, e.g., energies and energy gradients, the error in MLatom can be calculated as the sum of error for values (L_{val}) and weighted error for gradients in XYZ coordinates (L_{grxyz}) as typically done in the literature [37]:
Although this approach gives the user additional flexibility, it has a drawback in that one has to choose an arbitrary parameter \({w}_{grxyz}\). To eliminate this parameter, we introduce in MLatom the geometric mean of errors of different properties, which is used by default:
The final model's accuracy also depends on how the training set is split into subtraining and validation sets; this topic is overviewed below in "Splitting and Sampling".
Estimating Accuracy of ML Models
MLatom also provides the means to estimate the accuracy (generalization error) of the ML model (Fig. 3). Since the validation set has already been used to create the ML model, it is good to use another independent test set to assess the model’s performance [36]. The entire data set can be split into training and test sets, and points can be sampled into these subsets using procedures similar to those used for hyperparameter tuning (model selection; see "Splitting and Sampling"). Naturally, hyperparameter tuning and model evaluation can be combined in a single calculation task with MLatom.
Multistep Tasks
The tasks above can be considered as basic. We now turn to describe multistep tasks built upon these basic tasks. One such task is a Δlearning task that combines predictions at the lowlevel QC method with ML corrections to make estimations approaching highlevel QC accuracy [11]. Another is a selfcorrection task that combines several layers of ML models, with each layer correcting the previous layer’s residual errors, which is useful for KRR models created with a focus on some region of the molecular PES [12]. Other multistep tasks are learning curve generation and ML spectrum simulations [13], covered in the next two subsections in more detail.
Learning Curves
Here, the concept of learning curves is used to investigate how ML generalization error depends on training set size. The relationship between ML error ε and training set size typically follows the powerlaw decay [40]:
where ε_{a} is the asymptotic error in the limit of the infinitely large training set, a is a nonnegative amplitude, and b is the positive exponent telling us how fast the ML improves by training with more data. These three parameters define the learning curve, giving a more complete characterization of the performance of a given ML model type than a single generalization error estimated for one training set size.
Since the errors drop that fast, the learning curves are often plotted on a log–log scale. In this case, particularly for not too large training sets and small asymptotic errors, a linear curve is often observed [37]:
A learning curve cannot be drawn without the accuracyestimating step (see "Estimating Accuracy of ML Models") being taken multiple times. Thus, we provide a dedicated task to automate this procedure (Fig. 4) in MLatom 2.
In the learning curve task, the accuracy of each training set size requested is examined in multiple repeats, where different training points are sampled. Testing with repeats helps to reduce the bias introduced by a specific combination of training and test sets, investigates the statistical variance, and reflects the robustness of an ML model. The results (RMSEs, wallclock training and prediction times) from each test are recorded in the .csv database file.
ML Nuclear Ensemble Spectra
Electronic spectrum simulation is yet another multistep task that uses the MLnuclear ensemble approach (MLNEA) to automatically calculate quantities like absorption cross sections from as few QC calculations as possible [13]. This approach accelerates the traditional NEA, which usually requires thousands of points in a nuclear ensemble to generate a statistically converged spectrum [41]. Most nuclear ensemble points are relatively similar, making them suitable for using ML for efficient interpolation and replacing most QC calculations. Figure 5 shows this approach and its implementation in MLatom schematically.
The calculations require only an initial geometry, Gaussian 16 [42] input files for optimization and normalmode calculations as well as for calculation of excitedstate properties (excitation energies and oscillator strengths) with the QC method of choice. The user can also provide available precalculated data, such as output files with normalmode calculations or nuclear ensemble geometries. Existing reference QC data can also be provided. MLatom has an interface to Gaussian 16, which automatically invokes and parses the QC calculations’ output to get the equilibrium geometry and frequencies. Then, it passes the required data to the interface to NewtonX [43, 44] to generate a nuclear ensemble of 50k conformations sampled from a harmonicoscillator Wigner distribution of the nuclear coordinates [45].
QC properties for training ML models are calculated iteratively. The number of training points is increased at each iteration to train 2N_{fs} models for excitation energies (ΔE_{0n}) and oscillator strengths (f_{0n}) of transitions from groundstate to each excited state n. Then, we evaluate the convergence of ML predictions. We first calculate the geometric mean of the RMSE in the validation sets (\({\mathrm{RMSE}}_{\mathrm{geom}}\)) for all ML models, then take the relative change to the previous step (which had \({N}_{batch}=50\) fewer training points) to get the relative RMSE (rRMSE) as the convergence criterion (we consider it converged when rRMSE < 0.1):
If the criterion is satisfied, we use current trained ML models to make predictions for the remaining nuclear ensemble points, substitute any negative ML oscillator strengths with zeros, and then calculate the absorption spectrum with the following equation [41]:
where σ is the absorption cross section, m and e are the electron mass and charge, \(\hslash\) is the reduced Planck constant, c is the speed of light, \({\varepsilon }_{0}\) is the vacuum permittivity, \({N}_{p}\) is the number of ensemble points (50k by default), and \(\delta =0.01\) eV is the broadening factor [13]. This summation can become quite computationally intensive for a large number of ensemble points; thus, we implemented it in C ++ . Currently, only absorption cross sections are available, but the code can be trivially adapted for other electronic spectrum types, like steadystate fluorescence.
Data Set Tasks
The quality of ML models depends strongly on the descriptor, i.e., the chosen transformation of the molecular structure into the ML input vector x. While they are part of any ML model, MLatom currently only allows converting data sets with XYZ coordinates to the descriptors available in its native implementations (see "Native Implementations"). Other data set operations are discussed in the subsections below.
Splitting and Sampling
As we have seen, for tasks such as creating and estimating the accuracy of the ML model, the data set can be split into subsets: subtraining, validation, training, and test sets [36]. Data points can also be assigned to these subsets in different ways.
The simplest approach is to split the sets into subsets just once and sample the points randomly [10, 36]. When a data set to be explored with ML is known in advance, farthestpoint (FPS) and structurebased sampling (SBS) [12] is possible and preferable to random sampling [10]. In both FPS and SBS, the data set points are sorted using an iterative procedure so that each next point is as far as possible from all the previous selected points [10]. The distance between points i and j is judged by the Euclidean distance \({\Vert {\mathbf{x}}_{i}{\mathbf{x}}_{j}\Vert }_{2}\) between the corresponding descriptors, input vectors x_{i} and x_{j}. In the case of SBS, the first point is the nearequlibrium geometry, in the case of FPS, the two most distant points are chosen as the first two points. Since these sampling procedures are based on an iterative greedy algorithm, they are implemented in MLatomF using Fortran and parallelized with OpenMP, which allows efficient sampling for data sets with tens of thousands of points [10]. SBS applied to sampling points from the PES of a single molecule would lead to underrepresentation of the nearequilibrium geometries as the most distorted geometries will be chosen. As a solution, the geometries are sorted by their Euclidean distance to the equilibrium geometry, sliced into regions corresponding to different degrees of deformation, and, finally, SBS can be performed from each of these regions to obtain a balanced set [2, 10, 12]. This slicing procedure is implemented in MLatomPy.
A more elaborate and slow technique for data set splitting is kfold crossvalidation, with leaveoneout crossvalidation being the slowest [2, 10, 36]. In brief, the data set is split into k roughly equal parts, and then each of these parts is used for validation/testing, with the remaining parts used for training; after krotations of parts, the whole data set is effectively reused for validation/testing purposes. Thus, this procedure is useful for relatively small training sets.
All these techniques are available for native implementations of MLatom. Userdefined sampling into these subsets can also be requested, and then the indices for each of the subsets should be provided to MLatom. By default, 80%:20% random splitting is used for native implementations, while interfaced thirdparty programs may use their default splitting and sampling for hyperparameter optimization and training (but not for model evaluation).
Analysis of Data
MLatom calculates several builtin statistical metrics for analyzing data, particularly for comparing ML estimations to the reference values. They are overviewed below for the sake of completeness. For N estimated values ŷ and reference values y:

Mean absolute error (MAE):
$${\text{MAE}}=\frac{1}{N}\sum_{i}^{N}\left{\widehat{y}}_{i}{y}_{i}\right$$(8) 
Mean signed error (MSE, not to be confused with mean squared error):
$${\text{MSE}}=\frac{1}{N}\sum_{i}^{N}\left({\widehat{y}}_{i}{y}_{i}\right)$$(9) 
Rootmeansquared error (RMSE):
$${\text{RMSE}}=\sqrt{\frac{1}{N}\sum_{i}^{N}{\left({\widehat{y}}_{i}{y}_{i}\right)}^{2}}$$(10) 
Arithmetic means of estimated and reference values, respectively:
$${\mu }_{\widehat{y}}=\frac{1}{N}\sum_{i}^{N}{\widehat{y}}_{i}$$(11)$${\mu }_{y}=\frac{1}{N}\sum_{i}^{N}{y}_{i}$$(12) 
Largest positive and negative outliers as judged by \(\widehat{y}y\).

Linear regression coefficients a and b in ŷ = a + by, their standard errors (SEs) with the corresponding correlation coefficient R and its squared value R^{2} found by leastsquares fitting [46]:
$$b=\frac{{\text{ss}}_{y\widehat{y}}}{{\text{ss}}_{yy}}$$(13)$$a={\mu }_{\widehat{y}}b{\mu }_{y}$$(14)$${\text{SE}}\left(a\right)=s\sqrt{\frac{1}{N}+\frac{{\mu }_{y}^{2}}{{\text{ss}}_{yy}}}$$(15)$${\text{SE}}\left(b\right)=\frac{s}{\sqrt{{\text{ss}}_{yy}}}$$(16)$${R}^{2}=\frac{{ss}_{y\widehat{y}}^{2}}{{\text{ss}}_{yy}{\text{ss}}_{\widehat{y}\widehat{y}}}$$(17)
where
Analogous expressions are used for derived properties, such as partial derivatives; in the latter case, each partial derivative is treated as a data point, e.g. for RMSE of energy gradients \(\frac{\partial E}{\partial M}\) in XYZ coordinates evaluated for the PES of a single molecule with N_{at} atoms:
Native Implementations
This section overviews the theory and provides technical details behind the native implementations available in MLatom. All native ML models are currently based on KRR, so this approach is described first, and then we describe the details behind the KREG model and approaches based on the Coulomb matrix. The code for the KREG model, hyperparameter grid search, farthestpoint, and structurebased sampling was optimized for efficient computing, while no such efforts were necessarily made for other implementations.
Kernel Ridge Regression
The approximating function \(\widehat{f}\left(\mathbf{x};\mathbf{h};\mathbf{p}\right)\) in KRR is the sum over all training points N_{tr}:[36]
where k is the kernel function, p are model parameters that include the set of the regression coefficients α and h are parameters present in the kernel function.
The regression coefficients are found by solving the linear system of equations regularized by adding a small, nonnegative constant value λ to the diagonal elements [36]:
or in matrix form:
where I is the identity matrix, K is the kernel matrix that evaluates the kernel function for each pair of the training points, and y is the vector with reference values. λ is called the regularization parameter and is an external hyperparameter not entering the approximating function itself but used for model selection.
This system of equations has an analytical solution that makes it very attractive. The solution is, however, computationally costly for large data sets as it involves some kind of matrix decomposition that scales as O(N_{tr}^{3}), followed by solving the system of equations, which scales as O(N_{tr}^{2}). MLatom uses the very computationally efficient Cholesky decomposition by default. Bunch–Kaufman and LU decomposition approaches are also available, which are sometimes necessary when Cholesky decomposition fails [47]. The solution of the above system of equations also requires calculation of the kernel matrix of size N_{tr}^{2}, which can become very large and no longer fit in the available computer memory. Note that, by default, MLatom does not invert the matrix to solve the system of equations using the common expression [36]:
as it is much less computationally efficient and numerically less precise [39].
The kernel functions supported by MLatom are [2, 10, 36, 39]:

Gaussian:
$$k\left(\mathbf{x},{\mathbf{x}}_{j}\right)=\mathrm{exp}\left(\frac{1}{2{\sigma }^{2}}\sum_{s}^{{N}_{x}}{\left({x}_{s}{x}_{j,s}\right)}^{2}\right)$$(27) 
Laplacian:
$$k\left(\mathbf{x},{\mathbf{x}}_{j}\right)=\mathrm{exp}\left(\frac{1}{\sigma }\sum_{s}^{{N}_{x}}\left{x}_{s}{x}_{j,s}\right\right)$$(28) 
Exponential:
$$k\left(\mathbf{x},{\mathbf{x}}_{j}\right)=\mathrm{exp}\left(\frac{1}{\sigma }{\left[\sum_{s}^{{N}_{x}}{\left({x}_{s}{x}_{j,s}\right)}^{2}\right]}^{1/2}\right)$$(29) 
Matérn:
$$k\left(\mathbf{x},{\mathbf{x}}_{j}\right)=\mathrm{exp}\left(\frac{1}{\sigma }{\left[\sum_{s}^{{N}_{x}}{\left({x}_{s}{x}_{j,s}\right)}^{2}\right]}^{1/2}\right)\sum_{k=0}^{n}\frac{\left(n+k\right)!}{\left(2n\right)!}\left(\begin{array}{c}n\\ k\end{array}\right){\left(\frac{2}{\sigma }{\left[\sum_{s}^{{N}_{x}}{\left({x}_{s}{x}_{j,s}\right)}^{2}\right]}^{1/2}\right)}^{nk}$$(30)
where \({N}_{x}\) is the dimensionality of the input vector x, and the symbol for hyperparameters h entering the kernel function was dropped. All of them have the lengthscale parameter σ, which is an internal hyperparameter. The Matérn kernel function has an additional integer hyperparameter n. The choice of the kernel function depends on the application, and it can be considered a hyperparameter itself, although MLatom does not automatically make this choice. MLatom performs automatic optimization of the hyperparameters λ and σ on the nested logarithmic grid [10]. Alternatively, the hyperparameters can be optimized using a thirdparty hyperopt package (see "Interfaces" and its subsection "Hyperopt" below).
As discussed above, we are often interested in the derivatives of properties. Once the KRR approximating function is trained on reference values of properties, it can be differentiated to obtain the required derivatives with respect to the d dimension of the input vector x:
Thus, calculating the approximating function derivatives requires calculation of kernel function derivatives. The expressions for analytical firstorder derivatives of the kernel functions are:

Gaussian:
$$\frac{\partial k\left({\mathbf{x},\mathbf{x}}_{j}\right)}{\partial {x}_{d}}=\frac{1}{{\sigma }^{2}}\left({x}_{j,d}{x}_{d}\right)k\left({\mathbf{x},\mathbf{x}}_{j}\right)$$(32) 
Matérn with n > 0:
$$\frac{\partial k\left(\mathbf{x},{\mathbf{x}}_{j}\right)}{\partial {x}_{d}}=\mathrm{exp}\left(\frac{{\Vert \mathbf{x}{\mathbf{x}}_{j}\Vert }_{2}}{\sigma }\right)\sum_{k=0}^{n1}\frac{\left(n+k1\right)!}{\left(2n\right)!}\left(\begin{array}{c}n\\ k\end{array}\right)\times\left(nk\right)\frac{2}{{\sigma }^{2}}{\left(\frac{2{\Vert \mathbf{x}{\mathbf{x}}_{j}\Vert }_{2}}{\sigma }\right)}^{nk1}\left({x}_{j,d}{x}_{d}\right)$$(33)
where \(\left\ {{\mathbf{x}}  {\mathbf{x}}_{j} } \right\_{2} = \left[ {\sum\nolimits_{x}^{{N_{s} }} {(x_{s}  x_{{j,s}} )^{2} } } \right]^{{1/2}}\) is the Euclidean distance. The exponential and Laplacian kernel functions, as well as the Matérn kernel function with n = 0 are not differentiable.
Often, we need to know the derivatives in XYZ coordinates. Let us define the molecular XYZ coordinates as M that can be transformed into the input vector x via descriptor function x(M). Then, the partial derivatives in XYZ coordinates for atom a and dimension t can be obtained using the chain rule as:
where x(M)_{d} are \({N}_{d}\) elements of the input vector x depending on M_{at}. The expression for the firstorder derivatives of the descriptors \(\partial {x\left(\mathbf{M}\right)}_{d}/\partial {M}_{at}\) available in MLatom are given in the following sections.
Note that the derivative information can also be included in the training data for KRR, which usually requires implementing higherorder derivatives of the kernel functions [26, 37, 39, 47,48,49,50]. Such implementations in MLatom are currently underway and will be released in the near future.
All KRR calculations are implemented in Fortran and OpenMP, and we use the efficient Intel® Math Kernel Library (MKL) LAPACK[51] routines for calculating regression coefficients.
KREG
KREG is the ML model type designed for constructing accurate PESs of a single molecule [12]. Its name is strictly speaking not an acronym but can be somewhat loosely derived from the first letters of the underlying components: KRR with the RE descriptor and the Gaussian kernel function. The RE descriptor is a vector of all inverse distances \({r}_{a,b\ne a}\) between atoms a and \(b\ne a\) in a molecule normalized relative to the corresponding distances in some reference structure of the same molecule, usually its equilibrium geometry:
The RE descriptor is a global descriptor, meaning that it describes the molecule as a whole, and the KREG model learns the QC property directly without partitioning it into, e.g., atomic contributions as done by many models discussed below. This descriptor is also a complete descriptor, meaning that it uniquely represents the molecular geometry as the latter can always be derived from the descriptor up to rotation, translation, and such symmetry operations as reflection [10]. Rotational and translation invariance is a feature of the RE descriptor that ensures that scalar properties such as total energy are invariant to these operations according to the laws of physics [10].
The RE descriptor does not, however, ensure homonuclear permutational invariance, e.g., that interchange of hydrogen atoms in the methyl group CH_{3} does not change the total energy [10, 12]. Thus, several variants of the KREG model are possible depending on how this issue is dealt with. The simplest approximation is to neglect permutational invariance, i.e., by using the unsorted RE descriptor as obtained after transforming XYZ coordinates [10]. Another approximation is to sort homonuclear atoms using some criteria. In MLatom, this is done by sorting homonuclear atoms in descending order with respect to the sum of nuclear repulsions to other atoms [12]. This may help ensure the same sorting of atoms while doing structurebased sampling but may lead to discontinuities in the approximating function, which is problematic, e.g., in molecular dynamics [12]. Finally, permutational invariance can be ensured using the permutational invariant kernel that takes as input the permuted RE descriptor (see "Permutationally invariant kernel") [12]. The latter is the most accurate but also most computationally expensive [10].
If the derivatives of the KREG model are necessary, they can be easily obtained using the expressions discussed in the previous subsection and the firstorder derivative of the RE descriptor:
where
The firstorder derivatives of the KREG models are implemented for both the unsorted and permuted RE descriptor variants.
Coulomb Matrix
The Coulomb matrix (CM) descriptor is popular in ML studies, where its vectorized form is used as the input vector [14]:
Like the RE descriptor, it is a global, complete descriptor based on the internuclear distances (in Bohr), ensuring rotational and translational invariance. However, it can also differentiate between molecules of different compositions by including nuclear charges Z (in a.u.) essentially to calculate internuclear repulsions in its offdiagonal elements. Its dimensionality, however, is limited by the largest number of \({N}_{\text{at}}\) atoms among molecules in the training set, and, for a smaller number of atoms, the CM descriptor elements are padded with zeros. Better ML models nowadays exist for treating molecules with a different number of atoms [4], and some of them are interfaced to MLatom, as discussed below. The CM matrix can also be used for constructing the PES of a single molecule though.
Like the RE descriptor, three variants of the CM matrix are available in MLatom: unsorted, sorted, and permuted. In the sorted variant, the atoms are sorted so that the Euclidean norms of columns and rows of the matrix are in descending order [15]. In the case of using the CM descriptor for a single molecule PES, the unsorted CM matrix has many redundant elements (as it is symmetric, only onehalf of it is needed without diagonal elements), while the sorting can lead to large discontinuities in the approximating function [2]. Despite this, the unsorted Coulomb matrix was used in the molecular dynamics studies [52, 53], where its firstorder derivative is also needed. It is implemented in MLatom for the unsorted variant as:
Permutationally Invariant Kernel
Permutationally invariant (symmetrized) kernel is employed to take into account the permutational invariance of the homonuclear atoms [7, 10, 12, 18]:
where k is one of the kernel functions mentioned above and \(\widehat{P}\) permutes the order of atoms that are selected by the user. The descriptor is calculated for each such permutation, and hence it is called the permuted descriptor. The denominator normalizes the kernel function [39]. Unnormalized variants of this kernel symmetrization approach were used to extend the original GDML model type[48] to be permutationally invariant (creating the sGDML model type [19] interfaced to MLatom 2 and discussed below) and to create conceptually related RKHS + F [20] (reproducing kernel Hilbert space using energies and forces) model type.
The firstorder derivative of the permutationally invariant kernel defined in Eq. 40 is given by:
The derivative \(\frac{\partial k\left(x\left(\mathbf{M}\right),x\left(\widehat{P}{\mathbf{M}}_{j}\right)\right)}{\partial {M}_{at}}\) is analogous to the derivatives \(\frac{\partial k\left(x\left(\mathbf{M}\right),x\left({\mathbf{M}}_{j}\right)\right)}{\partial {M}_{at}}\) shown above with the difference that elements \({x}_{j,d}\) change with each permutation as \({x\left(\widehat{P}{\mathbf{M}}_{j}\right)}_{d}\). The derivatives \(\frac{\partial k\left(x\left(\mathbf{M}\right),x\left(\widehat{P}\mathbf{M}\right)\right)}{\partial {M}_{at}}\) require additional derivation because \({M}_{at}\) enters both \(x\left(\mathbf{M}\right)\) and \(x\left(\widehat{P}\mathbf{M}\right)\). The dth element stemming from atoms a and b in the original atom order corresponds to \({x\left(\mathbf{M}\right)}_{d}\) from which \({x\left(\widehat{P}\mathbf{M}\right)}_{d}\) is subtracted in both the Gaussian and Matérn kernel functions. On the other hand, the element stemming from atoms a and b in the permuted atom order will be \({x\left(\widehat{P}\mathbf{M}\right)}_{\widehat{P}d}={x\left(\mathbf{M}\right)}_{d}\), from which \({x\left(\mathbf{M}\right)}_{\widehat{P}d}\) is subtracted in both the Gaussian and Matérn kernel functions. Thus, the expressions for this term are for these kernel functions:

Gaussian:
$$\frac{\partial k\left(x\left(\mathbf{M}\right),x\left(\widehat{P}\mathbf{M}\right)\right)}{\partial {M}_{at}}=\frac{1}{{\sigma }^{2}}\left[\left({x\left(\mathbf{M}\right)}_{d}{x\left(\widehat{P}\mathbf{M}\right)}_{d}\right)+\left({x\left(\mathbf{M}\right)}_{d}{x\left(\mathbf{M}\right)}_{\widehat{P}d}\right)\right]k\left(x\left(\mathbf{M}\right),x\left(\widehat{P}\mathbf{M}\right)\right)$$(42) 
Matérn:
$$\frac{\partial k\left(x\left(\mathbf{M}\right),x\left(\widehat{P}\mathbf{M}\right)\right)}{\partial {M}_{at}}=\frac{2}{{\sigma }^{2}}\left[\left({x\left(\mathbf{M}\right)}_{d}{x\left(\widehat{P}\mathbf{M}\right)}_{d}\right)+\left({x\left(\mathbf{M}\right)}_{d}{x\left(\mathbf{M}\right)}_{\widehat{P}d}\right)\right]\times\mathrm{exp}\left(\frac{{\Vert x\left(\mathbf{M}\right)x\left(\widehat{P}\mathbf{M}\right)\Vert }_{2}}{\sigma }\right)\sum_{k=0}^{n1}\frac{\left(n+k1\right)!}{\left(2n\right)!}\times\left(\begin{array}{c}n\\ k\end{array}\right)\left(nk\right){\left(\frac{2{\Vert x\left(\mathbf{M}\right)x\left(\widehat{P}\mathbf{M}\right)\Vert }_{2}}{\sigma }\right)}^{nk1}$$(43)
Interfaces
As mentioned in the "Introduction", it is not that easy to start using a new ML model, especially for novices who did not get their feet wet in this field, even sometimes for experienced researchers who have some preconceptions from their familiar frameworks. Such difficulties could be alleviated by just implementing all models we want into a single allinone software. However, this approach is laborintensive and unsustainable, considering the fastgrowing numbers of ML models. A better solution to tackle this problem is to make interfaces to thirdparty software, which is easier to implement and modularize. The drawbacks of such interfacebased solutions compared to allinone software are the need to install multiple thirdparty software packages and the decreased computational efficiency due to converting data and communication bottlenecks between programs. The interfacebased approach has, however, the considerable benefit of the rapid development of one uniform workflow, which eliminates the problem at its origin: the lack of standardization. This allows researchers to quickly test multiple ML model types, an advantage that outweighs the drawbacks in many cases.
Thus, we introduced the interfaces to thirdparty ML software in MLatom 2. Each interface should have four main functions inside, which are shown in Fig. 6. Arguments parsing translates MLatomstyle input arguments to thirdparty equivalent. Data conversion takes in MLatomformat data then converts them into the corresponding format required by thirdparty software. Model training communicates with thirdparty software to get output models by sending prepared data and arguments. Model using uses thirdparty software to get estimated values from the trained model. The interfaces are modularized Python 3 scripts stored in the subfolder interfaces in the MLatom program directory.
The currently interfaced thirdparty software packages are listed in Table 1 and will be introduced in the subsections below. Note that some programs provide several ML model types other than the listed ones. Using them through the interfaces might already be supported in MLatom or needs some slight modifications in the interface modules. However, only the listed model types are tested and will be discussed in this paper. Generally, we follow the strategy of making no modifications in the interfaced programs so that they can be taken “as is” and only the path to their executable should be known to MLatom. The only exception is the PhysNet, which uses TensorFlow v1. To make it compatible with the newer version of Python and TensorFlow [54], names of some legacy functions need to be changed to their equivalent in TensorFlow v2. Also, it is important to note that the main hyperparameters of each interfaced model can be modified easily through the standard input file of MLatom such that no modifications to the thirdparty software are required.
Besides the thirdparty ML software, we also implemented the interface to a hyperparameter optimization library called hyperopt [33], to provide a solution for hyperparameter optimization, which is unavailable in most of thirdparty software packages.
Hyperopt
Hyperparameters—the parameters that tune the shape of the ML model and stay unchanged in the training—have a huge impact on the performance of an ML model. To unleash the full potential of an ML model, the hyperparameters need to be well optimized, and here comes the hyperparameter optimization problem. Solving this problem manually is too cumbersome and would rarely lead to the optimal solution. Hence, many packages that are capable of automatic hyperparameter optimization are available; hyperopt [33] is one of them. Hyperopt is an opensource Python library that uses a Bayesian method with treestructured Parzen estimator (TPE) [32] to find the optima in hyperparameter spaces.
In MLatom 2, we added the interface that uses the hyperopt library as a convenient solution to the hyperparameter optimization problem. By simply substituting hyperparameters that need to be tuned with keywords for hyperopt search space, the interface will be activated to perform the automatic optimizing process (see Fig. 2). Optimization splits the training set into the subtraining and validation sets. The trained model’s error on the validation set will be sent to hyperopt to get the next searching point in hyperparameter space. If KMs are used, the final model will be generated by training on the entire training set with optimized hyperparameters after the optimization, while for NNs, the best model trained during the hyperparameter optimization will be the final ML model.
sGDML
sGDML [19] is a symmetrized variant of the gradientdomain machine learning (GDML [48]) model type and is interfaced to MLatom through the sGDML program [25]. The sGDML method is a KRR model with the descriptor of unnormalized inverse distances (the ID descriptor)
Thus, it has the same basic properties as the RE descriptor, i.e., the ID descriptor is also a global and complete descriptor ensuring rotational and translational invariance, but not permutational invariance of homonuclear atoms.
In contrast to the KREG model, the GDML model learns only from the energy gradients. It also uses the Matérn kernel function, whose expression differs slightly from that implemented in MLatom. For learning from gradients, the common KRR approximating function for scalar properties is modified by using covariance between derivatives to predict the XYZ components of energy gradients:
As usual, the linear system of KRR equations is solved with the kernel matrix K, now of a size \(3{N}_{\text{at}}{N}_{\text{tr}}\times 3{N}_{\text{at}}{N}_{\text{tr}}\):
When the energies are needed, they are obtained by integrating the energy gradients, and the integration constant is found by fitting to the reference energy values:
The sGDML tackles the permutational invariance by using a modified, unnormalized permutational invariant kernel with permutations chosen automatically to minimize the matching cost between pairs of training points. The sGDML program also supports automatic hyperparameter optimization via crossvalidation.
The sGDML method achieves remarkable accuracy given only a very small number of training points, as was shown for molecular PESs visited during dynamics [19]. Because of the large size of the kernel matrix, the method is practically applicable only for hundreds, up to several thousand training points. Note that when we refer to a single “training point,” we mean all the associated information for one geometry, and the real number of reference values available for sGDML is \({3N}_{\text{at}}\) times larger (the number of Cartesian energy gradient components). sGDML efficiently utilizes all this available information, which explains its accuracy.
The sGDML program requires a proprietary data format that uses NumPy’s [55] npz file as the container. Scripts to convert from other data formats (e.g., extended XYZ) are included in the program. Like MLatom, sGDML has a builtin hyperparameter optimization function for the hyperparameter \(\sigma\) using a grid search, which is enabled by default; \(\lambda\) is not optimized. The users can also specify \(\sigma\) or a list of \(\sigma\) values for the grid search, but only integers are acceptable.
GAPQUIP
The Gaussian approximation potential [26] (GAP) model is interfaced to MLatom through QUIP and GAP suite programs. Like native implementations in MLatom, GAP is also based on a kernel method, although it was developed within Gaussian process regression (GPR) formalism rather than KRR. In the GAP model, the total energy of a system is represented as the sum of atomic energies:
As a result, local descriptors, rather than global descriptors as used in MLatom’s native models and sGDML, describing atomic environments for every single atom are used. The GAP suite provides a smooth overlap of atomic positions [27] (SOAP) descriptor for this purpose. The construction of a SOAP descriptor is quite involved as it has to respect all the required symmetries (rotational, translational, permutational), and its derivations are given in the literature [7, 27, 56]. Alternative versions of SOAP descriptor also exist [35].
Here, we describe the main points behind this descriptor. The local environment of an atom is represented by the atomic neighborhood density \({\rho }_{i}\left(\mathbf{r}\right)\) that is constructed using Gaussians for the ith atom as:
where \({\mathbf{r}}_{ij}\) is a vector pointing from atom i to the neighboring atom j, \({\upsigma }_{\text{atom}}\) reflects the atom “size”, and \({f}_{\text{cut}}\) is the cutoff function with the width of the cutoff region \({r}_{\Delta }\) approaching the limits of \({r}_{\text{cut}}\)
The Gaussians are expanded with a set of orthonormal radial basis functions \({g}_{n}\) [57]:
where \(\widehat{\mathbf{r}}\) projects the direction of the vector r on the unit sphere and \({Y}_{lm}\) are spherical harmonics. For better efficiency, the choices of \(n\) and \(l\) are limited by \({n}_{\text{max}}\) and \({l}_{\text{max}}\), respectively. The orthonormal radial basis functions are constructed from
and the overlap matrix \(\mathbf{S}={\mathbf{U}}^{T}\mathbf{U}\) with elements \(S_{{nn^{\prime}}} = \int_{0}^{{r_{{{\text{cut}}}} }} {\phi_{n} (r)} \phi_{{n^{\prime}}} (r)r^{2} dr\) as
The coefficients \({c}_{nlm}\) are obtained as
The SOAP descriptor p_{i} consists of \(\sum\nolimits_{m} {\left( {c_{{nlm}}^{i} } \right)} ^{ * } c_{{n^{\prime}lm}}^{i}\), which correspond to the power spectrum elements.
The kernel matrix elements are calculated using the dotproduct kernel function
which are subsequently normalized, raised to a power of \(\zeta\) (a positive integer) to tune its sensitivity, and scaled by \({\upsigma }_{\omega }\):
Then the SOAP descriptor and kernel are used in estimating what the values of atomic energies are most likely to be by performing GPR that uses the same expression for making estimations as KRR:
The problem with this expression is that the α coefficients cannot be obtained directly using the similar expression as described in "Kernel ridge regression", because there are (usually) no reference N_{tr} atomic energies \({\varvec{\upvarepsilon}}\) and only N_{obs} total energies E are available. In the GAP approach, this is solved by introducing a linear transformation using the N_{tr} × N_{obs} matrix L with elements 1 or 0 so that
Then, the kernel matrix becomes
Using this kernel matrix, the regression coefficients α can be calculated in the usual manner as in the KRR approach. In the GAP–SOAP notation, the regularization hyperparameter is denoted \({\sigma }_{E}^{2}\) for energies. The GAP–QUIP implementation also allows for using sparsification to reduce the size of the kernel matrix and, in this case, additional parameters defining the size \({N}_{\text{sparse}}\) of the sparse kernel matrix and its regularization parameter \({\sigma }_{\text{jitter}}\) added to its diagonal elements can be set by the user.
GAP–SOAP implementation allows the inclusion of energy gradient information to the kernel matrix. In this case, the transformation matrix L has additional elements with the differentiation operators \(\partial /\partial {M}_{\mathrm{at}}\), which results in calculating covariance between energies and their partial derivatives and also between derivatives [7, 26, 56].
The GAP software suite from the libatoms.org website contains a gap_fit program that trains the GAP model with the SOAP descriptor. The QUIP program is used to get predictions from the model trained by gap_fit. To use this combination for PES training and prediction, the data need to be formatted to extended XYZ format.
The gap_fit program provides tons of options that enable users to make fine adjustments to the training process, including settings for atomic energies’ offsets, sparsification, etc. (Table 2). However, the regularization hyperparameter and hyperparameters in the SOAP descriptor need to be set by the user manually, which makes it harder to realize the model’s full potential.
TorchANI
TorchANI [28] is a Python library that implements the ANI model type [21], with PyTorch [58] as its backend for NN construction.
ANI is the abbreviation for ANAKINME, which was backengineered to Accurate NeurAl networK engINe for Molecular Energies. The ANI atomic environmental vector used in this model is a local descriptor, and is derived from the descriptor in Behler and Parrinello’s work [59].
The ANI descriptor \({\overrightarrow{G}}\!_{i}^{X}\) for element \(X\)’s \(i\)th atom contains a radial and an angular part, and both parts are further subdivided into subAEVs, in which the atoms taken into consideration will be limited by the same element.
For each element, the radial subAEV consists of input vector elements \({G}_{k}^{R}\in \mathbf{x}\) for different values of radial shift hyperparameters \({R}_{s}^{\left(k\right)}\):
Similarly, for each pair of elements, the angular subAEV consists of input vector elements \({G}_{p,q}^{A}\in \mathbf{x}\) for different combinations of angular shift hyperparameter \({\theta }_{s}^{\left(q\right)}\) and another set of radial shift hyperparameters \({R}_{s}^{\left(p\right)}\):
In the equations above, \({f}_{c}\) is the cutoff function used in Behler–Parrinello NN potentials [59] and similar to that used in GAPSOAP, \(\eta\), \({R}_{s}^{\left(k\right)}\),\({R}_{s}^{\left(p\right)}\), \({\theta }_{s}^{\left(q\right)}\) and \(\zeta\) are predefined hyperparameters. Parameters \(\eta\) are defined separately for radial part and angular part similarly to \({R}_{s}\).
After being computed, each atom’s AEV will be plugged into its own NN as the input vector to predict atomic energy, and the atoms of the same element share the same NN structure to ensure the permutational invariance of the trained model.
The total energies are obtained by summing all atomic energies, while atomic forces are generated by differentiating atomic energies using PyTorch’s automatic differentiation engine. TorchANI reads HDF5 files, where training data are organized and stored.
As shown in Table 3, many hyperparameters can be tuned in ANI descriptors, not to mention the hyperparameters of NNs. However, as a Python library, TorchANI provides neither default values nor optimization method for hyperparameters, but only the basic core functions of ANI model type as building blocks. The final training scripts need to be written by the users themselves.
DeePMDkit
DeePMDkit [29] is a software with the Deep Potential Molecular Dynamics (DPMD) [30] ML model type and its successor Deep Potential—Smooth Edition (DeepPotSE) [31] builtin. Like ANI, DPMD and DeepPotSE are also based on NNs with local descriptors. Nevertheless, DeepPotSE switched to a learned local descriptor rather than the fixed one in its predecessor.
In DeepPotSE, the generalized local environment matrix \(\widetilde{\mathbf{R}}^{i}\) (which is the descriptor of original DPMD) and the local embedding matrix \({\mathbf{G}}^{i}\) are used in representing the local environment of atom \(i\) with \({N}_{i}\) neighboring atoms. The matrix \(\widetilde{\mathbf{R}}^{i}\) has \({N}_{i}\) rows and each row are defined from relative coordinates and distances as:
where:
The \({N}_{i}\times M\) matrix \({\mathbf{G}}^{i}\) is generated from the local embedding network (also called filter network), which outputs a \(M\)dimensional vector for each neighboring atom \(j\):
where \({g}_{k}\) is the \(k\)th output of local embedding network applied to \(s\left({r}_{ij}\right)\).
The final descriptor, or the feature matrix \({\mathbf{D}}^{i}\) of atom \(i\) is defined by
where \({\mathbf{G}}^{i1}\) and \({\mathbf{G}}^{i2}\) are first \({M}_{1}\) and \({M}_{2}\) columns of \({\mathbf{G}}^{i}\). The translational, rotational and permutational invariance is preserved in such expressions.
The feature matrices are then passed to NNs that generate atomic energies as the ANI model does.
DeePMDkit program comes with its Python 3 environment, including TensorFlow and LAMMPS interface for MD simulations. The training data need to be saved in plain text in a specified style and then be transformed to what the program can read by the scripts they provide.
Training with the DeePMDkit needs to be initialized with json input script, in which options and parameters are defined. The main tunable hyperparameters of DeepPotSE are listed in Table 4, while hyperparameters in NNs (e.g., hyperparameters for network architecture, learning rate schedule, etc.) are not listed. However, this package cannot optimize those hyperparameters. Also, DeePMDkit does not include in its native implementation the regularization scheme called early stopping often required in NN models to control the number of iterations performed during training, to stop the simulation before the model can reach an overfitting stage. Thus, we provide an external early stopping function as part of the interface module that monitors the training progress (based on the loss for the validation set) in the MLatom/DeePMDkit output to stop the simulation when the criterion defined in the input has been reached.
PhysNet
PhysNet [22] is another ML model type based on learned local descriptor but using a messagepassing NN architecture as the underlying model.
In PhysNet, the embedding vectors \({\mathbf{e}}_{z}\) are used as the input vectors:
where the superscript \(l\) over a vector denotes the layer number (\(l=0\) stands for the input vector), and \({z}_{i}\) is the nuclear charge of atom \(i\). Moreover, the number of features is defined by hyperparameter \(F\).
Coordinates are transformed to \(\mathbf{g}\) by applying \(K\) radial basis functions and cutoff functions to internuclear distances:
where \(\beta\) and \({\mu }_{k}\) are parameters of radial basis functions and r_{ij} denotes pairwise Euclidean distance between atoms i and j.
Then \({\mathbf{x}}^{0}\) is passed through a stack of \({N}_{\mathrm{module}}\) modules which are connected in series, and \(\mathbf{g}\) is passed to each module.
There is a building block that will be repeatedly used in modules called a residual block. The residual block is a 2layer mini residual neural network (ResNet), in which input vectors will also directly contribute to output vectors by skipping over the layers in between:
where \(\mathbf{W}\) and \(\mathbf{b}\) consist of learnable weights and biases, \(\sigma\) is the activation function.
Inside each module, a prototype vector \(\widetilde{\mathbf{v}}\) will be generated by a NN first:
where the elements of matrix \(\mathbf{G}\) are learnable coefficients for \({g}_{k}\left({r}_{ij}\right)\), and \(\circ\) is the Hadamard product operator.
This prototype \(\widetilde{\mathbf{v}}\) then will be tuned by \({N}_{\mathrm{residual}}^{\mathrm{interaction}}\) residual blocks to get the optimized vector \(\mathbf{v}\) which will then ‘interact’ with \(\mathbf{x}\):
where \(\mathbf{u}\) is also a learnable parameter.
After going through another \({N}_{\mathrm{residual}}^{\mathrm{atomic}}\) residual blocks, input \(\mathbf{x}\) will be passed separately to the next module (if it exists) and the output block which turns \(\mathbf{x}\) to module’s output \({\mathbf{y}}^{m}\), which contributes to the final output \(\mathbf{y}\) after a linear transformation whose parameters are also learnable.
Unlike previously described models trained with local descriptors (GAP–SOAP, ANI, DeepPotSE), PhysNet may also take longrange interactions (e.g., electrostatic and dispersion) into account. By default, dispersion corrections are enabled in the MLatom interface, while electrostatic corrections are disabled because their calculations require additional input (reference dipole moments).
The official implementation of the PhysNet model is programmed in Python 3.6 with TensorFlow v1, and the data need to be stored in the Numpy’s npz format of a specific structure. Similar to TorchANI, using the PhysNet program also needs much manual work on scriptwriting and hyperparametertuning (see Table 5 for the list of the main hyperparameters).
Applications
In this section, we present several case studies demonstrating the capabilities of MLatom 2.
Case Study 1: Hyperparameter Optimization
As mentioned in section "Hyperopt" in "Interfaces", a solution for hyperparameter optimization problem is given in MLatom 2 by introducing the interface to the hyperopt package. Here, we demonstrate a case using hyperopt interface to optimize the hyperparameters of the learning rate schedule in DeepPotSE model type (start_lr and decay_rate, Table 6). For this, we used PES data (including energy gradients) of ethanol from MD17 data set [48]. A total of 1 k training points and 20 k test points were chosen randomly from the data set without overlapping. Other technical details can be found in Fig. 7 and Table 6.
For comparison, we also took two sets of start_lr and decay_rate from original literature on the DeepPotSE [31] and DPMD models [30] (see Table 6). These values were used for the same data set but with much larger training set sizes (50 and 95 k).
As shown in Table 6, the optimization process achieved significantly better accuracies in both energies and gradients prediction, despite being used for a related task as reported in the literature. This indicates that hyperparameter optimization is highly recommended for each new case, even for cases similar to previously published ones.
Case Study 2: Learning Curves
In this part, we provide cases of MLatom’s learning curve task (see "Learning curves") to show how KRR performance varies with the different molecular descriptors being used. Unsorted, sorted, and permuted RE descriptors, unsorted and sorted Coulomb matrix, and unsorted inverse distances descriptor were examined on ethanol, with energy data from MD17 data set [48]. The descriptors are denoted as uRE, sRE, pRE, uCM, sCM, and uID, respectively, and details for RE and CM can be found in subsections "KREG" and "Coulomb matrix" in section "Native implementations". The uID descriptor is not among native implementations. Thus, it is provided to MLatom, which also demonstrates support for external, userdefined descriptors. We used the Gaussian kernel throughout, i.e., KRR with the uRE, sRE, and pRE descriptors are equivalent to the corresponding KREG model variants. All these descriptors were tested with seven training set sizes roughly evenly spaced on the log scale from 100 to 10 k. Other training and testing details can be found in Fig. 8a.
The results (Fig. 8b) show the big impact of molecular descriptor choice on ML performance. First of all, let us look at the unsorted descriptors. The RE descriptor and CM are both related to the unnormalized inverse distances (ID) (see subsections "Native implementations" and "sGDML"). The uRE descriptor is a normalized version of the uID descriptor, with its advantages and disadvantages manifested in differences in the corresponding learning curves. Normalization gives equal importance to both closerange and longrange interactions, which is detrimental to accuracy for scarce training data (up to 1 k training points in case of this data set) but is advantageous when more training data are available. uCM has a product of nuclear charges in the nominator, which is different for each pair of elements and may introduce suboptimal weighting of the descriptor elements, leading to increased error relative to the uID descriptor [2].
The importance of properly taking into account the permutational invariance is demonstrated by using the sorted and permuted descriptors. Sorting is the simplest approach, but it causes discontinuities in the interpolant and leads to much worse results even compared to the unsorted descriptors. One of the more solid approaches, the permutationally invariant kernel using the permuted RE descriptors, can preserve permutational invariance without malfunctioning and achieves much better performance than uRE and better than uID (except for a very small number of training points).
Case Study 3: ΔLearning and StructureBased Sampling
Training with Δlearning [9] and choosing training points using structurebased sampling [5] can offer ML models with better accuracy. To showcase the superiority of these approaches, we compared them to the direct ML models of the target property and random sampling. Here, we provide test results with four combinations of learning (direct vs. Δ) and sampling (random vs. structurebased) approaches using the KREG model type. The data set from reference [9] containing the PES of CH_{3}Cl calculated at several QM levels was used.
For the Δlearning in this case study, CCSD(T)/CBS energy \({E}_{\mathrm{CCSD}(\mathrm{T})}\) served as the target energy, while MP2/augccpVQZ energy \({E}_{\mathrm{MP}2}\) was considered as the baseline energy. Thus, the Δlearning model \({\Delta }_{\mathrm{MP}2}^{\mathrm{CCSD}(\mathrm{T})}\) is defined by:
where the ML model giving predictions \(\widehat{f}\left(\mathbf{x}\right)\) is trained on the differences between target and baseline methods \({E}_{\mathrm{CCSD}(\mathrm{T})}{E}_{\mathrm{MP}2}\). Thus, the cost of the Δlearning model is determined by the cost of the baseline QC method, and the user should provide MLatom with the values calculated using the baseline QC method for both training and prediction.
For sampling, we used 10% points of the whole data set as the training set and the remaining 90% as the test set. Also, as illustrated in Fig. 9a, the CH_{3}Cl molecule has three indistinguishable hydrogen atoms, so the permutational invariant kernel was used (see "Permutationally invariant kernel").
Figure 9c shows a sample input file, and the scatter plots with RMSEs for all four combinations. Both Δlearning and structurebased sampling led to much more accurate predictions for CCSD(T)/CBS energy than the simplest combination of random sampling and direct learning, and even more so when these two approaches were combined.
Case Study 4: Absorption Spectrum
In this case study, we will calculate the absorption cross section for an acridophosphine derivative [60] (Fig. 10) using MLNEA implementation in MLatom, and discuss the effect of the number of points in the training set and nuclear ensemble. MLatom allows refining cross sections using existing data. Therefore, we used this feature to perform all the simulations using QC data at the ωB97XD[61]/def2SVP [62,63,64] level of theory from our previous publication (see [60] for computational details; energies and oscillator strengths for 30 excited states are available at https://doi.org/10.6084/m9.figshare.13635353).
By default, MLatom determines the optimal training point number iteratively by adding 50 points at each step, and the cross section is calculated using 50 k in the nuclear ensemble. In our example, the MLNEA procedure converged after 200 points. For comparison, the cross section obtained without ML using these 200 points in the nuclear ensemble (QCNEA spectrum, Fig. 10a) curve has many peaks, and it is hard to judge what are the actual peaks and what are the artifacts of insufficient statistical sampling, while the MLNEA spectrum is much smoother and ML calculations incur only small additional computational cost. The popular singlepoint convolution (SPC) approach gives a blueshifted spectrum with an arbitrary bandwidth [65].
Although MLatom determines the training set size automatically, one can always request calculations of additional training points to check whether the spectrum has “truly converged” by visual inspection. For example, by comparing MLNEA spectra with the ones calculated using 200, 250, and 300 points, one can see that they are very close to each other with some minor deviations (Fig. 10b). The MLNEA spectrum obtained with 300 points is nearly the same as the spectrum calculated with 2 k points. One should remember, however, that the accuracy of MLNEA depends on the accuracy of the underlying QC method, and the difference between the experimental spectrum and MLNEA, even with the largest 2 k number of training points, is bigger than the difference between MLNEA spectra with 200, 250, and 300 points (compare Fig. 10a, b). One minor aspect is that, although the default 50 k points is a rather large ensemble, it still leads to slightly rugged curves. Perfectly smooth curves can be obtained by simply increasing the number of ensemble points (Fig. 10c), and MLatom is a computationally efficient tool for this task; one can calculate very smooth curves, e.g., with 1 M points. One should keep in mind though, that such a large number is not necessary, and it will take about 10 GB of disk space to store the file with the nuclear ensemble geometries and take more time to make ML prediction and cross section calculations.
Conclusions
In this review article, we have described the MLatom 2 software package, which provides an integrative platform for AML simulations. Unlike other specialized AML packages, MLatom has been developed with the aim of facilitating the application of ML models to the wide variety of tasks often required in computational chemistry research.
Its capabilities range from native implementations such as the KREG model and other KRR model types (with the Coulomb matrix or any other userdefined descriptors as well as the Gaussian, Matérn, Laplacian, and exponential kernel functions) to interfaces to the thirdparty packages with popular models. The latter models are overviewed here for the sake of completeness and include sGDML, GAP–SOAP, ANI, DPMD, DeepPotSE, and PhysNet. Other AML model types can be implemented easily by using the modular approach adopted in MLatom for thirdparty interfaces.
Other important features of MLatom for AML simulations such as model evaluation, hyperparameter optimization, sampling procedures (including farthestpoint and structurebased sampling), Δlearning, selfcorrection, and automatic learning curve generation are overviewed too. We also discussed how all steps required for the absorption spectrum simulation with the machine learningnuclear ensemble approach (MLNEA) are integrated in MLatom. Finally, we provided examples of how MLatom can be used for selected applications: hyperparameter optimization, learning curve generation, Δlearning and structurebased sampling, and absorption spectrum simulation.
MLatom provides a userfriendly, integrated platform for researchers who want to use a wide variety of AML approaches and related techniques. It is also a useful package for educational purposes as it is used for teaching the basic and advanced concepts behind ML use in quantum chemistry (see, e.g., the book chapter [2], and online tutorial at http://MLatom.com). We are continually developing this platform based on the needs for practical AML computations such as dynamics, calculation of excitedstate properties, and rovibrational spectrum simulations, improvement of QC methods, and materials design.
Availability of Data and Material
All the data used in this manuscript are available from literature and online databases as cited in the article. No new data was generated in this study.
Code Availability
MLatom 2 is available from http://MLatom.com free of charge for noncommercial and nonprofit uses, such as academic research and education. All interfaced thirdparty software should be obtained and installed by the user separately.
References
 1.
Dral PO (2020) Quantum chemistry in the age of machine learning. J Phys Chem Lett 11(6):2336–2347. https://doi.org/10.1021/acs.jpclett.9b03664
 2.
Dral PO (2020) Quantum chemistry assisted by machine learning. In: Ruud K, Brändas EJ (eds) Advances in quantum chemistry. Chemical physics and quantum chemistry, vol 81. Elsevier, Amdsterdam, pp 291–324. https://doi.org/10.1016/bs.aiq.2020.05.002
 3.
Butler KT, Davies DW, Cartwright H, Isayev O, Walsh A (2018) Machine learning for molecular and materials science. Nature 559(7715):547–555. https://doi.org/10.1038/s4158601803372
 4.
von Lilienfeld OA, Müller KR, Tkatchenko A (2020) Exploring chemical compound space with quantumbased machine learning. Nat Rev Chem 4(7):347–358. https://doi.org/10.1038/s4157002001899
 5.
Manzhos S, Carrington T Jr (2020) Neural network potential energy surfaces for small molecules and reactions. Chem Rev. https://doi.org/10.1021/acs.chemrev.0c00665
 6.
Mueller T, Hernandez A, Wang C (2020) Machine learning for interatomic potential models. J Chem Phys 152(5):050902. https://doi.org/10.1063/1.5126336
 7.
Bartók AP, Csányi G (2015) Gaussian approximation potentials: a brief tutorial introduction. Int J Quantum Chem 115(16):1051–1057. https://doi.org/10.1002/qua.24927
 8.
Behler J (2016) Perspective: machine learning potentials for atomistic simulations. J Chem Phys 145(17):170901. https://doi.org/10.1063/1.4966192
 9.
Dral PO, Xue BX, Ge F, Hou YF, Pinheiro Jr M (2013–2021) MLatom: A Package for Atomistic Simulations with Machine Learning. Xiamen University, Xiamen, China, http://MLatom.com Accessed 23 Feb 2021
 10.
Dral PO (2019) MLatom: a program package for quantum chemical research assisted by machine learning. J Comput Chem 40(26):2339–2347. https://doi.org/10.1002/jcc.26004
 11.
Ramakrishnan R, Dral PO, Rupp M, von Lilienfeld OA (2015) Big data meets quantum chemistry approximations: the Δmachine learning approach. J Chem Theory Comput 11(5):2087–2096. https://doi.org/10.1021/acs.jctc.5b00099
 12.
Dral PO, Owens A, Yurchenko SN, Thiel W (2017) Structurebased sampling and selfcorrecting machine learning for accurate calculations of potential energy surfaces and vibrational levels. J Chem Phys 146(24):244108. https://doi.org/10.1063/1.4989536
 13.
Xue BX, Barbatti M, Dral PO (2020) Machine learning for absorption cross sections. J Phys Chem A 124(35):7199–7210. https://doi.org/10.1021/acs.jpca.0c05310
 14.
Rupp M, Tkatchenko A, Müller KR, von Lilienfeld OA (2012) Fast and accurate modeling of molecular atomization energies with machine learning. Phys Rev Lett 108(5):058301. https://doi.org/10.1103/Physrevlett.108.058301
 15.
Hansen K, Montavon G, Biegler F, Fazli S, Rupp M, Scheffler M, von Lilienfeld OA, Tkatchenko A, Müller KR (2013) Assessment and validation of machine learning methods for predicting molecular atomization energies. J Chem Theory Comput 9(8):3404–3419. https://doi.org/10.1021/ct400195d
 16.
Dral PO, von Lilienfeld OA, Thiel W (2015) Machine learning of parameters for accurate semiempirical quantum chemical calculations. J Chem Theory Comput 11(5):2120–2125. https://doi.org/10.1021/acs.jctc.5b00141
 17.
Dral PO, Barbatti M, Thiel W (2018) Nonadiabatic excitedstate dynamics with machine learning. J Phys Chem Lett 9:5660–5663. https://doi.org/10.1021/acs.jpclett.8b02469
 18.
Dral PO, Owens A, Dral A, Csányi G (2020) Hierarchical machine learning of potential energy surfaces. J Chem Phys 152(20):204110. https://doi.org/10.1063/5.0006498
 19.
Chmiela S, Sauceda HE, Müller KR, Tkatchenko A (2018) Towards exact molecular dynamics simulations with machinelearned force fields. Nat Commun 9(1):3887. https://doi.org/10.1038/s41467018061692
 20.
Koner D, Meuwly M (2020) Permutationally invariant, reproducing kernelbased potential energy surfaces for polyatomic molecules: from formaldehyde to acetone. J Chem Theory Comput 16(9):5474–5484. https://doi.org/10.1021/acs.jctc.0c00535
 21.
Smith JS, Isayev O, Roitberg AE (2017) ANI1: an extensible neural network potential with DFT accuracy at force field computational cost. Chem Sci 8(4):3192–3203. https://doi.org/10.1039/c6sc05720a
 22.
Unke OT, Meuwly M (2019) PhysNet: a neural network for predicting energies, forces, dipole moments, and partial charges. J Chem Theory Comput 15(6):3678–3693. https://doi.org/10.1021/acs.jctc.9b00181
 23.
Gv R (1995) Python tutorial, Technical Report CSR9526. Centrum voor Wiskunde en Informatica (CWI), Amsterdam
 24.
Rossum GV, Drake FL (2009) Python 3 Reference Manual. CreateSpace, 100 Enterprise Way, Suite A200, Scotts Valley, CA
 25.
Chmiela S, Sauceda HE, Poltavsky I, Müller KR, Tkatchenko A (2019) sGDML: constructing accurate and data efficient molecular force fields using machine learning. Comput Phys Commun 240:38–45. https://doi.org/10.1016/j.cpc.2019.02.007
 26.
Bartók AP, Payne MC, Kondor R, Csányi G (2010) Gaussian approximation potentials: the accuracy of quantum mechanics, without the electrons. Phys Rev Lett 104(13):136403. https://doi.org/10.1103/Physrevlett.104.136403
 27.
Bartók AP, Kondor R, Csányi G (2013) On representing chemical environments. Phys Rev B 87(18):187115. https://doi.org/10.1103/physrevb.87.184115
 28.
Gao X, Ramezanghorbani F, Isayev O, Smith JS, Roitberg AE (2020) TorchANI: a free and open source PyTorchbased deep learning implementation of the ANI neural network potentials. J Chem Inf Model 60(7):3408–3415. https://doi.org/10.1021/acs.jcim.0c00451
 29.
Wang H, Zhang L, Han J, Weinan E (2018) DeePMDkit: a deep learning package for manybody potential energy representation and molecular dynamics. Comput Phys Commun 228:178–184. https://doi.org/10.1016/j.cpc.2018.03.016
 30.
Zhang L, Han J, Wang H, Car R, Weinan E (2018) Deep potential molecular dynamics: a scalable model with the accuracy of quantum mechanics. Phys Rev Lett 120(14):143001. https://doi.org/10.1103/PhysRevLett.120.143001
 31.
Zhang LF, Han JQ, Wang H, Saidi WA, Car R (2018) Endtoend symmetry preserving interatomic potential energy model for finite and extended systems. Adv Neural Inf Process Syst 31:4436–4446
 32.
Bergstra J, Bardenet R, Bengio Y, Kégl B (2011) Algorithms for hyperparameter optimization. In: ShaweTaylor J, Zemel R, Bartlett P, Pereira F, Weinberger KQ (eds) Advances in neural information processing systems, vol 24. Curran Associates, Red Hook, NY
 33.
Bergstra J, Yamins D, Cox DD Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures. In: Proceedings of the 30th International Conference on International Conference on Machine Learning, Atlanta, GA, 2013. ICML'13. JMLR.org, pp I–115–I–123. https://doi.org/10.5555/3042817.3042832
 34.
Rezac J (2016) Cuby: an integrative framework for computational chemistry. J Comput Chem 37(13):1230–1237. https://doi.org/10.1002/jcc.24312
 35.
Himanen L, Jäger MOJ, Morooka EV, Federici Canova F, Ranawat YS, Gao DZ, Rinke P, Foster AS (2020) DScribe: library of descriptors for machine learning in materials science. Comput Phys Commun 247:106949. https://doi.org/10.1016/j.cpc.2019.106949
 36.
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer, New York
 37.
Christensen AS, von Lilienfeld OA (2020) On the role of gradients for machine learning of molecular energies and forces. Mach Learn Sci Technol 1(4):045018. https://doi.org/10.1088/26322153/abba6f
 38.
Behler J (2011) Neural network potentialenergy surfaces in chemistry: a tool for largescale simulations. Phys Chem Chem Phys 13(40):17930–17955. https://doi.org/10.1039/C1cp21668f
 39.
Rasmussen CE, Williams CKI (2006) Gaussian processes for machine learning. MIT Press, Boston
 40.
Cortes C, Jackel LD, Solla SA, Vapnik V, Denker JS (1994) Learning curves: asymptotic values and rate of convergence. Advances in neural information processing systems. Morgan Kaufmann, San Mateo, CA, pp 327–334
 41.
CrespoOtero R, Barbatti M (2012) Spectrum simulation and decomposition with nuclear ensemble: formal derivation and application to benzene, furan and 2phenylfuran. Theor Chem Acc 131(6):1237. https://doi.org/10.1007/s0021401212374
 42.
Frisch MJ, Trucks GW, Schlegel HB, Scuseria GE, Robb MA, Cheeseman JR, Scalmani G, Barone V, Petersson GA, Nakatsuji H, Li X, Caricato M, Marenich AV, Bloino J, Janesko BG, Gomperts R, Mennucci B, Hratchian HP, Ortiz JV, Izmaylov AF, Sonnenberg JL, Williams, Ding F, Lipparini F, Egidi F, Goings J, Peng B, Petrone A, Henderson T, Ranasinghe D, Zakrzewski VG, Gao J, Rega N, Zheng G, Liang W, Hada M, Ehara M, Toyota K, Fukuda R, Hasegawa J, Ishida M, Nakajima T, Honda Y, Kitao O, Nakai H, Vreven T, Throssell K, Montgomery Jr. JA, Peralta JE, Ogliaro F, Bearpark MJ, Heyd JJ, Brothers EN, Kudin KN, Staroverov VN, Keith TA, Kobayashi R, Normand J, Raghavachari K, Rendell AP, Burant JC, Iyengar SS, Tomasi J, Cossi M, Millam JM, Klene M, Adamo C, Cammi R, Ochterski JW, Martin RL, Morokuma K, Farkas O, Foresman JB, Fox DJ (2016) Gaussian 16 Rev. C.01. Wallingford, CT
 43.
Barbatti M, Granucci G, Ruckenbauer M, Plasser F, CrespoOtero R, Pittner J, Persico M, Lischka H (2013) NEWTONX: a package for Newtonian dynamics close to the crossing seam. http://www.newtonx.org. Accessed 23 Feb 2021
 44.
Barbatti M, Ruckenbauer M, Plasser F, Pittner J, Granucci G, Persico M, Lischka H (2014) NewtonX: a surfacehopping program for nonadiabatic molecular dynamics. WIREs Comp Mol Sci 4(1):26–33. https://doi.org/10.1002/wcms.1158
 45.
Schinke R (1995) Photodissociation dynamics: spectroscopy and fragmentation of small polyatomic molecules. Cambridge University Press, Cambridge
 46.
Weisstein EW (2020) “Least Squares Fitting." From MathWorld—A Wolfram Web Resource. https://mathworld.wolfram.com/LeastSquaresFitting.html. Accessed 25 Dec 2020
 47.
Schmitz G, Klinting EL, Christiansen O (2020) A Gaussian process regression adaptive density guided approach for potential energy surface construction. J Chem Phys 153(6):064105. https://doi.org/10.1063/5.0015344
 48.
Chmiela S, Tkatchenko A, Sauceda HE, Poltavsky I, Schütt KT, Müller KR (2017) Machine learning of accurate energyconserving molecular force fields. Sci Adv 3(5):e1603015. https://doi.org/10.1126/sciadv.1603015
 49.
Denzel A, Kästner J (2018) Gaussian process regression for geometry optimization. J Chem Phys 148(9):094114. https://doi.org/10.1063/1.5017103
 50.
Fdez Galván I, Raggi G, Lindh R (2021) Restrictedvariance constrained, reaction path, and transition state molecular optimizations using gradientenhanced kriging. J Chem Theory Comput 17(1):571–582. https://doi.org/10.1021/acs.jctc.0c01163
 51.
Anderson E, Bai Z, Bischof C, Blackford S, Demmel J, Dongarra J, Du Croz J, Greenbaum A, Hammarling S, McKenney A, Sorensen D (1999) LAPACK users’ guide, 3rd edn. Society for Industrial and Applied Mathematics, Philadelphia, PA
 52.
Hu D, Xie Y, Li X, Li L, Lan Z (2018) Inclusion of machine learning kernel ridge regression potential energy surfaces in onthefly nonadiabatic molecular dynamics simulation. J Phys Chem Lett 9:2725–2732. https://doi.org/10.1021/acs.jpclett.8b00684
 53.
Krämer M, Dohmen PM, Xie W, Holub D, Christensen AS, Elstner M (2020) Charge and exciton transfer simulations using machinelearned hamiltonians. J Chem Theory Comput 16(7):4061–4070. https://doi.org/10.1021/acs.jctc.0c00246
 54.
Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, Kudlur M, Levenberg J, Monga R, Moore S, Murray DG, Steiner B, Tucker P, Vasudevan V, Warden P, Wicke M, Yu Y, Zheng X TensorFlow: A system for largescale machine learning. In: 12th USENIX Symposium on Operating Systems, Savannah, GA, USA, 2016. USENIX Association. https://doi.org/10.5555/3026877.3026899
 55.
Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, Wieser E, Taylor J, Berg S, Smith NJ, Kern R, Picus M, Hoyer S, van Kerkwijk MH, Brett M, Haldane A, Del Rio JF, Wiebe M, Peterson P, GerardMarchant P, Sheppard K, Reddy T, Weckesser W, Abbasi H, Gohlke C, Oliphant TE (2020) Array programming with NumPy. Nature 585(7825):357–362. https://doi.org/10.1038/s4158602026492
 56.
Szlachta WJ, Bartók AP, Csányi G (2014) Accuracy and transferability of Gaussian approximation potential models for tungsten. Phys Rev B 90(10):104108. https://doi.org/10.1103/PhysRevB.90.104108
 57.
Taylor CD (2009) Connections between the energy functional and interaction potentials for materials simulations. Phys Rev B 80(2):024104. https://doi.org/10.1103/PhysRevB.80.024104
 58.
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Kopf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S (2019) PyTorch: an imperative style, highperformance deep learning library. In: Wallach H, Larochelle H, Beygelzimer A, d’AlchéBuc F, Fox E, Garnett R (eds) Advances in neural information processing systems, vol 32. Curran Associates, Red Hook, NY, pp 8026–8037
 59.
Behler J, Parrinello M (2007) Generalized neuralnetwork representation of highdimensional potentialenergy surfaces. Phys Rev Lett 98(14):146401. https://doi.org/10.1103/Physrevlett.98.146401
 60.
Schaub TA, Brülls SM, Dral PO, Hampel F, Maid H, Kivala M (2017) Organic electron acceptors comprising a dicyanomethylenebridged acridophosphine scaffold: the impact of the heteroatom. Chem Eur J 23(29):6988–6992. https://doi.org/10.1002/chem.201701412
 61.
Chai JD, HeadGordon M (2008) Longrange corrected hybrid density functionals with damped atomatom dispersion corrections. Phys Chem Chem Phys 10(44):6615–6620. https://doi.org/10.1039/b810189b
 62.
Weigend F, Ahlrichs R (2005) Balanced basis sets of split valence, triple zeta valence and quadruple zeta valence quality for H to Rn: design and assessment of accuracy. Phys Chem Chem Phys 7(18):3297–3305. https://doi.org/10.1039/B508541a
 63.
Schäfer A, Huber C, Ahlrichs R (1994) Fully optimized contracted Gaussianbasis sets of triple zeta valence quality for atoms Li to Kr. J Chem Phys 100(8):5829–5835. https://doi.org/10.1063/1.467146
 64.
Schäfer A, Horn H, Ahlrichs R (1992) Fully optimized contracted Gaussianbasis sets for atoms Li to Kr. J Chem Phys 97(4):2571–2577
 65.
Bai S, Mansour R, Stojanovic L, Toldo JM, Barbatti M (2020) On the origin of the shift between vertical excitation and band maximum in molecular photoabsorption. J Mol Model 26(5):107. https://doi.org/10.1007/s0089402004355y
Funding
P.O.D. acknowledges funding by the National Natural Science Foundation of China (No. 22003051) and via the Lab project of the State Key Laboratory of Physical Chemistry of Solid Surfaces. M.B. and M.P.J. acknowledges the support of the European Research Council (ERC) Advanced grant SubNano (Grant agreement 832237).
Author information
Affiliations
Contributions
P.O.D. is MLatom project supervisor and lead developer. P.O.D. and Y.F.H. derived and implemented analytical gradients in MLatomF and made other improvements to the KREG model. F.G. implemented learning curves and interfaces to hyperopt, PhysNet, DeePMDkit, TorchANI, and GAPSOAP packages. M.P.J. assisted in implementation of PhysNet, DeePMDkit, and TorchANI interfaces. B.X.X. implemented MLNEA approach. J.H. assisted in implementation of interfaces to GAPSOAP and DeePMDkit. P.O.D. has done all other implementations in MLatom. P.O.D. and M.B. conceived the project of testing thirdparty software. P.O.D., F.G., B.X.X., and Y.F.H. wrote the manuscript and prepared all illustrations; all other authors revised, discussed, and commented on the manuscript.
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article is part of the Topical Collection “New Horizon in Computational Chemistry Software” edited by Michael Filatov, Cheol. H. Choi and Massimo Olivucci.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Dral, P.O., Ge, F., Xue, BX. et al. MLatom 2: An Integrative Platform for Atomistic Machine Learning. Top Curr Chem (Z) 379, 27 (2021). https://doi.org/10.1007/s41061021003395
Received:
Accepted:
Published:
Keywords
 Machine learning
 Quantum chemistry
 Kernel ridge regression
 Neural networks
 Gaussian process regression