Abstract
This chapter describes recent advances in the use of machine learning techniques in reactive atomistic simulations. In particular, it provides an overview of techniques used in training force fields with closed form potentials, developing machinelearningbased potentials, use of machine learning in accelerating the simulation process, and analytics techniques for drawing insights from simulation results. The chapter covers basic machine learning techniques, training procedures and loss functions, issues of offline and inlined training, and associated numerical and algorithmic issues. The chapter highlights key outstanding challenges, promising approaches, and potential future developments. While the chapter relies on reactive atomistic simulations to motivate models and methods, these are more generally applicable to other modeling paradigms for reactive flows.
You have full access to this open access chapter, Download chapter PDF
Similar content being viewed by others
1 Introduction and Overview
Timedependent reactive simulations involve complex interaction models that must be trained using experimental or highly resolved simulation data. The training process as well as data acquisition are often computationally expensive. Once trained, the coupling models are incorporated into reactive simulation procedures that involve small timesteps, and generate large amounts of data that must be effectively analyzed for drawing scientific insights. The past few decades have witnessed significant advances in each of these facets. More recently, increasing attention has been focused on the development and application of machine learning (ML) techniques for increasing the accuracy, generalizability, and speed of such simulations.
In this chapter, we provide an overview of ML models and methods, along with their use in reactive particle simulations. We use highly resolved reactive atomistic simulations as the model problem for motivating and describing ML methods. We start by first presenting an overview of common ML techniques that are broadly used in the field. We then present the use of these techniques in training interaction models for reactive atomistic simulations. Recent work has focused on overcoming the timestep constraints of conventional reactive atomistic methods—we describe these methods and survey key results in the area. Finally we discuss the use of ML techniques in analyzing atomistic trajectories. The goal of the Chapter is to provide readers with a broad understanding of the state of the art in the area, unresolved challenges, and available methods and software for constructing simulations in diverse application domains. While we use reactive atomistics as our model problem, the discussion is broadly applicable to other particlebased/ discrete element simulation paradigms.
Reactive atomistic simulations provide understanding of chemical processes at the atomic level, which are usually not accessible through common experimental techniques. Quantum chemistry methods have come a long way in modeling electronic structures and subsequent chemical changes at the scale of a few atoms. However, if the interest is in the thermodynamics of chemical reactions then atomistic techniques are the methods of choice. Here, individual reactions are modeled in an approximate sense but system size (or particle number) approaches thermodynamic limit (or a suitable approximation thereof, i.e., as large as practical). One of the simplest sampling techniques used in atomistic simulations is molecular dynamics, which provides a psuedoNewtonian trajectory of the system, and is applicable in modeling equilibrium as well as nonequilibrium problems. There are other sampling techniques such as Monte Carlo methods which are exclusively applicable to equilibrium statistical mechanical models. In this Chapter, we primarily focus on reactive molecular dynamics techniques.
1.1 Molecular Dynamics, Reactive Force Fields and the Concept of Bond Order
Molecular Dynamics (MD) is a widely adopted method for studying diverse molecular systems at an atomistic level, ranging from biophysics to chemistry and material science. While quantum mechanical (QM) models provide highly accurate results, they are of limited applicability in terms of spatial and temporal scales. MD simulations rely on parameterized force fields that enable the study of larger systems (with millions to billions of degrees of freedom) using atomistic models that are computationally tractable and scalable on large computer systems. Typical applications of MD range from computational drug discovery to design of new materials.
MD is an active field in terms of the development of new techniques. In its most conventional form (i.e., classical MD), it relies on “BornOppenheimer approximation”, where atomic nuclei and the core electrons together are treated as classical point particles and the interactions of outer electrons are approximated by pairwise and “manybody” terms such as bond, angle, torsion and nonbonded interactions, and additionally by using variable charge models. Each interaction is described by a parametric mathematical formula to compute relevant energies and forces. The collection of various interactions used to describe a molecular system is called a force field. Figure 1 illustrates interactions commonly used in various force fields. Equation 1 gives an example of a simple force field where \(K_b, r_0, K_a, \theta _0, V_d, \phi _0, \epsilon _{ij}, \nu \) and \(\sigma _{ij}\) denote parameters that are specific to the types of interacting atoms (which may be a pair, triplet, or quadruplet of atoms), and \(\epsilon \) denotes some global parameter.
Classical MD models, as implemented in highly popular MD software such as Amber (Case et al. 2021), LAMMPS (Thompson et al. 2022), GROMACS (Hess et al. 2008) and NAMD (Phillips et al. 2005), are based on the assumption of static chemical bonds and, in general, static charges. Therefore, they are not applicable to modeling phenomena where chemical reactions and charge polarization effects play a significant role. To address this gap, reactive force fields (e.g., ReaxFF, Senftle et al. (2016), REBO, Stuart et al. (2000), Tersoff (1989)) have been developed. Functional forms for reactive potentials are significantly more complex than their non reactive counterparts due to the presence of dynamic bonds and charges. The development of an accurate force field (be it nonreactive or reactive) is a tedious task that relies heavily on biological and/or chemical intuition. More recently, machine learning based potentials have been proposed to alleviate the burden of force field design and fitting. Even so, the most computationally efficient way to study a large reactive molecular system, as would be necessary in a reactive flow application, is a well tuned reactive force field model. Hence, this Chapter focuses on reactive force fields and specifically on ReaxFF whenever it is necessary to discuss specific methods and results, since covering all reactive force field models would necessitate a significantly longer discussion. Nevertheless, models and methods discussed for ReaxFF are broadly applicable to other reactive force fields, as well.
Bond order is a key concept in reactive simulations; it models the overlap of electronic orbitals. This is intrinsically ambiguous in classical simulations because of approximations in assigning bond index and the bond type based on the wave function overlaps (Dick and Freund 1983). In classical reactive simulations, bond order is defined as a smooth function that vanishes with increasing distance between the atoms (van Duin et al. 2001). Clearly, such a function must depend on the environment of the atoms to correctly reproduce valencies. In nonreactive classical simulations, bond structure is maintained by either applying constraints on where a bond is expected to exist, or by assigning a large energy penalty (typically in the form of a harmonic potential, see e.g. Eq. (1)) if the atoms deviate from the expected bond length (Frenkel and Smit 2002). In either case, an improperly optimized force field can lead to divergent energies or breakdown of the constraint algorithms. Reactive systems, however, have bond orders that smoothly go to zero, and usually do not have this problem but may end up with an unphysical final structure. Recently proposed MLbased approaches depend only on the atomic positions and sometimes on momenta, but do not carry information on molecular topology. Consequently, such approaches are wellsuited for describing reactive simulations.
1.2 Accuracy, Complexity, and Transferability
Three key aspects must be considered when formulating simulation models: (i) Accuracy: A simulation is expected to reproduce structure as well as the chemical reactions and reaction rates for the model system against the target data. If a model has a sufficient number of free parameters, then, in principle, such model can accurately describe the physical system. However, the choice of model and its size depend on the availability of target training data, which are usually highlyresolved quantum chemistry calculations ranging from Density Functional Theory (DFT) to coupled cluster theory, along with a basis sets specifying the desired level of accuracy; (ii) Complexity: For any simulation model the complexity increases with the number of terms and free parameters in force computations (Frenkel and Smit 2002). Thus, accuracy of the model goes hand in hand with its complexity. Ideally, we would like to have a high accuracy and low complexity model. Consequently, a clever use of target data for extracting accurate results from a relatively simple model or alternately, approximations that represent minimal compromise on accuracy for significant reduction in model complexity are desirable; and (iii) Transferability: The models are expected to provide physical insight into the system by reproducing correct properties for different types of systems beyond the training data. This is usually achieved by breaking down the interaction terms into corresponding physical concepts, e.g., bond interaction, angle interaction, shielded 1–4 interaction, etc. Each of these interactions, although suitably abstracted, represent a physical concept that is expected to have similar interaction behavior under different conditions. Thus the total interaction can be computed as a combination of such transferable terms (Frenkel and Smit 2002). We note that the target data (usually obtained using quantum calculations) are not split into such physical abstractions. This gives rise to numerous models with similar accuracy and varying degrees of transferability. Commonly used reactive potentials such as REBO or ReaxFF are built with tranferability as a key consideration. However, even within the limited domain of atomic types and environments, these simulations rarely produce accurate results for wide variety of problems without requiring a retuning of the force field parameters. Unlike fixed form potential simulations, machine learnt potentials focus on tranferability of the model to similar atomic enviroments as the training datasets and optimize for higher accuracy as well as lower complexity.
In the rest of this chapter, we describe how reactive interaction models are constructed, trained, and used in accelerating simulations, in particular by making use of MLbased techniques. We begin our discussion with an overview of common ML models and methods, followed by their use in the simulation toolchain.
2 Machine Learning and Optimization Techniques
We begin our discussion with an overview of general ML techniques. This literature is vast and rapidly evolving. For this reason, we restrict ourselves to common ML techniques as they apply to reactive particlebased simulations.
ML frameworks are typically comprised of a model, a suitably specified cost function, and a training set over which the cost function is minimized. An ML model corresponds to an abstraction of the physical system—e.g., the force on an atom in its atomic context, and has a number of parameters that must be suitably instantiated. The cost function corresponds to the mismatch between the output of the model and physical (experimental or highresolution simulated) data. Minimizing the cost function yields the necessary parametrization of the model. Training data is used to match the model output with target distribution. At the heart of ML procedures is the optimization technique used to match the model output with the target distribution.
The costfunction in typical ML applications is averaged over the training set:
Here, J(.) represents the costfunction, \(P_{data}\) represents the empirical distribution (i.e., the training set), L(.) is the lossfunction that quantifies the difference between estimated and true value, and f(.) is a prediction function parameterized by \(\theta \). A key point to note here is that we operate on empirical data, and not the “true” data distribution. Hence, this approach is also called empirical risk minimization(Vapnik 1991). The assumption is that minimizing the loss w.r.t. empirical data will (indirectly) minimize the loss w.r.t. true data distribution, thereby allowing for generalizability (i.e., to make predictions on unseen data samples). In the rest of this section, we discuss continuous and discrete optimization strategies commonly used in ML formulations.
2.1 Continuous Optimization for Convex and Nonconvex Optimization
In many applications, the objective function in Eq. 2 is continuous and differentiable. For such applications, a key consideration is whether the function is convex or nonconvex (recall that a realvalued convex function is one in which the line joining any two points on the graph of the function does not lie below the graph at any point in the interval between the two points). Simple approaches to optimizing convex functions start from an initial guess, compute the gradient, and take a step along the gradient. This process is repeated until the gradient is sufficiently small (i.e., the function is close to its minima). In ML applications, the step size is determined by the gradient and the learning rate—the smaller the gradient, the lower the step size. Convex objective functions arise in models such as logistic regression and single layer neural networks.
In more general ML models such as deep neural networks, the objective function (Eq. 2) is not convex. Optimizing nonconvex objective functions in high dimensions is a computationally hard problem. For this reason, most current optimizers use a gradient descent approach (or its variant) to find a local minima in the objective function space. It is important to note that a point of zero gradient may be a local minima or a saddle point. Common solvers rely on randomization and noise introduced by sampling to escape saddle points. In deep learning applications, the problem of computing the gradient can be elegantly cast as a backpropagation operator—making it computationally simple and inexpensive. Optimization methods that use the entire training set to compute the gradient are called batch or deterministic methods (Rumelhart et al. 1986). Methods that operate on smallsubsets of the dataset (called minibatches) are called stochastic methods. In this context, a complete pass over the training dataset sampled in minibatches is called an epoch. Stochastic Gradient Descent (SGD) methods are workhorses for training deep neural network models.
First order methods such as SGD suffer from slow convergence, lack of robustness, and need for tuning a large number of hyperparameters. Indeed, model training using SGDtype methods incurs most of its computation cost in exploring the highdimensional hyperparameter space to find model parametrizations with high accuracy and generalization properties (Goodfellow et al. 2014). These problems have motivated significant recent research in the development of second order methods and their variants. Second order methods scale different components of the gradient suitably to accelerate convergence. They also typically have much fewer hyperparameters, making the training process much simpler. However, these methods involve a product with the inverse of the dense Hessian matrix, which is computationally expensive. Solutions to these problems include statistical sampling, lowrank structures, and Kronecker products as approximations for the Hessian.
2.2 Discrete Optimization
In contrast to continuous optimization, in many applications, the variables and the objective function take discrete values, and thus the derivative of the objective function may not exist. This is often the case when optimizing parameters for force fields in atomistic models. Two major classes of techniques for discrete optimization are Integer Programming and Combinatorial Optimization. In Integer Programming, some (or all) variables are restricted to the space of integers, and the goal is to minimize an objective subject to specified constraints. In combinatorial optimization, the goal is to find the optimal object from a set of feasible discrete objects. Combinatorial optimization functions operate on discrete structures such as graphs and trees. The class of discrete optimization problems is typically computationally hard.
A commonly used discrete optimization procedure in optimization of force fields is genetic programming (Katoch et al. 2021; Mirjalili 2019). Genetic programming starts with a population of potentially suboptimal candidate solutions. It successively selects from this population (formally called selection) and combines them (formally called crossover) to generate new candidates. In many variants, mutations are introduced into the candidates to generate new candidates as well. A fitness function is used to screen these new candidates and the fittest candidates are retained in the population. This process is repeated until the best candidates achieve desired fitness. In the context of forcefield optimization, the process is initialized with a set of parametrizations. The fitness function corresponds to the accuracy with which the candidate reproduces training data. The crossover function generates new candidates through operations such as exchange of corresponding parameters, min, max, average, and other simple operators.
3 Machine Learning Models
While the field of ML is vast, it is common to classify ML algorithms into “supervised” and “unsupervised”. In supervised learning algorithms, training data contain both features and labels. The goal is to learn a function that takes as input a feature vector and returns a predicted label. Supervised learning can further be categorized into classification and regression. When labels are categorical, the learning task is commonly called “classification”. On the other hand, if the task is to predict a continuous numerical value, it is called regression. In unsupervised learning algorithms, training data do not have labels. The goal of unsupervised algorithms is to analyze patterns in data without requiring annotation. Common examples of unsupervised algorithms include clustering and dimensionality reduction. We note that there are many other active areas of ML, such as reinforcement learning and semisupervised learning that are beyond the scope of this chapter. We refer interested readers to more exhaustive sources for a comprehensive discussion (Bishop and Nasrabadi 2006; Murphy 2012; ShalevShwartz and BenDavid 2014; Goodfellow et al. 2016).
3.1 Unsupervised Learning
The most commonly used unsupervised learning techniques are clustering and dimensionality reduction.
3.1.1 Clustering
In clustering, data represented as vectors are grouped together on the basis of some inherent structures (or patterns), typically characterized by their similarities or distances (Saxena et al. 2017; Gan et al. 2020). Clustering algorithms can be categorized on the basis of their outputs into: (i) crisp versus overlapping; or (ii) hard versus soft. In crisp clustering, each data point is assigned to exactly one cluster, whereas overlapping clustering algorithms allow for multiple memberships for each data point. In hard clustering algorithms, a datapoint is assigned a 0/1 membership to every cluster (a 1 corresponding to the cluster the point is assigned to). In soft clustering algorithms, each data point is assigned membership grades (typically in a 0–1 range) that indicate the degree to which data points belong to each cluster. If the grades are convex (i.e., they are positive and sum to 1), then the grades can be interpreted as probabilities with which a data point belongs to each of the classes. In the general class of fuzzy clustering algorithms (Ruspini 1969), the convexity condition is not required.
Centroidbased clustering refers to algorithms where each cluster is represented by a single, “central” point, which may not be a part of the dataset. The most commonly used algorithm for centroidbased clustering (and indeed all of clustering) is kmeans algorithm of Lloyd (1982). Given a set of datapoints \([\textbf{x}_1, \textbf{x}_2, \ldots , \textbf{x}_n]\), and predefined number of clusters k, the objective function of kmeans is given by:
where, \(\textbf{C}\) is the union of nonoverlapping clusters (\(\textbf{C} = \{\textbf{C}_1, \textbf{C}_2, \ldots , \textbf{C}_k \}\)), and \(\mu _i\) represents the mean of all datapoints of belonging to cluster i. Stated otherwise, the objective of kmeans clustering is to minimize the distance between datapoints and their assigned clusters (as represented by the mean). The problem of kmeans is NP hard, but approximation algorithms such as Lloyd’s Algorithm can efficiently find local optima.
Distributionbased clustering algorithms work on the assumption that datapoints belonging to the same cluster are drawn from the same distribution. Common algorithms in this class assume that data follow Gaussian Mixture Models, and typically solve the problem using the ExpectationMaximization (EM) Approach. EM does maximum likelihood estimation in the presence of latent variables. In each iteration, there are two steps. In the first step, latent variables are estimated (Estep). This is followed by the Maximization (Mstep) where parameters of the models are optimized to better fit the data. In fact, the aforementioned Lloyd’s algorithm for kmeans clustering is a simple instance of EM.
Densitybased clustering is a class of spatialclustering algorithms, in which a cluster is modeled as a dense region in data space that is spatially separated from other clusters. Densitybased spatial clustering of applications with noise (DBSCAN) by Ester et al. (1996) is the most commonly used algorithm in this class. DBSCAN requires two parameters: (i) \(\epsilon \)size of neighborhood; and (ii) Minpts—minimum number of points in each cluster. DBSCAN proceeds as follows—first, it finds all points that are in the \(\epsilon \)neighborhood of all points. Then, it designates points with more than Minpts neighbors as “corepoints”. Next, it finds connected components of corepoints by inspecting the neighbors of each corepoint. Finally, each noncorepoint is assigned to the cluster if it is in an \(\epsilon \) neighborhood. If a datapoint is not in the neighborhood, it is identified as an outlier, or noise (Schubert et al. 2017).
Hierarchical clustering refers to a family of clustering algorithms that seeks to build a hierarchy of the clusters (Maimon and Rokach 2005). The two common approaches to build these hierarchies are bottomup and topdown. In bottomup (or agglomerative) clustering, each datapoint initially belongs to a separate cluster. Small clusters are created on the basis of similarity (or proximity). These clusters are merged repeatedly until all datapoints belong to a single cluster. The reverse process is performed in the topdown (or divisive) clustering approaches, where a single cluster is split repeatedly until each datapoint is its own cluster. The main parameters to choose are the metric (i.e., the distance measures), and the linkage criterion. Commonly used metrics are L1, L2 norms, Hamming distance, and inner products. Linkage criterion quantifies distance between two clusters on the basis of distances between pairs of points across the clusters.
3.1.2 Dimensionality Reduction
Dimensionality reduction is an unsupervised technique common to many applications. Reducing dimensions produces a parsimonious denoised representation of data that is amenable to analysis by complex algorithms that would otherwise not be able to handle large amounts of raw data.
Linear Dimensionality Reduction Techniques
Principal component analysis (PCA) is perhaps the most commonly used linear dimension reduction technique. Principal components correspond to directions of maximum variation in data. Projecting data on to these directions, consequently, maintains dominant patterns in data. The first step in PCA is to center the data around zero mean to ensure translational invariance. This is done by computing the mean of the rows of data matrix M and subtracting it from each row to give a zerocentered data matrix \(M'\). A covariance matrix is then computed as the nromalized form of \(M'^TM'\). Note that the \((i,j)\textrm{th}\) element of this covariance matrix is simply the covariance of the \(i\textrm{th}\) and \(j\textrm{th}\) rows of matrix \(M'\). The dominant directions in this covariance matrix are then computed as the dominant eigenvectors of this matrix. Selecting the k dominant eivengectors and projecting the data matrix M to this subspace yields a k dimensional data matrix that best preserves variances in data. A common approach to selecting k is to consider the drop in magnitude of corresponding eigenvalues. PCA has several advantages: (i) by reducing the effective dimensionality of data, it reduces the cost of downstream processing; (ii) by retaining only the dominant directions of variance, it denoises the data; and (iii) it provides theroetical bounds on loss of accuracy in terms of the dropped eigenvalues.
The general class of dimensionality reduction techniques also includes other matrix decomposition techniques. In general, these techniques express matrix data M as an approximate product of two matrices \(UV^T\); i.e., they minimize \(M  UV^T\). Various methods impose different constraints on matrices U and V, leading to a general class of methods that range from dimension reduction to commonly used clustering techniques. Perhaps, the bestknown technique in this class is Singular Value Decomposition (SVD) (Golub and Reinsch 1971), which is closely related to PCA, where columns of U and V are orthogonal, and rankk for some value of k. The orthogonality of the column space of these matrices makes them hard to interpret directly in the data space.
In contrast to SVD, if matrix U is constrained to only positive entries and columns in matrix U sum to 1, we get a decomposition called archetypal analysis. In this interpretation, columns of V correspond to the corners of a convex hull of the points in matrix M, also known as puresamples or archetypes, and all data points are expressed as convex combinations of these archetypes. A major advantage of archetypal analyses is that archetypes are directly interpretable in the data space. Another closely related decomposition is nonnegative matrix factorization (NMF), which relaxes the orthogonality constraint of SVD, instead, constraining elements of matrix U to be nonnegative (Gillis 2020). In doing so, it loses error norm minimization properties of SVD, but gains interpretability. All of these methods can be used to identify patterns of coherent behavior among particles in the simulation. We refer interested readers to a comprehensive survey on linear dimensionality reduction methods by Cunningham and Ghahramani (Cunningham and Ghahramani 2015).
Nonlinear Dimensionality Reduction
General nonlinear dimensionality reduction techniques are needed for data that resides on complex nonlinear manifolds. This is commonly the case for particle datasets in reactive environments. Nonlinear dimensionality reduction technqiues typically operate in three steps: (i) embedding of data onto a lowdimensional manifold (in a highdimensional space); (ii) defining suitable distance measures; and (iii) reducing dimensionality to preserve distance measures. Among the more common nonlinear dimensionality reduction technique is Isometric feature mapping. This technique first constructs a graph corresponding to the dataset by associating a node with each row of the data matrix, and edges to correspond to the k nearest neighbors of the node. This graph is then used to define distances between nodes in terms of shortest paths. Finally, techniques such as multidimensional scaling (MDS)—a generalization of PCA that can use general distance matrices, as opposed to covariance matrices used by PCA, are used to compute lowdimension representations of the matrix. An alternate approach uses the spectrum of a Laplace operator defined on the manifold to embed data points in a lower dimensional space. Such techniques fall into the general class of Laplacian eigenmaps.
An alternate approach to nonlinear dimensionality reduction is the use of nonlinear transformations on data in conjunction with a suitable distance measure, followed by MDS for dimensionality reduction. The first two steps of this process (nonlinear transformation and distance measure computation) are often integrated into a single step through the specification of a kernel. The use of such a kernel with MDS is called kernel PCA. The key challenges in the use of these methods relate to: (i) suitable representation techniques (described in Sect. 5); (ii) kernel functions; and (iii) appropriate scaling mechanisms since distance matrices can have highly skewed distributions and the directions may be dominated by a small number of very large entries in the distance matrix. Common approaches to kernel selection rely on polynomial transformations of increasing degree until suitable spectral gap is observed. Data representations and normalization are highly application and context dependent.
Autoencoder and Deep Dimensionality Reduction
Autoencoders have been recently proposed for use in nonlinear dimensionality reduction (Kramer 1991; Schmidhuber 2015; Goodfellow et al. 2016). Autoencoders are feedforward neural networks (discussed in further detail in Sect. 3.2) that are trained to code the identity function—i.e., the output of the autoencoder neural network is the input itself. Dimensionality reduction is accomplished in this framework by having an intermediate layer with a small number of activation functions. Through this constraint, an autoencoder is trained to “encode” input data into a lowdimensional latent space, with the goal of “decoding” the input back. The output of the encoder therefore represents a nonlinear reduced dimension representation of the input.
Tdistributed Stochastic Neighbor Embedding (tSNE) and Uniform Manifold Approximation and Projection (UMAP)
tSNE (Maaten and Hinton 2008) and UMAP (McInnes et al. 2018) are commonly used nonlinear dimensionality reduction techniques for mapping data to two or three dimensions—primarily for visual analysis. tSNE computes two probability distributions—one in the highdimensional space and one in the lowdimensional space. These distributions are constructed so that two points that are close to each other in the euclidean space have similar probability values. In the highdimensional space, a Gaussian distribution is centered at each data point, and a conditional probability is estimated for all other data points. These conditional probabilities are normalized to generate a global probability distribution over all points. For points in the low dimensional space, tSNE uses a Cauchy distribution to compute the probability distribution. The goal of dimensionality reduction translates to minimizing the distance (in terms of KL divergence) of these two distributions. This is typically done using gradient descent. In contrast to tSNE, a closely related technique, UMAP assumes that the data is uniformly distributed on a locally connected Riemannian manifold and that the Riemannian metric is locally constant or approximately locally constant (https://umaplearn.readthedocs.io/en/latest/). Both of these techniques are extensively used in visualization of highdimensional data.
3.2 Supervised Learning
The goal of supervised methods is to learn a function from input data vectors to output classes (labels) using training inputoutput examples. The function should “generalize” to be able to accurately predict labels for unseen inputs. The general learning procedure is as follows: first, the data is split into train and test sets. Then, the function is learnt by using the inputoutput training examples. The learnt function is applied to the test input to get predicted outputs. If the algorithm performs poorly on training examples, we say that the algorithm “underfits” the data. This typically occurs when the model is unable to capture the complexity of the data. When learnt functions do not perform well (say, low prediction accuracy) on test data, we say that the algorithm “overfits” to the train set. Overfitting occurs when the algorithm fits to noise, rather than true data patterns. The problem of balancing underfitting and overfitting is called the biasvariance tradeoff. Intuitively, we want the model to be sophisticated enough to capture complex data patterns, but on the other hand, we don’t want to endow it with the ability to capture idiosyncrasies of the train examples.
The problem of overfitting can be controlled through a number of approaches. In crossvalidation, the training set is further divided into subsets (or folds). The training procedure proceeds to learn the function by leaving out one fold in every iteration. The model is validated on the remaining fold. The parameters of the model are optimized to ensure high crossvalidation accuracy. Regularization is a technique in which a penalty term is added to the error function to prevent overfitting. Tikhonov regularization is one of the early examples of regularization that is commonly used in linear regression. Early stopping is a form of regularization in which the learner uses iterative methods like gradient descent. The key idea of early stopping is to perform training until the learning algorithm continues to improve performance on external (unseen) data. It is stopped when improvement on training performance comes at the expense of test performance. Other approaches to avoid overfitting include data augmentation (increasing number of data points for training) and improved feature selection. Underfitting can be avoided by using more complex models (e.g., going from a linear to a nonlinear model), increasing training time, and reducing regularization.
3.2.1 Overview of Supervised Learning Algorithms
Supervised learning algorithms are often categorized as generative or discriminative. Generative algorithms aim to learn the distribution of each class of data, whereas discriminative algorithms aim to find boundaries between different classes. Naive Bayes Classifier is a generative approach that uses the Bayes Theorem with strong assumptions on independence between the features (Rish 2001). Given a ddimensional data vector \(\textbf{x} = [x_1, x_2, \ldots , x_d]\), naive Bayes models the probability that \(\textbf{x}\) belongs to class k as follows:
In practice, the parameters for the distributions of features are estimated using maximumlikelihood estimations. Despite the strong assumptions made in naive Bayes, it works well in many practical settings. Linear Discriminant Analysis (LDA) is a binary classification algorithm that models the conditional probability densities \(p(\textbf{x}  C_k)\) as normal distributions with parameters \((\mu _k, \Sigma )\), where \(k=\{0,1\}\) (McLachlan 2005). The simplifying assumption of homoscedasticity (i.e., the covariance matrices are the same for both classes) means that the classifier predicts class 1 if:
More complex generative methods include Bayesian Networks and Hidden Markov Models.
kNearest Neighbor (kNN) algorithm is an early, and still widely used discriminative algorithm used for both classification and regression. In classification, the label of a test data sample is obtained by a vote of the labels of its knearest neighbors. In regression, kNN computes the predicted value of a test sample as a function of the corresponding values of its knearest neighbors. Logistic regression uses a logistic function (logit) to model a binary dependent variable. In the training phase, the parameters for the logit function are learnt. Logistic regression is similar to LDA, but with fewer assumptions.
Support Vector Machine (SVM) (Cortes and Vapnik 1995) is a widely used discriminative model for regression and classification. Given input data \([\textbf{x}_1, \textbf{x}_2, \ldots , \textbf{x}_n]\) and corresponding labels \(y_1, y_2, \ldots , y_n\), where \(y_i \in \{1,1\}, \forall i \in \{1,2,\ldots ,n\}\), SVM aims to optimize the following objective function:
Here, vector \(\textbf{w}\) represents the vector normal to the separating hyperplane and \(\lambda \) is the weight given to regularization. The \(\max (.)\) term is called Hingeloss function, which allows SVMs to work with nonlinear boundaries. SVMs typically use the socalled “kernel trick”. The idea is that implicit highdimensional representation of raw data can let linear learning algorithms learn nonlinear boundaries. The kernel function itself is a similarity measure. Common kernels include Fisher, Polynomial, Radial Basis Function (RBF), Gaussian, and Sigmoid functions. Other examples of discriminative methods include decision trees and random forests.
3.2.2 Neural Networks
Neural Networks are interconnected groups of units called neurons that are organized in layers. The first layer is called the input layer, and is typically the same dimension as the input. The final layer is called the output layer. The outputs of neural networks could be prediction of class labels, images, text, etc. Each neuron in an intermediate layer is given a number of inputs. It computes a nonlinear function on a weighted sum of its input. The resulting output may be fed into a number of neurons in the next layer. The nonlinear function associated with a neuron is called an activation function. Common examples of activation functions include hyperbolic tan (tanh), sigmoid, Rectified Linear Unit (ReLU), and Leaky ReLU, among many others.
There are two key steps to designing neural networks for specific tasks. The first step corresponds to design of the network architecture. This specifies the number of layers, connectivity, and types of neurons. The second step parametrizes weights on edges of the neural network using a suitable optimization procedure for matching the output distribution with the target distribution (as discussed earlier in Sect. 2.1).
The term deep learning is used to describe a family of machine learning models and methods whose architectures use neural networks as core components. The word “deep” corresponds to the the fact that learning algorithms typically use neural network models with many layers, in contrast to shallow networks which typically have one or two intermediate (or hidden) layers (Schmidhuber 2015).
3.2.3 Convolutional Neural Networks
Convolutional Neural Networks (CNNs) are neural networks that use convolutions to quantify local pattern matches. CNNs are feedforward networks with one or more convolution layers. CNNs are used extensively in the analysis of images, and more recently, graphs that model connected structures such as molecules. CNNs have an input layer, hidden layer(s), and output layers. The input to a CNN is a tensor of the form \(\# inputs \times input\ height \times input\ width \times input\ channels\). The height and width parameters correspond to the size of the original images. The number of input channels is typically three (red, green, and blue) for images.
Each of the hidden layers can be one of: (i) convolutional layer, (ii) pooling layer, or a (iii) fully connected layer. A convolutional layer takes as input an image, or the output of another layer, and outputs a feature map. This produces a tensor of the form \(\#inputs \times \! feature\ height \times \! feature\ width \times \! feature\ channels\). Each neuron of a CNN processes only a small region of the input. This region is called the receptive field. It convolves this input and passes it on to the next layer. Pooling Layers are used to reduce the dimensionality of the data. They do so by aggregating the outputs of neurons in the previous (convolutional) layer. Pooling strategies can be local (operating on a small subset of neurons), or global (operating on the entire feature map). Common pooling functions include max and average. In fully connected layers, outputs of neurons are connected to every single neuron in the next layer. They are often used as the penultimate layer before the output layer, where all weights are combined to compute the prediction (i.e., the output). A neural network with only fully connected layers is also called a Multiple Layer Perceptron (MLP). From this point of view, CNNs are regularized forms of MLPs.
There are a number of parameters associated with CNNs that must be tuned. Specific to convolutional layers, the common parameters are stride, depth, and padding. The depth parameter of the output volume controls the neurons in a layer that connect to the same region of the input volume. Stride controls the translation of the convolution filter. Padding allows the augmentation of input with zeros at the border of input volume. Other parameters include kernel size and pooling size. Kernel size specifies the number of pixels that are processed together, whereas pooling size controls the extent of downsampling. Typical values for both are \(2 \times 2\) in common image processing networks.
In addition to parameter tuning, regularization is also required to design robust CNNs. In addition to generic methods for regularization mentioned earlier (such as earlystopping, L1/L2 regularization), there are CNNspecific approaches. Dropout is a common measure taken to regularize neural networks. Fullyconnected networks (or MLPs) are prone to overfitting, because of the large number of connections. An intuitive way to resolve this issue is to leave out individual nodes (and the corresponding inbound and outbound edges) from the training procedure. Each node is left out with a probability p (p is usually set to 0.5). During the testing phase, the expected value of the weights are computed from different versions of the droppedout network. Other simple, CNNspecific parameter tuning techniques limit the number of units in hidden layers, number of hidden layers, and number of channels in each layer.
Commonly used CNN architectures include LeNet (LeCun et al. 1989), AlexNet (Krizhevsky et al. 2012), ResNet (He et al. 2016), Wide ResNet (Zagoruyko and Komodakis 2016), GoogleNet (Szegedy et al. 2015), VGG (Simonyan and Zisserman 2014), DenseNet (Huang et al. 2017), and Inception (v2 (Szegedy et al. 2016), v3 (Szegedy et al. 2016), v4 (Szegedy et al. 2017)).
3.2.4 Recurrent Neural Networks
A Recurrent Neural Network (RNN) is a neural network in which nodes have internal hidden states, or memory. RNNs can therefore process (temporal) sequences of inputs. They are typically used in the analysis of speech signals, language translation, and handwriting recognition, and more recently in prediction of atomic trajectories in molecular dynamics simulations.
A key feature of RNN is the ability to share intermediate outputs across different parts of the model. Given a sequence of inputs \([\textbf{x}_1, \textbf{x}_2, ... , \textbf{x}_n]\), the state of RNN at time t is given as
where, f(.) is the recurrent function, and \(\theta \) is the set of shared intermediate outputs. From Eq. 7, one can see that RNNs predict the future on the basis of the past outcomes.
A generic RNN can, in theory, remember arbitrarily longterm dependencies. In practice, repeated use of backpropagation causes gradients to vanish (i.e., tend to zero), or explode (i.e., tend to infinity). Gated RNNs are designed to circumvent these issues. The most widely used Gated RNNs are Long ShortTerm Memory (LSTM) (Hochreiter and Schmidhuber 1997; Gers et al. 2000) and Gated Recurrent Unit (GRU) (Cho et al. 2014). Recall that a regular activation neuron consists of a nonlinear function applied to a linear transformation of the input. In addition to this, LSTMs have an internal cellstate (different from the hiddenstate recurrence previously discussed), and a gating mechanism that controls the flow of information. In all, LSTMs have three gates—input gate, forget gate, and output gate. Specifically, the forget gate allows a network to forget old states that have accumulated over time, thereby preventing vanishing gradients. GRUs are similar to LSTMs, but with a simplified gating architecture. GRUs combine LSTM’s input and forget gate into a reset gate. The reset gate also allows GRUs to combine hidden and cellstates. This results in a simpler architecture that requires fewer tensor operations. The problem of exploding gradients is handled by gradient clipping. Two common strategies in gradient clipping are: (i) value clipping—values above and below set thresholds are set to the respective thresholds, and (ii) norm clipping—rescaling the gradient values by a chosen norm. Using CNNs and RNNs as building blocks, we can develop complex NN frameworks such as Generative Adversarial Networks (GANs).
3.2.5 Generative Adversarial Networks
A Generative Adversarial Network (GAN) is a neural network in which a zerosum game is contested by two neural networks—the generative network and the discriminative network (Goodfellow et al. 2014). The generative network learns to map a predefined latent space to the distribution of the dataset, whereas a discriminative network is used to predict whether an input instance is truly from the dataset or if it is the output of the generative network. The objective of the generative network is to fool the discriminative network (i.e., increase error of the discriminative network), whereas the objective of the discriminative network is to correctly identify true data. The training procedure for a GAN is as follows: first, the discriminative network is given several instances from the dataset, so that it learns the “true” distribution. The generative network is initially seeded with a random input. From there, the generative network creates candidates with the objective of fooling the discriminative network. Both networks have separate backpropagation procedures; the discriminator learns to distinguish the two sources of inputs, even as the generative network produces increasingly realistic data.
GANs have found a number of applications in synthesis of (realistic) datasets. They have been successful in creating art, synthesizing virtual environments, generate photographs of synthetic faces, and designing animation characters. GANs are often used for the purpose of transfer learning, where knowledge obtained from one training in one application can be used in another similar, but different application.
3.2.6 Transfer Learning
Traditional machine learning is isolated, in that a model is trained in a very specific context, to perform a targeted task. The key idea in transferlearning is that new tasks learn from the knowledge gained in a previously trained task (Weiss et al. 2016). To formally define Transfer Learning, we first define domain and task. Let \(\mathcal {X}\) be a feature space, and \(\textbf{X}\) be the dataset (i.e., \(\textbf{X} = [\textbf{x}_1, \textbf{x}_2, \ldots , \textbf{x}_n] \in \mathcal {X}\)). Similarly, let \(\mathcal {Y}\) be the label space and \(Y = \{y_1, y_2, \ldots , y_n\} \in \mathcal {Y}\) be the labels corresponding to the rows of \(\textbf{X}\). Further, let P(.) denote a probability distribution. A domain is defined as \(\mathcal {D} = \{\mathcal {X}, P(\textbf{X})\}\). Given a domain \(\mathcal {D}\), a task \(\mathcal {T}\) is defined as \(\mathcal {T} = \{\mathcal {Y}, P(Y\textbf{X})\}\). Given source and target domains \(\mathcal {D}_S\) and \(\mathcal {D}_T\) and corresponding tasks \(\mathcal {T}_S\) and \(\mathcal {T}_T\), transfer learning aims to learn \(P(Y_T\textbf{X}_T)\) using information from \(\mathcal {D}_S\) and \(\mathcal {D}_T\). In this setup, we can see that there are four possibilities: (i) \(\mathcal {X}_S \ne \mathcal {X}_T\), (ii) \(\mathcal {Y}_S \ne \mathcal {Y}_T\), (iii) \(P(\textbf{X}_S) \ne P(\textbf{X}_T)\), or (iv) \(P(Y_S  \textbf{X}_S) \ne P(Y_T  \textbf{X}_T)\). In (i), the feature spaces of the source and target domain are different. In (ii), the label space of the task are different, which happens in conjunction with (iv) where the conditional probabilities of labels are different. In (iii), the feature spaces of source and target domains are the same, while the marginal probabilities are different. Case (iii) is interesting for simulations, because the feature spaces for source (simulation) and target (reality) is typically the same, but the marginal probabilities of observations in simulation and reality can be very different.
3.3 Software Infrastructure for Machine Learning Applications
A number of software packages and libraries have been developed over the last decade in support of ML applications in different contexts. Matrix computations are often performed using NumPy (Python) (Harris et al. 2020), Eigen ( et al. 2010), and Armadillo (C++) (Sanderson and Curtin 2016, 2020). Standard machine learning methods, including clustering such as kmeans clustering and DBSCAN, classification algorithms such as SVM and LDA, regression, and dimensionality reduction are available in Python packages such as SciPy (Virtanen et al. 2020) and Theano (Theano Development Team 2016), and in C++ packages such as MLPack (Curtin et al. 2018). Deep learning approaches are often implemented using libraries such as PyTorch (Paszke et al. 2019), TensorFlow (Abadi et al. 2015), Caffe (Jia et al. 2014), Microsoft Cognitive Toolbox, and DyNet (Neubig et al. 2017). We note that a number of machine learning packages written in a source language have readily available interfaces for other languages. For example, Caffe is written in C++, with interfaces available for both Python and MATLAB. Finally, we also note that Julia has wrappers for a number of the Python and C++ libraries.
4 ML Applications in Reactive Atomistic Simulations
Building on our basic toolkit of ML models and methods, we now describe recent advances in the use of ML techniques in reactive atomistic simulations. We focus on three core challenges—use of ML techniques for training highly accurate atomistic interaction models, use of ML techniques in accelerating simulations, and use of ML methods for analysis of atomistic trajectories. Our discussion applies broadly to particle methods, however, we use reactive atomistic simulations as our model problem. In particular, we use ReaxFF as the force field for simulations.
4.1 ML Techniques for Training Reactive Atomistic Models
Optimization of forcefield parameters for target systems of interest is crucial for high fidelity in simulations. However, such optimizations cannot be specific to the sets of molecules present in the target system for two reasons: (i) utility of a parameter set that only works for a particular system is marginal; and (ii) in a reactive simulation, molecular composition of a system is expected to change as a result of the reactions during the course of a simulation. For this reason, reactive force field optimizations are performed at the level of groups of atoms, e.g. Ni/C/H, Si/O/H, etc. Nevertheless, the behaviour of a given group of atoms may show variations in different contexts such as combustion, aqueous systems, condensed matter phase systems, and biochemical processes. Therefore, it may be desirable to create parameter sets optimized for different contexts (Senftle et al. 2016).
Reactive force fields such as ReaxFF are complex, with a large number of parameters that can be grouped by charge equilibration parameters, bond order parameters, and parameters based on Nbody interaction (e.g., singlebody, twobody, threebody, fourbody and nonbonded) in addition to the systemwide global parameters. As the number of elements in a parameter set increases, force field optimization quickly becomes a challenging problem due to the high dimensionality and discrete nature of the problem. Several methods and software systems have been developed for force field optimization over the years, starting with more traditional methods early on and moving to MLbased methods more recently. After giving an overview of the force field optimization problem, we briefly review traditional methods first and then discuss the MLbased techniques, which mainly draw upon Genetic Algorithms (see Sect. 2.2) as well as the extensive ML software infrastructure that has been built recently (see Sect. 3.3).
4.1.1 Training Data and Validation Procedures
Training procedures for typical force fields require three inputs: (i) model parameters to be optimized; (ii) geometries, a set of atom clusters that describe the key characteristics of the system of interest (e.g., bond stretching, angle and torsion scans, reaction transition states, crystal structures, etc.); and (iii) training data, chemical and physical properties associated with these atom clusters (such as energy minimized structures, relative energies for bond/ angle/ torsion scans, partial charges and forces), which are typically obtained from highfidelity quantum mechanical (QM) models or sometimes experiments, along with a function that combines these different types of training items into a quantifiable fitness value:
In Eq. 8, m represents the model with a given set of force field parameter values, \(x_i\) is the predicted training data value calculated using the model m, \(y_i\) is the ground truth value of the corresponding training data item, and \(\sigma _i^{1}\) is the weight assigned to each training item.
Table 1 summarizes commonly used training data types and provides some examples. An energybased training data item uses a linear relationship of different molecules (expressed through their identifiers) because relative energies rather than the absolute energies drive the chemical and physical processes. For structural items, geometries must be energy minimized as accurate prediction of the lowest energy states is crucial. For other training item types, energy minimization is optional, but usually preferred.
4.1.2 Global Methods for Reactive Force Field Optimization
The earliest ReaxFF optimization tool is the sequential parabolic parameter interpolation method (SOPPI) (van Duin et al. 1994). SOPPI uses a oneparameteratatime approach, where consecutive single parameter searches are performed until a certain convergence criteria is met. The algorithm is simple, but as the number of parameters increases, the number of oneparameter optimization steps needed for convergence increases drastically. Furthermore, the success of this method is highly dependent on the initial guess and the order of the parameters to be optimized.
Due to the drawbacks of SOPPI, various global methods such as genetic or evolutionary algorithms (Dittner et al. 2015; JaramilloBotero et al. 2014; Larsson et al. 2013; Trnka et al. 2018), simulated annealing (SA) (Hubin et al. 2016; Iype et al. 2013) and particle swarm optimization (PSO) (Furman et al. 2018) have been investigated for force field optimization. We discuss some of the promising techniques below.
Genetic Algorithms (GA) often work well for global optimization because via crossover they can exploit (partial) separability of the optimization problem even in the absence of any explicit knowledge about its presence. They are also able to make longrange “jumps” in search space. Due to the continuous presence of multiple individuals that have survived several selection rounds it is ensured that these “jumps,” based on information interchange between individuals, have a high probability of landing at new, promising locations. Last but not least, by admitting operators other than the classic crossover and mutation steps, it is possible to extend GAs within this abstract metaheuristic framework with desirable features of other global optimization strategies, too. GAs are especially useful when dealing with challenging and timecritical optimization problems. The straightforward parallelism and intrinsic high scalability property of GAs provide an advantage over other strategies that are either serial in nature or where parallelization facilitates decoupled or only loosely coupled tasklevel parallelism. An efficient and scalable implementation of GAs for ReaxFF is provided in the ogolemspuremd software (Dittner et al. 2015), where the authors demonstrate convergence to fitness values similar to or better than those reported in the literature in a matter of a few hours of execution time through effective use of highperformance computers and advanced GA techniques.
Recently, other populationbased global ReaxFF optimization methods have been proposed, such as the particle swarm optimization algorithm RiPSOGM (Furman et al. 2018), covariance matrix adaptation evolutionary strategy (CMAES) (Shchygol et al. 2019), and the KVIK optimizer (Gaissmaier et al. 2022). Shchygol et al. (2019) explore different optimization choices for the CMAES method, the ogolemspuremd software, as well as a MonteCarlo force field optimizer (MCFF), and they systematically compare these techniques using three training sets from literature. Their CMAES method is an implementation of the stochastic gradientfree optimization algorithm proposed by Hansen (2006), where the main idea is to iteratively improve a multivariate normal distribution in the parameter space to find a distribution whose random samples minimize the objective function starting from a user provided initial guess. The MCFF technique is based on the simulated annealing algorithm to optimize a given set of parameters. In every iteration, MCFF makes a small random change to the parameter vector and computes the corresponding change in the error function. Any change that reduces the error is accepted; changes that increase the error are accepted with a predetermined probability. With sufficiently small random changes and acceptance rates, MCFF can become a rigorous global optimization method, but at very high computational cost. Through extensive benchmarking, Shchygol et al. conclude that while CMAES can often converge to the lowest error rates, it cannot do this on a consistent basis. The GA method employed by ogolemspuremd can produce consistently good (but not necessarily the lowest) error rates, but at higher computational costs compared to CMAES. Overall, they have found MCFF to underperform compared to CMAES and GA for similar computational costs.
4.1.3 Machine Learning Based Search Methods
While global methods have been proven to be successful for forcefield optimization, due to the absence of any gradient information, these global search methods require a large number of potential energy evaluations, as such they can be very costly. With the emergence of advanced tools to calculate the gradients of complex functions automatically, machine learning based techniques for optimization of force fields have attracted interest.
iReaxFF: One of the earliest such attempts is the IntelligentReaxFF, iReaxFF, software (Guo et al. 2020). iReaxFF uses the TensorFlow library for automatically calculating gradient information and use local optimizers such as Adam or BFGS. An additional benefit of the Tensorflow implementation is that iReaxFF can automatically leverage GPU acceleration. However, iReaxFF does not have the expected flexibility in terms of the training data as it can only be trained to match the ReaxFF energies to the absolute energies from Density Functional Theory (DFT) computations on the training data; relative energies, charges or geometry optimizations cannot be used in the training, essentially limiting its usability. As iReaxFF tries to exactly match the energies of the training data, the transferability of force fields generated by iReaxFF is also limited. While it is not clearly stated what kind of gradient information is calculated using Tensorflow, their definition of the loss function (which is the sum of the squared differences between absolute DFT and ReaxFF energies) suggests that their gradients are calculated with respect to atomic positions, which essentially amounts to performing a force matching based force field optimization. The number of iterations required to reach the desired accuracies for their test cases is rather large, on the order of tens to hundreds of iterations. Even with GPU acceleration, the training time for a test case reportedly takes several days. This is partly because iReaxFF does not filter out the unnecessary 2body, 3body and 4body interactions before the optimization step.
JAXReaxFF: Another recent effort that utilizes the Tensorflow framework is the JAXReaxFF software (Kaymak et al. 2022). JAX is an autodifferentitation software by Google that is built on top of Tensorflow for high performance machine learning research (Bradbury et al. 2020), it can automatically differentiate native Python and NumPy functions. Leveraging this capability, JAXReaxFF automatically calculates the derivative of a given fitness function with respect to the set of force field parameters to be optimized from Pythonbased implementation of the ReaxFF potential energy terms. By learning the gradient information of the high dimensional optimization space (which generally includes tens to over a hundred parameters), JAXReaxFF can employ highly effective local optimization methods such as the Limited Memory Broyden–Fletcher–Goldfarb–Shanno (LBFGS) algorithm (Zhu et al. 1997) and Sequential Least Squares Programming (SLSQP) (Kraft et al 1988) optimizer. The gradient information alone is obviously not sufficient to prevent local optimizers from getting stuck in a local minima, but when combined with a multistart approach, JAXReaxFF can greatly improve the training efficiency (measured in terms of the number of fitness function evaluations) performed. As they demonstrate through a diverse set of systems such as cobalt, silica, and disulfide, which were also used in other related work, they can reduce the number of optimization iterations from tens to hundreds of thousands (as in CMAES, ogolemspuremd or iReaxFF) down to only a few tens of iterations.
Another important advantage of JAX is its architectural portability enabled by the XLA technology (Sabne 2020) used under the hood. Hence, JAXReaxFF can run efficiently on various architectures, including graphics processing units (GPU) and tensor processing units (TPU), through automatic thread parallelization and vector processing. By making use of efficient vectorization techniques and carefully trimming the 3body and 4body interaction lists, JAXReaxFF can reduce the overall training time by up to three orders of magnitude (down to a few minutes on GPUs) compared to the existing global optimization schemes, while achieving similar (or better) fitness scores. The force fields produced by JAXReaxFF have been validated by measuring the macroscale properties (such as density and radial distribution functions) of their target systems.
Beyond speeding up force field optimization, the Python based JAXReaxFF software provides an ideal sandbox environment for domain scientists, as they can move beyond parameter optimization and start experimenting with the functional forms of the interactions in the model, add new types of interactions or remove existing interactions as desired. Since evaluating the gradient of the new functional forms with respect to atom positions gives forces, scientists are freed from the burden of coding the lengthy and errorprone force calculation parts. Through automatic differentiation of the fitness function as explained above, parameter optimization for the new set of functional forms can be performed without any additional effort by the domain scientists. After parameter optimization, they can readily start running MD simulations to test the macroscale properties predicted by the modified set of functional forms as a further validation test before productionscale simulations, or go back to editing the functional forms if desired results cannot be confirmed in this sandbox environment provided by JAXReaxFF.
4.2 Accelerating Reactive Simulations
We now discuss how ML techniques can be directly used to accelerate reactive simulations and to improve their accuracy in different application contexts.
4.2.1 Machine Learning Potentials
At a high level, ML based potentials can be defined as follows (Behler 2016):

1.
The potential must establish a direct functional relation between atomic configuration and the corresponding energy, where the functional must be based on an ML model. As an example, a forward propagating deep neural network may serve as a functional, where input is the atomic configuration and output is the energy.

2.
Any physical approximations or theoretically grounded constraints are explicitly incorporated into the training data and are not part of the energy functional.
The second requirement in the definition distinguishes traditional fixed form potentials from the ML potentials. It also ensures that for a “sufficiently complex” energy functional and “sufficiently large and diverse” training set, an ML based potential can produce arbitrarily accurate model predictions. Often it is expected that the training data are generated using a consistent and specific set of methods. It has been observed that mixing data from different QC techniques or experiments lead to poor learning outcomes. Sizes of the training sets depend on the computational cost of the training sets and the desired accuracy expected out of the ML model.
As with most traditional fixed form potentials, ML potential energy is expressed a sum of local energies:
where the local energy corresponds to the ML energy, which depends on the local neighborhood of the \(i\text {th}\) atom. Chemical environment of an atom is primarily decided by short range interactions (Kohn 1996). The long range interactions, which decay slower than \(r^{2}\), are usually either approximated at cutoff distance \(R_c\) as zero or smoothly reduced to zero using tapering functions. As an example, polynomial tapering functions are used in the ReaxFF. The accuracy of such model depends on the cutoff distance \(R_c\) – larger values of \(R_c\) lead to better approximation of long range interactions. However, larger \(R_c\) implies larger atomic neighborhood (which grows as \(R_c^3\)), which means that more sample points are required in the training set. Thus \(R_c\) must be chosen appropriately to provide better long range approximation while keeping the neighborhood size tractable.
4.2.2 Training Considerations
ML potentials, like fixed form potentials require training. Here we briefly explore the steps and potential issues with the design and training of ML potentials (See e.g. (Unke et al. 2021)).
 Choice of quantum methods used in generation of training data::

Typically ML based simulations are orders of magnitude slower than the equivalent fix form potential simulations (Brickel et al. 2019). However, unlike the fixed form potentials, ML potentials may offer accuracy similar to that of an ab initio method (Sauceda et al. 2020). Thus it is essential to choose an appropriate ab initio method. On one hand if the ab inito method is very fast and/or less accurate, it defeats the purpose of further approximating these data into a machine learnt model. On the other hand a method such as CCSD(t), that are computationally expansive makes it difficult to generate enough training data for ML models.
 How much data?:

The amount of data needed depends on the size of the ML model, the desired accuracy, and the sampling technique used in producing the data set.
 Sampling::

Sampling of training data over the domain of atomic configurations is crucial in achieving good training of the potentials. For the models designed to simulate equilibrium problems, one can potentially rely on samples that are output of an ab initio molecular dynamics simulation. Depending on the desired accuracy, generating such samples can become prohibitively expensive. Another alternative is to use meta dynamics type sampling techniques and generate samples that are in the vicinity of the free energy minima of the system. However, if the model is intended to address chemical reactions or transition states, then a more uniform sampling is required where “rare events” are also sampled with relatively higher frequencies. The framework provided by an ML model does not include any “physics” of the problem, thus the training data must sample the configuration space sufficiently to include the relevant “physics” in the problem.
 Training/validation and testing::

In usual ML methodology, models are trained and tested against similarly structured but disjoint data sets. In this case, the training and the validation is performed on the data sets that are similarly sampled but distinct. However, the testing of the model is usually performed against bulk or physically measurable quantities computed using the trained models. Often the ML potential frameworks have hyperparameters that require a second step of optimizations. The testing phase must be repeated for ifferent hyperparameter values.
4.2.3 Descriptors
Unique description of atomic neighborhood is a central issue in structure–function prediction problems in biophysics and materials science (Ghiringhelli et al. 2015; Deviller and Balaban 1999; Valle and Oganov 2010). For ML systems, such uniqueness is crucial for effective training. Thus, one must express any atomic neighborhood in a representation that is invariant with respect to the action of the symmetry group of the system. In case of three dimensional atomistic systems, we have a group of Galilean transformations and discrete group of atomic permutations. We summarize commonly used descriptors, noting that the state of the art in this context is continually evolving.
Atom Centered Symmetry Function (ACSF)
This descriptor expresses the environment of \(i\text {th}\) atom in terms of a Gaussian basis of varying widths and angular basis at different resolution. It uses a cosine taper function given by:
where \(r_{ij}\) is the distance between \(i\text {th}\) and \(j\text {th}\) particles. This ensures that, when multiplied, the quantity goes smoothly to zero as \(r_{ij}\) approaches \(R_c\) from below. Using this taper function, an atom centered descriptor can be written with radial and angular parts as:
where n is the number of neighbors in cutoff distance \(R_c\), \(\lambda = \pm 1\). The descriptor vector is generated by sampling the parameters \(\eta \), \(\zeta \), \(\mu \), and \(\lambda \). By design, ACSF produces a description that is invariant under translation and rotation. We note that the number of symmetry functions needed does not depend on n. However, the number of symmetry functions grow very rapidly. Typically for an atom 50–100 symmetry functions are used with various values of parameters (Behler 2016). Further the number of functions required grows quadraticaly with respect to the number of types of atoms used in the model. ACSF can be generalized with additional weight functions to improve resolution and complexity (Gastegger et al. 2017).
Coulomb Matrix (CM)
An alternate descriptor uses the Fourier transform of the Coulomb matrix (Rupp et al. 2012), which is defined as:
where \(Z_i\) is the chanrge on the \(i\text {th}\) particle. This descriptor is invariant under the transformations listed, however, it is computationally expensive unless restricted to a local coulomb matrix (Rupp et al. 2012). The descriptor can be further generalized to include Ewald matrix instead of Coulomb matrix (Faber et al. 2015).
Bispectral Coefficients (BC)
In this descriptor, the atomic environment is represented as a local density that is expressed in terms spherical harmonics on a four dimensional sphere. The density is written as superposition of delta function densities using the taper function from Eq. (9) as:
where the dimensionless parameter \(\omega _j\) represents atom type or other internal properties of the \(j\text {th}\) atom. Angular part of such density can be expanded in spherical harmonics basis and radial part can be expanded in terms of a linear basis. The radial part is transformed into an additional angle, converting the basis to spherical harmonics on 3sphere. Let \(U_{m,m^\prime }^j\) be these hyperspherical harmonics, then one can express the local density as:
where \(c_{m, m^\prime }^{j}\) are the coefficients of expansion. The \(c_{m, m^\prime }^{j}\) are computed by evaluating the inner product \(\langle U_{m, m^\prime }^{j}  \rho \rangle \). The BC are then computed using the mixing rules as:
where \(C_{j_1m_1j_2m_2}^{jm}\) are the Clebsch–Gordon coefficients of mixing. These descriptors also satisfy the required invariance properties. One key advantage of BC over ACSF is that BCs can be systematically expanded or truncated based on accuracy versus complexity trade offs of the model (Thompsona et al. 2015).
Smooth Overlap of Atomic Positions (SOAP)
In SOAP descriptor local density is generated by smoothing delta functions into a Gaussian as (Albert et al. 2013)
This density can be expanded in term of radial and angular basis as
where \(Y_{l,m} (\theta , \phi )\) are spherical harmonics basis, and \(g_n(r)\) is a radial basis set chosen based on specific model. Thus the descriptor for atom i is written as an appropriately normalized power spectrum
4.2.4 Energy Functionals
The input to the ML model is a descriptor using one of the models described above. The output of the ML model is an energy functional. We describe common forms of the energy functional here.
Feed Forward Neural Network Based Energy Functional
One of the common ML energy functionals is based on feed forward neural networks (FFNN) (see e.g. Blank et al. (1995), Gassner et al. (1998), Lorenz et al. (2004), Manzhos and Carrington (2006), Behler et al. (2007), Geiger and Dellago (2013), Behler (2014), Behler (2015)). These networks typically use descriptor as input and produce an energy value as output. One can write the energy as:
where the neural network has m layers, \(\textbf{W}_{k1,k}\), \(\textbf{b}_k\) are the weights and the bias values associated with the \(k\text {th}\) layer respectively, and \(f_k\) are the nonlinear activation functions associated with the \(k\text {th}\) layer. Forces are computed as the negative gradients of the energy functional. Thus we expect the activation functions \(f_k\) to be differentiable functions.
Gaussian Approximation Potential (GAP)
This approximation establishes a mapping between the environment of an atom and the corresponding energy using a Gaussian kernel function.
where L is the number of truncated bispectrum components, \(\textbf{b}\) are the BCs. The determination of the coefficients \(\alpha _n\) is computationally expensive, since it grows as \(N^3\) (Li et al. 2015).
Spectral Neighbour Analysis Potential (SNAP)
SNAP simplifies the computation of \(\alpha _i\) in GAP by changing problem of Gaussian regression to a linear regression. Thus now the energy functional is given by (Thompsona et al. 2015)
where M is the number of bispectrum coefficients used in an approximation. Most important advantage of SNAP over GAP is the simplification of computation due to linear regression.
4.2.5 Accelerating Timestepping Using Deep Networks
We have previously described the use of ML potentials to increase the accuracy and scope of modeled interactions. An important bottleneck in reactive atomistic simulations is the need for small timesteps (subfemtoseconds in typical applications), whose sequential nature limits the temporal scope of simulations. There have been some recent efforts aimed at ML techniques for longtimestep integration. Conventional timestepping schemes use the current atomic state (and in some cases, the few states leading up to the current state), combined with the force (derived from energy) to advance system state to the next step. The goal of MLbased time integrators is to use a sequence of past atomic states, along with the energy, to predict system state over longer timesteps (e.g., three orders of magnitude longer than conventional integrators).
The use of multiple past states in predicting the next state motivates the use of Recurrent Neural Networks (RNNs) for this task. Recall that RNNs use internal states to process timeseries data. To address the ‘vanishing gradient’ problem discussed in Sect. 3.2.2, RNN variants such as Long ShortTerm Memory (LSTM) networks are used for this purpose. There are three key issues in the use of LSTMs in long timestep integrators: (i) specification of input states for the deep network; (ii) the network architecture; and (iii) training process. The input to a LSTMbased time integrator is typically limited to a finite region around the atom for which the trajectory is predicted. Larger neighborhoods require significantly larger number of degrees of freedom in the network. While in theory, this would improve accuracy, the need for large amounts of training data and training error typically negate this improvement in accuracy. The network architecture is determined by the complexity of the energy functional and specific domain properties. In current practice, even simple energy terms (LennardJones interactions) require large networks (\({\tilde{1}}00\)K parameters) for ensembles of as few as 16 particles. The need for training data and associated training cost for these is significant. However, such integrators are shown to be capable of timesteps three orders of magnitude longer than conventional Verlet integrators (Kadupitiya et al. 2020).
In current proposals, which are in relative infancy, the training procedures for the LSTMs use simulation data generated from the specific potential, with well specified boundary conditions (e.g., periodic boundaries). Even in these simple systems, a large amount of training data is needed to accurately predict trajectories. It is observed that for more complex potentials (with multiple terms) and diverse atomic contexts, the need for training data increases substantially.
We note that the use of deep networks for particle dynamics is in relative infancy. There has been significant interest in the use of deep networks for timeintegrating ODEs since the recent work of Chen et al. (2018). Recent advances include symplectic ODENets for learning the dynamics of Hamiltonian systems (Zhong et al. 2019), and associated deep learning architectures (Rusch and Mishra 2021).
5 Analyzing Results from Atomistic Simulations
A key use of machine learning techniques is in the analysis of large amounts of data generated from timedependent simulations. This data generally takes the form of snapshots of trajectories—with each snapshot corresponding to system state comprised of degrees of freedom (position, momentum, etc.) associated with particles, and in the case of reactive simulations, bond information. Complex simulations scale to millions of particles and beyond, over billions of timesteps—leading to datasets that are in excess of terabytes. A number of techniques are deployed to deal with this data volume, including subsampling for reducing storage, indexing for fast access, and compression. While these techniques facilitate storage and access, the focus of this section is primarily on analysis techniques that abstract and extract useful information from trajectories.
We note that ML techniques for analyses of timedependent simulation is an active area of research. This section summarizes the rich state of the art in the area—for a more detailed recent summary, we refer readers to excellent reviews by Glielmo et al. (2021), Sidky et al. (2020), and Noé et al. (2020).
5.1 Representation Techniques
We consider a general class of simulations that result in a set of T snapshots of data—each snapshot \(S_i, i = 0 \ldots T1\), stored as a D dimensional vector, in a matrix M of dimension \(T\times D\). The first challenge we face is to suitably encode system state at time \(t_i\) into a corresponding vector \(S_i\). This poses challenges w.r.t. different data structures and their consistent encoding. We consider two common data structures and associated representation techniques:
Vector Fields
The most common data associated with particles is in vector fields. This includes position data, momentum, and other particle properties. The first step in representing these vector fields is to account for underlying invariants. For instance, a particle aggregate (e.g., a molecule) may be invariant under rotation and translation. To account for this invariance, these aggregates must be represented in a canonical framework so that two aggregates in different orientations can be viewed as being identical under affine transformations. The most common technique relies on aligning particle aggregates with known reference aggregates (e.g., reference geometries of molecules) and to store them as deviations from these reference molecules under affine transformations. Such transformations can easily be computed through local formulations solved using Shapelets or global formulations such as the Orthogonal Procrustes Problem, which has an optimal solution due to Kabsch (1976). Once suitable alignments have been computed, the particle aggregates are stored as suitable vectors of deviations from reference aggregates. When reference aggregates are unavailable, canonical representations can be derived through suitable internal representations, for example, in the form of internal distances between reference particles (e.g., distance between pairs of marked atoms in a molecule). This vector of distances provides a canonical representation.
Network Models
Reactive simulations often store bond structure of molecules within snapshots \(S_i\). These structures are invariant to within an isomorphism; i.e., any relabeling of atoms in the molecule should be treated identically. Canonical labelings are challenging because there exist an exponential number of permutations, and corresponding labelings. Deriving canonical labelings to represent graphs corresponding to molecular structures as vectors require solution of the graph isomorphism problem. For small molecules, this can be done by enumeration; however, for larger molecules, this is more computationally expensive. One solution to this problem relies on a diffusion kernel to derive canonical labelings. The Laplacian of the given graph structure is used to simulate a diffusion process on the graph. The stationary probabilities associated with the diffusion process are used to represent the graph in a canonical vector form. One may also view this vector in terms of the spectra of the graph. Other approaches to canonical labelings rely on graph neural networks (GNNs). These networks are trained to input a given graph and to generate canonical labels as output. This training procedure for GNNs associates the identical labelings for isomorphic graphs.
5.2 Dimensionality Reduction and Clustering
Using suitable representation techniques, state \(S_i\) at timestep i is represented as a vector \(v_i\) in dimension \(D_n\). We use subscript n to denote the native dimension of the representation. The next step in typical analyses is to reduce the native dimension \(D_n\) to a lower (reduced) dimension \(D_r\). This facilitates downstream analyses by denoising data (filtering dimensions that are less important), while simultaneously reducing computational cost. Dimensionality reduction is accomplished through the linear (PCA, SVD, NMF, AA) or nonlinear techniques (Kernel PCA, Autoencoders) described in Sect. 3.1.
5.3 Dynamical Models and Analysis
Molecular systems evolve through a dynamical operator acting on successive system states. This motivates the natural observation that the datapoints associated with temporal snapshots are not independent; rather, they have temporal correlations that reveal interesting aspects of underlying systems. Identification of temporally coherent subdomains is an important analysis task. The starting point for such analysis is a timelagged covariance matrix, which is computed as the distance (normalized dot product) of a state descriptor at time t with that at time \(t + \delta t\), for a suitably selected lag \(\delta t\). A commonly used method, Time Lagged Independent Component Analysis (TLICA) uses this timelagged covariance matrix, along with the covariance matrix at current state to define a generalized eigenvalue problem. The eigenvectors derived from this generalized eigenvalue problem correspond to the slow modes in the underlying dynamics in the system. We refer to the work of Naritomi and Fuchigami (2013) for a detailed description of this method and its use in analysing atomic trajectories. These approaches are generalized into a variational framework that aims to characterize the dominant eigenpairs of the propagation operator corresponding to the dynamical system. This is achieved by first computing a discrete approximation to the propagation operator, which uses abstractions of the self and timelagged covariance matrices to compute transition probabilities for each state at time t to a state at time \(t + \delta t\). The eigenvectors of this operator correspond to the dominant modes in the system. This general variational model is equivalent to TLICA if data points are represented through a linear basis. However, the variational model admits a more general basis, through the use of higherorder kernels and the underlying optimization problem is solved using conventional gradientdescent type methods.
5.4 Reaction Rates and Chemical Properties
Reactive simulations often produce diverse chemical constituents. Some of these compounds are transient, however these still require careful analysis and classification. In the simple case of two component Silica–Water system, the molecular components observed at the end of the simulations include Si–O, Si–O\(_2\), OH, H\(_2\) etc. (Fogarty et al. 2010). Identifying all the molecular components and corresponding chemical reaction is a difficult problem.
In order to enumerate all the molecular components, one can treat a simulation time step as a colored graph with atom type as color on the node and the existence of an edge between two atoms is decided by the bond order between the pair being greater than a cutoff value. Further the enumeration requires identification of all the distinct classes of isomorphic subgraphs of atoms. Each such class entry is either a molecule or molecular fragment present in a simple time frame. Then a hash table of such fragments is constructed to label the frequency of occurrence of reactant or product in a single time frame.
For the most common molecular fragments, often it is possible to identify reactions of kind, \(\text {A} + \text {B} \rightleftharpoons \text {AB}\). Such reactions can be modeled using first order differential equations, which can be solved as:
where N is total number of molecules of type A and B, \(N_\text {AB}\) is the number of molecules of AB; \(K_f\), \(K_b\) are forward and backward reaction rates respectively (Saunders et al. 2022). Within simulations the computed number of molecular types can be fitted to Eq. (16) as a function of time, giving the reaction rates and equilibrium concentrations of various chemical components.
6 Concluding Remarks
In this chapter, we presented an overview of common ML techniques and formulations. We discussed how computationally expensive components of reactive atomistic simulations are formulated in ML frameworks, considerations for training ML models, tradeoffs of accuracy, need for training data, transferrability, and computational cost. While we primarily focused on reactive atomistic simulations, the models and methods discussed apply more generally to discrete element models.
The area of ML techniques for reactive simulations is extremely active and fluid. There is tremendous potential for significant new developments in the area, enabling simulation scales and scope far beyond those currently accessible. In doing so, these techniques hold the promise of new applications and domains.
References
Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Mané D, Monga R, Moore S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viégas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, Zheng X (2015) TensorFlow: largescale machine learning on heterogeneous systems. Software available from tensorflow.org
Albert B, Kodor PG, Csányi R (2013) On representing chemical environments. Phys Rev B 87:184115
Behler J (2014) Representing potential energy surfaces by highdimensional neural network potentials. J Phys: Condensed Matter 26:183001
Behler J (2016) Perspective: machine learning potentials for atomistic simulations. J Chem Phys 145:170901
Behler J (2015) Constructing highdimensional neural network potentials: a tutorial review. Int J Quant Chem 115(16):1032–1050
Behler J, Lorenz S, Reuter K (2007) Representing moleculesurface interactions with symmetryadapted neural networks. J Chem Phys 127:014705
Bishop CM, Nasrabadi NM (2006) Pattern recognition and machine learning, vol 4. Springer
Blank TB, Brown SD, Calhoun AW, Doren DJ (1995) Neural network models of potential energy surfaces. J Chem Phys 103:4129
Bradbury J, Frostig R, Hawkins P, Johnson MJ, Leary C, Maclaurin D, WandermanMilne S (2020) Jax: composable transformations of python+ numpy programs, p 18. http://github.com/google/jax
Brickel S, Das AK, Unke OT, Turan HT, Meuwly M (2019) Reactive molecular dynamics for the [clch3br]reaction in the gas phase and in solution: a comparative study using empirical and neural network force fields. Elect Struc 1:024002
Case DA, Aktulga HM, Belfon K, BenShalom I, Brozell SR, Cerutti DS, Cheatham TE III, Cruzeiro VWD, Darden TA, Duke RE et al (2021) Amber 2021. University of California, San Francisco
Chen TQ, Rubanova Y, Bettencourt J, Duvenaud D (2018) Neural ordinary differential equations. arxiv:1806.07366
Cho K, Merriënboer BV, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: encoderdecoder approaches. arXiv:1409.1259
Cortes C, Vapnik V (1995) Supportvector networks. Mach Learn 20(3):273–297
Cunningham JP, Ghahramani Z (2015) Linear dimensionality reduction: survey, insights, and generalizations. J Mach Learn Res 16(1):2859–2900
Curtin RR, Edel M, Lozhnikov M, Mentekidis Y, Ghaisas S, Zhang S (2018) mlpack 3: a fast, flexible machine learning library. J Open Source Softw 3(26):726
Deviller J, Balaban AT (eds) (1999) Topological indices and related descriptors in QSAR and QSPR. Gordon and Breach Science Publishers
Dick B, Freund HJ (1983) Analysis of bonding properties in molecular ground and excited states by a cohentype bond order. Int J Quant Chem 24:747–765
Dittner M, Müller J, Aktulga HM, Hartke B (2015) Efficient global optimization of reactive forcefield parameters. J Comput Chem 36(20):1550–1561
Ester M, Kriegel HP, Sander J, Xu X et al (1996) A densitybased algorithm for discovering clusters in large spatial databases with noise. In KDD 96:226–231
Faber F, Lindmaa A, von Lilienfeld OA, Armiento R (2015) Crystal structure representations for machine learning models of formation energies. Int J Quant Chem 115(16):1094–1101
Fogarty JC, Aktulga HM, Grama AY, van Duin ACT, Pandit SA (2010) A reactive molecular dynamics simulation of the silicawater interface. J Chem Phys 132(17):174704
Frenkel D, Smit B (2002) Understanding molecular simulation from algorithms to applications. Academic Press
Furman D, Carmeli B, Zeiri Y, Kosloff R (2018) Enhanced particle swarm optimization algorithm: efficient training of reaxff reactive force fields. J Chem Theory Comput 14(6):3100–3112
Gaissmaier D, van den Borg M, Fantauzzi D, Jacob T (2022) Kvik optimiser—an enhanced reaxff force field training approach, ChemRxiv
Gan G, Ma C, Wu J (2020) Data clustering: theory, algorithms, and applications. SIAM
Gassner H, Probst M, Lauenstein A, Hermansson K (1998) Representation of intermolecular potential functions by neural networks. J Phys Chem A 102(24):4596–4605
Gastegger M, Behler J, Marquetand P (2017) Machine learning molecular dynamics for the simulation of infrared spectra. Chem Sci 8(10):6924–6935
Geiger P, Dellago C (2013) Neural networks for local structure detection in polymorphic systems. J Chem Phys 139:164105
Gers FA, Schmidhuber J, Cummins F (2000) Learning to forget: continual prediction with LSTM. Neural Comput 12(10):2451–2471
Ghiringhelli LM, Vybiral J, Levchenko SV, Draxl C, Scheffler M (2015) Big data of materials science: critical role of the descriptor. Phys Rev Lett 114:105503
Gillis N (2020) Nonnegative matrix factorization. SIAM
Glielmo A, Husic BE, Rodriguez A, Clementi C, Noé F, Laio A (2021) Unsupervised learning methods for molecular simulation data. Chem Rev 121(16):9722–9758
Golub GH, Reinsch C (1971) Singular value decomposition and least squares solutions. In: Linear algebra. Springer, pp 134–151
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press
Goodfellow I, PougetAbadie J, Mirza M, Xu B, WardeFarley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. Advances in neural information processing systems, vol 27
Guennebaud G, Jacob B et al (2010) Eigen v3. http://eigen.tuxfamily.org
Guo F, Wen YS, Feng SQ, Li XD, Li HS, Cui SX, Zhang ZR, Hu HQ, Zhang GQ, Cheng XL (2020) Intelligentreaxff: evaluating the reactive force field parameters with machine learning. Comput Mat Sci 172:109393
Hansen N (2006) Towards a new evolutionary computation. Stud Fuzziness Soft Comput 192:75–102
Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, Wieser E, Taylor J, Berg S, Smith NJ, Kern R, Picus M, Hoyer S, van Kerkwijk MH, Brett M, Haldane A, del Río JF, Wiebe M, Peterson P, GérardMarchant P, Sheppard K, Reddy T, Weckesser W, Abbasi H, Gohlke C, Oliphant TE (2020) Array programming with NumPy. Nature 585(7825):357–362
Hess B, Kutzner C, Spoel DVD, Lindahl E (2008) Gromacs 4: algorithms for highly efficient, loadbalanced, and scalable molecular simulation. J Chem Theory Comput 4(3):435–447
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hochreiter S, Schmidhuber J (1997) Long shortterm memory. Neural Comput 9(8):1735–1780
Huang G, Liu Z, Maaten LVD, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the conference on computer vision and pattern recognition, pp 4700–4708
Hubin PO, Jacquemin D, Leherte L, Vercauteren DP (2016) Parameterization of the reaxff reactive force field for a prolinecatalyzed aldol reaction. J Comput Chemistry 37(29):2564–2572
Iype E, Hütter M, Jansen APJ, Nedea SV, Rindt CCM (2013) Parameterization of a reactive force field using a monte carlo algorithm. J Comput Chemistry 34(13):1143–1154
JaramilloBotero A, Naserifar S, Goddard WA III (2014) General multiobjective force field optimization framework, with application to reactive force fields for silicon carbide. J Chem Theory Comput 10(4):1426–1439
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on multimedia, pp 675–678
Kabsch W (1976) A solution for the best rotation to relate two sets of vectors. Acta Crystallographica Sect A 32:922–923
Kadupitiya JCS, Fox GC, Jadhao V (2020) Solving newton’s equations of motion with large timesteps using recurrent neural networks based operators
Katoch S, Chauhan SS, Kumar V (2021) A review on genetic algorithm: past, present, and future. Multi Tools App 80(5):8091–8126
Kaymak MC, Rahnamoun A, O’Hearn KA, van Duin ACT, Merz KM Jr, Aktulga HM (2022) Jaxreaxff: a gradient based framework for extremely fast optimization of reactive force fields. ChemRxiv
Kohn W (1996) Density functional and density matrix method scaling linearly with the number of atoms. Phys Rev Lett 76:3168
Kraft D et al (1988) A software package for sequential quadratic programming. DFVLR Obersfaffeuhofen, Germany
Kramer MA (1991) Nonlinear principal component analysis using autoassociative neural networks. AIChE J 37(2):233–243
Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classification with deep convolutional neural networks. Neural Info Proc Sys 25:01
Larsson HR, van Duin ACT, Hartke B (2013) Global optimization of parameters in the reactive force field reaxff for sioh. J Comput Chem 34(25):2178–2189
LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD (1989) Backpropagation applied to handwritten zip code recognition. Neural Comput 1(4):541–551
Li Z, Kermode JR, Vita AD (2015) Molecular dynamics with onthefly machine learning of quantummechanical forces. Phys Rev Lett 114:096405
Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Info Theory 28(2):129–137
Lorenz S, Groß A, Scheffler M (2004) Representing highdimensional potentialenergy surfaces for reactions at surfaces by neural networks. Chem Phys Lett 395:4–6
Maaten LV, Hinton G (2008) Visualizing data using tsne. J Mach Learn Res 9(11)
Maimon O, Rokach L (2005) Data mining and knowledge discovery handbook. Springer
Manzhos S, Carrington T Jr (2006) A randomsampling high dimensional model representation neural network for building potential energy surfaces. J Chem Phys 125:084109
McInnes L, Healy J, Melville J (2018) Umap: uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426
McLachlan GJ (2005) Discriminant analysis and statistical pattern recognition. Wiley
Mirjalili S (2019) Genetic algorithm. In: Evolutionary algorithms and neural networks. Springer, pp 43–55
Murphy KP (2021) Machine learning: a probabilistic perspective. MIT Press
Naritomi Y, Fuchigami S (2013) Slow dynamics of a protein backbone in molecular dynamics simulation revealed by timestructure based independent component analysis. J Chem Phys 139:215102
Neubig G, Dyer C, Goldberg Y, Matthews A, Ammar W, Anastasopoulos A, Ballesteros M, Chiang D, Clothiaux D, Cohn T, Duh K, Faruqui M, Gan C, Garrette D, Ji Y, Kong L, Kuncoro A, Kumar G, Malaviya C, Michel P, Oda, M. Richardson Y, Saphra N, Swayamdipta S, Yin P (2017) Dynet: the dynamic neural network toolkit. arXiv:1701.03980
Noé F, Tkatchenko A, Müller KR, Clementi C (2020) Machine learning for molecular simulation. Ann Rev Phys Chem 71(1):361–390 PMID: 32092281
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Kopf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S (2019) Pytorch: An imperative style, highperformance deep learning library. In: Wallach H, Larochelle H, Beygelzimer A, d’ AlchéBuc F, Fox E, Garnett R (eds) Advances in neural information processing systems, vol 32. Curran Associates, Inc, pp 8024–8035
Phillips JC, Braun R, Wang W, Gumbart J, Tajkhorshid E, Villa E, Chipot C, Skeel RD, Kale L, Schulten K (2005) Scalable molecular dynamics with NAMD. J Comput Chem 26(16):1781–1802
Rish I (2001) An empirical study of the naive bayes classifier. In: IJCAI 2001 workshop on empirical methods in artificial intelligence, vol 3, pp 41–46
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by backpropagating errors. Nature 323(6088):533–536
Rupp M, Tkatchenko A, Müller KR, von Lilienfeld OA (2012) Fast and accurate modeling of molecular atomization energies with machine learning. Phys Rev Lett 108:058301
Rusch TK, Mishra S (2021) Unicornn: a recurrent model for learning very long time dependencies. arxiv:2103.05487
Ruspini EH (1969) A new approach to clustering. Info Control 15(1):22–32
Sabne A (2020) XLA: compiling machine learning for peak performance. Google Res
Sanderson C, Curtin R (2016) Armadillo: a templatebased c++ library for linear algebra. J Open Source Softw 1(2):26
Sanderson C, Curtin R (2020) An adaptive solver for systems of linear equations. In: 2020 14th international conference on signal processing and communication systems (ICSPCS). IEEE, pp 1–6
Sauceda HE, Gastegger M, Chmiela S, Müller KR, Tkatchenko A (2020) Molecular force fields with gradientdomain machine learning (gdml): comparison and synergies with classical force fields. J Chem Phys 153:124109
Saunders M, WinemanFisher V, Jakobsson E, Varma S, Pandit SA (2022) Highdimensional parameter search method to determine force field mixing terms in molecular simulations. Langmuir
Saxena A, Prasad M, Gupta A, Bharill N, Patel OP, Tiwari A, Er MJ, Ding W, Lin CT (2017) A review of clustering techniques and developments. Neurocomputing 267:664–681
Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117
Schubert E, Sander J, Ester M, Kriegel HP, Xu X (2017) Dbscan revisited, revisited: why and how you should (still) use dbscan. ACM Trans Database Syst (TODS) 42(3):1–21
Senftle TP, Hong S, Islam MM, Kylasa SB, Zheng Y, Shin YK, Junkermeier C, EngelHerbert R, Janik MJ, Aktulga HM et al (2016) The reaxff reactive forcefield: development, applications and future directions. NPJ Comput Mater 2(1):1–14
ShalevShwartz S, BenDavid S (2014) Understanding machine learning: from theory to algorithms. Cambridge University Press
Shchygol G, Yakovlev A, Trnka T, van Duin ACT, Verstraelen T (2019) Reaxff parameter optimization with montecarlo and evolutionary algorithms: Guidelines and insights. J Chem Theory Comput 15(12):6799–6812
Sidky H, Chen W, Ferguson AL (2020) Machine learning for collective variable discovery and enhanced sampling in biomolecular simulation. Mol Phys 118(5):e1737742
Simonyan K, Zisserman A (2014) Very deep convolutional networks for largescale image recognition. arXiv:1409.1556
Stuart SJ, Tutein AB, Harrison JA (2000) A reactive potential for hydrocarbons with intermolecular interactions. J Chem Phys 112(14):6472–6486
Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inceptionv4, inceptionresnet and the impact of residual connections on learning. In: 31 AAAI conferences on artificial intelligence
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the conference on computer vision and pattern recognition, pp 2818–2826
Tersoff J (1989) Modeling solidstate chemistry: interatomic potentials for multicomponent systems. Phys Rev B 39(8):5566
Theano Development Team. Theano: a python framework for fast computation of mathematical expressions, May 2016. arxiv:abs/1605.02688
Thompson AP, Aktulga HM, Berger R, Bolintineanu DS, Brown WM, Crozier PS, Veld PJ, Kohlmeyer A, Moore SG, Nguyen TD et al (2022) Lammpsa flexible simulation tool for particlebased materials modeling at the atomic, meso, and continuum scales. Comput Phys Commun 271:108171
Thompsona A, Swilerb P, Trottc R, Foilesd M, Tucker J (2015) Spectral neighbor analysis method for automated generation of quantumaccurate interatomic potentials. J Comput Phys 285:316–330
Trnka T, Tvaroska I, Koca J (2018) Automated training of reaxff reactive force fields for energetics of enzymatic reactions. J Chem Theory Comput 14(1):291–302
Unke OT, Chmiela S, Sauceda HE, Gastegger M, Poltavsky I, Schütt KT, Tkatchenko A, Muller KR (2021) Machine learning force fields. Chem Rev 121:10142–10186
Valle M, Oganov AR (2010) Crystal fingerprint space—a novel paradigm for studying crystalstructure sets. Acta Crystallographica A A66:507–517
van Duin ACT, Baas JMA, Graaf BVD (1994) Delft molecular mechanics: a new approach to hydrocarbon force fields. Inclusion of a geometrydependent charge calculation. J Chem Soc Faraday Trans 90(19):2881–2895
van Duin ACT, Dasgupta S, Lorant F, Goddard WA (2001) Reaxff: a reactive force field for hydrocarbons. J Phys Chem A 105(41):9396–9409
Vapnik V (1991) Principles of risk minimization for learning theory. Advances in neural information processing systems, vol 4
Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, van der Walt SJ, Brett M, Wilson J, Millman KJ, Mayorov N, Nelson ARJ, Jones E, Kern R, Larson E, Carey CJ, Polat İ, Feng Y, Moore EW, VanderPlas J, Laxalde D, Perktold J, Cimrman R, Henriksen I, Quintero EA, Harris CR, Archibald AM, Ribeiro AH, Pedregosa F, van Mulbregt P, SciPy 1.0 contributors. SciPy 1.0: fundamental algorithms for scientific computing in python. Nat Methods 17:261–272
Weiss K, Khoshgoftaar TM, Wang D (2016) A survey of transfer learning. J Big Data 3(1):1–40
Zagoruyko S, Komodakis N (2016) Wide residual networks. arXiv:1605.07146
Zhong YD, Dey B, Chakraborty A (2019) Symplectic odenet: learning hamiltonian dynamics with control. arxiv:1909.12077
Zhu C, Byrd RH, Lu P, Nocedal J (1997) Algorithm 778: Lbfgsb: fortran subroutines for largescale boundconstrained optimization. ACM Trans Math Softw (TOMS) 23(4):550–560
Acknowledgements
This work is supported by the US National Science Foundation through grants OAC 1807622, OAC 1908691 and CCF 2019263, as well as the National Institutes of Health through the grant 5R01GM130641.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2023 The Author(s)
About this chapter
Cite this chapter
Aktulga, H., Ravindra, V., Grama, A., Pandit, S. (2023). Machine Learning Techniques in Reactive Atomistic Simulations. In: Swaminathan, N., Parente, A. (eds) Machine Learning and Its Application to Reacting Flows. Lecture Notes in Energy, vol 44. Springer, Cham. https://doi.org/10.1007/9783031162480_2
Download citation
DOI: https://doi.org/10.1007/9783031162480_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 9783031162473
Online ISBN: 9783031162480
eBook Packages: EnergyEnergy (R0)