1 Introduction and Overview

Time-dependent reactive simulations involve complex interaction models that must be trained using experimental or highly resolved simulation data. The training process as well as data acquisition are often computationally expensive. Once trained, the coupling models are incorporated into reactive simulation procedures that involve small time-steps, and generate large amounts of data that must be effectively analyzed for drawing scientific insights. The past few decades have witnessed significant advances in each of these facets. More recently, increasing attention has been focused on the development and application of machine learning (ML) techniques for increasing the accuracy, generalizability, and speed of such simulations.

In this chapter, we provide an overview of ML models and methods, along with their use in reactive particle simulations. We use highly resolved reactive atomistic simulations as the model problem for motivating and describing ML methods. We start by first presenting an overview of common ML techniques that are broadly used in the field. We then present the use of these techniques in training interaction models for reactive atomistic simulations. Recent work has focused on overcoming the time-step constraints of conventional reactive atomistic methods—we describe these methods and survey key results in the area. Finally we discuss the use of ML techniques in analyzing atomistic trajectories. The goal of the Chapter is to provide readers with a broad understanding of the state of the art in the area, unresolved challenges, and available methods and software for constructing simulations in diverse application domains. While we use reactive atomistics as our model problem, the discussion is broadly applicable to other particle-based/ discrete element simulation paradigms.

Reactive atomistic simulations provide understanding of chemical processes at the atomic level, which are usually not accessible through common experimental techniques. Quantum chemistry methods have come a long way in modeling electronic structures and subsequent chemical changes at the scale of a few atoms. However, if the interest is in the thermodynamics of chemical reactions then atomistic techniques are the methods of choice. Here, individual reactions are modeled in an approximate sense but system size (or particle number) approaches thermodynamic limit (or a suitable approximation thereof, i.e., as large as practical). One of the simplest sampling techniques used in atomistic simulations is molecular dynamics, which provides a psuedo-Newtonian trajectory of the system, and is applicable in modeling equilibrium as well as non-equilibrium problems. There are other sampling techniques such as Monte Carlo methods which are exclusively applicable to equilibrium statistical mechanical models. In this Chapter, we primarily focus on reactive molecular dynamics techniques.

1.1 Molecular Dynamics, Reactive Force Fields and the Concept of Bond Order

Molecular Dynamics (MD) is a widely adopted method for studying diverse molecular systems at an atomistic level, ranging from biophysics to chemistry and material science. While quantum mechanical (QM) models provide highly accurate results, they are of limited applicability in terms of spatial and temporal scales. MD simulations rely on parameterized force fields that enable the study of larger systems (with millions to billions of degrees of freedom) using atomistic models that are computationally tractable and scalable on large computer systems. Typical applications of MD range from computational drug discovery to design of new materials.

Fig. 1
figure 1

Various classical force field interactions employed in atomistic MD simulations

MD is an active field in terms of the development of new techniques. In its most conventional form (i.e., classical MD), it relies on “Born-Oppenheimer approximation”, where atomic nuclei and the core electrons together are treated as classical point particles and the interactions of outer electrons are approximated by pairwise and “many-body” terms such as bond, angle, torsion and non-bonded interactions, and additionally by using variable charge models. Each interaction is described by a parametric mathematical formula to compute relevant energies and forces. The collection of various interactions used to describe a molecular system is called a force field. Figure 1 illustrates interactions commonly used in various force fields. Equation 1 gives an example of a simple force field where \(K_b, r_0, K_a, \theta _0, V_d, \phi _0, \epsilon _{ij}, \nu \) and \(\sigma _{ij}\) denote parameters that are specific to the types of interacting atoms (which may be a pair, triplet, or quadruplet of atoms), and \(\epsilon \) denotes some global parameter.

$$\begin{aligned} V_{tot}= & {} \sum _{bonds} K_b (r-r_0)^2 + \sum _{angles} K_a (\theta - \theta _0)^2 + \sum _{torsions} \frac{V_d}{2}[1 + cos(\nu \phi - \phi _0)] \nonumber \\+ & {} \sum _{nonbonded} \frac{\delta _i \delta _j}{4 \pi \epsilon r} + \sum _{nonbonded} 4\epsilon _{ij} [(\frac{\sigma _{ij}}{r})^{12} - (\frac{\sigma _{ij}}{r})^{6}] \end{aligned}$$

Classical MD models, as implemented in highly popular MD software such as Amber (Case et al. 2021), LAMMPS (Thompson et al. 2022), GROMACS (Hess et al. 2008) and NAMD (Phillips et al. 2005), are based on the assumption of static chemical bonds and, in general, static charges. Therefore, they are not applicable to modeling phenomena where chemical reactions and charge polarization effects play a significant role. To address this gap, reactive force fields (e.g., ReaxFF, Senftle et al. (2016), REBO, Stuart et al. (2000), Tersoff (1989)) have been developed. Functional forms for reactive potentials are significantly more complex than their non reactive counterparts due to the presence of dynamic bonds and charges. The development of an accurate force field (be it non-reactive or reactive) is a tedious task that relies heavily on biological and/or chemical intuition. More recently, machine learning based potentials have been proposed to alleviate the burden of force field design and fitting. Even so, the most computationally efficient way to study a large reactive molecular system, as would be necessary in a reactive flow application, is a well tuned reactive force field model. Hence, this Chapter focuses on reactive force fields and specifically on ReaxFF whenever it is necessary to discuss specific methods and results, since covering all reactive force field models would necessitate a significantly longer discussion. Nevertheless, models and methods discussed for ReaxFF are broadly applicable to other reactive force fields, as well.

Bond order is a key concept in reactive simulations; it models the overlap of electronic orbitals. This is intrinsically ambiguous in classical simulations because of approximations in assigning bond index and the bond type based on the wave function overlaps (Dick and Freund 1983). In classical reactive simulations, bond order is defined as a smooth function that vanishes with increasing distance between the atoms (van Duin et al. 2001). Clearly, such a function must depend on the environment of the atoms to correctly reproduce valencies. In non-reactive classical simulations, bond structure is maintained by either applying constraints on where a bond is expected to exist, or by assigning a large energy penalty (typically in the form of a harmonic potential, see e.g. Eq. (1)) if the atoms deviate from the expected bond length (Frenkel and Smit 2002). In either case, an improperly optimized force field can lead to divergent energies or break-down of the constraint algorithms. Reactive systems, however, have bond orders that smoothly go to zero, and usually do not have this problem but may end up with an un-physical final structure. Recently proposed ML-based approaches depend only on the atomic positions and sometimes on momenta, but do not carry information on molecular topology. Consequently, such approaches are well-suited for describing reactive simulations.

1.2 Accuracy, Complexity, and Transferability

Three key aspects must be considered when formulating simulation models: (i) Accuracy: A simulation is expected to reproduce structure as well as the chemical reactions and reaction rates for the model system against the target data. If a model has a sufficient number of free parameters, then, in principle, such model can accurately describe the physical system. However, the choice of model and its size depend on the availability of target training data, which are usually highly-resolved quantum chemistry calculations ranging from Density Functional Theory (DFT) to coupled cluster theory, along with a basis sets specifying the desired level of accuracy; (ii) Complexity: For any simulation model the complexity increases with the number of terms and free parameters in force computations (Frenkel and Smit 2002). Thus, accuracy of the model goes hand in hand with its complexity. Ideally, we would like to have a high accuracy and low complexity model. Consequently, a clever use of target data for extracting accurate results from a relatively simple model or alternately, approximations that represent minimal compromise on accuracy for significant reduction in model complexity are desirable; and (iii) Transferability: The models are expected to provide physical insight into the system by reproducing correct properties for different types of systems beyond the training data. This is usually achieved by breaking down the interaction terms into corresponding physical concepts, e.g., bond interaction, angle interaction, shielded 1–4 interaction, etc. Each of these interactions, although suitably abstracted, represent a physical concept that is expected to have similar interaction behavior under different conditions. Thus the total interaction can be computed as a combination of such transferable terms (Frenkel and Smit 2002). We note that the target data (usually obtained using quantum calculations) are not split into such physical abstractions. This gives rise to numerous models with similar accuracy and varying degrees of transferability. Commonly used reactive potentials such as REBO or ReaxFF are built with tranferability as a key consideration. However, even within the limited domain of atomic types and environments, these simulations rarely produce accurate results for wide variety of problems without requiring a re-tuning of the force field parameters. Unlike fixed form potential simulations, machine learnt potentials focus on tranferability of the model to similar atomic enviroments as the training datasets and optimize for higher accuracy as well as lower complexity.

In the rest of this chapter, we describe how reactive interaction models are constructed, trained, and used in accelerating simulations, in particular by making use of ML-based techniques. We begin our discussion with an overview of common ML models and methods, followed by their use in the simulation toolchain.

2 Machine Learning and Optimization Techniques

We begin our discussion with an overview of general ML techniques. This literature is vast and rapidly evolving. For this reason, we restrict ourselves to common ML techniques as they apply to reactive particle-based simulations.

ML frameworks are typically comprised of a model, a suitably specified cost function, and a training set over which the cost function is minimized. An ML model corresponds to an abstraction of the physical system—e.g., the force on an atom in its atomic context, and has a number of parameters that must be suitably instantiated. The cost function corresponds to the mismatch between the output of the model and physical (experimental or high-resolution simulated) data. Minimizing the cost function yields the necessary parametrization of the model. Training data is used to match the model output with target distribution. At the heart of ML procedures is the optimization technique used to match the model output with the target distribution.

The cost-function in typical ML applications is averaged over the training set:

$$\begin{aligned} J(\theta ) = \mathbb {E}_{(x,y) \sim \hat{P}_{data}} \ \mathbb {L}[f(x;\theta ),y] \end{aligned}$$

Here, J(.) represents the cost-function, \(P_{data}\) represents the empirical distribution (i.e., the training set), L(.) is the loss-function that quantifies the difference between estimated and true value, and f(.) is a prediction function parameterized by \(\theta \). A key point to note here is that we operate on empirical data, and not the “true” data distribution. Hence, this approach is also called empirical risk minimization(Vapnik 1991). The assumption is that minimizing the loss w.r.t. empirical data will (indirectly) minimize the loss w.r.t. true data distribution, thereby allowing for generalizability (i.e., to make predictions on unseen data samples). In the rest of this section, we discuss continuous and discrete optimization strategies commonly used in ML formulations.

2.1 Continuous Optimization for Convex and Non-convex Optimization

In many applications, the objective function in Eq. 2 is continuous and differentiable. For such applications, a key consideration is whether the function is convex or non-convex (recall that a real-valued convex function is one in which the line joining any two points on the graph of the function does not lie below the graph at any point in the interval between the two points). Simple approaches to optimizing convex functions start from an initial guess, compute the gradient, and take a step along the gradient. This process is repeated until the gradient is sufficiently small (i.e., the function is close to its minima). In ML applications, the step size is determined by the gradient and the learning rate—the smaller the gradient, the lower the step size. Convex objective functions arise in models such as logistic regression and single layer neural networks.

In more general ML models such as deep neural networks, the objective function (Eq. 2) is not convex. Optimizing non-convex objective functions in high dimensions is a computationally hard problem. For this reason, most current optimizers use a gradient descent approach (or its variant) to find a local minima in the objective function space. It is important to note that a point of zero gradient may be a local minima or a saddle point. Common solvers rely on randomization and noise introduced by sampling to escape saddle points. In deep learning applications, the problem of computing the gradient can be elegantly cast as a backpropagation operator—making it computationally simple and inexpensive. Optimization methods that use the entire training set to compute the gradient are called batch or deterministic methods (Rumelhart et al. 1986). Methods that operate on small-subsets of the dataset (called minibatches) are called stochastic methods. In this context, a complete pass over the training dataset sampled in minibatches is called an epoch. Stochastic Gradient Descent (SGD) methods are workhorses for training deep neural network models.

First order methods such as SGD suffer from slow convergence, lack of robustness, and need for tuning a large number of hyperparameters. Indeed, model training using SGD-type methods incurs most of its computation cost in exploring the high-dimensional hyperparameter space to find model parametrizations with high accuracy and generalization properties (Goodfellow et al. 2014). These problems have motivated significant recent research in the development of second order methods and their variants. Second order methods scale different components of the gradient suitably to accelerate convergence. They also typically have much fewer hyperparameters, making the training process much simpler. However, these methods involve a product with the inverse of the dense Hessian matrix, which is computationally expensive. Solutions to these problems include statistical sampling, low-rank structures, and Kronecker products as approximations for the Hessian.

2.2 Discrete Optimization

In contrast to continuous optimization, in many applications, the variables and the objective function take discrete values, and thus the derivative of the objective function may not exist. This is often the case when optimizing parameters for force fields in atomistic models. Two major classes of techniques for discrete optimization are Integer Programming and Combinatorial Optimization. In Integer Programming, some (or all) variables are restricted to the space of integers, and the goal is to minimize an objective subject to specified constraints. In combinatorial optimization, the goal is to find the optimal object from a set of feasible discrete objects. Combinatorial optimization functions operate on discrete structures such as graphs and trees. The class of discrete optimization problems is typically computationally hard.

A commonly used discrete optimization procedure in optimization of force fields is genetic programming (Katoch et al. 2021; Mirjalili 2019). Genetic programming starts with a population of potentially suboptimal candidate solutions. It successively selects from this population (formally called selection) and combines them (formally called crossover) to generate new candidates. In many variants, mutations are introduced into the candidates to generate new candidates as well. A fitness function is used to screen these new candidates and the fittest candidates are retained in the population. This process is repeated until the best candidates achieve desired fitness. In the context of force-field optimization, the process is initialized with a set of parametrizations. The fitness function corresponds to the accuracy with which the candidate reproduces training data. The crossover function generates new candidates through operations such as exchange of corresponding parameters, min, max, average, and other simple operators.

3 Machine Learning Models

While the field of ML is vast, it is common to classify ML algorithms into “supervised” and “unsupervised”. In supervised learning algorithms, training data contain both features and labels. The goal is to learn a function that takes as input a feature vector and returns a predicted label. Supervised learning can further be categorized into classification and regression. When labels are categorical, the learning task is commonly called “classification”. On the other hand, if the task is to predict a continuous numerical value, it is called regression. In unsupervised learning algorithms, training data do not have labels. The goal of unsupervised algorithms is to analyze patterns in data without requiring annotation. Common examples of unsupervised algorithms include clustering and dimensionality reduction. We note that there are many other active areas of ML, such as reinforcement learning and semi-supervised learning that are beyond the scope of this chapter. We refer interested readers to more exhaustive sources for a comprehensive discussion (Bishop and Nasrabadi 2006; Murphy 2012; Shalev-Shwartz and Ben-David 2014; Goodfellow et al. 2016).

3.1 Unsupervised Learning

The most commonly used unsupervised learning techniques are clustering and dimensionality reduction.

3.1.1 Clustering

In clustering, data represented as vectors are grouped together on the basis of some inherent structures (or patterns), typically characterized by their similarities or distances (Saxena et al. 2017; Gan et al. 2020). Clustering algorithms can be categorized on the basis of their outputs into: (i) crisp versus overlapping; or (ii) hard versus soft. In crisp clustering, each data point is assigned to exactly one cluster, whereas overlapping clustering algorithms allow for multiple memberships for each data point. In hard clustering algorithms, a data-point is assigned a 0/1 membership to every cluster (a 1 corresponding to the cluster the point is assigned to). In soft clustering algorithms, each data point is assigned membership grades (typically in a 0–1 range) that indicate the degree to which data points belong to each cluster. If the grades are convex (i.e., they are positive and sum to 1), then the grades can be interpreted as probabilities with which a data point belongs to each of the classes. In the general class of fuzzy clustering algorithms (Ruspini 1969), the convexity condition is not required.

Centroid-based clustering refers to algorithms where each cluster is represented by a single, “central” point, which may not be a part of the dataset. The most commonly used algorithm for centroid-based clustering (and indeed all of clustering) is k-means algorithm of  Lloyd (1982). Given a set of data-points \([\textbf{x}_1, \textbf{x}_2, \ldots , \textbf{x}_n]\), and pre-defined number of clusters k, the objective function of k-means is given by:

$$\begin{aligned} \mathop {\arg \,\min }\limits _{{} {\textbf {C}}} \quad {\sum _{i=1}^{k} \sum _{x \in \textbf{C}_i} || \textbf{x} - \mu _i ||} \end{aligned}$$

where, \(\textbf{C}\) is the union of non-overlapping clusters (\(\textbf{C} = \{\textbf{C}_1, \textbf{C}_2, \ldots , \textbf{C}_k \}\)), and \(\mu _i\) represents the mean of all data-points of belonging to cluster i. Stated otherwise, the objective of k-means clustering is to minimize the distance between data-points and their assigned clusters (as represented by the mean). The problem of k-means is NP hard, but approximation algorithms such as Lloyd’s Algorithm can efficiently find local optima.

Distribution-based clustering algorithms work on the assumption that data-points belonging to the same cluster are drawn from the same distribution. Common algorithms in this class assume that data follow Gaussian Mixture Models, and typically solve the problem using the Expectation-Maximization (EM) Approach. EM does maximum likelihood estimation in the presence of latent variables. In each iteration, there are two steps. In the first step, latent variables are estimated (E-step). This is followed by the Maximization (M-step) where parameters of the models are optimized to better fit the data. In fact, the aforementioned Lloyd’s algorithm for k-means clustering is a simple instance of EM.

Density-based clustering is a class of spatial-clustering algorithms, in which a cluster is modeled as a dense region in data space that is spatially separated from other clusters. Density-based spatial clustering of applications with noise (DBSCAN) by Ester et al. (1996) is the most commonly used algorithm in this class. DBSCAN requires two parameters: (i) \(\epsilon \)-size of neighborhood; and (ii) Minpts—minimum number of points in each cluster. DBSCAN proceeds as follows—first, it finds all points that are in the \(\epsilon \)-neighborhood of all points. Then, it designates points with more than Minpts neighbors as “core-points”. Next, it finds connected components of core-points by inspecting the neighbors of each core-point. Finally, each non-core-point is assigned to the cluster if it is in an \(\epsilon \) neighborhood. If a data-point is not in the neighborhood, it is identified as an outlier, or noise (Schubert et al. 2017).

Hierarchical clustering refers to a family of clustering algorithms that seeks to build a hierarchy of the clusters (Maimon and Rokach 2005). The two common approaches to build these hierarchies are bottom-up and top-down. In bottom-up (or agglomerative) clustering, each data-point initially belongs to a separate cluster. Small clusters are created on the basis of similarity (or proximity). These clusters are merged repeatedly until all data-points belong to a single cluster. The reverse process is performed in the top-down (or divisive) clustering approaches, where a single cluster is split repeatedly until each data-point is its own cluster. The main parameters to choose are the metric (i.e., the distance measures), and the linkage criterion. Commonly used metrics are L-1, L-2 norms, Hamming distance, and inner products. Linkage criterion quantifies distance between two clusters on the basis of distances between pairs of points across the clusters.

3.1.2 Dimensionality Reduction

Dimensionality reduction is an unsupervised technique common to many applications. Reducing dimensions produces a parsimonious denoised representation of data that is amenable to analysis by complex algorithms that would otherwise not be able to handle large amounts of raw data.

Linear Dimensionality Reduction Techniques

Principal component analysis (PCA) is perhaps the most commonly used linear dimension reduction technique. Principal components correspond to directions of maximum variation in data. Projecting data on to these directions, consequently, maintains dominant patterns in data. The first step in PCA is to center the data around zero mean to ensure translational invariance. This is done by computing the mean of the rows of data matrix M and subtracting it from each row to give a zero-centered data matrix \(M'\). A covariance matrix is then computed as the nromalized form of \(M'^TM'\). Note that the \((i,j)\textrm{th}\) element of this covariance matrix is simply the covariance of the \(i\textrm{th}\) and \(j\textrm{th}\) rows of matrix \(M'\). The dominant directions in this covariance matrix are then computed as the dominant eigenvectors of this matrix. Selecting the k dominant eivengectors and projecting the data matrix M to this subspace yields a k dimensional data matrix that best preserves variances in data. A common approach to selecting k is to consider the drop in magnitude of corresponding eigenvalues. PCA has several advantages: (i) by reducing the effective dimensionality of data, it reduces the cost of downstream processing; (ii) by retaining only the dominant directions of variance, it denoises the data; and (iii) it provides theroetical bounds on loss of accuracy in terms of the dropped eigenvalues.

The general class of dimensionality reduction techniques also includes other matrix decomposition techniques. In general, these techniques express matrix data M as an approximate product of two matrices \(UV^T\); i.e., they minimize \(||M - UV^T||\). Various methods impose different constraints on matrices U and V, leading to a general class of methods that range from dimension reduction to commonly used clustering techniques. Perhaps, the best-known technique in this class is Singular Value Decomposition (SVD) (Golub and Reinsch 1971), which is closely related to PCA, where columns of U and V are orthogonal, and rank-k for some value of k. The orthogonality of the column space of these matrices makes them hard to interpret directly in the data space.

In contrast to SVD, if matrix U is constrained to only positive entries and columns in matrix U sum to 1, we get a decomposition called archetypal analysis. In this interpretation, columns of V correspond to the corners of a convex hull of the points in matrix M, also known as pure-samples or archetypes, and all data points are expressed as convex combinations of these archetypes. A major advantage of archetypal analyses is that archetypes are directly interpretable in the data space. Another closely related decomposition is non-negative matrix factorization (NMF), which relaxes the orthogonality constraint of SVD, instead, constraining elements of matrix U to be non-negative (Gillis 2020). In doing so, it loses error norm minimization properties of SVD, but gains interpretability. All of these methods can be used to identify patterns of coherent behavior among particles in the simulation. We refer interested readers to a comprehensive survey on linear dimensionality reduction methods by Cunningham and Ghahramani (Cunningham and Ghahramani 2015).

Non-linear Dimensionality Reduction

General non-linear dimensionality reduction techniques are needed for data that resides on complex non-linear manifolds. This is commonly the case for particle datasets in reactive environments. Non-linear dimensionality reduction technqiues typically operate in three steps: (i) embedding of data onto a low-dimensional manifold (in a high-dimensional space); (ii) defining suitable distance measures; and (iii) reducing dimensionality to preserve distance measures. Among the more common non-linear dimensionality reduction technique is Isometric feature mapping. This technique first constructs a graph corresponding to the dataset by associating a node with each row of the data matrix, and edges to correspond to the k nearest neighbors of the node. This graph is then used to define distances between nodes in terms of shortest paths. Finally, techniques such as multidimensional scaling (MDS)—a generalization of PCA that can use general distance matrices, as opposed to covariance matrices used by PCA, are used to compute low-dimension representations of the matrix. An alternate approach uses the spectrum of a Laplace operator defined on the manifold to embed data points in a lower dimensional space. Such techniques fall into the general class of Laplacian eigenmaps.

An alternate approach to non-linear dimensionality reduction is the use of non-linear transformations on data in conjunction with a suitable distance measure, followed by MDS for dimensionality reduction. The first two steps of this process (non-linear transformation and distance measure computation) are often integrated into a single step through the specification of a kernel. The use of such a kernel with MDS is called kernel PCA. The key challenges in the use of these methods relate to: (i) suitable representation techniques (described in Sect. 5); (ii) kernel functions; and (iii) appropriate scaling mechanisms since distance matrices can have highly skewed distributions and the directions may be dominated by a small number of very large entries in the distance matrix. Common approaches to kernel selection rely on polynomial transformations of increasing degree until suitable spectral gap is observed. Data representations and normalization are highly application and context dependent.

Autoencoder and Deep Dimensionality Reduction

Autoencoders have been recently proposed for use in non-linear dimensionality reduction (Kramer 1991; Schmidhuber 2015; Goodfellow et al. 2016). Autoencoders are feed-forward neural networks (discussed in further detail in Sect. 3.2) that are trained to code the identity function—i.e., the output of the autoencoder neural network is the input itself. Dimensionality reduction is accomplished in this framework by having an intermediate layer with a small number of activation functions. Through this constraint, an autoencoder is trained to “encode” input data into a low-dimensional latent space, with the goal of “decoding” the input back. The output of the encoder therefore represents a non-linear reduced dimension representation of the input.

T-distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP)

t-SNE (Maaten and Hinton 2008) and UMAP (McInnes et al. 2018) are commonly used non-linear dimensionality reduction techniques for mapping data to two or three dimensions—primarily for visual analysis. t-SNE computes two probability distributions—one in the high-dimensional space and one in the low-dimensional space. These distributions are constructed so that two points that are close to each other in the euclidean space have similar probability values. In the high-dimensional space, a Gaussian distribution is centered at each data point, and a conditional probability is estimated for all other data points. These conditional probabilities are normalized to generate a global probability distribution over all points. For points in the low dimensional space, t-SNE uses a Cauchy distribution to compute the probability distribution. The goal of dimensionality reduction translates to minimizing the distance (in terms of KL divergence) of these two distributions. This is typically done using gradient descent. In contrast to t-SNE, a closely related technique, UMAP assumes that the data is uniformly distributed on a locally connected Riemannian manifold and that the Riemannian metric is locally constant or approximately locally constant (https://umap-learn.readthedocs.io/en/latest/). Both of these techniques are extensively used in visualization of high-dimensional data.

3.2 Supervised Learning

The goal of supervised methods is to learn a function from input data vectors to output classes (labels) using training input-output examples. The function should “generalize” to be able to accurately predict labels for unseen inputs. The general learning procedure is as follows: first, the data is split into train and test sets. Then, the function is learnt by using the input-output training examples. The learnt function is applied to the test input to get predicted outputs. If the algorithm performs poorly on training examples, we say that the algorithm “underfits” the data. This typically occurs when the model is unable to capture the complexity of the data. When learnt functions do not perform well (say, low prediction accuracy) on test data, we say that the algorithm “overfits” to the train set. Overfitting occurs when the algorithm fits to noise, rather than true data patterns. The problem of balancing underfitting and overfitting is called the bias-variance tradeoff. Intuitively, we want the model to be sophisticated enough to capture complex data patterns, but on the other hand, we don’t want to endow it with the ability to capture idiosyncrasies of the train examples.

The problem of overfitting can be controlled through a number of approaches. In cross-validation, the training set is further divided into subsets (or folds). The training procedure proceeds to learn the function by leaving out one fold in every iteration. The model is validated on the remaining fold. The parameters of the model are optimized to ensure high cross-validation accuracy. Regularization is a technique in which a penalty term is added to the error function to prevent overfitting. Tikhonov regularization is one of the early examples of regularization that is commonly used in linear regression. Early stopping is a form of regularization in which the learner uses iterative methods like gradient descent. The key idea of early stopping is to perform training until the learning algorithm continues to improve performance on external (unseen) data. It is stopped when improvement on training performance comes at the expense of test performance. Other approaches to avoid overfitting include data augmentation (increasing number of data points for training) and improved feature selection. Underfitting can be avoided by using more complex models (e.g., going from a linear to a non-linear model), increasing training time, and reducing regularization.

3.2.1 Overview of Supervised Learning Algorithms

Supervised learning algorithms are often categorized as generative or discriminative. Generative algorithms aim to learn the distribution of each class of data, whereas discriminative algorithms aim to find boundaries between different classes. Naive Bayes Classifier is a generative approach that uses the Bayes Theorem with strong assumptions on independence between the features (Rish 2001). Given a d-dimensional data vector \(\textbf{x} = [x_1, x_2, \ldots , x_d]\), naive Bayes models the probability that \(\textbf{x}\) belongs to class k as follows:

$$\begin{aligned} p(C_k | \textbf{x}) \propto p(C_k) \prod _{i=1}^d p(x_i | C_k) \end{aligned}$$

In practice, the parameters for the distributions of features are estimated using maximum-likelihood estimations. Despite the strong assumptions made in naive Bayes, it works well in many practical settings. Linear Discriminant Analysis (LDA) is a binary classification algorithm that models the conditional probability densities \(p(\textbf{x} | C_k)\) as normal distributions with parameters \((\mu _k, \Sigma )\), where \(k=\{0,1\}\) (McLachlan 2005). The simplifying assumption of homoscedasticity (i.e., the covariance matrices are the same for both classes) means that the classifier predicts class 1 if:

$$\begin{aligned} \Sigma ^{-1}(\mu _1 - \mu _0) \cdot \textbf{x} > \frac{1}{2} \Sigma ^{-1}(\mu _1 - \mu _0) \cdot (\mu _1 + \mu _0) \end{aligned}$$

More complex generative methods include Bayesian Networks and Hidden Markov Models.

k-Nearest Neighbor (k-NN) algorithm is an early, and still widely used discriminative algorithm used for both classification and regression. In classification, the label of a test data sample is obtained by a vote of the labels of its k-nearest neighbors. In regression, k-NN computes the predicted value of a test sample as a function of the corresponding values of its k-nearest neighbors. Logistic regression uses a logistic function (logit) to model a binary dependent variable. In the training phase, the parameters for the logit function are learnt. Logistic regression is similar to LDA, but with fewer assumptions.

Support Vector Machine (SVM) (Cortes and Vapnik 1995) is a widely used discriminative model for regression and classification. Given input data \([\textbf{x}_1, \textbf{x}_2, \ldots , \textbf{x}_n]\) and corresponding labels \(y_1, y_2, \ldots , y_n\), where \(y_i \in \{-1,1\}, \forall i \in \{1,2,\ldots ,n\}\), SVM aims to optimize the following objective function:

$$\begin{aligned} \text {minimize} \quad {\lambda ||\textbf{w}||^2 + \sum _{i=1}^n \max (0, 1 - y_i(\textbf{w}\cdot \textbf{x} - b)) } \end{aligned}$$

Here, vector \(\textbf{w}\) represents the vector normal to the separating hyperplane and \(\lambda \) is the weight given to regularization. The \(\max (.)\) term is called Hinge-loss function, which allows SVMs to work with non-linear boundaries. SVMs typically use the so-called “kernel trick”. The idea is that implicit high-dimensional representation of raw data can let linear learning algorithms learn non-linear boundaries. The kernel function itself is a similarity measure. Common kernels include Fisher, Polynomial, Radial Basis Function (RBF), Gaussian, and Sigmoid functions. Other examples of discriminative methods include decision trees and random forests. 

3.2.2 Neural Networks

Neural Networks are interconnected groups of units called neurons that are organized in layers. The first layer is called the input layer, and is typically the same dimension as the input. The final layer is called the output layer. The outputs of neural networks could be prediction of class labels, images, text, etc. Each neuron in an intermediate layer is given a number of inputs. It computes a non-linear function on a weighted sum of its input. The resulting output may be fed into a number of neurons in the next layer. The non-linear function associated with a neuron is called an activation function. Common examples of activation functions include hyperbolic tan (tanh), sigmoid, Rectified Linear Unit (ReLU), and Leaky ReLU, among many others. 

There are two key steps to designing neural networks for specific tasks. The first step corresponds to design of the network architecture. This specifies the number of layers, connectivity, and types of neurons. The second step parametrizes weights on edges of the neural network using a suitable optimization procedure for matching the output distribution with the target distribution (as discussed earlier in Sect. 2.1).

The term deep learning is used to describe a family of machine learning models and methods whose architectures use neural networks as core components. The word “deep” corresponds to the the fact that learning algorithms typically use neural network models with many layers, in contrast to shallow networks which typically have one or two intermediate (or hidden) layers (Schmidhuber 2015).

3.2.3 Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are neural networks that use convolutions to quantify local pattern matches. CNNs are feed-forward networks with one or more convolution layers. CNNs are used extensively in the analysis of images, and more recently, graphs that model connected structures such as molecules. CNNs have an input layer, hidden layer(s), and output layers. The input to a CNN is a tensor of the form \(\# inputs \times input\ height \times input\ width \times input\ channels\). The height and width parameters correspond to the size of the original images. The number of input channels is typically three (red, green, and blue) for images.

Each of the hidden layers can be one of: (i) convolutional layer, (ii) pooling layer, or a (iii) fully connected layer. A convolutional layer takes as input an image, or the output of another layer, and outputs a feature map. This produces a tensor of the form \(\#inputs \times \! feature\ height \times \! feature\ width \times \! feature\ channels\). Each neuron of a CNN processes only a small region of the input. This region is called the receptive field. It convolves this input and passes it on to the next layer. Pooling Layers are used to reduce the dimensionality of the data. They do so by aggregating the outputs of neurons in the previous (convolutional) layer. Pooling strategies can be local (operating on a small subset of neurons), or global (operating on the entire feature map). Common pooling functions include max and average. In fully connected layers, outputs of neurons are connected to every single neuron in the next layer. They are often used as the penultimate layer before the output layer, where all weights are combined to compute the prediction (i.e., the output). A neural network with only fully connected layers is also called a Multiple Layer Perceptron (MLP). From this point of view, CNNs are regularized forms of MLPs.

There are a number of parameters associated with CNNs that must be tuned. Specific to convolutional layers, the common parameters are stride, depth, and padding. The depth parameter of the output volume controls the neurons in a layer that connect to the same region of the input volume. Stride controls the translation of the convolution filter. Padding allows the augmentation of input with zeros at the border of input volume. Other parameters include kernel size and pooling size. Kernel size specifies the number of pixels that are processed together, whereas pooling size controls the extent of down-sampling. Typical values for both are \(2 \times 2\) in common image processing networks.

In addition to parameter tuning, regularization is also required to design robust CNNs. In addition to generic methods for regularization mentioned earlier (such as early-stopping, L1/L2 regularization), there are CNN-specific approaches. Dropout is a common measure taken to regularize neural networks. Fully-connected networks (or MLPs) are prone to overfitting, because of the large number of connections. An intuitive way to resolve this issue is to leave out individual nodes (and the corresponding inbound and outbound edges) from the training procedure. Each node is left out with a probability p (p is usually set to 0.5). During the testing phase, the expected value of the weights are computed from different versions of the dropped-out network. Other simple, CNN-specific parameter tuning techniques limit the number of units in hidden layers, number of hidden layers, and number of channels in each layer.

Commonly used CNN architectures include LeNet (LeCun et al. 1989), AlexNet (Krizhevsky et al. 2012), ResNet (He et al. 2016), Wide ResNet (Zagoruyko and Komodakis 2016), GoogleNet (Szegedy et al. 2015), VGG (Simonyan and Zisserman 2014), DenseNet (Huang et al. 2017), and Inception (v2 (Szegedy et al. 2016), v3 (Szegedy et al. 2016), v4 (Szegedy et al. 2017)).

3.2.4 Recurrent Neural Networks

A Recurrent Neural Network (RNN) is a neural network in which nodes have internal hidden states, or memory. RNNs can therefore process (temporal) sequences of inputs. They are typically used in the analysis of speech signals, language translation, and handwriting recognition, and more recently in prediction of atomic trajectories in molecular dynamics simulations.

A key feature of RNN is the ability to share intermediate outputs across different parts of the model. Given a sequence of inputs \([\textbf{x}_1, \textbf{x}_2, ... , \textbf{x}_n]\), the state of RNN at time t is given as

$$\begin{aligned} \textbf{h}^{(t)} = f(\textbf{h}^{(t-1)}, \textbf{x}^{(t)}; \theta ) \end{aligned}$$

where, f(.) is the recurrent function, and \(\theta \) is the set of shared intermediate outputs. From Eq. 7, one can see that RNNs predict the future on the basis of the past outcomes.

A generic RNN can, in theory, remember arbitrarily long-term dependencies. In practice, repeated use of back-propagation causes gradients to vanish (i.e., tend to zero), or explode (i.e., tend to infinity). Gated RNNs are designed to circumvent these issues. The most widely used Gated RNNs are Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber 1997; Gers et al. 2000) and Gated Recurrent Unit (GRU) (Cho et al. 2014). Recall that a regular activation neuron consists of a non-linear function applied to a linear transformation of the input. In addition to this, LSTMs have an internal cell-state (different from the hidden-state recurrence previously discussed), and a gating mechanism that controls the flow of information. In all, LSTMs have three gates—input gate, forget gate, and output gate. Specifically, the forget gate allows a network to forget old states that have accumulated over time, thereby preventing vanishing gradients. GRUs are similar to LSTMs, but with a simplified gating architecture. GRUs combine LSTM’s input and forget gate into a reset gate. The reset gate also allows GRUs to combine hidden- and cell-states. This results in a simpler architecture that requires fewer tensor operations. The problem of exploding gradients is handled by gradient clipping. Two common strategies in gradient clipping are: (i) value clipping—values above and below set thresholds are set to the respective thresholds, and (ii) norm clipping—rescaling the gradient values by a chosen norm. Using CNNs and RNNs as building blocks, we can develop complex NN frameworks such as Generative Adversarial Networks (GANs).

3.2.5 Generative Adversarial Networks

A Generative Adversarial Network (GAN) is a neural network in which a zero-sum game is contested by two neural networks—the generative network and the discriminative network (Goodfellow et al. 2014). The generative network learns to map a pre-defined latent space to the distribution of the dataset, whereas a discriminative network is used to predict whether an input instance is truly from the dataset or if it is the output of the generative network. The objective of the generative network is to fool the discriminative network (i.e., increase error of the discriminative network), whereas the objective of the discriminative network is to correctly identify true data. The training procedure for a GAN is as follows: first, the discriminative network is given several instances from the dataset, so that it learns the “true” distribution. The generative network is initially seeded with a random input. From there, the generative network creates candidates with the objective of fooling the discriminative network. Both networks have separate back-propagation procedures; the discriminator learns to distinguish the two sources of inputs, even as the generative network produces increasingly realistic data.

GANs have found a number of applications in synthesis of (realistic) datasets. They have been successful in creating art, synthesizing virtual environments, generate photographs of synthetic faces, and designing animation characters. GANs are often used for the purpose of transfer learning, where knowledge obtained from one training in one application can be used in another similar, but different application.

3.2.6 Transfer Learning

Traditional machine learning is isolated, in that a model is trained in a very specific context, to perform a targeted task. The key idea in transfer-learning is that new tasks learn from the knowledge gained in a previously trained task (Weiss et al. 2016). To formally define Transfer Learning, we first define domain and task. Let \(\mathcal {X}\) be a feature space, and \(\textbf{X}\) be the dataset (i.e., \(\textbf{X} = [\textbf{x}_1, \textbf{x}_2, \ldots , \textbf{x}_n] \in \mathcal {X}\)). Similarly, let \(\mathcal {Y}\) be the label space and \(Y = \{y_1, y_2, \ldots , y_n\} \in \mathcal {Y}\) be the labels corresponding to the rows of \(\textbf{X}\). Further, let P(.) denote a probability distribution. A domain is defined as \(\mathcal {D} = \{\mathcal {X}, P(\textbf{X})\}\). Given a domain \(\mathcal {D}\), a task \(\mathcal {T}\) is defined as \(\mathcal {T} = \{\mathcal {Y}, P(Y|\textbf{X})\}\). Given source and target domains \(\mathcal {D}_S\) and \(\mathcal {D}_T\) and corresponding tasks \(\mathcal {T}_S\) and \(\mathcal {T}_T\), transfer learning aims to learn \(P(Y_T|\textbf{X}_T)\) using information from \(\mathcal {D}_S\) and \(\mathcal {D}_T\). In this setup, we can see that there are four possibilities: (i) \(\mathcal {X}_S \ne \mathcal {X}_T\), (ii) \(\mathcal {Y}_S \ne \mathcal {Y}_T\), (iii) \(P(\textbf{X}_S) \ne P(\textbf{X}_T)\), or (iv) \(P(Y_S | \textbf{X}_S) \ne P(Y_T | \textbf{X}_T)\). In (i), the feature spaces of the source and target domain are different. In (ii), the label space of the task are different, which happens in conjunction with (iv) where the conditional probabilities of labels are different. In (iii), the feature spaces of source and target domains are the same, while the marginal probabilities are different. Case (iii) is interesting for simulations, because the feature spaces for source (simulation) and target (reality) is typically the same, but the marginal probabilities of observations in simulation and reality can be very different.

3.3 Software Infrastructure for Machine Learning Applications

A number of software packages and libraries have been developed over the last decade in support of ML applications in different contexts. Matrix computations are often performed using NumPy (Python) (Harris et al. 2020), Eigen ( et al. 2010), and Armadillo (C++) (Sanderson and Curtin 2016, 2020). Standard machine learning methods, including clustering such as k-means clustering and DBSCAN, classification algorithms such as SVM and LDA, regression, and dimensionality reduction are available in Python packages such as SciPy (Virtanen et al. 2020) and Theano (Theano Development Team 2016), and in C++ packages such as MLPack (Curtin et al. 2018). Deep learning approaches are often implemented using libraries such as PyTorch (Paszke et al. 2019), TensorFlow (Abadi et al. 2015), Caffe (Jia et al. 2014), Microsoft Cognitive Toolbox, and DyNet (Neubig et al. 2017). We note that a number of machine learning packages written in a source language have readily available interfaces for other languages. For example, Caffe is written in C++, with interfaces available for both Python and MATLAB. Finally, we also note that Julia has wrappers for a number of the Python and C++ libraries.

4 ML Applications in Reactive Atomistic Simulations

Building on our basic toolkit of ML models and methods, we now describe recent advances in the use of ML techniques in reactive atomistic simulations. We focus on three core challenges—use of ML techniques for training highly accurate atomistic interaction models, use of ML techniques in accelerating simulations, and use of ML methods for analysis of atomistic trajectories. Our discussion applies broadly to particle methods, however, we use reactive atomistic simulations as our model problem. In particular, we use ReaxFF as the force field for simulations.

4.1 ML Techniques for Training Reactive Atomistic Models

Optimization of force-field parameters for target systems of interest is crucial for high fidelity in simulations. However, such optimizations cannot be specific to the sets of molecules present in the target system for two reasons: (i) utility of a parameter set that only works for a particular system is marginal; and (ii) in a reactive simulation, molecular composition of a system is expected to change as a result of the reactions during the course of a simulation. For this reason, reactive force field optimizations are performed at the level of groups of atoms, e.g. Ni/C/H, Si/O/H, etc. Nevertheless, the behaviour of a given group of atoms may show variations in different contexts such as combustion, aqueous systems, condensed matter phase systems, and biochemical processes. Therefore, it may be desirable to create parameter sets optimized for different contexts (Senftle et al. 2016).

Reactive force fields such as ReaxFF are complex, with a large number of parameters that can be grouped by charge equilibration parameters, bond order parameters, and parameters based on N-body interaction (e.g., single-body, two-body, three-body, four-body and non-bonded) in addition to the system-wide global parameters. As the number of elements in a parameter set increases, force field optimization quickly becomes a challenging problem due to the high dimensionality and discrete nature of the problem. Several methods and software systems have been developed for force field optimization over the years, starting with more traditional methods early on and moving to ML-based methods more recently. After giving an overview of the force field optimization problem, we briefly review traditional methods first and then discuss the ML-based techniques, which mainly draw upon Genetic Algorithms (see Sect. 2.2) as well as the extensive ML software infrastructure that has been built recently (see Sect. 3.3).

4.1.1 Training Data and Validation Procedures

Training procedures for typical force fields require three inputs: (i) model parameters to be optimized; (ii) geometries, a set of atom clusters that describe the key characteristics of the system of interest (e.g., bond stretching, angle and torsion scans, reaction transition states, crystal structures, etc.); and (iii) training data, chemical and physical properties associated with these atom clusters (such as energy minimized structures, relative energies for bond/ angle/ torsion scans, partial charges and forces), which are typically obtained from high-fidelity quantum mechanical (QM) models or sometimes experiments, along with a function that combines these different types of training items into a quantifiable fitness value:

$$\begin{aligned} \text {Error}(m) = \sum _{i=1}^N \left( \frac{x_{i} - y_{i}}{\sigma _i} \right) ^2. \end{aligned}$$

In Eq. 8, m represents the model with a given set of force field parameter values, \(x_i\) is the predicted training data value calculated using the model m, \(y_i\) is the ground truth value of the corresponding training data item, and \(\sigma _i^{-1}\) is the weight assigned to each training item.

Table 1 summarizes commonly used training data types and provides some examples. An energy-based training data item uses a linear relationship of different molecules (expressed through their identifiers) because relative energies rather than the absolute energies drive the chemical and physical processes. For structural items, geometries must be energy minimized as accurate prediction of the lowest energy states is crucial. For other training item types, energy minimization is optional, but usually preferred.

Table 1 Examples for commonly used training items. Identifiers (e.g., ID1) refer to structures/molecules

4.1.2 Global Methods for Reactive Force Field Optimization

The earliest ReaxFF optimization tool is the sequential parabolic parameter interpolation method (SOPPI) (van Duin et al. 1994). SOPPI uses a one-parameter-at-a-time approach, where consecutive single parameter searches are performed until a certain convergence criteria is met. The algorithm is simple, but as the number of parameters increases, the number of one-parameter optimization steps needed for convergence increases drastically. Furthermore, the success of this method is highly dependent on the initial guess and the order of the parameters to be optimized.

Due to the drawbacks of SOPPI, various global methods such as genetic or evolutionary algorithms (Dittner et al. 2015; Jaramillo-Botero et al. 2014; Larsson et al. 2013; Trnka et al. 2018), simulated annealing (SA) (Hubin et al. 2016; Iype et al. 2013) and particle swarm optimization (PSO) (Furman et al. 2018) have been investigated for force field optimization. We discuss some of the promising techniques below.

Genetic Algorithms (GA) often work well for global optimization because via crossover they can exploit (partial) separability of the optimization problem even in the absence of any explicit knowledge about its presence. They are also able to make long-range “jumps” in search space. Due to the continuous presence of multiple individuals that have survived several selection rounds it is ensured that these “jumps,” based on information interchange between individuals, have a high probability of landing at new, promising locations. Last but not least, by admitting operators other than the classic crossover and mutation steps, it is possible to extend GAs within this abstract meta-heuristic framework with desirable features of other global optimization strategies, too. GAs are especially useful when dealing with challenging and time-critical optimization problems. The straightforward parallelism and intrinsic high scalability property of GAs provide an advantage over other strategies that are either serial in nature or where parallelization facilitates decoupled or only loosely coupled task-level parallelism. An efficient and scalable implementation of GAs for ReaxFF is provided in the ogolem-spuremd software (Dittner et al. 2015), where the authors demonstrate convergence to fitness values similar to or better than those reported in the literature in a matter of a few hours of execution time through effective use of high-performance computers and advanced GA techniques.

Recently, other population-based global ReaxFF optimization methods have been proposed, such as the particle swarm optimization algorithm RiPSOGM (Furman et al. 2018), covariance matrix adaptation evolutionary strategy (CMA-ES) (Shchygol et al. 2019), and the KVIK optimizer (Gaissmaier et al. 2022). Shchygol et al. (2019) explore different optimization choices for the CMA-ES method, the ogolem-spuremd software, as well as a Monte-Carlo force field optimizer (MCFF), and they systematically compare these techniques using three training sets from literature. Their CMA-ES method is an implementation of the stochastic gradient-free optimization algorithm proposed by Hansen (2006), where the main idea is to iteratively improve a multi-variate normal distribution in the parameter space to find a distribution whose random samples minimize the objective function starting from a user provided initial guess. The MCFF technique is based on the simulated annealing algorithm to optimize a given set of parameters. In every iteration, MCFF makes a small random change to the parameter vector and computes the corresponding change in the error function. Any change that reduces the error is accepted; changes that increase the error are accepted with a predetermined probability. With sufficiently small random changes and acceptance rates, MCFF can become a rigorous global optimization method, but at very high computational cost. Through extensive benchmarking, Shchygol et al. conclude that while CMA-ES can often converge to the lowest error rates, it cannot do this on a consistent basis. The GA method employed by ogolem-spuremd can produce consistently good (but not necessarily the lowest) error rates, but at higher computational costs compared to CMA-ES. Overall, they have found MCFF to underperform compared to CMA-ES and GA for similar computational costs.

4.1.3 Machine Learning Based Search Methods

While global methods have been proven to be successful for force-field optimization, due to the absence of any gradient information, these global search methods require a large number of potential energy evaluations, as such they can be very costly. With the emergence of advanced tools to calculate the gradients of complex functions automatically, machine learning based techniques for optimization of force fields have attracted interest.

iReaxFF: One of the earliest such attempts is the Intelligent-ReaxFF, iReaxFF, software (Guo et al. 2020). iReaxFF uses the TensorFlow library for automatically calculating gradient information and use local optimizers such as Adam or BFGS. An additional benefit of the Tensorflow implementation is that iReaxFF can automatically leverage GPU acceleration. However, iReaxFF does not have the expected flexibility in terms of the training data as it can only be trained to match the ReaxFF energies to the absolute energies from Density Functional Theory (DFT) computations on the training data; relative energies, charges or geometry optimizations cannot be used in the training, essentially limiting its usability. As iReaxFF tries to exactly match the energies of the training data, the transferability of force fields generated by iReaxFF is also limited. While it is not clearly stated what kind of gradient information is calculated using Tensorflow, their definition of the loss function (which is the sum of the squared differences between absolute DFT and ReaxFF energies) suggests that their gradients are calculated with respect to atomic positions, which essentially amounts to performing a force matching based force field optimization. The number of iterations required to reach the desired accuracies for their test cases is rather large, on the order of tens to hundreds of iterations. Even with GPU acceleration, the training time for a test case reportedly takes several days. This is partly because iReaxFF does not filter out the unnecessary 2-body, 3-body and 4-body interactions before the optimization step.

JAX-ReaxFF: Another recent effort that utilizes the Tensorflow framework is the JAX-ReaxFF software (Kaymak et al. 2022). JAX is an auto-differentitation software by Google that is built on top of Tensorflow for high performance machine learning research (Bradbury et al. 2020), it can automatically differentiate native Python and NumPy functions. Leveraging this capability, JAX-ReaxFF automatically calculates the derivative of a given fitness function with respect to the set of force field parameters to be optimized from Python-based implementation of the ReaxFF potential energy terms. By learning the gradient information of the high dimensional optimization space (which generally includes tens to over a hundred parameters), JAX-ReaxFF can employ highly effective local optimization methods such as the Limited Memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) algorithm (Zhu et al. 1997) and Sequential Least Squares Programming (SLSQP) (Kraft et al 1988) optimizer. The gradient information alone is obviously not sufficient to prevent local optimizers from getting stuck in a local minima, but when combined with a multi-start approach, JAX-ReaxFF can greatly improve the training efficiency (measured in terms of the number of fitness function evaluations) performed. As they demonstrate through a diverse set of systems such as cobalt, silica, and disulfide, which were also used in other related work, they can reduce the number of optimization iterations from tens to hundreds of thousands (as in CMA-ES, ogolem-spuremd or iReaxFF) down to only a few tens of iterations.

Another important advantage of JAX is its architectural portability enabled by the XLA technology (Sabne 2020) used under the hood. Hence, JAX-ReaxFF can run efficiently on various architectures, including graphics processing units (GPU) and tensor processing units (TPU), through automatic thread parallelization and vector processing. By making use of efficient vectorization techniques and carefully trimming the 3-body and 4-body interaction lists, JAX-ReaxFF can reduce the overall training time by up to three orders of magnitude (down to a few minutes on GPUs) compared to the existing global optimization schemes, while achieving similar (or better) fitness scores. The force fields produced by JAX-ReaxFF have been validated by measuring the macroscale properties (such as density and radial distribution functions) of their target systems.

Beyond speeding up force field optimization, the Python based JAX-ReaxFF software provides an ideal sandbox environment for domain scientists, as they can move beyond parameter optimization and start experimenting with the functional forms of the interactions in the model, add new types of interactions or remove existing interactions as desired. Since evaluating the gradient of the new functional forms with respect to atom positions gives forces, scientists are freed from the burden of coding the lengthy and error-prone force calculation parts. Through automatic differentiation of the fitness function as explained above, parameter optimization for the new set of functional forms can be performed without any additional effort by the domain scientists. After parameter optimization, they can readily start running MD simulations to test the macro-scale properties predicted by the modified set of functional forms as a further validation test before production-scale simulations, or go back to editing the functional forms if desired results cannot be confirmed in this sandbox environment provided by JAX-ReaxFF.

4.2 Accelerating Reactive Simulations

We now discuss how ML techniques can be directly used to accelerate reactive simulations and to improve their accuracy in different application contexts.

4.2.1 Machine Learning Potentials

At a high level, ML based potentials can be defined as follows (Behler 2016):

  1. 1.

    The potential must establish a direct functional relation between atomic configuration and the corresponding energy, where the functional must be based on an ML model. As an example, a forward propagating deep neural network may serve as a functional, where input is the atomic configuration and output is the energy.

  2. 2.

    Any physical approximations or theoretically grounded constraints are explicitly incorporated into the training data and are not part of the energy functional.

The second requirement in the definition distinguishes traditional fixed form potentials from the ML potentials. It also ensures that for a “sufficiently complex” energy functional and “sufficiently large and diverse” training set, an ML based potential can produce arbitrarily accurate model predictions. Often it is expected that the training data are generated using a consistent and specific set of methods. It has been observed that mixing data from different QC techniques or experiments lead to poor learning outcomes. Sizes of the training sets depend on the computational cost of the training sets and the desired accuracy expected out of the ML model.

As with most traditional fixed form potentials, ML potential energy is expressed a sum of local energies:

$$\begin{aligned} E_\text {ML} = \sum _{i=1}^N E_i^\text {nbd}, \nonumber \end{aligned}$$

where the local energy corresponds to the ML energy, which depends on the local neighborhood of the \(i\text {th}\) atom. Chemical environment of an atom is primarily decided by short range interactions (Kohn 1996). The long range interactions, which decay slower than \(r^{-2}\), are usually either approximated at cutoff distance \(R_c\) as zero or smoothly reduced to zero using tapering functions. As an example, polynomial tapering functions are used in the ReaxFF. The accuracy of such model depends on the cutoff distance \(R_c\) – larger values of \(R_c\) lead to better approximation of long range interactions. However, larger \(R_c\) implies larger atomic neighborhood (which grows as \(R_c^3\)), which means that more sample points are required in the training set. Thus \(R_c\) must be chosen appropriately to provide better long range approximation while keeping the neighborhood size tractable.

4.2.2 Training Considerations

ML potentials, like fixed form potentials require training. Here we briefly explore the steps and potential issues with the design and training of ML potentials (See e.g. (Unke et al. 2021)).


Choice of quantum methods used in generation of training data::

Typically ML based simulations are orders of magnitude slower than the equivalent fix form potential simulations (Brickel et al. 2019). However, unlike the fixed form potentials, ML potentials may offer accuracy similar to that of an ab initio method (Sauceda et al. 2020). Thus it is essential to choose an appropriate ab initio method. On one hand if the ab inito method is very fast and/or less accurate, it defeats the purpose of further approximating these data into a machine learnt model. On the other hand a method such as CCSD(t), that are computationally expansive makes it difficult to generate enough training data for ML models.

How much data?:

The amount of data needed depends on the size of the ML model, the desired accuracy, and the sampling technique used in producing the data set.


Sampling of training data over the domain of atomic configurations is crucial in achieving good training of the potentials. For the models designed to simulate equilibrium problems, one can potentially rely on samples that are output of an ab initio molecular dynamics simulation. Depending on the desired accuracy, generating such samples can become prohibitively expensive. Another alternative is to use meta dynamics type sampling techniques and generate samples that are in the vicinity of the free energy minima of the system. However, if the model is intended to address chemical reactions or transition states, then a more uniform sampling is required where “rare events” are also sampled with relatively higher frequencies. The framework provided by an ML model does not include any “physics” of the problem, thus the training data must sample the configuration space sufficiently to include the relevant “physics” in the problem.

Training/validation and testing::

In usual ML methodology, models are trained and tested against similarly structured but disjoint data sets. In this case, the training and the validation is performed on the data sets that are similarly sampled but distinct. However, the testing of the model is usually performed against bulk or physically measurable quantities computed using the trained models. Often the ML potential frameworks have hyperparameters that require a second step of optimizations. The testing phase must be repeated for ifferent hyperparameter values.


4.2.3 Descriptors

Unique description of atomic neighborhood is a central issue in structure–function prediction problems in biophysics and materials science (Ghiringhelli et al. 2015; Deviller and Balaban 1999; Valle and Oganov 2010). For ML systems, such uniqueness is crucial for effective training. Thus, one must express any atomic neighborhood in a representation that is invariant with respect to the action of the symmetry group of the system. In case of three dimensional atomistic systems, we have a group of Galilean transformations and discrete group of atomic permutations. We summarize commonly used descriptors, noting that the state of the art in this context is continually evolving.

Atom Centered Symmetry Function (ACSF)

This descriptor expresses the environment of \(i\text {th}\) atom in terms of a Gaussian basis of varying widths and angular basis at different resolution. It uses a cosine taper function given by:

$$\begin{aligned} T_{R_c}(r_{ij}) = {\left\{ \begin{array}{ll} \frac{1}{2} \left( \cos \left( \frac{\pi r_{ij}}{R_c}\right) + 1 \right) &{} \text {for}\;\;r_{ij} \le R_c \\ 0 &{} \text {for}\;\; r_{ij} > R_c, \end{array}\right. } \end{aligned}$$

where \(r_{ij}\) is the distance between \(i\text {th}\) and \(j\text {th}\) particles. This ensures that, when multiplied, the quantity goes smoothly to zero as \(r_{ij}\) approaches \(R_c\) from below. Using this taper function, an atom centered descriptor can be written with radial and angular parts as:

$$\begin{aligned} G_i^{r}(\eta , \mu )= & {} \sum _{j=1}^{n} e^{-\eta (r_{ij} - \mu )^2} \cdot T_{R_c}(r_{ij}) \end{aligned}$$
$$\begin{aligned} G_i^{\theta } (\eta ,\zeta ,\lambda )= & {} 2^{1-\zeta } \sum _{j,k \ne i}^{n} \left( 1+ \lambda \cos \theta _{ijk}\right) ^\zeta e^{-\eta \left( r_{ij}^2 + r_{ik}^2 + r_{jk}^2\right) } \nonumber \\{} & {} \cdot T_{R_c}(r_{ij}) \cdot T_{R_c}(r_{ik}) \cdot T_{R_c}(r_{jk}), \end{aligned}$$

where n is the number of neighbors in cutoff distance \(R_c\), \(\lambda = \pm 1\). The descriptor vector is generated by sampling the parameters \(\eta \), \(\zeta \), \(\mu \), and \(\lambda \). By design, ACSF produces a description that is invariant under translation and rotation. We note that the number of symmetry functions needed does not depend on n. However, the number of symmetry functions grow very rapidly. Typically for an atom 50–100 symmetry functions are used with various values of parameters (Behler 2016). Further the number of functions required grows quadraticaly with respect to the number of types of atoms used in the model. ACSF can be generalized with additional weight functions to improve resolution and complexity (Gastegger et al. 2017).

Coulomb Matrix (CM)

An alternate descriptor uses the Fourier transform of the Coulomb matrix (Rupp et al. 2012), which is defined as:

$$\begin{aligned} M_{ij} = {\left\{ \begin{array}{ll} \frac{1}{2} Z_i^{2.4} &{} i = j\\ \frac{Z_i Z_j}{\left| \textbf{r}_i - \textbf{r}_j\right| } &{} i \ne j, \end{array}\right. } \end{aligned}$$

where \(Z_i\) is the chanrge on the \(i\text {th}\) particle. This descriptor is invariant under the transformations listed, however, it is computationally expensive unless restricted to a local coulomb matrix (Rupp et al. 2012). The descriptor can be further generalized to include Ewald matrix instead of Coulomb matrix (Faber et al. 2015).

Bispectral Coefficients (BC)

In this descriptor, the atomic environment is represented as a local density that is expressed in terms spherical harmonics on a four dimensional sphere. The density is written as superposition of delta function densities using the taper function from Eq. (9) as:

$$\begin{aligned} \rho _i(\textbf{r}) = \delta (\textbf{r}_i) + \sum _{r_{ij} < R_c} T_{R_c}(r_{ij}) \omega _j \delta (\textbf{r} - \textbf{r}_{ij}), \end{aligned}$$

where the dimensionless parameter \(\omega _j\) represents atom type or other internal properties of the \(j\text {th}\) atom. Angular part of such density can be expanded in spherical harmonics basis and radial part can be expanded in terms of a linear basis. The radial part is transformed into an additional angle, converting the basis to spherical harmonics on 3-sphere. Let \(U_{m,m^\prime }^j\) be these hyper-spherical harmonics, then one can express the local density as:

$$\begin{aligned} \rho = \sum _{j=0}^{\infty } \sum _{m,m^\prime = -j}^{j} c_{m, m^\prime }^{j} U_{m, m^\prime }^{j}, \end{aligned}$$

where \(c_{m, m^\prime }^{j}\) are the coefficients of expansion. The \(c_{m, m^\prime }^{j}\) are computed by evaluating the inner product \(\langle U_{m, m^\prime }^{j} | \rho \rangle \). The BC are then computed using the mixing rules as:

$$\begin{aligned} B_{j_1, j_2, J}= & {} \sum _{m_1, m_1^\prime = -j_1}^{j_1} \sum _{m_2, m_2^\prime = -j_2}^{j_2} \sum _{m, m^\prime = -j}^{j} c_{m,m^\prime }^{j} \nonumber \\{} & {} \times C_{j_1m_1j_2m_2}^{jm} C_{j_1m_1^\prime j_2m_2^\prime }^{jm^\prime } c_{m_1^\prime ,m^1}^{j_1} c_{m_2^\prime ,m^2}^{j_2}, \end{aligned}$$

where \(C_{j_1m_1j_2m_2}^{jm}\) are the Clebsch–Gordon coefficients of mixing. These descriptors also satisfy the required invariance properties. One key advantage of BC over ACSF is that BCs can be systematically expanded or truncated based on accuracy versus complexity trade offs of the model (Thompsona et al. 2015).

Smooth Overlap of Atomic Positions (SOAP)

In SOAP descriptor local density is generated by smoothing delta functions into a Gaussian as (Albert et al. 2013)

$$\begin{aligned} \rho _\text {SOAP} (\textbf{r}) = \sum _{j=1}^{N_i} e^{-\alpha \left( \textbf{r} - \mathbf {r_j}\right) ^2}. \nonumber \end{aligned}$$

This density can be expanded in term of radial and angular basis as

$$\begin{aligned} \rho _\text {SOAP} (\textbf{r}) = \sum _{j=1}^{N_i} \sum _{n,l,m} c^j_{n,l,m} g_n(r) Y_{l,m} (\theta , \phi ), \nonumber \end{aligned}$$

where \(Y_{l,m} (\theta , \phi )\) are spherical harmonics basis, and \(g_n(r)\) is a radial basis set chosen based on specific model. Thus the descriptor for atom i is written as an appropriately normalized power spectrum

$$\begin{aligned} p_{n,k,l}(i) = \sum _m c^i_{n,l,m} \left( c^i_{k,l,m}\right) ^*. \nonumber \end{aligned}$$

4.2.4 Energy Functionals

The input to the ML model is a descriptor using one of the models described above. The output of the ML model is an energy functional. We describe common forms of the energy functional here.

Feed Forward Neural Network Based Energy Functional

One of the common ML energy functionals is based on feed forward neural networks (FFNN) (see e.g. Blank et al. (1995), Gassner et al. (1998), Lorenz et al. (2004), Manzhos and Carrington (2006), Behler et al. (2007), Geiger and Dellago (2013), Behler (2014), Behler (2015)). These networks typically use descriptor as input and produce an energy value as output. One can write the energy as:

$$\begin{aligned} E_i= & {} g_m\circ g_{m-1}\circ \cdots \circ g_2 \circ g_1\left( \textbf{b}_1 + \textbf{W}_{0,1} \cdot \textbf{G}_i\right) \nonumber \\ h_{k+1}= & {} g_k(h_k) = f_k(\textbf{b}_k + \textbf{W}_{k-1,k} \cdot \textbf{h}_{k-1}), \nonumber \end{aligned}$$

where the neural network has m layers, \(\textbf{W}_{k-1,k}\), \(\textbf{b}_k\) are the weights and the bias values associated with the \(k\text {th}\) layer respectively, and \(f_k\) are the nonlinear activation functions associated with the \(k\text {th}\) layer. Forces are computed as the negative gradients of the energy functional. Thus we expect the activation functions \(f_k\) to be differentiable functions.

Gaussian Approximation Potential (GAP)

This approximation establishes a mapping between the environment of an atom and the corresponding energy using a Gaussian kernel function.

$$\begin{aligned} E_i= & {} \sum _n^{N_i} \alpha _n G(\textbf{b}, \textbf{b}_n) \nonumber \\= & {} \sum _{n}^{N_i} \alpha _n e^{- \frac{1}{2} \sum _{l}^{L} \left( \frac{b_l-b_{n,l}}{\theta _l}\right) ^2}, \nonumber \end{aligned}$$

where L is the number of truncated bispectrum components, \(\textbf{b}\) are the BCs. The determination of the coefficients \(\alpha _n\) is computationally expensive, since it grows as \(N^3\) (Li et al. 2015).

Spectral Neighbour Analysis Potential (SNAP)

SNAP simplifies the computation of \(\alpha _i\) in GAP by changing problem of Gaussian regression to a linear regression. Thus now the energy functional is given by (Thompsona et al. 2015)

$$\begin{aligned} E_i= & {} \beta _0^{\omega _i} + \sum _{k=1}^{M} \beta _k^{\omega _i} \cdot B^i_k, \nonumber \end{aligned}$$

where M is the number of bispectrum coefficients used in an approximation. Most important advantage of SNAP over GAP is the simplification of computation due to linear regression.

4.2.5 Accelerating Time-stepping Using Deep Networks

We have previously described the use of ML potentials to increase the accuracy and scope of modeled interactions. An important bottleneck in reactive atomistic simulations is the need for small timesteps (sub-femtoseconds in typical applications), whose sequential nature limits the temporal scope of simulations. There have been some recent efforts aimed at ML techniques for long-timestep integration. Conventional time-stepping schemes use the current atomic state (and in some cases, the few states leading up to the current state), combined with the force (derived from energy) to advance system state to the next step. The goal of ML-based time integrators is to use a sequence of past atomic states, along with the energy, to predict system state over longer timesteps (e.g., three orders of magnitude longer than conventional integrators).

The use of multiple past states in predicting the next state motivates the use of Recurrent Neural Networks (RNNs) for this task. Recall that RNNs use internal states to process time-series data. To address the ‘vanishing gradient’ problem discussed in Sect. 3.2.2, RNN variants such as Long Short-Term Memory (LSTM) networks are used for this purpose. There are three key issues in the use of LSTMs in long time-step integrators: (i) specification of input states for the deep network; (ii) the network architecture; and (iii) training process. The input to a LSTM-based time integrator is typically limited to a finite region around the atom for which the trajectory is predicted. Larger neighborhoods require significantly larger number of degrees of freedom in the network. While in theory, this would improve accuracy, the need for large amounts of training data and training error typically negate this improvement in accuracy. The network architecture is determined by the complexity of the energy functional and specific domain properties. In current practice, even simple energy terms (Lennard-Jones interactions) require large networks (\({\tilde{1}}00\)K parameters) for ensembles of as few as 16 particles. The need for training data and associated training cost for these is significant. However, such integrators are shown to be capable of timesteps three orders of magnitude longer than conventional Verlet integrators (Kadupitiya et al. 2020).

In current proposals, which are in relative infancy, the training procedures for the LSTMs use simulation data generated from the specific potential, with well specified boundary conditions (e.g., periodic boundaries). Even in these simple systems, a large amount of training data is needed to accurately predict trajectories. It is observed that for more complex potentials (with multiple terms) and diverse atomic contexts, the need for training data increases substantially.

We note that the use of deep networks for particle dynamics is in relative infancy. There has been significant interest in the use of deep networks for time-integrating ODEs since the recent work of Chen et al. (2018). Recent advances include symplectic ODE-Nets for learning the dynamics of Hamiltonian systems (Zhong et al. 2019), and associated deep learning architectures (Rusch and Mishra 2021).

5 Analyzing Results from Atomistic Simulations

A key use of machine learning techniques is in the analysis of large amounts of data generated from time-dependent simulations. This data generally takes the form of snapshots of trajectories—with each snapshot corresponding to system state comprised of degrees of freedom (position, momentum, etc.) associated with particles, and in the case of reactive simulations, bond information. Complex simulations scale to millions of particles and beyond, over billions of time-steps—leading to datasets that are in excess of terabytes. A number of techniques are deployed to deal with this data volume, including subsampling for reducing storage, indexing for fast access, and compression. While these techniques facilitate storage and access, the focus of this section is primarily on analysis techniques that abstract and extract useful information from trajectories.

We note that ML techniques for analyses of time-dependent simulation is an active area of research. This section summarizes the rich state of the art in the area—for a more detailed recent summary, we refer readers to excellent reviews by Glielmo et al. (2021), Sidky et al. (2020), and Noé et al. (2020).

5.1 Representation Techniques

We consider a general class of simulations that result in a set of T snapshots of data—each snapshot \(S_i, i = 0 \ldots T-1\), stored as a D dimensional vector, in a matrix M of dimension \(T\times D\). The first challenge we face is to suitably encode system state at time \(t_i\) into a corresponding vector \(S_i\). This poses challenges w.r.t. different data structures and their consistent encoding. We consider two common data structures and associated representation techniques:

Vector Fields

The most common data associated with particles is in vector fields. This includes position data, momentum, and other particle properties. The first step in representing these vector fields is to account for underlying invariants. For instance, a particle aggregate (e.g., a molecule) may be invariant under rotation and translation. To account for this invariance, these aggregates must be represented in a canonical framework so that two aggregates in different orientations can be viewed as being identical under affine transformations. The most common technique relies on aligning particle aggregates with known reference aggregates (e.g., reference geometries of molecules) and to store them as deviations from these reference molecules under affine transformations. Such transformations can easily be computed through local formulations solved using Shapelets or global formulations such as the Orthogonal Procrustes Problem, which has an optimal solution due to Kabsch (1976). Once suitable alignments have been computed, the particle aggregates are stored as suitable vectors of deviations from reference aggregates. When reference aggregates are unavailable, canonical representations can be derived through suitable internal representations, for example, in the form of internal distances between reference particles (e.g., distance between pairs of marked atoms in a molecule). This vector of distances provides a canonical representation.

Network Models

Reactive simulations often store bond structure of molecules within snapshots \(S_i\). These structures are invariant to within an isomorphism; i.e., any relabeling of atoms in the molecule should be treated identically. Canonical labelings are challenging because there exist an exponential number of permutations, and corresponding labelings. Deriving canonical labelings to represent graphs corresponding to molecular structures as vectors require solution of the graph isomorphism problem. For small molecules, this can be done by enumeration; however, for larger molecules, this is more computationally expensive. One solution to this problem relies on a diffusion kernel to derive canonical labelings. The Laplacian of the given graph structure is used to simulate a diffusion process on the graph. The stationary probabilities associated with the diffusion process are used to represent the graph in a canonical vector form. One may also view this vector in terms of the spectra of the graph. Other approaches to canonical labelings rely on graph neural networks (GNNs). These networks are trained to input a given graph and to generate canonical labels as output. This training procedure for GNNs associates the identical labelings for isomorphic graphs.

5.2 Dimensionality Reduction and Clustering

Using suitable representation techniques, state \(S_i\) at timestep i is represented as a vector \(v_i\) in dimension \(D_n\). We use subscript n to denote the native dimension of the representation. The next step in typical analyses is to reduce the native dimension \(D_n\) to a lower (reduced) dimension \(D_r\). This facilitates downstream analyses by denoising data (filtering dimensions that are less important), while simultaneously reducing computational cost. Dimensionality reduction is accomplished through the linear (PCA, SVD, NMF, AA) or non-linear techniques (Kernel PCA, Autoencoders) described in Sect. 3.1.

5.3 Dynamical Models and Analysis

Molecular systems evolve through a dynamical operator acting on successive system states. This motivates the natural observation that the data-points associated with temporal snapshots are not independent; rather, they have temporal correlations that reveal interesting aspects of underlying systems. Identification of temporally coherent subdomains is an important analysis task. The starting point for such analysis is a time-lagged covariance matrix, which is computed as the distance (normalized dot product) of a state descriptor at time t with that at time \(t + \delta t\), for a suitably selected lag \(\delta t\). A commonly used method, Time Lagged Independent Component Analysis (TL-ICA) uses this time-lagged covariance matrix, along with the covariance matrix at current state to define a generalized eigenvalue problem. The eigenvectors derived from this generalized eigenvalue problem correspond to the slow modes in the underlying dynamics in the system. We refer to the work of Naritomi and Fuchigami (2013) for a detailed description of this method and its use in analysing atomic trajectories. These approaches are generalized into a variational framework that aims to characterize the dominant eigenpairs of the propagation operator corresponding to the dynamical system. This is achieved by first computing a discrete approximation to the propagation operator, which uses abstractions of the self and time-lagged covariance matrices to compute transition probabilities for each state at time t to a state at time \(t + \delta t\). The eigenvectors of this operator correspond to the dominant modes in the system. This general variational model is equivalent to TL-ICA if data points are represented through a linear basis. However, the variational model admits a more general basis, through the use of higher-order kernels and the underlying optimization problem is solved using conventional gradient-descent type methods.

5.4 Reaction Rates and Chemical Properties

Reactive simulations often produce diverse chemical constituents. Some of these compounds are transient, however these still require careful analysis and classification. In the simple case of two component Silica–Water system, the molecular components observed at the end of the simulations include Si–O, Si–O\(_2\), OH, H\(_2\) etc. (Fogarty et al. 2010). Identifying all the molecular components and corresponding chemical reaction is a difficult problem.

In order to enumerate all the molecular components, one can treat a simulation time step as a colored graph with atom type as color on the node and the existence of an edge between two atoms is decided by the bond order between the pair being greater than a cutoff value. Further the enumeration requires identification of all the distinct classes of isomorphic subgraphs of atoms. Each such class entry is either a molecule or molecular fragment present in a simple time frame. Then a hash table of such fragments is constructed to label the frequency of occurrence of reactant or product in a single time frame.

For the most common molecular fragments, often it is possible to identify reactions of kind, \(\text {A} + \text {B} \rightleftharpoons \text {AB}\). Such reactions can be modeled using first order differential equations, which can be solved as:

$$\begin{aligned} N_\text {AB}(t) = \frac{K_f\cdot N}{K_f+K_b}\left( 1 - \exp \left[ {- (K_f + K_b)(t-t_0)}\right] \right) , \end{aligned}$$

where N is total number of molecules of type A and B, \(N_\text {AB}\) is the number of molecules of AB; \(K_f\), \(K_b\) are forward and backward reaction rates respectively (Saunders et al. 2022). Within simulations the computed number of molecular types can be fitted to Eq. (16) as a function of time, giving the reaction rates and equilibrium concentrations of various chemical components.

6 Concluding Remarks

In this chapter, we presented an overview of common ML techniques and formulations. We discussed how computationally expensive components of reactive atomistic simulations are formulated in ML frameworks, considerations for training ML models, tradeoffs of accuracy, need for training data, transferrability, and computational cost. While we primarily focused on reactive atomistic simulations, the models and methods discussed apply more generally to discrete element models.

The area of ML techniques for reactive simulations is extremely active and fluid. There is tremendous potential for significant new developments in the area, enabling simulation scales and scope far beyond those currently accessible. In doing so, these techniques hold the promise of new applications and domains.