Mathematical nuances of Gaussian process-driven autonomous experimentation

The fields of machine learning (ML) and artificial intelligence (AI) have transformed almost every aspect of science and engineering. The excitement for AI/ML methods is in large part due to their perceived novelty, as compared to traditional methods of statistics, computation, and applied mathematics. But clearly, all methods in ML have their foundations in mathematical theories, such as function approximation, uncertainty quantification, and function optimization. Autonomous experimentation is no exception; it is often formulated as a chain of off-the-shelf tools, organized in a closed loop, without emphasis on the intricacies of each algorithm involved. The uncomfortable truth is that the success of any ML endeavor, and this includes autonomous experimentation, strongly depends on the sophistication of the underlying mathematical methods and software that have to allow for enough flexibility to consider functions that are in agreement with particular physical theories. We have observed that standard off-the-shelf tools, used by many in the applied ML community, often hide the underlying complexities and therefore perform poorly. In this paper, we want to give a perspective on the intricate connections between mathematics and ML, with a focus on Gaussian process-driven autonomous experimentation. Although the Gaussian process is a powerful mathematical concept, it has to be implemented and customized correctly for optimal performance. We present several simple toy problems to explore these nuances and highlight the importance of mathematical and statistical rigor in autonomous experimentation and ML. One key takeaway is that ML is not, as many had hoped, a set of agnostic plug-and-play solvers for everyday scientific problems, but instead needs expertise and mastery to be applied successfully.


Introduction
Machine learning (ML) and artificial intelligence (AI) have transformed how problems involving model creation and decision-making from data are approached in all areas of science and engineering. Examples are wide ranging and include weather forecasts, 1,2 protein folding, 3,4 natural language processing, 5 image recognition, 6,7 and autonomous experimentation. [8][9][10][11][12] Some successes, for instance, IBM's Watson and AlphaGo, reached international fame. When Watson famously won the popular game Jeopardy in 2011 against two of the best human players, the general opinion was that Watson would soon be able to answer any medical or scientific questions better than any human; the reality turned out to be very different and mathematics can explain why. Contrary to what was perceived outside IBM's offices, Watson's architecture was specifically customized to win in a game, such as Jeopardy, with minimal generalizations in place. A current example is large language models, 13 such as GPT-3 14 or Turing-NLG, 15 which are tailored for natural language processing and can deliver amazing results for some tasks, but are also easily tricked into wrong and overconfident answers. 16 Tuned for a specific task, they do very well. At the other end of the spectrum of generalizability, largely agnostic ML software tools are being developed and distributed (Scikit-learn, PyTorch, TensorFlow) in order to give more people access to the power of ML. Although this generalization is laudable and necessary, it can also lead to user errors and dissatisfaction with the results. There seems to be a natural tradeoff between the power of AI and ML and its generalization potential. Many of the successes of ML can be attributed to the wide availability of off-the-shelf software tools. However, this availability and the userfriendliness of those tools can lead to a one-concept-fits-all attitude, leading to poor performance of the algorithms in nonstandard scenarios. To understand this discrepancy, we have to dive a bit deeper into the mathematics of ML.
ML can broadly be divided into supervised and unsupervised methods. Supervised learning uses "labeled" data D = {x i , y i } to find some function f (x) : R n → R m that can approximate unobserved pairs (x i , y i ) . R n is often called the input space or parameter space and here denoted by X ; R m is the output space. Unsupervised learning does not use labels and attempts to find structural (geometric or topological) information about a data set D = {x i } . Throughout this article, we focus on autonomous experimentation, which is often classified as "active learning" and is part of supervised ML; even so, many of our takeaways are valid for unsupervised learning as well. Supervised learning can be characterized by two main building blocks, the definition of a function space-sometimes called the hypothesis space-F containing all conceivable model functions, and the selection of an optimality condition, most often some misfit, that is maximized (or minimized) to find a candidate solution-a particular element of the function space-which is called the training in ML. Often, neither step is getting the attention it deserves. Common mistakes are to use ML tools that span a function space that does not contain functions with the desired behavior based on physics or practitioner intuition, or to combine solutions in an ensemble even though their function spaces are disjoint. A quick note on the diversity of ML. Neural networks (NNs) have been so prominent in the literature and media that one could equate them with ML; however, kernels, 17 especially in combination with Bayesian methods, provide a very powerful and flexible framework for learning which, in small data regimes, outperforms NNs. Even so, mathematically it can be shown that all different methods are just instances of a more fundamental framework. 18,19 As an example, we want to have a look at kernel ridge regression (KRR), [20][21][22] where the underlying function space is a, so-called, reproducing kernel Hilbert space (RKHS) 23 For a more accessible notion of the RKHS, we can understand it as a set of functions that are all defined by a weighted linear combination of kernels. In simple terms, kernels are functions that get two points of the parameter space as input and return a measure of similarity of the function itself. Stationary kernels only depend on the distance between the two input points; most often the similarity of the function value decreases as the distance increases. Nonstationary kernels have no such restrictions and can encode complicated rules about how similarity between inputs behaves across the domain. Alternatively, but equivalently, we can view kernels as basis functions defined on the input space, centered at given locations x i . RKHSs have gained popularity in recent years due to their importance for many machine learning methods, such as Gaussian processes (GPs), 24 support vector machines (SVMs), 25 and kernel PCA (principal component analysis). 26 It is not uncommon to see practitioners using a GP posterior mean and the surrogate computed by KRR to create ensemble models; despite the fact that, for the same kernel and the quadratic loss function, these models will coincide, which instills unsupported confidence in the ensemble. Ensembles of ML solutions carry the risk of bias if the underlying function spaces are not compatible (e.g., disjoint). The most commonly used optimality condition for KRR is to place a measure on the difference between predicted and given values ŷ i ∈D , which is the test data set. In the standard literature, not much effort is spent on different kernels for KRR; instead the radial basis function (RBF) kernel ) or other kernels of the Matérn class are oftentimes used without justification; 17,27 the RBF kernel gives rise to a function space that only contains functions of infinite differentiability (very smooth functions)-a property often not supported by the data or the underlying physics (see Figure 1). Similarly, other Matérn kernels-and for that matter all other kernels-have well-defined differentiability properties that directly influence the model. l and σ s are free parameters, examples of so-called hyperparameters that can be interpreted as a global length scale and a signal variance, respectively. Their global validity is often unsupported and, when enforced, can lead to poor performance of kernel methods. These challenges directly affect how well ML can control data acquisition without human supervision.
As instruments and detectors are accelerating their peak data acquisition rates and the increasing complexity of scientific questions give rise to larger and higherdimensional parameter spaces, it becomes infeasible for the human brain to make optimal decisions about experimental design. Autonomous experimentation (AE) describes the ability of an instrument and algorithm to decide what measurements should be performed next, ideally without the need for human interference; it is a multifaceted field that needs expertise in instrument science, robotics, computer science, and ML. The role of ML is twofold: First, as raw data-images, films, spectra-leave the instrument, they have to be analyzed and dimensionality reduced. Although many classical methods are successfully being used, ML is increasingly considered a viable option. Second, intelligent autonomous decisionmaking is performed based on all collected and analyzed data. This decision-making is commonly categorized as active learning, which, as aforementioned, is a kind of supervised learning in which the algorithm can choose its own training data. If no offline training data are available, stochastic process-driven uncertainty quantification (UQ) is often used in the form of Gaussian (stochastic) processes (GPs). [28][29][30][31] The principle of a GP is simple; given a set of noisy function evaluations, we define a normal probability distribution that explains the data and can be conditioned on observations to yield a probabilistic view of the model function in unobserved regions (see the next section). A well-tuned GP can quantify the uncertainty of the model function across the domain, allowing for intelligent decision-making and therefore autonomous control. However, the control is only effective if the GP is set up correctly, meaning the right function space is considered and the optimization is sufficiently well posed. AE is a particularly challenging ML problem because for new experiments often no offline training data are available and decisions have to be made on the fly as data are collected. It, therefore, emphasizes the need for particular rigor of the underlying mathematics; black-box applications of off-the-shelf ML tools will often show poor performance, which manifests itself through the overestimation of uncertainties that severely limit the efficiency of the AE. Autonomous experimentation plays an important role in the materials sciences due to the fact that scientific questions are often posed as finding one or more materials properties as a function of some parameters. Examples are crystal sizes as a function of an annealing history, x-ray scattering mapping, or point-wise evaluation of spectra originating from neutron scattering.
The aforementioned emphasis in ML on the choice of a sensible function space and an appropriate optimality condition is the main reason why common off-the-shelf tools perform suboptimally; not always because they do not possess the ability for sufficiently flexible definitions, but they trade flexibility for user-friendliness, which is often preferred.
The main objective of this article is to show what gains can be made in Gaussian process-driven AE if we open up the black box that is ML and spend some time evaluating and customizing the underlying mathematics and statistics. After a short excursion into some theory, the remainder of this article discovers, by example, the shortcomings of some off-the-shelf applications of ML tools for AE and how they can be avoided by simple adjustments of the core algorithm.

Some theory of Gaussian processes and related autonomous experimentation
To maximize the value of the tests in the next section, we present some minimal but necessary theory in this section. We will start introducing Gaussian processes and then move to the way they affect AE through an acquisition functional.

Gaussian processes
Gaussian processes (GPs) are a type of stochastic process-sometimes called a random field-in which a set of random variables, often thought of as function evaluations ..)} , are jointly normally distributed. 24,32,33 Imagine having information about a function in the form of probabilistic function evaluations and being interested in the best guess of those function evaluations in other places. A GP is based on the idea of defining a normal distribution over the known and unknown function evaluations. Given data D = {x i , y i } , a prior probability distribution over functions f (x) can be defined as follows: where K is the covariance matrix of the data, whose entries are calculated by having the kernel k(x i , x j ) act on the data positions. Kernels can be seen as basis functions that define the model but also compute how covariances behave as we move away from known data points. In this context, we can understand them as a similarity measure that allows us to calculate covariances purely based on data-point locations. µ is the prior mean vector. We define the likelihood over observations y(x) as  9 The latent function is inherently non-smooth (non-differentiable). This has to be accounted for when defining a kernel for the Gaussian process (GP)-driven autonomous experiment. The ground truth is displayed on the left. In the center, we see the posterior mean of a GP using the squared exponential radial basis function kernel. In this case, the model has to be smooth, which leads to artifacts in the model. On the right-hand side, the posterior mean using an exponential kernel is shown. The exponential kernel is rarely an appropriate choice, except when the latent function is non-differentiable, which happens to be the case for many mapping experiments in the materials sciences. Here, the use of the exponential kernel leads to a more accurate model prediction.
where V is the matrix of the noise. 11 Our first test ("The role of noise for autonomous experiments" section) will focus on different choices for the matrix V ; however, we assume uncorrelated noise that renders V diagonal. Most literature assumes identically and independently distributed (i.i.d., also homogeneous, homoscedastic, or simply constant) noise, which translates into V = σ 2 n I . Often, σ 2 n is estimated by the experimenter ad hoc, while others absorb it into the kernel definition and optimize its value. As we will see, it is ideal to estimate the noise during the measurement process, especially for the purpose of AE.
The vast majority of published work about Gaussian processes only utilizes a few well-known standard stationary kernels to compute covariances. 27 The most frequently used kernel is the RBF kernel where σ 2 s is the constant signal variance and l is the isotropic length scale. 27 Both signal variance and length scale are hyperparameters ( φ ) of the Gaussian process and can be calculated by solving the optimization problem 24 which can be understood as maximizing the probability that the data would be observed, given a prior probability distribution. Hyperparameters can be seen as free parameters that are part of the kernel and control the quality of the model. Kernel functions can freely be defined to account for ever-increasing model complexity-as long as positive semi-definiteness is maintained. In fact, the real power of Gaussian processes is only revealed by utilizing nonstationary kernels. This is the focus of our second test ("Stationary kernels versus nonstationary kernels for Gaussian processes" section). As kernels become more complex, the number of needed hyperparameters rises and requires advanced optimization procedures; the optimization is often ill-posed and solutions are nonunique. This is the focus of the "Training as a constrained and ill-posed function optimization problem" section.
Given the hyperparameters, we calculate the posterior probability density function given by x 0 is the point at which the Gaussian posterior should be predicted. f 0 is the value of the latent function f at the point x 0 .
The posterior contains the posterior mean m(x 0 ) and the posterior variance σ 2 (x 0 ).

Autonomous experimentation
Having calculated the posterior probability density function (Equation 5), it can now be used to decide where future measurements should take place. For this, a function of the posterior, a so-called acquisition functional-sometimes simply called acquisition functionf a (x) : X → R , is defined to assign every measurement (point in the domain) a value.
Commonly, regions of low uncertainty are assigned a low value and regions of high uncertainty or probability of finding certain desirable characteristics are assigned a high value.
There is an overwhelming number of acquisition functionals in the literature. Often new acquisition functionals have to be defined to allow for optimal performance of the AE. Certain acquisition functionals will turn an autonomous experiment into Bayesian optimization. 32,34 Having defined the acquisition functional, 29 we solve where the constraints g i (x) can be used to restrict the search to regions that are of special interest or to protect the instrument from navigating to regions that are inaccessible or unsafe. An additional modification is to estimate or learn a cost function c : X × X → R and optimize f a (x)/c(x, x 0 ) , where x 0 is the last measurement location. In this case, new measurement suggestions are cost-sensitive. The "Acquisition functionals for optimal measurement suggestions" section focuses on different choices of acquisition functionals for a simple toy example to emphasize its importance for the successful execution of an autonomous experiment.

Case studies
In this section, we present four case studies that were carefully chosen to highlight specific characteristics of GPs and the associated autonomous experiments. Although many data sets do not stem from the materials sciences, the key takeaways always apply to data with similar properties and are agnostic to the field in which the data originated.

The role of noise for autonomous experiments
Reading through most of the available Gaussian process literature, one would be forgiven to assume that i.i.d. noise is fundamental to the GP framework. On the contrary, the GP framework-as one could expect from a Bayesian method-is without any adaptations able to handle non-i.i.d. noise. Noni.i.d. noise plays an important role in x-ray scattering and neutron scattering applications, among others. In ML, it is common to ignore noise entirely and advanced noise models are virtually unheard of. In this example, we want to show that the consideration and inclusion of non-i.i.d. noise into the model is indispensable for optimal autonomous experimentation. 11 This is because the sequence of measurements depends on the location of the maxima of the acquisition functional, which commonly depends on the posterior variance. However, the key takeaway here goes beyond AEs and has important implications for all of ML: accurate estimation and inclusion of noise are important for accurate model predictions. Before looking at the result, we would expect that the data acquisition is biased toward regions with high noise in order to reduce total uncertainty. Our experiment was set up using an anisotropic and stationary Matérn kernel with ν = 3/2 . The data set is taken from IR spectroscopy 29,35 and the material is organic matter. In this test, we are approximating a scalar intensity on [0, 1] × [0, 1].
Consider the ground-truth data and the noise model in Figure 2. The noise model is usually not known before the experiment, but we define it ad hoc to test how treating noise differently in the GP affects autonomous decision-making. For this example, we are performing standard maximum variance steering, where points are placed where the posterior variance is at its maximum. One would expect a higher point density in places of high measurement noise, but only non-i.i.d. noise, estimated and communicated for each new measurement delivers this behavior. While treating the noise as one hyperparameter, which can be found by solving Equation 4, delivers a satisfactory model, the overall approximation error is larger than for non-i.i.d. noise. Estimating one noise value ad hoc can render the model ineffective because it influences and misleads the hyperparameter optimization, which negatively affects the accuracy of uncertainty estimation.

Stationary kernels versus nonstationary kernels for Gaussian processes
Even more uncommon than non-i.i.d. noise in GPs is the use of nonstationary kernels. In a review by Pilario et al., 27 we can dissect that around 90% of studies employing kernel methods use the RBF kernel. The number is much higher when considering a broader range of stationary kernels. Stationarity in the kernel means we are assuming that covariances between data points only depend on their distance, not on the points' respective locations (i.e., k(x 1 , x 2 ) = k(|x 1 − x 2 |) ), a high standard to be met for most modern data sets; this is especially true for the materials sciences where changes of some parameters will have much more impact on the properties of a material in some regions in the parameter space than in other regions. Imagine inorganic crystal growth as a function of an annealing history; clearly, in some temperature regions, the crystal size will react much more strongly to temperature changes than in others. A prime example to see, experience, and understand the importance of nonstationarity in the kernel definition is to use the topography of the United States as a test data set. Although this data set did obviously not originate from an AE or in the materials sciences it has all characteristics we need to illustrate the importance of nonstationarity. Clearly, covariances should behave differently in the mountainous regions of the Rocky Mountains compared to the Great Plains.
Although for accuracy of the function approximation, the difference between nonstationarity and stationarity could still be bearable, for autonomous experimentation where the accurate estimation of uncertainty is a deciding factor, nonstationary plays a vital role in the experiment design, as can be seen in Figure 3.
For this example, the stationary kernel is given by , which is the axially anisotropic Matérn kernel of first-order differentiability, with diagonal M , length scale l, and signal variance σ 2 s . The nonstationary kernel 36 was defined as . α i are the heights of some radial basis functions with w being the parameter controlling the width. The term f (x 1 )f (x 2 ) can be interpreted as flexible signal variance, which impacts how uncertainties are estimated across the domain. This leads to uncertainties that are much more reflective of the true error compared with the use of the stationary kernel (see Figure 3).

Training as a constrained and ill-posed function optimization problem
One of the major drawbacks of using nonstationary kernels is that they need parametric representations of functions-signal variances, length scales, and so on-which gives rise to many more hyperparameters compared with standard stationary kernels. For the example in the last section, for instance, the kernel definition needed 286 hyperparameters to be found, compared to two for an isotropic RBF kernel. This dependency of the hyperparameter number on the kernel definition shifts the focus of kernel methods to the training (i.e., the optimization of the hyperparameters). In this case study, we want to analyze the characteristics of the solutions of the training for the kernels used in the last example.
The optimization of the hyperparameters is naturally constrained for many kernels due to the subset (0, ∞] n ⊂ R on which the log likelihood is well defined (e.g., signal variances and length scales should never be negative or zero). n, in this case, is the number of hyperparameters. Other constraints are potentially introduced through domain knowledge. When using local optimizers for hyperparameter training, it is the author's recommendation to remove the bounds on the optimization by considering simple transformations (e.g., exp(−||x 1 − x 2 ||l 2 ) instead of exp(−||x 1 − x 2 ||/l) ). Transformations such as this should be applied to make the optimization domain closed but unbounded. In Figure 4, we see top views of marginal log-likelihood functions. It is apparent that nonstationary kernels give rise to nonuniqueness of solutions and regions in which the ratio of eigenvalues of the Hessian is very large, indicating flat regions or ridges.
It is a frequently discussed topic whether to use optimization to find the hyperparameters or to use Markov Chain Monte Carlo (MCMC), which is the fully Bayesian approach. The argument against optimization is that it can lead to overfitting and the argument against MCMC is that it can be slow to converge, especially in situations when many hyperparameters have to be found. The advantage of optimization is that we do not need to specify a prior, which can be difficult for nonstandard kernels. A potential middle ground is to find optima using optimization and the Laplace approximations to account for the uncertainty in the hyperparameters.

Acquisition functionals for optimal measurement suggestions
As described in the "Some theory of Gaussian processes and related autonomous experimentation" section, the acquisition functional f a has a major impact on the performance The bottom images show the ground-truth model, the noise model, and the accuracy of the model function approximation as a function of the iteration number. We can see that fixed noise, which is underestimated in this case, has the potential to render the autonomous experiment entirely ineffective. The hyperparameter optimization is misled, which leads to overfitting. Although the optimized noise model performs better, no extra emphasis is put on regions with high noise. The measured noise delivers the best model by strategically increasing measurement point density in high-noise regions.
of an autonomous experiment. Imagine a situation in which a function should be explored but with a focus on regions {x ⊂ X : f (x) < b, b ∈ R} . In simpler terms, "valleys" with a certain "depth" should be given priority in the exploration. For materials scientists, the valley could be understood as a region of small crystal or grain sizes. Because the problem is exploratory, a practitioner could utilize f a (x) = σ 2 (x) . However, the focus on valleys could motivate a lower-confidencebound style acquisition functional f a (x) = −(m(x) − 3σ(x)) . Ideally, the functional would allow the exploration of the regions of interest. Recalling that an acquisition functional is a function of the posterior probability density functiontherefore the name "functional"-we could focus on the entire region of interest by defining ) and nonstationary (Equation 9) kernels. The overall approximation error is lower for the nonstationary kernel. The key takeaway, however, is that uncertainties are much more realistically represented for the nonstationary kernel. The posterior variance for the stationary kernel does not respond to local characteristics of the underlying latent function. In contrast, the posterior variance calculated using the nonstationary kernel seems sensitive to the latent function, which results in very low uncertainties just east of the Rocky Mountains where the topography is nearly flat.
i.e., the probability that the latent function f (x) ≤ b . For a visual comparison of the mentioned acquisition functionals, see Figure 5. We can see in the figure that all acquisition functions find similar results. However, while pure exploration wastes measurements on exploring the function outside of the region of interest, the lower-confidence bound focuses too much on finding the minima. The acquisition functional (Equation 11) balances the two to allow exploration of the region of interest.

Conclusion
In this article, we looked at some mathematical idiosyncrasies and nuances of Gaussian process-driven autonomous data acquisition, which can, if not identified and countered, lead to undesired behavior and even error-prone model identification. We selected four examples: non-i.i.d. measurement noise, nonstationary kernels, the issue of ill-posed optimization problems that have to be solved for training a GP, and  8). There is a global optimum that can easily and inexpensively be found with local or global optimizers. We can see from the eigenvalues of the Hessian that there is a direction (long arrow) in which we could move and the function value would only slightly change, which suggests some ambiguity in the model. We can see in (b-d) that this phenomenon becomes increasingly challenging in higher dimensions. (b) The marginal log-likelihood function over a 2D slice through the 286-dimensional space spanned by the hyperparameters. The two presented hyperparameters are the constant width of the basis functions of the kernel and the length scale. Just in this slice, we see three optima-two maxima and one saddle point-a local optimizer could identify as the final solution. (c) All found eigenvalues of the Hessian in the nonstationary case. The plot reveals that the final solution is actually a saddle point since the eigenvalues change sign. The key takeaway from this figure is that finding the hyperparameters of our model robustly becomes more challenging as we use more flexible nonstationary kernel designs. Although this is not a reason to avoid those more powerful kernels, it motivates the employment of powerful optimization algorithms that can handle nonuniqueness well. (d) The same function in a 2D slice spanned by the first two basis function coefficients. Here, many optima can be identified, and some are marked in the image. Every time, the ratio of the eigenvalues tells a story about the stability and robustness of the solution.
tailored acquisition functionals. The key takeaway for all four examples is the same: To maximize the success of a GP function approximation or associated autonomous experiment, we have to go beyond standard setups and customize every aspect and property of the method. This takeaway is not exclusive to the GP framework, but applies to all ML methods. The first test ("The role of noise for autonomous experiments" section) showed that the performance of an autonomous experiment strongly depends on the accurate estimation of the measurement noise. The dangers go much beyond inefficiencies; insufficiently accurate noise can render the resulting function approximation entirely useless. In addition, the estimation of uncertainties across the domain becomes so poor that data acquisition control becomes random due to very small length scales, which leads to overfitting ( Figure 2) and high uncertainties almost everywhere. Our next test illustrated the benefits of allowing nonstationarity in the kernel design ("Stationary kernels versus nonstationary kernels for Gaussian processes" section). Function approximation and uncertainty quantification are significantly more accurate for nonstationary kernels. Both affect the performance of the autonomous experiment. More flexible kernels, however, lead to many more hyperparameters that have to be found. In the "Training as a constrained and ill-posed function optimization problem" section, we investigated the Figure 5. Three different acquisition functionals and their effect on exploring the latent function (top). We assume that the interest of the practitioner is in the valley region bordered by the contour line on both sides. An acquisition functional focusing on minimizing the uncertainty (pure exploration, left) will explore all regions of the input space. A lower-confidence-bound acquisition functional (middle) will focus on finding minima but runs the risk of not exploring the region of interest fully. An acquisition functional using the posterior distribution to value the probability of the latent function value being in the right range accomplishes exactly what was intended. Although neither of the shown acquisition functionals is particularly sophisticated, they show the importance of customization in order to maximize the benefit during an autonomous experiment. nonuniqueness properties of the optimization problem and found that more hyperparameters and nonstationary kernels can lead to many solutions. This is an important lesson for novel automated kernel selection methods, such as deep kernel learning; unnecessary hyperparameters will lead to ill-posed optimization problems that could be impossible to solve robustly. The practitioner will often not be informed about the problem but has a faulty model to work with. Our last test ("Acquisition functionals for optimal measurement suggestions" section) drew attention to the acquisition functional of an autonomous experiment. The simple example illustrated ( Figure 5) that a well-customized acquisition functional leads to a much targeted data acquisition.
Of course, there are many more pitfalls to look out for when performing ML and AE. Some of those are very practical, such as trying to visualize the model at least in slices to validate its validity before using it for decision-making and making sure the input data are cleaned up. Simple sanity checks can make sure the data are in order before applying any ML method: Were data points recorded twice? Is noise correctly communicated? Are there "NaN"s in the data set? What about outliers? And so on. For GPs, some of the sanity checks are naturally included because the trained hyperparameters often have a physical meaning, which can be checked against the intuition of the practitioner. For this, however, it is indispensable to know about the mathematics of the underlying method.
In summary, the authors hope that this collection of examples will help practitioners in the materials sciences avoid some of the common mathematical and statistical pitfalls when using Gaussian process-driven autonomous experimentation methods.