Active Learning and Bayesian Optimization: a Unified Perspective to Learn with a Goal

Science and Engineering applications are typically associated with expensive optimization problems to identify optimal design solutions and states of the system of interest. Bayesian optimization and active learning compute surrogate models through efficient adaptive sampling schemes to assist and accelerate this search task toward a given optimization goal. Both those methodologies are driven by specific infill/learning criteria which quantify the utility with respect to the set goal of evaluating the objective function for unknown combinations of optimization variables. While the two fields have seen an exponential growth in popularity in the past decades, their dualism and synergy have received relatively little attention to date. This paper discusses and formalizes the synergy between Bayesian optimization and active learning as symbiotic adaptive sampling methodologies driven by common principles. In particular, we demonstrate this unified perspective through the formalization of the analogy between the Bayesian infill criteria and active learning criteria as driving principles of both the goal-driven procedures. To support our original perspective, we propose a general classification of adaptive sampling techniques to highlight similarities and differences between the vast families of adaptive sampling, active learning, and Bayesian optimization. Accordingly, the synergy is demonstrated mapping the Bayesian infill criteria with the active learning criteria, and is formalized for searches informed by both a single information source and multiple levels of fidelity. In addition, we provide guidelines to apply those learning criteria investigating the performance of different Bayesian schemes for a variety of benchmark problems to highlight benefits and limitations over mathematical properties that characterize real-world applications.


Introduction
In science and engineering, the development of advanced technologies involves the formalization and solution of optimization problems to identify both optimal designs capable of satisfying competing requirements of performance [1], and states of the system to monitor their health status during the operational life [2].Depending on the specific application, the identification of optimal solutions requires the minimization of an objective function that measures the goodness of design configurations with respect to the requirements, or the accuracy of the estimated health status of the system with respect to the measurements.Typically, the scale of complexity of engineering systems requires several evaluations of this objective function through accurate computer simulations -e.g.Computational Fluid Dynamics (CFD) or Computational Structural Dynamics (CSD) -or physical experiments -e.g.lab-scale test benches or realworld testing -before assessing an optimal solution.The use of highly complicate representations of those systems leads to a significant bottleneck: the demand for resources to evaluate the objective function for all the combinations of optimization variables is difficult to be adequately satisfied.Indeed, the acquisition of data from these high-fidelity models involves huge non-trivial computational and economical costs that could arise from the computation of the objective function and its derivatives over ideally the entire optimization domain.
Surrogate models are computed on evaluations of the objective function acquired through computer codes and/or physical experiments of the system: these sources of information are mostly treated as purely input/output black-box relationships whose analytical form is unknown and not directly accessible to the optimizer.Thus, the accuracy and efficiency of the resulting surrogate are highly dependent on the sampling approach adopted to select informative combinations of optimization variables for the acquisition of data.Among the numerous sampling schemes available in literature, it is possible to identify two major families: one-shot, and sequential schemes.The one-shot strategy defines a grid of samples over the domain all at once.Examples include Latin Hypercube [3], factorial and fractional factorials designs [4,5], Placket-Burmann [6], and D-optimal [7].However, it is very hard to identify a priori the best design of those experiments to efficiently compute the most informative surrogate.To overcome these limitations, sequential sampling selects samples over the domain through an iterative process [8,9].Among these, adaptive sampling [10] provides resource-efficient techniques that seek to reduce as much as possible the evaluations of the objective function, and targets the improvement of the fitting quality across the domain and/or the acceleration of the optimization search [11,12,13].Popular adaptive samplings to address black-box optimization problems characterized by the expensive evaluation of the objective function are those realized through the Bayesian Optimization (BO) methodology [14,15].BO aims at efficiently elicit valuable data from models of the system to contain the computational expense of the optimization procedure.The Bayesian routine iteratively computes a surrogate model of the objective function, and defines a goal-driven sampling process through an acquisition function computed on the surrogate information.This acquisition function measures the merit of samples according to certain infill criteria, and permits to select the next sample that maximizes the query utility with respect to the given optimization goal.
The popular paradigms for Bayesian optimization show substantial synergy with active learning schemes which has not been explicitly discussed and formally described in literature to date.This paper proposes the explicit formalization of this synergy through an original perspective of Bayesian optimization and active learning as symbiotic expressions of adaptive sampling schemes.The aim of this unifying viewpoint is to support the use of those methodologies, and point out and discuss the analogies via their mathematical formalization.This unified interpretation is based on the formulation and demonstration of the analogy between the Bayesian infill criteria and the active learning criteria as the elements responsible for the decision on how to learn from samples to reach the given goal.In support of this unified perspective, this paper first clarifies the concept of goal-driven learning, and proposes a general classification of adaptive sampling methods that recognizes Bayesian optimization and active learning as methodologies characterized by goal-oriented search schemes.Thus, we elucidate the synergy between Bayesian optimization and active learning mapping the Bayesian learning features on the active learning properties.The mapping is discussed through the analysis of three popular Bayesian frameworks for both the case of a single information source, and when a spectrum of multiple sources are available to the search.In addition, we observe the capabilities introduced by the different learning criteria over a comprehensive set of benchmark problems specifically defined to stress test and validate goal-driven approaches [16].The objective is to discuss opportunities and limitations of different learning principles over a variety of challenging mathematical properties of optimization problems frequently encountered in complex scientific and engineering applications.This manuscript is organized as follows.Section 2 discusses goal-driven learning procedures and defines the concept of goal-driven learner according to surrogate modeling and optimization.In Section 3, we recognize that Bayesian optimization, active learning and adaptive sampling are not fully superimposable concepts, and propose a general classification to position Bayesian optimization and active learning with respect to the adaptive sampling methodologies.Then, Section 4 provides an overview on Bayesian optimization and multifidelity Bayesian optimization.Section 5 presents our perspective on the symbiotic relationship between Bayesian optimization and active learning.Then, in Section 6 popular Bayesian optimization and multifidelity Bayesian optimization algorithms are numerically investigated over a variety of benchmark problems.Finally, Section 7 provides concluding remarks.

Goal-Driven Learning
Goal-driven learning is a decision-making process in which each decision is made to acquire specific information about the system of interest that contribute the most to achieve a given goal [17,18,19,20,21,22].This learning goal can be the increase of the knowledge of the system behaviour over all the domain of application, or the acquisition of specific knowledge to enhance and accelerate the identification of optimization solutions.Accordingly, a goal-driven learner selects what to learn considering both the current knowledge and information needed, and determines how to learn quantifying the relative utility of alternative options in the current circumstances.This paper focuses on Bayesian optimization and active learning as goal-driven procedures where a surrogate model is built to accurately represent the behaviour of a system or effectively inform an optimization procedure to minimize given objectives.This goal-driven process is guided by learning principles that determine the "best" location of the domain to acquire information about the system, and refine the surrogate model toward the goal -improve the accuracy of the surrogate or minimize an objective function over the domain.Formally, these surrogate based modeling and optimization problems can be formulated as a minimization problem of the following form: where f (R(x)) denotes the objective function evaluated at the location x ∈ χ of the domain χ.The objective function is of the general form f = f (R(x)), where R(x) represents the response of the system of interest evaluated through a model -e.g.computer-based numerical simulations or real-world experiments.In surrogate based modeling, the objective function can be represented as the error between the approximation of the surrogate model and the response of the system: the goal is to minimize such error to improve the accuracy of the surrogate over all the domain.In surrogate based optimization, the objective function represents a performance indicator dependent on the system response: the goal is to minimize this indicator to improve the capabilities of the system according to given performance requirements.
Goal-driven techniques address Equation ( 1) through a decision-making iterative process where learning principles tailor the acquisition of specific knowledge about the objective function -evaluation of f at certain domain location xcurrently needed to update the surrogate and inform the learner toward the given goal.
In this context, the goal-driven learner is the agent that makes decisions based on the current knowledge of the system of interest, and acquires new information to accomplish a given goal while augmenting the awareness about the system itself.In practice, the learner queries the sample that maximizes the utility to achieve the desired goal: specific learning principles quantify this utility based on the surrogate estimate and in response to information needs.At the same time, the surrogate model is dynamically updated once new information are acquired, and informs the learner to focus and tailor on the fly the elicitation of samples to further overarching the goal.Thus, the distinguishing element of a goal-driven learning procedure is represented by the mutual exchange of information between the learner and the surrogate model: the learner assimilates the information from the surrogate to make a decision aimed at achieving the goal, and the approximation/prediction of the surrogate is enriched by the result of this decision.

Adaptive Sampling Classification
Bayesian optimization and active learning realize adaptive sampling schemes to efficiently accomplish a given goal while adapting to the previously collected information.In recent years, there has been a profusion of literature devoted to the general topic of adaptive sampling but arguably a blurring of focus: many contributions from different field provided a deal of interesting advancements, but also led to some degree of confusion around the concepts of adaptive sampling, active learning and Bayesian optimization.Figure 1 illustrates the use of the words "adaptive sampling", "active learning", and "Bayesian optimization" from 1990 to 2022.In addition, we report the combined use of all the three words over the same period of time.It can be appreciated both the general increasing trend of use of the three techniques and the associated increase of the use of the three terms combined.Many times the three concepts have been used as complete synonyms, with some growing abuse motivated by the difficulties to map the (shaded) boundaries.
Stemming from these considerations, this paper recognizes that adaptive sampling is not always superimposable with active learning and Bayesian optimization.Figure 2 illustrates the relationships between those three methodologies.
We propose a classification of adaptive sampling techniques in three main families, namely adaptive probing (Section 3.1), adaptive modeling (Section 3.2) and adaptive learning (Section 3.3).This classification is based on the concept of goal-driven learning as the distinctive element of adaptive learning methodologies: the learner assimilates the information from the surrogate model to make a decision aimed at achieving a goal, and the surrogate is enriched by the result of this decision following a mutual exchange of information.Conversely, adaptive probing and adaptive modeling classes do not realize a goal-driven learning: the former does not rely on a surrogate model to assist the sampling procedure while the latter computes a surrogate model that is not used to inform the search task.This classification permits to clarify the reciprocal positions between adaptive sampling, active learning and Bayesian optimization.
Accordingly, adaptive sampling and active learning do not completely overlap.Active learning strategies are categorized into population-based and pool-based algorithms according to the nature of the search procedure [23,24].In populationbased active learning, the distribution of the objective function is available: the learner seeks to determine the optimal training input density to generate training points without relying on a surrogate model of the objective function.Conversely, pool-based active learning computes a surrogate model of the unknown objective function that is used to inform the learner toward a given goal, and is updated during the procedure to refine the informative content supporting  the learning procedure.Thus, pool-based active learning methods realize goal-driven learning schemes and can be collocated in the adaptive learning class while population-based active learning techniques can not be considered as adaptive samplings.Following the proposed classification, Bayesian optimization represents the logic intersection between active learning and adaptive sampling since (i) BO realizes an adaptive sampling scheme toward a given goal, and (ii) the BO goal-driven learning procedure is guided by learning principles also traceable in active learning schemes.This synergy between Bayesian optimization and active learning is the main focus of our work, and the remaining of this manuscript is dedicated to formalize and discuss this dualism.To support this discussion, we provide additional details of the proposed classification for adaptive sampling, and review some popular approaches for each of the three classes.The literature on adaptive sampling is vast, and a complete review goes beyond the purpose of this work.Although our discussion will not be comprehensive, the objective is to highlight the distinguishing features of each class and clarify the relative positions of adaptive sampling, active learning and Bayesian optimization.

Adaptive Probing
Adaptive probing schemes exploit the observations of previous samples without computing any surrogate model.These sampling procedures are informed exclusively from the collected data to guide the selection of the next location to query, and exclude the adoption of emulators to support the search.Several adaptive probing frameworks have been developed based on the Monte Carlo method [25,26].Among these, adaptive importance samplings [27,28,29] and adaptive Markov Chain Monte Carlo samplings [30,31] represent popular methodologies adopted in different practical scenarios, from signal processing [32,33] to reliability analysis of complex systems [34,35].Adaptive importance sampling uses previously observed samples to adapt the proposal densities and locate the regions from which samples should be drawn; this strategy permits to iteratively improve the quality of the samples distribution and enhance the accuracy of the relative inference from these observations.Adaptive Markov Chain Monte Carlo (MCMC) determines the parameters of the MCMC transition probabilities on the fly through already collected information.This adaptively generates new samples from an usually complex and high-dimensional distribution, and enhances the overall computational efficiency and reliability of the procedure.In the next paragraph, we report the mathematical formulation of adaptive importance sampling to illustrate the properties of adaptive probing methodologies and the elements that differentiate them from active learning paradigms.

Adaptive Importance Sampling
Adaptive Importance Sampling (AIS) usually considers a generic inference problem characterized by a certain probability density function (pdf) π(x) of a d x -dimensional vector of unknown statistic real parameters x ∈ χ.AIS frameworks aim to provide a numerical approximation of some particular moment of x: where f : χ → R can be any function of x integrable with respect to the pdf π(x) The integral I( f ) is representative of different mathematical problems, from Bayesian inference [36] to the estimate of rare events [37].In many practical scenarios, the integral I( f ) can not be computed in closed form.Adaptive importance sampling provides an algorithmic framework to efficiently address this problem.
Let us define a proposal probability density function q(x) to simulate samples under the restriction that q(x) > 0 for all x where π(x) f (x) ̸ = 0. AIS provides an iterative procedure that improves the quality of one or multiple proposals q(x) to approximate a non-normalized non-negative target function π(x).At the beginning, AIS initializes N proposals {q n (x|θ n,1 )} N n=1 parameterized through the vector θ n,1 .Then, the procedure simulates K samples from each proposal x (k) n,1 , n = 1, ..., N, k = 1, ..., K, and assigns to each sample an associated importance weight formalized as follows: These importance weights measure the representativeness of each sample simulated from the proposal pdf q(x) with reference to the distribution of random variables π(x).
At this point, this set of N weighted samples (x n,1 ), n = 1, ..., N, k = 1, ..., K is used to define a self-normalized estimator: where wn = w n /∑ N j=1 w j are the normalized weights.This permits to approximate the target function distribution as follows: where δ represents the Dirac measure.
Finally, AIS realizes the adaptation phase and updates the parameters of the n-th proposals from θ n,1 to θ n,2 using the last set of drawn parameters [38] or all the parameters evaluated so far [39].The whole procedure is repeated until a certain termination criteria is met (e.g.maximum number of iterations).
This adaptive policy permits to gradually evolve the single or multiple proposal densities to accurately approximate the target pdf.The generation of new samples is uniquely driven by the measurement of the importance of previous samples (weighting) that supports the updating of the proposal parameters (adaptation).Thus, AIS adaptively locates promising regions to query without benefiting from an overall quantification of the goodness of all the spectrum of samples available in the domain -e.g.through the construction of a surrogate model.On these basis, AIS and the general class of adaptive probing strategies is not considerable as a learning procedures since the adaptation phase is not informed by a surrogate model updated on the fly during the procedure, and is not guided by a "learner" that assimilates information from this emulator and adapts the next queries to achieve a given goal.

Adaptive Modeling
Adaptive modeling paradigms sample the domain supported by the information from previous queries, and use the collected data to build a surrogate model.However, the informative content encoded in the emulator is not used to guide the sampling and decide the next point to evaluate.Adaptive modeling approaches have been extensively developed for the reliable propagation and quantification of uncertainties [40,41], analysis of ordinary or partial differential equations [42,43], and inverse problems [44,45].One common approach is represented by adaptive stochastic collocation methodologies, which use an adaptive sparse grid approximation scheme to construct an interpolant polynomial in a multi-dimensional random space [46,47].The adaptive selection of collocation points is driven by an error indicator [48] or estimator [49] that evaluates a certain number of sparse admissible subspaces of the domain: the subspace that exhibits the higher error is included in the grid and the new set of subspaces is identified.Other well-known adaptive modeling approaches are residual-based samplings distribution [50].This family of techniques is mostly applied to improve the training efficiency of Physics-Informed Neural Networks (PINN) surrogate models.Residual-based approaches enhance the distribution of residual points by placing more samples according to certain properties of the residuals during the training of PINN.This decision can be made on the basis of locations where the residual of the partial differential equation is large [51], according to a probability density function of the residual points [52], and hybrid approaches of the above [50].This permits to achieve a better accuracy of the final PINN surrogate model while containing the computational burden associated with computations.Both stochastic collocation and residual based samplings are intended to build an efficient and accurate surrogate model over the domain of samples.However, the sampling procedure is adapted uniquely to previous evaluated samples without a learning procedure from data: the surrogate model is not used to inform the decision on where to sample, and is not progressively updated with previous information.In the following, we provide general mathematical details about adaptive stochastic collocation to analyze the peculiarities of the adaptive modeling class, and underline the absence of a learning process during the construction of the surrogate model.

Adaptive Stochastic Collocation
Adaptive Stochastic Collocation (ASC) builds an interpolation function to approximate the outputs from a model of interest.This emulator is constructed on the evaluations of the model at valuable collocation points of the stochastic inputs to obtain the moments and the probability density function of the outputs.
Consider any point x contained in the random space Γ ⊂ R N with probability distribution function ρ(x).The goal of ASC is to find an interpolating polynomial I ( f ) to approximate a smooth function f (x) : R N → R: for a given set of points {x k } P k=1 .The selection of the collocation points majorly influences the capability of the interpolating polynomial to be close to the original function f .For multivariate problems, the interpolation function is defined as follows using the tensor product grid: where U i k is the univariate interpolation function for the level i k in the k-th coordinate, x i k j m is the j m -th node, and L j k are the Lagrange interpolating polynomials.Equation 7 demands for n i 1 × • • • × n i N nodes, which indicate an exponential rate of computational cost growth with the number of dimensions.Adaptive stochastic collocation targets the reduction of this computational effort through an adaptive sparse grid of collocation points: the objective is to wisely place more points of the grid in the important directions to prioritize the collection of highly informative data.This adaptive sparse grid is defined through a subset of the full tensor product grid as follows: where i = (i 1 , ..., i N ) ∈ R N , |i|= i 1 + ... + i N , q is the sparseness parameter, and the difference formulas are defined by Equation 8 leverages the previous results to extend the interpolation from level q − 1 to q through the evaluation of the multivariate function on the sparse grid: where ∆ϑ i = ϑ i \ϑ i−1 are the newly added set of univariate nodes ϑ i k for level i k in the k-th coordinate.
This scheme adapts the sampling procedure through the knowledge acquired on the fly, and efficiently leverages data to improve the quality of the interpolation function.In this case, the selection of the collocation points is intended to compute an emulator of the target function, but the adaptive sampling is not driven by the information acquired from this emulator.In addition, the acquisition of data is not used to learn and update the surrogate model.These considerations on ASC can be extended to the general class of adaptive modeling methods: even if the sampling scheme is conceived to construct surrogate models, the selection of promising locations to query is not delegated to a goal-driven learner that leverages a mutual exchange of information with the surrogate.

Adaptive Learning
Adaptive learning methodologies realize goal-driven learning processes characterized by the mutual exchange of information between the surrogate model and the goal-driven learner: the former is updated and refined after new evaluations of samples while the latter decides the next query based on the updated approximation given by the emulator.Bayesian optimization and pool-based active learning belong to this specific class of adaptive sampling techniques.Bayesian frameworks constitute a learning process driven by the mutual informative assimilation between an acquisition function -learner -and a surrogate model [53,15].The acquisition function commensurates the benefit of evaluating samples based on the prediction of the surrogate model, and selects the most useful sample to query toward the given goal -either to improve the accuracy of the surrogate over the domain or to effectively inform the optimization search; at the same time, the emulator is enriched with the data from the new query, and is updated to refine the approximation of the objective function over the domain.Similarly, pool-based active learning methods search the domain through a goal-driven learner informed by a classification model of samples [54,55].This process is characterized by the reciprocal flow of information between the learner and the emulator: the classification model is updated through the new evaluations of unsampled locations, and the learner uses these information to select the next query.Mathematical details about pool-based active learning are provided in the following section to better clarify the distinction between this class of adaptive learners, and the other classes which do not realize a goal-driven learning procedure.

Pool-Based Active Learning
Pool-based active learning commonly defines an optimal sampling strategy to improve the accuracy of a surrogate model adopted to classify data-points from a target distribution of labels over the domain of samples χ.Considering this general classification task, pool-based active learning routine is grounded on a probabilistic estimate of the distribution of features f over the entire domain χ through a surrogate model f .This emulator is trained on a set of collected data-points, and maps features to labels f N (x n ) = fn through a predicted probability p N ( f n = f |x n ) that estimates the distribution of features over the domain.Suppose we have collected from a large pool of unlabelled data χ the -smalldataset observing the label values f (x n ) in output from an observation model or oracle at some informative locations x n .Based on this dataset, the goal-driven procedure learns a surrogate model fN whose predictive framework emulates the behaviour of samples over the domain based on the previous collected information.
At this point, an utility function acts as the goal-driven learner informed by the surrogate model, and identifies the most promising sample to be labelled by the oracle according to a measure of utility with respect to the given goal -improve the accuracy of the classifier.The next query augments the dataset D N+1 = D N {x N+1 , f N+1 } and the surrogate model is updated.This utility function defines a learning policy that maps the current predictive distribution to a decision/action on where to sample in the next iteration as follows: Equation ( 10) mathematically formalizes the concept of goal-driven learning procedure: the learner leverages the predicted probability of the surrogate p N ( f n = f |x n ) to make an action x N+1 ; at the same time, the decision is used to enrich the dataset D{x n , f (x n )} N+1 n=1 and update the predicted probability p N+1 .This mutual exchange and assimilation between the learner and the surrogate represents the key aspect that defines a goal-driven learning process and the whole class of adaptive learning sampling schemes.

Bayesian Frameworks
Bayesian optimization constitutes the mid-point between adaptive sampling and active learning.This intersection represents the focal point of our work, and motivates the substantial synergy between Bayesian optimization and active learning as adaptive sampling schemes capable of learning from data and accomplish a certain learning goal.The remaining of this section is dedicated to the general overview of Bayesian optimization considering both a single source of information (Section 4.1) and when multiple sources are available to the learning procedure (Section 4.2).This will guide the reader into the next sections that make explicit the symbiosis between Bayesian frameworks and active learning through our original perspective of Bayesian optimization as a way to actively learn with acquisition functions (Section 5).

Bayesian Optimization
The birth of Bayesian optimization can be retraced in 1964 with the work of Kushner [56] where unconstrained one-dimensional optimization problems are addressed through a predictive framework based on the Wiener process surrogate model, and a sampling scheme guided by the probability of improvement acquisition function.Further contributions have been proposed by Zhilinskas [57] and Mockus [58], and the methodology has been extended to high dimensional optimization problems in the works of Stuckman [59] and Elder [60].Bayesian optimization achieved resounding success after the introduction of the Efficient Global Optimization (EGO) algorithm by Jones et al. [61].EGO uses a Kriging surrogate model to predict the distribution of the objective function, and adopts the expected improvement acquisition function to measure the improvement of the optimization procedure obtained evaluating unknown samples.
The EGO methodology paves the way to the application of Bayesian optimization over a wide range of problems in science and engineering.These research fields demand for the efficient management of the information from black-box representations of the objective function -the procedure is only aware of the input and output without a priori knowledge about the function -to guide the optimization search.Engineering has been a pioneer in the adoption of Bayesian optimization: the design optimization of complex systems is frequently characterized by computationally intensive black-box functions which require efficient global optimization methods.Early applications relate to engineering design optimization [62], computer vision [63] and combinatorial problems [64].Nowadays, the Bayesian framework becomes widely adopted in many fields including and not limited to engineering [65,66,67,68], robotics and reinforcement learning [69,70,71], finance and economics [72,73], automatic machine learning [74,75], and preference learning [76,77].In addition, significant advances have been made in the expansion of BO methodologies to higher-dimensional search spaces frequently encountered in science and engineering, where the effectiveness of the search procedure is usually correlated to an exponential growth of the required observations of the objective function and associated demand for computational resources and time.Within this context, BO techniques have been scaled to approach high-dimensional problems exploiting potential additive structures of the objective function [78,79], mapping highdimensional search spaces into low-dimensional subspaces [80,81], learning from observations of multiple input points evaluated through parallel computing [82,83], and through simultaneous local optimization approaches [84].
Given a black-box expensive objective function f : χ → R, Bayesian optimization seeks to identify the input x * ∈ min x∈χ f (x) that minimizes the objective f over an admissible set of queries χ with a reduced computational cost.To achieve this goal, Bayesian optimization relies on an adaptive learning scheme based on a surrogate model that provides a probabilistic representation of the objective f , and uses this information to compute an acquisition function U(x) : χ → R + that drives the selection of the most promising sample to query.Let us consider the available information regarding the objective function f stored in the dataset D N = {(x 1 , y 1 ), ..., (x n , y n )} where y n ∼ N ( f (x n ), σ ε (x n )) are the noisy observations of the objective function and σ ε is the standard deviation of the normally distributed noise.
At each iteration of the optimization procedure, the surrogate model depicts possible explanations of f as f ∼ p( f |D N ) applying a joint distribution over its behaviour at each sample x ∈ χ.Typically, Gaussian Processes (GPs) have been widely used as the surrogate model for Bayesian optimization [85,86].In GP regression, the prior distribution of the objective p( f ) is combined with the likelihood function p where µ(x) represents the prediction of the GP model at x and κ(x, x ′ ) the associated uncertainty.BO uses this statistical belief to make the decision on where to sample assisted by an acquisition function U, which identifies the most informative sample x new ∈ χ that should be evaluated via maximization x new ∈ max x∈χ U(x).Then, the objective function is evaluated at x new and this information is used to update the dataset Acquisition functions are designed to guide the search for the optimum solution according to different infill criteria which provide a measure of the improvement that the next query is likely to provide with respect to the current posterior distribution of the objective function.In engineering applications, we could retrieve different implementations proposed for the acquisition function, which differ for the infill schemes adopted to sample pursuing the optimization goal.Examples include the Probability of Improvement (PI) [87], Expected Improvement (EI) [61], Entropy Search (ES) [88] and Max-Value Entropy Search (MES) [89], Knowledge-Gradient (KG) [90], and non-myopic acquisition functions [91,92].
The Probability of Improvement (PI) acquisition function encourages the selection of samples that are likely to obtain larger improvements over the current minimum predicted by the surrogate model, while the Expected Improvement (EI) considers not only the PI but also the expected gain in the solution of the optimization problem achieved evaluating a certain sample.Other popular schemes are entropy-based acquisition functions such as the Entropy Search (ES) and Max-Value Entropy Search (MES), which rely on estimating the entropy of the location of the optimum and the minimum function value, respectively, to maximize the mutual information between the samples and the location of the global optimum.Knowledge-gradient sampling procedures are conceived for applications where the evaluations of the objective function are affected by noise, recommending the location that maximizes the increment of the expected value that would be acquired by taking a sample from the location.Through the adoption of non-myopic acquisition functions, the learner maximizes the predicted improvement over future iterations of the optimization procedure, overcoming myopic schemes where the improvement of the solution is measured at the immediate step ahead.

Multifidelity Bayesian Optimization
The evaluation of black-box functions in engineering and science frequently requires time-consuming lab experiments or expensive computer-based models, which would dramatically increase the computational burden for the optimization procedure.This is the case of large-scale design optimization problems, where the evaluation of the objective function for enough samples can not be afforded in practice.In many real-world applications, the objective function can be computed using multiple representations at different levels of fidelity { f (1) , ..., f (L) }, where the lower the level of fidelity the less accurate but also less time-consuming the evaluation procedure.Multifidelity methods recognize that different representative levels of fidelity and associated costs can be used to accelerate the optimization process, and enable a flexible trade-off between computational cost and accuracy of the solution.In particular, multifidelity optimization leverages low-fidelity data to massively query the domain, and uses a reduced number of high-fidelity observations to refine the belief about the objective function toward the optimum [93,94,95].
Accordingly, Multifidelity Bayesian Optimization (MFBO) learns a surrogate model that synthesizes through stochastic approximation the multiple levels of fidelity available, and uses an acquisition function as the learner that selects the most promising sample and associated level of fidelity to interrogate.This learning procedure provides potential accelerations of the optimization procedure that is reflected in the likely improvement of the surrogate accuracy.According to Godino et al. [96], the improvement in performance occurs usually if the acquisition of large amount of high-fidelity data is hampered by the computational expense, the correlation between high-fidelity and low-fidelity data is high, and low-fidelity models are sufficiently inexpensive; Under different circumstances, multifidelity optimization might not deliver substantial accelerations and quality of the surrogate: the relationship between dimension of the training set and surrogate accuracy is not monotonically increasing, as evidenced by [97].In recent years, multifidelity Bayesian optimization has been successfully adopted for optimization problems ranging from engineering design optimization [98,99,22,100,101,102], automatic machine learning [103,104], applied physics [105,106], and medical applications [107,108].In the context of high-dimensional problems, multifidelity Bayesian optimization capitalizes from fast low-fidelity models to alleviate the computational burden associated with the required numerous observations of the objective function to effectively direct the search toward the given goal, and achieved promising results in terms of accuracy and efficiency for applications in quantum control [109], aerospace engineering [110], and reinforcement learning [111].
Multifidelity Bayesian optimization determines a learning procedure informed by the surrogate model of the objective function constructed on the dataset of noisy objective observations and σ ε has the same distribution over the fidelities.This multifidelity surrogate model defines an approximation of the objective f (l) ∼ p( f (l) |(x, l), D N ) at different level of fidelity, and represents the belief about the distribution of the objective function over the domain χ based on data.A popular practice for MFBO is to extend the Gaussian process surrogate model to a multifidelity setting through an autoregressive scheme [112]: where ρ (l−1) is a constant scaling factor that includes the contribution of the previous fidelity with respect to the following one, and ζ (l) ∼ GP(0, κ (l) (x, x ′ )) models the discrepancy between two adjoining levels of fidelity.The posterior of the multifidelity Gaussian process is completely specified by the multifidelity mean function µ (l) (x, l) = E f (l) (x) that represents the approximation of the objective function at different levels of fidelity, and the multifidelity covariance function ) that defines the associated uncertainty for each level of fidelity.
The availability of multiple representations of the objective function poses a further decision task that has to be accounted by the learner during the sampling of unknown locations: the selection of the most promising sample is effected with the simultaneous designation of the information source to be evaluated.This is obtained through a learner represented by the multifidelity acquisition function U(x, l) that extends the infill criteria of Bayesian optimization, and selects the pair of sample and the associated level of fidelity to query (x new , l new ) ∈ max x∈χ,l∈L U(x, l) that is likely to provide higher gains with a regard for the computational expenditure.Among different formulations, well known multifidelity acquisition functions to address optimization problems are the Multifidelity Probability of Improvement (MFPI) [113], Multifidelity Expected Improvement (MFEI) [114], Multifidelity Predictive Entropy Search (MFPES) [115], Multifidelity Max-Value Entropy Search (MFMES) [116], and non-myopic multifidelity expected improvement [21].These formulations of the acquisition function define adaptive learning schemes that retain the infill principles characterizing the single-fidelity counterpart, and account for the dual decision task balancing the gains achieved through accurate queries with the associated cost during the optimization procedure.

An Active Learning Perspective
Bayesian frameworks and Active learning schemes exhibit a strong synergy: in both cases the learner seeks to design an efficient sampling policy to accomplish the learning goal, and is guided by a surrogate model that informs the learner and is continuously updated during the learning procedure.Active learning literature is vast an include a multitude of approaches [117,118,119,120,121,122,123,24].According to the well accepted classification proposed by Sugiyama and Nakajima [23], active learning strategies can be categorized in population-based and pool-based active learning frameworks according to the nature of the sampling scheme defined by the learner.Population-based active learning targets the identification of the best optimal density of the samples for training known the target distribution.Conversely, pool-based active learning defines an efficient sampling scheme to improve the efficiency of a surrogate model of the unknown target distribution over the domain of samples.
This paper explicitly formalizes and discusses Bayesian frameworks as an active learning procedure realized through acquisition functions.In particular, pool-based active learning shows in essence a strong dualism with Bayesian frameworks.We emphasize this synergy through the dissertation on the correspondence between learning criteria and infill criteria; the former drive the sampling procedure in pool-based active learning, while the latter guide the search in Bayesian schemes through the acquisition function.This symbiosis is evidenced for the case of a single source of information adopted to query samples, and when multiple sources are at disposal of the learner to evaluate new input.Accordingly, we review and discuss popular sampling policies commonly adopted in pool-based active learning, and discern the learning criteria to accomplish a specific learning goal (Section 5.1).Then, the attention is dedicated to the identification of the infill criteria realized through popular acquisition functions in Bayesian optimization (Section 5.2).The objective is to explicitly formalize the synergy between Bayesian frameworks and Active learning as adaptive sampling schemes guided by common principles.The same avenue is followed to formalize this dualism for the case of multiple sources of information available during the learning procedure.In particular, we identify the learning criteria The objective is to clarify the shared principles and the mutual relationship that characterize the two adaptive learning schemes when the decision of the sample to query requires also the selection of the appropriate source of information to be evaluated.

Learning Criteria
Pool-based active learning determines a tailored sampling policy to ensure the maximum computational efficiency of the adaptive sampling procedure -limited and well selected amount of samples to query.This adaptive learning demands for principled guidelines to decide whether to evaluate or not a certain sample based on a measure of its goodness.
Learning criteria permit to establish a metric to quantify the gains of all the possible learner decisions, and prescribe an optimal decision based on the information acquired from the surrogate model.The vast majority of the literature concerning pool-based active learning identifies three essential learning criteria: informativeness, representativeness and diversity [124,125,126,24,127,55]: 1. Informativeness measures the amount of information encoded by a certain sample.This means that the sampling policy is driven by the maximum likely contribution of queries that would significantly benefit the objective of the learning procedure.2. Representativeness quantifies the similarity of a sample or a group of samples with respect to a target sample representative of the target distribution.Thus, the sampling policy exploits the structure underlying the domain to direct the queries in locations where a sample can represent a large amount of neighbouring samples.3. Diversity estimates how well the queries are disseminated over the domain of samples.This is reflected in a sampling policy that selects samples scattering across the full domain, and prevents the concentration of queries in small local regions.
Figure 3 illustrates a watering optimization problem that attempts to clarify the peculiarities of each learning criteria.This simple toy problem requires to identify the areas of a wheat field where the crop is ripe and where it is still unripe for irrigation purposes.The learning goal is formalized as the identification of the area where the wheat is lower, which means an unripe cultivation and maximum requirements for irrigation.We assume that the learner can explore a maximum of five sites on the field during the procedure.A learner driven by the pure informativeness criterion (Figure 3(a)) would uniquely sample the regions of the wheat field that are likely to provide the maximum amount of information to accomplish the given learning goal; accordingly, observations are placed where the height of the wheat is minimum and the demand for water is maximum: this maximizes the information on where it is strictly necessary to irrigate, but nothing is known about the regions where the wheat is higher and irrigation is not a priority.Conversely, a purely representative sampling (Figure 3(b)) would probe the field by agglomerating observations to ensure the representativeness of the samples.This allows to partially know even areas where copious irrigation is not necessary, but increases the overall uncertainty given the small amount of samples for each agglomeration.If the learner pursues only the diversity of queries (Figure 3(c)), samples would scatter the field minimizing the maximum distance between measurements.Although this allows the queries to be distributed across the entire domain, the uncertainty is high as only one sample covers a respective area of the field.The remaining of this section is dedicated to the revision and discussion of popular pool-based active learning schemes.
We aim to provide a broad spectrum of approaches that exemplify the implementation of different learning criteria both individually and in combination.This permits to highlight the driving principles of learning procedures, and will help to better clarify the existing synergy between active learning and Bayesian optimization accounted in the following sections.Figure 4 summarizes the relationship between the methodologies reviewed in the following and the three learning criteria.

Informativeness-Based
Learning procedures characterized by a pure informative criterion can be traced in uncertainty-based sampling policies.These approaches make the query decision based on the predictive uncertainty of the surrogate model, and seek to improve the density of samples in regions that exhibit the largest uncertainty with respect to a specific learning goal.Popular uncertainty-based active learning algorithms are uncertainty sampling and query-by-committee methods.Uncertainty sampling algorithms probe the domain to improve the overall accuracy of the surrogate model according to a measure of the predictive uncertainty.Examples include the quantification of the uncertainty associated with samples [128], and its alternatives as margin-based [129], least confident [130] and entropy-based [131] approaches.Other strategies define sampling policies which promote the minimization of the surrogate model predicted variance [132] to maximize, respectively, the decrease of loss augmenting the training set [54], and the gradient descend [133].Other uncertainty-based strategies are query-by-committee sampling schemes [118,134], where the most informative sample to query is selected through the maximization of the disagreement between the predictions of a committee of surrogate models computed on subsets of the locations.

Representativeness/Diversity-Based
Other pool-based active learning algorithms rely exclusively on representativeness and diversity learning frames: usually these learning criteria are implemented simultaneously in the learning procedure to drive the domain probing.This blend is justified by the mutual complementary relationship between representativeness and diversity: pure representativeness might concentrate the sampling in congregated representative domain regions without a proper dispersion of queries, while pure diversity might lead to the over-query of the domain and divert the learning procedure from the actual goals.The combination of both the learning criteria permits on one hand to leverage the representativeness of samples to accomplish a certain learning goal, on the other hand prevents the selection of redundant samples and the high densities of queries only in circumstanced regions of the domain.Representative/diversity-based algorithms include a multitude of approaches that are commonly classified in two main schemes: clustering methodology and optimal experimental design.The former clustering algorithms identify the most representative locations exploiting the underlying structures of the domain: the utility of samples is obtained as a function of their distance from the cluster centers.Popular examples include hierarchical clustering and k-center clustering.The former identifies a hierarchy of clusters based on the encoded information, and selects samples closer to the cluster centers [135]; the latter determines a subset of k congruent clusters that together cover the sampling space and whose radius is minimized, and the best sample minimizes the maximum distance of any point to a center [136].The latter optimal experimental design defines a sampling policy based on a transductive approach: the learning procedure conducts the queries through a data reconstruction framework that measures the samples representativeness based on the capacity to reconstruct the training dataset.The selection of the most representative sample comes from an optimization process that maximizes the local acquisition of information about the parameters of the surrogate model [137,138,139].

Hybrid
Recent avenues explore the combination of both informativeness and representativeness/diversity learning criteria to combine the goal oriented query of the first, and the use of underlying structures preventing over-density of the second.Accordingly, combined-based algorithms integrates multiple learning criteria to improve the overall sampling performance.Those approaches are commonly classified into three main classes [140,55]: serial-form, criteria selection, and parallel-form approaches.Serial-form algorithms use a switching approach to take advantages from all the three learning criteria: informativeness-based techniques are used to select a subset of highly informative samples, and then representativeness/diversity techniques identify the centers of the clusters on this subset as the querying locations [125].Criteria selection algorithms rely on a selection parameter informed by a measure of the learning improvement that suggests the appropriate learning criteria to be used during the procedure [141].Both serial-form and criteria selection strategies combine the three learning criteria through a sequential approach where each criteria is used consecutively during the learning procedure.Parallel-form methods combine simultaneously multiple learning criteria: the utility of each sample is judged by weighting informativeness and representativeness/diversity at the same time; then, valuable samples are selected through a multi-objective optimization of the weights to maximize at the same time the improvement in terms of learning goals and the exploitation of potentially useful structures of the domain [142,143,144].

Acquisition Functions and Infill Criteria
The synergy between active learning and Bayesian optimization relies on the substantial analogy between the learning criteria driving the active learning procedure and the infill criteria that characterize the Bayesian learning scheme.Infill criteria provide a measure of the information gain in terms of utility acquired evaluating a certain location of the domain.
In Bayesian optimization, the acquisition function is formalized according to a certain infill criterion: this permits to quantify the merit of each sample with respect to a specific learning goal.Accordingly, the sample that maximizes the querying utility is observed to enrich the learning procedure toward this goal.
In particular, Bayesian learning schemes rely on two main infill criteria: global exploration ad local exploitation toward the optimum.The former exploration criterion concentrates the samples in regions of the domain where the uncertainty predicted by the surrogate is higher; this enhances the global awareness about the distribution of the objective function over the domain, but the resources might not be directed toward the goal of the procedure -e.g.minimum of the objective function.The latter exploitation criterion condensates the samples in regions where the surrogate model indicates that the objective is likely to be located -e.g.minimum of the Gaussian process mean function; exploitation realizes a goal-oriented sampling procedure that privileges the search for the objective without a potentially accurate knowledge of the overall distribution of interest.The dilemma between exploration and exploitation represents a key challenge to be carefully addressed.On one hand, a learning procedure based on pure exploration might use a large amount of samples to improve the overall accuracy of the surrogate model without searching toward the learning goal.On the other hand, an exploitation-based learner might anchor a high density of samples to a suboptimal local solution as a consequence of the information from an unreliable surrogate model.These extreme behaviours demonstrate the need to find a compromise between exploration and exploitation criteria.
In principle, infill criteria in Bayesian optimization are strongly related to the learning criteria commonly adopted in active learning.In particular: • The concept of exploration is close to the representativeness/diversity criterion: both these learning schemes leverage underlying structures of the target distribution predicted by an accurate surrogate model to improve the awareness about the objective over the domain.
• The concept of exploitation is close to the informativeness criterion: the learner directs the selection of samples toward the believed objective without considering the global behaviour of the objective over the domain.
Figure 5 summarizes the mapping between infill criteria and learning criteria.The following sections discuss the formalization of (infill) active learning criteria for three popular formulations of Bayesian acquisition functions, namely the expected improvement (Section 5.2.1), probability of improvement (Section 5.2.2), and max-value entropy search (Section 5.2.3).
Figure 5: Mapping of the learning criteria in active learning and infill criteria in Bayesian optimization.

Expected Improvement
The Expected Improvement (EI) acquisition function quantifies the expected value of the improvement of the solution of the optimization problem achieved evaluating a certain location of the domain [61,145].EI in the generic location x relies on the predicted improvement over the best solution of the optimization problem observed so far.Considering the Gaussian process as the surrogate model for Bayesian optimization, EI can be expressed as follows: where is the predicted improvement, x * is the current location of the best value of the objective sampled so far, µ is the mean function and σ is the standard deviation of the GP, and Φ(•) and φ (•) are the cumulative distribution function and the probability density function of a standard normal distribution, respectively.The computation of U EI (x) requires limited computational resources and the first-order derivatives are easy to calculate: Both Equation (13) and Equation ( 14) demonstrate that U EI (x) is monotonic with respect to the increase of both the mean and the uncertainty of the GP surrogate model.This highlights a form of trade-off between exploration and exploitation: the formulation of the EI permits to balance the sampling in locations of the domain where is likely to have a significant improvement of the solution with respect to the current best solution, and the observations of regions where the improvement might be contained but the prediction is highly uncertain.In principle, it is possible to state that EI is driven by a combination of informativeness and representativeness/diversity criteria adopted in active learning.On one hand, the learner seeks to direct the computational resources toward the maximization of the learning contribution and achievement of the goal -informativeness; on the other hand, the learner pursues the awareness of the objective distribution over the domain to improve the quality of the prediction and better drive the searchrepresentativeness/diversity.The predictive framework of the surrogate model regulates the learning thrusts privileging the one over the other on the basis of the information about the objective function acquired over the iterations.

Probability of Improvement
The Probability of Improvement (PI) acquisition function targets the locations characterized by the highest probability of achieving the goal, based on the information from the current surrogate model [87,146].PI measures the probability that the prediction of the surrogate model in the generic location is lower than the best observation of the objective function so far.Under the Gaussian process surrogate model, the PI acquisition function is computed in closed form as follows: where Φ(•) is the cumulative distribution function of a standard normal distribution and x * is the current location of the best value of the objective.Similarly to EI, also U PI (x) is inexpensive to compute and the evaluation of the first-order derivatives requires simple calculations: ) where φ is the standard Gaussian probability density function.As demonstrated by Equation ( 16), regions of the input space characterized by lower values of the posterior mean of the GP are preferred for sampling, at fixed uncertainty of the surrogate.Moreover, Equation (17) shows that if µ(x) < f (x * ) the regions characterized by lower uncertainty are preferred and, conversely, PI increases with uncertainty.Overall, the PI acquisition function can be considered as an exploitative scheme that determines the most informative location as the one that potentially produces a larger reduction of the minimum value of the objective function observed so far.This is achieved sampling regions where the surrogate model is reliable and characterized by lower levels of uncertainty.In principle, this sampling scheme makes PI in accordance with the informativeness criterion: the search toward the optimum is uniquely directed in regions of the domain that exhibit the higher probability of achieving the goal according to the emulator prediction.

Entropy Search and Max-value Entropy Search
The Entropy Search (ES) acquisition function measures the differential entropy of the believed global minimum location of the objective function, and targets the reduction of uncertainty selecting the sample that maximizes the decrease of differential entropy [88].The ES acquisition function is formulated as follows: where H(p(x * )) is the entropy of the posterior distribution at the current iteration in the location of the minimum of the objective function x * , and E f (x) [•] is the expectation over f (x) of the entropy of the posterior distribution at the next iteration on x * .Typically, the exact calculation of the second term of Equation ( 18) is not possible and requires complex and expensive computational techniques to provide an approximation of U ES (x).
The Max-value entropy search (MES) [89] acquisition function is derived from the ES acquisition function and allows to reduce the computational effort required to estimate Equation ( 18) measuring the differential entropy of the minimum-value of the objective function: where the first and the second term are now computed for the minimum value of the objective function f * .This permits to simplify the computations and to approximate the second term through a Monte Carlo strategy [89].The analysis of the derivatives is not possible for the MES acquisition function since the formulation of the second term of Equation ( 19) is intractable.
As reported by Wang et al. [89] in their experimental analysis, MES targets the balance between the exploration of locations characterized by higher uncertainty of the surrogate model, and the exploitation toward the believed optimum of the objective function.However, Nguyen et al. [147] demonstrate that MES might suffer from an imbalanced exploration/exploitation trade-off due to noisy observations of the objective function, and to the discrepancy in the computation of the mutual information in the second term of Equation ( 19).As a result, MES might over-exploit the domain in presence of noise in measurements, and over-explore when the discrepancy in the evaluation issue determines a pronounced sensitivity to the uncertainty of the surrogate model.Overall, the adaptive sampling scheme determined by the MES acquisition function follows both the informativeness and the representativeness/diversity learning criteria: the most promising sample is ideally selected targeting the balance between the search toward the believed minimum predicted by the emulator, and the decrease of uncertainty about the objective function distribution.

Learning Criteria with Multiple Oracles
Most of the active learning paradigms rely on a unique and supposed omniscient source of information about the target distribution.This oracle is iteratively queried by the learner to evaluate the value of the distribution in certain locations, and is assumed that its prediction is exact.In many other scenarios, the learner can elicit information from multiple imperfect oracles at different levels of reliability, accuracy and cost.Accordingly, the active learning community introduced a multitude of annotator-aware algorithms which are capable of efficiently learning from multiple sources of information.This requires to make an additional decision during the learning procedure: the learner has to select at each iteration the most useful sample and the associated information source to query.In this context, the original learning criteria of informativeness and representativeness/diversity (Section 5.1) evolve and extend to quantify the utility of querying the domain with a certain level of accuracy and associated cost: 1. Informativeness seeks to maximize the amount of information to decide the sample and information source to query.Thus, the learner might privilege the evaluations from accurate and yet costly oracles to capitalize from high-quality information and potentially reach the objective.2. Representativeness attempts to identify underlying structures of the domain to better inform the search procedure.In this case, the decision making process might prefer to interrogate less expensive sources of information to contain the required effort, especially if cheap predictions of the target distribution exhibit good correlation with the estimate of the accurate oracle.3. Diversity scatters the sampling effort over the domain to pursue a proper distribution of evaluations and augment the awareness about the target distribution.This might be favored by a major use of less accurate predictions of the target distribution, which are more likely to well address the cost/effectiveness trade-off during the diversity sampling.
The remaining of this section provides an overview of different multiple oracles active learning methodologies to present and further clarify popular extensions of the learning criteria to a multi-oracle setting.
Typically, active learning paradigms are extended to the multiple-oracle setting through relabeling, repeating-labeling, probabilistic and transfer knowledge, and cost-aware algorithms.Relabeling approaches query samples multiple times using the library of sources of information available, and the final query is obtained via majority voting [148].Popular methodologies following this scheme pursue the identification of a subset of oracles according to the proximity of their upper confidence bound to the maximum upper confidence bound, and apply the majority voting technique only considering the queries of this informative subset [149].Other multi-oracle active learning methods use a repeatinglabeling procedure: the learner integrates the repeated -often noisy -prediction of the oracles to improve the quality of the evaluation process and the accuracy of the surrogate model learned from data [150].Both relabeling and repeatinglabeling approaches share a common drawback: the same unknown sample is evaluated multiple times with different oracles, which results in a sub-optimal usage of the available sources of information.Probabilistic and transfer learning methodologies attempt to overcome this limitation.Probabilistic frameworks rely on surrogate models specifically conceived for the multi-source scenario that provides a predictive framework to estimate the accuracy of each oracle in the evaluation of samples over the domain [151,152].Transfer knowledge approaches enhance the simultaneous selection of the most informative location to sample and the associated most profitable source to query; this is achieved through the transfer of knowledge from samples not evaluated in auxiliary domains to support the estimate of the oracle reliability [153].Recent advancements in multiple oracles active learning are cost-effective algorithms, where the cost of an oracle is evaluated considering both the overall reliability of the prediction and the quality of samples in specific locations [154,155,156].The cost-effectiveness property enhances the use of computational resources for the evaluation of samples, and targets the search toward the learning objectives while guarantees an optimal trade-off between evaluation accuracy and computational cost.
From the examined literature, the three learning criteria appear frequently coupled together during the learning procedure with multiple sources to query.This appears as a natural evolution of what has already been observed in the literature for active learning with single information source: the overall learning procedure usually benefits from a balanced learning scheme driven by informativeness and representativeness/diversity.In particular, informativeness permits to direct the search toward the learning goal, while representativeness/diversity augments the learner awareness about the target distribution over the domain; the combination of these learning criteria -in different measures -contributes to improve the performance of the active learning algorithms by efficiently using the computational resources and the information from multiple oracles.

Multifidelity Acquisition Functions and Infill Criteria
This section further investigates and highlights the synergy between active learning and Bayesian optimization for the specific case of multiple sources of information used to accomplish the learning goal.Similarly to the single source setting, this symbiotic relationship is revealed through common principles characterizing the infill criteria in multifidelity Bayesian optimization and the learning criteria in active learning with multiple oracles.The multifidelity scenario imposes an additional decision to be made: the learner has to identify the appropriate information source to query according to an accuracy/cost trade-off.This is reflected in the formalization of infill criteria capable of defining an efficient and balanced sampling policy, targeting either the wise selection of the samples and the levels of fidelity which ensure the maximum benefits with the minimum cost.Accordingly, the multifidelity acquisition function formalizes an adaptive sampling scheme based on one or multiple infill criteria to quantify the utility of querying a location of the domain with a specific level of fidelity.
Based on these considerations, the exploration and exploitation infill strategies are extended according to the peculiarities of the multifidelity setting: • Exploration is close to the representativeness/diversity criterion and defines a sampling policy that incentivizes the overall reduction of the surrogate uncertainty.Accordingly, the selection of the appropriate level of fidelity is driven by a trade-off between accuracy and evaluation cost.This might be accomplished through less-expensive low-fidelity information to contain the demand for computational resources during exploration.
• Exploitation is close to the informativeness criterion and concentrates the sampling process in the regions of the domain where optimal solutions are likely to be located.For this purpose, the learner might emphasize the use of accurate evaluations of the target function to refine the solution of the learning procedure toward the specific goal.
Similarly to the acquisition functions in Bayesian optimization (Section 5.2), the symmetry between informativeness and exploitation criteria, and between representativeness/diversity and exploration criteria is preserved in the multifidelity setting.The following sections are dedicated to the review and discussion of popular multifidelity acquisition function, namely the multifidelity expected improvement (Section 5.4.1),multifidelity probability of improvement (Section 5.4.2) and multifidelity max-value entropy search (Section 5.4.3).The goal is to highlight the equivalent principles driving both the learning schemes, and further clarify the elements that encode the symbiotic relationship that exists between multifidelity Bayesian optimization and multi-oracle active learning.

Multifidelity Expected Improvement
The Multifidelity Expected Improvement (MFEI) extends the expected improvement acquisition function to define a learning scheme in the multifidelity setting as follows [114]: where U EI (x, L) is the expected improvement illustrated in Equation ( 12) and evaluated with the highest level of fidelity L, and the utility functions α 1 , α 2 and α 3 are defined as follows: The first element α 1 is the posterior correlation coefficient between the level of fidelity l and the high-fidelity level L, and accounts for the reduction of the expected improvement when a sample is evaluated with a low fidelity model.This term reflects a measure of the informativeness of the l-th source of information in the location x, and balances the amount of improvement achievable evaluating the high-fidelity level L with the reliability of the prediction associated with the level of fidelity l.Accordingly, α 1 modifies the learning scheme by adding a penalty in the formulation that reduces the U MFEI when 1 ≤ l < L: this includes awareness about the increase of uncertainty associated with a low-fidelity prediction.The second element α 2 is conceived to adjust the expected improvement when the output evaluated with the l-th level of fidelity contains random errors.This is equivalent to consider the reduction of the uncertainty of the Gaussian process prediction after a new evaluation of the objective function is added to the dataset D. This function allows to improve the robustness of U MFEI when the representation of f (l) at different levels of fidelity is affected by noise in the measurements.The third element α 3 is formulated as the ratio between the computational cost of the high-fidelity level L and the l-th level of fidelity.This permits to balance the informative contributions of high-and a lower-fidelity observations and the related computational resources required for the evaluation.The effect of this term is to encourage the use of low-fidelity representations if almost the same expected improvement can be achieved with a high-fidelity evaluation.This wisely directs the use of computational resources to achieve the representativeness/diversity of samples, and prevents a massive use of expensive accurate queries during the exploration phases.

Multifidelity Probability of Improvement
The Multifidelity Probability of Improvement (MFPI) acquisition function provides an extended formulation of the probability of improvement suitable for the multifidelity scenario as follows [113]: where the PI acquisition function (Equation ( 15)) is computed considering the highest-fidelity level L available, and the utility function η 1 , η 2 and η 3 are defined as follows: .
The first term η 1 shares the same formalization of the utility function α 1 in Equation ( 21), and accounts for the increase of uncertainty associated with low-fidelity representations 1 ≤ l < L if compared with the high-fidelity output L. This reduces the probability of improvement if a low-fidelity representation is queried in a specific location of the input space x.As already highlighted in Section 5.4.1, η 1 incentivizes a form of informativeness learning where the information source is selected according to its capability to accurately represent the objective function.Similarly, the second utility function η 2 is also included in the multifidelity expected improvement in Equation ( 23) as the α 3 term.This element balances the computational costs and the informative contributions achieved through the l-th level of fidelity.This prevents the rise of computational demand produced by the over-exploitative nature of the probability of improvement (Section 5.2.2): η 2 encourages the use of fast low-fidelity data if the discrepancy between the l-th level of fidelity and the high-fidelity L -quantified by η 1 -is not significant.The third element η 3 is the sample density function computed as the product of the complement to unity of the spatial correlation function R (•) [157] evaluated for the n l samples considering the l-th level of fidelity.This term reduces the probability of improvement in locations with an high sampling density -over exploitation of the domain -to prevent the clustering of data.Accordingly, η 3 promotes a form of representativeness/diversity learning scheme and encourages the exploration to augment the awareness about the domain structure.

Multifidelity Entropy Search and Multifidelity Max-Value Entropy Search
The Multifidelity Entropy Search (MFES) acquisition function is formulated extending the entropy search acquisition function to query multiple sources of information [115] U where the expectation term E f (l) (x) [•] considers multiple levels of fidelity l = 1, ..., L. Similarly to the entropy search acquisition function, the computation of the expectation in Equation ( 28) is not possible in closed-form and requires an intensive procedure to provide a reliable approximation.
The Multifidelity Max-Value Entropy Search (MFMES) acquisition function can be formulated extending the max-value entropy search to a multifidelity setting as follows [116]: where the differential entropy is measured for the minimum value of the objective function f * (L) considering the high-fidelity representation L. In this case, the approximation of the expectation term in Equation ( 29) relies on a Monte Carlo strategy that allows to contain the computational cost if compared with the procedure used for the MFES acquisition function [116].
In the multifidelity scenario, the MFMES acquisition function measures the information gain obtained evaluating the objective function f (l) (x) in a certain location x and associated level of fidelity l with respect to the global minimum of the objective function.This can be interpreted as an informativeness-driven learning based on the reduction of the uncertainty associated with the minimum value of the objective f * (L) through the observation f (l) (x), where this uncertainty is measured as the differential entropy associated with the l-th level of fidelity.At the same time, the information gain is also sensitive to the accuracy of the surrogate predictive framework, and realizes a form of representativeness/diversity balance to improve the awareness about the distribution of the objective function over the domain.The sensitivity to the computational cost λ (l) of the l-th level of fidelity is introduced in Equation ( 29) to balance the quality of the source -quantified by the information gain -and the demand for computational resources.

Experiments
This section investigates and compares the performance of the acquisition functions for both single-fidelity and multifidelity Bayesian optimization considering a set of benchmark problems conceived to stress those algorithms.
The objective is to highlight advantages and opportunities offered by different learning principles over challenging mathematical properties of the objective function, which are frequently encountered in real-world engineering and scientific problems.[16].In particular, this comparative study considers the expected improvement (Section 5.2.1), probability of improvement (PI) (Section 5. We impose the same initialization conditions for both the single-fidelity and the multifidelity algorithms.This initial setting includes: (i) the initial dataset of N 0 samples for each level of fidelity l to compute the prior surrogate model of the objective function, (ii) the computational cost assigned to each level of fidelity λ (l) , and (iii) the maximum computational budget B max allocated for each benchmark problem defined linearly with the dimensionality D of the problem B max = 100D.The initial dataset N (l) 0 is obtained through Latin hypercube sampling for all the numerical experiments [158] to ensure the full coverage of the range of the optimization variables.The computational budget B = ∑ λ (l) i is quantified as the cumulative computational cost used during the optimization at each iteration i.All the methods are based on the Gaussian processes surrogate model and its extension to the multifidelity setting.We implement the squared exponential kernels for all the GP covariances, and use the maximum likelihood estimation approach to optimize the hyperparameters of the kernel and the mean function of the GP [159].

Benchmark Problems
The following set of benchmark problems is specifically conceived to investigate the capabilities of different learning criteria over challenging mathematical properties of the objective function [16].In particular, the experimental settings include a variety of attributes that can be traced in real-world optimization problems, namely local and global behaviours, non-linearities and discontinuities, multimodality and noise.The set of problems consist of several objective functions such as the Forrester continuous and discontinuous, the Rosenbrock increasing the domain dimensionality, the Rastrigin shifted and rotated, the Agglomeration of Locally Optimized Surrogate (ALOS), a coupled spring-mass optimization problem and the noisy Paciorek function.

Forrester Function
The Forrester function is a popular test-case to investigate the performance of different learning strategies over a non-linear one-dimensional distribution characterized by local behaviours.This benchmark problem guarantees an high interpretability of the results thanks to the one-dimensional nature of the objective function.The search domain is bounded between χ = [0, 1] and four levels of fidelity are available during the optimization: f (3) (x) = (5.5x− 2.5) 2 sin(12x − 4) ( 31)  where f (4) is the high-fidelity function and the levels of fidelity l = 1, 2, 3, 4 increase with the accuracy of the representations.Figure 6(a) reports the four levels of fidelity for the Forrester function over the search domain.
The analytical minimum of the Forrester function is equal to f * (4) = −6.0207and it is located at the domain point x * = 0.7572.

Jump Forrester Function
The jump Forrester function introduces a discontinuity in the formulation of the Forrester function to investigate the capabilities of the learning schemes to refine the surrogate model and capture the instantaneous variation of the objective function over the domain.This scenario can often occur in real problems where the phenomena of interest -e.g.physical quantity of interest in engineering -evolves over the domain and determines large variations of the objective function values.Figure 6(b) reports the two levels of fidelity that are available during the search procedure: f (1) (x) = 0.5 f (2) (x) + 10(x − 0.5) − 5, 0 ≤ x ≤ 0.5 0.5 f (2) (x) + 10(x − 0.5) − 2 0.5 < x ≤ 1 (35) where f (2) is the high-fidelity information source.The optimum is located at x * = 0.75724876 corresponding to a value of the objective equal to f * (2) = −0.9863.

Rosenbrock Function
The Rosenbrock function permits to investigate the learning criteria over a non-convex objective function that allows for parametric scalability over the domain χ = [−2, 2] D where D is the dimensionality of the input space.A library of three levels of fidelity is available (Figure 7):

ALOS Functions
The Agglomeration of Locally Optimized Surrogate (ALOS) is a heterogeneous and non-polynomial function defined on unit hypercubes up to three dimensions useful to assess the accuracy of surrogate models in presence of localized behaviours.In particular, the ALOS function reproduces a real-world scenario where the objective function is characterized by oscillatory phenomena at different frequency distributed along the domain.We consider two levels of fidelity and increasing dimensionality of the input space D = 1, 2, 3.For D = 1 the ALOS function is formalized as follows: f (2) (x) = sin[30(x − 0.9) 4 ] cos[2(x − 0.9)] + (x − 0.9)/2 f (1) (x) = ( f (2) (x) − 1.0 + x)/(1.0+ 0.25x) (39) and for D = 2, 3 is formulated as:

Shifted-Rotated Rastrigin Function
The Rastrigin function is commonly used as test function to represent real-world applications where the objective function might present a high multimodal behaviour.We adopt a benchmark problem based on the original formulation of the Rastrigin function shifted and rotated as follows (Figure 9): where: z z z = R(θ )(x − x * ) and R(θ ) = cos θ − sin θ sin θ cos θ is the rotation matrix with the rotation angle fixed at θ = 0.2.

Spring-Mass System
This benchmark problem consists of a coupled spring mass system composed of two masses connected by two springs.
The challenges associated with this simple physical optimization problem are related to the intrinsic multimodality induced by the elastic behaviour of the system dynamics.We consider the masses m 1 and m 2 concentrated at their center of gravity and the elastic behaviour of the two spring modeled through the Hooke's law and characterized by the Hooke's constants k 1 and k 2 , respectively.Considering a friction-less dynamics, it is possible to define the equations of motion as follows Equation ( 44) can be solved using the fourth-order accurate Runge-Kutta time-marching method and varying the time-step dt to define two fidelity levels.Specifically, we define the high-fidelity model f (2) (dt = 0.01) and the low-fidelity model f (1) (dt = 0.6).The benchmark problem requires the identification of the combination of masses and Hooke's constants of springs x = [m 1 , m 2 , k 1 , k 2 ] that minimizes h 1 (t = 6) considering the domain χ = [1, 4] 4 and the initial conditions of motion h 1 = h 2 = 0 and ḣ1 = ḣ2 = 0.

Paciorek Function with Noise
The Paciorek function reproduces an optimization setting where the objective function is affected by measurement noise and localized multimodal behaviour.This scenario is replicated through a random noise term in the low-fidelity Paciorek function as follows (Figure 10): where f (2) is the Paciorek function, A = 0.5, α = 0.2, and the input variable is defined across the input domain χ = [0, 3, 1] 2 .

Results and Discussion
First, we define the following evaluation metrics to assess the performances of the Bayesian schemes [16]: where x * is the location of the analytical optimum, x * is the optimum identified by the algorithm, and f max and f * are the maximum and minimum of the objective function, respectively.The first metric ε x quantifies the search error in the domain of the objective function, while the second metric ε f evaluates the error associated with the learning goalminimum of the objective function [160].We evaluate the metrics ε x and ε f as functions of the computational budget B defined as the cumulative computational cost associated with observations of the objective function considering the l-th level of fidelity.We run 10 trails for each benchmark problem presented in Section 6.1 to compensate the influence of the random initial design of experiments, and to verify the sensitivity and robustness of the algorithms to the initialization setting.The results for all the experiments are reported in terms of median values of ε x and ε f .show that the multifidelity algorithms identify the optimum solution with a significant reduction of the computational budget if compared with the single fidelity counterparts.The best performing algorithm is the MFPI learner considering only the high-fidelity l = 4 and the lower-fidelity l = 1 levels, while the second best is the MFEI acquisition function considering available the complete spectrum of fidelities l = 1, 2, 3, 4.These outcomes suggest that multifidelity learning paradigms driven majorly by informativeness -MFPI acquisition function -are capable of efficiently directing the computational resources toward the optimum of low-dimensional objective functions in presence of continuous localized behaviours.Moreover, it should be noted that the MFEI capitalizes from all the information sources available and leverages the balance between informativeness and representativeness/diversity to effectively search toward the analytical optimum.The single fidelity Bayesian frameworks exhibit a lower convergence rate with respect to the multifidelity algorithms.The EI and PI use almost the same computational budget to identify the optimum solution, while the MES adopts more evaluations of the objective function.This confirms the observations for the multifidelity experiments.PI takes advantage from the purely exploitation of high-fidelity samples in the surrounding of the surrogate minimum to reach the optimum.This can be explained with the computation of an accurate surrogate model -at least close to the optimum -for low-dimensional objective functions.In contrast, EI balances an exploration phase to improve the overall accuracy of the surrogate with the exploitation toward the believed optimum.Particular attention should be dedicated to the MES and MFMES outcomes.In the single-fidelity frameworks, MES scores slightly worst both in terms of convergence rate and budget expenditure.This can be interpreted with an overall over-exploration behavior: MES distributes computational resources to explore the domain and refine the surrogate model, and directs lately efforts toward the optimum.This trend is considerably dampened in the multifidelity scenario, where MFMES shows good capabilities especially when all the sources of information are available during the search.In this case, cheap low-fidelity observations are used to explore the domain with contained computational expenditure, and high-fidelity data are mostly adopted to search toward the prescribed optimum location.
The discontinuous Forrester problem introduces a discontinuous local property of the objective function that further stresses the learning schemes.This can be explicitly observed with the average improvement of the budget required to achieve the optimum.Overall, it is possible to identify the same trends observed for the continuous Forrester function (Figure 11(b) and Figure 11(d)): either balancing exploration and exploitation -EI and MFEI -or a major exploitation search -PI and MFPI -lead to an efficient identification of the analytical optimum.In contrast, the over-exploration of MES and MFMES decelerates the optimization procedure with respect to the counterpart competing methods.This can be observed majorly for the MES which uses almost all the budget available to explore the domain and finally reach the optimum.Figure 12 illustrates the experiments conducted on the Rosenbrock benchmark function increasing the dimensionality D of the domain.This allows to investigate the performance of the learning scheme as the number of parameters to optimize increases.Overall, the multifidelity schemes deliver better convergences with a fraction of the computational budget required by single-fidelity algorithms for all the dimensions of the domain -D = 2, 5, 10.
For D = 2 (Figure 12(a) and Figure 12(d)), MFEI and MFPI implementing only the highest and lower levels of fidelity l = 1, 3 are the best performing algorithms, followed by the counterpart considering all the fidelities spectrum and the MFMES also learning from l = 1, 3. Two major observations can be made in this experimental setting.First, multifidelity learners are not capable of making advantage of the intermediate fidelity l = 2 during exploration leading to an increase of the computational expenditure.A possible explanation to these outcomes is the local behaviour of the intermediate fidelity that pushes the exploration in regions far from the optimum.Second, pure exploitation or a balanced search between exploration and exploitation are advantageous in low-dimensional domains, while pure exploration sacrifices valuable computational resources to improve the awareness about the global distribution of the objective instead of searching the optimum.
Increasing the dimension of the input space to D = 5 (Figure 12(b) and Figure 12(e)), only the MFEI and the MFPI using all the fidelities available are capable of identifying the optimum solution, while the other competing algorithms converge to suboptimal solutions.However, it should be noted the much faster convergence of MFEI with respect to the MFPI in the complete setting.These outcomes indicate that as the number of optimization variables increases, both exploration and exploitation are required for an efficient learning procedure.In particular, the exploration improves the accuracy of the surrogate over the domain which permits to better inform the learner during the exploitation phase.The utility of purely exploitation -MFPI -also continues to be observed, but the effectiveness is limited by the dimensionality of the domain that requires an exploration phase to better capture the distribution of the objective function.
Pushing further the dimensionality of the domain at D = 10 (Figure 12(c) and Figure 12(f)), all the algorithms are not capable of reaching the analytical optimum with the allocated budget.This can be explained with the unreliable prediction of the surrogate model that is not capable of correctly informing the learner with limited amount of datalimited allocated budget.However, the multifidelity paradigms achieve larger reductions of both the error in the domain ε x and the goal error ε f if compared with the single-fidelity outcomes.This suggests that learners capable of leveraging multiple information sources might produce higher gains in a limited budget scenario thanks to the massive use of cheap low-fidelity models to learn the objective function.Among the competing strategies, MFMES exhibits remarkable outcomes in terms of convergence values of the errors when all the library of fidelities is available.These results can be justified with the over-exploration properties of the MFMES acquisition function: the learner uses massive low-fidelity data to refine the approximation of the surrogate model and augment its predictive capabilities.This permits to better inform the procedure and direct computational resources toward the optimum.
The results obtained for the ALOS benchmark problem in Figure 13  The outcomes related to the multimodal benchmarks are reported in Figure 14.The multifidelity algorithms are capable of converging toward the analytical optimum with a fraction of the computational cost, if compared with the singlefidelity results.For the Rastrigin function (Figure 14(a) and Figure 14(d)), the multifidelity methods implementing all the levels of fidelity l = 1, 2, 3 outperform the multifidelity methods with l = 1, 3: the intermediate level of fidelity l = 2 is more accurate if compared with the low-fidelity output l = 3 and allows to improve the reliability of the Gaussian process in presence of a strong multimodal behaviour.The best performing method is MFPI using l = 1, 2, 3 denoting that the over-exploitation of the input space with lower-fidelity levels l = 2, 3 allows to take full advantage from low-fidelity data, improving the performance of the learning process.In contrast, we observe that the MES algorithm exhibits a more efficient convergence of the MFMES counterpart.This is related to the already noticed over-exploration of the domain: the MES uses accurate high-fidelity observations to refine the surrogate during the exploration, while the MFMES systematically adopts lower-levels of fidelity to massively query the domain and retard the exploitation with more accurate information sources.The results achieved for the mass spring benchmark problem (Figure 14(b) and Figure 14(e)) confirm the superior convergence performance of the multifidelity algorithms in presence of marked multimodal objective functions.In particular, the balance between exploration and exploitation delivered by the MFEI allows for superior accelerations and contained demand for computational resources.Similar results can be observed for the Paciorek benchmark problem (Figure 14(c) and Figure 14(f)): the multifidelity learning delivers efficient optimization procedures even in the simultaneous presence of multi-modality and noise.It should be noticed that in presence of noise both the MES and MFMES show an attenuation of the exploratory behaviour and a greater exploitation of the domain.This result is in agreement with what observed by Nguyen et al. [147].The overall outcomes for this subset of benchmark functions demonstrate that a learning scheme characterized by a balanced exploration and exploitation phase is essential in presence of multimodal behaviour and noise in the measurements of the objective function.

Advice on using Learning Criteria
Throughout the experiments in this paper and in our research experience, we can summarize several recommendations that are intended to provide a guideline to apply the different learning criteria in real-world optimization problems.Although these advice may not be suitable in general due to the vast and natural heterogeneity of the applications where optimization is relevant, we believe that these guidelines can be useful in directing researchers toward the effective use of learning schemes.
1. Pure exploitation/informativeness learning schemes could be potentially beneficial for low-dimensional optimization problems.In our experience, the direct exploitation of data at the beginning of the optimization procedure can produce significant improvement in the solution with relatively contained computational resources.The reason behind this behaviour is due to the accurate prediction of the emulator with contained amount of data in low-dimensional domains.This contributes to better inform the learner and effectively direct resources toward the optimum.
2. Pure exploration/representativeness-diversity could impact considerably the optimization results for highdimensional optimization problems.The exploration reduces the uncertainty of the emulator over all the domain and leads to a more reliable predictive framework.This would better inform the learner and help directing the computational resources in regions of the domain where is more likely to achieve benefits in terms of solution.
3. The balance between exploration and exploitation guarantees consistent and satisfactory optimization performances over different mathematical properties of the objective function.In particular, our experiments suggest that pursuing the trade-off between exploration and exploitation often leads to satisfactory and in many cases better performance than implementing the learning criteria individually.Although the well performing behaviour in general, it should be privileged mainly in cases when there is no prior knowledge about the specific optimization problem considered to increase the chances of success.
4. When the computational resources are severely limited -e.g.engineering preliminary design phases or trade-off analysis -, there is a clear advantage of using multifidelity learning criteria and leverage a spectrum of information sources at different levels of fidelity.Indeed, the wise combination of fast low-fidelity data with expensive high-fidelity evaluations reduces the overall demand for computational resources, and shows more robust performance for challenging mathematical properties of the objective function such as local/global behaviours, non-linearities and discontinuities, multimodality, and noisy measurements.
This paper proposes an original unified perspective of Bayesian optimization and active learning as adaptive sampling schemes guided by common learning principles toward a given optimization goal.Our arguments are based on the recognition of Bayesian optimization and active learning as goal-driven learning procedures characterized by the mutual information exchange between the learner and the surrogate model: the learner makes a decision based on the surrogate information to maximize the sampling utility with respect to the given goal, while the emulator is constantly updated through the results of this decision.Accordingly, we clarify and support our discussion through a general classification of adaptive sampling methodologies, and recognize Bayesian optimization as the logic intersection between active learning and adaptive sampling.This lays the foundations for the explicit formalization of the synergy between Bayesian optimization and active learning considering both a single information source and when a library of representations at different levels of fidelity is available to the learner.This unified perspective is based on the dualism between the active learning criteria of informativeness and representativeness/diversity, and the Bayesian infill criteria of exploration and exploitation as the driving elements to achieve the learning goal.To support our perspective, we reviewed and analysed popular formulations of the acquisition function for Bayesian optimization considering both single-fidelity and multifidelity settings.Accordingly, we formalize this synergy mapping the informativeness learning criterion with the exploitation infill criterion as driving components that direct the selection of samples toward the learning goal.Similarly, we formulate the substantial analogy between representativeness-diversity learning criterion and the exploration infill criterion as sampling policies that improve the awareness about the objective function over the domain.
Through stressfull analytical benchmark problems, the authors demonstrate the benefits of each learning/infill criteria over challenging mathematical properties of the objective function typically encountered in real-world applications.The results reveal that the balance between the learning/infill criteria ensures good performances and computational efficiency over all the benchmark problems.In addition, multifidelity learning schemes deliver significant accelerations of the learning procedure making them particularly attractive when the available computational resources are limited.The authors also include some advice and guidelines on the use of the different learning criteria based on the experimental results and their own experience in the field.

Figure 2 :
Figure 2: Where adaptive sampling and active learning meet: this work focuses on the synergies between Bayesian optimization and active learning as goal-driven learning procedures driven by common learning principles.

Figure 4 :
Figure 4: Mapping methodologies to learning criteria.

Figure 7 :
Figure 7: Rosenbrock function benchmark problem over the D = 2 dimensional domain

2 Figure 8 :
Figure 8: ALOS function benchmark problems over the D = 1 and D = 2 dimensional domain

Figure 9 :
Figure 9: Rastrigin function shifted and rotated benchmark problem

Figure 11 :
Figure 11: Performances of the competing algorithms for the Forrester and Jump Forrester benchmarks.

Figure 11
Figure11summarizes the outcomes obtained for the Forrester function and discontinuous Forrester function.The results for the Forrester benchmark (Figure11(a) and Figure11(c)) show that the multifidelity algorithms identify the optimum solution with a significant reduction of the computational budget if compared with the single fidelity counterparts.The best performing algorithm is the MFPI learner considering only the high-fidelity l = 4 and the lower-fidelity l = 1 levels, while the second best is the MFEI acquisition function considering available the complete spectrum of fidelities l = 1, 2, 3, 4.These outcomes suggest that multifidelity learning paradigms driven majorly by informativeness -MFPI acquisition function -are capable of efficiently directing the computational resources toward the optimum of low-dimensional objective functions in presence of continuous localized behaviours.Moreover, it should be noted that the MFEI capitalizes from all the information sources available and leverages the balance between informativeness and representativeness/diversity to effectively search toward the analytical optimum.The single fidelity Bayesian frameworks exhibit a lower convergence rate with respect to the multifidelity algorithms.The EI and PI use almost the same computational budget to identify the optimum solution, while the MES adopts more evaluations of the objective function.This confirms the observations for the multifidelity experiments.PI takes advantage from the purely exploitation of high-fidelity samples in the surrounding of the surrogate minimum to reach the optimum.This can be explained with the computation of an accurate surrogate model -at least close to the optimum -for low-dimensional objective functions.In contrast, EI balances an exploration phase to improve the overall accuracy of the surrogate with the exploitation toward the believed optimum.Particular attention should be dedicated to the MES and MFMES outcomes.In the single-fidelity frameworks, MES scores slightly worst both in terms of convergence rate and budget expenditure.This can be interpreted with an overall over-exploration behavior: MES distributes computational resources to explore the domain and refine the surrogate model, and directs lately efforts toward the optimum.This trend is considerably dampened in the multifidelity scenario, where MFMES shows good capabilities especially when all the sources of information are available during the search.In this case, cheap low-fidelity observations are used to explore the domain with contained computational expenditure, and high-fidelity data are mostly adopted to search toward the prescribed optimum location.

Figure 12 :
Figure 12: Performances of the competing algorithms for the Rosenbrock benchmarks.

Figure 13 :
Figure 13: Performances of the competing algorithms for the ALOS benchmarks.
confirm the previous observations about the different effectiveness of the learning schemes.In particular, the multifidelity strategies provide larger accelerations of the optimization procedure in presence of oscillations at different frequencies of the objective function for the one-(Figure 13(a) and Figure 13(d)), two-(Figure 13(b) and Figure 13(e)) and three-(Figure 13(c) and Figure 13(f)) dimensional ALOS problem.We observe that the best performances are delivered by either learners based on the balance between informativeness and representativeness/diversity -MFEI and EI -or a purely informativeness-driven -MFPI and PI -, while over-exploration performs relatively poorly -MFMES and MES.These results are justified with the low-dimensionality of the objective function.

Figure 14 :
Figure 14: Performances of the competing algorithms for the multimodal benchmarks.