1 Introduction

Advances in modeling and availability of cheap computational power have enabled the science, engineering, and business research communities to make use of simulations to model phenomena and systems. It is only natural that there be a great interest in manipulating degrees of freedom in the simulations to optimize them.

The term simulation optimization (SO) is an umbrella term for techniques used to optimize stochastic simulations. Simulation optimization involves the search for those specific settings of the input parameters to a stochastic simulation such that a target objective, which is a function of the simulation output, is, without loss of generality, minimized.

As opposed to algebraic model-based mathematical programming, SO does not assume that an algebraic description of the simulation is available—the simulation may be available as a black box that only allows the evaluation of the objective and constraints for a particular input. In fact, many SO algorithmic approaches solely depend on such input–output data from the simulation in their search for optimal input settings.

In addition, many large-scale and/or detailed simulations may be expensive to run, in terms of time, money, or resources. As a result, there is also a need to perform few simulations in this search for optimal parameters. Outputs from these stochastic simulations are not deterministic, and usually follow some output distribution, which may or may not vary across the parametric space. This uncertainty or variability in output also adds to the challenge of optimization, as it becomes harder to discern the quality of the parametric input in the presence of this output noise. In addition, when an algebraic description of the simulation is not accessible, derivative information is usually unavailable, and the estimation of derivatives from the use of finite differences may not be suitable due to noisy outputs and the expensive nature of simulations.

The nature of the stochastic simulations under study will determine the specific technique chosen to optimize them. The simulations, which are often discrete-event simulations, may be partially accessible to us in algebraic form, or may be purely available as an input–output model (as a black box); they may have single or multiple outputs; they may have deterministic or stochastic output(s); they may involve discrete or continuous parameters; and they may or may not involve explicit, or even implicit/hidden constraints.

A very general simulation optimization problem can be represented by (P1).

$$\begin{aligned} \begin{array}{ll} \min &{}\mathbb {E}_{\omega } [f(x,y,\omega )]\\ \hbox {s.t.}&{}\mathbb {E}_{\omega }[ g(x,y,\omega )] \le 0\\ &{}h(x,y) \le 0\\ &{}x_l \le x \le x_u\\ &{}x \in \mathbb {R}^n, y \in \mathbb {D}^m. \end{array} \end{aligned}$$
(P1)

The function f can be evaluated through simulation for a particular instance of the continuous inputs x, discrete inputs y, and a realization of the random variables in the simulation, the vector \(\omega \) (which may or may not be a function of the inputs, x and y). Similarly, the constraints defined by the vector-valued function g are also evaluated with each simulation run. In this formulation, expected values for these stochastic functions are used. There may be other constraints (represented by h) that do not involve random variables, as well as bound constraints on the decision variables.

The relaxation of any of these conditions would constitute a problem that would fall under the purview of SO. Most algorithms focus on problems that either have solely discrete choices, or solely continuous decisions to make. Each constraint may be thought of as representing additional outputs from the simulation that need to be taken into consideration. In addition, there may be bound constraints imposed on decision variables, that may either be available or obtained from domain-specific knowledge. Relatively few existing algorithms attempt to address both discrete and continuous choices simultaneously, although some broad classes of approaches naturally lend themselves to be applicable in either, and therefore both, settings. Further, the discrete variables may either be binary, integer-ordered, or categorical and lie in some discrete space \(\mathbb {D}\).

As can be seen, the formulation P1 is extremely general, and therefore a wide variety of applications fall under the scope of simulation optimization. Various applications of simulation optimization in diverse research fields are tabulated in Sect. 2.

Another common assumption is that f is a real-valued function and g is a real vector-valued function, both of whose expected values may or may not be smooth or continuous functions. The most common objective in SO is to optimize the expected value of some performance metric, but other objective functions may be appropriate depending on the application. For instance, an objective that minimizes risk could be a possible alternative, in which case one would incorporate some sort of variance measure as well into the objective.

This paper is an updated version of Amaran et al. (2014) and is meant to be a survey of available techniques as well as recent advances in simulation optimization. The remainder of the introduction section provides a literature survey of prior reviews, and elaborates on the relationship of simulation optimization to algebraic model-based mathematical programming, derivative-free optimization, and machine learning. Section 2 provides a glimpse into the wide variety of applications of simulation optimization that have appeared in the literature. Section 3 focuses on various algorithms for discrete and continuous simulation optimization, provides basic pseudocode for major categories of algorithms, and provides comprehensive references for each type of algorithm. Section 4 provides a listing of available software for simulation optimization and Sect. 5 discusses means to compare their performance. Sect. 6 summarizes the progress of the field, and outlines some current and future topics for research.

1.1 Prior reviews of simulation optimization

Several review papers (cf. Meketon 1987; Jacobson and Schruben 1989; Safizadeh 1990; Azadivar 1992; Fu 1994; Carson and Maria 1997; Andradóttir 1998; Azadivar 1999; Swisher et al. 2000; Fu et al. 2000, 2005; Fu 2002; Tekin and Sabuncuoglu 2004; Andradóttir 2006; Hong and Nelson 2009; Ammeri et al. 2011; Pasupathy and Ghosh 2013), books and research monographs (cf. Spall 2003b; Rubinstein and Kroese 2004; Kleijnen 2008; Chen and Lee 2010), and theses (cf. Angün 2004; Driessen 2006; Deng 2007; Chang 2008; Frazier 2009; Kabirian 2009) have traced the development of simulation optimization.

Meketon (1987) provides a classification of algorithmic approaches for optimization over simulations based on how much information or structure about the underlying model is known. The paper surveys the progress of the field between 1975 and 1987, and focuses on continuous simulation optimization. Andradóttir (1998) provides a tutorial on gradient-based procedures for continuous problems. Carson and Maria (1997) and Azadivar (1999) also give brief outlines of and pointers to prevailing simulation optimization algorithms.

Fu et al. (2000) contains several position statements of eminent researchers and practitioners in the field of simulation, where the integration of simulation with optimization is discussed. The issues addressed include generality versus specificity of an algorithm, the wider scope of problems that simulation optimization methodologies have the potential to address, and the need for integrating provably convergent algorithms proposed by the research community with metaheuristics often used by commercial simulation software packages.

Of the more recent surveys, Fu (1994) provides an excellent tutorial on simulation optimization, and focuses on continuous optimization problems more than discrete optimization problems. The paper focuses specifically on discrete-event simulations. Fu (2002) provides a comprehensive survey of the field and its scope—the paper outlines the different ways in which optimization and simulation interact, gives examples of real-world applications, introduces simulation software and the optimization routines that each of them use, provides a very basic tutorial on simulation output analysis and convergence theory for simulation optimization, elaborates on algorithms for both continuous and discrete problems, and provides pointers to many useful sources. Fu et al. (2005) provide a concise, updated version of all of this, and also talk about estimation of distribution algorithms.

Tekin and Sabuncuoglu (2004) provide a table that analyzes past review papers and the techniques they focus on. Apart from providing detailed updates on advances in approaches and algorithms, the paper also lists references that attempt to compare different SO techniques. Hong and Nelson (2009) classify simulation optimization problems into those with (1) a finite number of solutions; (2) continuous decision variables; and (3) discrete variables that are integer-ordered. The paper describes procedures for each of these classes. Perhaps the most recent survey, Ammeri et al. (2011), classifies simulation optimization algorithms and provides a survey of methods as well as applications appearing in the literature between 1995 and 2010.

The present paper provides an overview of techniques, and briefly outlines well-established methods with pointers to more detailed surveys, while expounding on more recent methods in a concise manner. Though several reviews exist, we catalog the most recent developments—the emergence of derivative-free optimization and its relationship with simulation optimization, the appearance of simulation test-beds for comparing algorithms, the recent application of simulation optimization in diverse fields, the development of and interest in related techniques and theory by the machine learning community and the optimization community, as well as the sheer unprecedented nature of recent interest in optimizing over simulations. A reflection of a surge in recent interest is evidenced by the fact that more than half of the works we reference were published in the last decade. The intent is to not only trace the progress of the field, but to provide an update on state-of-the-art methods and implementations, point the familiar as well as the uninitiated reader to relevant sources in the literature, and to speculate on future directions in the field.

1.2 A note on terminology and scope

As simulation optimization involves the use of algorithms that arose from widely differing fields (Sect. 3), has relationships to many diverse disciplines (Sect. 1.3), and has been applied to many different practical applications from biology to engineering to logistics (Sect. 2), it is not surprising that it is known by various names in different fields. It has also been referred to as simulation-based optimization, stochastic optimization, parametric optimization, black-box optimization, and optimization via simulation (OvS), where the continuous and discrete versions are accordingly known as continuous optimization via simulation (COvS) and discrete optimization via simulation (DOvS). Each algorithmic technique may also go by different names, and we attempt to reconcile these in Sect. 3.

Inputs to the simulation may be variously referred to as parameter settings, input settings, variables, controls, solutions, designs, experiments (or experimental designs), factors, or configurations. Outputs from the simulation are called measurements, responses, performance metrics, objective values, simulation replications, realizations, or results. The performance of a simulation may also be referred to as an experiment, an objective function evaluation, or simply a function evaluation. We will use the term ‘iteration’ to refer to a fixed number of function evaluations (usually one) performed by a simulation optimization algorithm.

A note of caution while using SO methods is to incorporate as much domain specific knowledge as possible in the use of an SO algorithm. This may be in terms of (1) screening relevant input variables, (2) scaling and range reduction of decision variables, (3) providing good initial guesses for the algorithm; and (4) gleaning information from known problem structure, such as derivative estimates.

Table 1 classifies the techniques that are usually most suitable in practice for different scenarios in the universe of optimization problems. Certain broad classes of algorithms, such as random search methods, may be applicable to all of these types of problems, but they are often most suitable when dealing with pathological problems (e.g., problems with discontinuities, nonsmoothness) and are often used because they are relatively easy to implement.

Table 1 Terminology of optimization problems

The possibilities of combining simulation and optimization procedures are vast: simulation with optimization-based iterations; optimization with simulation-based iterations; sequential simulation and optimization; and alternate simulation and optimization are four such paradigms. A recent paper by Figueira and Almada-Lobo (2014) delves into the taxonomy of such problems, and provides a guide to choosing an appropriate approach for a given problem. As detailed by Meketon (1987), different techniques may be applicable or more suitable depending on how much is known about the underlying simulation, such as its structure or associated probability distributions. We focus on approaches that are applicable in situations where all the optimization scheme has to work with are evaluations of \(f(x,y,\omega )\) and \(g(x,y,\omega )\), or simply, observations with noise.

1.3 Relationship to other fields

Algebraic model-based mathematical programming

As mentioned earlier, most mathematical programming methods rely on the presence of an algebraic model. The availability of an algebraic model has many obvious implications to a mathematical programming expert, including the ability to evaluate a function quickly, the availability of derivative information, and the possibility of formulating a dual problem. None of these may be possible to do/obtain in an SO setting.

In the case with continuous decisions, derivative information is often hard to estimate accurately through finite differences, either due to the stochastic noise associated with objective function evaluations, or due to the large expense associated with obtaining function evaluations, or both. The inherent stochasticity in output also renders automatic differentiation (AD) (Rall 1981; Griewank and Walther 2008) tools not directly applicable. Moreover, automatic differentiation may not be used when one has no access to source code, does not possess an AD interface to proprietary simulation software, and, of course, when one is dealing with a physical experiment. The lack of availability of derivative information has further implications—it complicates the search for descent directions, proofs of convergence, and the characterization of optimal points.

Simulation optimization, like stochastic programming, also attempts to optimize under uncertainty. However, stochastic programming differs in that it makes heavy use of the model structure itself (Birge and Louveaux 2011). Optimization under uncertainty techniques that make heavy use of mathematical programming are reviewed in Sahinidis (2004).

Derivative-Free Optimization

Both simulation optimization and derivative-free optimization (DFO) are referred to in the literature as black-box optimization methods. Output variability is the key factor that distinguishes SO from DFO, where the output from the simulation is deterministic. However, there are many approaches to DFO that have analogs in SO as well (e.g., response surfaces, direct search methods, metaheuristics), cf. Sect. 3.

Another distinction is that most algorithms in DFO are specifically designed keeping in mind that function evaluations or simulations are expensive. This is not necessarily the case with SO algorithms.

With regard to rates of convergence, SO algorithms are generally inefficient and convergence rates are typically very slow. In general, one would expect SO to have a slower convergence rate than DFO algorithms simply because of the additional complication of uncertainty in function evaluations. As explained in Conn et al. (2009), some DFO algorithms, under certain assumptions, expect rates that are closer to linear than quadratic, and therefore early termination may be suitable. As described in some detail by Fu (1994), the best possible convergence rates for SO algorithms are generally \(\mathcal {O}(1/\sqrt{k})\), where k is the number of samples. This is true from the central limit theorem that tells us the rate at which the best possible estimator converges to the true expected function value at a point. This implies that though one would ideally incorporate rigorous termination criteria in algorithm implementations, most practical applications have a fixed simulation or function evaluation budget that is reached first.

Machine learning

Several subcommunities in the machine learning community address problems closely related to simulation optimization. Traditional machine learning settings assume the availability of a fixed dataset. Active learning methods (Cohn et al. 1996; Settles 2010) extend machine learning algorithms to the case where the algorithms are allowed to query an oracle for additional data to infer better statistical models. Active learning is closely related in that this choice of sampling occurs at every iteration in a simulation optimization setting as well. The focus of active learning is usually to learn better predictive models rather than to perform optimization.

Reinforcement learning (Stephens and Baritompa 1998) is broadly concerned with what set of actions to take in an environment to maximize some notion of cumulative reward. Reinforcement learning methods have strong connections to information theory, optimal control, and statistics. The similarity with simulation optimization is that the common problem of exploration of the search space versus exploitation of known structure of the cost function arises. However, in the reinforcement learning setting, each action usually also incurs a cost, and the task is to maximize the accumulated rewards from all actions—as opposed to finding a good point in the parameter space eventually.

Policy gradient methods (Peters et al. 2003) are a subfield of reinforcement learning, where the set of all possible sequences of actions form the policy space, and a gradient in this policy space is estimated and a gradient ascent-type method is then used to move to a local optimum. Bandit optimization (Gittins 1989) is another subfield of reinforcement learning that involves methods for the solution to the multi-armed bandit problem. The canonical example involves a certain number of slot machines, and a certain total budget to play them. Here, each choice of sample corresponds to which slot machine to play. Each play on a slot machine results in random winnings. This setting is analogous to discrete simulation optimization (DOvS) over finite sets, although with a different objective (Powell and Ryzhov 2012). Again, in DOvS over finite sets, we are only concerned with finding the best alternative eventually, whereas the cumulative winnings is the concern in the multi-armed bandit problem.

Relationship to other fields

Most, if not all, simulation optimization procedures have elements that are derived from or highly related to several other fields. Direct search procedures and response surface methodologies (RSM) have strong relationships with the field of experimental design. RSM, sample path optimization procedures, and gradient-based methods heavily incorporate ideas from algebraic model-based mathematical programming. RSM also involves the use of nonparametric and Bayesian regression techniques, whereas estimation of distribution algorithms involves probabilistic inference, and therefore these techniques are related to statistics and machine learning. Simulation optimization has been described as being part of a larger field called computational stochastic optimization. More information is available at Powell (2013).

2 Applications

SO techniques are most commonly applied to either (1) discrete-event simulations, or (2) systems of stochastic nonlinear and/or differential equations.

As mentioned in Fu (1994), discrete event simulations can be used to model many real-world systems such as queues, operations, and networks. Here, the simulation of a system usually involves switching or jumping from one state to another at discrete points in time as events occur. The occurrence of events is modeled using probability distributions to model the randomness involved.

Stochastic differential equations may be used to model phenomena ranging from financial risk (Merton 1974) to the control of nonlinear systems (Song and Grizzle 1995) to the electrophoretic separation of DNA molecules (Cho and Dorfman 2010).

With both discrete-event simulations and stochastic differential equation systems, there may be several parameters that one controls that affect some performance measure of the system under consideration, which are essentially degrees of freedom that may be optimized through SO techniques. Several applications of SO from diverse areas have been addressed in the literature and we list some of them in Table 2.

Table 2 Partial list of published works that apply simulation optimization

3 Algorithms

Algorithms for SO are diverse, and their applicability may be highly dependent on the particular application. For instance, algorithms may (1) attempt to find local or global solutions; (2) address discrete or continuous variables; (3) incorporate random elements or not; (4) be tailored for cases where function evaluations are expensive; (5) emphasize exploration or exploitation to different extents; (6) assume that the uncertainty in simulation output is homoscedastic or that it comes from a certain probability distribution; or (7) rely on underlying continuity or differentiability of the expectation (or some function of a chosen moment) of the simulation output. The sheer diversity of these algorithms also makes it somewhat difficult to assert which one is better than another in general, and also makes it hard to compare between algorithms or their implementations.

As mentioned in Sect. 1.3, many algorithms that are available for continuous simulation optimization have analogs in derivative-based optimization and in derivative-free optimization, where function evaluations are deterministic. In any case, the key lies in the statistics of how noise is handled, and how it is integrated into the optimization scheme. We will provide pointers to references that are applicable to simulation optimization in particular. A comprehensive review of methods for derivative-free optimization is available in Rios and Sahinidis (2013).

Each major subsection below is accompanied by pseudocode to give researchers and practitioners unfamiliar with the field an idea of the general approach taken by each of these algorithms. Many of the sections include pointers to convergence proofs for individual algorithms. Optimality in simulation optimization is harder to establish than in mathematical programming or derivative-free optimization due to the presence of output variability. Notions of optimality for simulation optimization are explored in Fu (1994); for the discrete case, Xu et al. (2010), for instance, establishes conditions for local convergence, where a point being ‘better’ than its \(2m+1\) neighboring solutions is said to be locally optimal. There has also been some work in establishing Karush–Kuhn–Tucker (KKT) optimality conditions for multiresponse simulation optimization (Bettonvil et al. 2009). Globally convergent algorithms will locate the global optimal solution eventually, but assuring this would require all feasible solutions to be evaluated through infinite observations; in practice, a convergence property that translates to a practical stopping criterion may make more sense (Hong and Nelson 2009).

Based on their scope, the broad classes of algorithms are classified in Table 3. Algorithms are classified based on whether they are applicable to problems with discrete/continuous variables, and whether they focus on global or local optimization. However, there may be specific algorithms that have been tweaked to make them applicable to a different class as well, which may not be captured by this table.

Table 3 Classification of simulation optimization algorithms

3.1 Discrete optimization via simulation

Discrete optimization via simulation is concerned with finding optimal settings for variables that can only take discrete values. This may be in the form of integer-ordered variables or categorical variables (Pasupathy and Henderson 2011). Integer-ordered variables are allowed to take on integer or discrete values within a finite interval, where the order of these values translates to some physical interpretation. For example, this could be the number of trucks available for vehicle routing, or the set of standard pipe diameters that are available for the construction of a manufacturing plant. Categorical variables refer to more general kinds of discrete decisions, ranging from conventional on–off (0–1 or binary) variables to more abstract decisions such as the sequence of actions to take given a finite set of actions. It should be noted that though integer-ordered variables, for instance, may be logically represented using binary variables, it may be beneficial to retain them as integer-ordered to exploit correlations in objective function values between adjacent integer values.

A rich literature in DOvS has developed over the last 50 years, and the specific methods developed are tailored to the specific problem setting. Broadly, methods are tailored for finite or for very large/potentially infinite parameter spaces.

3.1.1 Finite parameter spaces

In the finite case, where the number of alternatives is small and fixed, the primary goal is to decide how to allocate the simulation runs among the alternatives. In this setting, there is no emphasis on ‘search’, as the candidate solution pool is small and known; each iteration is used to infer the best, in some statistical sense, simulation run(s) to be performed subsequently.

The optimization that is desired may differ depending on the situation, and could involve:

  1. 1.

    The selection of the best candidate solution from a finite set of alternatives;

  2. 2.

    The comparison of simulation performance measures of each alternative to a known standard or control; or

  3. 3.

    The pairwise comparison between all solution candidates.

Item (1) is referred to as the ranking and selection problem. Items (2) and (3) are addressed under literature on multiple comparison procedures, with the former referred to as multiple comparisons with a control.

Ranking and selection

In traditional ranking and selection, the task is to minimize the number of simulation replications while ensuring a certain probability of correct selection of alternatives. Most procedures try to guarantee that the design ultimately selected is better than all competing alternatives by \(\delta \) with a probability at least \(1-\alpha \). \(\delta \) is called the indifference zone, and is the value deemed to be sufficient to distinguish between expected performance among solution candidates.

Conventional procedures make use of the Bonferroni inequality which relates probabilities of the occurrence of multiple events with probabilities of each event. Other approaches involve the incorporation of covariance induced by, for example, the use of common random numbers to expedite the algorithmic performance over the more conservative Bonferroni approach. Kim and Nelson (2006), Kim and Nelson (2007) and Chick (2006) provide a detailed review and provide algorithms and procedures for this setting. Extensions of fully sequential ranking and selection procedures to the constrained case have been explored as well, e.g., Andradóttir and Kim (2010).

An alternative formulation of the ranking and selection of the problem would be to try to do the best within a specified computational budget, called the optimal computing budget allocation formulation (Chen 1995). Chen et al. (2009) present more recent work, while the stochastically constrained case is considered in Lee et al. (2012).

Recent work (Hunter and Pasupathy 2013) in the area of DOvS over finite sets provides a quick overview of the field of ranking and selection, and considers general probability distributions and the presence of stochastic constraints simultaneously.

A basic ranking and selection procedure (Kim and Nelson 2007) is outlined in Algorithm 1, where it is assumed that independent data comes from normal distributions with unknown, different variances.

figure a

Multiple comparison procedures

Here, a number of simulation replications are performed on all the potential designs, and conclusions are made by constructing confidence intervals on the performance metric. The main ideas and techniques for multiple comparisons in the context of pairwise comparisons, or against a known standard are presented in Hochberg and Tamhane (1987), Fu (1994), and Hsu (1996). Recent work in multiple comparisons with a control include Kim (2005) and Nelson and Goldsman (2001), which provide fully sequential and two-stage frequentist procedures respectively; and Xie and Frazier (2013), which addresses the problem using a Bayesian approach.

Comprehensive treatment of ranking and selection and multiple comparison procedures may be found in Goldsman and Nelson (1998) and Bechhofer et al. (1995). A detailed survey that traces the development of techniques in simulation optimization over finite sets is available in Tekin and Sabuncuoglu (2004).

3.1.2 Large/infinite parameter spaces

To address DOvS problems with a large number of potential alternatives, algorithms that have a search component are required. Many of the algorithms that are applicable to the continuous optimization via simulation case are, with suitable modifications, applicable to the case with large/infinite parameter spaces. These include (1) ordinal optimization (2) random search methods and (3) direct search methods.

Ordinal optimization methods (Ho 1999) are suitable when the number of alternatives is too large to find the globally optimal design in the discrete-event simulation context. Instead, the task is to find a satisfactory solution with some guarantees on quality (called alignment probability) (Lau and Ho 1997). Here, the focus is on sampling a chosen subset of the solutions and evaluating them to determine the best among them. The key lies in choosing this subset such that it contains a subset of satisfactory solutions. The quality or satisfaction level of this selected subset can be quantified (Chen 1996). A comparison of subset selection rules is presented in Jia et al. (2006) and the multi-objective case is treated in Teng et al. (2007).

Random search methods include techniques such as simulated annealing (e.g., Alrefaei and Andradóttir 1999), genetic algorithms, stochastic ruler methods (e.g., Yan and Mukai 1992), stochastic comparison (e.g., Gong et al. 2000), nested partitions (e.g., Shi and Ólafsson 2000), ant colony optimization (e.g., Dorigo and Stützle 2004; Dorigo and Blum 2005), and tabu search (e.g., Glover and Hanafi 2002). Some of these—simulated annealing, genetic algorithms, and tabu search—are described in Sect. 3.6). Ant colony optimization is described under model-based methods (cf. Sect. 3.7.2). Proofs of global convergence, i.e., convergence to a global solution, or local convergence are available for most of these algorithms (Hong and Nelson 2009) (note that these definitions differ from algebraic model-based mathematical programming where global convergence properties ensure convergence to a local optimum regardless of the starting point).

Nested partition methods (Shi and Ólafsson 2007) attempt to adaptively sample from the feasible region. The feasible region is then partitioned, and sampling is concentrated in regions adjudged to be the most promising by the algorithm from a pre-determined collection of nested sets. Hong and Nelson propose the COMPASS algorithm (Hong and Nelson 2006) which uses a unique neighborhood structure, defined as the most promising region that is fully adaptive rather than pre-determined; a most promising ‘index’ is defined that classifies each candidate solution based on a nearest neighbor metric. More recently, the Adaptive Hyberbox Algorithm (Xu et al. 2013) claims to have superior performance on high-dimensional problems (problems with more than ten or fifteen variables); and the R-SPLINE algorithm (Wang et al. 2012), which alternates between a continuous search on a continuous piecewise-linear interpolation and a discrete neighborhood search, compares favorably as well.

A review of random search methods is presented in Andradóttir (2006) and Ólafsson (2006). Recent progress, outlines of basic algorithms, and pointers to specific references for some of these methods are presented in Bianchi et al. (2009), Hong and Nelson (2009), and Nelson (2010).

Direct search methods such as pattern search and Nelder–Mead simplex methods are elaborated on in Sect. 3.5.

3.2 Response surface methodology

Response surface methodology (RSM) is typically useful in the context of continuous optimization problems and focuses on learning input–output relationships to approximate the underlying simulation by a surface (also known as a metamodel or surrogate model) for which we define a functional form. This functional form can then be made use of by leveraging powerful derivative-based optimization techniques. The literature in RSM is vast and equivalent approaches have variously been referred to as multi-disciplinary design optimization, metamodel-based optimization, and sequential parameter optimization. RSM was originally developed in the context of experimental design for physical processes (Box and Wilson 1951), but has since been applied to computer experiments. Metamodel-based optimization is a currently popular technique for addressing simulation optimization problems (Barton and Meckesheimer 2006; Kleijnen 2008).

figure b

Different response surface algorithms differ in the choice between regression and interpolation; the nature of the functional form used for approximation (polynomials, splines, Kriging, radial basis functions, neural networks); the choice of how many and where new samples must be taken; and how they update the response surface.

RSM approaches can either (1) build surrogate models that are effective in local regions, and sequentially use these models to guide the search, or; (2) build surrogate models for the entire parameter space from space-filling designs, and then use them to choose samples in areas of interest, i.e., where the likelihood of finding better solutions is good according to a specified metric. A generic framework for RSM is presented in Algorithm 2.

Classical sequential RSM Originally, RSM consisted of a Phase I, where first-order models were built using samples from a design of experiments. A steepest descent rule was used to move in a certain direction, and this would continue iteratively until the estimated gradient would be close to zero. Then, a Phase II procedure that built a more detailed quadratic model would be used for verifying the optimality of the experimental design. A thorough introduction to response surface methodology is available in Myers et al. (2009). Recent work in the field includes automating RSM (Neddermeijer et al. 2000; Nicolai and Dekker 2009) and the capability to handle stochastic constraints (Angün et al. 2009).

Bayesian global optimization These methods seek to build a global response surface, commonly using techniques such as Kriging/Gaussian process regression (Sacks et al. 1989; Rasmussen and Williams 2006). Subsequent samples chosen based on some sort of improvement metric may balance exploitation and exploration. The seminal paper by Jones et al. (1998) which introduced the EGO algorithm for simulations with deterministic output, uses Kriging to interpolate between function values, and chooses future samples based on an expected improvement metric (Mockus et al. 1978). Examples of analogs to this for simulation optimization are provided in Huang et al. (2006) and Kleijnen et al. (2012). The use of Kriging for simulation metamodeling is explored in van Beers and Kleijnen (2004), Kleijnen and Beers (2005) and Kleijnen (2009). Other criteria that have been used to choose samples are most probable improvement (Mockus 1989), knowledge gradient for continuous parameters (Scott et al. 2011), and maximum information gain (Srinivas et al. 2012).

Trust region methods Trust region methods (Conn et al. 2000) can be used to implement sequential RSM. Trust regions provide a means of controlling the region of approximation, providing update criteria for surrogate models, and are useful in analyzing convergence properties. Once a metamodel or response surface, g, is built around a trust region center \(x_i\), trust region algorithms involve the solution of the trust-region subproblem (\(\min _s g(x_i+s): s \in {\mathcal {B}}(x_i,\varDelta )\)), where \({\mathcal {B}}\) is a ball defined by the center-radius pair \((x_i, \varDelta )\). There are well-defined criteria to update the trust region center and radius (Conn et al. 2000) that will define the subsequent region of approximation.

The use of trust regions in simulation optimization is relatively recent, and has been investigated to some extent (Deng and Ferris 2006; Chang et al. 2013). Trust-region algorithms have been used, for example, to optimize simulations of urban traffic networks (Osorio and Bierlaire 2010).

3.3 Gradient-based methods

Stochastic approximation methods or gradient-based approaches are those that attempt to descend using estimated gradient information. Stochastic approximation techniques are one of the oldest methods for simulation optimization. Robbins and Monro (1951) and Kiefer and Wolfowitz (1952) were the first to develop stochastic approximation schemes in the early 1950s. These procedures initially were meant to be used under very restrictive conditions, but much progress has been made since then.

These methods can be thought of being analogous to steepest descent methods in derivative-based optimization. One may obtain direct gradients or may estimate gradients using some finite difference scheme. Direct gradients may be calculated by a number of methods: (1) Perturbation Analysis (specifically, Infinitesimal Perturbation Analysis) (PA or IPA), (2) Likelihood Ratio/Score Function (LR/SF), and (3) Frequency Domain Analysis (FDA). Detailed books on these methods are available in the literature (Ho and Cao 1991; Glasserman 1991; Rubinstein and Shapiro 1993; Pflug 1996; Fu and Hu 1997) and more high-level descriptions are available in papers (Tekin and Sabuncuoglu 2004; Fu 2002). Most of these direct methods, however, are either applicable to specific kinds of problems, need some information about underlying distributions, or are difficult to apply. Fu (2002) outlines which methods are applicable in which situations, and Tekin and Sabuncuoglu (2004) discuss a number of applications that have used these methods.

Stochastic approximation schemes attempt to estimate a gradient by means of finite differences. Typically, a forward difference estimate would involve sampling at least \(n + 1\) distinct points, but superior performance has been observed by simultaneous perturbation estimates that require samples at just two points (Spall 2003a), a method referred to as Simultaneous Perturbation Stochastic Approximation (SPSA). The advantage gained in SPSA is that the samples required are now independent of the problem size, and, interestingly, this has been shown to have the same asymptotic convergence rate as the naive method that requires \(n + 1\) points (Spall 1992). A typical gradient-based scheme is outlined in Algorithm 3.

figure c

Recent extensions of the SPSA method include introducing a global search component to the algorithm by injecting Monte Carlo noise during the update step (Maryak and Chin 2008), and using it to solve combined discrete/continuous optimization problems (Wang and Spall 2011). Recent work also addresses improving Jacobian as well as Hessian estimates in the context of the SPSA algorithm (Spall 2009). Much of the progress in stochastic approximation has been cataloged in the proceedings of the Winter Simulation Conference over the years (http://informs-sim.org/). A recent review of stochastic approximation methods is available in Spall (2012), and an excellent tutorial and review of results in stochastic approximation is presented in Pasupathy and Kim (2011).

3.4 Sample path optimization

Sample path optimization involves working with an estimate of the underlying unknown function, as opposed to the function itself. The estimate is usually a consistent estimator such as the sample mean of independent function evaluations at a point, or replications. For instance, one may work with \(F_n = \frac{1}{n}\sum _{i=1}^n f(x,y,\omega _i)\), instead of the underlying function \({E}[f(x,y,\omega )]\) itself. It should be noted that the functional form of \(F_n\) is still unknown, it is just that \(F_n\) can be observed or evaluated at a point in the search space visited by an algorithm iteration. The alternative name of sample average approximation reflects this use of an estimator.

As the algorithm now has to work with an estimator, a deterministic realization of the underlying stochastic function, sophisticated techniques from traditional mathematical programming can now be leveraged. Sample path methods can be viewed as the use of deterministic optimization techniques within a well-defined stochastic setting. Yet another name for them is stochastic counterpart. Some of the first papers using sample path methods are Healy and Schruben (1991) and Shapiro (1991). Several papers (Rubinstein and Shapiro 1993; Chen and Schmeiser 1994; Gürkan et al. 1994; Shapiro 1996; Deng and Ferris 2006) discuss convergence results and algorithms in this context.

3.5 Direct search methods

Direct search can be defined as the sequential examination of trial solutions generated by a certain strategy (Hooke and Jeeves 1961). As opposed to stochastic approximation, direct search methods rely on direct comparison of function values without attempting to approximate derivatives. Direct search methods typically rely on some sort of ranking of quality of points, rather than on function values.

Most direct search algorithms developed for simulation optimization are extensions of ideas for derivative-free optimization. A comprehensive review of classical and modern methods is provided in Kolda et al. (2003). A formal theory of direct search methods for stochastic optimization is developed in Trosset (2000). Direct search methods can be tailored for both discrete and continuous optimization settings. Pattern search and Nelder–Mead simplex procedures are the most popular direct search methods. There is some classical as well as relatively recent work done on investigating both pattern search methods (Trosset 2000; Anderson and Ferris 2001; Lucidi and Sciandrone 2002) and Nelder–Mead simplex algorithms (Nelder and Mead 1965; Barton and Ivey 1996; Humphrey and Wilson 2000; Chang 2012) and their convergence in the context of simulation optimization.

These methods remain attractive as they are relatively easy to describe and implement, and are not affected if a gradient does not exist everywhere, as they do not rely on gradient information. Since conventional procedures can be affected by noise, effective sampling schemes to control the noise are required. A basic Nelder–Mead procedure is outlined in Algorithm 4.

figure d

3.6 Random search methods

3.6.1 Genetic algorithms

Genetic algorithms use concepts of mutation and selection (Reeves 1997; Whitley 1994). In general, a genetic algorithm works by creating a population of strings and each of these strings are called chromosomes. Each of these chromosome strings is basically a vector of point in the search space. New chromosomes are created by using selection, mutation and crossover functions. The selection process is guided by evaluating the fitness (or objective function) of each chromosome and selecting the chromosomes according to their fitness values (using methods such as mapping onto Roulette Wheel). Additional chromosomes are then generated using crossover and mutation functions. The cross over and mutation functions ensure that a diversity of solutions is maintained. Genetic algorithms are popular as they are easy to implement and are used in several commercial simulation optimization software packages (Table 4). The GECCO (Genetic and Evolutionary Computation Conference) catalogs progress in genetic algorithms and implementations.

3.6.2 Simulated annealing

Simulated Annealing uses a probabilistic method that is derived from the annealing process in which the material is slowly cooled so that, while its structure freezes, it reaches a minimum energy state (Kirkpatrick et al. 1983; Bertsimas and Tsitsiklis 1993). Starting with a current point i in a state j, a neighborhood point \(i'\) of the point i is generated. The algorithm moves from point i to \(i'\) using a probabilistic criteria that is dependent on the ‘temperature’ in state j. This temperature is analogous to that in physical annealing, and serves here as a control parameter. If the solution at \(i'\) is better than the existing solution, then this new point is accepted. If the new solution is worse than existing solution, then the probability of accepting the point is defined as \(\exp (-(f(i')- f(i))/T(j))\), where f(.) is the value of objective function at a given point, and T(j) is temperature at the state j. After a certain number of neighborhood points are evaluated, the temperature is decreased and new state is \(j+1\) is created. Due to the exponential form, the probability of acceptance of a neighborhood point is higher at high temperature, and is lower as temperature is reduced. In this way, the algorithm searches for a large number of neighborhood points in the beginning, but a lower number of points as temperature is reduced.

Implementation of simulated annealing procedures require choosing parameters such as the initial and final temperatures, the rate of cooling, and number of function evaluations at each temperature. A variety of cooling ‘schedules’ have been suggested in Collins et al. (1988) and Hajek (1988). Though simulated annealing was originally meant for optimizing deterministic functions, the framework has been extended to the case of stochastic simulations (Alkhamis et al. 1999). The ease of implementing a simulated annealing procedure is high and it remains a popular technique used by several commercial simulation optimization packages.

3.6.3 Tabu search

Tabu search (Glover 1990) uses special memory structures (short-term and long-term) during the search process that allow the method to go beyond local optimality to explore promising regions of the search space. The basic form of tabu search consists of a modified neighborhood search procedure that employs adaptive memory to keep track of relevant solution history, together with strategies for exploiting this memory (Gendreau and Potvin 2010). More advanced forms of tabu search and its applications are described in Glover and Laguna (1997).

3.6.4 Scatter search

Scatter search and its generalized form, path relinking, were originally introduced by Glover and Laguna (2000). Scatter search differs from other evolutionary approaches [such as Genetic Algorithms (GA)] by using strategic designs and search path construction from a population of solutions as compared to randomization (by crossover and mutation in GA). Similar to Tabu search, Scatter Search also utilize adaptive memory in storing best solutions (Glover and Laguna 2000; Martí et al. 2006). Algorithm 5 provides the scatter search algorithm.

figure e

3.7 Model-based methods

Model-based simulation optimization methods attempt to build a probability distribution over the space of solutions and use it to guide the search process.

3.7.1 Estimation of distribution algorithms

Estimation of distribution algorithms (EDAs) (Larrañaga and Lozano 2002) are model-based methods that belong to the evolutionary computation field. However, generation of new candidate solutions is done by sampling from the inferred probability distribution over the space of solutions, rather than, say, a genetic operator such as crossover or mutation. A comprehensive review of estimation of distribution algorithms is presented in Fu et al. (1996). EDAs usually consider interactions between the problem variables and exploit them through different probability models.

Cross-entropy methods and Model Reference Adaptive Search (MRAS) are discussed next and can be seen as specific instances of EDAs.

Cross-entropy methods Cross-entropy methods first sample randomly from a chosen probability distribution over the space of decision variables. For each sample, which is a vector defining a point in decision space, a corresponding function evaluation is obtained. Based on the function values observed, a pre-defined percentile of the best samples are picked. A new distribution is built around this ‘elite set’ of points via maximum likelihood estimation or some other fitting method, and the process is repeated. One possible method that implements cross-entropy is formally described in Algorithm 6.

figure f

The method is guaranteed (probabilistically) to converge to a local optimum, but it also incorporates an exploration component as random samples are obtained at each step. However, the intuition behind the selection of subsequent samples can be shown to be analogous to minimizing the Kullback–Leibler divergence (KL-divergence) between the optimal importance sampling distribution and the distribution used in the current iterate (Rubinstein and Kroese 2004).

There exist variants of the cross-entropy method to address both continuous (Kroese et al. 2006) and discrete optimization (Rubinstein 1999) problems. A possible modification is to use mixtures of distributions from current and previous iterations, with the current distribution weighted higher. This can be done by linearly interpolating the mean covariance in the case of Gaussian distributions. This also helps in avoiding singular covariance matrices. Cross-entropy can also deal with noisy function evaluations, with irrelevant decision variables, and constraints (Kroese et al. 2006). If decision variables are correlated, the covariance of the distribution will reflect this.

The immediately apparent merits of cross-entropy methods are that they are easy to implement, require few algorithmic parameters, are based on fundamental principles such as KL-divergence and maximum likelihood, and give consistently accurate results (Kroese et al. 2006). A potential drawback is that cross-entropy may require a significant number of new samples at every iteration. It is not clear as to how this would affect performance if samples were expensive to obtain. The cross-entropy method has analogs in simulated annealing, genetic algorithms, and ant colony optimization, but differs from each of these in important ways (de Boer et al. 2005).

More detailed information on the use of cross-entropy methods for optimization can be found in de Boer et al. (2005), a tutorial on cross-entropy and in Rubinstein and Kroese (2004), a monograph. The cross-entropy webpage, http://iew3.technion.ac.il/CE/ provides up-to-date information on progress in the field.

Model reference adaptive search (MRAS) The MRAS method (Hu et al. 2005, 2007) is closely related to the cross-entropy method. It also works by minimizing the Kullback–Leibler divergence to update the parameters of the inferred probability distribution. However, the parameter update step involves the use of a sequence of implicit probability distributions. In other words, while the cross-entropy method uses the optimal importance sampling distribution for parameter updates, MRAS minimizes the KL-divergence with respect to the distribution in the current iteration, called the reference model.

Covariance matrix adaptation–evolution strategy (CMA-ES) In the CMA-ES algorithm (Hansen 2006), new samples are obtained from a multivariate normal distribution, and inter-variable dependencies are encoded in the covariance matrix. The CMA-ES method provides a way to update the covariance matrix. Updating the covariance matrix is analogous to learning an approximate inverse Hessian, as is used in Quasi-Newton methods in mathematical programming. The update of the mean and covariance is done by maximizing the likelihood of previously successful candidate solutions and search steps, respectively. This is in contrast to other EDAs and the cross-entropy method, where the covariance is updated by maximizing the likelihood of the successful points. Other sophistications such as step-size control, and weighting of candidate solutions are part of modern implementations (Hansen 2011).

3.7.2 Ant colony optimization

Ant colony optimization methods (Dorigo and Stützle 2004; Dorigo and Blum 2005) are heuristic methods that have been used for combinatorial optimization problems. Conceptually, they mimic the behavior of ants to find shortest paths between their colony and food sources. Ants deposit pheromones as they walk; and are more likely to choose paths with higher concentration of pheromones. This phenomenon is incorporated in a pheromone update rule, which increases the pheromone content in components of high-quality solutions, and causes evaporation of pheromones in less favorable regions. Probability distributions are used to make the transition between each iteration. These methods differ from EDAs in that they use an iterative construction of solutions.

This and other algorithms that incorporate self-organization in biological systems are said to use the concept of ‘swarm intelligence’.

3.8 Lipschitzian optimization

Lipschitzian optimization is a class of space-partitioning algorithms for performing global optimization, where the Lipschitz constant is pre-specified. This enables the construction of global search algorithms with convergence guarantees. The caveat of having prior knowledge of the Lipschitz constant is overcome by the DIRECT (DIviding RECTangles) algorithm (Jones et al. 1993) for deterministic continuous optimization problems. An adaptation of this for noisy problems is provided in Deng and Ferris (2007).

4 Software

4.1 Simulation optimization in commercial simulation software

Many discrete-event simulation packages incorporate some methodology for performing optimization. A comprehensive listing of simulation software, the corresponding vendors, and the optimization packages and techniques they use can be found in Table 4. More details on the specific optimization routines can be found in Law and Kelton (2000). OR/MS-Today, the online magazine of INFORMS, conducts a biennial survey of simulation software packages, the latest of which is available at http://www.orms-today.org/surveys/Simulation/Simulation.html. The survey lists 43 simulation software packages, and 31 of these have some sort of optimization routine; fewer still have black-box optimizers that interact with the simulation.

Table 4 Simulation optimization packages in commercial simulation software

4.2 Academic implementations of simulation optimization

Table 5 contains a small subset of academic implementations of SO algorithms, and classifies them by type. Some of these are available for download from the web, some have code with suggested parameters in corresponding papers themselves, and others are available upon request from the authors.

Table 5 Academic simulation optimization implementations

5 Comparison of algorithms

As far as comparisons between algorithms are concerned, the literature does not yet provide a comprehensive survey of the performance of different implementations and approaches on large test beds. In this regard, simulation optimization lags behind other optimization fields such as linear, integer, and nonlinear programming, global optimization and even derivative-free optimization, where the first comprehensive comparison appeared in 2013 (Rios and Sahinidis 2013). A study of prior comparisons in simulation optimization is provided by Tekin and Sabuncuoglu (2004), but these comparisons are fairly dated, are inconclusive about which algorithms perform better in different situations, and compare only a small subset of available algorithms. One difficulty lies in the inherent difficulty of comparing solutions between algorithms over true black-box simulations, as one does not usually know the true optimal point and can only compare between noisy estimates observed by the solvers. Less impeding difficulties, but difficulties nonetheless, include the need to interface algorithms to a common wrapper, the objective comparison with solvers that incorporate random elements as their results may not be reproducible, and lack of standard test simulations for purposes of benchmarking.

The benchmarking of algorithms in mathematical programming is usually done by performance profiles (Dolan and Moré 2002), where the graphs show the fraction of problems solved after a certain time. For derivative-free algorithms, data profiles are commonly used (Moré and Wild 2009), where the fraction of problems solved after a certain number of iterations (function evaluations) or ‘simplex gradients’ is shown. The definition of when a problem is ‘solved’ may vary—when the true global optimum is known, the solutions found within a certain tolerance of this optimal value may be called solutions, but when this optimum is not known, the solvers that find the best solution (within a tolerance) for a problem, with respect to the other solvers being compared, may be said to have solved the problem. The latter metric may also be used when function evaluations are expensive, and no solver is able to reach within this tolerance given the limited simulation budget.

In both of these cases, the output of the simulations are deterministic, and so it is clear as to which algorithms have performed better than others on a particular problem. In simulation optimization, however, usually one does not know the true solution for the black box system, nor does one see deterministic output. All that one possesses are mean values and sample variances obtained from sample paths at different points. There does not exist a standard method to compare simulation optimization algorithms on large test beds. Many papers perform several macroreplications and report the macroreplicate average of the best sample means (along with the associated sample variance) at the end of the simulation budget. The issue with this is that the performance of the algorithms with different simulation budgets is not seen, as in the case of performance or data profiles. Other papers report the average number of evaluations taken to find a sample mean that is within the global tolerance for each problem. Here, results are listed for each problem and one does not get an idea of overall performance. In addition, the difference in sample variance estimates is not highlighted. As simulation optimization develops, there is also a need for methods of comparison of algorithms on test beds with statistically significant number of problems.

With regard to standardized simulation testbeds, to our knowledge, the only testbed that provides practical simulations for testing simulation optimization algorithms is available at www.simopt.org (Pasupathy and Henderson 2011). At the point of writing this paper, just 20 continuous optimization problems were available from this repository. Most testing and comparisons happen with classical test problems in nonlinear optimization (many of which have been compiled in Rios and Sahinidis (2013) and available at http://archimedes.cheme.cmu.edu/?q=dfocomp), to which stochastic noise has been added. There is a need for more such repositories, not only for testing of algorithms over statistically significant sizes of problem sets, but for comparison between different classes of algorithms. The need for comparison is evident, given the sheer number of available approaches to solving simulation optimization problems, and the lack of clarity and lack of consensus on which types of algorithms are suitable in which contexts.

As observed by several papers (Fu et al. 2000; Tekin and Sabuncuoglu 2004; Hong and Nelson 2009), there continues to exist a significant gap between research and practice in terms of algorithmic approaches. Optimizers bundled with simulation software, as observed in Sect. 4, tend to make use of algorithms which seem to work well but do not come with provable statistical properties or guarantees of local or global convergence. Academic papers, on the other hand, emphasize methods that are more sophisticated and prove convergence properties. One reason that may contribute to this is that very few simulation optimization algorithms arising from the research community are easily accessible. We wholeheartedly encourage researchers to post their executable files, if not their source code. This could not only encourage practitioners to use these techniques in practice, but allow for comparisons between methods and the development of standardized interfaces between simulations and simulation optimization software.

6 Conclusions

The field of simulation optimization has progressed significantly in the last decade, with several new algorithms, implementations, and applications. Contributions to the field arise from researchers and practitioners in the industrial engineering/operations research, mathematical programming, statistics and machine learning, as well as the computer science communities. The use of simulation to model complex, dynamic, and stochastic systems has only increased with computing power and availability of a wide variety of simulation languages. This increased use is reflected in the identification and application of simulation and simulation optimization methods to diverse fields in science, engineering, and business. There also exist strong analogies between, and ideas that may be borrowed from recent progress in related fields. All of these factors, along with the ever increasing number of publications and rich literature in this area, clearly indicate the interest in the field of simulation optimization, and we have tried to capture this in this paper.

With increased growth and interest in the field, there come also opportunities. Potential directions for the field of simulation optimization are almost immediately apparent. Apart from the ability to handle simulation outputs from any well-defined probability distribution, the effective use of variance reduction techniques when possible, and the improvement in theory and algorithms, there is a requirement to address (1) large-scale problems with combined discrete/continuous variables; (2) the ability to effectively handle stochastic and deterministic constraints of various kinds; (2) the effective utilization of parallel computing at the linear algebra level, sample replication level, iteration level, as well as at the algorithmic level; (3) the effective handling of multiple simulation outputs; (4) the incorporation of performance measures other than expected values, such as risk; (5) the continued consolidation of various techniques and their potential synergy in hybrid algorithms; (6) the use of automatic differentiation techniques in the estimation of simulation derivatives when possible; (7) the continued emphasis on providing guarantees of convergence to optima for local and global optimization routines in general settings; (8) the availability and ease of comparison of the performance of available approaches on different applications; and (9) the continued reflection of sophisticated methodology arising from the literature in commercial simulation packages.