## 1 Introduction

Optimization of Water Distribution Networks (WDNs) operations has been a very important field for the Operation Research (OR) community at least in the last 40 years [1] and many tools from mathematical programming as well as metaheuristics have been proposed. An updated review on optimal water distribution is given in [2] where several classes of existing solutions, including linear programming, nonlinear programming, dynamic programming, metamodeling, heuristics, and metaheuristics are deeply analyzed and referenced. Different issues are considered, such as solutions for sensors placement and leakage detection and localization [3], and more in detail optimization of pumping operations.

In this paper the authors are concerned with pump scheduling optimization (PSO): which pumps are to be operated and with which settings at different periods of the day, so that the energy cost, the largest operational cost for water utilities, is minimized. Nonlinearities, and binary decisions in the case of ON/OFF pumps, make PSO computationally challenging, even for small WDNs [4]. While mathematical programming approaches linearize/convexify the equations regulating the flow distribution in the WDN, the more recent optimization strategies use a hydraulic simulation software which can solve all the equations and provide computed values relatively to objective function (e.g. energy costs) and hydraulic feasibility (e.g. satisfaction of the demand, pressures within a given range, tanks level within min–max range, etc.).

The decision variables can be formulated in 2 ways: the ON/OFF pump state during fixed time intervals or the start/end run times of pumps [5]. Even if the latter results in a decrease in the number of decision variables, the former is the most widely used and will be considered in this paper.

If there are storage tanks in the network, cost reductions can be made by concentrating most of pumping during the time windows when electricity is least expensive and by using this water capacity during the remaining part of the day to meet consumer demand. According to this aim, an accurate demand forecasting system, as the one proposed in [6,7,8], can be used to set the demand, which basically drives every simulation run, in order to optimize pump schedule in advance. The cost is generally associated with the cost of energy for pumping, and the system performance requires satisfying operational constraints as: assuring pumps can satisfy user demand, keeping pressures within certain bounds to reduce leakage and the risk of pipe burst, keeping reservoir levels within bounds to avoid overflow.

As the sizes of the tested example networks increase from simplified hypothetical examples to large real networks, the number of variables and objectives considered grows leading to increase in the number of function evaluations, computational resources (simulation time increases with the number of hydraulic components of the WDN) and memory requirements. Moreover, the headloss equations used to model water flow in pipes lead to nonlinear optimization problems [9].

Linear approaches have been considered extensively in the past, emphasizing their advantage for obtaining fast solutions, however they require linearity for objective function and constraints, and their performance depends on finding a suitable linearization of the problem at hand. Pasha and Lansey in [10], to linearize the problem, rely on the relationship between energy, pump flow, user demand and tank water levels, while in [11] head-loss convex equation from water supply systems was iteratively linearized and incorporated in linear programming optimization models. As the number of decision variables and constraints increases considerably with the number of time-control intervals, this leads to increased computational and memory requirements. Also nonlinear programming has been used, however since the problem is non-convex, there is no guarantee that the global optimum will be found [12].

Since the 1990’s metaheuristics, such as Genetic Algorithms and Simulated Annealing among others, have been increasingly applied given their potential to solve nonlinear, nonconvex and even black-box problems [13] where traditional mathematical programming methods would face difficulties.

In [14] a hybrid approach was applied where genetic algorithm was coupled with two hill-climber search algorithms for improving local search and even though this approach allowed for better performance, in terms of efficiency, than classical genetic algorithms, it still could not be applied in near-real time conditions. Similarly, [12] applied hydraulic network linearization for two stage simulated annealing approach and while the approach finds global near optimal solution it is still limited to off-line optimization problems. Overall, metaheuristic optimization methods offer versatility and a problem-independent global approach that can be applied to complex problems, but over large search spaces, the coupling with the network simulator results in lengthy computation when searching for the optimum solution. Generally, the application of these methods to pump scheduling is rarer since the solutions need to be produced both rapidly and reliably. For these reasons, more efficient deterministic methods have been applied but the simulation runs might still constrain their application to mid-size networks. Thus, all the optimization approaches require to reduce the computational load of the simulator (which is usually the open source EPANET 2.0 [15]): one way to do this is “internal” working on the mathematical model of the hydraulic components, the other is “external” and takes the form of fitting a surrogate function or meta-model which approximates the simulation environment and can be used to drive global optimization strategies.

In this paper we propose Bayesian Optimization (BO) for the solution of PSO, as it is suitable to solve simulation–optimization problems and/or to be applied when the objective function is a black-box as well as it is computationally expensive to evaluate.

The diagonal approach [16] offers global convergence properties and guaranteed accuracy estimation of the global solutions, e.g., in the case of Lipschitz global optimization [17, 18]. A significant application of these methods to hyper-parameter optimization is given, e.g. in [19] for the case of SVM regression or in [20] for the case of signal processing. Random Search [21, 22] and the related metaheuristics like simulated annealing, evolutionary/genetic algorithms, multistart&clustering have recently received fresh attention from the machine learning community and is increasingly considered as a baseline for global optimization as in Hyperband [23]. A further approach is related to the adoption of probabilistic branch and bound for level set estimation [24]. Sequential Model Based Optimization (SMBO), and in particular Bayesian Optimization, came to dominate the landscape of hyper-parameter optimization in the machine learning community [25,26,27]. BO’s key ingredients are a probabilistic model which captures our beliefs—given the evaluations already performed—and an acquisition function which computes the “utility” of each candidate point for the next evaluation of f. The Bayesian framework is particularly useful when evaluations of f are costly, no derivatives are available and/or f is nonconvex/multimodal.

Two alternatives BO approaches are proposed in this paper: a method using Gaussian Processes (GP) to fit the surrogate model and another using Random Forest (RF), which is an ensemble of Decision Trees. Although GP is the most widely adopted approach, RF recently turns out to be more computationally efficient [28], in particular when some decision variables are discrete and/or conditional. A similar framework has been proposed to address Kernel-based Clustering in a leakage localization application [29]. Both GP and RF are presented in Sect. 3.1.

The remainder of the paper is organized as follows. Section 2 presents the pump scheduling optimization (PSO) problem in Water Distribution Networks (WDN), addressing both ON/OFF pumps and Variable Speed Pumps (VSP) and by considering, in the optimization model, the time patterns of demand and energy price. The well-known EPANET 2.0 simulator is used to check, for every pump schedule, that hydraulic constraints are not violated, demand is met and to compute the associated energy cost. Section 3 is devoted to introduce some background on BO, focusing on surrogate models and acquisition functions considered in the implementation of the proposed BO framework. Section 4 presents the computational results obtained on both test functions and the PSO problem, by comparing different alternatives in the BO framework’s set up. Results on WDN refer to two different case studies: a benchmark WDN widely adopted in literature (namely, Anytown) and a real-life large-scale WDN in Milan, Italy. The proposed BO approaches are compared to two commercial products: an “optimization-as-a-service” cloud-based platform (SigOptFootnote 1) and the combination of ALAMOFootnote 2 with BARONFootnote 3 for the generation of an accurate surrogate model of the analyzed system and the successive optimization process. Finally, Sect. 5 summarizes relevant findings.

## 2 Pump scheduling optimization

The goal of PSO is to identify the pump schedule, consisting in the status of each pump over time, associated to the lowest energy cost while satisfying hydraulic feasibility (i.e., water demand satisfaction, pressure level within a given operational range, etc.). Status of a pump is its activation—in the case of an ON/OFF pump—or its speed—in the case of a VSP, leading to discrete or continuous decision variables, respectively. Thus, a general formulation of the PSO, involving both the two types of decision variables (i.e. a WDN with both ON/OFF pumps and VSP), is reported as follows:

Objective function (minimizing the energy costs)

$$\hbox{min} C = \Delta t \mathop \sum \limits_{t = 1}^{T} c_{E}^{t} \left( {\mathop \sum \limits_{{k_{b} = 1}}^{{N_{b} }} \gamma x_{{k_{b} }}^{t} Q_{{k_{b} }}^{t} \frac{{H_{{k_{b} }}^{t} }}{{\eta_{{k_{b} }} }}} \right. + \left. {\mathop \sum \limits_{{k_{v} = 1}}^{{N_{v} }} \gamma x_{{k_{v} }}^{t} Q_{{k_{v} }}^{t} \frac{{H_{{k_{v} }}^{t} }}{{\eta_{{k_{v} }} }}} \right)$$
(1)

where $$C - {\text{total energy cost over }} T {\text{time steps}}, \;\Delta t - {\text{time step width}}, C_{E}^{t} - {\text{energy cost at time }}t$$, $$\gamma - {\text{specific weight of water}}, \;Q_{{k_{j} }}^{t} - {\text{water flow provided through the pump }} k_{j} {\text{ at time}}\,t, \;H_{{k_{j} }}^{t} - {\text{head loss on pump }} k_{j} {\text{ at time }}\, t, \;\eta_{{k_{j} }} - {\text{efficiency of the pump }} k_{j}$$ and the decision variables are:

$$x_{{k_{b} }}^{t} \in \left\{ {0,1} \right\} {\text{with }} t = 1, \ldots , T \;{\text{and}}\;k_{b} = 1, \ldots , N_{b} - {\text{status of pump }} k_{b} {\text{ at time }} t \left( {{\text{i}} . {\text{e}} . ON = 1, OFF = 0} \right)$$
$$x_{{k_{v} }}^{t} \in \left( {0,1} \right)\; {\text{with }} t = 1, \ldots , T \;{\text{and}} \;k_{v} = 1, \ldots , N_{v} - {\text{speed of pump }} k_{v} {\text{ at time }} t$$

where N b is the number of ON/OFF pumps and N v is the number of VSPs.

The time horizon usually adopted in PSO is 1 day with hourly time steps, leading to T = 24 time steps overall. This choice is basically motivated by the energy tariffs which are typically characterized by a different energy price depending on the hour of the day.

Although the PSO objective function (1) has an apparently closed expression, its exact resolution through mathematical programming requires to approximate the flow equations (linearization or convexification) which regulates the hydraulic behaviour of a pressurized WDN. More recent approaches address the optimization of the objective function by considering it as “black-box”, computing the value associated to a given schedule through a hydraulic simulation software which solves all the flow equations. Naturally, hydraulic simulation is performed at the selected time resolution—usually hourly basis for 24 h—which is adopted for the input data (i.e. demand at consumption points) as well as the outputs obtained at the end of a simulation run (i.e. pressure at nodes, flow/velocity on the pipes).

Depending on the size and complexity of the WDN, time required for a complete simulation can significantly change. This means that in the case of very large and complex WDN evaluating a schedule may also become significantly expensive in terms of time.

Although not expressed in the formulation above, there exists a set of hydraulic constraints which are checked by the hydraulic simulation software. Even in papers reporting the analytical formulation of the constraints, such as in [30], this check is usually verified by running a simulation. A violation message is usually provided by the software in the case a given pump schedule does not satisfy at least one of the hydraulic constraints. Some examples are: a negative pressure in some point of the WDN, impossibility to supply the demand at some consumption point, imbalance of the overall hydraulic system, etc. The hydraulic simulation software used in this study is EPANET 2.0 and a detailed list of errors/warnings provided by EPANET, with respect to hydraulic unfeasibility of a pump schedule, is reported in [15].

Moreover, some further “operational” constraints can be also added to the formulation. For instance, a common requirement is to have pressure values within a given range (not just positive values). Usually, when real-time monitoring is considered, this constraint regards a limited set of nodes of the WDN, which are the most critical ones (those showing the lowest and the highest pressure in the WDN). On the other hand, as simulation can estimate pressure at every node, this constraint can be specified for all of them.

$$P_{i}^{\hbox{min} } \le P_{i}^{t} \le P_{i}^{Max} \quad i = 1, \ldots , N_{nodes}$$

where $$P_{i}^{t}$$ is the pressure on node $$i$$ at time $$t$$.

Another typical operational constraint is related to the level of water into tank(s): in many studies the level of the tank(s) at the end of the day is required to be not lower than that at the beginning of the same day. This constrain can be mathematically formulated as:

$$L_{l}^{t = 0} \le L_{l}^{t = 24}$$

where $$L_{l}^{t}$$ is the level of the $$l$$-th tanks at time $$t$$ and $$l = 1, \ldots , N_{L}$$, with $$N_{L}$$ the number of tanks in the WDN.

Again, $$P_{i}^{t}$$ and $$L_{l}^{t}$$ are computed through hydraulic software simulation.

PSO is computationally hard to solve through an exhaustive search, due to the large number of possible pump schedules even for a very tiny WDN. A simple WDN, consisting of just 1 ON/OFF pump, and considering an hourly resolution and a 24-hours horizon, leads to an optimization problem with 24 discrete decision variables and, consequently, 224 (i.e. more than 16 million) possible pump schedules.

An efficient search method to identify promising pump schedule is therefore crucial to implement a practical decision support for PSO, with the aim to identify an optimal schedule within a limited number of objective function evaluations. This is the motivation of the BO framework for PSO proposed in this paper.

## 3 Bayesian optimization

The simulation–optimization task is addressed in this study through Sequential Model Based Optimization (SMBO) and, specifically, Bayesian Optimization. The goal is to find a global minimizer of the (black-box) objective function:

$$x^{*} = \mathop {\arg \hbox{min} }\limits_{x} f\left( x \right)$$

where $$x \in {\mathbb{R}}^{d}$$ is a d-dimensional vector whose components are the value of the decision variables. In the global optimization $$x$$ is usually defined in a limited domain (bounding box). In the most general case decision variables can be continuous, discrete as well as conditional.

The BO framework has 2 main components:

• The first is a probabilistic model of the objective function (also known as “surrogate function”) whose mean and variance are sequentially updated accordingly to observations of the objective function. Two such models are considered in Sect. 3.1, a Gaussian Process (GP) and a Random Forest (RF).

• The second key component is the acquisition function based on the GP model, whose optimization drives the querying process of the model and sets the next observation at the most “promising” point. Several acquisition functions are considered in Sect. 3.2.

BO has become the most promising approach even in the case there is not any information about derivatives as well as $$f\left( x \right)$$ is not convex or multi-modal. However, when gradients of $$f\left( x \right)$$ can be inferred by the surrogate, they can be incorporated in the BO framework to improve the value through a local search [31].

With respect to the PSO problem, $$f\left( x \right)$$ is the energy cost C reported in (1) while the dimension d of the search space is given by $$T \cdot \left( {N_{b} + N_{v} } \right)$$, where $$T$$ is the number of time steps and $$N_{b}$$ and $$N_{V}$$ are the number of ON/OFF and variable speed pumps, respectively. Every component of $$x$$ refers to a specific time step and a specific pump, so $$x_{i} \in \left\{ {0,1} \right\}$$ in case of an ON/OFF pump and $$x_{i} \in \left[ {0,1} \right]$$ in case of a VSP.

This notation holds in the following sections which are related to the probabilistic surrogate models and acquisition functions used in this study.

### 3.1 The probabilistic model

A GP is an extension of the multivariate Gaussian distribution to an infinite dimension stochastic process. Just as a Gaussian distribution is a distribution over a random variable, completely specified by its mean and covariance, a GP is a distribution over functions, completely specified by its mean function $$m$$ and covariance function $$k$$:

$$f\left( x \right) \sim GP\left( {m\left( x \right);k\left( {x,x_{0} } \right)} \right)$$

It is often useful to intuitively think of a GP as analogous to a function, but instead of returning a single numeric value $$f\left( {\bar{x}} \right)$$ for an arbitrary $$\bar{x}$$, it returns the mean and variance of a normal distribution over the possible values of $$f\left( {\bar{x}} \right)$$.

For the covariance function $$k$$, a very popular choice is the squared exponential function:

$$k\left( {x_{i} ,x_{j} } \right) = exp\left( { - \frac{1}{2}x_{i} - x_{j}^{2} } \right)$$

This function approaches 1 as values get close together and 0 as they get further apart. Two points that are close together can be expected to have a very large mutual influence, whereas distant points have almost none. This is a necessary condition for convergence under the assumptions of [32].

The function values are drawn according to a multivariate normal distribution $$N\left( {0,K} \right)$$ where the kernel matrix $$K$$ is given by:

$$K = \left[ {\begin{array}{*{20}c} {k\left( {x_{1} ,x_{1} } \right)} & \ldots & {k\left( {x_{1} ,x_{t} } \right)} \\ \vdots & \ddots & \vdots \\ {k\left( {x_{t} ,x_{1} } \right)} & \ldots & {k\left( {x_{t} ,x_{t} } \right)} \\ \end{array} } \right]$$

Of course, the diagonal values of this matrix are 1 (each point is perfectly correlated with itself), which is only possible in a noise-free environment.

The GP-based surrogate model is fitted according to data obtained in the previous iterations of the SMBO process. Let consider to be at iteration $$t$$, that means $$t$$ evaluations of the objective function have been already performed and stored in the dataset $$D_{1:t} = \left( {x_{i} ,f\left( {x_{i} } \right)} \right)$$, BO will suggest what is the next $$x_{t + 1}$$ to evaluate. Then, by the properties of Gaussian processes, $$\{ f_{1} , \ldots f_{t} \}$$ and $$f_{t + 1}$$ are jointly Gaussian, where, for simplicity, $$f_{i}$$ denotes $$f\left( {x_{i} } \right)$$:

$$\left[ {\begin{array}{*{20}c} {\left\{ {f_{1} , \ldots , f_{t} } \right\}} \\ {f_{t + 1} } \\ \end{array} } \right] \sim N\left( {0,\left[ {\begin{array}{*{20}c} {\mathbf{K}} & {\text{k}} \\ {{\text{k}}^{T} } & {k\left( {x_{t + 1} ,x_{t + 1} } \right)} \\ \end{array} } \right]} \right)$$

where

$${\text{k}} = \left[ {{\text{k}}\left( {x_{t + 1} ,x_{1} } \right) {\text{k}}\left( {x_{t + 1} ,x_{2} } \right) \ldots {\text{k}}\left( {x_{t + 1} ,x_{t} } \right)} \right]$$

It is then easy to derive an expression for the predictive distribution:

$$P\left( {f_{t + 1} |D_{1:t} , x_{t + 1} } \right) = N\left( {\mu_{t} \left( {x_{t + 1} } \right),\sigma_{t}^{2} \left( {x_{t + 1} } \right)} \right)$$

where

$$\mu_{t} \left( {x_{t + 1} } \right) = {\text{k}}^{T} {\mathbf{K}}^{ - 1} \left\{ {f_{1} , \ldots , f_{t} } \right\}$$
$$\sigma_{t}^{2} \left( {x_{t + 1} } \right) = k\left( {x_{t + 1} ,x_{t + 1} } \right) - {\text{k}}^{T} {\mathbf{K}}^{ - 1} {\text{k}}$$

A major problem with GP, and more generally with BO, is that it scales poorly with dimensionality of the problem (i.e. number of decision variables); furthermore, the computational complexity for fitting the GP, at each iteration of the SMBO process, is $$O\left( {n^{3} } \right)$$ where $$n$$ is the number of evaluations performed so far, so time for proposition, as well as overall wall-clock time, increases over the iterations. To reduce the impact of dimensionality some proposals have been recently suggested [33, 34].

When the number of decision variables is high—greater than 20—as well as some of them are discrete and/or conditional, other surrogate models are more promising, such as Random Forest (RF) [35, 36]. RF is an ensemble learning method, based on decision trees, for both classification and regression problems. According to the originally proposed implementation, RF aims at generating a multitude of decision trees, at training time, and providing as output the mode of the classes (classification) the mean/median prediction (regression) of the individual trees. Decision trees are ideal candidates for ensemble methods since they usually have low bias and high variance, making them very likely to benefit from the averaging process. RF mostly differs from other typical ensemble methods (e.g. voting systems) in the way they introduce random perturbations into the training phase. The basic idea is to combines bagging, to sample examples from the original dataset, and random selection of features, in order to train every tree of the forest. Injecting randomness simultaneously with both strategies yields one the most effective off-the-shelf methods in machine learning, working surprisingly well for almost any kind of problems, allowing for generating a collection of decision trees with controlled variance. Let denote with $$S$$ the size of the forest (i.e. the number of decision trees in the forest) and $$T_{i} \left( x \right)$$ the output provided by the i-th decision tree for the input $$x$$, the variance of the RF-based estimator, for the regression case, can be easily computed as follows:

\begin{aligned} Var\left( {\frac{1}{S}\mathop \sum \limits_{i = 1}^{S} T_{i} \left( x \right)} \right) & = Cov\left( {\frac{1}{S}\mathop \sum \limits_{i = 1}^{S} T_{i} \left( x \right),\frac{1}{S}\mathop \sum \limits_{j = 1}^{S} T_{j} \left( x \right)} \right) = \frac{1}{{S^{2} }}\mathop \sum \limits_{i = 1}^{S} \mathop \sum \limits_{j = 1}^{S} Cov\left( {T_{i} \left( x \right),T_{j} \left( x \right)} \right) \\ & = \frac{1}{{S^{2} }}\mathop \sum \limits_{i = 1}^{S} \left( {\mathop \sum \limits_{j \ne i}^{S} Cov\left( {T_{i} \left( x \right),T_{j} \left( x \right)} \right) + Var\left( {T_{i} \left( x \right)} \right) } \right) \le \frac{1}{{S^{2} }}\mathop \sum \limits_{i = 1}^{S} \left( {\left( {S - 1} \right)\rho \sigma^{2} + \sigma^{2} } \right) \\ & = \frac{{S\left( {S - 1} \right)\rho \sigma^{2} + S\sigma^{2} }}{{S^{2} }} = \frac{{S\left( {S - 1} \right)\rho \sigma^{2} }}{S} + \frac{{\sigma^{2} }}{S} = \rho \sigma^{2} - \frac{{\rho \sigma^{2} }}{S} + \frac{{\sigma^{2} }}{S} \\ & = \rho \sigma^{2} + \sigma^{2} \frac{1 - \rho }{S} \\ \end{aligned}

where $$\sigma^{2}$$ is the variance computed over all the $$T_{i} \left( x \right)$$ and $$\rho \sigma^{2} = \mathop {\hbox{max} }\limits_{i,j} Cov\left( {T_{i} \left( x \right),T_{j} \left( x \right)} \right)$$.

Thus, variance of the RF estimator decreases with $$\sigma^{2}$$ and $$\rho$$ decreasing (i.e. if the number m of selected features decreases) and with the size $$S$$ the forest increasing.

Variance is crucial to replace GPs with RF in the SMBO framework; the possibility to compute the variance of the RF allows for keep the overall sequential optimization process as is, since fitting the surrogate model in the SMBO process is a regression task. Therefore, RF can be easily adopted by replacing machine learning concepts with those related to SMBO (e.g., features with decision variables and dataset with set of evaluations performed so far). Following a basic description of the RF learning algorithm.

With respect to PSO, it is important to highlight that using GP or RF as surrogate model may significantly affect effectiveness and efficiency of the overall BO framework proposed. More specifically, a surrogate model based on RF should be, at least ideally, well suited for WDNs with only ON/OFF pumps or mixed (both ON/OFF and VSP) pumps. Furthermore, a RF-based surrogate offers higher scalability with respect to the number of evaluations of the objective functions, leading to a wall-clock time lower than that required by a GP-based surrogate, given the same number of function evaluations. According to these considerations, using RF as probabilistic model of the BO framework should be the most profitable choice.

### 3.2 Acquisition functions

The acquisition function is the mechanism to implement the trade-off between exploration and exploitation in the BO. The basic idea is to improve over the best solution observed so far (“best seen”), even if an improvement is not guaranteed from one iteration to the next one, due to the exploration. In particular, any acquisition function aims to guide the search of the optimum towards points with potential high values of objective function either because the prediction of $$f\left( x \right)$$, based on the surrogate model, is high or the uncertainty is high (or both). While exploiting means to consider the area providing more chance to improve the current solution (with respect to the current surrogate model), exploring means to move towards less explored regions of the search space.

Probability of Improvement (PI) was the first acquisition function proposed in the literature [37], where the parameter ξ ≥ 0 is used to balance between exploration and exploitation (ξ = 0 means pure exploitation).

$$PI\left( x \right) = P\left( {f\left( x \right) \ge f\left( {x^{ + } } \right) + \xi } \right) = \phi \left( {\frac{{\mu \left( x \right) - f\left( {x^{ + } } \right) - \xi }}{\sigma \left( x \right)}} \right)$$

However, one of the drawback of PI is that it is biased towards exploitation.

Another popular acquisition function is the Expected Improvement (EI) [38] which measures the expectation of the improvement on $$f\left( x \right)$$ with respect to the predictive distribution of the surrogate model. Even in this case parameter ξ is used to handle the trade-off between exploitation and exploration. When exploring, points associated to high uncertainty of the surrogate model are chosen, while when exploiting, points associated to high value of the mean of the surrogate model are selected.

$$EI\left( x \right) = \left\{ {\begin{array}{*{20}c} {\left( {\mu \left( x \right) - f\left( {x^{ + } } \right) - \xi } \right)\phi \left( Z \right) + \sigma \left( x \right)\phi \left( Z \right) if \sigma \left( x \right) > 0} \\ {0 if \sigma \left( x \right) = 0} \\ \end{array} } \right.$$

where ϕ and $$\varPhi$$ are the probability and the cumulative distribution functions, respectively, and

$$Z = \left\{ {\begin{array}{*{20}l} {\frac{{\mu \left( x \right) - f\left( {x^{ + } } \right) - \xi }}{{\sigma \left( x \right)}}} \hfill & {~if~\;~\sigma \left( x \right) > 0} \hfill \\ {0~} \hfill & {~if~\;~\sigma \left( x \right) = 0} \hfill \\ \end{array} } \right.$$

Finally, another acquisition function selects the next point to evaluate according to the Confidence Bound concept [39, 40]: Lower Confidence Bound—LCB—and Upper Confidence Bound—UCB—in the minimization and maximization case, respectively. For the case of minimization, LCB is given by:

$${\text{LCB}}\left( x \right) = \mu \left( x \right) - {\text{k}}\sigma \left( x \right)$$

where $${\text{k}} \ge 0$$ is the parameter to manage the exploration/exploitation trade-off ($${\text{k}} = 0$$ means pure exploitation).

To solve the black-box optimization the optimization of the acquisition function is required but, usually, this is cheaper and easier than to optimize the objective function. Methods used are basically deterministic, such as the derivative free DIRECT [41] or multi-start BFGS (Broyden–Fletcher–Goldfarb–Shanno) [42]. Another relevant approach is the so called “focus search”, proposed in the mlrMBO software to handle numeric, categorical, mixed and hierarchical/conditional decision variable spaces, and therefore well suited for a RF based surrogate model.

### 3.3 Software environment

In order to implement the BO framework proposed for solving the pump scheduling optimization problem, the R package named “mlrMBO” has been adopted: it is a flexible and comprehensive toolbox for model-based optimization (MBO). This toolbox has been designed for both single- and multi-objective optimization with mixed continuous, categorical and conditional decision variables, and it has been implemented in a modular fashion, such that single components can be easily replaced or adapted by the user for specific use cases, e.g., any regression learner from the R package “mlr” for machine learning can be used (more than 60 machine learning regression algorithms), and infill criteria and infill optimizers are easily exchangeable.

The mlrMBO toolbox implements the SMBO framework giving the possibility to customize each step of the overall sequential optimization process. An initial set of evaluations has to be performed in order to create the first instance of the surrogate model, which will be then updated over the iterations of the SMBO process. The user can manually specify the initial set of points to be evaluated or generate them either completely at random, coarse grid designs or by using space-filling Latin Hypercube Sampling (LHS) [21].

With respect to the surrogate model, two default choices are available: GP for continuous unconditional decision variables and RF for the other cases. In any case, any other regression algorithm (named “learner” in mlrMBO) can be used to fit the surrogate model. Three different acquisition functions—also named infill criteria in mlrMBO—are available: EI, Augmented EI and LCB/UCB. The toolbox is anyway extendible, so user can implement its own acquisition function.

Finally, different termination criteria can be defined depending on the specific optimization problem: maximum number of function evaluations is reached, the optimum is found (in the case that optimum value of the objective function is known) or “budget” is exhausted (e.g., overall wall-clock time or costs for cloud computing).

## 4 Experimental results

This section presents the results obtained both on a set of test functions and two PSO case studies: a benchmark well known in the PSO literature and a real-life WDN in Milan, Italy. All the experiments, with exception of SigOpt (which is a cloud-based platform), were carried out on a machine with: intel i7 (3.4gHz) processor with 4 cores and 16Gb of RAM.

### 4.1 Experiments on test functions

As preliminary evaluation, 12 different test functions has been considered to compare different SMBO alternatives. Four are test functions well known in global optimization literature: Branin (2D), Hartman-3D, Hartman-6D and Schwefel-20D; the other 8 test functions have been obtained through the GKLS software [43]: GKLS2D, GKLS3D, GKLS6D and GKLS20D are differentiable functions, GKLS2D-ND, GKLS3D-ND, GKLS6D-ND and GKLS20D-ND are not differentiable functions. The generation mechanism is based on the perturbation of a paraboloid, performed by setting up 4 different parameters. These parameters, and the values set up for the different test cases, are reported in the following Table 1. For all the GKLS functions the decision variables are bounded in the unitary hypercube.

For illustrative purposes, the differentiable and not differentiable functions generated through GKLS, in the 2D case, are reported in the following Fig. 1.

Experiments on test functions have been performed with the aim to have a preliminary understanding about the benefits provided by BO and its different alternatives, in particular relatively to the adoption of GP or RF for the surrogate model and the selection of EI or CB as acquisition function. As they are test functions, the optimum is known—both x* and f(x*)—thus the difference between the optimum f(x*) and the “best seen” f(x+), after t evaluations of the objective function, is computed. More specifically, a budget of 1500 evaluations was fixed for each experiment (i.e. t = 1500).

The following tables report, separately, the difference between f(x*) and f(x+) and the function evaluation (also indicated with “iteration)” at which best seen occurred for the first time. Latin Hypercube Sampling (LHS) is used to initialize the design, with only 5 initial evaluations, 7 in the cases of 6D test functions and 21 in the case of 20D test functions, since LHS requires a number of evaluations greater than the dimensions.

With respect to the commercial “optimization-as-a-service” solution there is no possibility to set anything related to the optimization process performed by SigOpt. The experiments with ALAMO in combination with BARON use all the possible base functions for the generation of the surrogate model (i.e., constant, linear, exponential, logarithmic, sine, cosine) (Tables 2, 3).

Although it is not possible to identify a “winner” among the proposed approaches, some relevant insights can be retrieved:

• All the approaches are effective on low-dimensions, in particular on the test function Branin2D;

• BO with RF as surrogate model and LCB as acquisition function is the only approach providing the best solution on 50% of all the test functions considered;

• BO with GP as surrogate model, in some cases, is not able to perform all the function evaluations, resulting in a premature interruption of the SMBO process. This issue arises more frequently when the acquisition function is LCB;

• ALAMO with BARON does not use all the budget (1500 function evaluations) because error between surrogate model and preliminary function evaluations is very low, so the surrogate is considered accurate. However, when optimization is performed on the generated surrogate model, the value of the optimal solution is quite different from the known optimum of the related test function.

### 4.2 Experiments on the Anytown WDN

After the preliminary evaluation on the test functions, we have applied the proposed BO approaches on the PSO problem. As first case study, a benchmark WDN named Anytown, originally proposed in [11], has been considered. This WDN consists of 37 pipes, 19 nodes, 1 tank (having elevation and diameter equal to 65.53 and 12.20 m, respectively) and 1 reservoir, with 4 pumps installed at the source to supply water. The Anytown WDN, modelled through EPANET, is reported in the following Fig. 2.

A recent work has addressed the PSO on the Anytown network through Harmony Search [44], a meta-heuristics search method. It is important to underline that we do not compare directly with Harmony Search: it has been recently proposed to solve the PSO problem on Anytown as the plethora of meta-heuristics—extensively Genetic Algorithms—proposed in the last decade. Our aim is to propose a more mathematically sound approach to solve the PSO problem by exploiting black-box optimization and making it an effective and efficient tool for practical/operational problems, such as PSO in WDNs. However, the more recent, detailed and completely replicable PSO setting on Anytown is—at our knowledge—proper reported in [44].

We have used a penalty on the objective function in the case warnings occur during the corresponding simulation run (i.e. some hydraulic constraint is not satisfied). However, contrary to [44], who used as penalty a default cost equals to 9999, we have decided to adopt, as penalty value, the cost obtained by using the “naïve” schedule consisting in having all the 4 pumps switched ON for all the hours of the day. We have adopted the same setting reported in [44]: minimum and maximum tank heads are set to 67.67 and 76.20 m, respectively. Hourly demand factors ranged from 0.4 to 1.2 and base demand is doubled from the Anytown test case to allow longer pump operation and tank storage, defining a total inflow between 161.51 and 484.53 lps. Finally, an electricity tariff with a price of 0.0244 $/kW h between 0:00–7:00 and 0.1194$/kW h between 8:00 and 24:00 was considered for simulations.

The objective function f(x) is given in (1) as the energy cost associated to the pump schedule, and x t represents the state of the pumps at a specific time step (i.e. hour). If an ON/OFF pump is considered, the domain of the decision variable is {0, 1}, while it is a continuous variable in [0, 1] in the case of VSP. As previously mentioned, f(x) is not available in closed form but just as a black-box whose evaluation requires the run an EPANET simulation. Furthermore, we assume that each evaluation produces an unbiased (not noisy) point-wise estimation of f(x) because we consider that both WDN structure and demand do not change from one simulation to another.

Usually PSO on this WDN has been targeted considering ON/OFF pumps in the system, leading to 4 × 24 = 96 variables. Thus, the total number of possible schedules is 24×24 = 296, with an hourly step and a 24 h horizon; consequently, the total search space is 7.92 × 1028.

The number of overall evaluations has been fixed at 800, where 400 are used for the initial design (i.e. generation of the first surrogate model) and the other 400 are related to the iterations of the SMBO process. For the initial 400 evaluations, the LHS approach has been used.

The following Tables 4 and 5 report, separately, the results obtained when ON/OFF pumps and VSPs are considered. As natural, the adoption of VSPs allows for lower energy costs (due to the nature of the pumps), even lower than the optimal value found in [44]. Furthermore, when VSPs are considered, using a GP-based surrogate model results to be a better choice, than using RF, in minimizing the energy costs. On the contrary, the RF is better than GP in the case of ON/OFF pumps. This information basically confirms that the selection of a strategy to infer the surrogate model should depend on the type of decision variables. The VSPs case has been also optimized through ALAMO in combination with BARON: the energy cost associated to the optimal solution identified on its surrogate model is higher than the BO approaches.

The iteration associated to the occurrence of the best seen and the overall wall clock time are also reported. It is easy to note that, in the case a GP-based surrogate is used, the overall clock time is one order of magnitude higher. This basically proves the most relevant drawback of GPs: they scale poorly with the number of decision variables, since computational complexity is O(n3) with n the number of function evaluations.

Taking into account the ON/OFF pumps case, there is a substantial difference between the optimal schedules we have identified through BO and the one found via Harmony Search in [44]: our optimal schedules use all the 4 pumps while the optimal one obtained in [44] uses just 3 pumps. A possible motivation is that the overall number of function evaluations are not sufficient (800): although the number of function evaluations performed in [44] is not declared, a comparison with 2000 generations of a Genetic Algorithm based approach is reported, leading to suppose that a similar number of evaluations was used also for Harmony Search. We have therefore decided to perform a further experiment by forcing the same pump to be shut down; thus, the decision variables are in this case related to 3 pumps, leading to 3 × 24 = 72 decision variables and 23×24 = 272 possible schedules. The search space is still huge, and BO can be still profitably applied. Again, we consider both the ON/OFF pumps and VSP cases; moreover, the reduction from 96 to 72 variables allows us to make a comparison, in the VSP case, between the BO framework implemented through mlrMBO and the commercial “optimization-as-a-service” platform named SigOpt. The following Table 6 reports the results obtained in the case of 3 ON/OFF pumps.

The pump schedules obtained have a lower energy cost with respect to those obtained from the previous BO process (performed on 4 ON/OFF pumps) leading to results more similar to [44]. In particular, the best result is obtained when a GP surrogate model is used and EI is chosen as acquisition function. However, in terms of energy cost, GP and RF differ only by 2–3%, while in terms of number of iterations needed to identify the best seen, and more important overall wall-clock time, RF is significantly better. The following Fig. 3 shows the main information related to the optimal pump schedule: the pumps status and the level of tank—even with respect to the level reported in [44]—for every hour of the day.

Finally, the following Table 7 summarizes results obtained in the case of VSPs. For this experiment we have also adopted SigOpt as further comparison. Again, considering 3 VSPs instead of 3 ON/OFF pumps gives the possibility to further reduce energy cost. In this specific case, SigOpt is near to the optimal value provided by RF-based BO with LCB as acquisition function. The main advantage provided by SigOpt is the high scalability which allows for increasing the number of evaluations and therefore improving the optimal value.

Finally, the following Fig. 4 shows information related to the optimal pump schedule identified: the speed of the pumps (from 0 to 1) and the tank level—along with that reported in [44]—for every hour of the day.

### 4.3 Experiments on a real-life large-scale WDN

After the experiments on the benchmark WDN, we have addressed a real-life large-scale WDN in Milan, Italy. The water distribution service in the urban area of Milan is managed by Metropolitana Milanese (MM) through a cyber-physical system consisting of: a transmission network (550 wells and pipelines bringing water to 29 storage and treatment facilities located inside the city) and a distribution network (26 different pump stations spread over the city—95 pumps overall—which take water from the storage and treatment facilities and supply it to the final customers, about 1.350.000 habitants). Financial burden for MM is approximately 16.000.000 € for energy costs, with 45% of them due to pumping operations in the distribution network [45]. During the project ICeWater—co-founded by the European Union—a Pressure Management Zone (PMZ), namely “Abbiategrasso”, was identified in the South of the city and isolated from the rest of the network.

The EPANET model of the Abbiategrasso PMZ, reported in the following Figs. 5, 6 was presented in [45]: it consists of 7855 nodes, 6073 pipes, 1961 valves, 1 reservoir and 4 pumps. Two of them, the oldest ones, are ON/OFF, while the other two, the newest ones installed in 2014, are VSPs.

As previously reported in [45], preliminary results showed that two pumps are sufficient to match supply and demand within the PMZ. This consideration allows for limiting the number of decision variables (only two pumps are, at most, active simultaneously) and quantify which is the economic return of MM in using VSPs instead of ON/OFF pumps.

The optimization setting is the same defined for the benchmark WDN. A relevant difference is related to the electricity price: the Time-Of-Use (TOU) tariff in place for the real-life case study is:

• (low-price) 0.0626 €/kW h between 00:00–7:00 and between 23:00–24:00

• (mid-price) 0.0786 €/kW h between 08:00 and 19:00

• (high-price) 0.0806 €/kW h between 07:00–08:00 and between 19:00–23:00

All the available data for the study, including the reported energy tariff, are related to the year 2014.

The other relevant difference, with respect to Anytown, is the number of decision variables (2 instead of 4), and, consequently, the size of the search space. For the ON/OFF pumps case (i.e., binary decision variables) the overall number of possible pump schedules is 22×24 = 248, that is about 2.81 × 1014, in any case too large to perform an exhaustive search. The number of overall objective function evaluations (i.e. EPANET simulation runs) has been fixed at 800, with 400 for the initial design (i.e. generation of the first surrogate model) and the remaining 400 for the iterations of the SMBO process. For the initial 400 evaluations, the LHS approach has been used. The following tables summarize the obtained results for the ON/OFF pumps and the VSPs setting, respectively (Tables 8, 9).

With respect to the ON/OFF pumps case, only BO with RF as surrogate model was able to identify a feasible solution. Moreover, the value of the best seen is independent on the acquisition function.

With respect to the VSPs case, some important considerations are reported:

• Energy costs are naturally lower than those obtained on the ON/OFF pumps case, thanks to the nature of the pumps—and, therefore, nature of the decision variables;

• ALAMO in combination with BARON is not able to identify a feasible solution;

• Although BO with GP as surrogate model and LCB as acquisition function provided the lowest energy cost, using GP is too computational expensive (overall wall clock time is about 10 times that required by RF). This makes BO with GP unsuitable from an operational point of view, since too much time is required to generate a suggestion and turn decision into action timely.

Finally, the real-life large-scale case study allowed to highlight a further and important difficulty in solving the PSO problem. Solutions in the search space are not all feasible; however, the feasible region is not known a priori. As for the evaluation of the objective function, also the feasibility check is performed through EPANET: more specifically, EPANET returns a numeric value (i.e., energy cost) when the considered point in the search space (i.e., pump schedule) is feasible, otherwise it returns a specific error code.

As previously mentioned, we have assigned a penalty to unfeasible points, with the aim to move towards feasible points/regions. This approach is quite naïve because the penalty does not depend on the entity of violation, but it is the approach usually adopted in the PSO literature. This leads to almost “flat” objective functions, especially when the unknown feasible region results extremely narrow with respect to the overall search space. The two WDN use cases considered, but more specifically the real-life WDN, seem to be characterized by this kind of situation: in the following figures it is possible to see the extremely narrow feasible region for the real-life WDN, with x and y speed of the first and the second VSP, respectively. To have a 2D representation, the first hour of the day is considered, only.

Since BO gives, along the sequential optimization process, some chance to explore unseen regions of the search space, there is some opportunity to get to the feasible region. On the contrary, ALAMO in combination with BARON just tries to generate a surrogate model as accurate as possible. Thus, it is deceived by the false flatness of the objective function induced by the penalty and the narrow extent of the feasible region. A possibility to exploit the potential of ALAMO in combination with BARON, in a future work, could consists in using it instead of GP or RF within the SMBO process. Using ALAMO in combination with BARON, within the BO framework, with the aim to create/update the surrogate model requires to understand how to model uncertainty of the surrogate, similarly to the GP and RF approaches.

## 5 Conclusions

Energy saving in WDN is one of the most challenging issues of smart urban water management and PSO is critical to reduce energy costs while guaranteeing a satisfactory water supply service. Simulation–optimization is usually adopted to solve the PSO, where the optimization is mostly performed via meta-heuristics, such as Genetic Algorithms or Harmony Search, and simulation is widely performed via EPANET.

According to the emerging interest on Sequential Model Based Optimization (SMBO), in particular Bayesian Optimization (BO), to solve black-box and expensive optimization processes, we have proposed the SMBO as optimization engine of the simulation–optimization PSO problem and compared different alternative approaches. A set of test functions has been initially used to make preliminary considerations about the different alternative schemes proposed, then two PSO case studies have been addressed: a widely studied benchmark WDN, namely Anytown, and a real-life large-scale WDN in Milan, Italy. Considering different alternative was motivated by the type of decision variables, which are discrete in the ON/OFF pumps case and continuous in the VSPs case.

Although an “always winning” approach cannot be identified, relevant conclusions can be summarized. BO with RF as surrogate model and LCB acquisition function was the only approach providing the best solution in 50% of the test functions considered.

With respect to PSO on the two case studies, SMBO proved to be effective, even using a limited number of function evaluations (400 for initial design and 400 for the BO process) with respect to the overall search space. However, GP-based and RF-based surrogate models alternate in providing the best result. Although this could appear inconclusive, the most relevant conclusion is related to the time needed to obtain a solution: GP-based BO requires from 5 to 10 times the wall clock time needed to RF-based BO, independently on the acquisition function considered. Thus, time-to-decision could be the key for the final choice on the SMBO configuration to use. From an operational point of view, wall clock time is acceptable for the benchmark WDN only: a suitable pump schedule can be identified in few hours (from 1 to 5, according to the set-up of the SMBO framework) which is a reasonable time, in particular when some reliable urban water demand forecasting solution is available. Wall clock time for RF-based SMBO resulted 5 times lower than GP-based one’s (1 vs. 5 h), meaning that, in the case the time required by the GP-based SMBO is anyway acceptable, the number of evaluations for the RF-based SMBO could be increased to raise the chance to find a better pump schedule while keeping wall-clock time acceptable.

A different conclusion arises with respect to the results on the real-life WDN: while RF-based BO requires about 2 h for performing 800 iterations of the SMBO process, in the VSPs case, GP-based BO may require about 22 h, making it unsuitable for optimizing pump schedule in a real-life setting. Therefore, RF-based BO is the most effective and efficient choice.

Finally, the real-life WDN allowed to highlight that PSO is characterized by a feasible region but it is unknown a priori. Like the value of the objective function, feasibility of a solution (i.e., a pump schedule) can only be verified by querying the black-box (i.e., EPANET). A naïve strategy based on assigning a constant penalty value to unfeasible points could be inefficient in BO and completely ineffective in the generation and consecutive optimization of an accurate surrogate model (i.e., ALAMO in combination with BARON). Future works will address how to include an estimation of the feasibility region, as well as a more sophisticated and effective quantification of penalties to unfeasible points.