1 Introduction

Global optimization seeks to address the following problem

$$\begin{aligned} \begin{aligned} \underset{\textbf{x}}{\text {min}}&~~f(\textbf{x}) \\ \text {s.t.}&~~\textbf{g}(\textbf{x}) \ge \textbf{0}, \\&~~\textbf{h}(\textbf{x}) = \textbf{0}, \\&~~~\textbf{x}\in \mathbb {Z}^m \times \mathbb {R}^{n-m}, \end{aligned} \end{aligned}$$
(1)

where f, \(\textbf{g}\) and \(\textbf{h}\) are the objective function, inequality constraints and equality constraints, respectively, and \(\textbf{x}\) is a vector of decision variables. The objective functions and constraints may or may not conform to any specific mathematical structure, unlike linear or convex optimization problems, and variables can be continuous or integer.

Existing global optimizers approximate problem (1) into forms compatible with efficient optimization. These optimizers use three major approaches, which are gradient-based methods, outer approximations, and mixed-integer optimization (MIO) methods. The gradient-based approach is used by popular nonlinear solvers such as IPOPT. These solvers initialize their solution procedure using feasible solutions found via efficient heuristics. Then, they solve a series of gradient descent iterations, confirming optimality via satisfaction of the Karush-Kuhn-Tucker (KKT) conditions. Wächter and Biegler [38] describe IPOPT’s primal-dual barrier approach in greater detail. It relaxes the constrained global optimization problem into an unconstrained optimization problem using a logarithmic barrier function, then uses a damped Newton’s method to reduce the optimality gap to a desired tolerance. These gradient-based optimizers are efficient and effective in the presence of nonlinear constraints that are sparse, being able to solve problems on the order of 1000 variables and constraints on unremarkable personal computers in minutes to local optimality.

Another approach is an outer approximation approach, described by Horst et al. [19]. This approach simplifies a global optimization problem by approximating constraints via linear and nonlinear cuts that preserve the original feasible set over decision variables \(\textbf{x}\). This approach is effective for constraints with certain mathematical structure (e.g., linearity of integer variables and convexity of nonlinear functions considered by Duran and Grossmann [12], or concavity or bilinearity of constraints considered by Bergamini et al. [3]), where mathematically efficient outer approximators exist. While these approaches are effective, they have found less commercial success due to their problem specific nature.

A final approach, and one that meshes naturally with optimization over integer variables, couples MIO with outer approximations. Ryoo and Sahinidis [30] present an approach called the branch-and-reduce method, which relies on recursively partitioning the domain of each constraint and objective over the decision variables, and bounding their values in each subdomain by examining their mathematical primitives. Similarly, Nagarajan et al. [28] implement an adaptive disjunctive representation of constraints with piecewise-convex outer approximations, which they refine iteratively to find globally optimal solutions. Such recursive partitioning creates a branch-and-bound tree, the solution to which has guarantees of global optimality through the bounding and pruning process inherent in solving MIO problems via branch-and-bound. This method has seen success in Branch-And-Reduce Optimization Navigator (BARON) [32] and Alpine [28], two popular global optimizers.

While the aforementioned approaches are effective in addressing certain classes of global optimization problems, each of these approaches has weaknesses. In general, gradient-based approaches rely on good initial feasible solutions, and are ineffective in presence of integer decision variables. Outer approximation approaches fail to generalize to global optimization problems with general nonlinearities. While being more general than outer approximation methods, existing MIO approaches don’t scale as well due to their combinatorial nature.

Perhaps more importantly, in pursuit of mathematical efficiency, many global optimizers place additional constraints on the forms of constraints, requiring constraints to use a small subset of possible mathematical primitives. For example, BARON “can handle functions that involve \(\textrm{exp}(x)\), \(\textrm{ln}(x)\), \(x^{\alpha }\) for real \(\alpha \), and \(\beta ^x\) for real \(\beta \)” [32]. Other solvers such as Alpine are even more restrictive, addressing global optimization problems over polynomial functions [28]. Constraints from the real world do not always adhere to these forms, and often involve other classes of functions such as trigonometric functions, signomials, and piecewise-discontinuous functions. It is often not possible to transform these functions into forms compatible with existing global optimizers. These optimizers face even greater challenges when dealing with objectives and constraints that are black box. Black box constraints are inexplicit, meaning that they have no analytical representations, such as when constraints are the outcomes of simulations.

In this paper, we propose a new approach to reformulate global optimization problems as MIO problems using machine learning (ML) leveraging work by Bertsimas and Dunn [4, 5] on the optimal classification tree with hyperplanes (OCT-H) and the optimal regression tree with hyperplanes (ORT-H). The approach addresses global optimization with arbitrary explicit and inexplicit constraints. The only requirement for the proposed method is a bounded feasible domain for the subset of decision variables \(\textbf{x}\) present in nonlinear constraints.

In our proposed method, we approximate each constraint that is outside of the scope of efficient mathematical optimization using an OCT-H. More specifically, each nonlinear constraint \(g_i(\textbf{x}) \ge 0\) is approximated by an OCT-H \(T_{i}\) trained on data \(\{(\tilde{\textbf{x}}_{k}, \mathbb {I}(g_i(\tilde{\textbf{x}}_{k}) \ge 0)),~k \in [n]\}\), where \(\tilde{\textbf{x}}_k\) is an outcome of decision variables, \(\mathbb {I}\) is the indicator function, and \(g_i(\tilde{\textbf{x}}_k)\) is the left-hand-side of the constraint evaluated at \(\tilde{\textbf{x}}_k\). Thus, tree \(T_{i}\) makes an approximation of the feasible space of constraint \(g_i(\textbf{x}) \ge 0\), predicting (with some error) whether an outcome of decision variables satisfies the constraint. This approach also extends to approximate each nonlinear equality \(h_j(\textbf{x}) = 0\), and approximates nonlinear objective functions via ORT-Hs.

The approximating trees allow for a natural MIO approximation of the underlying constraints. Each feasible leaf of an OCT-H is reached by a decision path defining an intersection of halfspaces, i.e. a polyhedron. Constraints may thus be approximated as a union of feasible polyhedra of the approximating OCT-Hs using disjunctive constraints. We solve this efficient, locally ideal MIO approximation of the original problem to obtain a near-feasible and near-optimal solution, and then use gradient-based methods to repair the solution to be feasible and locally optimal.

While the proposed method is a heuristic and not a true global optimization method, it has several strengths relative to other global optimization methods. It is agnostic of the forms of constraints in the problem; as long as we can query whether a sample \(\tilde{\textbf{x}}\) is feasible to a constraint, we can embed the constraint into the MIO approximation. Once the constraints are learned using decision trees, the solution time of the resulting MIO approximation is low compared to solving the original global optimization problem. The proposed method can also be used to generate constraints from data which may not come from any known function, simulation or distribution. This allows us to simultaneously learn the physics of complex phenomena such as but not limited to social dynamical models or solutions of partial differential equations, and embed them into optimization problems.

In this work, we present our global optimization approach, implemented in our optimizer OCT-H for Global Optimization (OCT-HaGOn), pronounced “octagon”. We demonstrate its promise by considering global optimization problems with explicit nonlinear constraints. This allows us to quantify the performance of our method against existing global optimizers using available benchmarks. In addition, we approximate all nonlinear constraints in the benchmarks regardless of their efficient optimization-representability. The proposed method extends to mixed-integer-convex approaches where we embed efficiently-optimizable nonlinear constraints (e.g., quadratic, second order conic, log-sum-exponential constraints) into the MIO formulation directly, as long as these constraints are supported by the underlying solver.

1.1 Role of machine learning in optimization

The role of optimization in training ML models is well known and studied. Recent review papers in the literature [14, 34] survey the landscape of mathematical optimization and heuristic methods used for a variety of ML applications. However, we are interested in the inverse of the above, and specifically how ML can be used for the purpose of optimization, especially to solve problems that cannot naturally be posed as efficient optimization problems.

There is precedent for using ML methods to improve computational efficiency. A prominent example is the use of ML to accelerate the simulation of nonlinear systems such as those in computational fluid dynamics [21], molecular dynamics [15] and quantum mechanics [27]. There has been some prior work using ML to accelerate optimizations, e.g. using Bayesian optimization [13] or neural networks [17, 35]. While these show that ML-driven optimization is theoretically possible, the proposed methods are computationally expensive and have limited scalability. An interesting parallel use of ML in optimization is in the interpretation of optimal solutions, where ML is used to understand the optimal strategies (i.e. outcomes of all or subsets of decision variables) resulting from an optimization problem under different parameters [6].

In this work, we use ML to find optimal solutions to global optimization problems involving both explicit constraints with arbitrary mathematical primitives, and inexplicit black box functions. For this purpose, ML is used for constraint learning within two capacities. The first capacity is to accelerate optimizations over known models. When models and/or constraints are known but their use is prohibitive, e.g. in the case of explicit but nonlinear and nonconvex constraints, learners are used to create surrogates that are more efficient for use in optimization. The second is in modeling. When data is available but models and/or constraints are black box, learners act as interpolants to the data, and to allow patterns in the data to be embedded in optimization.

Prior work has recognized the potential for using constraint learning approaches in optimization, especially when the ML models used are compatible with MIO representations. Both Biggs et al. [7] and Mišić [26] use the prediction of tree ensembles as the objective function of optimization problems, given that a subset of tree features are decision variables. Grimstad and Andersson [17] use deep neural networks and their MIO representations to solve optimization problems over non-convex quadratic functions and outcomes of simulations. Maragno et al. [24] go further and present a more general approach for data-driven optimization that leverages decision trees as well as other MIO-compatible ML models such as support vector machines and neural networks.

The aforementioned applications of ML in optimization are restricted in scope and efficacy. Biggs et al. [7] and Mišić [26] limit their applications to optimization over data-driven objective functions, where decision trees are used to regress on a continuous quantity of interest. Grimstad and Andersson [17] report the success of their deep neural network approximations on a small set of problems, while noting that the tractability of the resulting MIO “quickly fades with increasing network sizes” and is sensitive to the tightness of variable bounds. And while Maragno et al. [24] use constraint learning for data-driven constraints, we use constraint learning to make approximations of intractable explicit and inexplicit constraints as well, where we have the capacity to sample the underlying constraints to generate data. Thus we propose a global optimization framework that can accommodate arbitrary explicit, inexplicit and data-driven constraints, leveraging decision trees in regression and classification settings.

While it is possible to use other MIO-compatible ML models for constraint learning in global optimization as proposed by Maragno et al. [24], we choose to rely on ORT-Hs and OCT-Hs since they are tunable, accurate and interpretable [5]. In addition, the MIO representations of decision trees are locally ideal, i.e. their linear relaxations result in solutions that satisfy integrality of the binary variables required to construct the approximations. This allows global optimization via OCT-Hs to be more computationally efficient and scalable than other global optimization methods with embedded ML models, such as in the work of Grimstad and Andersson [17]. In the following sections, we demonstrate that a global optimization method leveraging optimal decision trees makes significant progress in using ML for both acceleration of optimizations and modeling, using the efficient MIO representation of trees.

1.2 Review of decision trees

Decision trees is a popular predictive ML method that partitions data hierarchically according to its features. A class label in a finite set of possible labels is assigned to each leaf node of the tree depending on the most common label of the data falling into the node. The optimization problem that is solved to produce a decision tree \(T \in \mathbb {T}\) over known data \((\textbf{x}, \textbf{y})\) is the following:

$$\begin{aligned} \underset{T}{\textrm{min}}~\textrm{error}(T, \textbf{x}, \textbf{y}) + c_p \cdot \textrm{complexity}(T), \end{aligned}$$

where \(c_p\) is a complexity penalty parameter which attempts to strike a balance between the misclassification error over the test data and complexity (depth and breadth) of the tree. Once trained, decision trees are queried to predict the classes of test points with known features, but unknown class.

Decision trees were pioneered by Breiman et al. [8] with the advent of classification and regression trees (CART). However, CART is a top-down, greedy method of producing decision trees. Each split is only locally optimal since the splits are made recursively on the children of each new split starting from the root node. The ability of decision trees to explore the feature space has improved with the work of Bertsimas and Dunn [4] on the optimal classification tree (OCT). OCTs leverage MIO and local search heuristics to reduce misclassification error relative to CART without overfitting. Furthermore, OCTs are more interpretable, since they can achieve similar misclassification error as trees generated by CART with much less complexity.

OCT-Hs generalize OCTs by allowing for hyperplane splits, i.e. splits in more than one feature at a time. An OCT-H can solve classification problems with higher accuracy and lower complexity than an OCT [5], and is more expressive in an optimization setting due to couplings of decision variables in nonlinear constraints. Thus, our method leverages OCT-Hs exclusively to approximate constraints.

ORT-Hs extend OCT-Hs to regression problems, where the prediction of interest is continuous, i.e. \(\tilde{y} \in \mathbb {R}\). Each leaf of an ORT-H, instead of containing a fixed class prediction, contains a continuous prediction \(\tilde{y}\) as a linear regression over \(\textbf{x}\) in the domain of the leaf. ORT-Hs are particularly useful when approximating nonlinear objective functions.

We rely on software from the company Interpretable AI (IAI) in building, training and storing problem data in the form of OCT-Hs and ORT-Hs [20].

2 Contributions

In this paper, we propose a global optimization approach that generalizes to explicit and inexplicit constraints and objective functions over bounded \(\textrm{dom}({\textbf{x}})\). Our specific contributions are as follows:

  1. 1.

    We introduce an ensemble of methods for sampling constraints efficiently for the purpose of constraint learning. We leverage synergies of existing design of experiments (DoE) techniques, but also devise a new k-Nearest Neighbors (kNN) based sampling technique for sampling near-feasible points of explicit and inexplicit constraints.

  2. 2.

    We learn the feasible space of nonlinear objectives, inequalities and equalities using OCT-Hs and ORT-Hs.

  3. 3.

    We make MIO approximations of global optimization problems using the disjunctive representations of decision trees, and solve them using MIO solvers.

  4. 4.

    We devise a projected gradient descent method to check and repair the near-feasible, and near-optimal solutions from the MIO approximations.

  5. 5.

    We apply our method to a set of benchmark and real-world problems, and demonstrate its performance in finding global optima.

2.1 Paper structure

In Sect. 3, we detail our method, followed by a demonstrative example in Sect. 4. In Sect. 5, we test our method on a number of small benchmark problems from the literature, to gain confidence in the approach. In Sect. 6, we use our method to optimize two aerospace systems, one of which cannot be addressed via existing optimization tools. In Sect. 7, we address 93 benchmarks from MINLPLib, and compare our results with those of BARON, a popular global optimizer. In Sect. 8, we discuss the results and avenues for future research. We conclude in Sect. 9 by summarizing our findings and contributions.

3 Method

As aforementioned, our goal is to solve the global optimization problem approximately by making an OCT-H based MIO approximation, and then repairing the solution to be feasible and locally optimal. As an overview of this section, our method takes the following steps:

  1. 1.

    Generate standard form problem: In order to reduce the global optimization problem to a tractable MIO problem, we first restructure the global optimization problem in (1). The linear constraints are passed directly to the MIO problem, while the nonlinear constraints are approximated in steps 2-6 below. If any variables involved in nonlinear constraints are unbounded from above and/or below, we attempt to compute bounds for the purpose of sampling.

  2. 2.

    Sample and evaluate nonlinear constraints: The data used in training is important for the accuracy of ML models. For accurate OCT-H approximations of nonlinear constraints, we use fast heuristics and DoE methods to sample variables over \(\textrm{dom}(\textbf{x})\). We evaluate each constraint over the samples, and resample to find additional points near the constraint boundary for local approximation refinement.

  3. 3.

    Train decision trees over constraint data: The feasibility space of each constraint is classified and approximated by an OCT-H. If the objective function is nonlinear, it is regressed and approximated via an ORT-H.

  4. 4.

    Generate mixed-integer (MI) approximation: The decision paths and hyperplane splits are extracted from the trees, and used to formulate efficient MIO approximations of the nonlinear constraints using disjunctions.

  5. 5.

    Solve MIO approximation: The resulting MIO problem is optimized using commercial solvers to get an approximate solution.

  6. 6.

    Check and repair solution: The MIO problem approximates the global optimization problem, so the optimum is likely to be near-optimal and near-feasible. We evaluate the feasibility of each nonlinear constraint, and compute the gradients of the objective and nonlinear constraints using automatic differentiation. In case of suboptimality or infeasibility, we perform a number of projected gradient descent steps to repair the solution, so that it is feasible and locally optimal.

We describe the steps in greater detail in Sects. 3.1 through 3.6. A step-by-step demonstration of the method, as implemented in our optimizer OCT-HaGOn, can be found in Sect. 4.

3.1 Standard form problem

We restructure the global optimization problem posed in (1) by separating the linear and nonlinear constraints. The linear constraints are passed directly into a MIO model, while the nonlinear constraints are stored for approximation. If constraints are black box, they are assumed to be nonlinear as well. This restructured problem is shown in (2), and referred to as the standard form. Note that the standard form allows for both nonlinear inequalities and equalities.

$$\begin{aligned} \begin{aligned} \underset{x}{\text {min}}&~~ f(\textbf{x}) \\ \text {s.t.}&~~ g_i(\textbf{x}) \ge 0,~ i \in I, \\&~~ h_j(\textbf{x}) = 0,~j \in J, \\&~~\textbf{Ax} \ge \textbf{b},~\textbf{Cx} = \textbf{d}, \\&~~x_k \in [{\underline{x}}_k, {\overline{x}}_k],~k \in [n]. \end{aligned} \end{aligned}$$
(2)

3.1.1 Variable outer-bounding

The proposed method requires boundedness of decision variables \(\textbf{x}\) in each approximated constraint so that we can sample \(\textrm{dom}(\textbf{x})\) for constraint evaluation. When bounds are missing for any variable \(x_k\) in a nonlinear constraint, we pose the following optimization problem over the linear constraints only.

$$\begin{aligned} \begin{aligned} \underset{x}{\text {min/max}}&~~ x_k \\ \text {s.t.}&~~\textbf{Ax} \ge \textbf{b},~\textbf{Cx} = \textbf{d}, \\&~~x_i \in [{\underline{x}}_i, {\overline{x}}_i],~i \subseteq [n]. \end{aligned} \end{aligned}$$
(3)

The solution to this problem is the absolute largest range \([{\underline{x}}_k, {\overline{x}}_k]\) that satisfies all linear constraints as well as bounds on \(x_i\), for those indices i for which \(x_i\) is bounded. We can also solve the above optimization problem to tighten bounds on variables with existing bounds. Tighter bounds can significantly improve solution quality and time by improving the quality of ML approximators.

3.2 Sampling and evaluation of nonlinear constraints

For the purpose of constraint learning, we require data over variables and corresponding left-hand-side values of nonlinear constraints. The importance of the quality of data for the accuracy of machine learning tasks is well known and studied since the 1990’s [10]. Thus, the distribution of data points used for constraint learning is critical. The samples over \(\textrm{dom}(\textbf{x})\) should be sufficiently space-filling so that the behavior of each constraint is captured over the whole \(\textrm{dom}(\textbf{x})\). In addition, we require sufficient concentration of points near the constraint boundary so that learners are adequately trained to predict the feasibility of near-feasible points.

To achieve both of these objectives, we take a disciplined approach to sampling, and generate data over \(\textrm{dom}(\textbf{x})\) for each constraint in several stages. Note that the sampling and evaluation steps in the following subsections are performed constraintwise.

3.2.1 Boundary sampling

We first sample the corners of the \(\textbf{x}\) hypercube for the constraint, defined by \(x_k \in [{\underline{x}}_k, {\overline{x}}_k],~k \subseteq [n]\), in an effort to capture extremal points. We call this boundary sampling. This is combinatorial in the number of variables in each nonlinear constraint; a constraint with p bounded variables would require \(2^p\) samples. In practice, we sample a limited combination of corner points, depending on the number of variables in the constraint.

3.2.2 Optimal latin hypercube sampling

Next, we implement optimal Latin hypercube (OLH) sampling over the \(\textbf{x}\) hypercube. There is a wealth of literature starting with McKay et al. [25] that demonstrates the strength of Latin hypercube (LH) sampling versus other methods for DoE. However, LH sampling is not in general a maximum entropy sampling scheme [33], i.e. the samples from LHs do not optimize information gained about the underlying system. OLH sampling is the entropy maximizing variant of LH sampling for a uniform prior, where our entropy function is the pairwise Euclidian distances between sample points [1]. The uniform prior assumption is logical since we do not have or require an initial guess for where in the \(\textbf{x}\) hypercube the optimal solution will land, and the constraints are treated as black boxes.

OLH sampling, unlike standard LHs, is space-filling and thus useful for learning the global behavior of constraints using ML models. In practice, OLH generation is time-consuming and impractical. Instead, we use an efficient heuristic proposed by Bates et al. [2], which uses a permutation genetic algorithm to find near-optimal solutions to the OLH problem with low computational cost. We terminate the genetic algorithm prematurely in our optimization scheme, since samples are not required to be optimally distributed.

3.2.3 Constraint evaluation

We use the samples to either compute the left-hand-side of the constraint, or the feasibility of the constraint if the left-hand-side is not available. If the constraint is an equality \(h_j(\textbf{x}) = 0\), we relax it and treat it as an inequality \(h_j(\textbf{x}) \ge 0\) until Sect. 3.4. The result is a \(\{0, 1\}^n\) feasibility vector corresponding to each of the n samples, defining the classes for the classification problem.

If desired, assuming that constraints use a common set of samples, it is possible to lump the feasibility of a set of inequality constraints by taking the row-wise minimum of their joint feasibility over the same data. This can reduce the model complexity, but we currently do not consider this in our method.

3.2.4 kNN quasi-newton sampling

The previous sampling methods achieve a space-filling distribution of samples in \(\textrm{dom}(\textbf{x})\) to enable approximating OCT-Hs to learn the feasibility of each constraint in a global sense. We still require sufficient concentration of points near the constraint boundary, i.e. points \(\tilde{\textbf{x}}_i\) so that \(g(\tilde{\textbf{x}}_i) \approx 0\), so our OCT-H models are trained to classify such near-feasible points accurately.

Assuming that the first stage sampling and evaluation has found at least one feasible point to the constraint, in this step, we attempt to sample near the constraint boundary using a method we’ve developed called kNN quasi-Newton sampling. The method hinges on using kNN to generate near-feasible neighborhoods for the constraint over previous data \((\tilde{\textbf{x}}, \tilde{\textbf{y}})\), and using approximate gradients in these neighborhoods to find new near-feasible samples \(\tilde{\textbf{u}}\), with vanishing \(g(\tilde{u}_i) = \epsilon \rightarrow 0\). We present the method in Algorithm 1.

figure a

kNN quasi-Newton sampling

The method is described as follows. Starting from space-filling data \((\tilde{\textbf{x}},\tilde{\textbf{y}})\) where \(\tilde{y}_{i} = g(\tilde{\textbf{x}}_{i})\), we find the k-nearest points for each sampled point \(\tilde{\textbf{x}}_i\) in the 0-1 normalized \(\textbf{x}\) hypercube. In our particular implementation, we use \(k=p+1\), where p is the number of variables in constraint \(g(\textbf{x}) \ge 0\). For each kNN cluster with index i centered at \(\tilde{\textbf{x}}_i\) with \(k-1\) neighbor indices \(\xi _i\), we determine if all sample points are feasible, all points are infeasible, or points are mixed-feasibility.

In each cluster with mixed-feasibility points, we perform the secant method between points of opposing feasibility. The secant method is an approximate root finding algorithm defined by the following recurrence relation

$$\begin{aligned} \tilde{\textbf{x}}_k = \tilde{\textbf{x}}_j - \tilde{y}_j\frac{\tilde{\textbf{x}}_j - \tilde{\textbf{x}}_{i}}{\tilde{y}_j - \tilde{y}_i}, \end{aligned}$$
(4)

where \(\tilde{\textbf{x}}_i\) and \(\tilde{\textbf{x}}_j\) are points of opposing feasibility in the same mixed-feasibility neighborhood, and \(\tilde{\textbf{x}}_k\) is a new candidate root. The secant method thus allows us to efficiently generate roots \(\tilde{\textbf{x}}_k\) that would be expected to be near the constraint boundary, using combinations of points \(\tilde{\textbf{x}}_i\) and \(\tilde{\textbf{x}}_j\) from the space-filling OLH samples. While we could use the recursively refine the computed root from (4) by having more iterations, it is sufficient to perform one iteration of the algorithm to sample points that are close to the boundary.

We ensure that each pair of kNN-adjacent points on the constraint boundary results in only one new point, by only sampling within mixed-feasibility kNN cells if their centroid is infeasible, and then only sampling between the infeasible centroid and surrounding feasible points in the kNN cell. Once we have performed the kNN sampling process and have new samples \(\tilde{\textbf{u}}\), we evaluate the left-hand-side \(g(\textbf{x})\) over the samples and add them to data \((\tilde{\textbf{x}},\tilde{\textbf{y}})\) before proceeding to the tree training step.

3.3 Decision tree training

We use trees to approximate the nonlinear constraints in our global optimization problem due to their MIO representability, which we will demonstrate in Sect. 3.4. We use software from the company IAI in building, training and storing problem data in the form of OCT-Hs and ORT-Hs [20]. We train trees exclusively with hyperplane splits due to their higher approximation accuracy and lower tree complexity.

The trees are trained on all available data instead of a subset of the data as would be expected in traditional ML. In addition, we penalize tree complexity very little. This is because our data is noise-free, and approximation accuracy is important in the global optimization setting. In the case where the constraints are generated on noisy data, we would allow for the splitting of data into training and test sets, and cross-validate over a range of parameters.

We use the base OCT-H and ORT-H parameters in Table 1 within IAI when initializing constraint learning instances. These parameters are used for all computational benchmarks throughout the paper unless stated otherwise. The parameters have been chosen to balance tree accuracy with tree complexity and associated computational cost, and may be tuned by users as they find necessary.

Table 1 Parameters for base decision trees in constraint learning

Our training loss function for OCT-Hs is misclassification error. If a tree is a function that maps feature inputs into classes (\(T: \textbf{x}\xrightarrow {} y\)), the misclassification error is simply the weighted proportion of samples that are misclassified by the tree, where \(\mathbb {I}\) is the indicator function and \(w_i\) are the sample weights. An exact classifier would have a misclassification error of 0.

$$\begin{aligned} \text {misclassification error} = \frac{1}{n}\frac{\sum _{i=1}^{n} w_i \cdot \mathbb {I}(T(\textbf{x}_i) \ne y_i)}{\sum _{i=1}^{n} w_i}. \end{aligned}$$

For ORT-Hs used to approximate objective functions, we use \(1-\textrm{R}^2\) as the loss function, where \(\textrm{R}^2\) is the coefficient of determination. An exact regressor would have a \(1-\textrm{R}^2\) value of 0.

$$\begin{aligned} 1-\textrm{R}^2 = \frac{\sum _{i=1}^n(T(\textbf{x}_i) - y_i)^2}{\sum _{i=1}^n(T(\textbf{x}_i) - {\bar{y}})^2},~\textrm{where}~{\bar{y}} = \frac{1}{n} \sum _{i=1}^n y_i. \end{aligned}$$

3.4 MI approximation

From this section forward, we recognize that the global optimization problem is approximated constraint-wise, and introduce indices \(i \in I\) and \(j \in J\) for the inequality and equality constraints respectively. Having classified the feasible space of nonlinear inequalities \(g_i(\textbf{x}) \ge 0,~i \in I\) and relaxed nonlinear equalities \(h_j(\textbf{x}) \ge 0,~j \in J\) using OCT-Hs, we retighten equalities to \(h_j(\textbf{x}) = 0,~j \in J\), and pose the feasible \(\textbf{x}\)-domains of each tree as unions of polyhedra. In this section, we define mathematically the set of disjunctive MI-linear constraints that represent the trees exactly.

3.4.1 Nonlinear inequalities

The tree \(T_{i}\) that classifies the feasible set of nonlinear inequality \(g_i(\textbf{x}) \ge 0\) has a set of leaves \(L_{i}\), where a subset of leaves \(L_{i,1} \subset L_i\) are classified feasible (where the indicator function \(\mathbb {I}(g_i(\textbf{x}) \ge 0) = 1\)) and \(L_{i,0} \subset L_i\) are classified infeasible (\(\mathbb {I}(g_i(\textbf{x}) \ge 0) = 0\)). The decision path to each leaf defines a set of separating hyperplanes, \(H_{i,l}\), where \(H_{i,l,-}\) and \(H_{i,l,+}\) are the set of leftward (less-than) and rightward (greater-than) splits required to reach leaf l respectively. The feasible polyhedron of tree \(T_i\) at feasible leaf \(l \in L_{i,1}\) is thus defined as

$$\begin{aligned} \textbf{P}_{i,l} = \{\textbf{x}: {\varvec{\alpha }}_h^{\top } \textbf{x}\le \beta _h, ~\forall ~h \in H_{i,l,-}~;~ {\varvec{\alpha }}_h^{\top } \textbf{x}\ge \beta _h, ~\forall ~ h \in H_{i,l,+}\}. \end{aligned}$$
(5)

The feasible set of \(\textbf{x}\) over constraint \(g_i(\textbf{x}) \ge 0\) is approximated by the union of the feasible polyhedra in (5). More formally,

$$\begin{aligned} \textbf{x}\in \bigcup _{l \in L_{i,1}} \textbf{P}_{i,l}. \end{aligned}$$
(6)

This union-of-polyhedra representation can described by a set of disjunctive constraints involving a big-M formulation. Vielma [37] describes many such “projected” formulations; the specific disjunctive representation of OCT-Hs approximating nonlinear inequalities is as follows:

$$\begin{aligned} \textbf{x}\in \bigcup _{l \in L_{i,1}} \textbf{P}_{i,l} \iff {\left\{ \begin{array}{ll} \begin{aligned} &{}\{~{\varvec{\alpha }}_h^{\top } \textbf{x}\le \beta _h + M(1-z_{i,l}), ~\forall ~h \in H_{i,l,-}~; \\ &{}~\beta _h \le {\varvec{\alpha }}_h^{\top } \textbf{x}+ M(1-z_{i,l}), ~\forall ~ h \in H_{i,l,+}\},~ \forall ~l \in L_{i,1}, \\ ~&{}\sum _{l \in L_{i,1}} z_{i,l} = 1, \\ &{}~ z_{i,l} \in \{0, 1\},~l \in L_{i,1}, \\ &{}~M> |\beta _h|,~M > \underset{\textrm{dom}(\textbf{x})}{\max }~|{\varvec{\alpha }}_h^{\top } \textbf{x}|,~\forall ~h \in H_{i,l},~l \in L_{i,1}. \end{aligned} \end{array}\right. } \end{aligned}$$
(7)

Membership of \(\textbf{x}\) in polyhedron \(\textbf{P}_{i,l}\) is defined by binary variable \(z_{i,l}\). The constraint \(\sum _{l \in L_{i,1}} z_{i,l} = 1\) ensures that \(\textbf{x}\) is in exactly one feasible polyhedron. However, the formulation above requires knowing the value of M with sufficient accuracy, which can be difficult in practice. The value of M is important; too small an M means that the constraint is insufficiently enforced, and too large an M can cause numerical issues. Knowing M to a sufficient tolerance can require solving the inner maximization in (7) over \(\textrm{dom}(\textbf{x})\), and even declaring a separate \(M_h\) for each separating hyperplane \(h \in H_{i,l}\).

Alternatively, we derive a representation that completely avoids the need to compute big-M values, since we restrict ourselves to \(\textbf{x}\in [{\underline{\textbf{x}}}, {\overline{\textbf{x}}}]\). The tradeoff is that we require the addition of auxiliary variables \(\textbf{y}_l \in \mathbb {R}^{p_i}\), for each leaf \(~l \in L_{i,1}\), where \(p_i\) is the dimension of variables in constraint i. We present the big-M free representation of OCT-Hs used to approximate nonlinear inequalities in (8). The formulation is an application of basic extended disjunctive formulations for defining unions of polyhedra, as detailed by Vielma [37].

$$\begin{aligned} \textbf{x}\in \bigcup _{l \in L_{i,1}} \textbf{P}_{i,l} \iff {\left\{ \begin{array}{ll} \begin{aligned} &{}\{~{\varvec{\alpha }}_h^{\top } \textbf{y}_l \le \beta _h z_{i,l}, ~\forall ~h \in H_{i,l,-}~; \\ &{}~\beta _h z_{i,l} \le {\varvec{\alpha }}_h^{\top } \textbf{y}_l, ~\forall ~ h \in H_{i,l,+}\}~\forall ~l \in L_{i,1}, \\ &{}~\textbf{y}_l \in [{\underline{\textbf{x}}} z_{i,l}, {\overline{\textbf{x}}} z_{i,l}],~l \in L_{i,1}, \\ &{}\sum _{l \in L_{i,1}} \textbf{y}_l = \textbf{x}, \\ ~&{}\sum _{l \in L_{i,1}} z_{i,l} = 1, \\ &{}~ z_{i,l} \in \{0, 1\},~l \in L_{i,1}. \\ \end{aligned} \end{array}\right. } \end{aligned}$$
(8)

Just as in the big-M formulation, whether or not \(\textbf{x}\) lies in polyhedron \(\textbf{P}_{i,l}\) is defined by binary variable \(z_{i,l} \in \{0, 1\}\). If \(\textbf{x}\) is in \(\textbf{P}_{i,l}\), then \(\textbf{x}= \textbf{y}_l\). If not, \(\textbf{y}_l = \textbf{0}\). Thus \(\textbf{x}\) can only lie in the leaves of \(T_i\) that are classified feasible.

Notably, formulation (8) is locally ideal; its continuous relaxation has at least one basic feasible solution, and all its basic feasible solutions are integral in \(\textbf{z}_i\) [37]. This confers computational advantages in optimization over such disjunctions compared to its big-M variant. Since disjunctive formulation (8) is tractable and big-M free, we implement it in OCT-HaGOn.

3.4.2 Nonlinear equalities

Nonlinear equalities can also be approximated by OCT-Hs. To do so, we simply relax \(h_j(\textbf{x}) = 0\) to \(h_j(\textbf{x}) \ge 0\) and fit an OCT-H \(T_j\) to the feasible set of this constraint, with polyhedra \(\textbf{P}_{j,l}\), where l can lie in feasible leaves \(L_{j,1}\) and infeasible leaves \(L_{j,0}\). The feasible set of the original equality must be represented by the union of the polyhedral faces between the feasible and infeasible leaves. It is critical to note however that this is not equivalent to the union of polyhedral faces, \(\textbf{x}\in \bigcup _{l \in L_{j}} \text {faces} (\textbf{P}_{j,l})\), since some of the faces separate two feasible spaces from each other, and thus would not be valid constraint boundaries. We are only interested in polyhedral faces that separate feasible polyhedra from infeasible polyhedra, where \(h_j(\textbf{x}) \ge 0\) and \(h_j(\textbf{x}) \le 0\). Therefore the approximate equality is the union of intersections of all permutations of a feasible polyhedron with an infeasible polyhedron,

$$\begin{aligned} \textbf{x}\in \bigcup _{l_0 \in L_{j,0},~l_1 \in L_{j,1}} \{\textbf{P}_{j,l_0} \cap \textbf{P}_{j,l_1}\}. \end{aligned}$$
(9)

To ensure that \(\textbf{x}\) lies on a face between a feasible and an infeasible polyhedron, we allocate a binary variable \(z_{j,l}\) for each leaf \(l \in L_j\). We make sure that \(\textbf{x}\) lies in exactly one feasible and one infeasible polyhedron by having exactly two non-zero \(z_{j,l}\)’s, one in a feasible leaf \(l \in L_{j,1}\) and the other in an infeasible leaf \(l \in L_{j,0}\). Thus we represent the approximate equality as the following set of disjunctive big-M constraints, where \(L_j = \{L_{j,1} \cup L_{j,0}\}\) are the combined set of feasible and infeasible leaves of tree \(T_j\).

$$\begin{aligned} \textbf{x}\in \bigcup _{\begin{array}{c} l_0 \in L_{j,0},\\ l_1 \in L_{j,1} \end{array}} \{\textbf{P}_{j,l_0} \cap \textbf{P}_{j,l_1}\} \iff {\left\{ \begin{array}{ll} \begin{aligned} &{}\{~{\varvec{\alpha }}_h^{\top } \textbf{x}\le \beta _h + M(1-z_{j,l}), ~\forall ~h \in H_{j,l,-}~; \\ &{}~~\beta _h \le {\varvec{\alpha }}_h^{\top } \textbf{x}+ M(1-z_{j,l}) , ~\forall ~ h \in H_{j,l,+}\}, \\ ~&{}\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad ~~\forall l \in L_j, \\ ~&{}\sum _{l \in L_{j,0}} z_{j,l} = 1,~~\sum _{l \in L_{j,1}} z_{j,l} = 1, \\ &{}~ z_{j,l} \in \{0, 1\},~~~l \in L_j. \end{aligned} \end{array}\right. } \end{aligned}$$
(10)

This guarantees that \(\textbf{x}\) falls on a polyhedral face that separates a feasible and infeasible polyhedron, thus approximating \(h_j(\textbf{x}) = 0\). As we have done for nonlinear inequalities, we can come up with an equivalent big-M-free formulation as follows, and implement it in OCT-HaGOn.

$$\begin{aligned} \textbf{x}\in \bigcup _{\begin{array}{c} l_0 \in L_{j,0},\\ l_1 \in L_{j,1} \end{array}} \{\textbf{P}_{j,l_0} \cap \textbf{P}_{j,l_1}\} \iff {\left\{ \begin{array}{ll} \begin{aligned} &{}\{~{\varvec{\alpha }}_h^{\top } \textbf{y}_l \le \beta _h z_{i,l}, ~\forall ~h \in H_{i,l,-}~; \\ &{}~\beta _h z_{i,l} \le {\varvec{\alpha }}_h^{\top } \textbf{y}_l, ~\forall ~ h \in H_{i,l,+}\},~ \forall ~l \in L_j, \\ &{}~\textbf{y}_l \in [{\underline{\textbf{x}}} z_{i,l}, {\overline{\textbf{x}}} z_{i,l}],~l \in L_j, \\ &{}\sum _{l \in L_{i,1}} \textbf{y}_l = \textbf{x},~~\sum _{l \in L_{i,0}} \textbf{y}_l = \textbf{x}, \\ ~&{}\sum _{l \in L_{i,1}} z_{i,l} = 1,~~\sum _{l \in L_{i,0}} z_{i,l} = 1, \\ &{}~ z_{i,l} \in \{0, 1\},~~~l \in L_j. \\ \end{aligned} \end{array}\right. } \end{aligned}$$
(11)

Note that nonlinear equalities pose the greatest challenge for any global optimization method, since the \(\epsilon \)-feasible space of equalities is restrictive.

3.4.3 Nonlinear objectives

We treat nonlinear objectives \(f(\textbf{x})\) differently than constraints. Constraints are represented well by classifiers because constraints partition the space of \(\textbf{x}\) into feasible and infeasible classes. Nonlinear objectives however are continuous with respect to \(\textbf{x}\), and are thus better approximated by regressors. To approximate a nonlinear objective function \(f(\textbf{x})\), we train an ORT-H on sample data \(\{\tilde{\textbf{x}}_i, f(\tilde{\textbf{x}}_i)\}_{i=1}^n\), and replace the nonlinear objective with the auxiliary variable \(f^*\). We lower bound the value of \(f^*\) using the disjunctive constraints derived from the ORT-H, thus approximating the original objective function.

We can apply the same logic to constraints of the form \(\textbf{a}^{\top } \textbf{x}+ b \ge g(\textbf{x})\), where the left-hand-side is affine and separable from the nonlinear component \(g(\textbf{x})\). Since \(\textbf{a}^{\top } \textbf{x}+ b\) is linear and MIO-compatible, we instead train an ORT-H on sample data \(\{\tilde{\textbf{x}}_i, g(\tilde{\textbf{x}}_i)\}_{i=1}^n\), and make sure that \(\textbf{a}^{\top } \textbf{x}+ b\) is lower bounded by the approximating ORT-H. It is the choice of the user whether or not to use OCT-Hs or ORT-Hs to approximate separable constraints, but in general an ORT-H is more accurate in these cases. All problems addressed in Sects. 5 and 7 treat constraints as non-separable, and use classifiers to approximate them. To solve the satellite scheduling problem in Sect. 6.2, we take advantage of this separability and choose to train ORT-Hs instead.

Since an ORT-H is an OCT-H with additional regressors added to each leaf, the disjunctive constraints in (8) and (11) apply with minor modifications described as follows. \(L_f\) is the set of leaves of the approximating ORT-H; assuming that \(f(\textbf{x})\) can be evaluated on \(\textrm{dom}(\textbf{x})\), all leaves \(l \in L_f\) of the ORT-H can feasibly contain \(\textbf{x}\), meaning that the disjunctions are applied to all leaves instead of a subset of the leaves of the tree. Each leaf \(l \in L_f\) has a set of separating hyperplanes that is described by its decision path, as well as an additional separating hyperplane described by the regressor in each leaf.

For objectives and separable inequalities, instead of using the regressor within each leaf of the ORT-H directly, we run a secondary linear regression problem on the points within each leaf to find the tightest lower bounding hyperplane on the data. This allows us to have an approximate relaxation of the constraint or objective function, and tighten the relaxation later via solution repair in Sect. 3.6.

3.5 Solution of MIO approximation

Having represented the feasible space of inequality and equality constraints as a unions of polyhedra, we have the following final problem.

$$\begin{aligned} \begin{aligned} \underset{x}{\text {min}}&~~ f^*\\ \text {s.t.}&~~ f^*, \textbf{x}\in \bigcup _{l \in L_f} \textbf{P}_{i,l}, \\&~~\textbf{x}\in \bigcup _{l \in L_{i,1}} \textbf{P}_{i,l},~ \forall ~i \in I, \\&~~ \textbf{x}\in \bigcup _{l_0 \in L_{j,0},~l_1 \in L_{j,1}} \{\textbf{P}_{j,l_0} \cap \textbf{P}_{j,l_1}\},~\forall j \in J, \\&~~\textbf{Ax} \ge \textbf{b},~\textbf{Cx} = \textbf{d}, \\&~~ x_k \in [{\underline{x}}_k, {\overline{x}}_k],~k \in [n]. \end{aligned} \end{aligned}$$
(12)

This is a mixed-integer linear optimization (MILO) that can be efficiently solved using branch-and-bound methods. We use CPLEX for this purpose, since it is available free of charge to solve small scale MILO instances.

3.6 Solution checking and repair

The optimum obtained in Sect. 3.5 is likely to be near-optimal and near-feasible to the original global optimization problem, since the MIO is approximate. To repair the solution in case of suboptimality or infeasibility, we devise and present a local search procedure based on projected gradient descent (PGD). PGD is a method for constrained gradient descent that is reliable, scalable and fast for the local optimization required to restore feasibility and optimality to approximate solutions. It relies on using gradients of the constraints and objective to simultaneously reduce constraint violation (by projecting \(\textbf{x}^*\) onto the feasible space of \(\textbf{x}\)) and the objective function value. Our particular implementation of PGD solves a series of gradient-driven MIO problems to do so, and has the additional benefit compatible with integer variables unlike traditional interior-point methods.

To obtain the gradients of explicit and inexplicit constraints, we leverage automatic differentiation (AD), and specifically forward mode AD. Forward mode AD looks at the fundamental mathematical operations involved in evaluating the constraint functions, and thus computes the gradient of each constraint exactly at any solution \(\textbf{x}^*\) [36]. Unlike finite differentiation, AD does not require additional function evaluations or discretization, and unlike symbolic differentiation, it doesn’t require the constraints to be explicit.

The proposed PGD method begins by first evaluating the objective and all constraints at \(\textbf{x}^*\), the last known optimum, as well as their gradients. The disjunctive approximations of nonlinear inequality constraints are replaced by linear approximators based on the local constraint gradient, depending on the feasibility of each constraint:

$$\begin{aligned} g_i(\textbf{x}) \ge 0 \rightarrow {\left\{ \begin{array}{ll} \begin{array}{rl} \nabla g_i(\textbf{x}^*)^{\top } \textbf{d}+ g_i(\textbf{x}^*) \ge 0, &{}~\textrm{if}~ g_i(\textbf{x}^*) \ge 0, \\[1ex] \nabla g_i(\textbf{x}^*)^{\top } \textbf{d}+ g_i(\textbf{x}^*) + \lambda _i \ge 0, &{} ~\textrm{if}~ g_i(\textbf{x}^*) \le 0, \\ \end{array} \end{array}\right. } \end{aligned}$$
(13)

where \(\textbf{d}\in \mathbb {R}^n\) is the descent direction, and \(\lambda _i \in \mathbb {R}^+\) is an inequality relaxation variable. Similarly, we replace the MI approximations of equalities with their local linear approximators, but always include relaxation variables regardless of the level of infeasibility of the constraints, as shown in (14).

$$\begin{aligned} h_j(\textbf{x}) = 0 \rightarrow {\left\{ \begin{array}{ll} \begin{array}{ll} \nabla h_j(\textbf{x}^*)^{\top } \textbf{d}+ h_j(\textbf{x}^*) + \mu _j &{}\ge 0, \\[1ex] \nabla h_j(\textbf{x}^*)^{\top } \textbf{d}+ h_j(\textbf{x}^*) &{}\le \mu _j, \\ \end{array} \end{array}\right. } \end{aligned}$$
(14)

where \(\mu _j \in \mathbb {R}^+\) is an equality relaxation variable. This relaxation is for two reasons. The first is that, in presence of equalities, the local PGD step may be infeasible due to conflicting equality constraints. The second is that each PGD step will involve solving a quadratic program, which can only be solved to given numerical precision. This precision, while low, is non-zero.

Thus we introduce a constraint tightness tolerance parameter \(\phi \), and say that an inequality \(g_i(\textbf{x}) \ge 0\) is feasible at \(\textbf{x}^*\) if \(g_i(\textbf{x}^*) \ge -\phi \). If all inequality constraints are feasible to tolerance, relaxation variables \({\lambda }\) are only required on the inequalities where \(0 \ge g_i(\textbf{x}^*) \ge -\phi ,~i \in I\), by the condition in (13). In that case, we perform a simple gradient descent step. This involves solving the quadratic optimization problem in (15), where \(\gamma \) is the infeasibility penalty coefficient, \(\alpha \) is the step size within a 0-1 normalized \(\textbf{x}\) hypercube, r is the step size decay rate, t is the current PGD iteration and T is the maximum number of iterations.

$$\begin{aligned} \begin{aligned} \underset{\textbf{x}, \textbf{d}, \lambda , \mu }{\text {min}}&~~ \nabla f(\textbf{x}^*)^{\top } \textbf{d}+ \gamma (||\lambda ||^2_2+||\mu ||^2_2) \\ \text {s.t.}&~~ \textbf{x}= \textbf{x}^* + \textbf{d}, \\&~~ \Bigg |\Bigg | \frac{\textbf{d}}{{\overline{\textbf{x}}} - {\underline{\textbf{x}}}} \Bigg |\Bigg |^2_2 \le \alpha \textrm{exp} \Big ( \frac{-rt}{T} \Big ), \\&~~ \left\{ \begin{array}{lr} \nabla g_i(\textbf{x}^*)^{\top } \textbf{d}+ g_i(\textbf{x}^*) \ge 0, &{}~\textrm{if}~ g_i(\textbf{x}^*) \ge 0 \\ \nabla g_i(\textbf{x}^*)^{\top } \textbf{d}+ g_i(\textbf{x}^*) + \lambda _i \ge 0, &{} ~\textrm{if}~ -\phi \le g_i(\textbf{x}^*) \le 0 \end{array}\right\} ,~\forall i \in I, \\&~~ \left\{ \begin{array}{lr} \nabla h_j(\textbf{x}^*)^{\top } \textbf{d}+ h_j(\textbf{x}^*) + \mu _j \ge 0, \\ \nabla h_j(\textbf{x}^*)^{\top } \textbf{d}+ h_j(\textbf{x}^*) \le \mu _j, \end{array}\right\} ,~\forall j \in J, \\&~~\textbf{Ax} \ge \textbf{b},~\textbf{Cx} = \textbf{d}, \\&~~ x_k \in [{\underline{x}}_k, {\overline{x}}_k],~\forall k \in [n], \\&~~ \left\{ \begin{array}{lr} \lambda _i = 0, &{}~\textrm{if}~ g_i(\textbf{x}^*) \ge 0 \\ \lambda _i \ge 0, &{} ~\textrm{if}~ g_i(\textbf{x}^*) \le 0 \end{array}\right\} ,~\forall i \in I, \\&~~ \mu _i \in \mathbb {R}_+,~j \in J. \end{aligned} \end{aligned}$$
(15)

We exponentially decrease the allowed step size \(\textbf{d}\) as defined in (15), to aid convergence and break cycles that may result.

If the current solution \(\textbf{x}^*\) is infeasible beyond tolerance to any constraints, we take a projection-and-descent step. This modifies the objective and first two constraints in  (15) by removing the step size constraint on \(\textbf{d}\), and augmenting the objective function with a projection distance penalty with \(\beta \) as a parameter, as shown in (16):

$$\begin{aligned} \begin{aligned} \underset{\textbf{x}, \textbf{d}, \lambda , \mu }{\text {min}}&~~ \nabla f(\textbf{x}^*)^{\top } \textbf{d}+ \beta \Bigg |\Bigg | \frac{\textbf{d}}{{\overline{\textbf{x}}} - {\underline{\textbf{x}}}} \Bigg |\Bigg |^2_2 + \gamma (||\lambda ||^2_2+||\mu ||^2_2) \\ \text {s.t.}&~~ \textbf{x}= \textbf{x}^* + \textbf{d}, \\ \vdots \end{aligned} \end{aligned}$$
(16)

This quadratic optimization problem approximates the closest feasible projection of \(\textbf{x}\) onto the feasible space of nonlinear constraints.

The gradient and projected gradient steps defined above require knowing the maximum range on all variables, \({\overline{\textbf{x}}} - {\underline{\textbf{x}}}\). If this range is not provided for variable \(x_k\), then we assume \({\overline{x}}_k - {\underline{x}}_k = \max ({\overline{\textbf{x}}}) - \min ({\underline{\textbf{x}}})\). The convergence of the PGD is much stronger with user-provided bounds however. We repeat the above PGD steps on new incumbent solutions until the final two solutions are feasible to all constraints, and the improvement in original objective function \(f(\textbf{x})\) is less than absolute tolerance \(\epsilon \).

The PGD algorithm introduces many parameters, whose default values are defined in Table 2. While this adds additional complexity to the solution procedure, the descent procedure is intuitive to tune, and the current implementation warns the user in case parameters require examination. In addition, the parameters are applied to 0-1 normalized quantities over the \(\textbf{x}\) hypercube wherever possible. For all examples in this paper, the default PGD parameters from Table 2 apply unless stated otherwise.

Table 2 Parameters for PGD repair procedure

4 Demonstrative example

Consider the following modified mixed-integer nonlinear optimization problem from Duran and Grossmann [12]. For demonstrative purposes, the original nonlinear objective has been replaced with a linear objective, and variables \(\textbf{y}\) have been concatenated to \(\textbf{x}\) for consistency of notation.

$$\begin{aligned}{} & {} \text {min} ~ f(\textbf{x}) = 10x_1 - 17x_3 -5x_4 + 6x_5 + 8x_6 \nonumber \\{} & {} \text {s.t.} ~ g_1(\textbf{x}) = 0.8\textrm{log}(x_2 + 1) + 0.96\textrm{log}(x_1 - x_2 + 1) - 0.8x_3 \ge 0, \nonumber \\{} & {} \qquad ~ g_2(\textbf{x}) = \textrm{log}(x_2 + 1) + 1.2\textrm{log}(x_1 - x_2 + 1) - x_3 - 2x_6 + 2 \ge 0, \nonumber \\{} & {} \qquad ~ x_1 - x_2 \ge 0,~~2x_4 - x_2 \ge 0, \nonumber \\{} & {} \qquad ~ 2x_5 - x_1 + x_2 \ge 0,~~1 - x_4 - x_5 \ge 0, \nonumber \\{} & {} \qquad ~ 0 \le x_1 \le 2,~0 \le x_2 \le 2,~0 \le x_3 \le 1, \nonumber \\{} & {} \qquad ~ x_4, x_5, x_6 \in \{0,1\}^3. \end{aligned}$$
(17)

We will focus on the nonlinear inequalities \(g_1(\textbf{x}) \ge 0\) and \(g_2(\textbf{x}) \ge 0\) as we implement the method step by step.

4.1 Standard form problem

Most global optimization problems are compatible with the standard form in Sect. 3.1 by construction. We demonstrate this by partitioning the original problem (17) below.

$$\begin{aligned} \begin{aligned} \text {min}&~ f(\textbf{x}) = 10x_1 - 17x_3 -5x_4 + 6x_5 + 8x_6&\textrm{Objective}&\\ \hline \text {s.t.}&~ g_1(\textbf{x}) = 0.8\textrm{log}(x_2 + 1) + 0.96\textrm{log}(x_1 - x_2 + 1) - 0.8x_3 \ge 0,&\textrm{Nonlinear}&\\&~ g_2(\textbf{x}) = \textrm{log}(x_2 + 1) + 1.2\textrm{log}(x_1 - x_2 + 1) - x_3 - 2x_6 + 2 \ge 0,&\textrm{constraints}&\\ \hline&~ x_1 - x_2 \ge 0,~~2x_4 - x_2 \ge 0,&\textrm{Linear}&\\&~ 2x_5 - x_1 + x_2 \ge 0,~~1 - x_4 - x_5 \ge 0,&\textrm{constraints}&\\ \hline&~ 0 \le x_1 \le 2,~0 \le x_2 \le 2,~0 \le x_3 \le 1,&\textrm{Variables}&\\&~ x_4, x_5, x_6 \in \{0,1\}^3.&\mathrm {and~bounds}&\end{aligned} \end{aligned}$$

We pass the linear constraints, variables and bounds directly to the MIO model, and confirm that all variables in nonlinear constraints, in this case \(x_1\), \(x_2\), \(x_3\) and \(x_6\), are bounded. Note the presence of binary \(x_4\), \(x_5\) and \(x_6\) in the problem as well.

4.2 Sampling and evaluation of nonlinear constraints

Next we generate samples over the nonlinear constraints using the procedure in Sect. 3.2. Note that \(g_1(\textbf{x}) \ge 0\) and \(g_2(\textbf{x}) \ge 0\) have 3 and 4 active variables, so samples are generated in \(\mathbb {R}^3\) and \(\mathbb {R}^4\) respectively. The resulting samples over \(g_1(\textbf{x}) \ge 0\) and their feasibilities are shown in Fig. 1. Note that the samples span the whole \(\textbf{x}\) hypercube, but that there are certain concentrations of points, thanks to the kNN sampling procedure, that approximate the constraint boundary. This improves the ability of the approximating OCT-H to be both globally and locally accurate.

Fig. 1
figure 1

The distribution of data for constraint \(g_1(\textbf{x}) \ge 0\), generated by sampling procedures defined in Sect. 3.2

4.3 Decision tree training

We train two OCT-Hs to classify the feasible space of constraints \(g_1(\textbf{x}) \ge 0\) and \(g_2(\textbf{x}) \ge 0\). For demonstrative purposes, the trees were limited to a maximum depth of 3, as opposed to the standard depth of 6 used in OCT-HaGOn as defined in Table 1. The approximating OCT-H for \(g_1(\textbf{x}) \ge 0\) and the accuracy of its predictions are presented in Fig. 2. Notably, the OCT-H approximator achieves a high degree of accuracy (97%) throughout \(\textrm{dom}(\textbf{x})\) with only two feasible leaves.

Fig. 2
figure 2

The approximating OCT-H achieves a high degree of accuracy, capturing both the global and local behavior of the constraint \(g_1(\textbf{x}) \ge 0\)

4.4 MI approximation

We pose the trees in a MIO-compatible form. As a bookkeeping note, auxiliary variables are introduced with two indices, the first indicating the constraint index, and the second indicating the numerical index of the leaf of the approximating OCT-H. This is consistent with the formulation in Sect. 3.4.

Fig. 3
figure 3

\(g_1(\textbf{x}) \ge 0\) is approximated via 6 continuous and 2 binary auxiliary variables, and 6 linear constraints

Fig. 4
figure 4

\(g_2(\textbf{x}) \ge 0\) is approximated via 8 continuous and 2 binary auxiliary variables, and 7 linear constraints

Figure 3 shows the approximating tree for constraint \(g_1(\textbf{x})\ge 0\), as well as its disjunctive representation as defined by (8). Since the constraint has three active variables \([x_1, x_2, x_3]\), and the tree has two feasible leaves with node indices 4 and 7, the disjunctive representation requires the definition of 6 auxiliary continuous variables \(\textbf{y}_{1,4} \in \mathbb {R}^3\) and \(\textbf{y}_{1,7} \in \mathbb {R}^3\), and two binary variables \(z_{1,4}\) and \(z_{1,7}\). The number of linear constraints required is 6, which is equal to the sum of the depths of each feasible leaf, plus 2 additional constraints defining the disjunctions.

We approximate \(g_2(\textbf{x}) \ge 0\) in Fig. 4, with four active variables \([x_1, x_2, x_3, x_6]\), using the same approach.

4.5 Solution of MIO approximation

As described Sect. 3.5, once the intractable constraints \(g_1(\textbf{x}) \ge 0\) and \(g_2(\textbf{x}) \ge 0\) are replaced with their tractable disjunctive approximations (18) and (19), the problem turns into a MILO that is tractable using commercial solvers. We solve the problem via CPLEX, and obtain a near-feasible, near-optimal solution with the objective value of \(-\) 7.685 in Table 5a.

Fig. 5
figure 5

The MIO solution to the demonstrative example is successfully repaired to be feasible and locally optimal by the PGD method

4.6 Solution checking and repair

We check whether the approximate solution \(\textbf{x}^{*}\) is feasible to the original optimization problem (17) by evaluating the two nonlinear constraints. Since constraint \(g_1(\textbf{x})\ge 0\) is violated, we initiate the PGD repair procedure from Sect. 3.6. To do so, OCT-HaGOn replaces the MIO approximations of intractable constraints with the auto-differentiated gradients of the constraints at \(\textbf{x}^*\), and takes a local step to close the feasibility gap while descending along the objective. This is done iteratively, evaluating the objective function and nonlinear constraints at each step, until all constraints are feasible, and the change in the objective value falls below an absolute tolerance (\(10^{-4}\)). The path of the PGD algorithm is shown in Fig. 5b, on the surface of constraint \(g_1(\textbf{x}) \ge 0\). Note that this surface is unknown by the method, so it is remarkable that it projects towards it with remarkable accuracy in its first step, and then moves along the surface in a series of descent steps.

For this problem, the absolute tolerance of \(10^{-4}\) was too small to converge definitively, so the PGD algorithm terminates at its maximum of 100 iterations, with the optimal objective value of \(f(\textbf{x}^*) = -7.021\) and the optimal solution in Table 5a.

5 Computational experiments on small benchmarks

In Sects. 5 through 7, we apply OCT-HaGOn to a number of optimization problems, and benchmark it against other global optimizers when possible. The software implementation of OCT-HaGOn can be found via the link in “Appendix 10.1”. For the full list of optimizers used and their capabilities, please refer to “Appendix 10.2”. We lead these sections with a caveat. Since our approach is approximate, different random restarts of the solution procedure may yield different optima. To represent the performance of OCT-HaGOn in the fairest manner, all results in this paper were generated through a single run of the algorithm on the test cases on one computer, with no random restarts or tuning of parameters.

We first apply our method to five small benchmark problems from MINLPLib [9], and compare our results to those of BARON [31], a popular and effective commercial mixed-integer nonlinear program (MINLP) solver. The types and numbers of constraints in the benchmarks are listed in Table 3. The results are shown in Table 4.

Table 3 The five small nonlinear benchmarks from MINLPLib have a combination of nonlinear inequalities, equalities and objective
Table 4 Solutions to the small benchmarks using OCT-HaGOn and BARON

OCT-HaGOn is able to find the global optima for all five small benchmarks, matching the BARON solutions. OCT-HaGOn takes significantly longer to solve the small benchmarks than BARON. This is expected, since these problems have explicit constraints that only contain mathematical primitives BARON supports. Tree training time makes up the vast majority of the solution times for the small benchmarks; the MIO and PGD solution steps are efficient, taking less that 5% of the total time for each benchmark. Within the context of using optimization in design, where the optimization would be run many times to obtain a number of solutions on the Pareto frontier, OCT-HaGOn is competitive and even faster than BARON, since the MIO and PGD steps are solved in a small fraction of the time it takes for the BARON solver to solve a single instance of each MINLP.

6 Real world examples

Given OCT-HaGOn’s ability to address small benchmarks, we test our method on two aerospace problems of varying complexity. We first solve a benchmark from the engineering literature, to show that the method can address real world problems. We then apply OCT-HaGOn to a satellite on-orbit servicing problem that could not be addressed using other global optimizers.

6.1 Speed reducer problem

The speed reducer problem is a nonlinear optimization problem posed in Golinski [16]. The problem aims to design a gearbox for an aircraft engine, subject to 11 specifications, geometry, structural and manufacturability constraints, in addition to variable bounds over \(\textbf{x}\in \mathbb {R}^7\). We apply our method to the problem as written in “Appendix 10.2” in standard form.

Table 5 Both OCT-HaGOn and IPOPT beat the best known (BK) solution of the speed reducer problem

In Table 5, we compare different solutions to the speed reducer problem. Both OCT-HaGOn and IPOPT beat the best known optimum from Lin and Tsai [22]. In addition, OCT-HaGOn allows us to achieve all constraints with zero error after 4 iterations of the PGD algorithm as shown in “Appendix 10.2.1”, while the other two methods have small but nonzero error tolerances.

IPOPT was able to solve this particular nonlinear program (NLP) in 4.2 s, significantly faster than OCT-HaGOn, which took 32.6 s. However, this required a relaxation of the integrality of \(x_3\). For this particular problem, this was not concerning since \(x_3\) was lower bounded by its optimal value of 17. However, IPOPT cannot in general be used to solve MINLPs.

On a practical note, we would like to note the different levels of complexity in the OCT-H approximations of the underlying nonlinear constraints. Some constraints, while they look quite complex, have low-complexity tree approximators. For example, constraint \(g_5(\textbf{x}) \ge 0\) is approximated with perfect accuracy over 613 samples, by its associated OCT-H approximator using a single hyperplane split in Fig. 6. Within the bounded \(\textrm{dom}(\textbf{x})\), the nonlinear constraint is thus simplified to a linear constraint.

Fig. 6
figure 6

The constraint \(g_5(\textbf{x}) \ge 0\) is accurately approximated by a single separating hyperplane over \(\textrm{dom}(\textbf{x})\)

Fig. 7
figure 7

The objective function \(f(\textbf{x})\) is approximated via an ORT-H with 19 leaves (4 leaves shown) and \(1 - \textrm{R}^2\) error of \(1.4\times 10^{-5}\)

However, not all constraints are straightforward to represent via unions of polyhedra. Consider the objective function, which is a 5th order polynomial (20). In this particular case, the objective is represented by an ORT-H with 19 leaves, each defining a unique feasible polyhedron over \(\textbf{x}\). A truncated version of the tree, with four leaves visible, is shown in Fig. 7. The \(1-\textrm{R}^2\) error of the approximation is \(1.4\times 10^{-5}\) over 532 samples.

6.2 Satellite OOS problem

We test our method on the previously-unsolved optimization problem of satellite on-orbit servicing (OOS) scheduling. Satellite OOS is a future technology that seeks to improve the lifetime of existing and next-generation satellites by allowing autonomous servicer spacecraft to perform repairs or refuels in orbit [23]. OOS is a difficult scheduling problem that acts on a highly nonlinear dynamical system. It is a good problem to address via our method since, in its full MINLP form, the problem is a nonconvex combinatorial optimization problem with nonlinear equality constraints. In addition, due to the 11 orders of magnitude difference in the ranges of decision variables, it is numerically challenging. Before this paper, it was addressed only via enumeration [23]. Please refer to “Appendix 10.3” for more details on the full list of constraints; a succinct summary of the problem follows.

The dynamical problem is the orbital mechanics of moving a servicer satellite between client satellites in the same orbital plane. Orbital transfers involve using on-board thrusters to get the servicer into a different orbital altitude than the client satellite, called the phasing orbit, in order to reduce the true anomaly (angular phase difference in radians) between the servicer and the client. The servicer then propels itself back onto the client’s orbit to meet the client satellite at the right time and position in space, while obeying conservation of energy, momentum and mass. The scheduling problem involves both choosing the optimal order in which to serve each client satellite (discrete decisions), as well as choosing the optimal phasing orbits (continuous decisions).

In this section, we consider a simple example of OOS. We schedule a single servicer satellite to refuel 7 client satellites in orbit, traveling between clients using on-board propulsors. Each client requires different amounts of fuel, and we constrain the servicer to fulfill its mission in 0.35 years, with the objective being to minimize the wet mass (the dry mass and fuel) of the servicer. The problem parameters are in Table 6.

Table 6 OOS problem parameters

The fuel requirements shown in the Fig. 8 were randomly generated and reflect a possible distribution of fuel needs for client satellites that are part of the same constellation and were launched concurrently at a previous point in time.

Fig. 8
figure 8

Client satellites require different amounts of fuel, which affects the optimal schedule for servicing

In addressing the OOS problem, we make the following realistic simplifying assumptions, although our method does extend to more general cases. The servicer satellite is delivered by an external rocket to the first client, and uses its own propulsor to use Hohmann transfers between the subsequent client satellites. Thrusting and refueling steps take a negligible amount of time relative to maneuver steps. All client satellites are in the same orbital plane, at the same altitude, and are evenly spaced around the orbit.

The initial problem of servicing \(n_s = 7\) clients has 141 variables, of which \(n_s^2 = 49\) binary variables denote the servicing order. The continuous decision variables in nonlinear constraints are bounded from above and below to be compatible with OCT-HaGOn as defined in Sect. 3.1. There are 41 linear constraints in the model representing a subset of the system dynamics. On top of the linear constraints, we have \(10(n_s-1) = 60\) nonlinear constraints, all of them equalities. The constraints are presented in detail in “Appendix 10.3”.

We solve the problem in two ways. First we solve it via OCT-HaGOn. Since we know the constraints of this problem explicitly, we use the ORT-H approximation method as described in Sect. 3.4.3, separating nonlinearities from affine components of constraints for improved accuracy, and training a tree for each set of recurrent constraints. The resulting MIO problem has 999 continuous and 349 binary variables, and 3650 linear inequalities and 286 linear equalities.

Other global optimizers such as IPOPT and BARON could not be used as benchmarks for OCT-HaGOn on this particular problem. Since OOS is a MIO, gradient-based optimizers such as IPOPT are rendered ineffective. While we attempted to use BARON by reformulating the constraints with \(\max \)-functions as defined in “Appendix 10.3” using binary variables, BARON deemed the problem infeasible. We know this is not the case since an aerospace engineer can construct a suboptimal, but still feasible orbital schedule without the help of optimization.

Instead, we successfully discretize out a subset of the nonlinearities in constraints by restricting the possible transfer orbits into 1 km bins. This reduces the complexity of the OOS problem to a MI-bilinear problem, which we are able to solve via Gurobi’s MI-bilinear optimizer [18]. The MI-bilinear representation has 394 variables, of which 289 variables are binary. 36 of the 60 nonlinear constraints are turned into bilinear equalities, while the rest are transformed into linear constraints. The solution of the discretized problem is globally optimal, but guaranteed to be worse than the global optimum of the full MINLP formulation, since a discrete set of orbit altitudes is a subset of the continuous set. However, the solution is granular enough to be a good benchmark for OCT-HaGOn.

Table 7 The discretized and OCT-HaGOn formulations come up with the same optimal satellite schedule, although the discretized solution performs \(0.1\%\) better

The results are presented in Table 7, and shown graphically in Fig. 9. Firstly, we look for two important effects, demonstrated well by the MI-bilinear solution and easily seen in Fig. 9. The first is that it is best to refuel satellites with the largest refuel requirements first, since a lighter servicer requires less fuel to transfer between subsequent clients. The second is that it is better to spend more time transferring in the beginning of the mission than the end, since transfers spend less fuel when the servicer is lighter. This is exhibited by a general downward trend in both maneuver times and fuel costs in the MI-bilinear solution.

Fig. 9
figure 9

While it captures the orbital dynamics well, OCT-HaGOn is not able to schedule the phasing orbits as well as the MI-bilinear formulation

While OCT-HaGOn properly captures the optimal satellite schedule, it isn’t able to find the optimal set of phasing orbits. This is easily seen by observing the flat profile of maneuver times in the OCT-HaGOn solution in Fig. 9a, which is suboptimal (by \(<0.1\%\) total fuel) to the decreasing profile seen in the discretized solution in Fig. 9b. In addition, due to the presence of many nonlinear equalities, the PGD method was not able to reduce the infeasibility and optimality gaps, getting stuck in a local optimum. With a maximum tree depth of 6, the solution has a maximum relative error of \(3.5 \times 10^{-3}\) and a mean relative error of \(2.5 \times 10^{-4}\) on all nonlinear constraints. While this is sufficiently accurate for conceptual design purposes, greater accuracy and a more robust repair procedure are desired.

In terms of solution time, OCT-HaGOn took 14.2 s when solved using a personal computer with an 8-core Intel i7 processor. That includes all sampling, evaluation, training and optimization steps. In comparison, the MI-bilinear solution took 17.7 s, just for the optimization step. This is in addition to the two days spent by an experienced engineer, reformulating the problem to be compatible with existing efficient optimization formulations.

Despite the suboptimal solution of OCT-HaGOn to the OOS problem, we argue that it is a strong demonstration of the capabilities and promise of the method, especially considering the problem complexity. Notably, OCT-HaGOn successfully finds the optimal satellite servicing schedule, which is arguably the most important decision in the problem. This is despite the fact that the problem is ill-conditioned, with 11 orders of magnitude difference in decision variable values, and has 60 nonlinear equality constraints coupling a majority of the decision variables. While BARON declares the problem infeasible, OCT-HaGOn’s disjunctive formulation is able to find and optimize over many feasible schedules. In addition, discretized reformulations of such complex global optimization problems may not exist in general; even if they do, they may be intractable due to the combinatorial nature of such reformulations. To the best of our knowledge, this makes OCT-HaGOn the only global optimization tool in the literature that can address this problem directly.

7 Computational experiments on MINLPLib benchmarks

We proceed by considering a subset of 357 benchmarks from the global repository of MINLPLib [9]. We restrict ourselves to benchmarks with fewer than 100 decision variables, that have bounded variables in each nonlinear constraint. This results in 93 problems, which we solve via OCT-HaGOn. Given the increased difficulty of these larger benchmarks, we make a single modification to OCT-HaGOn to improve its computational time. To reduce training time of ORT-Hs on objective functions with more than 6 variables, we restrict the tree splits to univariate splits in these cases, as in CART Breiman et al. [8]. Otherwise, we use the same training and PGD parameters included in Tables 1 and 2 respectively.

In addition to OCT-HaGOn, we address the same problems via BARON. In the authors’ best knowledge, BARON is the only other global optimizer that can address the wide variety of problems in the global repository without additional manipulation or reformulations. Since in practice BARON can take a long time to prove the global optimality of its solution, we restrict it to solve the optimization instances in 420 s, which is maximum number of seconds taken by OCT-HaGOn to solve any of the benchmarks, rounded to the nearest minute. The quality of the OCT-HaGOn and BARON solutions are shown in Figs. 10 and 11 respectively, where the optimality gap is computed against the best known optima as documented in MINLPLib [9].

Fig. 10
figure 10

OCT-HaGOn is able to find global optima in approximately half of the MINLPLib instances, where blue and red markers designate feasible and infeasible solutions respectively

Per Table 8, BARON is able to find the global optima for 88 out of 93 instances within the 420 s time limit, and finds feasible solutions in 91 instances. OCT-HaGOn exhibits weaker performance, finding the global optimum in 46 out of 93, or roughly half of the instances. Given that OCT-HaGOn is a proof of concept, we find these results encouraging. In addition, the benchmarks considered contain only the nonlinear mathematical primitives that BARON supports, giving BARON an advantage, not to mention the additional years of research and development.

OCT-HaGOn solves 6 out of the remaining 47 sub-optimal instances to near-optimality; while the objective values are near the optimum for these instances, the solutions don’t achieve the constraint tightness tolerance of \(10^{-8}\). This shows room for substantial improvement in the PGD algorithm, since the method could not sufficiently reduce the infeasibility within the parameters listed in Table 2.

Fig. 11
figure 11

BARON is able to optimize 88 out of the 93 benchmarks to global optimality within 420 s, and returns an infeasible solution in only 2 instances

The time performances of OCT-HaGOn and BARON are shown in Figs. 12 and 13 respectively, plotted with respect to the number of variables in the instances. OCT-HaGOn has a strong positive correlation between solution time and size of the problems; this is because the number of variables in the approximated constraints tends to drive both tree training time and MIO time, and thus total solution time. For problems with continuous variables, as problem size increases, the solution time is dominated by tree training time. For mixed-integer instances, both tree training and MIO time strongly influence solution time. BARON has a more complex relationship in number of variables. While the types of nonlinearities can affect both OCT-HaGOn and BARON’s solution times, the convergence rate of the branch-and-reduce method in BARON is significantly more sensitive to the nonlinearities present in the constraints. In 12 instances, BARON’s branch-and-reduce algorithm was terminated prematurely at 420 s; all instances had found a feasible solution, while 11 of them had found the global optimum without confirming optimality. Note that there is slight variation in the solution time of these instances at the 420 s cutoff, since the termination of BARON branch-and-bound does not occur instantaneously.

Table 8 The relative performances of OCT-HaGOn and BARON on the 93 MINLPLib instances
Fig. 12
figure 12

OCT-HaGOn has a clear positive relationship between solution time and number of variables in the optimization instances

Fig. 13
figure 13

BARON’s solution time is more sensitive to the types of nonlinearities in the optimization problems than on the number of variables

In the instances that are both solved to optimality by BARON and OCT-HaGOn, OCT-HaGOn solves a single instance in on average 31.0 s, with a standard deviation of 78.8 s. BARON takes on average 34.5 s on the same instances, with a standard deviation of 110.3 s. Given that OCT-HaGOn was substantially slower than BARON in solving the smaller benchmarks in Sect. 5, this indicates the scalability of OCT-HaGOn’s constraint learning approach as the problem size increases, especially due to the local idealness of the MIO approximation of decision trees. While these results are promising in showing that OCT-HaGOn can find global optima and can scale to larger problems, they highlight room for improvements.

8 Discussion

In this section, we discuss the results and limitations, and propose areas for future work.

8.1 Limitations

The proposed method shows promise in solving a variety of global optimization problems, but it is a work in progress. Here we detail some of the limitations of the method as implemented in this paper.

As noted in Sect. 5, the proposed method has no guarantees of global optimality since it is a heuristic. Different iterations of the method generate high-performing solutions that are locally optimal, but do not have guarantees of global optimality such as those provided by BARON. OCT-HaGOn could be augmented in the future with lower- and upper-bounding approximators for constraints involving certain compositions of mathematical primitives, generating optimality bounds for objective functions and feasibility bounds for constraints. However, this is outside the scope of this work and would require the definition of bounds for each mathematical primitive in \(\textrm{dom}(x)\) lying in arbitrary polyhedra. This can be challenging in practice.

While the OOS example demonstrates that the method can address problems with a high degree of nonlinear coupling between decision variables, individual nonlinear constraints involving a large number of active variables will pose challenges in both the OCT-H training time, as well as the accuracy of the tree approximations. Tree accuracy directly affects the quality of the approximate optima. Separability of constraints into linear combinations of nonlinear functions, as described in Sect. 3.4, can partially mitigate this problem, by allowing many nonlinear constraints to be decomposed and better approximated via a series of ORT-Hs.

In addition, while the method is agnostic about whether constraints are explicit or inexplicit, the method has so far been tested on explicit constraints only. This is because of the inavailability of numerical benchmarks with black box functions, due to their incompatibility with other existing global optimizers. An implicit assumption of the method is that the intractable constraints are quick to evaluate; if this assumption is not true, then the implementation may need to change to accommodate computational requirements. Additionally, the PGD method requires that the constraint functions are auto-differentiable. While this is a modest assumption, it is possible that constraint evaluations do not allow for AD. This could be overcome by finding gradients approximately, e.g. via finite differencing, but this is not currently implemented.

Finally, we have yet to rigorously test how solution time and quality scale with the number of variables and nonlinear constraints, and the sparsity of the nonlinear constraints. Given that the performance of OCT-HaGOn is formulation-dependent, there is much to be gained, both in terms of solution time and quality, through formulations that premeditate where OCT-H approximations need to be used, and use them judiciously. We expect OCT-HaGOn to be particularly effective when a majority of the constraints in the optimization problem are linear or convex and therefore efficient, and the constraint learning approach is implemented on the otherwise intractable constraints.

With these limitations outlined, we continue by proposing future work to improve the method.

8.2 Decision tree training

Decision tree training is a key component of OCT-HaGOn in finding both feasible and optimal solutions. Since OCT-HaGOn was not able to find a feasible solution for 26 out of 93 benchmarks, it is clear that there is room for improvement in determining the right balance of number of samples, tree depth and tree complexity in order to best capture the feasible \(\textrm{dom}(\textbf{x})\) over each constraint. Currently, tree training is done in a static manner, and the disjunctive approximations are not refined as the optimizer tries to converge to the global optimum. The best method finding the right balance is likely a dynamic method for decision tree generation, where new samples are generated and trees iteratively grown in order to refine the constraint approximations near candidate optima. This would improve both the solution quality and time performance of OCT-HaGOn. Since the authors used an off-the-shelf OCT-H algorithm from IAI to build decision trees, such dynamic tree refinement has so far been outside the scope of OCT-HaGOn and this paper.

In addition, the majority of the solution time of OCT-HaGOn is taken by the tree training step. While the computational cost of training is linear with the number of constraints, the results on benchmarks in Sect. 5 show that training time can vary dramatically depending on the complexity of the underlying constraints.

We discuss several ways to manage computational time. The first potential source of training time reduction comes from tuning the base tree parameters described in Table 1 and implemented in IAI. To do so, we can reduce the complexity of the trees, by reducing the maximum depth and increasing the minbucket parameters. Otherwise we can modify the number of random restarts in tree training. Since the local search method used in generating OCT-Hs and ORT-Hs is locally optimal, we can reduce training time by changing the number of random restarts of candidate trees, as well as the number of random hyperplane restarts. However, both methods have a clear negative tradeoff with respect to the accuracy of the OCT-H approximations. In general, we find that using 10 random tree restarts and 5 hyperplane restarts, as described by the base tree parameters in Table 1, we are able to generate trees that are sufficiently accurate for decision making while being efficient enough to use in a real-time optimization setting. One could also speed up the training process by trying a greedy approach, building trees with hyperplanes in a locally optimal manner similar to CART [8], instead of a globally optimal manner via local search heuristics [5].

A potentially large source of training time reduction is from recognizing the common form of constraints in a problem. If a nonlinear constraint \(g(\textbf{u}_i) \ge 0\) is repeated k times with different variables \(\textbf{u}_i \subset \textbf{x},~i \in [k]\), the constraints can be approximated jointly. Specifically, we can train a single OCT-H to approximate the constraint over the domain \(\cup _{i=1}^k \textrm{dom}(\textbf{u}_i)\). We then express the k constraints as k repetitions of the disjunctive representation of the tree with different variables \(\textbf{u}_i\). In this paper, many benchmarks in Sect. 7 exhibit this kind of repeating behavior, but we treat the constraints as black boxes and do not take advantage of potential speed-ups. For the OOS problem however, we use our knowledge of the constraints to train the trees jointly.

There are also improvements that could be made considering computing architecture. Since individual constraints are learned separately, the training process could be done in parallel, making the best of use of available computational resources. For individual problems, the trees can be efficiently stored once trained, allowing the same trees to be used in different instances of the same optimization problem. This avoids the need to retrain trees, and also avoids having to store the samples required to train the trees, saving on memory.

8.3 Complexity of the MIO approximation

As aforementioned, the complexity of solving the MIO approximations of global optimization problems is modest, since the scale of the MIO is small compared to the abilities of commercial solvers such as Gurobi or CPLEX. However, it is important to note how the complexity of the MIO can scale depending on the number of nonlinear constraints and the depth of the approximating trees.

We first consider the number of auxiliary variables required to pose the MIO approximation. The number of variables used to approximate a nonlinear constraint is a linear function of the number of disjunctive polyhedra describing the feasible space of \(\textbf{x}\), as well as the number of decision variables in the constraint. More explicitly, the total number of binary variables required to approximate the problem is linear with respect to the number of leaves in the decision trees, and equivalent to

$$\begin{aligned} |L_f| + \sum _{i \in I} |L_{i,1}| + \sum _{j \in J} |L_{j}|, \end{aligned}$$

where \(L_f\), \(L_{i,1}\) and \(L_{j}\) are the set of feasible leaves in the objective-, inequality- and equality-approximating trees respectively. In addition, we introduce a number of continuous auxiliary variables. The number of auxiliary variables is equivalent to:

$$\begin{aligned} 1 + |L_f|(p_f + 1) + \sum _{i \in I} \Big (|L_{i,1}| p_i\Big ) + \sum _{j \in J} \Big (|L_{j}| p_j \Big ), \end{aligned}$$

where \(p_i\) is the number of variables in the ith constraint. The maximum number of leaves of a tree is \(2^d\), so in the worst case, the number of auxiliary binary variables in the problem is \(\mathcal {O}(2^d(1+|I|+|J|))\), and the number of auxiliary continuous variables is \(\mathcal {O}(2^d(1+|I|+|J|)\textrm{dim}(\textbf{x}))\), equivalent to the number of binary variables augmented by the dimension of \(\textbf{x}\). In practice however, this worst case is not seen, as the trees are pruned during the training process, and approximated intractable constraints are sparse in \(\textbf{x}\).

The number of disjunctive constraints is more complicated, since the trees are not guaranteed to be of uniform depth, and we do not know a priori the fraction of feasible leaves for a classification tree. However, if we assume that each tree has a depth \(d_i\), we get the following worst case number of disjunctive constraints, not including the univariate bounding constraints for the continuous auxiliary variables:

$$\begin{aligned} \Big (2^{d_f} \times (d_f + 1)\Big ) + 3 + \sum _{i \in I} \Big ((2^{d_i}-1) \times d_i \Big ) + \sum _{j \in J} \Big ((2^{d_j}-1) \times d_j \Big ) + 2|I| + 4|J|. \end{aligned}$$

The above implies that the number of disjunctive constraints in the MIO is \(\mathcal {O}(2^dd(1+|I|+|J|))\), where d is the maximum depth of all approximating trees. This shows the super-exponential impact of tree depth on MIO complexity, where the need for greater accuracy may result in large computational cost. However, for the small to medium scale instances we have considered in this paper, this is an acceptable tradeoff.

Additionally, the number of variables grows linearly with number of constraints, which could result in the solution time of OCT-HaGOn being exponential in the worst-case. Unlike linear or convex optimization problems, where the average solution time can be sublinear with the number of constraints, OCT-HaGOn is expected to have on average super-linear solution time with respect to number of constraints due to the combinatorial nature of the approximations. We have yet to observe problems that exhibit such exponential-time behavior, likely because of the sparsity of the approximating constraints and the local-idealness of the formulation. However, tree complexity needs to be investigated as OCT-H approximations are applied to large scale problems.

8.4 Extending to MI-convex formulations

The OCT-HaGOn approach allows us to generate efficient MIO representations of nonlinear constraints that are not efficiently optimizable, i.e. not linear or convex. It opens up the possibility to include these approximations in more general MI-convex formulations, where the efficient convex nonlinear constraints are preserved, either via direct insertion or via outer approximation, while the intractable constraints are approximated via OCT-Hs. This will significantly improve both the speed and accuracy of our method.

8.5 Improved random restarts

As aforementioned, since the constraint learning approach is approximate, random restarts may be required gain confidence in the quality of the locally optimal solutions. Currently, random restarts for OCT-HaGOn involve retraining trees over all nonlinear constraints, and replacing them simultaneously. A better method would be to train an ensemble of trees on each constraint, and permute the tree approximations to generate a set of MI approximations of the problem. The solution of each permutation would provide a near-optimal seed for a new PGD sequence. This would reduce the computational burden of random restarts and result in higher-performing populations of solutions, giving increased confidence in the method.

8.6 Optimization over data-driven constraints

There are global optimization contexts where constraints are informed by data, without having access to the underlying models. Some examples are simulation data in the design of engineered systems, outcomes of past experiments, or anthropogenic data such as clinical data and consumer preferences. In theory, OCT-HaGOn is able to learn constraints from arbitrary data and integrate these models in an optimization setting. However, we have yet to perform experiments to confirm the efficacy of OCT-HaGOn in real-world decision making using data-driven constraints. Such an embedding of data into optimization via constraint learning has important implications for a variety of fields, such as healthcare and operations research.

8.7 Integration of other MIO-compatible ML models

While this paper focuses on the use of OCT-Hs and ORT-Hs for constraint learning, there are other ML models that have optimization-compatible representations. Maragno et al. [24] explore the possibility of using linear models, decision trees and their variants, and multi-layer perceptrons to learn constraints and objectives from data. OCT-HaGOn could easily be extended to accommodate such other MIO-representable ML models.

9 Conclusion

In this paper, we have proposed an intuitive new heuristic method for solving global optimization problems leveraging interpretable ML and efficient MIO. Our method approximates explicit and inexplicit nonlinear constraints in global optimization problems using OCT-Hs and ORT-Hs, using the natural disjunctive representation of decision trees. We demonstrate, both theoretically and practically, that the disjunctive MIO approximations are efficiently solvable using modern solvers, and result in near-optimal and near-feasible solutions to global optimization problems. We then improve our solutions using gradient-based methods to obtain feasible and high-performing solutions. We demonstrate that our global optimizer OCT-HaGOn can address a number of benchmark and real-world problems. The Julia implementation of OCT-HaGOn as described in this paper is available via the link in “Appendix 10.1”.

The method we present is more than a new tool in the global optimization literature. Tree-based optimization stands out among existing global optimization tools because it can handle constraints that are explicit and inexplicit, and even learn constraints from arbitrary data. To the author’s best knowledge, it is the most general global optimization method in the literature, since it has no requirements on the mathematical primitives of constraints or variables. Our method only requires a bounded decision variable domain over the nonlinear constraints. This has important implications to a number of fields that can benefit from optimization, but have yet to do so due to lack of efficient mathematical formulations.