1 Introduction

State-of-the-art algorithms for deterministic global optimization of nonconvex problems are mainly based on adaptations of the spatial branch-and-bound algorithm [17, 25, 35], where lower and upper bounds are computed in each node of the branch-and-bound tree. In most cases, established nonlinear local solvers supported by several heuristics are used to compute valid upper bounds. The computation of lower bounds is performed through the application of a general method for the construction of valid convex and concave relaxations of the original functions participating in the optimization problem. Due to numerical reliability and computational costs, established deterministic global solvers such as, e.g., ANTIGONE [27], BARON [22] or SCIP [38], construct a linear outer approximation of the nonlinear convex relaxation and use well-developed linear programming methods for the determination of valid lower bounds instead of applying local convex techniques directly. The selection of points for linearization affects the tightness of the linear relaxation and consequently the lower bound obtained through the solution of the resulting linear program.

The auxiliary variable method (AVM) [33, 35, 36] is a general method for the construction of convex and concave relaxations of factorable functions, i.e., functions representable as a finite recursion of addition, multiplication and composition of intrinsic functions. The AVM introduces an auxiliary variable together with a corresponding auxiliary equality constraint for every intermediate nonlinear factor of a given function. Then, the convex and concave envelopes of each factor are constructed providing a convex relaxation and a concave relaxation of the original function. To linearize the convex and concave envelopes of each factor provided by the AVM, the so-called sandwich algorithm has been developed (Section 4.2 in [35]). The algorithm starts with a small set of linearization points and computes additional points based on a particular rule, e.g., the maximum error rule or the chord rule, for every auxiliary factor until a pre-defined maximum of linearization points has been computed. This method works very well for the AVM, since each introduced auxiliary factor is defined by a low dimensional function for which the convex and concave envelopes are known analytically allowing for an efficient computation of maximum distances and/or angles used in the sandwich procedure. Moreover, all computations are conducted in the low dimensional domain space of the intermediate factor further lowering the computational effort. However, the resulting linear program suffers from an increased dimensionality and a large number of constraints because of the auxiliary variables and equality constraints added.

The method of McCormick [26], extended to multivariate compositions in [37], is a different method for the construction of valid convex and concave relaxations of factorable functions. In contrast to the AVM, the McCormick technique does not introduce any auxiliary variables when constructing convex and concave relaxations of a given function, thus always preserving the original dimension of the underlying function. Since the resulting McCormick relaxations may be nonsmooth, we use subgradient propagation [28] in order to construct valid affine under- and overestimators for the convex and concave McCormick relaxations. It is also possible to use the recently developed differentiable McCormick relaxations [23] in order to replace the computation of subgradients by the computation of derivatives, but they are not as tight as the nonsmooth McCormick relaxations and are computationally more costly to calculate. Furthermore, Cao et al. [6] have shown that affine subtangents used for the linearization of the convex relaxation of the lower bounding problem benefit from the favorable convergence order of McCormick relaxations. Thus, we stick to the nonsmooth McCormick relaxations and its subgradients throughout this article. As currently no algorithms exist for the selection of suitable linearization points for the McCormick relaxations, the goal of this work lies in the development of algorithms for the computation of a set of linearization points. We briefly discuss the idea of direct propagation and combination of all subgradients in each intermediate factor. However, this approach results in the computation of a combinatorial number of affine functions and thus, we focus on methods for computing linearization points in the original variable space either iteratively or a priori. All methods developed for the linearization of multivariate McCormick relaxations are implemented on the basis of the open-source MC++ library [7] and are available in the open-source deterministic global optimization solver MAiNGO [4].

Tawarmalani and Sahinidis [36] showed that for a specific set of linearization points, the linear outer approximation obtained via the AVM can be tighter than the linear outer approximation computed with McCormick relaxations. Later Tsoukalas and Mitsos demonstrated that the McCormick method extended to multivariate outer functions provides the same convex and concave relaxations if no reoccurring intermediate factors of a given function are recognized by the AVM. However, if the AVM recognizes reoccurring intermediate factors, it provides tighter convex relaxations than the multivariate McCormick approach. This problem, can, at least in princible, be solved by the addition of only a sufficient number of auxiliary optimization variables when applying the McCormick method, thus providing equally tight relaxations as the AVM. The authors thus suggest a combination of the AVM and the multivariate McCormick technique (Section 4 in [37]), but do not provide any computational insights or results. Modern computational representations of factorable functions through directed acyclic graphs (DAGs) allow for the recognition of factors occurring multiple times in a given optimization problem [7, 32]. This allows for the development of a method that introduces common factors as auxiliary optimization variables together with the corresponding auxiliary constraints. Herein, we first briefly summarize the theory discussed in [37], showing that the AVM and the McCormick relaxations provide equally tight relaxations if common factors occurring at least twice are recognized. Then, we present a heuristic for the addition of intermediate factors occurring most often in a given DAG as auxiliary optimization variables in the lower bounding procedure and show that this hybridization of the AVM and the McCormick propagation technique results in large computational speed ups. We add only a limited number of auxiliary optimization variables in order get the advantage of tighter relaxations when common factors are recognized and keep the benefit of lower dimensional optimization problems when using the McCormick method. Last, we apply the linearization algorithms developed herein to the hybridization of the AVM and multivariate McCormick relaxations to come up for the increased dimensionality and resulting complexity of the lower bounding problems. Again, the implemented heuristic is available in the open-source solver MAiNGO [4].

The remainder of this article is structured as follows. Section 2 introduces the mathematical notation used throughout the manuscript. In Sect. 3 we present alternative algorithms for the computation of linearization points for McCormick relaxations a priori, iteratively and a combination of both supported by numerical results. Section 4 provides results on the application of the presented linearization algorithms to the hybridization of the McCormick technique and the auxiliary variable method. We summarize and discuss future work in Sect. 5.

2 Preliminaries

If not stated otherwise, we consider continuous functions \(f:Z\rightarrow \mathbb {R}\) with \(Z \in \mathbb {I}\mathbb {R}^n\), where \(\mathbb {I}\mathbb {R}\) denotes the set of closed bounded intervals in \(\mathbb {R}\). \(Z\in \mathbb {I}\mathbb {R}^n\), also called box, is defined as \(Z\equiv [\mathbf {z}^L,\mathbf {z}^U]=[z^L_1,z^U_1]\times \dots \times [z^L_n,z^U_n]\) with \(\mathbf {z}^L,\mathbf {z}^U \in \mathbb {R}^n\) where the superscripts L and U always denote a lower and upper bound, respectively. We denote the range of f over Z by \(f(Z)\in \mathbb {I}\mathbb {R}\).

We call a convex function \(f^{cv}:Z\rightarrow \mathbb {R}\) a convex relaxation (or convex underestimator) of f on Z if \(f^{cv}(\mathbf {z})\le f(\mathbf {z})\) for every \(\mathbf {z}\in Z\). Similarly, we call a concave function \(f^{cc}:Z\rightarrow \mathbb {R}\) a concave relaxation (or concave overestimator) of f on Z if \(f^{cc}(\mathbf {z})\ge f(\mathbf {z})\) for every \(\mathbf {z}\in Z\). We call the tightest convex and concave relaxations of f the convex and concave envelopes \(f^{cv}_e, f^{cc}_e\) of f on Z, respectively, i.e., it holds \(f^{cv}(\mathbf {z})\le f^{cv}_e(\mathbf {z})\le f(\mathbf {z})\) and \(f(\mathbf {z})\le f^{cc}_e(\mathbf {z})\le f^{cc}(\mathbf {z})\) for all \(\mathbf {z}\in Z\) and all convex relaxations \(f^{cv}\) and concave relaxations \(f^{cc}\) of f on Z, respectively.

Definition 1

(Subgradients and linearization points). For a convex and concave function \(f^{cv},f^{cc}:Z\rightarrow \mathbb {R}\), we call \(\mathbf {s}^{cv}(\bar{\mathbf {z}}),\mathbf {s}^{cc}(\bar{\mathbf {z}})\in \mathbb {R}^n\) a convex and a concave subgradient of \(f^{cv},f^{cc}\) at \(\bar{\mathbf {z}}\in Z\), respectively, if

$$\begin{aligned} f^{cv}(\mathbf {z})&\ge f^{cv}(\bar{\mathbf {z}}) + (\mathbf {s}^{cv}(\bar{\mathbf {z}}))^T(\mathbf {z}-\bar{\mathbf {z}}),&~~\forall \mathbf {z}\in Z, \end{aligned}$$
(A1)
$$\begin{aligned} f^{cc}(\mathbf {z})&\le f^{cc}(\bar{\mathbf {z}}) + (\mathbf {s}^{cc}(\bar{\mathbf {z}}))^T(\mathbf {z}-\bar{\mathbf {z}}),&~~\forall \mathbf {z}\in Z. \end{aligned}$$
(A2)

We denote the affine functions on the right-hand side of inequalities (A1), (A2) constructed with the convex and concave subgradient \(\mathbf {s}^{cv}(\bar{\mathbf {z}}),\mathbf {s}^{cc}(\bar{\mathbf {z}})\) by \(f^{cv,sub}(\bar{\mathbf {z}},\mathbf {z})\) and \(f^{cc,sub}(\bar{\mathbf {z}},\mathbf {z})\), respectively.

Note that \(f^{cv,sub}\) and \(f^{cc,sub}\) are valid under- and overestimators of f on Z, respectively. We call the point \(\bar{\mathbf {z}}\) in Definition 1 a linearization point of f.

2.1 McCormick relaxations and subgradient propagation

All relaxations considered herein are constructed with the use of the McCormick propagation rules originally developed by McCormick (Section 3 in [26]) and extended to multivariate compositions of functions by Tsoukalas and Mitsos (Theorem 2 in [37]). The corresponding subgradients are constructed as described by Mitsos et al. (Sections 2.4-3.1 in [28]) and Tsoukalas and Mitsos (Theorem 4 in [37]). We use the open-source library MC++ [7] for the automatic computation of convex and concave McCormick relaxations and its subgradients at a single linearization point. Moreover, we extend the library to handle a set of linearization points directly by adapting the original implementation of McCormick propagation to vector valued calculations. This extension is available in the MC++ library used in the open-source deterministic global solver MAiNGO [4].

3 Computation of linearization points for McCormick relaxations

Before describing the methods and algorithms used in this work, we briefly discuss the computation of linearization points for the AVM, its relationship with McCormick relaxations and the associated difficulties. We then begin the presentation of methods with an adaptation of the iterative Kelley’s cutting plane algorithm for convex optimization [21]. Subsequently, we develop a method for a priori computation of linearization points (before calculating relaxations). As a last point of this section, we present numerical results comparing the described methods and the combination of the two methods to a computationally cheap single-point approach and random selection.

3.1 Auxiliary variable method and McCormick relaxations

For the sake of demonstration, let us consider a simple abstract box-constrained minimization of \(f:\mathbb {I}\mathbb {R}^n\ni X\rightarrow \mathbb {R}\)

$$\begin{aligned} \min _{\mathbf {x}}&~~~f(\mathbf {x}) \\ \text {s.t.}&~~~ \mathbf {x}\in X \in \mathbb {I}\mathbb {R}^n. \end{aligned}$$

When applying the AVM to a given optimization problem, each intermediate factor is replaced by an auxiliary variable together with an appropriate equality constraint. For now, let us assume that the AVM introduces exactly one auxiliary variable \(z_1\) for the intermediate factor g with range \(g(X)=[g^L,g^U]\) providing the equivalent optimization problem

$$\begin{aligned} \begin{aligned} \min _{\mathbf {x},z_1}&~~~\tilde{f}(z_1) \\ \text {s.t.}&~~~z_1 = g(\mathbf {x}) \\&~~~ \mathbf {x}\in X \in \mathbb {I}\mathbb {R}^n\\&~~~ z_1 \in [g^L,g^U], \end{aligned} \end{aligned}$$
(AVM)

with \(\tilde{f}:[g^L,g^U]\rightarrow \mathbb {R}, g:X\rightarrow [g^L,g^U]\) and \(\tilde{f}(g(\mathbf {x}))=f(\mathbf {x})\) for all \(\mathbf {x}\in X\). Then, convex and concave relaxations \(\tilde{f}^{cv},g^{cv},g^{cc}\) of \(\tilde{f}\) and g over \([g^L,g^U]\) and X, respectively, are constructed. Note that in general the range set of g overestimates the true image of g under X, but to keep this example simple, we assume that the exact image of g under X is known. Finally, a set of linearization points \(\mathscr {L}_{AVM}\) with \(|\mathscr {L}_{AVM}| = N_\mathscr {L}\) is computed, e.g., with the sandwich algorithm and one of its different rules (Section 4.2 in [35]), and all convex and concave relaxations are replaced by their affine under- and overestimators computed at points in \(\mathscr {L}_{AVM}\). For the above optimization problem (AVM), when assuming that every function is linearized the same number of times, we end up with \(N_\mathscr {L}\) affine inequalities consisting of \(\frac{N_\mathscr {L}}{3}\) for the convex relaxation of the objective, \(\frac{N_\mathscr {L}}{3}\) for the convex relaxation, and \(\frac{N_\mathscr {L}}{3}\) for the concave relaxation of the auxiliary function g. The linearized lower bounding problem is given as

figure a

It is desirable to achieve the same tightness of the lower bounding linear optimization problem when using McCormick relaxations. Since no auxiliary variables are introduced when McCormick’s theorems are applied, this amounts to computing the projection of (\(\hbox {AVM}_{\text {LP}}\)) to the original variable space X. This is done by combining each of the affine under- or overestimators of function g with each of the affine relaxations of f. The choice whether the affine under- or overestimators of g have to be used, depends on the monotonicity of the affine relaxations of f (Corollary 3 in [37]). This combination results in a linear optimization problem with \((N_\mathscr {L})^{2}\) linear inequalities in the original variables \(\mathbf {x}\) and only one (free) auxiliary variable \(\eta \). Generalizing, if \(F>1\) auxiliary variables are introduced in the AVM, we end up with \(\mathscr {O}\left( (N_\mathscr {L})^{(F+1)}\right) \) affine inequalities for the McCormick method. The estimation of number of inequalities is not exact since some of the resulting affine inequalities may be redundant. Thus, propagating all affine inequalities with the McCormick method is not a promising idea as solely the computation of all affine functions would require an exponentional computational time.

For the AVM, we can say that the method pays for the lower number of affine inequalities with the increase of the dimension of the final linear problem (\(\hbox {AVM}_{\text {LP}}\)). This increase of number of variables however seems acceptable. The resulting LP is typically solved with (variants of the) simplex method [10]. In practice, such methods perform very well for large problems, and does not require anywhere near the worst-case exponential complexity. When propagating all combinations with the McCormick method, we have to compute all facets explicitly and thus, also all vertices of the underlying polytope. In contrast to the computation of vertices in the linear programming simplex algorithm, this propagation thus inevitably results in exponential computational runtime.

In order to achieve a similar tightness without the necessity of propagating all combinations of the affine functions and avoiding the combinatorial complexity, the following possibility can be considered. We choose only a limited number of what we think are promising linearizations in each factor and propagate only a small part of all possible combinations. The difficulty emerging with this approach is the decision making in a given factor \(f_j\) over domain \(X_{f_j}\) of the original function f, i.e., telling what is a promising linearization point, since only local information about the factor \(f_j\) on its domain \(X_{f_j}\) and not on the final function f over \(X_f\) is available. This local information is unsatisfactory because it is not possible to know which affine under- and overestimator of \(f_j\) is the best to be used for further propagation, since f may have many nonlinear factors following \(f_j\). Still, several heuristics to overcome this issue can be applied to the concave and convex relaxations. First, we can choose the affine underestimator whose minimum value is the closest to the minimum value of the convex factor \(f_j\) can be considered. As an alternative, we can compute the value at the minimum point of the factor \(f_j\) over \(X_{f_j}\) among all propagated affine underestimators and save the tightest underestimator for further propagation. Finally, we can choose the affine underestimators within each intermediate factor randomly. Unfortunately, in preliminary experiments none of the above methods provides results which could compete with the algorithms and ideas described throughout the rest of this manuscript. Thus, we focus the scope of this paper on the iterative and direct computation of the set of linearization points \(\mathscr {L}_{MC}\) providing the linear lower bounding problem in the space of original variables

figure b

3.2 Adaptation of Kelley’s algorithm

Cutting plane or localization methods are commonly used in continuous convex optimization [16] and are often also called bundle methods. The probably best known methods are the algorithm by Kelley [21] and the method by Cheney and Goldstein [8]. These algorithms may be inefficient in the sense that they may require the solution of multiple small linear programs until they stabilize but are computationally very cheap. To improve the efficiency, extensions of the methods by Kelley and Cheney and Goldstein have been developed during the last decade. The center of gravity method proposed by Newman [30] and Levin [24] computes the center of gravity of the given polyhedron approximating the convex set of interest. Although it is proven to decrease the volume of the polyhedron by a large part in each iteration, it is computationally not practicable. The maximum volume inscribed ellipsoid method [34] and the analytic center method [13] both require the solution of subsequent convex optimization problems in order to obtain a set of linearization points. As we have to compute possibly multiple linearization points in each node of the B&B algorithm, it is intuitive to avoid this additional computational overhead. The Chebyshev center cutting-plane method [11] requires the solution of a linear program to compute a linearization point (Chapter 8 in [5]) but it is prone to scaling and coordinate transformations. Since we have to compute multiple linearization points in each B&B node, the two algorithms by Kelley and Cheney and Goldstein seem acceptable in terms of computational time and resulting tightness of the lower bounding problem. Next, we present the method we use in this work and stick to the notion of Kelley’s algorithm. We adapt the original algorithm to be able to directly use McCormick relaxations and the corresponding subgradients instead of computing tangents of (smooth) convex functions.

The idea of bundle methods is to approximate a given convex function by a set or a so-called bundle of affine underestimators. The bundle is constructed by iteratively generating and adding new affine functions. The pseudo-code of the procedure can be found in Algorithm 1. In particular, we start with an initial set of linearization points \(\mathscr {L}\), e.g., the mid-point of the underlying interval domain (Line 4). We construct the lower bounding linear optimization problem L (\(\hbox {MC}_{\text {LP}}\)) with the use of McCormick relaxations by using subgradient propagation [28] with reference points defined in \(\mathscr {L}\) (Line 5). We solve L to obtain the optimal solution point \(\mathbf {x}^*\) and a corresponding solution value \(f^*\) (Line 9). We add the solution point \(\mathbf {x}^*\) to the set of linearization points \(\mathscr {L}\). We now extend the linear problem L with the inequalities which we get by the computation of subgradients of the McCormick relaxations at the new point \(\mathbf {x}^*\) of every function participating in the original problem (Lines 16-17). We then re-solve L to get a new solution point \(\mathbf {x}^*_{new}\) and its corresponding solution value \(f^*_{new}\) and repeat the procedure until either a maximum number \(N_{K}\) of linearizations has been reached (Line 8) or the difference of the solution values \(f^*\) and \(f^*_{new}\) reach a given threshold (Line 13). The implementation of the described method is available in the open-source solver MAiNGO [4].

figure c

3.3 Computation of linearization points as vertices of an n-simplex

We have shown how to construct lower bounding linear problems iteratively with a simple adaptation of well-known bundle methods in the previous section. In this section we discuss how to compute a promising set of linearization points a priori, i.e., before the first lower bounding linear optimization problem is solved.

First, we need to clarify what we denote as a promising set of linearization points. A promising set of linearization points consists of points, that are well-distributed among the n-dimensional interval domain of the optimization variables and at the same time is not too large. Intuitively, since the nonlinear n-dimensional function \(f^{cv}:X\in \mathbb {I}\mathbb {R}^n\rightarrow \mathbb {R}\), which we want to approximate through affine estimators, is convex, it should be enough to approximate the shape of f with the use of \(n+1\) affine underestimators. This intuition comes from the fact that a set \(\mathscr {L}\) of \(n+1\) points is enough to encapsulate any interior point \(\mathbf {x}^* \in X\) within the convex hull of the points \(\mathscr {L}\) in the domain space X. Moreover, since the function we aim to underestimate is convex, a set of points which is well-distributed among the domain appears to be more favorable when computing the lower bound rather than a possibly clustered set of linearization points. In practice, we cannot expect the linearization points based on this set to be exact at the minimum of f, but we hope that the affine linearizations defined by the set of linearization points, which is well-distributed among the domain of the optimization variables, can approximate f well enough. Additionally, the computation of only a limited number of affine underestimators does not increase the computational cost too much allowing for a possible trade-off between accuracy of the affine approximation and computational time.

In order to obtain a well-distributed set of linearization points, it is also possible to use well-known space covering algorithms such as Monte-Carlo methods [15] or quasi-random sequences such as the Sobol or the Halton sequence [9, 19]. These methods cover the given space well for a large number of samples. However, we desire only a very limited number of linearization points when constructing lower bounding problems, making these methods unsuitable for our procedure. Additionally, Monte-Carlo methods and quasi-random sequences obviously posses the randomness factor, whereas the methods presented herein do not. Thus, we propose a different approach based on the computation of all vertices of an n-simplex. An n-simplex is defined as an n-dimensional polytope which also is the convex hull of its \(n+1\) vertices. The idea is to compute all \(n+1\) vertices of an n-simplex, with all vertices lying on an n-dimensional Ball centered at \(\mathbf {0}\) in \([-1,1]^n\) with radius \(r \in (0,1]\). These points can then be rescaled to the original interval domains of the optimization variables and then used for the computation of subgradients and McCormick relaxations. The \(n+1\) points are well-distributed in a sense that each vertex has the same distance to every other vertex in the \([-1,1]^n\) box, namely \(2\cdot r \frac{\sqrt{3}}{2}\). Since subgradients of McCormick relaxations are steeper at the boundary of the domain if the convex relaxation has its minimum in the interior of the domain and thus, tend to contribute less to the tightness of the resulting linear program it is favorable to choose \(r<1\). It is also reasonable to choose the radius larger than 0 to avoid a possible clustering of the linearization set resulting in possibly many similar subgradients of the affine underestimators and finally redundant constraints. Thus, we recommend to choose \(r\in (0.7,0.9)\). The pseudo-code of the described procedure can be found in Algorithm 2. In the pseudo-code, the index (ij) of a vertex \(\mathbf {v}\) denotes the ith row of the jth vertex in the set \(\mathscr {V}\). Throughout the algorithm we make use of the fact that all edges of the n-simplex have the same length, meaning that the sum \(\sum _{i=1}^n v_{i,j}\) has to equal 0 for a fixed j. After the computation of the \(n+1\) vertices of the n-simplex, it is advisable to rotate the polytope by a given angle \(\gamma \) with the help of a standard rotation matrix to avoid the computation of points with having multiple equal coordinates. It is not recommended to rotate the matrix with respect to each axis due to high computational effort in high dimensions, e.g., \(n>100\), but rather rotate with respect to every kth axis to avoid a large number of calculations. The case \(n=1\) is a special case, where we always use 3 points equally spread among the 1-dimensional interval \(X=[x^L,x^U]\), being

$$\begin{aligned} x_1 = \frac{1}{3}\cdot (x^U-x^L)+x^L,\,x_2 = \frac{1}{2}\cdot (x^U-x^L)+x^L,\text { and } x_3 = \frac{2}{3}\cdot (x^U-x^L)+x^L. \end{aligned}$$
figure d

We also experienced that using all \(n+1\) points is not optimal if each function in the optimization problem only depends on a subset of the optimization variables. If the maximum number of variables participating non-linearly in any function of the optimization problem is \(p<n\), we filter the points further and choose only p out of \(n+1\) vertices. Note that when a p-simplex is computed, we obtain \(p+1\) p-dimensional points. Just because there are at most p variables participating nonlinearly in some function \(f_i\), it is not excluded that subsets of these variables and some additional variables participate nonlinearly in some other function \(f_j\) with \(i\ne j\). It is possible to rescale these p vertices to the n variables, but then the same simplex vertices for multiple variables are used which could intuitively lead to possibly very similar and often redundant affine underestimators. Moreover, since the computation of the n-simplex is performed only once in the global optimization procedure, the computation of a larger simplex is negligible. We choose only p points instead of \(p+1\) and one of these points is always the mid point of the domain, since this choice has shown the best results in the numerical case studies performed in this work. Since the mid point is always used, we choose the \(p-1\) points with the highest absolute improvement in interval tightness when applying the subgradient interval heuristic presented in our previous work [29] for computation of McCormick relaxations of the given optimization problem.

3.4 Illustrative example

Fig. 1
figure 1

Example of the n-simplex algorithm described in Sect. 3.3. 2-simplex vertices on the 2 dimensional ball with radius r. The original points computed by Algorithm 2 are depicted as blue circles. The rotated original points by an angle of \(30^\circ \) are depicted as red crosses

Fig. 2
figure 2

Figures showing the affine underestimators of the three-hump camel function over \([-0.75,1]\times [-1,0.25]\) at linearization points obtained with the adaptation of Kelley’s algorithm (Figure (a)) described in Sect. 3.2 and the n-simplex algorithm (Figure (b)) described in Sect. 3.3. a The linearizations are conctructed starting at the middle point \(\mathbf {m}\) and conducting one iteration \(it_1\) of the adapted Kelley’s algorithm. b The linearizations are constructed at the middle point \(\mathbf {m}\) and the point \(\mathbf {v}_1\) obtained with the n-simplex algorithm

The following example illustrates application of Algorithm 2 in practice. Consider the three-hump camel function

$$\begin{aligned} f:[-0.75,1]\times [-1,0.25] \rightarrow \mathbb {R}, (x_1,x_2) \mapsto 2\cdot x_1^2 - 1.05 \cdot x_1 ^4 + \frac{x_1^6}{6} + x_1\cdot x_2 + x_2^2. \end{aligned}$$

First, we compute the vertices of the n-simplex lying on the 2 dimensional ball \(\mathscr {B}\) with radius \(r=0.725\). The points of the 2-simplex on \(\mathscr {B}\) are

$$\begin{aligned} \mathbf {v}_1&= (0.725,0), \\ \mathbf {v}_2&= \left( -0.3625, \sqrt{r^2 - (-0.3625)^2}\right) , \\ \mathbf {v}_3&= \left( -0.3625, -\sqrt{r^2 - (-0.3625)^2}\right) . \end{aligned}$$

Then, we rotate the points by \(30^\circ \) counterclockwise and finally obtain

$$\begin{aligned} \mathbf {v}_1&= \left( \sqrt{r^2 - (-0.3625)^2},-0.3625\right) , \\ \mathbf {v}_2&= (0,0.725), \\ \mathbf {v}_3&= \left( -\sqrt{r^2 - (-0.3625)^2},-0.3625\right) . \end{aligned}$$

Note that the similarity of the rotated points with respect to the original points is a special case of the 2 dimensional case. Figure 1 shows the (not) rotated points of the 2-simplex computed on the ball with radius \(r=0.725\).

Next, we scale the points to the domain \([-0.75,1]\times [-1,0.25]\) of f and get

$$\begin{aligned} \mathbf {v}_1&\approx (0.674,-0.601), \\ \mathbf {v}_2&= (0.125,0.078125), \\ \mathbf {v}_3&\approx (-0.424,-0.601). \end{aligned}$$

As final linearization points, we choose the mid point \(\mathbf {m}=(0.125,-0.375)\) and the point \(\mathbf {v}_i\) providing the most improvement when using the subgradient interval heuristic presented in our previous work [29], given by \(\mathbf {v}_1\) in this example. Finally, we compute the linearization of convex McCormick relaxations of f at \(\mathscr {L} = \{\mathbf {m},\mathbf {v}_1\}\) (cf. Fig. 2a). Figure 2b shows the resulting affine underestimators of f when using the adaptation of Kelley’s algorithm described in the previous subsection. The difference between the linearizations with respect to the resulting lower bound of the respective linear program is only marginal in this example although different linearization points are used.

3.5 Numerical results

We compare the presented linearization strategies by the solution of various problems with the deterministic global optimization solver MAiNGO [4] using CPLEX v12.8 [18] for the solution of linear lower bounding problems, IPOPT v3.12.0 [39] for local optimization in the pre-processing step, and the SLSQP solver found in the NLOPT package v2.5.0 [20] for local optimization within the B&B algorithm. For range reduction, we use optimization-based bound tightening with the trivial filtering algorithm with filtering value 0.001 [12], duality-based bound tightening using dual values obtained from CPLEX [31], and basic constraint propagation. All calculations are conducted on an Intel\(^{\textregistered }\) Core™ i5-3570 CPU with 3.4 GHz running Fedora Linux 30.

We solve 9 problems of varying sizes with up to 57 variables chosen from the MINLPLIB2 library; 4 case studies of a combined-cycle power plant presented in Sections 5.2 and 5.2 of [1] where we either maximize the power output of the net or minimize the levelized cost of electricity; 2 case studies of the William-Otto process with 4 and 5 degrees of freedom, respectively [2]; and 3 case studies of chemical processes presented in [3], where we consider the example of a flash modeled with the NRTL mixture model as well as the optimization of a methanol production process from \(\hbox {H}_2\) and \(\hbox {CO}_2\), once with original degrees of freedom (case c) and once additionally using the \(\hbox {CO}_2\) inlet flowrate as degree of freedom making the case study somewhat more complex. All problem statistics can be found in Table 6 in Appendix A.

In the 9 problems found in the MINLPLIB2 library, we set the absolute and relative tolerances to \(\epsilon =10^{-4}\) and for the process engineering case studies, we set both optimality tolerances to \(\epsilon =10^{-2}\). We allow for a maximum of 8 hours. We consider the computational performance of the B&B algorithm for five cases:– Mid, where every convex and/or concave relaxation is linearized at the middle point of the optimization variable domain. This is the setting used in all of our previous work.– Kelley, where every convex and/or concave relaxation is linearized iteratively using the adaptation of Kelley’s algorithm described in Sect. 3.2. We allow for at most p linearizations, where p is the maximum number of variables participating nonlinearly in any function of the optimization problem.– Simplex, where every convex and/or concave relaxation is linearized at points determined through the application of the n-simplex algorithm described in Sect. 3.3. We choose p linearizations, where p is the maximum number of variables participating nonlinearly in any function of the optimization problem. We choose the p points with the highest absolute improvement in interval tightness when applying the subgradient interval heuristic presented in our previous work [29] for computation of McCormick relaxations of the given optimization problem. – S+K, where every convex and/or concave relaxation is linearized iteratively using the adaptation of Kelley’s algorithm with the initial set of linearization points \(\mathscr {L}\) being determined with the use of the n-simplex algorithm. We add at most 5 additional linearizations with Kelley’s algorithm.– Random, where every convex and/or concave relaxation is linearized at p points determined randomly, where p is the maximum number of optimization variables participating nonlinearly in any function found in the optimization problem. Here we report the results for the best random seed among 5 different random seeds we used in C++.

In all numerical experiments, we set the maximum number of iterations in the Kelley’s Algorithm 1 to p denoting the maximum number of optimization variables participating non-linearly in any function found in the optimization problem and the minimum difference thresholds to \(\mu ^a=\epsilon ^a \cdot 10\) and \(\mu ^r=10^{-2}\). For the determination of n-simplex vertices, we set the ball radius to \(r=0.725\) and rotate the n-simplex by \(\gamma =30^\circ \) counterclockwise. Furthermore, in any algorithm, in addition to the at most p or \(n+1\) linearization points, we always use the mid point of the interval domain as linearization point. The detailed CPU times and number of iterations can be found in Table 7 in Appendix A. Since performance plots can be easily misinterpreted [14], we present the computational results with the use of the shifted geometric mean defined as

$$\begin{aligned} \exp \left( \sum _{i=1}^n \frac{\log (\max (1,t_i+s))}{n}\right) - s, \end{aligned}$$
(1)

where n is the number of instances, \(t_i\) the reported computational time and s a constant shift set to \(s=10\) in this work. We also report the shifted geometric mean with \(t_i\) denoting the number of iterations needed. If a method reaches the maximum computational time of 8 hours, we use \(t_i = 28 800~\hbox {s}\) or \(t_i= 2\cdot 10^6\) iterations (this is the maximum iterations needed among all runs that converged within the time limit of 8 hours). Table 1 shows the speed up differences among the considered methods. We see that the simple method using a single linearization point performs worst with respect to the number of iterations. It also performs worst with respect to the computational time in all but the smallest case studies. All methods are at least \(50\%\) faster than the single middle point linearization and need up to 5 times less iterations. The n-simplex method seems to perform best on the considered test set with respect to computational time, but all other presented method are only slightly weaker. Indeed the n-simplex method solves more instances than the other considered methods in the given time frame. The hybrid method has the lowest number of iterations but performs slightly worse in terms of computational time. Kelley’s algorithm performs best regarding computational time for the methanol production processes but performs worse than the n-simplex algorithm on the other process engineering case studies. It also almost always needed strictly less than p iterations to converge to the preset tolerances. This may be an indicator that Kelley’s algorithm is already good enough for our purposes but obviously this cannot be guaranteed. Note that for the minimization of the levelized cost of electricity (LCOE) in case 3, the methods which did not converge in time have a very small remaining gap. This may be misleading as for this particular problem, the last few percent needed to close the optimality gap are the most time consuming, meaning that the remaining \(\approx 1\%\) would most likely need more than 1 additional hour to converge. It is also interesting that the random generation of points performed slightly better than the adaptation of Kelley’s algorithm and the hybrid of n-simplex and Kelley’s. This can be easily explained with the fact that the random seed we chose in C++ seems to work well for the instances considered. Unfortunately, this behavior cannot be guaranteed if a different machine or a different random seed is chosen, making the random generation of linearization points undesirable in general. We can conclude that except for the easiest instances or the instances with already very tight relaxations, the computation of additional linearization points seems favorable.

Table 1 Tables reporting the correlation of the shifted geometric mean (1) between the considered methods with respect to CPU seconds and number of iterations

4 Hybridization of the auxiliary variable method and McCormick relaxations

Tsoukalas and Mitsos already discussed the relationship between the auxiliary variable method and multivariate McCormick relaxations (Section 4 in [37]). Their conclusion was that McCormick relaxations can be modified to provide equally tight relaxations as the AVM. The modification consists of the introduction of auxiliary optimization variables and corresponding equality constraints for common intermediate factors. We first briefly present the main result again herein. Since the introduction of additional optimization variables together with the corresponding equality constraints increases the complexity of the lower bounding problems, in a B&B framework it is favorable to use the linearization algorithms presented in the previous section. Then, we conduct additional numerical experiments to check the impact of the introduction of common factors as auxiliary optimization variables when using multivariate McCormick relaxations.

4.1 Tightness of McCormick relaxations and the auxiliary variable method

Consider the functions

$$\begin{aligned}&f_1:\mathbb {I}\mathbb {R}^2\ni X_1\rightarrow \mathbb {R},&~f_2:\mathbb {I}\mathbb {R}\ni X_2\rightarrow \mathbb {R},\\&f_3:\mathbb {I}\mathbb {R}\ni X_3\rightarrow \mathbb {R},&~f_4:\mathbb {I}\mathbb {R}\ni X_4\rightarrow \mathbb {R}, \end{aligned}$$

and the composition

$$\begin{aligned} g:\mathbb {I}\mathbb {R}\ni X\rightarrow \mathbb {R}, x \mapsto f_1\left( f_3(x),f_4(x)\right) + f_2\left( f_3(x)\right) , \end{aligned}$$

with \(X\subset X_3, X\subset X_4, f_3(X_3)\times f_4(X_4) \subset X_1\) and \(f_3(X_3)\subset X_2\). When minimizing g, the AVM could formulate the problem in two different ways, depending on whether the intermediate factor \(f_3\) is recognized as a common term (Formulation 2) or not (Formulation 1).

with the corresponding convex relaxations used for the lower bounding problem

Formulation 1 is in general a relaxation of Formulation 2 and results in a worse lower bound if the nonlinear convex problem is solved. Although, theoretically the better lower bound cannot be guaranteed when the formulations are linearized, in practice it is still expected that the linearization of Formulation 2 will result in a tighter bound.

When using multivariate McCormick relaxations we obtain the same bound for Formulation 1

$$\begin{aligned} \underset{x\in X}{\min } \left\{ \begin{array}{cc} \underset{\hat{z}_1,\hat{z}_2}{\min } &{} \hat{z}_1 + \hat{z}_2 \\ \text {s.t.} &{}\begin{array}{ccc} \left( \begin{array}{cc} \underset{\hat{z}_3,\hat{z}_4}{\min } &{} f_1^{cv}(\hat{z}_3,\hat{z}_4) \\ \text {s.t.} &{} f_3^{cv}(x) \le \hat{z}_3 \le f_3^{cc}(x) \\ &{} f_4^{cv}(x) \le \hat{z}_4 \le f_4^{cc}(x) \end{array} \right) &{} \le \hat{z}_1 \le &{} \left( \begin{array}{cc} \underset{\hat{z}_3,\hat{z}_4}{\max } &{} f_1^{cc}(\hat{z}_3,\hat{z}_4) \\ \text {s.t.} &{} f_3^{cv}(x) \le \hat{z}_3 \le f_3^{cc}(x) \\ &{} f_4^{cv}(x) \le \hat{z}_4 \le f_4^{cc}(x) \end{array} \right) \\ \left( \begin{array}{cc} \underset{\hat{z}'_3}{\min } &{} f_2^{cv}(\hat{z}'_3) \\ \text {s.t.} &{} f_3^{cv}(x) \le \hat{z}'_3 \le f_3^{cc}(x) \\ \end{array} \right) &{} \le \hat{z}_2 \le &{} \left( \begin{array}{cc} \underset{\hat{z}'_3}{\min } &{} f_2^{cc}(\hat{z}'_3) \\ \text {s.t.} &{} f_3^{cv}(x) \le \hat{z}'_3 \le f_3^{cc}(x) \\ \end{array} \right) \end{array} \end{array} \right\} . \end{aligned}$$

Note that in the above multivariate McCormick formulation, we have only one optimization variable x. Now, if the intermediate factor \(f_3(x)\) is recognized and treated as a common variable, it is necessary to introduce one additional optimization variable \(z_3\). The alternative formulation we get then is

$$\begin{aligned} \underset{x\in X,z_3 \in Z_3}{\min } \left\{ \begin{array}{cc} \underset{\hat{z}_1,\hat{z}_2}{\min } &{} \hat{z}_1 + \hat{z}_2 \\ \text {s.t.} &{}\begin{array}{ccc} \left( \begin{array}{cc} \underset{\hat{z}_4}{\min } &{} f_1^{cv}(z_3,\hat{z}_4) \\ \text {s.t.} &{} f_4^{cv}(x) \le \hat{z}_4 \le f_4^{cc}(x) \end{array} \right) &{} \le \hat{z}_1 \le &{} \left( \begin{array}{cc} \underset{\hat{z}_4}{\max } &{} f_1^{cc}(z_3,\hat{z}_4) \\ \text {s.t.} &{} f_4^{cv}(x) \le \hat{z}_4 \le f_4^{cc}(x) \end{array} \right) \\ f_2^{cv}(x) &{} \le \hat{z}_2 \le &{} f_2^{cc}(x) \\ f_3^{cv}(x) &{} \le z_3 \le &{} f_3^{cc}(x) \end{array} \end{array} \right\} \end{aligned}$$

providing the same bound as Formulation 2 in the AVM by only introducing a single additional auxiliary optimization variable. This also means that the tightness of the linear lower bounding problem obtained through linearization of the convex multivariate McCormick relaxation is equal to the bound obtained by the linearized AVM Formulation 2 if the same linearization points are used (which in general we do not recommend due to the combinatorial complexity (Sect. 3.1)). Although not explicitly mentioned in the original publication [37], the above formulation already suggests that a hybridization of the AVM and the multivariate McCormick relaxations may result in computational advantages. In this work, instead of trying to achieve the tightness of the AVM, we experiment with the addition of only a few auxiliary optimization variables in order to achieve best computational results.

4.2 Numerical results

Modern implementations of directed acyclic graphs such as the one in MC++ [7] allow for the recognition of common factors. The recognition is done automatically and has a quadratic complexity, since in the worst-case all previous factors have to be checked whenever a new factor is introduced. When comparing the AVM with McCormick relaxations, it is of interest to investigate the impact of the intermediate common factors on computational time in a B&B algorithm. Note that the addition of all intermediate factors, also those occurring only once, results in the AVM.

In our experiments, we always choose the intermediate nonlinear factors participating most often in the optimization problems and replace them by an auxiliary optimization variable and a corresponding equality constraint in the lower bounding problem. We use the same optimization problems as in Sect. 3.5 with the exact same computational setup. In our experiments we add up to 3 auxiliary variables and solve the problems with the same 5 presented linearization strategies as in Sect. 3.5. However, we do not branch on these auxiliary variables but rather only apply bound tightening heuristics to them. Initial valid bounds for the introduced variables are obtained through interval arithmetic and constraint propagation. We report only the problems where at least one common intermediate factor has been found. Again, we use the shifted geometric mean (cf. Sect. 3.5) to compare the different linearization strategies for a fixed number of additional auxiliary optimization variables and also to compare the same linearization strategies for the different numbers of introduced auxiliary optimization variables (cf. Tables 2, 4). Table 3 reports the shifted geometric mean comparing the different linearization strategies over all numbers of additional auxiliary variables (1–3). Table 5 reports the shifted geometric mean comparing the different linearization strategies and all different number of auxiliary variables (0–3).

Table 2 Tables reporting the correlation of the shifted geometric mean (1) between the considered methods with respect to CPU seconds and number of iterations when exactly 1, 2 or 3 auxiliary optimization variables are added
Table 3 Tables reporting the correlation of the shifted geometric mean (1) between the presented methods considering all numbers of additional auxiliary variables (1–3) with respect to CPU seconds and number of iterations
Table 4 Tables reporting the correlation of the shifted geometric mean (1) when only the mid point of the domain box is used as linearization point; when the adapted Kelley’s algorithm (Sect. 3.2) is used to linearize the lower bounding problem; the n-simplex algorithm (Sect. 3.3) is used compute the linearization points; and when the adapted Kelley’s algorithm (Sect. 3.2) with the initial set of linearization points computed with the n-simplex algorithm (Sect. 3.3) is used to linearize the lower bounding problem; and when random linearization points are used. We consider the case when 0–3 auxiliary variables are introduced and compare the each method individually in terms of CPU seconds and number of iterations
Table 5 Tables reporting the correlation of the shifted geometric mean (1) when only the mid point of the domain box is used as linearization point; when the adapted Kelley’s algorithm (Sect. 3.2) is used to linearize the lower bounding problem; the n-simplex algorithm (Sect. 3.3) is used compute the linearization points; and when the adapted Kelley’s algorithm (Sect. 3.2) with the initial set of linearization points computed with the n-simplex algorithm (Sect. 3.3) is used to linearize the lower bounding problem; and when random linearization points are used. We consider the case when 0–3 auxiliary variables are introduced and compare all combinations of method and number of auxiliary variables in terms of the CPU seconds and number of iterations. aux denotes auxiliary variable(s)

In the cases where 1 or 2 auxiliary optimization variables are added, the n-simplex method outperforms all other considered procedures and that it is the only one to solve all considered numerical examples in the time frame of 8 hours (cf. Tables 8, 9, 10 found in “Appendix A”) with up to 2 auxiliary variables. The adaptation of Kelley’s algorithm performs best when 3 auxiliary variables are added, but in general all procedures perform worse than when less auxiliaries are used. Overall, Kelley’s and the n-simplex algorithm perform very similar when auxiliary variables are added (cf. Tables 3, 5). The introduction of auxiliary variables allows an improvement of up to \(\approx 20\%\) with respect to solution times for all procedures (cf. Table 4). It is also comprehensible that linearization strategies using multiple linearizations achieve computational advantages when introducing auxiliary optimization variables when compared to the simple single point linearization due to the increased dimensionality and thus also an increased complexity of the optimization problem. Unfortunately, we cannot say that introducing more auxiliary variables always results in best performance as, e.g., when introducing auxiliary variables for the minimization of the levelized cost of electricity in case 3 [1], we achieve the best computational performance when adding only 1 auxiliary variable. However, the number of iterations needed decreases as more auxiliary optimization variables are added, confirming the improved tightness of the resulting lower bounding relaxations. This may be an indicator that replacing the common factors occurring the most in the optimization problem is not always the best heuristic but rather different properties are crucial for the computational advantage. It is also possible that it is required to branch on the auxiliary variables to possibly further improve the computational performance. The improvement of the choice of intermediate factors and the development of specialized heuristics for the hybridization of McCormick relaxations with the AVM requires further investigation and remains an active research topic of the authors.

The present results for the NRTL flash case (NRTL_RS_IdealGasFlash) also provide an explanation for the differences in performance of the model formulations considered by Bongartz and Mitsos [3]. In the version of the case study used herein, only the liquid mole fractions of the flash are optimization variables while the vapor mole fractions are calculated from the isofugacity equations. Bongartz and Mitsos [3] observed that adding the vapor mole fractions as optimization variables, thus turning the isofugacity equations into constraints, improved computational performance. When allowing the addition of auxiliary variables, the variables added herein correspond exactly to these vapor mole fractions, showing that their addition to the problem results in tighter relaxations because they are re-occurring factors.

5 Conclusion

Algorithms for the computation of linearization points for the auxiliary variable method are available and are widely used in state-of-the-art deterministic global optimization solvers [35]. No such methods have been proposed for the (multivariate) McCormick relaxations. We discuss difficulties and limits of methods for the computation of affine underestimators of McCormick relaxations when the idea of subgradient propagation is considered [28]. An iterative method based on well-known cutting plane algorithms from the field of convex optimization [8, 21] and an own algorithm based on the computation of all vertices of an n-simplex for the computation of linearization points a priori are developed. We combine both methods and compare all procedures in numerical experiments. The n-simplex algorithms performs best on the considered test set, but all method perform up to \(\approx 60\%\) better than the single point linearization strategy used in our previous works. Furthermore we tie on the discussion on the relaxation tightness and introduction of auxiliary variables in the AVM and McCormick relaxations presented in [37]. The theoretical results are applied in the numerical experiments by recognizing common intermediate factors in the directed acyclic graph and introducing additional optimization variables while still using the McCormick propagation rules. The introduction of a limited number of auxiliary variables provides large improvements in the computational times. We deduce that it is not always best (w.r.t. computational time) to introduce all commonly occurring factors as auxiliary optimization variables. The development of methods for the determination of when to replace a common factor by an auxiliary optimization variable remains an active field of research of the authors.