Abstract
We propose a new decomposition method to solve multistage nonconvex mixedinteger (stochastic) nonlinear programming problems (MINLPs). We call this algorithm nonconvex nested Benders decomposition (NCNBD). NCNBD is based on solving dynamically improved mixedinteger linear outer approximations of the MINLP, obtained by piecewise linear relaxations of nonlinear functions. Those MILPs are solved to global optimality using an enhancement of nested Benders decomposition, in which regularization, dynamically refined binary approximations of the state variables and Lagrangian cut techniques are combined to generate Lipschitz continuous nonconvex approximations of the value functions. Those approximations are then used to decide whether the approximating MILP has to be dynamically refined and in order to compute feasible solutions for the original MINLP. We prove that NCNBD converges to an \(\varepsilon \)optimal solution in a finite number of steps. We provide promising computational results for some unit commitment problems of moderate size.
Introduction
We propose a new decomposition method to solve multistage nonconvex mixedinteger (stochastic) nonlinear programming problems (MINLPs), i.e., optimization problems modeling a sequential decision making process. Continuous and integer decision variables and possibly nonconvex objective functions and constraints are allowed for any of the T stages.
If multistage (stochastic) problems are too large to be solved by offtheshelf solvers, then tailored solution techniques are required. One example are decomposition algorithms making use of the specific sequential and blockdiagonal structure of the constraints. The problems are decomposed into a large number of smaller but coupled subproblems which are solved iteratively. One of the most common decomposition methods is Benders decomposition, introduced by Benders [6] for linear programs. Since then, it has been enhanced to several more general cases, such as convex problems (generalized Benders decomposition (GBD) [19]), twostage stochastic linear problems (Lshaped method [49]) and multistage (stochastic) linear problems (nested Benders decomposition (NBD) [8]). To mitigate the curseofdimensionality related to NBD in the stochastic case, Pereira and Pinto introduced its samplingbased variant stochastic dual dynamic programming (SDDP) [35], which was followed by various extensions [23, 37].
The basic principle of NBD is to use the dynamic programming formulation of a given multistage problem. For each stage \(t \in \left\{ 1,\ldots , T \right\} \), a parametric subproblem is considered. This subproblem contains only those constraints, variables and parts of the objective function related to this specific stage, plus a value function determining the optimal value of all following stages for a given stage t solution. Since the value functions are not known in advance, they are iteratively approximated with linear cuttingplanes. However, this approach requires the value functions to be convex. Therefore, most decomposition methods for multistage problems cover linear programs, as their value functions are guaranteed to be piecewise linear and convex.
However, in many applications, also integer variables or nonlinearities occur naturally. In such case, the value functions are no longer convex and may also no longer be continuous. Therefore, the classical Benders approach fails, as it is impossible to construct a tight convex polyhedral approximation [47].
Thus, more sophisticated approaches have been developed to use Benderstype decomposition methods for nonconvex MINLPs, mostly for the twostage case. Li et al. propose an extension of GBD to the nonconvex case for twostage stochastic MINLPs with functions separable in integer and continuous variables [29, 30]. In [28], a branchandcut framework is presented, where in each node Lagrangian and generalized Benders cuts are constructed. Related methods are proposed in [26, 33]. All these methods have not been generalized to the multistage case yet.
To handle nonconvexities in multistage problems, a common idea is to use convex relaxations of the value function, e.g., by relaxing the integrality constraints for MILPs or by convexifying nonlinear terms in a static manner. Dynamically convexifying the nonconvex value functions using Lagrangian relaxation techniques allows for a polyhedral approximation by Lagrangian cuts [10, 45, 46]. None of these discussed approaches can guarantee to compute an optimal solution for nonconvex multistage problems, though.
Only recently, some substantial progress has been made in generalizing the Benders decomposition idea to multistage problems with nonconvex value functions directly. In [36], step functions are used, instead of cuttingplanes, to approximate the value functions, presuming their monotonicity.
For the mixedinteger linear case, the stochastic dual dynamic integer programming (SDDiP) approach is proposed [55]. SDDiP is an enhancement of NBD and SDDP which allows the solution of multistage (stochastic) MILPs in case of binary state variables. The method is based on generating special Lagrangian cuts, which reproduce the lower convex envelope of the value function. As the latter is piecewise linear and exact at binary state variables, strong duality is ensured and the problem is solved to global optimality in a finite number of iterations. SDDiP is applied to multistage unit commitment in [54]. It is also applied to a problem containing nonconvex functions in context of hydro power scheduling by using a static binary expansion of the state variables and a BigM reformulation [22].
As long as the value functions are assured to be Lipschitz continuous and some recourse property is satisfied, the requirement of binary state variables can be dropped, as is shown by the Stochastic Lipschitz Dynamic programming (SLDP) method in [1]. Here, two types of nonconvex Lipschitz continuous cuts are introduced: reversenorm cuts and augmented Lagrangian cuts.
In [52], Zhang and Sun present a new framework to solve multistage nonconvex stochastic MINLPs, generalizing both SDDiP and SLDP. Similarly to [1], nonlinear generalized conjugacy cuts are constructed by solving augmented dual problems. Moreover, as Lipschitz continuity is not assured for the value functions, a Lipschitz continuous regularized value function is considered within the decomposition method.
In this article, we propose a new method to solve multistage nonconvex MINLPs to proven global optimality, which we refer to as nonconvex nested Benders decomposition (NCNBD). The method combines piecewise linear relaxations, regularization, binary approximation and the SDDiP Lagrangian cuts in a unique and dynamic fashion. Its basic idea is to solve a MINLP by iteratively improved MILP outer approximations, which in turn are solved using a NBDbased decomposition scheme similar to that in [52]. The binary and piecewise linear approximations are dynamically refined.
In particular, the original MINLP is outer approximated by MILPs, which are iteratively improved in an outer loop. Those MILPs are obtained by piecewise linear approximations of all occuring nonlinear functions, which is an established method in global optimization [50]. In general, using MILP relaxations is a common approach to global optimization solvers [27, 32, 53].
In an inner loop, the multistage MILPs are solved to approximate optimality in finitely many steps. This is achieved using a NBDbased decomposition method. In a forward pass through the stages, trial solutions for the dynamic programming equations are determined. As Lipschitz continuity of the value functions is not guaranteed, this is done solving a regularized forward pass problem, as proposed in [52]. For a sufficiently large, but finite parameter, the regularization is exact [14, 52], so that still the desired MILP is solved.
In a backward pass through the stages, nonlinear nonconvex cuts are constructed to approximate the nonconvex value functions of the MILP. To this end, we make use of a binary approximation of the state variables in the subproblems. As proven in [55], for MILPs with binary state variables we obtain (sufficiently) tight cuts by solving Lagrangian dual problems. The constructed linear cuts are then projected back to the original state space, yielding a nonlinear, nonconvex, but Lipschitz continuous approximation of the value functions. The binary approximation is refined dynamically within the inner loop if required. By careful construction, all existing cuts remain valid even with such refinements.
Once the MILP approximation is solved to approximate optimality, the cut approximation of the value functions is used in the outer loop to determine bounds for the optimal value of the original MINLP. If the bounds are sufficiently close, the algorithm terminates with an \(\varepsilon \)optimal solution. Otherwise, the piecewise linear approximations are refined, and thus the approximating MILP is tightened. Again, by careful construction it is ensured that all previously generated cuts remain valid.
To our best knowledge, the above concepts have not been combined in this dynamic way to solve multistage nonconvex MINLPs yet. In that regard, our work also differs significantly from the aforementioned solution techniques.
Our proposed decomposition scheme uses the same regularization technique and similar convergence ideas as in [52]. However, a fundamental difference is that we only apply this technique to solve MILP outer approximations of the original MINLP. This has the advantage that in our framework MINLPs have to be solved only occasionally. In contrast, in [52], MINLPs are assumed to be solved by some oracle in each iteration and cuts are generated directly for the MINLP, which is computationally challenging. Moreover, contrary to our approach, the method in [52] does not require recourse assumptions, but in return it only allows for state variables in the objective function.
In contrast to SDDiP [55] and SLDP [1], we solve MINLPs, and thus consider a larger solution framework with an inner and an outer loop. However, even the inner loop, in which MILPs are solved, differs from both approaches.
To solve MILPs with nonbinary state variables using SDDiP, it is proposed to apply a static binary approximation [22, 55]. This way, the original MILP is replaced by an approximating problem with only binary state variables. It can be shown that for a sufficiently small approximation precision, i.e., an sufficiently large number of binary variables, an \(\varepsilon \)optimal solution of an MILP can be determined with this approach under some recourse assumption [55]. However, for a given problem at hand, it is not necessarily clear in advance how this precision has to be chosen, as knowledge on a problemspecific Lipschitz constant is required. This becomes even more challenging in our framework, where an MINLP is iteratively approximated by MILPs, for which the required precision may change. On the contrary, within NCNBD the binary approximation is refined dynamically if required.
More crucially, in NCNBD the binary approximation is applied temporarily only to derive cuts in the backward pass. These cuts are then projected back to the original state space. This construction has a few key advantages: Firstly, it is ensured that cuts remain valid even if the binary precision is refined later on. Secondly, the original state variables remain continuous and are not limited to values which can be exactly represented by the binary approximation. This, in turn, ensures that the true MILPs are solved in the inner loop. Consequently, the generated cuts are valid for the value functions of these MILPs and, due to their relaxation property, also the original MINLP. Analogously, the obtained lower bounds are valid for the corresponding optimal values. Importantly, this is not true for SDDiP with static binary approximation, where the state space is permanently modified and only approximations of the true MILPs are solved in the inner loop. In our approach to solve MINLPs, it is crucial to determine guaranteed valid cuts for the value functions in both loops. Therefore, SDDiP cannot be used effectively in this setting.
Our cut generation approach also differs from that in SLDP [1] (and also [52]), where augmented Lagrangian problems are solved to determine nonlinear cuts. While our method comes at the cost of introducing additional (binary) variables and constraints compared to those approaches, e.g., for the cut projection, we avoid solving dual problems containing nonlinear penalization in the objective. Such penalization may be disadvantageous as it prevents decomposition of the primal problems which are solved in the solution process of the dual problem. Additionally, in contrast to SLDP [1], we do not assume continuously complete recourse, but only the weaker complete recourse, as we circumvent the requirement of Lipschitz continuity of the true value functions by regularization.
The main contributions of this paper are as follows:

(1)
We present the nonconvex nested Benders decomposition (NCNBD) method to globally solve general multistage nonconvex MINLPs. The method combines piecewise linear relaxations, regularization, binary approximation and cuttingplanes techniques in a unique way. In contrast to existing approaches, all approximations are improved dynamically where and when it is reasonable. To our knowledge, this is the first decomposition method for general multistage nonconvex MINLPs.

(2)
A crucial requirement using dynamic refinements is to ensure that all previously determined cuts remain valid within the refinement process and have not to be generated from scratch. We ensure this by a special cut projection and careful choice of the MILP relaxations.

(3)
We prove that the proposed NCNBD method converges to an \(\varepsilon \)optimal solution of P in a finite number of steps under some mild assumptions.

(4)
We provide first computational results of applying NCNBD to moderatesized instances of a unit commitment problem to illustrate its efficacy.
To enhance readability, we focus our discussions solely on deterministic MINLPs. However, the presented NCNBD idea can also be applied to stochastic programs with stagewise independent and finite random variables.
The remainder of the paper is organized as follows. We present the considered problem formulation and assumptions in Sect. 2. Then, we introduce the NCNBD with its different steps in Sect. 3, before presenting convergence results in Sect. 4. Afterwards, we provide computational results for instances of a simple unit commitment problem in Sect. 5. We conclude with Sect. 6.
Problem formulation
We consider the following multistage nonconvex MINLP problems
Here \(t = 1,\ldots ,T\) denotes the different stages with the final stage \(T \in {\mathbb {N}}\). For each stage t, the decision variables can be separated into mixedinteger state variables \(x_t \in {\mathbb {R}}_+^{n_t^1} \times {\mathbb {Z}}_+^{n_t^2}\) and local variables \(y_t \in {\mathbb {R}}^{n_t^3} \times {\mathbb {Z}}^{n_t^4}\), with \(x_0 = 0\). We define \(n_t := n_t^1 + n_t^2\) as the number of state variables. The sets \(M_t(x_{t1})\) appearing in the constraints for each stage t are defined by
\(X_t\) and \(Y_t\) denote box constraints; \(X_0 := \{ 0 \}\). As such, \(X_t\) and \(Y_t\) are compact sets for all staget variables. All functions \(f_t: X_t \times Y_t \rightarrow {\mathbb {R}}\), \(g_t: X_{t1} \times X_t \times Y_t \rightarrow {\mathbb {R}}^{m_t^1}\) and \(h_t: X_{t1} \times X_t \times Y_t \rightarrow {\mathbb {R}}^{m_t^2}\) are welldefined on their domains.
To exploit its multistage structure, we solve \((\varvec{P})\) by some extension of NBD. NBD makes use of the dynamic programming formulation of \((\varvec{P})\), where each staget subproblem, \(t=1,\ldots ,T\), can be denoted by
with the value function \(Q_t(\cdot )\) of stage t and \(Q_{T+1}(\cdot ) \equiv 0\). Note that \(x_t\) links different stages, i.e., \(x_t\) is a decision variable for \((\varvec{P_t(x_{t1})})\) and a parameter for \((\varvec{P_{t+1}(x_{t})})\). For the first stage, we obtain that \(Q_1(x_0) = v\) with \(x_0 \equiv 0\). Importantly, subproblem \((\varvec{P_t(x_{t1})})\) is enhanced by introducing local copies \(z_t\) of the state variables \(x_{t1}\) and the copy constraints \(z_t = x_{t1}\). Those copy constraints will prove crucial for the cut generation later on. Taking into account the local copies, we define
As the subproblems \((\varvec{P_t(x_{t1})})\) are nonconvex MINLPs, the value functions \(Q_t(\cdot )\) may be noncontinuous and nonconvex, two detrimental properties for Benders decomposition approaches. To ensure that the value functions \(Q_t(\cdot )\) are at least lower semicontinuous (l.sc.), we make the following technical assumptions:
 (A1).:

For all \(t=1,\ldots ,T\),
 (a):

the functions \(f_t\) are Lipschitz continuous on \(X_t \times Y_t\),
 (b):

the functions \(g_t\) and \(h_t\) are continuous on \(X_{t1} \times X_t \times Y_t\).
 (A2):

(Complete recourse). For any stage t and any \(\bar{x}_{t1} \in X_{t1}\), there exists some \((z_t, x_t,y_t) \in X_{t1} \times X_t \times Y_t\) which is feasible for \((\varvec{P_t(\bar{x}_{t1})})\).
As all variables are boxconstrained, the feasible set \(M_t(x_{t1})\) of \((\varvec{P_t(x_{t1})})\) is bounded. With assumption (A1) and the recourse assumption (A2), all subproblems \((\varvec{P_t(x_{t1})})\) are feasible and bounded. Analogously, \((\varvec{P})\) is feasible with finite optimal value v. Note that under assumption (A2) we can restrict to generating optimality cuts in NCNBD without the need to introduce Benders feasibility cuts.
We obtain our required l.sc. property of the value functions \(Q_t(\cdot )\).
Lemma 2.1
Under assumptions (A1) and (A2) the value functions \(Q_t(\cdot )\) are l.sc. for all \(t=1,\ldots ,T\).
Proof
Fixing all integer variables, the l.sc. follows from Exercise 1.19 in [41]. As \(X_t\) and \(Y_t\) are bounded, only finitely many different values can be attained by the integer variables. The minimum of finitely many l.sc. functions is l.sc. \(\square \)
In the next section, we introduce the NCNBD method, which combines regularization, piecewise linear approximations, binary expansion and special cuttingplane techniques in a unique way to solve \((\varvec{P})\).
Nonconvex nested Benders decomposition
The NCNBD principle
The basic idea of the NCNBD algorithm is to employ that MILP problems can be solved exactly by enhancements of NBD under certain assumptions and that MINLPs can be outer approximated by MILPs iteratively. Thus, the method consists of two main components. The first component is an inner loop which is used to determine an approximately optimal solution of some MILP outer approximation \((\varvec{\widehat{P}^{\ell }})\) of problem \((\varvec{P})\). This approximation is determined by piecewise linear relaxations of nonlinear functions in \((\varvec{P})\). The second component is an outer loop which refines this outer approximation iteratively (indexed by \(\ell \)) to improve the approximation of the optimal value v of \((\varvec{P})\). The NCNBD is summarized in Algorithm 1 and illustrated in Fig. 1.
The inner loop follows the general principle of NBD to solve \((\varvec{\widehat{P}^{\ell }})\). It consists of a forward and a backward pass through the stages \(t=1,\ldots ,T\) in each iteration i. In the forward pass, the staget subproblem \((\varvec{\widehat{P}^{\ell }_t(x_{t1})})\) is approximated in two different ways: The value function \(\widehat{Q}_{t+1}(\cdot )\) of the following stage is replaced by some outer approximation \(\mathfrak {Q}_{t+1}^{\ell i}(\cdot )\). Moreover, a regularization is added to ensure Lipschitz continuity of the corresponding value functions. Thus, regularized subproblems \((\varvec{\widehat{P}_t^{R,\ell i}(x_{t1})})\) are solved, as proposed in [52], yielding trial solutions \(\widehat{x}_{t1}^{\ell i}\) and an upper bound \(\overline{\widehat{v}}^{\ell i}\) for \((\varvec{\widehat{P}^{\ell }})\).
In the backward pass, the approximations \(\mathfrak {Q}_{t+1}^{\ell i}(\cdot )\) of \(\widehat{Q}_{t+1}(\cdot )\) are improved iteratively by constructing additional cuts. As the value functions are possibly nonconvex, those cuts are nonlinear. Importantly, cuts for \(\widehat{Q}_{t+1}(\cdot )\) are also valid for \(Q_{t+1}(\cdot )\), as the first is an outer approximation of the latter.
In the literature, different ways are proposed to obtain nonlinear optimality cuts and to ensure that the inner loop converges to the optimal value \(\widehat{v}^{\ell }\) of \((\varvec{\widehat{P}^{\ell }})\). One method is to generate reversenorm cuts [1]. However, this only works if the value functions themselves are Lipschitz continuous which is not guaranteed in our setting. Another, more general method is to solve some augmented Lagrangian dual problem, as proposed in [1, 52].
We propose a third and new method, based on the SDDiP technique [55]. We utilize that we can generate sufficiently tight cuts by solving a Lagrangian dual in a lifted space, where all state variables are binary. Thus, we (temporarily) approximate the state variables with binary ones, construct cuts in the binary space and then project those cuts back to the original space. As we show, these projections can be modeled by mixedinteger linear constraints in the original space. By careful construction, these cuts remain valid even if the binary approximation is refined in later iterations.
In this way, we circumvent solving an augmented Lagrangian dual, which may be even more expensive than solving the classical Lagrangian dual, as with the additional nonlinear term in the objective, the primal problems lose their decomposability. In return, we require more (binary) variables and constraints in the Lagrangian duals and for an MILP representation of our cuts than the approach in [1].
In principle, the MILPs as they occur in the inner loop could also be solved by using SDDiP with a static binary approximation of the state variables [55]. As discussed in Sect. 1, this approach has some properties which prevent an efficient integration into our algorithmic framework, though.
As we show in the next section, for a sufficiently fine binary approximation, the obtained cuts in the NCNBD provide a sufficiently good approximation at the trial solutions \(\widehat{x}_{t1}^{\ell i}\). Additionally, the cut approximations \(\mathfrak {Q}_t^{\ell }(\cdot )\) are generated in such a way that they are Lipschitz continuous. This is sufficient to ensure convergence to a globally optimal solution of \((\varvec{\widehat{P}^{\ell }})\).
At the end of the backward pass, a lower bound \(\underline{\widehat{v}}^{\ell i}\) is determined. If \(\overline{\widehat{v}}^{\ell i}\) and \(\underline{\widehat{v}}^{\ell i}\) are sufficiently close to each other, an approximate globally minimal point \(\big ( (\widehat{z}_t^{\ell }, \widehat{x}_t^{\ell }, \widehat{y}_t^{\ell } ) \big )_{t=1,\ldots ,T}\) of \((\varvec{\widehat{P}^{\ell }})\) has been identified and the inner loop is left. Otherwise, further cuts have to be constructed or the binary approximation has to be refined. We discuss this decision in more detail in Sect. 3.3.6.
Once the inner loop is left, subproblems \((\varvec{P_t(x_{t1}, \mathfrak {Q}_{t+1}^{\ell })})\) are solved to determine trial points \(x_{t1}^{\ell }\) and an upper bound \(\overline{v}^{\ell }\) to v for the original problem \((\varvec{P})\). If this upper bound is sufficiently close to \(\underline{\widehat{v}}^{\ell }\), the solution \(\big ( (z_t^{\ell }, x_{t}^{\ell }, y_t^{\ell }) \big )_{t=1,\ldots ,T}\) is approximately optimal for problem \((\varvec{P})\). If not, the MILP relaxation \((\varvec{\widehat{P}^{\ell +1}})\) is created by refining \((\varvec{\widehat{P}^{\ell }})\) in the neighborhood of \(\big ( (\widehat{z}_t^{\ell }, \widehat{x}_t^{\ell }, \widehat{y}_t^{\ell } ) \big )_{t=1,\ldots ,T}\) and a new inner loop is started.
As for the inner loop, it is crucial that with these refinements in the outer loop all previously generated cuts remain valid. Otherwise, the cut approximation \(\mathfrak {Q}_{t+1}^{\ell }(\cdot )\) would have to be built from scratch, counteracting the idea of a dynamic solution framework. In the following subsections, we show how such persistent validity can be achieved by careful design. Note that, even though we make use of the same regularization idea, our framework with nested loops and dynamic refinements also differs from the method presented in [52].
We explain the different steps of NCNBD in more detail in the following subsections, before we discuss convergence results in Sect. 4. As long as the index \(\ell \) is not needed for the discussions of the inner loop, we omit it for notational convenience. Moreover, we note that several of the considered subproblems require the introduction of additional decision variables, e.g., for piecewise linear approximation or cut projection. For reasons of clarity and comprehensibility, by the terms optimal point or optimal solution we refer to the projection of their actual optimal points to the space \(X_{t1} \times X_t \times Y_t\), which we are interested in.
Piecewise linear relaxations
In the outer loop of NCNBD, all nonlinear functions \(\gamma \in \varGamma \) in problem \((\varvec{P})\) are approximated by some piecewise linear functions. This is achieved by determining a triangulation of their domain, which in our boxconstrained setting is always possible. Then, the piecewise linear functions can be defined on the simplices of this triangulation using the function values of \(\gamma \) at their vertices. For a thorough discussion and stateoftheart approaches to construct piecewise linear approximations and triangulations, see [18, 39, 40].
The piecewise linear approximations can then be reformulated as mixedinteger linear constraints using auxiliary continuous and binary variables. In the literature, several modeling techniques have been proposed, such as the convex combination model, the incremental model and some logarithmic variants [4, 18, 38, 51]. Later on, we draw on refinement and convergence ideas from [9], which work for several of these models, such as the generalized incremental model [9] or the disaggregated logarithmic convex combination model [51].
By shifting the approximations appropriately, it can be ensured that the obtained MILP \((\varvec{\widehat{P}^j})\) is indeed a relaxation of the original problem \((\varvec{P})\) [18]. Alternatively, one can construct piecewise linear underestimators and overestimators, yielding tubes for nonlinear equations [25].
Applying the piecewise linear approximations to problem \((\varvec{P})\), we obtain the MILP outer approximation with copy constraints
For reasons of clarity, we denote the piecewise linear relaxations of \(f_t(\cdot ), g_t(\cdot )\) and \(h_t(\cdot )\) by \(\widehat{f}_t(\cdot ), \widehat{g}_t(\cdot )\) and \(\widehat{h}_t(\cdot )\), although they are modeled using auxiliary constraints and variables. The set \(\widehat{M}_t\) is defined by replacing the functions \(g_t(\cdot )\) and \(h_t(\cdot )\) in \(M_t\) or \(M_t(x_{t1})\), respectively, with \(\widehat{g}_t(\cdot )\) and \(\widehat{h}_t(\cdot )\).
The dynamic programming equations for \(t=1, \ldots , T\) are given by
For the MILP subproblems \((\varvec{\widehat{P}_t}(\cdot ))\), we obtain the following properties.
Lemma 3.1
Under assumption (A2), subproblem \((\varvec{\widehat{P}_t}(\cdot ))\) has complete recourse and the value function \(\widehat{Q}_t(\cdot )\) is l.sc. for all \(t=1,\ldots ,T\).
The complete recourse follows from the complete recourse of \((\varvec{P_t}(\cdot ))\) by construction. The l.sc. then follows from Theorem 3.1 in [31].
The inner loop
In the inner loop of NCNBD, the MILP subproblems \((\varvec{\widehat{P}_t(x_{t1})})\) are considered. As stated before, we omit the index \(\ell \) for its discussion.
The copy constraints are crucial for all problems solved in the inner loop. In the forward pass, to ensure Lipschitz continuity, we consider regularized subproblems. The regularization is based on relaxing and penalizing the copy constraints. In the backward pass, to generate cuts, a special Lagrangian dual subproblem is solved based on dualizing the copy constraints. This is effective, since combined with a binary expansion of the state variables, the copy constraints yield a local convexification [55].
Regularization
Lipschitz continuity of the value functions is difficult to ensure in the general nonconvex case. However, as shown recently in [52], for l.sc. value functions, it is possible to determine some underestimating Lipschitz continuous function by enhancing the original subproblem with an appropriate penalty function \(\psi _t\). In contrast to the more general regularization approach in [52], we require only socalled sharp penalty functions \(\psi _t(x_{t1}) = \Vert x_{t1} \Vert \) to regularize the subproblems \((\varvec{\widehat{P}_t(x_{t1})})\), for some norm \(\Vert \cdot \Vert \).
Definition 3.2
(Regularized subproblem and value function) Let \(\sigma _t > 0\) for \(t=2, \ldots T\), \(\sigma _1=0\) and define
\((\varvec{\widehat{P}_t^R})\) is called regularized subproblem and \(\widehat{Q}_t^R(\cdot )\) regularized value function.
By recursion, this approach yields the regularized optimal value \(\widehat{v}^R := \widehat{Q}^R_1(x_{0})\) for the first stage. Lemma 3.1 implies that under assumption (A2), the function \(\widehat{Q}_t(\cdot )\) is l.sc. Then, the regularized value function \(\widehat{Q}_t^R(\cdot )\) has the following properties.
Lemma 3.3
(Proposition 2 in [52]) For all \(t=1,\ldots ,T\) we have:

(a)
\(\widehat{Q}^R_t(x_{t1}) \le \widehat{Q}_t(x_{t1})\) for all \(x_{t1} \in X_{t1}\),

(b)
Under assumptions (A1) and (A2), the regularized value function \(\widehat{Q}^R_t(\cdot )\) is Lipschitz continuous on \(X_{t1}\).
As also stated in [52], using sharp penalty functions as in Definition 3.2, the penalization is exact for sufficiently large (but finite) \(\sigma _t > 0\). For such \(\sigma _t\), the problems \((\varvec{\widehat{P}})\) and \((\varvec{\widehat{P}^R})\) have the same optimal points and \(\widehat{v}^R = \widehat{v}\). This result goes back to [14], in which augmented Lagrangian problems are analyzed for MILPs. It is shown that using sharp penalty functions and a sufficiently large augmenting parameter, strong duality holds. As this result holds for any value of the dual multipliers, it is also valid for the regularized subproblems.
Lemma 3.4
(Proposition 8 in [14]) Using sharp penalty functions \(\psi _t\), there exist some \(\bar{\sigma }_t > 0\) such that the penalty reformulation in \((\varvec{\widehat{P}^R_t(x_{t1})})\) is exact for all \(\sigma _t > \bar{\sigma }_t\).
Lemma 3.4 indicates that using the regularized subproblems within our decomposition method NCNBD, we obtain convergence to \(\widehat{v}\) in the inner loop. To exploit this, we take the following assumption:
 (A3).:

All \(\sigma _t > 0\) are chosen sufficiently large such that Lemma 3.4 is satisfied.
If (A3) is not satisfied, \(\sigma _t\) has to be increased gradually in the course of the NCNBD method to ensure convergence.
Forward pass
In the forward pass of the inner loop we solve approximations of the regularized subproblems \((\varvec{\widehat{P}_t^R(x_{t1})})\).
For iteration i, the staget forward pass problem is defined as follows
for the trial state variable \(\widehat{x}_{t1}^i\), with \(\widehat{x}_0^i \equiv 0\). Function \(\mathfrak {Q}^{i}_{t+1}(\cdot )\), in some sense, approximates the value functions \(\underline{\widehat{Q}}^{R,i}_{t+1}(\cdot , \mathfrak {Q}_{t+2}^{i})\) and \(\underline{\widehat{Q}}^{i}_{t+1}(\cdot , \mathfrak {Q}_{t+2}^{i})\). This approximation is constructed in the backward pass, see Sect. 3.3.4. As those value functions are nonconvex, the cut approximation \(\mathfrak {Q}_{t+1}^{i}(\cdot )\) is required to be nonlinear and nonconvex. However, as we show later, it can be expressed with mixedinteger linear constraints by lifting the problems to a higher dimension. Therefore, in addition to \(x_t, y_t\) and \(z_t\), the forward pass problem contains further decision variables, which are hidden in \(\mathfrak {Q}_{t+1}^{i}(\cdot )\) and the piecewise linear relaxations \(\widehat{f}_t, \widehat{g}_t\) and \(\widehat{h}_t\).
Note that expressing \(\mathfrak {Q}_{t+1}^i(\cdot )\) by mixedinteger linear constraints with bounded integer variables, the same reasoning as in Lemma 3.1 can be applied to show that \(\underline{\widehat{Q}}^{i}_t(\widehat{x}_{t1}^i, \mathfrak {Q}_{t+1}^{i})\) is l.sc. and therefore, \(\underline{\widehat{Q}}^{R,i}_t(\widehat{x}_{t1}^i, \mathfrak {Q}_{t+1}^{i})\) is Lipschitz continuous.
Even with a mixedinteger linear representation of \(\mathfrak {Q}_{t+1}^i(\cdot )\), subproblem \((\varvec{\widehat{P}_t^{R,i}(\widehat{x}_{t1}^i, \mathfrak {Q}_{t+1}^{i})})\) is a MINLP due to the regularization. For \(\Vert \cdot \Vert _1\) or \(\Vert \cdot \Vert _\infty \), it can be modeled by MILP constraints using standard reformulation techniques for absolute values, though.
The optimal point \((\widehat{z}_t^i, \widehat{x}_t^i, \widehat{y}_t^i)\) of each subproblem \((\varvec{\widehat{P}_t^{R,i}(\widehat{x}_{t1}^i)})\) is stored and \(\widehat{x}_t^i\) is passed to the following stage. Since \(\big ( (\widehat{z}_t^i, \widehat{x}_t^i, \widehat{y}_t^i) \big )_{t=1,\ldots ,T}\) satisfies all constraints of \((\varvec{\widehat{P}^R})\), after all stages have been considered, an upper bound \(\overline{\widehat{v}}\) on the optimal value \(\widehat{v}^R\) of the regularized problem can be determined by
With assumption (A3) and Lemma 3.4, this is also an upper bound to \(\widehat{v}\).
Backward pass–Part 1: binary approximation
The aim of the backward pass of an inner loop iteration i is twofold: Firstly, a lower bound \(\underline{\widehat{v}}^i\) on \(\widehat{v}\) is determined. Secondly, cuts for \(Q_{t}(\cdot )\) are derived to improve and update the current approximation \(\mathfrak {Q}_{t}^{i}(\cdot )\).
As mentioned before, we use a dynamically refined binary approximation of the state variables and then apply cuttingplane techniques from the SDDiP algorithm [55]. This approximation is based on static binary expansion [21].
Binary expansion can be applied componentwise to some vector \(x_t\). Some integer component \(x_{tj} \in \left\{ 0, ..., U_j \right\} \) can be exactly and uniquely expressed as
with variables \(\lambda _{tkj} \in \left\{ 0,1 \right\} \) and \({K_{tj} = \lfloor \log _2 U_j \rfloor + 1}\). Some continuous component \(x_{tj} \in [0, U_j]\) can be expressed by discretizing the interval with precision \(\beta _{t j} \in (0,1)\). We then have
with \(K_{tj} = \lfloor \log _2 \left( \frac{U_j}{\beta _{tj}} \right) \rfloor + 1\) and some error \(r_{tj} \in \left[ \frac{\beta _{t j}}{2}, \frac{\beta _{t j}}{2} \right] \).
For vector \(x_t\), this yields \(K_t = \sum _{j=1}^{n_t} K_{tj}\) number of binary variables. Defining an \((n_t \times K_t)\)matrix \(B_t\) containing all the coefficients of the binary expansion and collecting all binary variables in one large vector \(\lambda _{t} \in {\mathbb {B}}^{K_t}\), the binary expansion then can be written compactly as \(x_t = B_t \lambda _{t} + r_t\).
Based on this definition, to generate cuts, for each stage t and iteration i, a binary approximation of \(\widehat{x}_{t1}^i\) is used, i.e., it is replaced by \(B_{t1} \lambda _{t1}^i\). Note that the approximation is not necessarily exact for continuous components of \(\widehat{x}_{t1}^i\). Therefore, the cuts are not necessarily constructed at the trial point \(\widehat{x}_{t1}^i\) but at the deviating anchor point \(\widehat{x}_{{\mathbb {B}}, t1}^i := B_{t1} \lambda _{t1}^i\).
In the backward pass, we start from the following subproblem, where due to the binary approximation of the state variables, we also adapt the copy constraint to \(\lambda _{t1}^i = \mathfrak {z}_t\) with variables \(\mathfrak {z}_t \in [0,1]^{K_{t1}}\).
Remark 3.5
Subproblem \((\varvec{\widehat{P}_{{\mathbb {B}}t}^{i}(\lambda _{t1}^i, \mathfrak {Q}_{t+1}^{i+1})})\) is equivalent to subproblem \((\varvec{\widehat{P}_{t}^{i}(\widehat{x}_{{\mathbb {B}}, t1}^i, \mathfrak {Q}_{t+1}^{i+1})})\) because \(z_t = B_{t1} \mathfrak {z}_t = B_{t1} \lambda _{t1}^i = \widehat{x}_{{\mathbb {B}}, t1}^i\).
Asymptotically, i.e., for an infinitely fine binary approximation, the anchor point converges to the actual trial point.
Lemma 3.6
We have \(\lim _{\beta _{t1} \rightarrow 0} \widehat{x}_{{\mathbb {B}}t1}^i = \widehat{x}_{t1}^i\).
With Lemma 3.6, asymptotically, the cuts are constructed at \(\widehat{x}_{t1}^i\). While this is not directly useful in practice, since it requires an infinite number of binary variables, it also implies that for componentwise sufficiently small \(\beta _{t1} \in (0,1)\), the cuts are constructed very close to \(\widehat{x}_{t1}^i\). As NCNBD constructs Lipschitz continuous cuts, this guarantees a sufficiently good approximation of the value function at \(\widehat{x}_{t1}^i\), as we show in Sect. 4.
Importantly, in our framework the binary approximation is only applied temporarily to derive cuts, while the state variables \(x_{t1}\) in the forward pass remain continuous. In other words, the anchor points determine where cuts can be constructed, but do not limit where they can be evaluated. This is a crucial difference to applying a static binary expansion, as suggested in the original SDDiP work to solve MILPs with continuous state variables [55].
Moreover, let us emphasize again that applying such static approximation is not appropriate in our inner loop, as the obtained lower bounds are not guaranteed to be valid for \(\widehat{v}\) or v. Similarly, the obtained cuts are not guaranteed to be valid for \(\widehat{Q}_t(\cdot )\) or \(Q_t(\cdot )\), and therefore cannot be reused within the outer loop. Our proposed inner loop method does not share these issues. We follow a dynamic approach where the binary precision is dynamically refined if required and, as we show later, all cuts remain valid with later refinements.
Backward pass–Part 2: cut generation
As proposed in [55], the copy constraint is dualized to generate cuts. Applied to our context, the following Lagrangian dual subproblem has to be solved
where \(\mathcal {L}_{{\mathbb {B}}t}^i(\cdot )\) denotes the Lagrangian function for \(\pi _t\) defined by
and \(\Vert \cdot \Vert _*\) denotes the dual norm to the norm used in the regularized forward pass problems \((\varvec{\widehat{P}_t^{R,i}(\widehat{x}_{t1}^i, \mathfrak {Q}_{t+1}^{i})})\).
A linear (optimality) cut in binary space \(\{0,1\}^{K_{t1}}\) is then given by
where \(\pi _t^i\) is an optimal solution of the Lagrangain dual subproblem \((\varvec{D_{{\mathbb {B}}t}^i(\lambda _{t1}^i,\mathfrak {Q}_{t+1}^{i+1})})\). Those Lagrangian cuts are introduced in [55] and identified to be finite, valid and tight in the SDDiP setting. In our setting, we obtain the following validity result.
Lemma 3.7
Let \(\widehat{Q}_{{\mathbb {B}}t}(\cdot )\) denote the MILP value function of stage t with additional binary approximations. Then,

(a)
for all \(\lambda _{t1} \in [0,1]^{K_{t1}}\)
$$\begin{aligned} \widehat{Q}_{{\mathbb {B}}t}(\lambda _{t1}) \ge \phi _{{\mathbb {B}}t}(\lambda _{t1}), \end{aligned}$$ 
(b)
for all \(x_{t1}\)
$$\begin{aligned} \widehat{Q}_{t}(x_{t1}) \ge \phi _{{\mathbb {B}}t}(\lambda _{t1}) \end{aligned}$$for any \(\lambda _{t1} \in [0,1]^{K_{t1}}\), such that \(x_{t1} = B_{t1} \lambda _{t1}\).
Lemma 3.7 a) follows directly from the validity proof for the SDDiP cuts, which does also hold for \(\lambda _{t1} \in [ 0, 1 ]^{K_{t1}}\) instead of \(\lambda _{t1} \in \left\{ 0, 1 \right\} ^{K_{t1}}\) (see Theorem 3 in [55]). Part b) then follows using similar arguments as in Remark 3.5. Hence, \(\phi _{{\mathbb {B}}t}\) is, in fact, a valid cut in \([0,1]^{K_{t1}}\). This enables us to obtain valid underapproximations also for those points, which are not exactly approximated by the current binary expansion. As it refers to an outer approximation, \(\widehat{Q}_t(\cdot )\) underestimates the original MINLP value function \(Q_t(\cdot )\). Thus, the obtained cuts are valid for \(Q_t(\cdot )\) as well.
Contrary to [55], but following [52], we bound the dual variable \(\pi _t\) in the Lagrangian dual subproblem. Therefore, tightness for \(\underline{\widehat{Q}}^i_{{\mathbb {B}}t}(\cdot , \mathfrak {Q}_{t+1}^{i+1})\) is not guaranteed. However, the cuts are at least guaranteed to overestimate the value function \(\underline{\widehat{Q}}^{R,i}_{{\mathbb {B}}t}(\cdot , \mathfrak {Q}_{t+1}^{i+1})\) at \(\lambda _{t1}^i\). This value function is obtained by regularizing \(\underline{\widehat{Q}}^i_{{\mathbb {B}}t}(\cdot , \mathfrak {Q}_{t+1}^{i+1})\) in the binary space using the same norm as in the forward pass problem. By careful choice of the regularization factor, then, also the regularized value function \(\underline{\widehat{Q}}^{R,i}_{t}(\cdot , \mathfrak {Q}_{t+1}^{i+1})\) in the original space is overestimated at \(x_{{\mathbb {B}}, t1}^i\). This result is formalized in the following lemma.
Lemma 3.8
Assume that we use \(\Vert \cdot \Vert _1\) for regularization and its dual norm \(\Vert \cdot \Vert _\infty \) for bounding the dual multipliers. Then, as long as \(l_t \ge \sigma _t \Vert B_{t1} \Vert \), where the latter denotes the induced matrix norm of \(B_{t1}\), we have
Proof
See Appendix A. \(\square \)
Remark 3.9
The induced matrix norm \(\Vert B_{t1} \Vert \) depends on the chosen precision of the binary approximation. It can be bounded from above independent of the precision, e.g., \(\Vert B_{t1} \Vert _1 \le U_{t1, \max }\) with \(U_{t1, \max }\) the largest component of the upper bounds in \(X_{t1}\).
Backward pass–Part 3: cut projection
Solving the forward pass problems \((\varvec{\widehat{P}_t^{R,i}(\widehat{x}_{t1}^i, \mathfrak {Q}_{t+1}^{i})})\) and the backward pass dual problems \((\varvec{D_{{\mathbb {B}}t}^{i}(\lambda _{t1}^i, \mathfrak {Q}_{t+1}^{i+1})})\) requires expressing the cut approximation \(\mathfrak {Q}_{t+1}^i(\cdot )\) in the original state variables \(x_t\). Recall that the computed cut \(\phi _{{\mathbb {B}},t+1}(\cdot )\) is a function of \([ 0,1 ]^{K_{t}}\).
According to Lemma 3.7 a), the obtained cuts \(\phi _{{\mathbb {B}}, t+1} (\cdot )\) are not only valid for all binary points, but for all values in \([0,1]^{K_{t}}\). Allowing for \(\lambda _t \in [0,1]^{K_t}\) in the binary approximation, there exist infinitely many combinations of \(\lambda _t\) to exactly describe some point \(x_t \in X_t\), though. Therefore, following from Lemma 3.7 b), one cut in binary space entails infinitely many underestimators of \(Q_{t+1}(\cdot )\) at \(x_{t}\) in the original space \(X_t\). Including infinitely many inequalities in \(\mathfrak {Q}_{t+1}(\cdot )\) is computationally infeasible. Instead, we consider the pointwise maximum of the projection of the cuts to \(X_t\). That way, only the best underestimation for each point \(x_t\) is taken into account. In doing so, we obtain a nonlinear, i.e., piecewise linear, cut in the original state space. For simplicity, in the following, by cut projection we always mean the pointwise maximum of the actual projection.
The projection of some cut \(\phi _{{\mathbb {B}}, t+1}(\cdot )\) to \(X_t\) can be described as the value function
of a linear program where e denotes a vector of ones of dimension \(K_t\). The dual problem to (2) yields
Note that the dual feasible region does not depend on \(x_t\) and has a finite number of extreme points. Therefore, the cut projection is piecewise linear and concave.
As problem (2) is feasible and bounded for any \(x_t \in X_t\), this also holds for the dual problem (3). Therefore, in a dual optimal solution, \(\eta _t\) and \(\mu _t\) are bounded. Note that this bound may change with the binary approximation precision \(\beta _t\), though, and that, if we would generate tight cuts for \(\underline{\widehat{Q}}_{t+1}^i(\cdot , \mathfrak {Q}_{t+2}^{i+1})\), those cuts may become infinitely steep close to discontinuities. However, as we can bound \(\pi _t\) in the Lagrangian dual subproblem independent of \(\beta _t\), see Remark 3.9, and thus construct cuts which at least overestimate the regularized value function \(\underline{\widehat{Q}}^{R,i}_{t+1}(\cdot , \mathfrak {Q}_{t+2}^{i+1})\) at the anchor point \(x_{{\mathbb {B}}, t}^i\), such cases should be ruled out.
We formalize this by assuming the existence of a global bound for \(\eta _t\).
 (A4).:

There exists some \(\rho _t > 0\), such that for all \(t=1,\ldots ,T\), any binary precision \(\beta _t\) and any \(x_t\), the optimal dual variable \(\eta _t\) in problem (3) can be bounded, i.e., \(\Vert \eta _t \Vert \le \rho _t\).
For example, if we obtain cuts which are, in fact, tight for \(\underline{\widehat{Q}}^{R,i}_{t+1}(\cdot , \mathfrak {Q}_{t+2}^{i+1})\) at \(x_{{\mathbb {B}}, t}^i\) and consider only basic solutions in the Lagrangian dual, the gradient of the cuts is bounded by \(\sigma _{t+1}\). With Assumption (A4) it follows that the linear cuts \(\phi _{{\mathbb {B}}, t+1}(\cdot )\) derived in the binary space yield a nonlinear, but Lipschitz continuous projection \(\phi _{t+1}(\cdot )\) in the original space.
To express this projection by mixedinteger linear constraints, we use the KKT conditions to problems (2) and (3). To emphasize that these conditions are considered for the projection of one specific cut r (the index denoting the rth cut constructed), we index all occurring variables and coefficients by r.
The complementary slackness constraints (9) and (10) are nonlinear, but componentwise can be expressed linearly using a Big\(\mathcal {M}\) formulation (alternatively, SOS1 constraints may be used):
For all components k, we can choose \(\mathcal {M}_{1k} = 1\) and \(\mathcal {M}_{3k} = 1\) due to \(\lambda _{tk} \in [0,1]\). Moreover, using (A4), we are able to obtain explicit choices for \(\mathcal {M}_{2k}\) and \(\mathcal {M}_{4k}\) as well.
Lemma 3.10
Under (A4), there exist explicit, finite bounds for \(\nu _{tk}^r\) and \(\mu _{tk}^r\).
Proof
See Appendix B. \(\square \)
The cut approximation \(\mathfrak {Q}_{t+1}^{i+1}(\cdot )\) is then defined as the maximum of all cuts \(\phi _{{\mathbb {B}}, t+1}^r = c_{t+1}^r + (\pi _{t+1}^r)^\top \lambda _t^r\) where the variable \(\lambda _t^r\) satisfies the linearized KKT conditions (4)–(8) and (11)–(12) for the rth cut. With Assumption (A4), it is Lipschitz continuous.
Lemma 3.11
The cut approximation \(\mathfrak {Q}_{t+1}(\cdot )\) is Lipschitz continuous in \(X_t\) with Lipschitz constant \(\rho _t\).
The cut projection requires to introduce the variables \(\lambda _t^r, \nu _t^r, \mu _t^r, w_t^r, u_t^r, \eta _t^r\) and constraints (4)–(8) and (11)–(12) for each cut r. In particular, each cut is associated with a variable \(\lambda _t^r \in [0,1]^{K_t^r}\) where \(K_t^r\) corresponds to the number of binary variables at the time of the cut’s generation. This increases the problem size considerably, as the number of variables and constraints to be added per cut is in \(\mathcal {O} \big (n_t \log \big ( \frac{1}{\beta _t} \big ) \big )\). In return, it ensures that cuts do not have to be generated from scratch after each refinement.
Stopping and refining
At the end of the backward pass, a lower bound \(\underline{\widehat{v}}^{i}\) is determined by solving the firststage subproblem \((\varvec{\widehat{P}_1^{i}(0, \mathfrak {Q}_{2}^{i+1})})\). Here, no Lagrangian dual is solved, since no cuts have to be derived. The lower bound is nondecreasing because the cut approximation is only improved.
If the updated bounds are sufficiently close to each other, i.e., if
for some predefined tolerance \(\widehat{\varepsilon } > 0\), an approximately optimal point of problem \((\varvec{\widehat{P}})\) has been determined. We show in the following section that this is the case after finitely many iterations i.
If the gap between the bounds does not meet the stopping criteria yet, two cases are possible: In the first case, the algorithm has not determined the best possible approximation for the given binary approximation precision, yet. New cuts have been determined in iteration i such that the lower bound \(\underline{\widehat{v}}^i\) has been updated, and the forward solution will change in iteration \(i+1\) as the previous one is cut away.
In the second case, despite not meeting the stopping criterion, the forward solution does not change at the beginning of iteration \(i+1\). This case is related to the binary approximation. It can occur if the binary approximation is too coarse and therefore, for all t, the determined cuts at \(\widehat{x}_{{\mathbb {B}}t}^i\) do not improve the approximation at \(\widehat{x}_t^i\). Moreover, it can occur if in subsequent iterations the same cuts are constructed, since \(\widehat{x}_{B, t1}^i = \widehat{x}_{B, t1}^{i+1}\). Finally, it can also occur if all possible cuts have been generated: For a fixed binary approximation, there exist only finitely many points \(\widehat{x}_{{\mathbb {B}}t}\). If we restrict the Lagrangian dual subproblem to basic solutions, then only finitely many different cuts can be determined [55].
In the second case, at the beginning of the backward pass of iteration i, the binary approximation is refined. The refinement is computed by increasing \(K_{tj}\) by +1 for all components j and all stages t with
For simplicity, we refine in Algorithm 1 all stages and components equally by +1. Note that each refinement requires the introduction of an additional vector \(\lambda _t\), as described in the previous subsection.
As all previously generated cuts have been projected to the original space \(X_t\), they remain valid and have not to be recomputed when refining the binary approximation. This is computationally important.
The outer loop
The outer loop problem
Once the inner loop is left, we set \(\underline{\widehat{v}}^{\ell } := \underline{\widehat{v}}^{\ell i}, \overline{\widehat{v}}^{\ell } := \overline{\widehat{v}}^{\ell i}\) and \(\mathfrak {Q}_t^{\ell }(\cdot ) := \mathfrak {Q}_t^{\ell i}(\cdot )\) for all \(t=2,...,T\). Note that \(\overline{\widehat{v}}^{\ell }\) is not guaranteed to be a valid upper bound for v because \(\widehat{v}^\ell \le v\). Moreover, we set \(\big (( \widehat{z}_t^{\ell }, \widehat{x}_t^{\ell }, \widehat{y}_t^{\ell } ) \big )_{t = 1,...,T} := \big (( \widehat{z}_t^{\ell i}, \widehat{x}_t^{\ell i}, \widehat{y}_t^{\ell i} ) \big )_{t = 1,...,T}\).
To approximate the optimal value v of \((\varvec{P})\), we solve subproblems
in a forward manner for \(t=1, \ldots , T\) with \(x_0^{\ell } \equiv 0\) and \(x_{t}^\ell := x_t\), where \(x_t\) is an optimal solution of \((\varvec{P_t^{\ell }(x_{t1}^{\ell }, \mathfrak {Q}_{t+1}^{\ell })})\) for t. Here, we exploit that the cut approximation \(\mathfrak {Q}_t^{\ell }(\cdot )\), constructed in the inner loop, is valid for \(Q_t(\cdot )\) by design as well. By solving these subproblems, we obtain a feasible solution \(\big ( (z_t^{\ell }, x_t^{\ell }, y_t^{\ell }) \big )_{t=1,\ldots ,T}\) for \((\varvec{P})\) and we can determine a valid upper bound for v as \(\overline{v}^{\ell } = \min \left\{ \overline{v}^{\ell 1}, \sum _{t=1}^T f_t(x_t^{\ell }, y_t^{\ell }) \right\} \).
The subproblems \((\varvec{P_t^{\ell }(x_{t1}^{\ell }, \mathfrak {Q}_{t+1}^{\ell })})\) are nonconvex MINLP problems. This means that in order to solve the original nonconvex problem \((\varvec{P})\), easier, but still nonconvex subproblems have to be solved to optimality for each stage t in each outer loop iteration \(\ell \). This might be a hard challenge by itself. We make the following assumption for the remainder of this article:
 (A5).:

An oracle exists that is able to solve subproblems \((\varvec{P_t^{\ell }(x_{t1}^{\ell }, \mathfrak {Q}_{t+1}^{\ell })})\) to global optimality.
In case that no such global optimization algorithm is available, one can solve appropriate inner approximations of \((\varvec{P_t^{\ell }(x_{t1}^{\ell }, \mathfrak {Q}_{t+1}^{\ell })})\), which are improved in the course of the algorithm.
If \(\overline{v}^{\ell }  \underline{\widehat{v}}^{\ell } \le \varepsilon \), then NCNBD terminates and \(\big ( (z_t^{\ell }, x_t^{\ell }, y_t^{\ell }) \big )_{t=1,\ldots ,T}\) is an \(\varepsilon \)optimal solution for \((\varvec{P})\). Otherwise, the cut approximations \(\mathfrak {Q}_{t+1}^{\ell }(\cdot )\) are not sufficiently good underestimators for the true value functions, even though they give a good approximation of \(\widehat{Q}_t^{\ell }(\cdot )\). This implies that the piecewise linear relaxations have to be improved. Instead of refining them everywhere, they are refined dynamically where it is promising, i.e., in a neighborhood of the approximate optimal solution \(\big ((\widehat{z}_t^{\ell }, \widehat{x}_t^{\ell }, \widehat{y}_t^{\ell }) \big )_{t=1,\ldots ,T}\) of \((\varvec{\widehat{P}^{\ell }})\). In refining the piecewise linear relaxations in its neighborhood, the current solution can be excluded in the next inner loop and the lower bound \(\underline{\widehat{v}}^{\ell }\) improves.
Remark 3.12
Instead of \(\underline{\widehat{v}}^{\ell }\), an even better lower bound for v is given by the optimal value of the first stage subproblem \((\varvec{P_1^{\ell }(x_0^{\ell }, \mathfrak {Q}_2^{\ell })})\).
Refining the piecewise linear relaxations
The refinement consists of two steps: (1) the piecewise linear approximations are refined and (2) the corresponding MILP \((\varvec{\widehat{P}^{\ell }})\) is updated – in such a way that the new MILP \((\varvec{\widehat{P}^{\ell +1}})\) again yields a relaxation of \((\varvec{P})\).
Different strategies are possible to achieve this. For a thorough overview, we refer to [18]. In the following, we make use of a specific adaptive refinement scheme for triangulations from [9] for any nonlinear function \(\gamma _t \in \varGamma _t\). The given piecewise linear approximation at iteration \(\ell \) is defined by a triangulation \(\mathcal {T}\) of \(X_{t1} \times X_t \times Y_t\) (or a subspace) and the corresponding function values of \(\gamma _t\). Instead of refining this triangulation everywhere now, the main idea is to only refine it in a neighborhood of \(\big ((\widehat{z}_t^{\ell }, \widehat{x}_t^{\ell }, \widehat{y}_t^{\ell }) \big )_{t=1,\ldots ,T}\). Therefore, first, the simplex in \(\mathcal {T}\) containing this point is identified. It is then divided by a longestedge bisection, yielding a refined triangulation, for which a new MILP model can be set up. As proven in [9], this refinement strategy has some favorable properties with respect to convergence, see Sect. 4.2.
It is important that the obtained relaxation \((\varvec{\widehat{P}^{\ell +1}})\) is tighter than \((\varvec{\widehat{P}^{\ell }})\) so that the corresponding value functions improve monotonically. This is required to ensure that previously generated cuts remain valid in later iterations. For concave functions, this is always satisfied using the presented refinement strategy. For other functions, e.g., convex ones, a more careful determination of the relaxation is required or the MILP models for earlier relaxations have to be kept instead of being replaced. For our theoretical results, it is sufficient that such monotonically improving relaxations can always be determined.
After refining the piecewise linear relaxations, a new iteration \(\ell +1\) is started, beginning with the inner loop.
Convergence results
In this section, we prove the convergence of the NCNBD algorithm. We start proving the convergence of the inner loop to an optimal solution of \((\varvec{\widehat{P}^{\ell }})\) based on some results on the binary refinements. Afterwards, we prove that the outer loop converges to an optimal solution of the original problem \((\varvec{P})\).
Convergence of the inner loop
As explained in Sect. 3.3.3, within NCNBD the cuts are not generated at the trial points \(\widehat{x}_{t1}^i\), but instead at anchor points \(\widehat{x}_{{\mathbb {B}}, t1}^i := B_{t1} \lambda _{t1}^i\). This means that the generated cuts, and with that also the cut approximations \(\mathfrak {Q}_t(\cdot )\), implicitly depend on the binary approximation precision \(\beta _t\).
However, Lemma 3.6 implies that \(\widehat{x}_{t1}^i\) and \(\widehat{x}_{{\mathbb {B}}, t1}^i\) should become equal asymptotically in the refinements of the binary approximations. Therefore, asymptotically, the cuts are guaranteed to overestimate \(\underline{\widehat{Q}}^{R,i}_{t}(\widehat{x}_{t1}^i, \mathfrak {Q}_{t+1}^{i+1})\) and, due to their Lipschitz continuity, for some sufficiently small precision, they are at least \(\varepsilon _{{\mathbb {B}}t}\)close. This, in turn, leads to convergence of the inner loop, as we formalize and prove below.
Prior to this, let us introduce two useful ideas. Firstly, using the Lipschitz continuity results from Lemma 3.3, page 12 and Lemma 3.11, we can bound the cut approximation error in \(\widehat{x}_{t1}^i\) as follows:
Lemma 4.1
With Assumption (A4), for any iteration i and stage t it follows
Proof
See Appendix C. \(\square \)
Secondly, for any stage t and any fixed binary approximation, if we restrict to basic solutions in the Lagrangian duals, only finitely many different realizations of cut approximations \(\mathfrak {Q}_t(\cdot )\) can be generated. Thus, after a finite number of iterations, the binary approximation is refined. Assuming that the inner loop does not terminate for \(\widehat{\varepsilon } = 0\), we can then observe infinitely many such refinements. Hence, with \(j \rightarrow \infty \), we also get \(\beta _t \rightarrow 0\) for all \(t=1,\ldots ,T\).
Now, we address convergence of the inner loop of NCNBD to an optimal solution of \((\varvec{\widehat{P}})\). First, we provide a preliminary result, which can be proven by backward induction using Lemmas 3.11 and 4.1.
Lemma 4.2
Suppose that the inner loop does not terminate for \(\widehat{\varepsilon } = 0\). Then, the infinite sequence of forward pass trial solutions \(( \widehat{x}^i )_{i \in {\mathbb {N}}}\) possesses some cluster point \(\widehat{x}^*\) with a corresponding convergent subsequence \(( \widehat{x}^{i_j} )_{j \in {\mathbb {N}}}\). This subsequence satisfies
Proof
See Appendix D. \(\square \)
Using this result, convergence can be proven.
Theorem 4.3
Suppose that the inner loop does not terminate for \(\widehat{\varepsilon } = 0\). Then, the sequence \(( \underline{\widehat{v}}^i )_{i \in {\mathbb {N}}}\) of lower bounds determined by the algorithm converges to \(\widehat{v}\) and every cluster point of the sequence of feasible forward pass solutions generated by the inner loop is an optimal solution of \((\varvec{\widehat{P}})\).
Note that with a similar argument it can be shown that the inner loop terminates as soon as \(\mathfrak {Q}_t^i(\widehat{x}_{t1}^i) \ge \widehat{Q}_t^R(\widehat{x}_{t1}^{i})\) for all \(t=2,...,T\).
Considering that the inner loop is integrated into an outer loop improving the MILP approximations of \((\varvec{P})\), infinite convergence is not directly useful. Moreover, infinitely many binary refinements are not computationally feasible. However, we can deduce that an approximately optimal solution of \((\varvec{\widehat{P}})\) is determined in a finite number of iterations.
Corollary 4.4
For any stopping tolerance \(\widehat{\varepsilon } > 0\), the inner loop stops in a finite number of iterations with an \(\widehat{\varepsilon }\)optimal solution of \((\varvec{\widehat{P}})\).
Convergence of the outer loop
We start our convergence analysis of the outer loop with a feasibility result for the solutions determined in the inner loop, which follows from the convergence results in [9]. The main idea is that, as the domain is bounded for all functions \(\gamma \in \varGamma \), using a longestedge bisection, after a finite number of steps, all considered simplices become sufficiently small (since in the worst case all simplices have been refined sufficiently often).
Lemma 4.5
([9]) Using longestedge bisection for the piecewise linear relaxation refinements within NCNBD yields optimal solutions \(\big ((\widehat{z}_t^{\ell }, \widehat{x}_t^{\ell }, \widehat{y}_t^{\ell }) \big )_{t=1,\ldots ,T}\) for \((\varvec{\widehat{P}^{\ell }})\) in the inner loop, which

(a)
are approximately feasible for \((\varvec{P})\) after a finite number of steps \(\ell \),

(b)
become feasible for \((\varvec{P})\) asymptotically in the number of refinements \(\ell \).
Next we show that with decreasing the feasibility error also the deviation in the optimal value between \((\varvec{\widehat{P}^{\ell }})\) and \((\varvec{P})\) can be controlled.
As a preliminary result, we obtain that for sufficiently small feasibility tolerances \(\widehat{\varepsilon }_{\gamma }\) for all \(\gamma \in \varGamma \), there exists a neighborhood of the optimal solution \(\widehat{\mathrm {x}}^{\ell } := \big ((\widehat{z}_t^\ell , \widehat{x}_t^\ell , \widehat{y}_t^\ell ) \big )_{t=1,\ldots ,T}\) of problem \((\varvec{\widehat{P}^{\ell }})\) containing a feasible point \(\widetilde{\mathrm {x}}^{\ell } := \big ((\widetilde{z}_t^\ell , \widetilde{x}_t^\ell , \widetilde{y}_t^\ell ) \big )_{t=1,\ldots ,T}\) of \((\varvec{P})\). This follows primarily from the continuity of all functions in \((\varvec{P})\).
Lemma 4.6
For any \(\delta > 0\), there exists some \(\hat{\ell } \in {\mathbb {N}}\) such that for all \(\ell \ge \hat{\ell }\) there exists some feasible point \(\widetilde{\mathrm {x}}^{\ell }\) of \((\varvec{P})\) with
\(\Vert \widetilde{\mathrm {x}}^{\ell }  \widehat{\mathrm {x}}^{\ell } \Vert _2 \le \delta \).
Applying Lemma 4.6 yields the following result with respect to the deviation in the optimal value between \((\varvec{\widehat{P}^{\ell }})\) and \((\varvec{P})\).
Theorem 4.7
There exists some such that for all we have
Proof
The proof makes use of the Lipschitz continuity of \(f_t\), Lemma 4.5 and Lemma 4.6 to bound \(v  \widehat{v}^\ell \) from above by \(L_f \delta + \sum _{t=1}^T \widehat{\varepsilon }_{f_t}\) (with \(\widehat{\varepsilon }_{f_t}\) deduced from \(\widehat{\varepsilon }_\gamma \) with \(\gamma = f_t\)). The assertion then follows with \(\varepsilon := L_{f} \delta + \sum _{t=1}^T \widehat{\varepsilon }_{f_t}\). For a detailed proof see Appendix F. \(\square \)
We obtain the central convergence result for NCNBD:
Theorem 4.8
NCNBD has the following convergence properties:

(a)
Assume that for all \(\ell \) the MILP \((\varvec{\widehat{P}^\ell })\) is solved to global optimality in a finite number of steps. Then, if NCNBD does not terminate with \(\varepsilon = 0\), the sequence of lower bounds \((\underline{\widehat{v}}^\ell )_{\ell \in {\mathbb {N}}}\) converges to v and the outer loop solutions converge to an optimal solution of \((\varvec{P})\).

(b)
Let \(\varepsilon = \widehat{\varepsilon } > 0\). Then, if NCNBD does not terminate, it converges to an \(\widehat{\varepsilon }\)optimal solution of \((\varvec{P})\).

(c)
For any \(\varepsilon> \widehat{\varepsilon } > 0\), NCNBD terminates with an \(\varepsilon \)optimal solution of \((\varvec{P})\) after a finite number of steps.
Proof
See Appendix G. \(\square \)
Computational results
We illustrate the adequacy of using the NCNBD to solve multistage nonconvex MINLPs by applying it to moderatesized instances of a unit commitment problem. NCNBD is implemented in Julia1.5.3 [7] based on the SDDP.jl package [12], which provides an existing implementation for SDDP. More implementation details are presented in Appendix H.
The considered unit commitment problem is formally described in detail in Appendix I. Importantly, the considered problem contains binary state variables, but also continuous state variables, such that a binary approximation of the state variables is required in the backward pass of NCNBD. Additionally, all instances contain a nonlinear function in the objective. In the base instances, we consider a concave quadratic emission cost curve in the objective. In the valvepoint instances, additionally, we consider a nonconvex fuel cost curve with a sinusoidal term. In both cases, we analyze instances with 2 to 36 stages and 3 to 10 generators, resulting in 6 to 20 state variables. More details on our parameter settings and the complete test results for all instances are presented in Appendix I.
The results show that NCNBD succeeds to solve multistage nonconvex MINLPs with a moderate number of stages and state variables to (approximate) global optimality. It converges to the globally minimal point for each of the instances and, considering our 1% tolerance, terminates with valid upper and lower bounds for v.
For the base instances, we observe long computation times of several minutes compared to stateoftheart solvers for MINLPs, which solve the problems in a few seconds, though. We address some of the reasons and possible solutions for this behavior at the end of this section. As the results for problems with a small number of state variables, but many stages look most promising, for our valvepoint instance tests we focus on such instances.
For these instances, the sinusoidal terms in the objective exclude many existing general purpose solvers from application. A sample of the obtained results is presented in Table 1, for the complete ones, see Appendix I. The results show that NCNBD is less efficient than existing solvers for problems with few stages, but becomes competitive with a larger number of stages. Especially for the instances with 36 stages, conventional solvers have difficulties closing the optimality gap while NCNBD manages to solve the instances in reasonable time.
These results confirm that NCNBD should be best suited for multistage problems with a large number of stages, but a relatively small number of state variables, as the obtained subproblems remain sufficiently small even for a larger number of iterations, while general purpose solvers may start to struggle due to the combination of many stages and nonlinear terms. Therefore, NCNBD may also be useful for stochastic programs where the deterministic equivalent becomes computationally infeasible for monolith approaches. To investigate this is left for future research.
While some of the test results look promising, we still see substantial potential for improvement. This should also help to make NCNBD more efficient and competitive for problems with a larger number of state variables. It is a known drawback of SDDiP [55], which is inherited by NCNBD, that existing methods to solve the Lagrangian dual problems may take extremely long to converge. To some extent, this could possibly be mitigated by additionally using different cut types such as strengthened Benders cuts [55], thus, only constructing tight cuts every few iterations. Yet, developing more efficient solution methods is an important open research question.
Additionally, with each projected cut, the considered subproblems become considerably larger. While we implemented a simple cut selection scheme to reduce the subproblem size, more sophisticated approaches may be required to keep the subproblems tractable for applications with many state variables.
Finally, so far, we assume that the outer loop MINLPs are solved to global optimality (A5) directly. In a more efficient implementation of NCNBD, these subproblems should be approximated as well.
Conclusion
We propose the nonconvex nested Benders decomposition (NCNBD) method to solve multistage nonconvex MINLPs. The method is based on combining piecewise linear relaxations, regularization, binary approximation and cuttingplane techniques in a unique and dynamic way. We are able to prove that NCNBD is guaranteed to compute an \(\varepsilon \)optimal solution for the original MINLP in a finite number of steps. Computational results for some moderatesized instances of a unit commitment problem demonstrate its applicability to multistage problems.
We require all constraints to be continuous and the objective function to be Lipschitz continuous, which are common assumptions in nonlinear optimization. We also assume complete recourse for the multistage problem. Moreover, the regularization factors are assumed to be sufficiently large to ensure exact penalization in the regularized subproblems. If this is not the case, the factors can be increased gradually within NCNBD.
In contrast to previous approaches to solve multistage nonconvex problems, we do not require the value functions to be monotonic in the state variables [36] and allow the state variables to enter not only the objective function, but also the constraints. The latter avoids the assumption of oracles to handle indicator functions [52].
In NCNBD, we combine dynamic binary approximation of the state variables, cuttingplane techniques tailormade for binary state variables and a projection from the binary to the original space. This way, we are able to obtain nonconvex, piecewise linear cuts to approximate the nonconvex value functions of multistage MILPs. Using some additional regularization, this is even possible if those value functions are not (Lipschitz) continuous. Together with piecewise linear relaxations, this yields nonconvex underestimators for the nonconvex value functions of MINLPs. All approximations are refined dynamically and, by careful design, it is ensured that all cuts remain valid even with such refinements.
The proposed method can be enhanced to solve stochastic MINLPs as well. In particular, a samplingbased approach like in SDDP could be used. In such case some adaptions have to be made with respect to the refinement criteria (forward solutions may remain unchanged for several iterations until the right scenarios are sampled) or the convergence checks, though.
While the presented version of NCNBD already uses approximations which are dynamically refined, different strategies may be even more dynamic and efficient in practice. For instance, the piecewise linear relaxations could be refined dynamically in the inner loop as well.
The main drawback of NCNBD is that the considered subproblems can become severely large, since for binary approximation, for piecewise linear approximations and for cut projection, a high number of additional variables and constraints may have to be introduced. This can become problematic, especially, if a very high binary expansion precision is required to approximate the value functions sufficiently good in the forward solutions. Recent results show that the number of binary variables K required grows linearly with the dimension \(n_t\) of state variables and logarithmically with the inverse of the binary precision \(\beta _t\) [55].
Therefore, in its current form, NCNBD is best applicable to multistage MINLPs which are too large to solve in their extensive form, but for which each subproblem is sufficiently small and contains only a few nonlinear functions.
References
Ahmed, S., Cabral, F. G., Freitas Paulo da Costa, B.: Stochastic Lipschitz dynamic programming. Math. Programm. (2020)
Bacci, T., Frangioni, A., Gentile, C., TavlaridisGyparakis, K.: New MINLP formulations for the unit commitment problems with ramping constraints. Preprint, http://www.optimizationonline.org/DB_FILE/2019/10/7426.pdf (2019)
Beasley, J.E.: ORlibrary: distributing test problems by electronic mail. J. Oper. Res. Soc. 41(11), 1069 (1990)
Belotti, P., Kirches, C., Leyffer, S., Linderoth, J., Luedtke, J., Mahajan, A.: Mixedinteger nonlinear optimization. Acta Numer. 22, 1–131 (2013)
Belotti, P., Lee, J., Liberti, L., Margot, F., Wächter, A.: Branching and bounds tightening techniques for nonconvex minlp. Optim. Methods Softw. 24, 597–634 (2009)
Benders, J.F.: Partitioning procedures for solving mixedvariables programming problems. Numer. Math. 4(1), 238–252 (1962)
Bezanson, J., Edelman, A., Karpinski, S., Shah, V.B.: Julia: a fresh approach to numerical computing. SIAM Rev. 59(1), 65–98 (2017)
Birge, J.R.: Decomposition and partitioning methods for multistage stochastic linear programs. Oper. Res. 33(5), 989–1007 (1985)
Burlacu, R., Geißler, B., Schewe, L.: Solving mixedinteger nonlinear programmes using adaptively refined mixedinteger linear programmes. Optim. Methods Softw. 35(1), 37–64 (2020)
Cerisola, S., Latorre, J.M., Ramos, A.: Stochastic dual dynamic programming applied to nonconvex hydrothermal models. Eur. J. Oper. Res. 218(3), 687–697 (2012)
Van Dinter, J., Rebennack, S., Kallrath, J., Denholm, P., Newman, A.: The unit commitment model with concave emissions costs: a hybrid benders’ decomposition with nonconvex master problems. Ann. Oper. Res. 210(1), 361–386 (2013)
Dowson, O., Kapelevich, L.: SDDP.jl: a Julia package for stochastic dual dynamic programming. INFORMS J. Comput. (2020)
Dunning, I., Huchette, J., Lubin, M.: JuMP: a modeling language for mathematical optimization. SIAM Rev. 59(2), 295–320 (2017)
Feizollahi, M.J., Ahmed, S., Sun, A.: Exact augmented Lagrangian duality for mixed integer linear programming. Math. Program. 161(1–2), 365–387 (2017)
Füllner, C.: NCNBD.jl. Code released on GitHub https://github.com/ChrisFuelOR/NCNBD.jl (2021)
Gamrath, G., Anderson, D., Bestuzheva, K., Chen, W.K., Eifler, L., Gasse, M., Gemander, P., Gleixner, A., Gottwald, L., Halbig, K., Hendel, G., Hojny, C., Koch, T., Le Bodic, P., Maher, S. J., Matter, F., Miltenberger, M., Mühmer, E., Müller, B., Pfetsch, M. E., Schlösser, F., Serrano, F., Shinano, Y., Tawfik, C., Vigerske, S., Wegscheider, F., Weninger, D., Witzig, J.: The SCIP optimization suite 7.0. Technical report, Optimization Online (2020)
GAMS Software GmbH. GAMS.jl. Code released on GitHub https://github.com/GAMSdev/gams.jl (2020)
Geißler, B.: Towards globally optimal solutions for MINLPs by discretization techniques with applications in gas network optimization. PhD thesis, FriedrichAlexanderUniversität ErlangenNürnberg (2011)
Geoffrion, A.M.: Generalized Benders decomposition. J. Optim. Theory Appl. 10(4), 237–260 (1972)
LLC Gurobi Optimization. Gurobi optimization reference manual (2021)
Glover, F.: Improved linear integer programming formulations of nonlinear integer problems. Manage. Sci. 22(4), 455–460 (1975)
Hjelmeland, M.N., Zou, J., Helseth, A., Ahmed, S.: Nonconvex mediumterm hydropower scheduling by stochastic dual dynamic integer programming. IEEE Trans. Sustain. Energy 10(1) (2019)
Infanger, G., Morton, D.P.: Cut sharing for multistage stochastic linear programs with interstage dependency. Math. Program. 75(2), 241–256 (1996)
Kapelevich, L.: SDDiP.jl. Code released on GitHub https://github.com/lkapelevich/SDDiP.jl (2018)
Kallrath, J., Rebennack, S.: Computing areatight piecewise linear overestimators, underestimators and tubes for univariate functions. In: Optimization in Science and Engineering, pp. 273–292. Springer (2014)
Kannan, R.: Algorithms, analysis and software for the global optimization of twostage stochastic programs. PhD thesis, Massachusetts Institute of Technology (2018)
Kilinç, M.R., Sahinidis, N.V.: Exploiting integrality in the global optimization of mixedinteger nonlinear programming problems with baron. Optim. Methods Softw. 33(3), 540–562 (2018)
Li, C., Grossmann, I.E.: A generalized Benders decompositionbased branch and cut algorithm for twostage stochastic programs with nonconvex constraints and mixedbinary first and second stage variables. J. Global Optim. 75(2), 247–272 (2019)
Li, X., Chen, Y., Barton, P.I.: Nonconvex generalized Benders decomposition with piecewise convex relaxations for global optimization of integrated process design and operation problems. Ind. Eng. Chem. Res. 51(21), 7287–7299 (2012)
Li, X., Tomasgard, A., Barton, P.I.: Nonconvex generalized Benders decomposition for stochastic separable mixedinteger nonlinear programs. J. Optim. Theory Appl. 151, 425–454 (2011)
Meyer, R.R.: Integer and mixedinteger programming models: general properties. J. Optim. Theory Appl. 3(4) (1975)
Misener, R., Floudas, C.A.: Antigone: algorithms for continuous/integer global optimization of nonlinear equations. J. Global Optim. 59(2–3), 503–526 (2014)
Ogbe, E., Li, X.: A joint decomposition method for global optimization of multiscenario nonconvex mixedinteger nonlinear programs. J. Global Optim. 75, 595–629 (2018)
Pedroso, J. P., Kubo, M., Viana, A.: Unit commitment with valvepoint loading effect. Technical report, Universidade do Porto (2014)
Pereira, M.V.F., Pinto, L.M.V.G.: Multistage stochastic optimization applied to energy planning. Math. Program. 52(1–3), 359–375 (1991)
Philpott, A.B., Wahid, F., Bonnans, F.: MIDAS: a mixed integer dynamic approximation scheme. Math. Program. (2019)
Rebennack, S.: Combining samplingbased and scenariobased nested Benders decomposition methods: application to stochastic dual dynamic programming. Math. Program. 156(1–2), 343–389 (2016)
Rebennack, S.: Computing tight bounds via piecewise linear functions through the example of circle cutting problems. Math. Methods Oper. Res. 84(1), 3–57 (2016)
Rebennack, S., Kallrath, J.: Continuous piecewise linear deltaapproximations for univariate functions: computing minimal breakpoint systems. J. Optim. Theory Appl. 167(2), 617–643 (2015)
Rebennack, S., Krasko, V.: Piecewise linear function fitting via mixedinteger linear programming. INFORMS J. Comput. (2020)
Rockafellar, R.T., Wets, R.J.B.: Variational Analysis. Springer, Berlin (2009)
Sahinidis, N.V.: BARON 17.8.9: global optimization of mixedinteger nonlinear programs, User’s Manual (2017)
Schnetter, E.: Delaunay.jl. Code released on GitHub https://github.com/eschnett/Delaunay.jl (2020)
Schrage, L.: LindoSystems: LindoAPI (2004)
Steeger, G., Lohmann, T., Rebennack, S.: Strategic bidding for a pricemaker hydroelectric producer: stochastic dual dynamic programming and Lagrangian relaxation. IISE Trans. 47, 1–14 (2018)
Steeger, G., Rebennack, S.: Dynamic convexification within nested Benders decomposition using Lagrangian relaxation: an application to the strategic bidding problem. Eur. J. Oper. Res. 257(2), 669–686 (2017)
Tawarmalani, M., Sahinidis, N.: Convex extensions and envelopes of lower semicontinuous functions. Math. Program. 93, 247–263 (2002)
Tawarmalani, M., Sahinidis, N.V.: A polyhedral branchandcut approach to global optimization. Math. Program. 103, 225–249 (2005)
van Slyke, R.M., Wets, R.: Lshaped linear programs with applications to optimal control and stochastic programming. J. SIAM Appl. Math. 17(4), 638–663 (1969)
Vielma, J.P., Ahmed, S., Nemhauser, G.: Mixedinteger models for nonseparable piecewiselinear optimization: unifying framework and extensions. Oper. Res. 58(2), 303–315 (2010)
Vielma, J.P., Nemhauser, G.L.: Modeling disjunctive constraints with a logarithmic number of binary variables and constraints. Math. Program. 128(1–2), 49–72 (2011)
Zhang, S., Sun, X. A.: Stochastic dual dynamic programming for multistage stochastic mixedinteger nonlinear optimization. (2021). Available at https://arxiv.org/abs/1912.13278. Accessed 29 Nov 2021
Zhou, K., Kilinç, M.R., Chen, X., Sahinidis, N.V.: An efficient strategy for the activation of mip relaxations in a multicore global minlp solver. J. Global Optim. 70(3), 497–516 (2018)
Zou, J., Ahmed, S., Sun, X.: Multistage stochastic unit commitment using stochastic dual dynamic integer programming. IEEE Trans. Power Syst. 34(3) (2019)
Zou, J., Ahmed, S., Sun, X.A.: Stochastic dual dynamic integer programming. Math. Program. 175(1–2), 461–502 (2019)
Acknowledgements
Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)–445857709. The authors are grateful to the anonymous reviewers and the editor for their thoughtful and substantial remarks on earlier versions of this manuscript.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Proof of Lemma 3.8
Proof
We start proving the second inequality. We have
The inequality follows from \(\Vert B_{t1} \Vert \) being the induced matrix norm to the used vector norm. The last equality is obtained by choosing the same norm and \(\alpha _t := \sigma _t \Vert B_{t1} \Vert \) as regularization factor in \((\varvec{\widehat{P}^{R,i}_{{\mathbb {B}}t}(\lambda _{t1}^i, \mathfrak {Q}_{t+1}^{i+1})})\).
To show the first inequality, we construct a dual vector componentwise by
This vector is feasible, as it satisfies \(\Vert \widehat{\pi }_{t} \Vert _\infty \le l_t\). By feasibility and by definition of the Lagrangian dual \((\varvec{D_{{\mathbb {B}}t}^{i}(\lambda _{t1}^i, \mathfrak {Q}_{t+1}^{i+1})})\) it follows
Moreover, by construction we have \(\widehat{\pi }_{tj} (\lambda _{t1,j}^i\mathfrak {z}_{tj}) = l_t \lambda _{t1,j}^i\mathfrak {z}_{tj} \) in each component j. Inserting this into (14) and choosing \(l_t \ge \alpha _t\), we obtain the first inequality.
Proof of Lemma 3.10
Proof
Consider line (4) in the KKT conditions. By rearranging and taking norms on both sides, we obtain
The inequalities follow with the triangle inequality and with the compatibility of the matrix norm.
We can bound all three norms in (15) individually. In the Lagrangian dual, we have \(\Vert \pi _{t+1} \Vert \le \sigma _{t+1} \Vert B_{t} \Vert \). With Remark 3.9, we can bound \(\Vert B^\top _t \Vert \) and with Assumption (A4), we have \(\Vert \eta _t \Vert \le \rho _t\).
For example, using the \(\infty \)norm, we obtain
By the equivalence of norms, we obtain similar bounds using other norms. This means that every entry of \(\nu _t\mu _t\) is bounded by this constant. Moreover, since in each component only \(\nu _t\) or \(\mu _t\) can be nonzero, this also implies that the components of \(\nu _t\) and \(\mu _t\) are bounded by this constant. \(\square \)
Proof of Lemma 4.1
Proof
From the Lipschitz continuity of \(\underline{Q}_t^{R,i}\) we have
Analogously, using Assumption (A4) and Lemma 3.11, for the cut approximation we obtain
Starting with (17) it follows
The second inequality follows from the definition of \(\mathfrak {Q}_t^{i+1}(\cdot )\). The third inequality follows from Lemma 3.8 and the last one is obtained using (16). \(\square \)
Proof of Lemma 4.2
Proof
The structure of the proof is inspired by the proof for Lemma 4 in [1].
As the inner loop does not terminate and X is compact, there exists an infinite sequence of forward pass trial solutions \(( \widehat{x}^i )_{i \in {\mathbb {N}}}\) with cluster points. Let \(\widehat{x}^* \in X\) be such cluster point and \((\widehat{x}^{i_j})_{j \in {\mathbb {N}}}\) a subsequence of \((\widehat{x}^i)_{i \in {\mathbb {N}}}\) with \(\lim _{j \rightarrow \infty } \widehat{x}^{i_j} = \widehat{x}^*\).
We show that \(\lim _{j} \mathfrak {Q}_t^{i_j}(\widehat{x}_{t1}^{i_j}) \ge \widehat{Q}_t^R(\widehat{x}_{t1}^{*})\) holds by backward induction.
For \(t=T+1\), this relation is trivially true, since no stage follows after T. Now, assume it already holds for stage \(t+1\), i.e.,
We consider two subsequent indices in the subsequence \((\widehat{x}^{i_j})_{j \in {\mathbb {N}}}\).
where the first inequality follows from the monotonicity of \(\mathfrak {Q}_t(\cdot )\) in i and the second inequality uses Lemma 3.11.
By adding zero, we obtain
We can now also use the monotonicity of \(\underline{\widehat{Q}}_t^{R}(\cdot )\) in i and apply Lemma 4.1 to obtain
Moreover, we expand
and
with corresponding optimal points \((\widehat{z}_t^{i_{j1}}, \widehat{x}_t^{i_{j1}}, \widehat{y}_t^{i_{j1}})\) and \((\tilde{z}_t, \tilde{x}_t, \tilde{y}_t)\).
Then with (20) it follows
as the solution from (19) is feasible. Thus,
Rearranging yields
With inserting (21) in (18), we obtain
We take limits on both sides. \((*)\) converges to \(\widehat{Q}_t^{R} (\widehat{x}_{t1}^*)\), since the function is continuous. \((\#)\) becomes greater than or equal to zero by the induction hypothesis. \((+)\) tends to zero with Lemma 3.6 since with j to \(\infty \), the binary precision \(\beta _t\) goes to 0. \(()\) tends to zero as \(\widehat{x}_{t1}^{i_j}\) and \(\widehat{x}_{t1}^{i_{j1}}\) both converge to \(\widehat{x}_{t1}^*\).
Thus, the induction is proven for t. As this result holds for any cluster point of \((\widehat{x}^i)_{i \in {\mathbb {N}}}\), the assertion follows. \(\square \)
Proof of Theorem 4.3
Proof
Consider the first stage optimal value \(\widehat{v}^{FP, i}\) of the forward pass. By recursion we obtain
and hence
As in the proof of Lemma 4.2, let \((\widehat{x}^{i_j})_{j \in {\mathbb {N}}}\) denote a convergent subsequence of \((\widehat{x}^i)_{i \in {\mathbb {N}}}\), with \(\lim _j \widehat{x}^{i_j} = \widehat{x}^*\). Applying (22) to this subsequence and taking limits on both sides, yields
The second inequality here stems from Equation (13). Using Lemma 3.4 and Assumption (A3), yields \(\lim _j \widehat{v}^{FP,i_j} \ge \widehat{v}\).
As \(\widehat{v}^{FP,i_j}\) is also a lower bound to \(\widehat{v}\), we have \(\lim _j \widehat{v}^{FP,i_j} \le \widehat{v}\). Thus, \(\lim _j \widehat{v}^{FP,i_j} = \widehat{v}\). Since this is true for any cluster point \(\widehat{x}^*\) of \((\widehat{x}^i)_{i \in {\mathbb {N}}}\), the inner loop converges to the optimal value \(\widehat{v}\). With a similar reasoning it follows that every such cluster point is an optimal point of \((\widehat{P})\). \(\square \)
Proof of Theorem 4.7
Proof
Let \(\mathrm {x}^{*} := \big ((z_t^*, x_t^*, y_t^*) \big )_{t=1,\ldots ,T}\) be an optimal point of \((\varvec{P})\) and let \(\big ( (\widehat{z}_t^\ell , \widehat{x}_t^\ell , \widehat{y}_t^\ell ) \big )_{t=1,\ldots ,T}\) be an optimal point of its outer approximation \((\varvec{\widehat{P}^\ell })\). Then, we have
As \((\varvec{\widehat{P}^\ell })\) is a relaxation of \((\varvec{P})\) this expression is clearly nonnegative for all \(\ell \). Moreover, analogously to the feasibility result in Lemma 4.5 for sufficiently large \(\ell \), for all t we have
We distinguish two cases: First, let \(\sum _{t=1}^T f_t(x_t^*, y_t^*) \le \sum _{t=1}^T f_t(\widehat{x}_t^\ell , \widehat{y}_t^\ell )\), e.g., because \((\widehat{z}_t^\ell , \widehat{x}_t^\ell , \widehat{y}_t^\ell )\) is feasible for \((\varvec{P})\). Then, inserting this into (23) and using (24) it directly follows \(v  \widehat{v}^\ell \le \sum _{t=1}^T \varepsilon _{f_t}\).
Now let \(\sum _{t=1}^T f_t(x_t^*, y_t^*) > \sum _{t=1}^T f_t(\widehat{x}_t^\ell , \widehat{y}_t^\ell )\). With Lemma 4.6, for any \(\delta > 0\), there exists some \(\widehat{\ell } \in {\mathbb {N}}\) such that for all \(\ell \ge \widehat{\ell }\) there exists some feasible point \(\widetilde{\mathrm {x}}^{\ell }:= \big ((\widetilde{z}_t^\ell , \widetilde{x}_t^\ell , \widetilde{y}_t^\ell ) \big )_{t=1,\ldots ,T}\) of \((\varvec{P})\) with \(\Vert \widetilde{\mathrm {x}}^{\ell }  \widehat{\mathrm {x}}^{\ell } \Vert _2 \le \delta \).
Clearly, \(\sum _{t=1}^T f_t(x_t^*, y_t^*) \le \sum _{t=1}^T f_t(\widetilde{x}_t^\ell , \widetilde{y}_t^\ell )\). Therefore,
With Assumption (A1) \(f_t\) is Lipschitz continuous with some constant \(L_{f_t} > 0\). Thus, \(\sum _{t=1}^T f_t\) is Lipschitz continuous with constant \(L_f := \sum _{t=1}^T L_{f_t}\) and (25) can be bounded from above by \(L_f \delta \).
We can write the righthand side of (23) as
Then, with (24) and the previous result it follows that \(v  \widehat{v}^\ell \le L_f \delta + \sum _{t=1}^T \varepsilon _{f_t}\).
Choosing \(\varepsilon := L_{f} \delta + \sum _{t=1}^T \varepsilon _{f_t}\) proves the assertion. \(\square \)
Proof of Theorem 4.8
Proof

(a)
From Theorem 4.7 it follows that if NCNBD does not terminate for \(\varepsilon =0\), infinitely many piecewise linear relaxation refinements occur and \(\widehat{v}^\ell \) converges to v. Using the premise, we have \(\widehat{v}^\ell = \underline{\widehat{v}}^\ell \). Therefore, \(\underline{\widehat{v}}^\ell \) converges to v. This also implies that the cut approximations \(\mathfrak {Q}_{t+1}^\ell (\cdot )\) become tight at \(x_t^\ell \) asymptotically. Thus, the solutions \(\big ( (z_t^\ell , x_t^\ell , y_t^\ell ) \big )_{t=1,\ldots ,T}\) converge to an optimal solution for \((\varvec{P})\).

(b)
For sufficiently large \(\ell \), as in the proof of Theorem 4.7, we have
$$\begin{aligned} \begin{aligned} \sum _{t=1}^T f_t (\widehat{x}_t^\ell , \widehat{y}_t^\ell )&\le \sum _{t=1}^T \widehat{f}^\ell _t (\widehat{x}_t^\ell , \widehat{y}_t^\ell ) + \sum _{t=1}^T \varepsilon _{f_t}. \end{aligned} \end{aligned}$$Using this, and the termination of the inner loop, it follows
$$\begin{aligned} \begin{aligned} \sum _{t=1}^T f_t (\widehat{x}_t^\ell , \widehat{y}_t^\ell )&\le \overline{\widehat{v}}^\ell + \sum _{t=1}^T \varepsilon _{f_t} \le \underline{\widehat{v}}^\ell + \widehat{\varepsilon } + \sum _{t=1}^T \varepsilon _{f_t} \le v + \widehat{\varepsilon } + \sum _{t=1}^T \varepsilon _{f_t}. \end{aligned} \end{aligned}$$For \(\ell \) approaching infinity, \(\widehat{x}^\ell \) becomes feasible. Thus,
$$\begin{aligned} v \le \lim _\ell \overline{v}^\ell \le \lim _\ell \sum _{t=1}^T f_t (\widehat{x}_t^\ell , \widehat{y}_t^\ell ) \le \lim _\ell v + \widehat{\varepsilon } + \sum _{t=1}^T \varepsilon _{f_t} = v + \widehat{\varepsilon }. \end{aligned}$$Since, \(\overline{v}^\ell \) is bounded from above and nonincreasing, the limit exists. This proves the assertion.

(c)
This follows directly from b).
\(\square \)
Implementation details
The NCNBD method is implemented in Julia1.5.3 [7] using the JuMP.jl package [13] for optimization. The implementation is mainly derived from package SDDP.jl [12], which is enhanced by extensions specific to NCNBD. To model piecewise linear approximations of multidimensional functions, we draw on Delaunay.jl [43] to determine triangulations. All MILP subproblems are solved with CPLEX and all MINLP subproblems are solved with appropriate MINLP solvers, both accessed using GAMS.jl [17]. The Lagrangian duals are solved using Kelley’s cuttingplane method or a Level Bundle method as implemented in SDDiP.jl [24]. To reduce the size of the considered subproblems, a very basic Level cut selection technique is used based on SDDP.jl. In our case, however, not only the previously visited trial points, but also the anchor points are used to determine dominated cuts. Our code is available on GitHub [15].
All computations are performed on a machine with Intel(R) Xeon(R) E51630 v4 CPU and 128 GB RAM. All benchmark runs using stateoftheart solvers are executed in GAMS 32.1.0.
Unit commitment problem formulation and results
We consider a unit commitment problem with thermal generators based on [2] for some first tests of NCNBD. To formulate this problem, we define the following elements:
Sets:

\(\mathcal {G}\): set of thermal generators
Data:

\(p_g^f\): price of fuel for generator g [EUR/MWh]

\(p_g^{\overline{s}}\): price of start up for generator g [EUR]

\(p_g^{\underline{s}}\): price of shut down for generator g [EUR]

\(p_d\): price for not meeting demand or load shedding [EUR/MWh]

\(p_e\): tax on emissions from generators [EUR/kg]

\(d_t\): demand at hour t [MWh]

\(\overline{c}_g\): maximum hourly generation of generator g [MWh]

\(\underline{c}_g\): minimum hourly generation of generator g [MWh]

\(\overline{r}_g\): rampup rate of generator g [MWh/h]

\(\underline{r}_g\): rampup rate of generator g [MWh/h]

\(a_g, b_g, c_g\): coefficients of the emission cost curve

\(v^a_g, v^b_g, v^c_g, v^d_g, v^e_g\): coefficients of the fuel cost curve
Decision Variables:

\(x_{gt}:\) electricity production from generator g at time t [MWh]

\(y_{gt}:\) binary variable modeling commitment of generator g at time t

\(\overline{y}_{gt}:\) binary variable modeling startup of generator g at time t

\(\underline{y}_{gt}:\) binary variable modeling shutdown of generator g at time t

\(\underline{d}_{t}, \overline{d}_{t}:\) variables modeling demand slack at time t
The objective is to minimize the total costs of electricity generation, which consists of different cost components. For all instances, the objective function is nonlinear due to a concave quadratic function modeling emission costs with \(a_g <0\) for all \(g \in \mathcal {G}\) [11].
Additionally, we consider two different types of fuel cost function. In the first case (base instances), the fuel cost function is linear
with a static fuel price \(p_g^f\). In the second case (valvepoint instances), we consider the more sophisticated cost function
including a convex quadratic term and a sinusoidal term, modeling the socalled valve point effect of steam turbines [34].
Then, the unit commitment model reads
In this model, both, the continuous generation variables \(x_{gt}\) and the commitment variables \(y_{gt}\), for all \(g \in G, t=1,\ldots ,T\), act as state variables. For the states at time \(t=0\), we use some fixed inputs. (30) denotes the balance between generation and demand. It also contains slack variables for unmet or overfulfilled demand, which are penalized in the objective. By this construction, Assumption (A2) is satisfied. (31)–(32) denote limits to the generator output, while (33)–(34) define ramping constraints. Those ramping constraints require \(x_{gt}\) to be a state variable. (33)–(34) are required to model startup and shutdown costs.
For our input data, we draw on unit commitment instances created by Frangioni and published in the ORLibrary [3]. The data is enhanced by own assumptions, as it does not cover all inputs in our problem formulation. We consider a number T of stages between 2 and 36 and a number G of generators between 3 and 10.
We solve all instances using NCNBD with a relative optimality tolerance of 1% for the outer loop. All MILP subproblems are solved exactly while the outer loop MINLPs are solved to an optimality tolerance of \(10^{3}\). We use the lower bound provided by the MINLP solver to ensure that still a valid lower bound is obtained in the outer loop. The Lagrangian duals are solved with an optimality tolerance of \(10^{4}\). In case that \(\sigma _t\) is not chosen large enough for some stage t from the beginning, it is increased iteratively within the solution procedure once identified. For the base instances, we use BARON [42, 48] to solve the outer loop subproblem, for the valvepoint instances we draw on LINDOGlobal [44], as BARON does not support sinusoidal functions. For the same reason, ANTIGONE [32], SCIP [16] and Gurobi [20] cannot be applied to the valvepoint instances, so that we refer to LINDOGlobal and Couenne [5] as benchmarks.
The obtained upper bounds (UB) and lower bounds (LB) for the base instances are summarized in Table 2 and compared with the optimal point obtained by BARON.
All test instances can be solved by benchmark solvers in a few seconds, thus, outperforming NCNBD. Still, these results can be regarded as a proof of concept for applying NCNBD to multistage nonconvex MINLPs, as in each case, the globally minimal point is succesfully approximated.
For the valvepoint instances, we consider a larger number of stages, but only 3 or 4 generators, i.e., 6 or 8 state variables, to focus on cases, in which NCNBD looks most promising. For cases with many stages, we test differently scaled demand time series, as this seems to have a considerable effect on solution times. All instances are solved with a maximum solution time of two hours. The results are summarized in Table 3. If a solver does not terminate within the time limit, this is indicated by “”.
For a small number of stages, NCNBD takes significantly more time than conventional solvers. With a larger number of stages, this difference vanishes, though. For 36 stages, NCNBD manages to solve all considered instances within less than two hours, while LINDOGlobal and Couenne show more variance in computation time. For one instance, when terminated after two hours, they still show a 5% optimality gap.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Füllner, C., Rebennack, S. Nonconvex nested Benders decomposition. Math. Program. 196, 987–1024 (2022). https://doi.org/10.1007/s10107021017400
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107021017400
Keywords
 Nested Benders decomposition
 Mixedinteger nonlinear programming (MINLP)
 Global optimization
 Nonconvexities
 Nonconvex value functions
Mathematics Subject Classification
 90C26
 90C11
 49M27